Reliability contract
SchemaBrain’s reliability targets, what they cover, and how each one maps to an observable signal in the event bus or the audit log. The numbers are operator-facing promises. They are aspirational at the current release — we publish them so the conversation about reliability is concrete instead of vague.Why publish targets we have not yet defended at scale
SchemaBrain has shipped against a bundled six-table fixture and a fifteen-table real-world schema (Pagila). The local p95 measurements for read-side tools sit well inside the targets below. We have no production-scale measurements yet — that comes when the first donor schema connects. Publishing the targets now serves three purposes:- Contract clarity. Operators know what to expect; we know what to defend.
- Regression budget. Every PR that meaningfully changes a hot path is checked against these numbers. A target burned is a feature frozen.
- Honest framing. The doc tells the truth about what is measured vs. what is assumed. A reliability story you cannot pressure-test is marketing.
SLOs — read-side MCP tools
These tools read from the local SQLite store and the in-process embedding matrix. They do not touch Postgres.| Tool | p95 latency target | Error budget |
|---|---|---|
find_relevant_tables | 250 ms | 0.5 % |
find_relevant_entities | 250 ms | 0.5 % |
describe_table | 50 ms | 0.1 % |
describe_column | 50 ms | 0.1 % |
get_example_queries | 100 ms | 0.5 % |
suggest_joins | 200 ms | 0.5 % |
list_entities | 100 ms | 0.5 % |
list_metrics | 100 ms | 0.5 % |
list_joins | 100 ms | 0.5 % |
describe_entity | 100 ms | 0.5 % |
resolve_join | 150 ms | 0.5 % |
status: error envelopes — status: empty is a
correct response (no matching data), not an error.
SLOs — execute-side MCP tools
get_metric is the only tool that emits SQL to the source database.
Its latency is bounded mostly by the Postgres execution time and the
operator’s statement_timeout.
| Tool | p95 latency target | Error budget |
|---|---|---|
get_metric | 1500 ms | 1.0 % |
statement_timeout, which schemabrain serve configures via
--statement-timeout-ms (default 30000ms / 30s; pass 0 to disable).
A query that exceeds it surfaces as status: error from the Postgres
operational error and counts against the error budget.
SLOs — pipeline operations
| Operation | Target | Notes |
|---|---|---|
index throughput | ≥ 100 columns / minute | with default Haiku 4.5 + local embedder |
serve cold-start | < 800 ms | from process start to first tool ready |
| Audit-write success | 99.99 % | failures land in the events log |
What is measured today vs. what is aspirational
| Component | Measurement status |
|---|---|
| Read-side p95 against the bundled fixture | Measured locally; comfortably inside targets |
| Read-side p95 against Pagila (15 tables, 87 columns) | Measured locally; comfortably inside targets |
get_metric p95 against a real schema | Not measured at scale |
index throughput on a 200-table schema | Extrapolated from two anchors; see architecture.md |
| Audit-write success rate | No long-running soak test yet |
How to verify
Every target maps to an event recorded in~/.schemabrain/events.jsonl
(per the event bus introduced in the observability pane) or a row in
the mcp_audit table (per the audit-substrate work).
| Target | Where to look |
|---|---|
| Read-side latency | event.kind == "tool_call" records duration_ms |
| Read-side error budget | event.status values across a rolling window |
get_metric latency | Same tool_call event; tool_name == "get_metric" |
index throughput | index_progress events; columns / wall-clock |
serve cold-start | server_start event timestamp vs. process start |
| Audit-write success | Absence of audit_write_failed events |
schemabrain tail surfaces these events live. schemabrain audit list
filters by tool and status. JSON-mode output on both subcommands
pipes cleanly into jq or whatever observability sink the operator
already runs.
Error budget policy
If a target burns more than 50 % of its error budget in a rolling seven-day window, the next release ships a reliability fix, not a feature. This is a discipline-by-convention policy at the current release; once a hosted variant exists, the budget burn becomes a hard gate on the deployment pipeline.What is explicitly out of scope
- Availability of the source database. SchemaBrain is a client; source availability is the operator’s existing infrastructure.
- Availability of the Anthropic API.
indexretries with exponential backoff and surfaces a guided error when retries are exhausted; the operator can re-run. - Disk space exhaustion under
~/.schemabrain/. The events file rotates at 10 MiB; the store grows with the schema. Disk monitoring is at the OS layer. - Network reliability between the agent and the MCP stdio server. Stdio is in-process; the agent and SchemaBrain share a parent.