Threat model
SchemaBrain’s attack surface, the threats we model against, what we mitigate today, and the residual risk we accept. This is a living document; assumptions and mitigations are updated per release.Trust model
SchemaBrain runs as a local single-process MCP server on the operator’s own machine. It connects to a Postgres database the operator controls, reads schema and sample values, calls Anthropic for enrichment, and exposes read-only tools to an agent (Claude Desktop, Cursor, or a custom harness) over stdio. The operator is trusted. The agent is not fully trusted — agents are LLMs and can be confused, manipulated, or compromised through prompt-injection. The database is trusted to the extent the operator’s existing infrastructure trusts it, but its contents (column names, comments, sample values) may contain adversarial input.Asset 1: MCP tool surface
The MCP server exposes 12 tools over stdio. Each tool reads from the local SQLite store, the Postgres source, or both. The attack surface includes the tool arguments an agent passes, the connection string the operator configured, and the audit log that records every call. Assets at risk- Connection URL (with embedded password)
- Source-database schema introspection results
- Indexed store contents (
./schemabrain.db) - Audit log integrity (
mcp_audittable)
| Id | Threat | Severity |
|---|---|---|
| T1.1 | Database credentials land in argv, journald, or shell history | High |
| T1.2 | SQL injection via tool arguments (qualified names) | High |
| T1.3 | Audit log tampered after the fact to hide a tool call | Medium |
| T1.4 | URL query string smuggles a session config (?options=...) | Medium |
| T1.5 | Connection pool reuses a connection with poisoned session state | Medium |
| T1.6 | Agent reads a PII row value through MAX/MIN over a tagged column | High |
| T1.7 | Operator runs --pii-block against an unclassified source and the firewall silently fails open | High |
| T1.8 | Agent enumerates raw PII values through group_by on a tagged column (e.g. count(*) ... GROUP BY email) | High |
- T1.1: Connection URLs are read from environment variables only.
schemabrain index --url-env DBURLaccepts the env-var name; the URL itself never appears in argv. See schemabrain/cli.py_resolve_url_source. The positional-URL form prints a deprecation warning that does not echo the password. - T1.2: All qualified names flow through
_validate_identin schemabrain/mcp/_helpers.py, which enforces the regex^[A-Za-z_][A-Za-z0-9_$]*$and a 63-character bound (PostgresNAMEDATALEN-1). Tool SQL is parameterised; user input never enters a query string by concatenation. Identifier-only SQL fragments (table/column names that must be inline because Postgres does not parameterise schema objects) pass through_validate_identfirst. - T1.3: The
mcp_audittable is append-only at the SQLite level. DDL in schemabrain/audit/ddl.py installsmcp_audit_no_updateandmcp_audit_no_deletetriggers that raise on anyUPDATEorDELETEagainst the table. Tampering requires direct SQLite-file write access (an OS-level compromise), at which point the audit chain’s hash field detects mutation viaschemabrain audit verify. - T1.4: Connection URLs are filtered through
safe_engine_urlin schemabrain/connectors/_url.py before reachingcreate_engine. Onlysslmode,sslrootcert, andapplication_namequery parameters survive; everything else (notablyoptions=-c statement_timeout=1) is stripped. - T1.5: Both the profiler and the metric executor use SQLAlchemy
NullPool(see schemabrain/connectors/postgres.py and schemabrain/cli.py —_make_metric_engine). Every database call opens a fresh connection and disposes of it; there is no pool state to escape from. - T1.6:
MAXandMINover a text column return the lexicographically ordered row value, not a derived aggregate — the returned value IS a row value, indistinguishable from aSELECT._resolve_pii_categoriesin schemabrain/mcp/get_metric.py refuses any metric whoseaggisminormaxover a column with non-empty PII categories, independent of the operator’s--pii-blockpolicy. The operator’s policy governs which aggregate categories may pass; MIN/MAX of a tagged column isn’t aggregation.SUM,COUNT,COUNT_DISTINCT, andAVGremain pure aggregates and continue to flow through the existing--pii-blockgate — unless the metric groups BY a tagged column, which projects raw values into the result rows regardless of the aggregate (see T1.8). Composite expressions (expression: a * b) under MIN/MAX refuse if any operand is tagged. - T1.8:
group_byon a PII-tagged column projects that column’s raw values into the result rows as the group keys —count(*) ... GROUP BY emailreturns every distinct email, indistinguishable fromSELECT DISTINCT email._resolve_pii_categoriesin schemabrain/mcp/get_metric.py refuses any metric that groups by a column with non-empty PII categories, independent of the operator’s--pii-blockpolicy — the same row-level-disclosure rule as T1.6 (MIN/MAX), applied to the projection surface. Aggregating OVER or filtering BY a tagged column stays governed by the--pii-blockcategory gate (no raw values leave the database); only thegroup_byprojection is categorically refused. The refusal happens before any SQL is emitted. The escape hatch for a benign-but-tagged column (e.g. a 2-lettercountry_codethe name-based classifier tagscontact) is to reclassify itpublic, not to weaken the rule. - T1.7: When
--pii-blockenforcement is active but the store returns zero PII tag rows across every column touched, the call refuses withPiiBlockedErrorrather than letting the empty lookup pose as confirmed-clean data. The operator opted into enforcement, so the firewall must surface tag-coverage gaps explicitly rather than silently passing. The opt-out is the existing--pii-block ''(no enforcement) flag; the matchingschemabrain indexinvocation populates the tag table for sources the operator intends to keep classified. See_resolve_pii_categoriesin schemabrain/mcp/get_metric.py.
- An operator who deliberately passes a positional URL despite the deprecation warning will leak credentials to argv. The fix is operator discipline plus the deprecation warning, not a mitigation.
- Direct filesystem access to
~/.schemabrain/allows reading the store and audit log. We set the events file to mode0600(see schemabrain/observability/bus.py); the store relies on OS umask. Operators on shared machines should rely on full-disk encryption and per-user home directories.
Asset 2: Prompt injection into the database
Duringindex, schema names, column names, column comments, and sample values
flow into the Anthropic enrichment prompt. An adversary who can write into the
database can attempt to override the system prompt or coerce the LLM into
emitting unexpected content.
Assets at risk
- The enrichment LLM’s output (column descriptions, table descriptions)
- The eventual agent’s behaviour, downstream of those descriptions
| Id | Threat | Severity |
|---|---|---|
| T2.1 | Column comment overrides the system prompt | Medium |
| T2.2 | Sample value contains exfiltration payload that surfaces verbatim in a description | Medium |
| T2.3 | Adversarial column name encodes an instruction the LLM follows | Low |
- T2.1, T2.3: The enrichment prompt is structured for resistance to override. Schema content is enclosed in clearly delimited JSON; the system prompt instructs the model to produce schema-description output only. Confusion remains possible (LLMs are not parsers) — defence in depth is what follows.
- T2.2: Sample values flow through
redact_piiin schemabrain/profiler/stats.py before reaching the prompt. Email-shaped, SSN-shaped, and credit-card-shaped strings are replaced with<redacted>tokens. Generic injection payloads (“ignore previous instructions”) survive this filter — it is a PII filter, not an injection filter. - All three: Descriptions are written into the store, not executed. The
agent reads them as context; the operator can re-run
schemabrain indexto overwrite a description they find suspicious;schemabrain audit listshows every retrieval if anomalous behaviour is observed downstream.
- A determined adversary controlling a column comment can probably bias the enrichment description’s tone or content. The blast radius is bounded by what the description can express (free-text English; not code execution). Operators indexing untrusted donor schemas should treat the resulting descriptions as data, not as gospel.
Asset 3: Agent pathological loops
A confused agent can call MCP tools repeatedly. The operator’s exposure is LLM credit (Anthropic side) and database load (Postgres side). Assets at risk- Operator LLM budget
- Source-database CPU and connection availability
- Postgres
pg_stat_statementsaccuracy (a runaway loop pollutes it)
| Id | Threat | Severity |
|---|---|---|
| T3.1 | Runaway tool-call loop drains the agent’s API budget | Medium |
| T3.2 | get_metric blow-up via repeated grain-mismatched queries | Medium |
| T3.3 | find_relevant_tables spammed with identical embeddings | Low |
- T3.1: The agent’s budget is enforced at the agent layer (Anthropic API
keys, per-conversation limits). SchemaBrain’s
--max-costdefault of $1 perindexrun caps the enrichment side; once exceeded the run halts withstatus: degraded. Read-side tools (describe_table,find_relevant_tables, etc.) are read-only from the store and cost effectively zero per call. - T3.2:
get_metricenforces a Postgres-sidestatement_timeout(operator-tunable via--statement-timeout-ms; default 30000ms / 30s; pass0to disable). The value is injected intoconnect_args.optionsso it can’t be overridden via URL query params, and a runaway query aborts with a clearOperationalError. The query compiler also refuses grain mismatches at parse time, before any SQL leaves the process. - T3.2b:
get_metricenforces an application-level row cap (operator-tunable via--max-rows-per-result; default 10000; pass0to disable). This is a payload-size guard, not a query-cost guard — the source DB still does the full scan, but the MCP response is bounded before reaching the agent. Use--statement-timeout-msfor query-cost bounding. EXPLAIN-based pre-execution cost caps are on the v0.5+ roadmap. - T3.3: Read-side retrieval is stateless and bounded — every call returns
at most
limitrows; embeddings are cached; cosine is computed against a fixed-size matrix.
- A pathological agent can still issue thousands of cheap read-side calls
in a session. The audit log records each one; the events file shows the
pattern in
schemabrain tail. Rate limiting on the MCP tool surface itself is a planned enhancement.
Asset 4: Adversarial schema names
Operators sometimes index schemas they do not control (donor demos, customer onboarding, evaluation suites). Schema names can contain control characters, oversized identifiers, Unicode confusables, or embedded ANSI escape sequences. Assets at risk- Operator terminal (ANSI escapes injected into rendered output)
- Agent context window (oversized identifiers echoed back into context)
- Display integrity (Unicode confusables swap for visually identical Latin)
| Id | Threat | Severity |
|---|---|---|
| T4.1 | 150 KB identifier exhausts the agent’s context window | High |
| T4.2 | Unicode confusable schema names mislead the operator | Low |
| T4.3 | Embedded ANSI escape sequences corrupt terminal output | Low |
- T4.1:
_validate_identrejects any identifier longer than 63 characters. Error messages bound their echoed input via_bounded_repr(capped at three identifier lengths). An adversary cannot force a 150 KB string back into agent context through SchemaBrain. - T4.2, T4.3: The unquoted-identifier regex
^[A-Za-z_][A-Za-z0-9_$]*$rejects control characters, ANSI escapes, and Unicode confusables outside the ASCII alphabet. SQL-standard double-quoted syntax ("Order Items","Percent (%) Eligible") is supported for schemas that legitimately carry non-identifier-shaped names, with a control-character denylist (C0 controls\x00-\x1fplus DEL\x7f) enforced at the parser boundary — even a Postgres catalog that admits an ANSI escape in a column name is refused before it can reach an agent-rendered surface.
- The control-char denylist covers C0 + DEL but not Unicode line/paragraph
separators (
,) or C1 controls (\x80-\x9f). These have no known terminal-reflow vector through the MCP response shape, but expanding the denylist if the threat profile widens is a one-line change.
Threats explicitly out of scope
- Denial of service against the host process. A misbehaving agent can spike CPU on the host; the operator’s infrastructure (cgroups, ulimits, container limits) is the correct layer for that defence.
- Filesystem attacks on
~/.schemabrain/. OS permissions and full-disk encryption are the right layer. We set the events file to mode0600and rely on the standard umask for the store. - Side-channel attacks on the local SQLite cache. Out of scope for a local single-process tool. Relevant for a future hosted variant.
- Supply-chain attacks on PyPI dependencies. Closed by a separate
programme: SECURITY.md, Dependabot,
pip-audit, Bandit, and Semgrep all run in CI. See SECURITY.md. - Multi-tenant isolation. SchemaBrain is single-tenant by design. Multi-tenancy is a hosted-variant concern.