Building a semantic layer
Three concepts compose SchemaBrain’s semantic layer:
- Entities — a domain name (e.g.
customer, order) bound to one physical table.
- Metrics — aggregations anchored on an entity, with grain (e.g.
total_revenue by month).
- Canonical joins — the persisted answer to “how do entity A and entity B connect?”
All three are agent-visible through dedicated MCP tools and compile to parameterized SQL the agent never sees.
How the agent reaches the semantic layer
| Tool | What the agent asks it |
|---|
find_relevant_entities(query) | ”Which entities match this business concept?” — semantic search over the layer. |
list_entities() | ”What entities exist in this database?” |
describe_entity(name) | ”What does this entity expose? Columns, PII sensitivity, bound table.” |
resolve_join(entity_a, entity_b) | ”Give me the canonical SQL JOIN between these two entities.” |
get_metric(name, by=..., filter=..., via=...) | ”Compute this aggregation. Return rows + the SQL + an audit fingerprint.” |
The five physical-schema tools (find_relevant_tables, describe_table, describe_column, suggest_joins, get_example_queries) sit below them. Full reference: the MCP tool reference.
Author from scratch (no LLM)
Every definition is a small YAML file you can write by hand — the suggest commands in the sections below are just an LLM-assisted shortcut for producing the same files. To build a layer manually, follow three steps in order.
1. Index the source first. Entities bind to physical tables, so the store must know the schema before any definition will apply:
schemabrain index --url-env DATABASE_URL --store-path ./schemabrain.db
2. Apply in dependency order. The store enforces references between definitions — an entity must exist before a join or metric can point at it — so apply entities → joins → metrics:
schemabrain entities apply ./schemabrain/entities --url-env DATABASE_URL --store-path ./schemabrain.db
schemabrain joins apply ./schemabrain/joins --url-env DATABASE_URL --store-path ./schemabrain.db
schemabrain metrics apply ./schemabrain/metrics --url-env DATABASE_URL --store-path ./schemabrain.db
A minimal layer (origin: manual marks a hand-authored definition). The
join below connects two entities, so both must be defined — apply the
entities before the join:
# schemabrain/entities/order.yaml
version: 1
name: order
description: A customer order.
binding:
single_table: public.orders
identity: id
origin: manual
# schemabrain/entities/customer.yaml
version: 1
name: customer
description: A customer who places orders.
binding:
single_table: public.customers
identity: id
origin: manual
# schemabrain/metrics/total_revenue.yaml
version: 1
name: total_revenue
description: Sum of order totals per requested grain.
entity: order
measure:
agg: sum
column: total_cents
time_dimension: order.placed_at
time_grains: [day, week, month]
# schemabrain/joins/order_customer.yaml
version: 1
name: order_customer
description: Links each order to the customer who placed it.
source_entity: order
target_entity: customer
"on":
- source: user_id
target: id
cardinality: many_to_one
3. Verify it resolves. Column references — a metric’s measure column, a join’s on columns, an entity’s identity — are validated when a query compiles, not when it applies. So apply accepting a file is not proof the layer works; confirm it end to end:
schemabrain entities list --store-path ./schemabrain.db # are all three present?
schemabrain metrics list --store-path ./schemabrain.db
Then ask the agent (or call get_metric) for one real number. A typo in a column name surfaces here as an unknown_name envelope listing the valid columns — not as a silent success at apply time.
Use the same database URL everywhere. The store keys everything by a source_connection_id derived from the connection URL, so schemabrain index and every later apply / serve must pass the same URL — same host, port, and scheme. Index under postgresql://… and apply under postgresql+psycopg://… and they key to different stores: the second command reports an empty / unindexed source even though the tables are right there. Pick one URL form and pass it consistently.
Entities
schemabrain entities suggest --url-env DATABASE_URL --dry-run
| Mode | What it does |
|---|
--dry-run | Print candidates to stdout with confidence + rationale + PII hints. No writes. |
--out-dir ./suggestions | Write one <entity>.yaml per candidate. Edit before applying. |
--apply | Write candidates straight into the store. |
Spend is bounded by --max-cost-usd (default $1.00) or $SCHEMABRAIN_MAX_LLM_COST_USD. Pair with --top-k N to cap candidate count.
Sample dry-run output:
# confidence: high
# rationale: users has id PK, NOT NULL email, referenced by sessions.user_id
# pii_hints:
# email: pii
version: 1
name: user
description: A workspace member account
binding:
single_table: public.users
identity: id
origin: suggested
-- 3 candidate(s) | model: claude-sonnet-4-6 | cost: $0.0271
Once entities are in the store, the MCP server exposes them via list_entities and describe_entity.
Metrics
metrics suggest mirrors entities suggest — same three modes, same cost guards. The LLM picks the measure column (or composite expression), aggregation function, optional time dimension, and grain:
schemabrain metrics suggest --url-env DATABASE_URL --dry-run
schemabrain metrics suggest --url-env DATABASE_URL --out-dir ./metric-candidates
schemabrain metrics list --store-path ./schemabrain.db
Metrics anchor on an entity that already exists in the store. If you haven’t curated entities first, metrics suggest refuses with a guided error pointing at entities apply.
Composite-expression measures
A metric’s measure can be either a single column or a composite expression over multiple columns of the same anchor table:
# Single column
measure:
agg: sum
column: amount_cents
# Composite expression (whitelist: identifiers, + - * /, parens, numeric literals)
measure:
agg: sum
expression: unit_price_cents * quantity
Exactly one of column or expression is required. PII propagation walks every column the expression references — touching one PII-tagged column taints the whole metric.
Canonical joins
Where entities suggest infers WHAT to query, joins suggest infers HOW two entities connect. Candidates are mined from FK constraints (always present) and query-log evidence (when schemabrain mine-queries has populated the example_queries table from pg_stat_statements).
schemabrain joins suggest --url-env DATABASE_URL --dry-run
schemabrain joins suggest --url-env DATABASE_URL --out-dir ./join-candidates
schemabrain joins apply ./join-candidates --url-env DATABASE_URL
schemabrain joins list --store-path ./schemabrain.db
Once applied, the agent-facing resolve_join MCP tool returns the canonical join with a paste-ready JOIN ... ON ... skeleton. Multi-canonical-per-pair (billing vs shipping address, primary vs secondary user) is supported: pass name=<canonical_name> to disambiguate, or get a structured ambiguity refusal listing both.
Multi-hop join paths
get_metric accepts group_by= columns that live on a table reachable through multiple canonical joins from the metric’s anchor. The compiler BFSes the canonical-join graph (default max_hops=6), emits each JOIN against the previous hop’s alias, and returns a topologically chain-ordered MetricPlan.joins. When multiple paths are equally short, the agent gets a structured ambiguous_path refusal — disambiguate via via=(join_name, ...).
Import from dbt
If you already curate models in dbt, point SchemaBrain at your compiled target/manifest.json and import them as entities. Scope today: each dbt model with a single-column primary key becomes a SchemaBrain entity (dbt owns those rows), and dbt metrics import opt-in via --include-metrics. dbt relationships tests import as canonical joins (origin="dbt_import"); FK + query-log mining fills in any joins dbt doesn’t declare. Two entry points:
During init (auto-detected or explicit): the wizard’s stage 1 auto-detects a manifest from $DBT_PROJECT_DIR/target/manifest.json or by walking up from the cwd looking for dbt_project.yml. When found, stages 3 (entities) and 4 (metrics) route through the importer instead of the LLM. Force a specific manifest with --from-dbt PATH:
schemabrain init --url-env DATABASE_URL --from-dbt /path/to/dbt/target/manifest.json
Stage 5 (joins) imports dbt relationships schema tests as canonical joins (each declared FK becomes a source → target join with origin="dbt_import"), and still mines FK constraints + the query log for joins dbt doesn’t declare.
Standalone import: if you want to import without going through the wizard, point the importer directly at a manifest. The importer binds each model to a table in the local index, so the store must already be indexed for the same source URL:
# Step 0 — index the schema first (skip if you already ran `init` against this URL).
schemabrain index --url-env DATABASE_URL --store-path ./schemabrain.db # add --no-enrich for a cost-free index
# Step 1 — import the dbt models as entities.
schemabrain import dbt path/to/target/manifest.json --url-env DATABASE_URL --store-path ./schemabrain.db
Run import dbt against a store that was never indexed for this URL and it short-circuits with a guided “run schemabrain index first” error rather than failing per-model. The importer also needs the physical tables to exist (run dbt run, not just dbt compile) since it verifies each model against the live schema — even under --dry-run.
Each dbt model with a single-column primary key lands as a SchemaBrain entity with origin="dbt_import", and each relationships test between two imported models lands as a canonical join (origin="dbt_import"). Re-running is idempotent; entities that previously had origin="manual" or "suggested" flip to "dbt_import" (dbt takes ownership), and joins upsert by name. Subsequent manual edits to dbt-owned rows are refused at the store boundary. A relationship whose endpoint model isn’t imported (or resolves to a self-join) is skipped and surfaced in the import summary.
| Flag | Behaviour |
|---|
| (default) | Plan + apply (entities only). |
--include-metrics | Also import type=simple dbt metrics (off by default). |
--dry-run | Compute the plan; write nothing (still connects to the live DB). |
--report report.json | Emit a CI-friendly JSON report. |
A bundled fixture demonstrates the flow:
schemabrain import dbt $(schemabrain fixture-path ecommerce_manifest.json) \
--url-env DATABASE_URL --dry-run
Audit and repair
The local store can be checked for corruption and repaired:
schemabrain metrics audit # report-only
schemabrain metrics audit --fix # delete metrics with malformed measures
Useful after a manual edit to YAML candidates or a schema migration in the underlying source.