> ## Documentation Index
> Fetch the complete documentation index at: https://schemabrain.mintlify.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Semantic layer

> How entities, canonical joins, and metrics compose into business-named queries that bypass FK guessing.

# Building a semantic layer

Three concepts compose SchemaBrain's semantic layer:

* **Entities** — a domain name (e.g. `customer`, `order`) bound to one physical table.
* **Metrics** — aggregations anchored on an entity, with grain (e.g. `total_revenue` by `month`).
* **Canonical joins** — the persisted answer to "how do entity A and entity B connect?"

All three are agent-visible through dedicated MCP tools and compile to parameterized SQL the agent never sees.

## How the agent reaches the semantic layer

| Tool                                            | What the agent asks it                                                          |
| ----------------------------------------------- | ------------------------------------------------------------------------------- |
| `find_relevant_entities(query)`                 | "Which entities match this business concept?" — semantic search over the layer. |
| `list_entities()`                               | "What entities exist in this database?"                                         |
| `describe_entity(name)`                         | "What does this entity expose? Columns, PII sensitivity, bound table."          |
| `resolve_join(entity_a, entity_b)`              | "Give me the canonical SQL JOIN between these two entities."                    |
| `get_metric(name, by=..., filter=..., via=...)` | "Compute this aggregation. Return rows + the SQL + an audit fingerprint."       |

The five physical-schema tools (`find_relevant_tables`, `describe_table`, `describe_column`, `suggest_joins`, `get_example_queries`) sit below them. Full reference: [the MCP tool reference](/reference/mcp-tools/overview).

## Author from scratch (no LLM)

Every definition is a small YAML file you can write by hand — the `suggest` commands in the sections below are just an LLM-assisted shortcut for producing the same files. To build a layer manually, follow three steps in order.

**1. Index the source first.** Entities bind to physical tables, so the store must know the schema before any definition will apply:

```bash theme={null}
schemabrain index --url-env DATABASE_URL --store-path ./schemabrain.db
```

**2. Apply in dependency order.** The store enforces references between definitions — an entity must exist before a join or metric can point at it — so apply **entities → joins → metrics**:

```bash theme={null}
schemabrain entities apply ./schemabrain/entities --url-env DATABASE_URL --store-path ./schemabrain.db
schemabrain joins    apply ./schemabrain/joins    --url-env DATABASE_URL --store-path ./schemabrain.db
schemabrain metrics  apply ./schemabrain/metrics  --url-env DATABASE_URL --store-path ./schemabrain.db
```

A minimal layer (`origin: manual` marks a hand-authored definition). The
join below connects two entities, so **both must be defined** — apply the
entities before the join:

```yaml theme={null}
# schemabrain/entities/order.yaml
version: 1
name: order
description: A customer order.
binding:
  single_table: public.orders
identity: id
origin: manual
```

```yaml theme={null}
# schemabrain/entities/customer.yaml
version: 1
name: customer
description: A customer who places orders.
binding:
  single_table: public.customers
identity: id
origin: manual
```

```yaml theme={null}
# schemabrain/metrics/total_revenue.yaml
version: 1
name: total_revenue
description: Sum of order totals per requested grain.
entity: order
measure:
  agg: sum
  column: total_cents
time_dimension: order.placed_at
time_grains: [day, week, month]
```

```yaml theme={null}
# schemabrain/joins/order_customer.yaml
version: 1
name: order_customer
description: Links each order to the customer who placed it.
source_entity: order
target_entity: customer
"on":
  - source: user_id
    target: id
cardinality: many_to_one
```

**3. Verify it resolves.** Column references — a metric's measure column, a join's `on` columns, an entity's `identity` — are validated when a query **compiles**, not when it applies. So `apply` accepting a file is not proof the layer works; confirm it end to end:

```bash theme={null}
schemabrain entities list --store-path ./schemabrain.db   # are all three present?
schemabrain metrics  list --store-path ./schemabrain.db
```

Then ask the agent (or call `get_metric`) for one real number. A typo in a column name surfaces here as an `unknown_name` envelope listing the valid columns — not as a silent success at apply time.

<Warning>
  **Use the same database URL everywhere.** The store keys everything by a `source_connection_id` derived from the connection URL, so `schemabrain index` and every later `apply` / `serve` must pass the **same** URL — same host, port, and scheme. Index under `postgresql://…` and apply under `postgresql+psycopg://…` and they key to different stores: the second command reports an empty / unindexed source even though the tables are right there. Pick one URL form and pass it consistently.
</Warning>

## Entities

```bash theme={null}
schemabrain entities suggest --url-env DATABASE_URL --dry-run
```

| Mode                      | What it does                                                                   |
| ------------------------- | ------------------------------------------------------------------------------ |
| `--dry-run`               | Print candidates to stdout with confidence + rationale + PII hints. No writes. |
| `--out-dir ./suggestions` | Write one `<entity>.yaml` per candidate. Edit before applying.                 |
| `--apply`                 | Write candidates straight into the store.                                      |

Spend is bounded by `--max-cost-usd` (default `$1.00`) or `$SCHEMABRAIN_MAX_LLM_COST_USD`. Pair with `--top-k N` to cap candidate count.

Sample dry-run output:

```
# confidence: high
# rationale: users has id PK, NOT NULL email, referenced by sessions.user_id
# pii_hints:
#   email: pii
version: 1
name: user
description: A workspace member account
binding:
  single_table: public.users
identity: id
origin: suggested

-- 3 candidate(s) | model: claude-sonnet-4-6 | cost: $0.0271
```

Once entities are in the store, the MCP server exposes them via `list_entities` and `describe_entity`.

## Metrics

`metrics suggest` mirrors `entities suggest` — same three modes, same cost guards. The LLM picks the measure column (or composite expression), aggregation function, optional time dimension, and grain:

```bash theme={null}
schemabrain metrics suggest --url-env DATABASE_URL --dry-run
schemabrain metrics suggest --url-env DATABASE_URL --out-dir ./metric-candidates
schemabrain metrics list --store-path ./schemabrain.db
```

Metrics anchor on an entity that already exists in the store. If you haven't curated entities first, `metrics suggest` refuses with a guided error pointing at `entities apply`.

### Composite-expression measures

A metric's measure can be either a single column or a composite expression over multiple columns of the same anchor table:

```yaml theme={null}
# Single column
measure:
  agg: sum
  column: amount_cents

# Composite expression (whitelist: identifiers, + - * /, parens, numeric literals)
measure:
  agg: sum
  expression: unit_price_cents * quantity
```

Exactly one of `column` or `expression` is required. PII propagation walks every column the expression references — touching one PII-tagged column taints the whole metric.

## Canonical joins

Where `entities suggest` infers WHAT to query, `joins suggest` infers HOW two entities connect. Candidates are mined from FK constraints (always present) and query-log evidence (when `schemabrain mine-queries` has populated the `example_queries` table from `pg_stat_statements`).

```bash theme={null}
schemabrain joins suggest --url-env DATABASE_URL --dry-run
schemabrain joins suggest --url-env DATABASE_URL --out-dir ./join-candidates
schemabrain joins apply ./join-candidates --url-env DATABASE_URL
schemabrain joins list --store-path ./schemabrain.db
```

Once applied, the agent-facing `resolve_join` MCP tool returns the canonical join with a paste-ready `JOIN ... ON ...` skeleton. Multi-canonical-per-pair (billing vs shipping address, primary vs secondary user) is supported: pass `name=<canonical_name>` to disambiguate, or get a structured ambiguity refusal listing both.

### Multi-hop join paths

`get_metric` accepts `group_by=` columns that live on a table reachable through multiple canonical joins from the metric's anchor. The compiler BFSes the canonical-join graph (default `max_hops=6`), emits each JOIN against the previous hop's alias, and returns a topologically chain-ordered `MetricPlan.joins`. When multiple paths are equally short, the agent gets a structured `ambiguous_path` refusal — disambiguate via `via=(join_name, ...)`.

## Import from dbt

If you already curate models in dbt, point SchemaBrain at your compiled `target/manifest.json` and import them as entities. Scope today: each dbt **model** with a single-column primary key becomes a SchemaBrain **entity** (dbt owns those rows), and dbt **metrics** import opt-in via `--include-metrics`. dbt **`relationships` tests import as canonical joins** (`origin="dbt_import"`); FK + query-log mining fills in any joins dbt doesn't declare. Two entry points:

**During `init` (auto-detected or explicit):** the wizard's stage 1 auto-detects a manifest from `$DBT_PROJECT_DIR/target/manifest.json` or by walking up from the cwd looking for `dbt_project.yml`. When found, stages 3 (entities) and 4 (metrics) route through the importer instead of the LLM. Force a specific manifest with `--from-dbt PATH`:

```bash theme={null}
schemabrain init --url-env DATABASE_URL --from-dbt /path/to/dbt/target/manifest.json
```

Stage 5 (joins) imports dbt `relationships` schema tests as canonical joins (each declared FK becomes a `source → target` join with `origin="dbt_import"`), and still mines FK constraints + the query log for joins dbt doesn't declare.

**Standalone import:** if you want to import without going through the wizard, point the importer directly at a manifest. The importer binds each model to a table in the local index, so the store must already be indexed for the **same** source URL:

```bash theme={null}
# Step 0 — index the schema first (skip if you already ran `init` against this URL).
schemabrain index --url-env DATABASE_URL --store-path ./schemabrain.db   # add --no-enrich for a cost-free index

# Step 1 — import the dbt models as entities.
schemabrain import dbt path/to/target/manifest.json --url-env DATABASE_URL --store-path ./schemabrain.db
```

Run `import dbt` against a store that was never indexed for this URL and it short-circuits with a guided *"run `schemabrain index` first"* error rather than failing per-model. The importer also needs the physical tables to exist (run `dbt run`, not just `dbt compile`) since it verifies each model against the live schema — even under `--dry-run`.

Each dbt model with a single-column primary key lands as a SchemaBrain entity with `origin="dbt_import"`, and each `relationships` test between two imported models lands as a canonical join (`origin="dbt_import"`). Re-running is idempotent; entities that previously had `origin="manual"` or `"suggested"` flip to `"dbt_import"` (dbt takes ownership), and joins upsert by name. Subsequent manual edits to dbt-owned rows are refused at the store boundary. A relationship whose endpoint model isn't imported (or resolves to a self-join) is skipped and surfaced in the import summary.

| Flag                   | Behaviour                                                        |
| ---------------------- | ---------------------------------------------------------------- |
| *(default)*            | Plan + apply (entities only).                                    |
| `--include-metrics`    | Also import `type=simple` dbt metrics (off by default).          |
| `--dry-run`            | Compute the plan; write nothing (still connects to the live DB). |
| `--report report.json` | Emit a CI-friendly JSON report.                                  |

A bundled fixture demonstrates the flow:

```bash theme={null}
schemabrain import dbt $(schemabrain fixture-path ecommerce_manifest.json) \
    --url-env DATABASE_URL --dry-run
```

## Audit and repair

The local store can be checked for corruption and repaired:

```bash theme={null}
schemabrain metrics audit                  # report-only
schemabrain metrics audit --fix            # delete metrics with malformed measures
```

Useful after a manual edit to YAML candidates or a schema migration in the underlying source.
