Mechanism: PII taxonomy
One-line claim: Every column carries a typed PII category tag at index time; the tags propagate through joins and aggregations at compile time, so
get_metric refuses queries that would touch blocked PII before the database is ever queried.pii: true annotation that a human reviewer looks at. SchemaBrain’s PII tags are structural inputs to the SQL compiler. A metric whose join path crosses a blocked-category column refuses with a typed pii_blocked envelope, and the compiled SQL never reaches the database.
This page documents the four mechanisms — the closed category set, the column-level classifier, the propagation rules, and the catastrophic-leak defaults.
1. The closed category set — 12 categories from real regulation
PIICategory is a Pydantic Literal — a closed set of 12 strings. Adding a new category is a minor charter bump.
| Category | Regulation that defines it | Example columns |
|---|---|---|
contact | GDPR Art. 4(1) “identifiable natural person” | email, phone, address, first_name |
financial | GLBA “non-public personal information”; GDPR Art. 9 income | salary, balance, revenue, transaction_amount |
payment_card | PCI DSS scope (PAN, CVV, cardholder data) | card_number, cvv, iban, account_number |
health | HIPAA PHI, GDPR Art. 9 health data | diagnosis, medication, mrn, patient_id, encounter_id |
genetic | GDPR Art. 9 genetic data | genome, genotype, dna_seq, rsid |
biometric | GDPR Art. 9 biometric data | fingerprint, face_embedding, iris, voiceprint |
behavioral | GDPR Recital 30 (browse / clickstream patterns) | purchase_history, clickstream, event_log |
online_identifier | GDPR Art. 4(1); CCPA “unique personal identifier” | ip, cookie_id, device_id, wallet_address |
credential | PCI DSS Req. 8; SOX access control | password, api_key, token, private_key |
government_id | HIPAA Safe Harbor; ISO 27018 | ssn, passport, national_id, npi, tax_id |
location | GDPR Art. 4(1) location data | latitude, longitude, gps, geolocation |
demographic_protected | GDPR Art. 9 special categories; HIPAA age | dob, race, ethnicity, religion, political_party |
2. Column-level classification — regex + integer-FK guard
schemabrain/pii/classifier.py takes a column name (plus its declared type and primary-key flag) and returns (sensitivity, frozenset[PIICategory]). The classifier is regex-based but with two refinements that pure regex can’t express.
Layer 1 — regex with snake_case-aware boundaries
The standard\b regex boundary treats _ as a word character, so \bemail\b doesn’t match email inside email_address. SQL columns are snake_case overwhelmingly, so we use custom boundary semantics:
email now matches inside email_address, user_email, and e_mail, but not inside emailish. The rule table is closed at v1: when an operator hits a column name our rules miss, the fix is to add a rule. CI pins (a) the rule count and (b) that every PIICategory has at least one producing rule — drift fails CI before merge.
Layer 2 — S1 <noun>_name denylist
product_name in a catalog table previously matched the bare name rule and tagged the column as contact (a person’s name). The S1 guard suppresses the name rule when the column matches ^<denylist>_name$ for a closed list of obviously-not-personal nouns (product, brand, category, language, currency, tag, …). Bare name on a denylisted table (products.name, categories.name) also skips the rule.
The denylist is deliberately conservative — company_name, team_name, and region_name stay tagged because they can refer to a natural person’s affiliation in CRM contexts.
Layer 3 — S2 integer-FK guard
An integer-FK column whose name contains a category keyword (e.g.,address_id BIGINT matching the address rule → contact) tags the FK as PII even though the actual PII lives on the referenced table. The S2 guard, when the column is integer-typed AND matches the <word>_id shape AND is not a primary key, intersects the matched categories with _FK_SAFE_CATEGORIES:
patient_id is HIPAA-sensitive even without the patient’s name; cookie_id is GDPR-sensitive). Other categories (contact, financial, payment_card, biometric, genetic) are stripped from FK columns because the integer doesn’t leak that content — only the referenced row does.
3. Propagation through joins and aggregations — the compiler-layer mechanism
This is the load-bearing claim. Column tags don’t just sit in the catalog; they propagate through the metric compiler so aget_metric whose join path touches a blocked column refuses, even when the touched column appears nowhere in the projected SQL.
schemabrain/mcp/get_metric.py walks five propagation surfaces before policy intersection:
- The measure column(s) — including composite-expression operands (e.g.,
email_hash * weightwalks both operands, not just the projected column). - The
time_dimensioncolumn — when the metric declares one. - The
group_bycolumns — every column the agent groups by. - The filter predicates — every column referenced in a
WHEREclause. - The JOIN ON pairs — every column on either side of every join hop, including pure-FK columns that never appear in projection or filters.
--pii-block. If the intersection is non-empty, the compiler raises PiiBlockedError. The compiled SQL never runs. The agent receives a pii_blocked refusal with only the blocked categories surfaced — see /mechanism/structured-recovery for the envelope shape, and the §“Probe oracle” section below for why we report only the blocked subset.
The Willison Mirror scenario
The compiler-layer propagation closes the Simon-Willison “lethal trifecta” attack (private data exposure × untrusted content × external communication). Concretely: The same defense applies if the password column appears only in a JOIN ON pair or only as a composite-measure operand — every surface walks every column. A row-level filter would have been too late.Grouping by a PII column is row-level disclosure
Walking thegroup_by surface (above) feeds the policy intersection — but group_by is special in a second way: the grouped column’s distinct values become the result rows. count(*) ... GROUP BY email returns every email — the same data as SELECT DISTINCT email. That is row-level disclosure, not aggregation. So, exactly like MIN/MAX over a tagged column, grouping by a column with any non-empty PII category is refused independent of --pii-block — even under an empty policy, raw PII values never reach the output rows. Aggregating OVER or filtering BY a tagged column stays governed by the category-policy gate (no raw values leave the database); only the group_by projection is categorically refused. This generalises the Willison Mirror defense above from the catastrophic floor to every category: the staff.password example refuses on the floor, and user.email (contact) refuses on the same row-level-disclosure rule.
4. Catastrophic-leak defaults — credential / payment_card / government_id
These three categories have no plausible aggregate-analytics use case — there is no legitimate dashboard that groups by SSN or sums credit-card numbers. They are pinned as CATASTROPHIC_LEAK_CATEGORIES:
Safe-by-default Enforcement Layers:
schemabrain servedefaults--pii-blockto this set when the flag is absent. A zero-config operator runningschemabrain serveon a fresh install gets safe-by-default enforcement:password,credit_card, andssnare all refused without explicit configuration.describe_entityalways redacts the column description when any of these categories appears, even with an empty--pii-block. The agent never reads “password hash” or “social security number” semantics regardless of operator policy. Override is possible per-column but requires deliberate configuration.
contact is not in the set — there are legitimate aggregate analytics over contact data (newsletter open rates by region, etc.), so blocking the aggregate by default would break common workflows. Grouping BY a contact column is a different matter: it projects raw emails/names into the result rows, so it is refused as row-level disclosure regardless of policy (see §“Grouping by a PII column is row-level disclosure” above). The “by region” workflow still works because region classifies public; a name-based-tagged-but-benign column like country_code (which the classifier marks contact) is reclassified public to group by it.
Probe oracle — why the refusal envelope reports only blocked, not attempted
A naive refusal envelope would return all categories the query touched:
get_metric with different group_by columns, read which categories surface in the refusal, and reconstruct the entire PII tagging in O(columns) calls. The PII policy itself becomes a probe oracle.
SchemaBrain’s refusal envelope reports only the blocked set — the intersection of attempted ∩ --pii-block:
--pii-block set is never leaked. Note: the audit row records the same blocked set the envelope returns, not the full attempted set — the latter currently lives only inside the in-process PiiBlockedError exception. A separate operator-visible field carrying the attempted set is on the backlog; see /mechanism/audit-chain §“What this is not” for the caveat.
What this is not
- It is not yet content-aware. The v0.4 classifier is name-based + structural (integer-FK guard). A column literally named
attr_1holding email addresses getspublic. Data-profile-aware classification (sample-based content matching) is deferred to a future minor version — the sampling pipeline already exists for the LLM-enrichment path, but plumbing it through the classifier with appropriate sampling caps and safe redaction is non-trivial. - It is not a substitute for column-level access control at the database. SchemaBrain refuses queries it would have run. A direct database connection bypasses SchemaBrain entirely. Postgres role-level grants are the right complement.
- It is not a single classifier the operator can swap. The rule table is a Python module today. YAML-overlay extensibility was deferred to first real demand — see
schemabrain/pii/classifier.pydocstring for the rationale. - It is not the same posture across all tools.
get_metricrefuses on blocked-category touch.describe_entityanddescribe_columnmark blocked columns withredacted=Trueand clear the LLM description, but they don’t refuse — the agent still sees the entity name and its non-PII columns. Uniform SQL-layer enforcement against agent-emitted SQL is a v2 surface.
Verify it yourself
schemabrain/pii/classifier.py; the propagation logic is in schemabrain/mcp/get_metric.py.
Related
Read-only
Why writes are impossible at the type level.
Structured recovery
Recovery envelope on a PII refusal.
Audit chain
Every refusal is recorded in the chain.
Threat model
Full PII surface analysis.