Mechanism: PII taxonomy

One-line claim: Every column carries a typed PII category tag at index time; the tags propagate through joins and aggregations at compile time, so get_metric refuses queries that would touch blocked PII before the database is ever queried.

Most “PII protection” in catalogs is a documentation feature — a column gets a pii: true annotation that a human reviewer looks at. SchemaBrain’s PII tags are structural inputs to the SQL compiler. A metric whose join path crosses a blocked-category column refuses with a typed pii_blocked envelope, and the compiled SQL never reaches the database. This page documents the four mechanisms — the closed category set, the column-level classifier, the propagation rules, and the catastrophic-leak defaults.

1. The closed category set — 12 categories from real regulation

PIICategory is a Pydantic Literal — a closed set of 12 strings. Adding a new category is a minor charter bump.

Literal["contact", "financial", "payment_card", "health", "genetic",
        "biometric", "behavioral", "online_identifier", "credential",
        "government_id", "location", "demographic_protected"]

Category	Regulation that defines it	Example columns
`contact`	GDPR Art. 4(1) “identifiable natural person”	`email`, `phone`, `address`, `first_name`
`financial`	GLBA “non-public personal information”; GDPR Art. 9 income	`salary`, `balance`, `revenue`, `transaction_amount`
`payment_card`	PCI DSS scope (PAN, CVV, cardholder data)	`card_number`, `cvv`, `iban`, `account_number`
`health`	HIPAA PHI, GDPR Art. 9 health data	`diagnosis`, `medication`, `mrn`, `patient_id`, `encounter_id`
`genetic`	GDPR Art. 9 genetic data	`genome`, `genotype`, `dna_seq`, `rsid`
`biometric`	GDPR Art. 9 biometric data	`fingerprint`, `face_embedding`, `iris`, `voiceprint`
`behavioral`	GDPR Recital 30 (browse / clickstream patterns)	`purchase_history`, `clickstream`, `event_log`
`online_identifier`	GDPR Art. 4(1); CCPA “unique personal identifier”	`ip`, `cookie_id`, `device_id`, `wallet_address`
`credential`	PCI DSS Req. 8; SOX access control	`password`, `api_key`, `token`, `private_key`
`government_id`	HIPAA Safe Harbor; ISO 27018	`ssn`, `passport`, `national_id`, `npi`, `tax_id`
`location`	GDPR Art. 4(1) location data	`latitude`, `longitude`, `gps`, `geolocation`
`demographic_protected`	GDPR Art. 9 special categories; HIPAA age	`dob`, `race`, `ethnicity`, `religion`, `political_party`

The closed set means an agent or audit pipeline can switch on the category without parsing strings. New regulatory categories (e.g., AI Act biometric-emotion data) trigger a minor charter bump rather than a free-form additions.

2. Column-level classification — regex + integer-FK guard

schemabrain/pii/classifier.py takes a column name (plus its declared type and primary-key flag) and returns (sensitivity, frozenset[PIICategory]). The classifier is regex-based but with two refinements that pure regex can’t express.

Layer 1 — regex with snake_case-aware boundaries

The standard \b regex boundary treats _ as a word character, so \bemail\b doesn’t match email inside email_address. SQL columns are snake_case overwhelmingly, so we use custom boundary semantics:

# schemabrain/pii/classifier.py — _kw()
re.compile(rf"(?<![A-Za-z0-9])(?:{alts})(?![A-Za-z0-9])", re.IGNORECASE)

email now matches inside email_address, user_email, and e_mail, but not inside emailish. The rule table is closed at v1: when an operator hits a column name our rules miss, the fix is to add a rule. CI pins (a) the rule count and (b) that every PIICategory has at least one producing rule — drift fails CI before merge.

Layer 2 — S1 `<noun>_name` denylist

product_name in a catalog table previously matched the bare name rule and tagged the column as contact (a person’s name). The S1 guard suppresses the name rule when the column matches ^<denylist>_name$ for a closed list of obviously-not-personal nouns (product, brand, category, language, currency, tag, …). Bare name on a denylisted table (products.name, categories.name) also skips the rule. The denylist is deliberately conservative — company_name, team_name, and region_name stay tagged because they can refer to a natural person’s affiliation in CRM contexts.

Layer 3 — S2 integer-FK guard

An integer-FK column whose name contains a category keyword (e.g., address_id BIGINT matching the address rule → contact) tags the FK as PII even though the actual PII lives on the referenced table. The S2 guard, when the column is integer-typed AND matches the <word>_id shape AND is not a primary key, intersects the matched categories with _FK_SAFE_CATEGORIES:

# schemabrain/pii/classifier.py — _FK_SAFE_CATEGORIES
frozenset({
    "credential", "online_identifier", "government_id",
    "health", "demographic_protected", "behavioral", "location",
})

These are the categories where the FK itself remains PII as an integer reference (patient_id is HIPAA-sensitive even without the patient’s name; cookie_id is GDPR-sensitive). Other categories (contact, financial, payment_card, biometric, genetic) are stripped from FK columns because the integer doesn’t leak that content — only the referenced row does.

3. Propagation through joins and aggregations — the compiler-layer mechanism

This is the load-bearing claim. Column tags don’t just sit in the catalog; they propagate through the metric compiler so a get_metric whose join path touches a blocked column refuses, even when the touched column appears nowhere in the projected SQL. schemabrain/mcp/get_metric.py walks five propagation surfaces before policy intersection:

The measure column(s) — including composite-expression operands (e.g., email_hash * weight walks both operands, not just the projected column).
The time_dimension column — when the metric declares one.
The group_by columns — every column the agent groups by.
The filter predicates — every column referenced in a WHERE clause.
The JOIN ON pairs — every column on either side of every join hop, including pure-FK columns that never appear in projection or filters.

All five contribute to the attempted (propagated) set. The set is intersected with --pii-block. If the intersection is non-empty, the compiler raises PiiBlockedError. The compiled SQL never runs. The agent receives a pii_blocked refusal with only the blocked categories surfaced — see /mechanism/structured-recovery for the envelope shape, and the §“Probe oracle” section below for why we report only the blocked subset.

The Willison Mirror scenario

The compiler-layer propagation closes the Simon-Willison “lethal trifecta” attack (private data exposure × untrusted content × external communication). Concretely:

Adversarial Willison Mirror Attack Closed:

An innocent metric staff_count defined as COUNT(*) over the staff table is not PII-tagged in isolation.
A prompt-injected agent calls get_metric(name="staff_count", group_by=["staff.password"]).
group_by walking adds credential to the attempted set.
Policy intersection finds credential ∈ pii_block.
The compiler raises PiiBlockedError and rejects execution. The SQL never runs, and no database query occurs.

The same defense applies if the password column appears only in a JOIN ON pair or only as a composite-measure operand — every surface walks every column. A row-level filter would have been too late.

Grouping by a PII column is row-level disclosure

Walking the group_by surface (above) feeds the policy intersection — but group_by is special in a second way: the grouped column’s distinct values become the result rows. count(*) ... GROUP BY email returns every email — the same data as SELECT DISTINCT email. That is row-level disclosure, not aggregation. So, exactly like MIN/MAX over a tagged column, grouping by a column with any non-empty PII category is refused independent of --pii-block — even under an empty policy, raw PII values never reach the output rows. Aggregating OVER or filtering BY a tagged column stays governed by the category-policy gate (no raw values leave the database); only the group_by projection is categorically refused. This generalises the Willison Mirror defense above from the catastrophic floor to every category: the staff.password example refuses on the floor, and user.email (contact) refuses on the same row-level-disclosure rule.

4. Catastrophic-leak defaults — `credential` / `payment_card` / `government_id`

These three categories have no plausible aggregate-analytics use case — there is no legitimate dashboard that groups by SSN or sums credit-card numbers. They are pinned as CATASTROPHIC_LEAK_CATEGORIES:

# schemabrain/pii/categories.py
CATASTROPHIC_LEAK_CATEGORIES: frozenset[PIICategory] = frozenset(
    {"credential", "payment_card", "government_id"}
)

Safe-by-default Enforcement Layers:

schemabrain serve defaults --pii-block to this set when the flag is absent. A zero-config operator running schemabrain serve on a fresh install gets safe-by-default enforcement: password, credit_card, and ssn are all refused without explicit configuration.
describe_entity always redacts the column description when any of these categories appears, even with an empty --pii-block. The agent never reads “password hash” or “social security number” semantics regardless of operator policy. Override is possible per-column but requires deliberate configuration.

The closed set is auditable; additions are a minor charter bump. contact is not in the set — there are legitimate aggregate analytics over contact data (newsletter open rates by region, etc.), so blocking the aggregate by default would break common workflows. Grouping BY a contact column is a different matter: it projects raw emails/names into the result rows, so it is refused as row-level disclosure regardless of policy (see §“Grouping by a PII column is row-level disclosure” above). The “by region” workflow still works because region classifies public; a name-based-tagged-but-benign column like country_code (which the classifier marks contact) is reclassified public to group by it.

Probe oracle — why the refusal envelope reports only `blocked`, not `attempted`

A naive refusal envelope would return all categories the query touched:

// NAIVE — leaks every category the schema has
{"pii_categories": ["contact", "credential"]}

An attacker could iterate get_metric with different group_by columns, read which categories surface in the refusal, and reconstruct the entire PII tagging in O(columns) calls. The PII policy itself becomes a probe oracle. SchemaBrain’s refusal envelope reports only the blocked set — the intersection of attempted ∩ --pii-block:

// SchemaBrain — surfaces only what triggered the refusal
{"pii_categories": ["credential"]}

The operator’s full --pii-block set is never leaked. Note: the audit row records the same blocked set the envelope returns, not the full attempted set — the latter currently lives only inside the in-process PiiBlockedError exception. A separate operator-visible field carrying the attempted set is on the backlog; see /mechanism/audit-chain §“What this is not” for the caveat.

What this is not

It is not yet content-aware. The v0.4 classifier is name-based + structural (integer-FK guard). A column literally named attr_1 holding email addresses gets public. Data-profile-aware classification (sample-based content matching) is deferred to a future minor version — the sampling pipeline already exists for the LLM-enrichment path, but plumbing it through the classifier with appropriate sampling caps and safe redaction is non-trivial.
It is not a substitute for column-level access control at the database. SchemaBrain refuses queries it would have run. A direct database connection bypasses SchemaBrain entirely. Postgres role-level grants are the right complement.
It is not a single classifier the operator can swap. The rule table is a Python module today. YAML-overlay extensibility was deferred to first real demand — see schemabrain/pii/classifier.py docstring for the rationale.
It is not the same posture across all tools. get_metric refuses on blocked-category touch. describe_entity and describe_column mark blocked columns with redacted=True and clear the LLM description, but they don’t refuse — the agent still sees the entity name and its non-PII columns. Uniform SQL-layer enforcement against agent-emitted SQL is a v2 surface.

Verify it yourself

# Confirm the closed category set
python -c "from schemabrain.pii.categories import PII_CATEGORIES; print(sorted(PII_CATEGORIES))"

# Confirm catastrophic-leak defaults
python -c "from schemabrain.pii.categories import CATASTROPHIC_LEAK_CATEGORIES; print(sorted(CATASTROPHIC_LEAK_CATEGORIES))"

# Classify a column
python -c "from schemabrain.pii.classifier import classify_column; print(classify_column('email', 'TEXT'))"
# ('pii', frozenset({'contact'}))

python -c "from schemabrain.pii.classifier import classify_column; print(classify_column('product_name', 'TEXT'))"
# ('public', frozenset())  — S1 denylist kicked in

python -c "from schemabrain.pii.classifier import classify_column; print(classify_column('patient_id', 'BIGINT'))"
# ('pii', frozenset({'health'}))  — S2 FK guard kept the category

The rule table itself is in schemabrain/pii/classifier.py; the propagation logic is in schemabrain/mcp/get_metric.py.

Read-only

Why writes are impossible at the type level.

Structured recovery

Recovery envelope on a PII refusal.

Audit chain

Every refusal is recorded in the chain.

Threat model

Full PII surface analysis.

​Mechanism: PII taxonomy

​1. The closed category set — 12 categories from real regulation

​2. Column-level classification — regex + integer-FK guard

​Layer 1 — regex with snake_case-aware boundaries

​Layer 2 — S1 <noun>_name denylist

​Layer 3 — S2 integer-FK guard

​3. Propagation through joins and aggregations — the compiler-layer mechanism

​The Willison Mirror scenario

​Grouping by a PII column is row-level disclosure

​4. Catastrophic-leak defaults — credential / payment_card / government_id

​Probe oracle — why the refusal envelope reports only blocked, not attempted

​What this is not

​Verify it yourself

​Related

Read-only

Structured recovery

Audit chain

Threat model

Mechanism: PII taxonomy

1. The closed category set — 12 categories from real regulation

2. Column-level classification — regex + integer-FK guard

Layer 1 — regex with snake_case-aware boundaries

Layer 2 — S1 `<noun>_name` denylist

Layer 3 — S2 integer-FK guard

3. Propagation through joins and aggregations — the compiler-layer mechanism

The Willison Mirror scenario

Grouping by a PII column is row-level disclosure

4. Catastrophic-leak defaults — `credential` / `payment_card` / `government_id`

Probe oracle — why the refusal envelope reports only `blocked`, not `attempted`

What this is not

Verify it yourself

Related