DLP scanning on ingest

Every uploaded document is pattern-scanned for SSNs, credit cards, routing numbers, MRNs, AWS keys, and other regulated PII / credentials. High-confidence findings auto-escalate sensitivity to "regulated" before the document is searchable.

Updated 2026-04-25

When extraction succeeds, Kodori runs a deterministic pattern scan over the extracted text and surfaces findings in three confidence bands.

**High confidence** — Luhn-validated credit-card number, ABA-validated US routing number, multiple distinct SSNs in one document, AWS access key, GitHub personal access token, PEM-encoded private-key block. These auto-escalate the document's sensitivity to **regulated** the moment the scan lands. The escalation emits a `document.sensitivity-changed` event with actorKind=system and reason "DLP auto-escalation: high-confidence regulated PII detected" — the audit trail captures exactly why the upgrade happened. The document never appears at a lower tier than its content warrants.

**Medium confidence** — single SSN, single MRN-prefixed identifier, JWT-shaped token, generic "key=value" secret pattern. Surfaces as an open finding on the document page; an operator confirms (it's real PII) or dismisses (false positive). Each decision is captured on the audit log.

**Low confidence** — date-of-birth-shaped string, three-or-more phone-number-shaped strings. Logged for visibility but doesn't auto-escalate or surface a confirmation prompt; useful as context if you're already reviewing a document for PII.

The matched value is **never stored**. Each finding row carries only a pre-redacted preview (`XXX-XX-1234` for SSN, `····-····-····-1234` for credit card) plus a count of distinct matches and a one-line reasoning. The document itself is the durable record; the finding is just a pointer.

Tenant-wide DLP posture is on /compliance: status counts, active findings by type, and the bar chart that shows whether one detector is firing on most uploads (a sign of either a real systemic issue or a noisy pattern needing tuning).

Why deterministic regex instead of an LLM detector: regex is auditable. A regulated customer can read the patterns and reason about false-positive rates without reading a model card. We add an LLM detector when a customer asks for the long tail; not before.