Content types — required fields + auto-naming per kind of document

Tell Kodori what every kind of document needs. Each content type binds a docType to a list of required metadata fields + an optional auto-naming template. The fields surface as a Required panel on the doc page until the operator fills them.

Updated 2026-05-20

A **content type** is the answer to the question Kevin asked at our 2026-05-18 working meeting: *"if I hold up a document and say 'that is a fingerprint card,' what do you need to know when you go looking for it?"* The fields you list — date received, county, sentence, custodian — become the required-metadata schema for every document of that type. When a doc's docType matches a content type's label, Kodori surfaces those fields on the doc page until the operator fills them in.

Roy + Kevin both wanted "content type" as the operator term (not "schema" — that reads as developer jargon). Kodori uses "content type" everywhere a user sees it.

**Where they live.** /admin/content-types is the owner / admin page where you define them. Each content type carries:

- **Display name** — what operators see. ("Fingerprint card", "Bill of cost", "Contract".) - **docType label** — the lowercase-kebab string that gets written to a document's metadata.docType when it's classified. ("fingerprint-card", "bill-of-cost".) The auto-classifier writes this; an operator can also set it manually in the Metadata panel. - **Required fields** — the metadata keys + types + (for enum kinds) the allowed options. The operator MUST fill these on every doc of this type before the doc is "complete." - **Auto-name template (D343)** — optional. When enabled, saving metadata on a doc of this content type rewrites the display_name from a template like `{{docTypeLabel}}-{{lastName}}-{{dateReceived}}`. Useful for matching file-naming conventions your existing systems expect.

**One docType, one content type.** The (tenant, docTypeLabel) pair is unique among live content types. If you need different field sets for "contracts under federal law" vs. "contracts under state law," create two content types with distinct docType labels ("contract-federal", "contract-state"). The auto-classifier will pick the matching one.

**Required field shape.** Each field has four parts:

```json { "key": "matterNumber", "label": "Matter number", "kind": "string", "hint": "e.g. 24-1234" } ```

- **key** — camelCase identifier; written to `document_objects.metadata[key]` when filled. - **label** — what the operator sees on the doc page. - **kind** — string | number | date | enum. Picks the input affordance + the resolution check. - **hint** — optional placeholder / description rendered below the input. - **options** — required when kind="enum"; the allowed values.

**Resolution rules.** A required field is satisfied when:

- string: `metadata[key]` is a non-empty trimmed string. - number: `metadata[key]` is a finite number (or a string that parses to one). - date: `metadata[key]` parses as a Date. - enum: `metadata[key]` (or its string form) is in `options`.

Lenient on input shape because operators paste from many sources. The Required panel clears automatically the moment the field is satisfied.

**Editing a content type.** Already-classified documents see the new required-field set on next render — there's no migration of historical metadata. Drop a field and operators stop seeing it as a requirement; add a field and the panel surfaces it. The audit log records every change to the content type itself.

**Archive vs. delete.** Archiving a content type stops it appearing in the docType assignment dropdown but keeps the row intact (already-classified docs keep their docType value; the audit chain referencing the content type stays valid). Re-creating a content type with the same docType label is allowed once the existing one is archived — the partial unique index ignores archived rows.

**What about templates?** A content type defines what a kind of document *is*. The Templates feature (/help/document-templates) is something different — it's an AI-generated drafting path where you start from an existing doc as a structural reference and Kodori writes a new one. They coexist; one is metadata schema, the other is doc-creation flow.

**Observed docTypes (D345).** Between the Defined list and the create form is the Observed docTypes panel. It queries every distinct `metadata.docType` value across your live corpus, sorted by usage. Each row shows the docType value, the doc count, and a mapped-status chip:

- **Green ✓ Mapped** — a defined content type's `docTypeLabel` matches literally. The doc-to-content-type binding is working. - **Amber ⚠ Similar to {name}** — normalization (lowercase + hyphenize non-alphanumeric runs) reveals that this observed docType collides with an existing content type's label. Common causes: the AI wrote `Fingerprint Card` and you defined `fingerprint-card`; one teammate's batch had `Contract` and another had `contract`. **D347 ships the one-click fix:** expand the amber row's "Reclassify N docs as `<canonical-label>`" disclosure, edit the audit reason if you want, click Reclassify. Every matching doc updates to the canonical label and emits its own `document.metadata-set` audit event with previous + next + reason. Reversible per-doc from each doc's history. Match is literal case-preserving — only docs with the exact source label get touched, so consolidating three normalized-equivalent variants is three separate clicks (which keeps the audit trail honest about which path each doc came from). **D350 adds an opt-in checkbox: "Also re-run AI classification on these docs"** — when checked, the server action queues fresh `document/auto-classify.requested` events for every updated doc so the AI re-proposes values that fit the new content type's required-field schema (sensitivity / keywords / docType / content-type-fields all get fresh proposals). Default OFF because cost is ~$0.005/doc Haiku — operators consolidating a few docs probably want it; operators consolidating thousands probably want to think about it. The banner after submit reports "Queued AI re-classification on N docs — proposed values appear on each doc's Suggestions panel within a minute." - **Indigo ≈ Synonym of {name} (D360)** — when exact + normalization both miss, Kodori checks a curated synonym dictionary covering legal-agreements (contract / agreement / engagement-letter / MSA / SOW), NDAs, MOUs, financial (invoice / bill / statement-of-account / PO / receipt), law-enforcement (fingerprint-card / booking-card / tenprint, mugshot / booking-photo, rap-sheet / criminal-history, motion / pleading, deposition / depo, subpoena / warrant), meeting notes / minutes / MOM, email / correspondence / letter, AEC (RFI / change-order / submittal / drawing-blueprint-plan-spec), HR (resume / CV / offer-letter / performance-review), and real-estate (lease / deed / appraisal). Synonym matches surface an indigo "Reclassify N docs as ..." disclosure with the same form shape as the amber branch — reason defaults to "Consolidate synonym ..." so the audit log distinguishes synonym consolidation from case-normalization fixes. Why a dictionary instead of embeddings for v1? Zero new infra; instant render; covers the high-volume common cases across our three beta verticals. The interface is shaped so a future pgvector implementation drops in without changing this UX. Curation principles: only "obviously same kind of doc" pairs included; context-sensitive terms ("note" could be meeting-note OR loan-note) are deliberately left out — better to miss them than to suggest a bad merge. - **Plain (unmapped)** — no content type covers this docType, no normalization match, no synonym match. Click "Create content type from this →" to pre-fill the create form with the observed name + a normalized docTypeLabel; edit and save. **D357 — same affordance on /doc/[id]:** when you open an individual document whose docType is unmapped, the Maps-to chip slot surfaces an amber "no content type defined yet · [Create one →]" that deep-links to the create form pre-filled with the doc's docType. Operators with the most context on a doc (they just opened it; they know what kind it is) get a path to fix the binding from where they are.

The panel caps at 50 rows ORDER BY count DESC — the most-used values float to the top, which are the ones worth formalizing first. If your corpus has > 50 distinct docTypes, that's a sign to consolidate before defining more content types.

**Starting from scratch (D358).** Brand-new tenants land on /admin/content-types and see "No content types yet." Below the empty state, admins get a "Load sample content types" button that seeds three generic templates — Contract, Invoice, Meeting notes — each with sensible required fields + auto-naming. Each seed goes through the same createContentTypeTool path as a manual create, so the audit chain records each as a normal content-type.created event with you as actor. Conflicts (same name OR same docType label) skip silently — re-clicking the button is idempotent. The seeds are intentionally generic and editable; the goal is "see what good looks like in 1 click," not "lock you into a specific vocabulary."

**Completeness scorecard (D361).** Every content-type row carries twin horizontal progress bars below its chips:

- **Finalized** (emerald) — what % of docs bound to this content type have been operator-approved through /triage (`finalized_at IS NOT NULL`). High % means the queue is being worked through; low % means there's a backlog. - **Required keys present** (amber) — what % of bound docs have ALL the keys defined by this content type's required fields present in their metadata. High % means operators and the AI auto-fill (D342) are doing their jobs; low % means lots of docs are stuck waiting for someone to fill the required fields.

Both metrics are floor-rounded — "99.6%" never reads as "100%" while one doc is still missing. The second bar suppresses entirely when the content type has no required fields (every doc is trivially complete in that case). The same scorecard renders on the per-content-type edit page so admins drilling in see identical math.

**What "required keys present" means precisely.** The metric uses Postgres' `?&` JSONB operator — "do all these keys exist as top-level keys in this doc's metadata?" It's a proxy for the full satisfaction check (non-empty string, valid date, enum membership in the allowed options). The precise check still gates Finalize on the doc page — the scorecard is honest about being a corpus-wide signal, not a per-doc gate. Why the proxy? Running the precise JS check across millions of docs on every /admin/content-types render would be unbounded; the SQL-side keys-present check is O(N) on a single index probe per content type.

**Coming after D341–D360.** The visible loop is closed: content types define the schema (D341), AI auto-fill proposes values (D342), auto-naming rewrites display names (D343), the triage queue surfaces unfinalized docs + finalize gates required fields (D344, D346), the Observed-docTypes panel keeps the schema in sync with reality (D345), one-click reclassify consolidates near-duplicates (D347, D350), docs-by-content-type list appears on the edit page (D353), inline create-from-this-doc closes the loop from the doc page (D357), the seeder solves the cold-start (D358), and synonym detection (D360) catches semantic duplicates that normalization can't reach. Follow-ups in the roadmap: tenant-specific synonyms keyed by industry vocabulary (D361+ pgvector embeddings).