How extraction works — what file types Kodori reads, in what order

When you upload a document to Kodori, an Inngest function runs the extractor cascade — the goal is to get readable text out of the bytes so search, the agent, and auto-classify all have something to work with.

**The cascade order** (first adapter whose `supports(mimeType)` returns true wins):

1. **Azure Document Intelligence** — primary cloud OCR for PDFs and most images (PDF, JPEG, PNG, BMP, TIFF, HEIF). Self-reports unsupported until `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` are set, so unconfigured deployments skip it cleanly. Best on dense scanned text. 2. **Office adapters** — `.docx` via mammoth, `.xlsx`/`.xls`/`.ods` via SheetJS, `.pptx` via jszip + fast-xml-parser. Pure-JS, no LLM cost. Office formats can't go through cloud OCR anyway (those expect rasters or PDFs). 3. **Adobe Illustrator** — `.ai` files sniff for the `%PDF-` magic header and route the embedded PDF through the cloud / Claude path. Illustrator-9+ files are PDFs in disguise. 4. **Whisper transcribe** — any `audio/*` MIME type. Voice notes from /capture, dictated depositions, recorded calls. Self-reports unsupported until `OPENAI_API_KEY` is set. 5. **Google Document AI** — fallback cloud OCR. Better than Azure on handwriting + dense legal scans in our testing; cheaper than Claude vision per page. Self-reports unsupported until `GOOGLE_DOCAI_PROJECT_ID` + `GOOGLE_DOCAI_PROCESSOR_ID` are set. 6. **Raster convert** — handles `image/tiff`, `image/bmp`, `image/heic`, `image/heif` when no cloud OCR is configured. `sharp` (libvips + libheif) decodes the source bytes; single-page sources convert to PNG and route to Claude vision; multi-page sources (legal fax-scan TIFFs are typically 5-50 pages) convert page-by-page and wrap into a single PDF that Claude reads in one request. Capped at 50 pages and 20 MB per call. **The reason this exists**: Claude vision rejects TIFF / BMP / HEIC bytes directly; without conversion, a fresh tenant uploading a court-filing TIFF or an iPhone-camera HEIC would silent-dead-letter to `status='unsupported'`. 7. **Claude vision (claude-pdf)** — last LLM-driven fallback. Handles PDFs (via Anthropic's `document` block) and PNG / JPEG / GIF / WebP (via Anthropic's `image` block — the request-shape branch was fixed in D298 after a silent production bug surfaced where all four image MIMEs were getting routed through the PDF block and rejected with `'Non-PDF files in user messages' functionality not supported`). The most expensive per-page path; the most flexible at understanding semi-structured layouts (multi-column briefs, footnotes, mixed text + diagrams). 8. **Built-in text** — UTF-8 decode for `text/*` MIME types, plus JSON / YAML / XML / CSV / SVG. Zero-cost shortcut for files that are already text.

**What "supported" means.** Each extractor reads the doc's MIME type and decides yes / no. The first yes wins. If every extractor says no, the document still uploads but `document_content.status='unsupported'` and search / agent / auto-classify can't see the contents — only the file name + MIME type are searchable.

**TIFF specifically.** Two paths cover it:

- Tenants with **Azure or Google Document AI configured** route TIFFs through cloud OCR. Best for high-volume scanned-record workloads — cheaper per page, higher accuracy on dense text, no Claude vision spend. - Tenants **without cloud OCR configured** route TIFFs through the raster-convert extractor. Single-page TIFFs convert to PNG and go to Claude. Multi-page TIFFs convert each page, wrap into a PDF, and go to Claude as one multi-page request. The 20 MB cap and 50-page cap are practical limits — beyond that, configure cloud OCR.

**HEIC / HEIF specifically.** iPhones default to HEIC since iOS 11. The /capture mobile camera workflow uploads HEIC unless the user has switched the camera to "Most Compatible" in Settings (most users haven't). The raster-convert extractor handles HEIC the same way it handles TIFF — sharp decodes via libheif, normalizes to PNG, routes to Claude. Works on every fresh tenant.

**Legacy Office (.doc), Outlook emails (.msg), and ZIP archives** (D317, v0.7.101) — `.doc` (1997-2003 binary Word) routes through `word-extractor` (pure-JS, no LibreOffice dep); `.msg` (Outlook saved emails) routes through `@kenjiuno/msgreader` and produces a transcript-style text (Subject / From / To / Cc / Date / Body) plus a structured email-metadata block including the attachment list. `.zip` produces a manifest text listing every inner file's path + byte size so the archive becomes searchable by inner file name. Auto-fanout (each inner file becomes its own document) is on the roadmap but not in v0.7.101 — for now, if you want the inner files individually searchable, unzip locally and re-upload.

**What's still NOT supported today.** `.one` (Microsoft OneNote — proprietary binary format, no good open-source library; convert to PDF in OneNote → Print to PDF before upload), `.tar.gz` / `.7z` / `.rar` (unzip locally and re-upload contents), raw camera files (NEF / CR2 / ARW / DNG — convert to JPEG or shoot in JPEG mode), CAD (`.dwg` / `.dxf` — export to PDF from AutoCAD), encrypted PDFs (remove the password and re-upload), video (`.mp4` / `.mov` — extract audio and re-upload as `.m4a` for Whisper transcription), and `application/octet-stream` (browser couldn't determine the MIME on upload — re-upload with the correct file extension). If you hit a gap not on this list, the doc lands at `status='unsupported'` — file an issue or ping us; we add formats by customer demand.

**Cost-bearing extractors.** Azure / DocAI / Whisper / Claude all cost money per call. The plan-tier quota gate (`requireQuota`) checks before fetching blob bytes so a Free-tier tenant can't bulk-upload 100 PDFs and rack up Anthropic vision charges past the monthly cap. The raster-convert extractor delegates to Claude under the hood, so it's also gated.

**Re-running extraction in bulk.** The dashboard's "Re-run for all (N)" button (visible to owners + admins, only when N > 0) re-queues every doc that's never extracted, failed, returned `unsupported`, or has been pending / running for more than 5 minutes (events that lost their worker). The click is durable: the action returns sub-second with "Queued ~N — running in background", then a background Inngest workflow (`extract-all-pending`) handles the bulk in 1,000-row chunks. Survives any tenant size — pre-D305 the inline implementation timed out around 14k stuck docs and showed "looks like nothing happened" with the numbers unchanged. Per-tenant concurrency-key 1 prevents double-fanout from rapid re-clicks; the dashboard's auto-refresh ticks the counts down as workers finish.

**Reading image previews.** The /doc/[id] page renders every image MIME (PNG / JPEG / TIFF / HEIC / etc.) inline at up to 80% of the viewport height — but court-filing TIFFs and high-resolution scans are routinely 2,000+ px wide, so even a generous inline view doesn't show enough detail to read. Click the inline image to open the full-window lightbox: mouse-wheel zooms toward the cursor, drag pans when zoomed beyond the viewport, the toolbar exposes +/− buttons + a "Fit" reset, and Esc closes. Keyboard: `+` / `−` zoom, `0` resets, `Esc` closes. Works on any image MIME including the TIFF / HEIC / BMP types that route through Kodori's server-side raster-convert pipeline (every browser except Safari refuses to render those natively, so the preview endpoint converts them to PNG before serving — the lightbox sees a PNG regardless of the source format).

How extraction works — what file types Kodori reads, in what order

Related in Documents

Bulk operations on /search results

Upload documents

Capture from your phone — photos and voice notes