When a file from SharePoint, OneDrive, or Google Drive lands in Kodori through a connector sync, the file metadata persists immediately (name, mime, URL, size, last-modified). The bytes themselves get downloaded and extracted to plain text in a follow-on background step so the file becomes searchable on its body content, not just its filename.
## Pipeline
1. **Sync worker** pulls the file metadata via the vendor's delta API and inserts a row in `external_documents` with `text: null`. 2. **Sync function** detects newly-inserted rows (via `onConflictDoNothing().returning(...)`) and fires one `external-document/extract.requested` Inngest event per row. 3. **Extract function** (concurrency-keyed on documentId so two parallel ticks don't double-spend the LLM call): - Decrypts the connector's access token via the application-layer token vault - Downloads bytes via the vendor-specific helper (Graph `/drives/{id}/items/{id}/content` for SharePoint/OneDrive; `/files/{id}?alt=media` for Drive binary; `/files/{id}/export?mimeType=text/plain` for Google-native Docs/Sheets/Slides) - Runs the existing extractor cascade against the bytes: Azure Doc Intel → Office adapters → illustrator-ai → Whisper → Google DocAI → Claude PDF → builtin-text - UPDATEs `external_documents.text` + `extracted_at`
## What gets extracted
- **Office formats** (.docx, .xlsx, .pptx) — pure-JS adapters, no LLM cost - **PDFs + images** — Azure Doc Intel when configured, Google DocAI as fallback, Claude vision as last resort - **Plain text + JSON + YAML + CSV** — built-in UTF-8 decode, sub-millisecond - **Adobe Illustrator (.ai)** — sniffs the embedded `%PDF-` magic bytes and routes through Claude PDF - **Audio attachments** — Whisper transcription when `OPENAI_API_KEY` is set - **Google Docs / Sheets / Slides** — exported as plain text by Drive's `/export` endpoint, no extractor cascade needed
## Caps
- **50MB byte cap** matches the Kodori upload limit. Files larger than this fail with `oversize_<bytes>_bytes` and persist that error to `extraction_error` so they're visible for triage but don't block subsequent extractions. - **2MB stored-text cap** truncates the extracted output. 2MB of dense prose is roughly 500 pages, well past the threshold where additional content adds retrieval value. The truncation is silent — the row still gets `extracted_at` populated so we don't retry.
## What does NOT get extracted
- **Slack / Gmail / Outlook** — message kinds already carry plain text in the `body` column (HTML bodies are stripped to plain text by the message workers). Firing extract events for messages would be a no-op. - **Folders** — sync workers skip folder rows; nothing to extract. - **Soft-deleted vendor files** — sync workers skip these on subsequent runs; previously-extracted rows stay until a cron prunes them.
## Extraction status
Each `external_documents` row carries:
- `extracted_at IS NULL AND extraction_error IS NULL` — never tried - `extracted_at IS NOT NULL` — successfully extracted (text in `text` column) - `extraction_error IS NOT NULL` — last attempt failed; retry candidate
Future: a per-connector "what % of my files extracted clean?" panel on `/integrations/[id]`.
## Search impact
After extraction, `searchExternalContent` (the agent's connector-search MCP tool) hits FTS + pgvector over the new `text` content. A SharePoint Word doc named `Q3-board-deck.docx` containing the phrase "indemnity language" now surfaces on either query — by name OR by content. Pre-D229, only the filename match worked.