Connector file text extraction (SharePoint, OneDrive, Drive)

When a file from SharePoint, OneDrive, or Google Drive lands in Kodori through a connector sync, the file metadata persists immediately (name, mime, URL, size, last-modified). The bytes themselves get downloaded and extracted to plain text in a follow-on background step so the file becomes searchable on its body content, not just its filename.

## Pipeline

1. **Sync worker** pulls the file metadata via the vendor's delta API and inserts a row in `external_documents` with `text: null`. 2. **Sync function** detects newly-inserted rows (via `onConflictDoNothing().returning(...)`) and fires one `external-document/extract.requested` Inngest event per row. 3. **Extract function** (concurrency-keyed on documentId so two parallel ticks don't double-spend the LLM call): - Decrypts the connector's access token via the application-layer token vault - Downloads bytes via the vendor-specific helper (Graph `/drives/{id}/items/{id}/content` for SharePoint/OneDrive; `/files/{id}?alt=media` for Drive binary; `/files/{id}/export?mimeType=text/plain` for Google-native Docs/Sheets/Slides) - Runs the existing extractor cascade against the bytes: Azure Doc Intel → Office adapters → illustrator-ai → Whisper → Google DocAI → Claude PDF → builtin-text - UPDATEs `external_documents.text` + `extracted_at`

## What gets extracted

- **Office formats** (.docx, .xlsx, .pptx) — pure-JS adapters, no LLM cost - **PDFs + images** — Azure Doc Intel when configured, Google DocAI as fallback, Claude vision as last resort - **Plain text + JSON + YAML + CSV** — built-in UTF-8 decode, sub-millisecond - **Adobe Illustrator (.ai)** — sniffs the embedded `%PDF-` magic bytes and routes through Claude PDF - **Audio attachments** — Whisper transcription when `OPENAI_API_KEY` is set - **Google Docs / Sheets / Slides** — exported as plain text by Drive's `/export` endpoint, no extractor cascade needed

## Caps

- **50MB byte cap** matches the Kodori upload limit. Files larger than this fail with `oversize_<bytes>_bytes` and persist that error to `extraction_error` so they're visible for triage but don't block subsequent extractions. - **2MB stored-text cap** truncates the extracted output. 2MB of dense prose is roughly 500 pages, well past the threshold where additional content adds retrieval value. The truncation is silent — the row still gets `extracted_at` populated so we don't retry.

## What does NOT get extracted

- **Slack / Gmail / Outlook** — message kinds already carry plain text in the `body` column (HTML bodies are stripped to plain text by the message workers). Firing extract events for messages would be a no-op. - **Folders** — sync workers skip folder rows; nothing to extract. - **Soft-deleted vendor files** — sync workers skip these on subsequent runs; previously-extracted rows stay until a cron prunes them.

## Extraction status

Each `external_documents` row carries:

- `extracted_at IS NULL AND extraction_error IS NULL` — never tried - `extracted_at IS NOT NULL` — successfully extracted (text in `text` column) - `extraction_error IS NOT NULL` — last attempt failed; retry candidate

Future: a per-connector "what % of my files extracted clean?" panel on `/integrations/[id]`.

## Search impact

After extraction, `searchExternalContent` (the agent's connector-search MCP tool) hits FTS + pgvector over the new `text` content. A SharePoint Word doc named `Q3-board-deck.docx` containing the phrase "indemnity language" now surfaces on either query — by name OR by content. Pre-D229, only the filename match worked.

Connector file text extraction (SharePoint, OneDrive, Drive)

Related in Integrations and API

Save Excel workbooks to Kodori from the ribbon

Save PowerPoint decks to Kodori from the ribbon

Save Word documents to Kodori from the ribbon