Connectors today ship in two shapes:
- **File-storage vendors** (SharePoint, OneDrive, Google Drive) — every synced item is a document - **Message vendors** (Slack, Outlook, Gmail) — message body indexes into external_messages, file attachments land in external_documents
After v0.7.36 (Slack files) + v0.7.37 (Outlook + Gmail attachments), all six connector kinds surface attached file content uniformly. A PDF contract emailed via Outlook, a board deck attached to a Gmail thread, or a .docx shared in a Slack channel all flow into the same extraction pipeline as a direct SharePoint upload — and surface in `searchExternalContent` on body content, not just filename.
## What's collected
- **Slack files** — every file uploaded into a channel the bot is invited to. `files.list` is workspace-wide so coverage is whatever channels you've invited @Kodori to. - **Outlook attachments** — every `fileAttachment` on every message in your Inbox. Inline images (signature scans, embedded screenshots) are skipped — those are already in the message body. itemAttachment (forwarded emails) and referenceAttachment (OneDrive links) are skipped to avoid dedup with their actual sources. - **Gmail attachments** — every part of every message with a `body.attachmentId` set + a discoverable filename (either `payload.filename` or via Content-Disposition header).
## How extraction works
Each new attachment row in `external_documents` fires an `external-document/extract.requested` event. The extractor cascade picks the right adapter based on mime type — Azure Doc Intel for PDFs, Office adapters for .docx/.xlsx/.pptx, Whisper for audio, Claude vision as the last LLM-driven fallback. The extracted text lands on `external_documents.text` + becomes searchable via the FTS + pgvector hybrid retrieval (D226).
## Caps and back-pressure
- **50MB byte cap** per attachment (matches Kodori upload limit; oversize attachments fail with `oversize_<bytes>_bytes` and visible in extraction_error) - **2MB stored-text cap** truncates after extraction (≈500 pages dense prose) - **Outlook fan-out cap of 50 messages-with-attachments per sync run** — subsequent ticks pick up the rest naturally - **Gmail attachments** are bounded by the per-run message cap (no separate attachment cap; every message's attachments process inline)
## What's NOT collected
- Slack DM file uploads (`im.files`) — requires `im:read` + `im:history` scopes; deferred until customer signal - Outlook itemAttachment recursion (forwarded-email file attachments) — same reason - Gmail attachments in labels other than INBOX — currently only the Inbox label syncs - Inline images / signature graphics — intentional skip
## Retry behavior
Failed attachment extractions stick to the row with `extraction_error` set. There are two recovery paths:
- **Manual: "Retry failed" button on /integrations/[id]** (admin / owner only). Appears next to the failed-count pill on the extraction status panel when failures > 0. Click to fan out 500 retries at once; if more remain, click again. Returns within seconds. Use this after a known outage. - **Automatic: 6-hour retry-cron (D231)** sweeps stale failures across every connected connector and re-fires extract events. Permanent failures (oversize, unsupported mime, vendor 404) get one cheap retry every 6h before stopping; transient outages (Anthropic 503, Azure rate-limit storm) recover within 1-4 retry shots without operator action.