Build a Document Intelligence Agent

Extract structured, queryable data from PDFs and unstructured documents so the agent can reason over it, store it, and hand off clean records downstream.

The actual problem is that your documents do not agree on a format. One vendor sends a two-column PDF invoice. Another sends a scanned image. A third sends a Word doc with merged cells. Engineers building document pipelines spend most of their time writing one-off parsers for each new format, then debugging the silent failures when a table shifts by one column. The people who need this stack are ops teams at law firms, finance teams processing supplier invoices, and SaaS teams whose customers upload anything. What makes it hard is not the extraction itself but getting the model to return a reliably typed object every time, not a plausible-looking string that breaks downstream on Wednesday.

The stack

Each pick is a real entry on the index. Click any one for the full detail page.

1

A Agents Driver

Claude Code

by Anthropic

Why this: Headless Claude Code processes incoming documents in batch via a cron job or file-watch trigger without any UI wrapper. Skills encode the extraction contract per document type, so the agent runs the same typed pipeline whether the input is an invoice PDF or a scanned W-9.
Full entry
2

M MCP PDF surface

AryanBV/pdf-toolkit-mcp

by Various

Why this: Twenty-two tools covering read, render, and transform. For scanned pages the vision rendering path is what prevents the pipeline from silently returning empty text on image-heavy PDFs. Zero native dependencies keeps the deployment footprint small.
Full entry
3

M MCP Format normaliser

microsoft/markitdown

by Various

Why this: Not every document is a PDF. This tool converts Word docs, Excel sheets, and image files into clean Markdown before the extraction step, so the agent sees a consistent text format regardless of what the user uploaded. 138k stars means the conversion quality is well-tested on the weird edge cases.
Full entry
4

O OSS Structured output

Instructor

by Jason Liu (community)

Why this: Instructor patches the Anthropic client to return a Pydantic model on every extraction call instead of a raw string. When the model returns a malformed field it retries automatically. For a pipeline that processes hundreds of documents a day, that retry logic is what prevents silent data loss at row 247.
Full entry
5

O OSS Storage + retrieval

pgvector

by Community

Why this: Extracted records land in Postgres with embeddings stored via pgvector. This means the agent can do both exact-match queries (give me all invoices from vendor X) and semantic queries (find contracts that mention termination for convenience) against the same table without standing up a separate vector database.
Full entry
6

M MCP Query interface

Postgres MCP Server

by Model Context Protocol (reference)

Why this: Once the extracted data is in Postgres the rest of your stack needs to query it. The reference Postgres MCP server gives any downstream agent read access to the schema and extracted records without anyone copy-pasting a connection string or opening a DB GUI.
Full entry

Why we picked this stack

Get this running with Enterprise DNA.

Enterprise DNA connects this stack to an operating layer that most document pipelines are missing. Each extraction job runs as a work item in OPM so you know which documents processed, which failed, and who owns the follow-up. Extracted records that need a human decision land in the inbox rather than a Slack message nobody finds. Secrets for the Postgres connection and Anthropic key live in Infisical and get pulled at runtime, so neither ends up in a .env file on a developer's laptop. The CRM record for each customer is the downstream destination for the structured fields the agent extracts, closing the loop from document upload to live account data without a manual copy-paste step.

Run this stack on Enterprise DNA

Free Blueprint

Get the Stack Blueprint

A printable architecture card with every tool, role, and rationale on one page.

Enter your email. We send one useful update per week. Unsubscribe any time.

Alternative stacks

Different angles on the same outcome.

Alternative

Build a research agent

If the documents are public web sources rather than uploaded files, the research agent stack replaces the extraction pipeline with search MCPs and a delivery layer.

See the alternative Alternative

Build a personal email assistant

Email attachments are often the delivery mechanism for the same PDFs and invoices. Combining the email assistant with this extraction stack routes attachments straight into the structured data pipeline.

See the alternative Alternative