Use case
Build a Data Extraction Pipeline
Scrape structured data from web sources at scale and land it in a queryable database without writing a custom scraper per site.
The problem is not fetching a URL. The problem is that every site is different: some render in the browser, some block bots, some paginate in undocumented ways, and almost none return clean structured data. Data engineers who build these pipelines manually spend most of their time writing site-specific glue code that breaks every time the source redesigns a page. The stack below replaces that glue with an agent loop that adapts to layout changes, a browser layer for JavaScript-heavy targets, and a typed output surface that keeps downstream consumers from breaking when the shape shifts. The people who build these are usually data engineers, product analytics teams, and solo founders who need competitor or market data they cannot buy.
The stack
Each pick is a real entry on the index. Click any one for the full detail page.
- 1A Agents Pipeline driver
Claude Code
by Anthropic
Why this: Claude Code in headless mode runs the extraction loop: fetch a batch of URLs, decide whether to use the fetch or browser path, parse the raw output into a schema, and write rows to the database. Its skills system lets you encode the target schema once and reload it on every run without re-prompting. Cron wires it to a schedule.
Full entry - 2M MCP Static page reader
Fetch MCP Server
by Anthropic (reference implementation)
Why this: Most pages on a target list are static HTML. The Fetch MCP server converts them to clean markdown in one call, which costs far fewer tokens than a screenshot and skips the overhead of launching a browser. Use this for the majority path and fall back to Playwright for the minority of JavaScript-rendered targets.
Full entry - 3M MCP Browser layer for JS-rendered pages
Playwright MCP
by Microsoft
Why this: The sites that matter most for competitive or market data are usually React or Vue SPAs that return empty markup to a plain fetch. Playwright MCP launches a real browser for those targets, waits for the DOM to stabilise, and hands the agent a rendered page it can actually read. Pairing fetch and Playwright in the same agent loop means one pipeline handles both cases.
Full entry - 4O OSS Orchestration
LangGraph
by LangChain
Why this: A data extraction pipeline has clear shape: fetch, parse, validate, write, handle errors, retry. LangGraph encodes that as a state machine so a failure at the parse step does not silently drop rows. The explicit state model also makes it straightforward to add a human-review step for uncertain extractions before they hit the database.
Full entry - 5M MCP Output store
Supabase MCP Server
by Supabase
Why this: Extracted rows need to land somewhere query-friendly. The Supabase MCP server lets the agent write directly to a typed table through a scoped token, no connection string passed to the model. Row-level security keeps multi-source pipelines from bleeding data between projects, and Supabase's REST API means downstream consumers can query the data without a separate API layer.
Full entry
Get this running with Enterprise DNA.
Enterprise DNA connects the extraction pipeline to the rest of the operating layer. Each pipeline run logs to an OPM work item so you can see which runs succeeded, which stalled on a blocked domain, and which produced fewer rows than expected. Secrets for the Supabase write token and any proxy credentials live in Infisical, pulled at runtime rather than hardcoded. When a run produces a dataset worth reviewing, the result lands in your inbox via the sealed Omni Mail client rather than sitting in a dashboard you have to remember to check.
Get the Stack Blueprint
A printable architecture card with every tool, role, and rationale on one page.
Enter your email. We send one useful update per week. Unsubscribe any time.
In the print dialog, choose "Save as PDF" as the destination.
Alternative stacks
Different angles on the same outcome.
Research agent that reads and synthesises
If the goal is briefings rather than structured rows, a research agent that reads pages and synthesises findings is the right shape. No database, no schema, just a written output.
See the alternative AlternativeContent publishing pipeline
If the extracted data feeds a content workflow rather than a database, the content pipeline stack handles the downstream formatting and scheduling work.
See the alternative AlternativeCode review bot for data quality
Once the pipeline is running, a code review bot applied to the extraction scripts catches schema drift, brittle selectors, and missing error handling before they break production runs.
See the alternativeOther use cases
More curated stacks from the index.
Build a customer support agent
A working customer-support agent that triages tickets, answers from your docs, and escalates with full context.
See the stack Use caseBuild a research agent
An agent that watches sources, synthesises findings, and ships you a briefing on the days something matters.
See the stack Use caseBuild a sales outreach agent
An outreach agent that drafts personal-feeling email, qualifies replies on the phone, and updates the CRM without anyone copy-pasting notes.
See the stack