Build a Data Extraction Pipeline

Scrape structured data from web sources at scale and land it in a queryable database without writing a custom scraper per site.

The problem is not fetching a URL. The problem is that every site is different: some render in the browser, some block bots, some paginate in undocumented ways, and almost none return clean structured data. Data engineers who build these pipelines manually spend most of their time writing site-specific glue code that breaks every time the source redesigns a page. The stack below replaces that glue with an agent loop that adapts to layout changes, a browser layer for JavaScript-heavy targets, and a typed output surface that keeps downstream consumers from breaking when the shape shifts. The people who build these are usually data engineers, product analytics teams, and solo founders who need competitor or market data they cannot buy.

The stack

Each pick is a real entry on the index. Click any one for the full detail page.

1

A Agents Pipeline driver

Claude Code

by Anthropic

Why this: Claude Code in headless mode runs the extraction loop: fetch a batch of URLs, decide whether to use the fetch or browser path, parse the raw output into a schema, and write rows to the database. Its skills system lets you encode the target schema once and reload it on every run without re-prompting. Cron wires it to a schedule.
Full entry
2

M MCP Static page reader

Fetch MCP Server

by Anthropic (reference implementation)

Why this: Most pages on a target list are static HTML. The Fetch MCP server converts them to clean markdown in one call, which costs far fewer tokens than a screenshot and skips the overhead of launching a browser. Use this for the majority path and fall back to Playwright for the minority of JavaScript-rendered targets.
Full entry
3

M MCP Browser layer for JS-rendered pages

Playwright MCP

by Microsoft

Why this: The sites that matter most for competitive or market data are usually React or Vue SPAs that return empty markup to a plain fetch. Playwright MCP launches a real browser for those targets, waits for the DOM to stabilise, and hands the agent a rendered page it can actually read. Pairing fetch and Playwright in the same agent loop means one pipeline handles both cases.
Full entry
4

O OSS Orchestration

LangGraph

by LangChain

Why this: A data extraction pipeline has clear shape: fetch, parse, validate, write, handle errors, retry. LangGraph encodes that as a state machine so a failure at the parse step does not silently drop rows. The explicit state model also makes it straightforward to add a human-review step for uncertain extractions before they hit the database.
Full entry
5

M MCP Output store

Supabase MCP Server

by Supabase

Why this: Extracted rows need to land somewhere query-friendly. The Supabase MCP server lets the agent write directly to a typed table through a scoped token, no connection string passed to the model. Row-level security keeps multi-source pipelines from bleeding data between projects, and Supabase's REST API means downstream consumers can query the data without a separate API layer.
Full entry

Why we picked this stack

Get this running with Enterprise DNA.

Enterprise DNA connects the extraction pipeline to the rest of the operating layer. Each pipeline run logs to an OPM work item so you can see which runs succeeded, which stalled on a blocked domain, and which produced fewer rows than expected. Secrets for the Supabase write token and any proxy credentials live in Infisical, pulled at runtime rather than hardcoded. When a run produces a dataset worth reviewing, the result lands in your inbox via the sealed Omni Mail client rather than sitting in a dashboard you have to remember to check.

Run this stack on Enterprise DNA

Free Blueprint

Get the Stack Blueprint

A printable architecture card with every tool, role, and rationale on one page.

Enter your email. We send one useful update per week. Unsubscribe any time.

Alternative stacks

Different angles on the same outcome.

Alternative

Research agent that reads and synthesises

If the goal is briefings rather than structured rows, a research agent that reads pages and synthesises findings is the right shape. No database, no schema, just a written output.

See the alternative Alternative

Content publishing pipeline

If the extracted data feeds a content workflow rather than a database, the content pipeline stack handles the downstream formatting and scheduling work.

See the alternative Alternative

Code review bot for data quality

Once the pipeline is running, a code review bot applied to the extraction scripts catches schema drift, brittle selectors, and missing error handling before they break production runs.

See the alternative

Other use cases

More curated stacks from the index.

Use case

Build a Data Extraction Pipeline

The stack

Claude Code

Fetch MCP Server

Playwright MCP

LangGraph

Supabase MCP Server