Pachyderm
by Community
Data-Centric Pipelines and Data Versioning
OSS
Pachyderm
Added 1 June 2026
Overview
Pachyderm is an open-source platform for data-centric pipelines and data versioning. It provides version control for datasets and enables reproducible data processing workflows. Written in Go, it treats data as a first-class citizen in the pipeline lifecycle.
Best for
Best for
Data engineers and ML teams needing reproducible data pipelines
Use cases
- Versioning datasets for machine learning experiments
- Building reproducible data pipelines
- Tracking data lineage and provenance
Notes
Pachyderm is an open-source platform for data-centric pipelines and data versioning. It provides version control for datasets and enables reproducible data processing workflows. Written in Go, it treats data as a first-class citizen in the pipeline lifecycle.
6,295 stars on GitHub. Last updated 2025-02-03. Licensed Apache-2.0.
Use cases
- Versioning datasets for machine learning experiments
- Building reproducible data pipelines
- Tracking data lineage and provenance
Pros
- Open source with a strong community (over 6,000 stars)
- Data versioning similar to Git for code
- Scalable pipeline execution with parallel processing
Cons
- Steep learning curve for data versioning concepts
- Requires significant infrastructure setup (e.g., Kubernetes)
- Limited to data-centric workflows, not general observability
Indexed from awesome-llmops and enriched against its public facts.
Pros
- Open source with a strong community (over 6,000 stars)
- Data versioning similar to Git for code
- Scalable pipeline execution with parallel processing
Cons
- Steep learning curve for data versioning concepts
- Requires significant infrastructure setup (e.g., Kubernetes)
- Limited to data-centric workflows, not general observability
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.