Enterprise DNA
O Open Source Observability medium

Pachyderm

by Community

Data-Centric Pipelines and Data Versioning

P

OSS

Pachyderm

Added 1 June 2026

#analytics #big-data #containers #data-analysis #data-science #distributed-systems #docker #go

Overview

Pachyderm is an open-source platform for data-centric pipelines and data versioning. It provides version control for datasets and enables reproducible data processing workflows. Written in Go, it treats data as a first-class citizen in the pipeline lifecycle.

Best for

Best for
Data engineers and ML teams needing reproducible data pipelines

Use cases

  • Versioning datasets for machine learning experiments
  • Building reproducible data pipelines
  • Tracking data lineage and provenance

Notes

Pachyderm is an open-source platform for data-centric pipelines and data versioning. It provides version control for datasets and enables reproducible data processing workflows. Written in Go, it treats data as a first-class citizen in the pipeline lifecycle.

6,295 stars on GitHub. Last updated 2025-02-03. Licensed Apache-2.0.

Use cases

  • Versioning datasets for machine learning experiments
  • Building reproducible data pipelines
  • Tracking data lineage and provenance

Pros

  • Open source with a strong community (over 6,000 stars)
  • Data versioning similar to Git for code
  • Scalable pipeline execution with parallel processing

Cons

  • Steep learning curve for data versioning concepts
  • Requires significant infrastructure setup (e.g., Kubernetes)
  • Limited to data-centric workflows, not general observability

Indexed from awesome-llmops and enriched against its public facts.

Pros

  • Open source with a strong community (over 6,000 stars)
  • Data versioning similar to Git for code
  • Scalable pipeline execution with parallel processing

Cons

  • Steep learning curve for data versioning concepts
  • Requires significant infrastructure setup (e.g., Kubernetes)
  • Limited to data-centric workflows, not general observability