Enterprise DNA
O Open Source Frameworks medium

Datatrove

by Community

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

D

OSS

Datatrove

Added 1 June 2026

Overview

Datatrove is an open-source Python framework for building platform-agnostic data processing pipelines. It provides customizable blocks that users assemble into workflows, reducing the need for custom scripting. The project is maintained by the Hugging Face community and has over 3000 stars on GitHub.

Best for

Best for
Developers who need flexible, reusable data pipelines without locking into a specific platform

Use cases

  • Assembling modular data preprocessing pipelines for machine learning
  • Creating reusable data cleaning and transformation workflows
  • Building scalable ETL processes without writing glue code

Notes

Datatrove is an open-source Python framework for building platform-agnostic data processing pipelines. It provides customizable blocks that users assemble into workflows, reducing the need for custom scripting. The project is maintained by the Hugging Face community and has over 3000 stars on GitHub.

3,076 stars on GitHub. Last updated 2026-05-26. Licensed Apache-2.0.

Use cases

  • Assembling modular data preprocessing pipelines for machine learning
  • Creating reusable data cleaning and transformation workflows
  • Building scalable ETL processes without writing glue code

Pros

  • Modular block-based design promotes code reuse and clarity
  • Platform-agnostic, works across different execution environments
  • Backed by a large open-source community with active development

Cons

  • Requires learning the block abstraction paradigm
  • May introduce overhead for simple or one-off data tasks
  • Documentation and examples may lag behind rapid development

Indexed from awesome-llm and enriched against its public facts.

Pros

  • Modular block-based design promotes code reuse and clarity
  • Platform-agnostic, works across different execution environments
  • Backed by a large open-source community with active development

Cons

  • Requires learning the block abstraction paradigm
  • May introduce overhead for simple or one-off data tasks
  • Documentation and examples may lag behind rapid development