Datatrove
by Community
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
OSS
Datatrove
Added 1 June 2026
Overview
Datatrove is an open-source Python framework for building platform-agnostic data processing pipelines. It provides customizable blocks that users assemble into workflows, reducing the need for custom scripting. The project is maintained by the Hugging Face community and has over 3000 stars on GitHub.
Best for
Best for
Developers who need flexible, reusable data pipelines without locking into a specific platform
Use cases
- Assembling modular data preprocessing pipelines for machine learning
- Creating reusable data cleaning and transformation workflows
- Building scalable ETL processes without writing glue code
Notes
Datatrove is an open-source Python framework for building platform-agnostic data processing pipelines. It provides customizable blocks that users assemble into workflows, reducing the need for custom scripting. The project is maintained by the Hugging Face community and has over 3000 stars on GitHub.
3,076 stars on GitHub. Last updated 2026-05-26. Licensed Apache-2.0.
Use cases
- Assembling modular data preprocessing pipelines for machine learning
- Creating reusable data cleaning and transformation workflows
- Building scalable ETL processes without writing glue code
Pros
- Modular block-based design promotes code reuse and clarity
- Platform-agnostic, works across different execution environments
- Backed by a large open-source community with active development
Cons
- Requires learning the block abstraction paradigm
- May introduce overhead for simple or one-off data tasks
- Documentation and examples may lag behind rapid development
Indexed from awesome-llm and enriched against its public facts.
Pros
- Modular block-based design promotes code reuse and clarity
- Platform-agnostic, works across different execution environments
- Backed by a large open-source community with active development
Cons
- Requires learning the block abstraction paradigm
- May introduce overhead for simple or one-off data tasks
- Documentation and examples may lag behind rapid development
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Colossal-AI
Community
Making large AI models cheaper, faster and more accessible