IBM data-prep-kit
by Community
Open source project for data preparation for GenAI applications
OSS
IBM data-prep-kit
Added 1 June 2026
Overview
An open source framework from IBM for preparing data for generative AI applications. It provides tools and pipelines to clean, transform, and structure raw data into formats suitable for training or fine-tuning models.
Best for
Best for
Developers building GenAI applications who need a focused, open source data preparation framework.
Use cases
- Cleaning and normalizing text datasets for LLM fine-tuning
- Transforming unstructured data into structured training examples
- Building repeatable data preparation pipelines for GenAI workflows
Notes
An open source framework from IBM for preparing data for generative AI applications. It provides tools and pipelines to clean, transform, and structure raw data into formats suitable for training or fine-tuning models.
934 stars on GitHub. Last updated 2026-05-15. Licensed Apache-2.0.
Use cases
- Cleaning and normalizing text datasets for LLM fine-tuning
- Transforming unstructured data into structured training examples
- Building repeatable data preparation pipelines for GenAI workflows
Pros
- Open source with community contributions and IBM backing
- Designed specifically for GenAI data needs, not general ETL
- Modular pipeline approach supports customization and reuse
Cons
- Limited to data preparation, not a full ML pipeline tool
- Relatively new project with smaller community (934 stars)
- Documentation and examples may be sparse for advanced use cases
Indexed from awesome-llm and enriched against its public facts.
Pros
- Open source with community contributions and IBM backing
- Designed specifically for GenAI data needs, not general ETL
- Modular pipeline approach supports customization and reuse
Cons
- Limited to data preparation, not a full ML pipeline tool
- Relatively new project with smaller community (934 stars)
- Documentation and examples may be sparse for advanced use cases
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.