O Open Source Frameworks medium

IBM data-prep-kit

by Community

Open source project for data preparation for GenAI applications

Visit Community View repo Submit your build →

OSS

IBM data-prep-kit

Added 1 June 2026

#code-quality #data #data-prep #data-preparation #data-preprocessing #data-preprocessing-pipelines #datacuration #datarecipes

Overview

An open source framework from IBM for preparing data for generative AI applications. It provides tools and pipelines to clean, transform, and structure raw data into formats suitable for training or fine-tuning models.

Best for

Best for
Developers building GenAI applications who need a focused, open source data preparation framework.

Use cases

Cleaning and normalizing text datasets for LLM fine-tuning
Transforming unstructured data into structured training examples
Building repeatable data preparation pipelines for GenAI workflows

Notes

934 stars on GitHub. Last updated 2026-05-15. Licensed Apache-2.0.

Use cases

Cleaning and normalizing text datasets for LLM fine-tuning
Transforming unstructured data into structured training examples
Building repeatable data preparation pipelines for GenAI workflows

Pros

Open source with community contributions and IBM backing
Designed specifically for GenAI data needs, not general ETL
Modular pipeline approach supports customization and reuse

Cons

Limited to data preparation, not a full ML pipeline tool
Relatively new project with smaller community (934 stars)
Documentation and examples may be sparse for advanced use cases

Indexed from awesome-llm and enriched against its public facts.

Pros

Open source with community contributions and IBM backing
Designed specifically for GenAI data needs, not general ETL
Modular pipeline approach supports customization and reuse

Cons

Limited to data preparation, not a full ML pipeline tool
Relatively new project with smaller community (934 stars)
Documentation and examples may be sparse for advanced use cases

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Built with1entry

O OSS Obs medium

PyTorch

Community

Tensors and Dynamic neural networks in Python with strong GPU acceleration

★ 100,318 updated 1mo ago

Pairs with2entries

O OSS Framework medium

LangChain

Community

The agent engineering platform.

★ 138,234 updated 1mo ago

O OSS Framework medium

vLLM

Community

A high-throughput and memory-efficient inference and serving engine for LLMs

★ 81,619 updated 1mo ago

Free 27-page guide

Get the free Developer’s Field Guide

A 27-page field guide to the AI coding workflow with Claude. Claude Code, MCP servers, the prompt patterns that work, and what to delegate. Free.

Enter your work email. We send it straight over, plus a few short notes worth knowing. Unsubscribe any time.

← Back to Open Source Submit your own entry →