O Open Source Frameworks medium

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

by Community

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LL

Visit Community View repo Submit your build →

OSS

Added 2 June 2026

Overview

FineWeb is a 15-trillion token pretraining dataset derived from 96 Common Crawl snapshots. It is designed to produce better-performing large language models than other open datasets. The dataset and its curation methodology are fully documented and ablated to advance understanding of high-quality data curation.

Best for

Best for
Researchers and engineers building or benchmarking open LLMs with high-quality pretraining data

Use cases

Pretraining large language models from scratch
Ablation studies on data curation techniques
Benchmarking open-source dataset quality for LLM training

Notes

Use cases

Pretraining large language models from scratch
Ablation studies on data curation techniques
Benchmarking open-source dataset quality for LLM training

Pros

Proven to improve LLM performance over other open datasets
Fully documented and ablated curation process
Large scale with 15 trillion tokens from diverse web sources

Cons

Requires significant compute resources to process and use
Derived only from Common Crawl, limiting domain coverage
Not a ready-to-use tool; requires integration into training pipelines

Indexed from awesome-llm and enriched against its public facts.

Pros

Proven to improve LLM performance over other open datasets
Fully documented and ablated curation process
Large scale with 15 trillion tokens from diverse web sources

Cons

Requires significant compute resources to process and use
Derived only from Common Crawl, limiting domain coverage
Not a ready-to-use tool; requires integration into training pipelines

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Pairs with4entries

O OSS Framework medium

DeepSpeed

Community

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

★ 42,436 updated 23d ago

O OSS Framework medium

Megatron-LM

Community

Ongoing research training transformer models at scale

★ 16,545 updated 23d ago

O OSS Framework medium

NeMo Framework

Community

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech

★ 17,285 updated 23d ago

O OSS Framework medium

Litgpt

Community

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

★ 13,395 updated 23d ago

← Back to Open Source Submit your own entry →