The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
by Community
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LL
OSS
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Added 2 June 2026
Overview
FineWeb is a 15-trillion token pretraining dataset derived from 96 Common Crawl snapshots. It is designed to produce better-performing large language models than other open datasets. The dataset and its curation methodology are fully documented and ablated to advance understanding of high-quality data curation.
Best for
Best for
Researchers and engineers building or benchmarking open LLMs with high-quality pretraining data
Use cases
- Pretraining large language models from scratch
- Ablation studies on data curation techniques
- Benchmarking open-source dataset quality for LLM training
Notes
FineWeb is a 15-trillion token pretraining dataset derived from 96 Common Crawl snapshots. It is designed to produce better-performing large language models than other open datasets. The dataset and its curation methodology are fully documented and ablated to advance understanding of high-quality data curation.
Use cases
- Pretraining large language models from scratch
- Ablation studies on data curation techniques
- Benchmarking open-source dataset quality for LLM training
Pros
- Proven to improve LLM performance over other open datasets
- Fully documented and ablated curation process
- Large scale with 15 trillion tokens from diverse web sources
Cons
- Requires significant compute resources to process and use
- Derived only from Common Crawl, limiting domain coverage
- Not a ready-to-use tool; requires integration into training pipelines
Indexed from awesome-llm and enriched against its public facts.
Pros
- Proven to improve LLM performance over other open datasets
- Fully documented and ablated curation process
- Large scale with 15 trillion tokens from diverse web sources
Cons
- Requires significant compute resources to process and use
- Derived only from Common Crawl, limiting domain coverage
- Not a ready-to-use tool; requires integration into training pipelines
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
DeepSpeed
Community
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Megatron-LM
Community
Ongoing research training transformer models at scale
NeMo Framework
Community
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech
Litgpt
Community
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.