tokenizers
by Community
๐ฅ Fast State-of-the-Art Tokenizers optimized for Research and Production
OSS
tokenizers
Added 1 June 2026
Overview
A Rust implementation of fast tokenizers, optimized for both research and production NLP pipelines. It provides subword tokenization algorithms such as BPE, WordPiece, and Unigram with full alignment tracking. The library is framework-agnostic and includes Python bindings for easy integration.
Best for
Best for
Developers needing high-throughput tokenization for NLP model training or serving
Use cases
- Tokenizing large text corpora for model training
- Integrating tokenization into production inference systems
- Building custom tokenizers for specialized vocabularies
Notes
A Rust implementation of fast tokenizers, optimized for both research and production NLP pipelines. It provides subword tokenization algorithms such as BPE, WordPiece, and Unigram with full alignment tracking. The library is framework-agnostic and includes Python bindings for easy integration.
10,782 stars on GitHub. Last updated 2026-05-26. Licensed Apache-2.0.
Use cases
- Tokenizing large text corpora for model training
- Integrating tokenization into production inference systems
- Building custom tokenizers for specialized vocabularies
Pros
- Blazingly fast performance due to Rust implementation
- Supports multiple tokenization algorithms with consistent API
- Seamless Python bindings for integration with ML workflows
Cons
- Limited to tokenization tasks without broader NLP utilities
- Requires compilation for Rust or using pre-built wheels
- Smaller community compared to Python-native tokenizers
Indexed from awesome-llmops and enriched against its public facts.
Pros
- Blazingly fast performance due to Rust implementation
- Supports multiple tokenization algorithms with consistent API
- Seamless Python bindings for integration with ML workflows
Cons
- Limited to tokenization tasks without broader NLP utilities
- Requires compilation for Rust or using pre-built wheels
- Smaller community compared to Python-native tokenizers
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.