Horovod
by Community
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
OSS
Horovod
Added 1 June 2026
Overview
Horovod is a distributed training framework that scales deep learning across multiple GPUs and nodes for TensorFlow, Keras, PyTorch, and Apache MXNet. It abstracts communication patterns like all-reduce to simplify multi-machine training without requiring extensive code rewrites. Developers add a few lines to existing training scripts to enable distributed execution.
Best for
Best for
ML engineers training large models who need to scale across multiple GPUs or nodes without rewriting training logic
Use cases
- Training large models across multiple GPUs or TPUs faster
- Scaling PyTorch or TensorFlow experiments to multi-node clusters
- Reducing training time for production ML pipelines
Notes
Horovod is a distributed training framework that scales deep learning across multiple GPUs and nodes for TensorFlow, Keras, PyTorch, and Apache MXNet. It abstracts communication patterns like all-reduce to simplify multi-machine training without requiring extensive code rewrites. Developers add a few lines to existing training scripts to enable distributed execution.
14,696 stars on GitHub. Last updated 2025-12-01.
Use cases
- Training large models across multiple GPUs or TPUs faster
- Scaling PyTorch or TensorFlow experiments to multi-node clusters
- Reducing training time for production ML pipelines
Pros
- Works with major frameworks (PyTorch, TensorFlow, Keras, MXNet) with minimal code changes
- Handles communication optimization automatically, reducing boilerplate
- Well-tested in production with 14k+ GitHub stars and active community
Cons
- Requires infrastructure setup (multiple GPUs/nodes) to see benefits
- Learning curve for distributed training concepts and debugging across machines
- Performance gains depend on network bandwidth and cluster configuration
Indexed from awesome-llmops and enriched against its public facts.
Pros
- Works with major frameworks (PyTorch, TensorFlow, Keras, MXNet) with minimal code changes
- Handles communication optimization automatically, reducing boilerplate
- Well-tested in production with 14k+ GitHub stars and active community
Cons
- Requires infrastructure setup (multiple GPUs/nodes) to see benefits
- Learning curve for distributed training concepts and debugging across machines
- Performance gains depend on network bandwidth and cluster configuration
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
TensorFlow
Community
An Open Source Machine Learning Framework for Everyone
PyTorch
Community
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Keras
Community
Deep Learning for humans
Apache MXNet
Community
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more