O Open Source Observability medium

Horovod

by Community

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Visit Community View repo Submit your build →

OSS

Horovod

Added 1 June 2026

#baidu #deep-learning #deeplearning #keras #machine-learning #machinelearning #mpi #mxnet

Overview

Horovod is a distributed training framework that scales deep learning across multiple GPUs and nodes for TensorFlow, Keras, PyTorch, and Apache MXNet. It abstracts communication patterns like all-reduce to simplify multi-machine training without requiring extensive code rewrites. Developers add a few lines to existing training scripts to enable distributed execution.

Best for

Best for
ML engineers training large models who need to scale across multiple GPUs or nodes without rewriting training logic

Use cases

Training large models across multiple GPUs or TPUs faster
Scaling PyTorch or TensorFlow experiments to multi-node clusters
Reducing training time for production ML pipelines

Notes

14,696 stars on GitHub. Last updated 2025-12-01.

Use cases

Training large models across multiple GPUs or TPUs faster
Scaling PyTorch or TensorFlow experiments to multi-node clusters
Reducing training time for production ML pipelines

Pros

Works with major frameworks (PyTorch, TensorFlow, Keras, MXNet) with minimal code changes
Handles communication optimization automatically, reducing boilerplate
Well-tested in production with 14k+ GitHub stars and active community

Cons

Requires infrastructure setup (multiple GPUs/nodes) to see benefits
Learning curve for distributed training concepts and debugging across machines
Performance gains depend on network bandwidth and cluster configuration

Indexed from awesome-llmops and enriched against its public facts.

Pros

Works with major frameworks (PyTorch, TensorFlow, Keras, MXNet) with minimal code changes
Handles communication optimization automatically, reducing boilerplate
Well-tested in production with 14k+ GitHub stars and active community

Cons

Requires infrastructure setup (multiple GPUs/nodes) to see benefits
Learning curve for distributed training concepts and debugging across machines
Performance gains depend on network bandwidth and cluster configuration

Pairs with

Other entries in the index that connect to this one. Click through to see the chain.

Uses3entries

O OSS Obs medium

TensorFlow

Community

An Open Source Machine Learning Framework for Everyone

★ 195,356 updated 1mo ago

O OSS Obs medium

PyTorch

Community

Tensors and Dynamic neural networks in Python with strong GPU acceleration

★ 100,318 updated 1mo ago

O OSS Obs medium

Keras

Community

Deep Learning for humans

★ 64,079 updated 1mo ago

Pairs with1entry

O OSS Obs medium

Apache MXNet

Community

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

★ 20,809 updated 2y ago

Free 27-page guide

Get the free Developer’s Field Guide

A 27-page field guide to the AI coding workflow with Claude. Claude Code, MCP servers, the prompt patterns that work, and what to delegate. Free.

Enter your work email. We send it straight over, plus a few short notes worth knowing. Unsubscribe any time.

← Back to Open Source Submit your own entry →