Attention Is All You Need
by Community
Transformers
OSS
Attention Is All You Need
Added 1 June 2026
Overview
The seminal 2017 paper that introduced the Transformer architecture, replacing recurrent layers with a multi-head self-attention mechanism for sequence transduction. It demonstrates that attention alone, without recurrence or convolution, can achieve state-of-the-art translation performance and forms the foundation of modern large language models.
Best for
Best for
Researchers and engineers building or modifying transformer-based models for NLP and beyond
Use cases
- Foundational reference for implementing Transformer-based NLP models
- Understanding self-attention and positional encoding for sequence tasks
- Building encoder-decoder architectures for machine translation and summarization
Notes
The seminal 2017 paper that introduced the Transformer architecture, replacing recurrent layers with a multi-head self-attention mechanism for sequence transduction. It demonstrates that attention alone, without recurrence or convolution, can achieve state-of-the-art translation performance and forms the foundation of modern large language models.
Use cases
- Foundational reference for implementing Transformer-based NLP models
- Understanding self-attention and positional encoding for sequence tasks
- Building encoder-decoder architectures for machine translation and summarization
Pros
- Introduced a highly parallelizable architecture that enabled training on large data
- Established attention as a core building block for countless follow-up models
- Simple yet powerful concept that generalizes beyond NLP to vision and other modalities
Cons
- Lacks inherent positional awareness, requiring explicit positional encodings
- Quadratic self-attention cost with sequence length limits long-context efficiency
- Original results require large compute and data; not a drop-in beginner tutorial
Indexed from awesome-llm and enriched against its public facts.
Pros
- Introduced a highly parallelizable architecture that enabled training on large data
- Established attention as a core building block for countless follow-up models
- Simple yet powerful concept that generalizes beyond NLP to vision and other modalities
Cons
- Lacks inherent positional awareness, requiring explicit positional encodings
- Quadratic self-attention cost with sequence length limits long-context efficiency
- Original results require large compute and data; not a drop-in beginner tutorial
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
PyTorch
Community
Tensors and Dynamic neural networks in Python with strong GPU acceleration
TensorFlow
Community
An Open Source Machine Learning Framework for Everyone
Jax
Community
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Keras
Community
Deep Learning for humans