concept technology ★ seed

Transformers

Attention is all you need. Self-attention lets models weigh the relevance of all parts of input simultaneously.

The Transformer architecture (Vaswani et al., 2017) replaced recurrence with self-attention: instead of processing tokens sequentially, each token attends to every other token in parallel, computing relevance scores via queries, keys, and values. Positional encoding injects sequence order without sequential computation. Multi-head attention lets the model attend to different relationship types simultaneously. This unlocked massive parallelism during training, enabling scaling to billions of parameters. Transformers first conquered NLP (BERT, GPT), then vision (ViT), then protein folding (AlphaFold), then generative AI — proving that attention truly is all you need as a general-purpose computational primitive.

#transformers #attention #nlp #scaling

References

arxiv.org/abs/1706.03762