From RNNs to transformers: The evolution of attention in AI
How attention solved the bottleneck of RNNs and paved the way for transformers
In this lecture, we begin by tracing the journey of sequence modeling in AI. While today’s transformers dominate natural language processing and vision tasks, the path to their creation is deeply connected to earlier architectures like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs). Understanding this evolution helps us appreciate why attention became such a pivotal breakthrough.
Neural Machine Translation: The sequence-to-sequence problem
The classic example for sequence modeling is machine translation. Take the sentence:
Input (English): “This is a cat.”
Output (French): “C’est un chat.”
At its core, this is a sequence-to-sequence mapping problem. The model must take an input sequence of tokens in one language and generate a corresponding output sequence in another language.
The early solution was the encoder-decoder architecture:
The encoder reads the input sequence and produces a fixed-size hidden representation.
The decoder generates the output sequence step by step, conditioned on that hidden representation.
Encoder-Decoder with RNNs
In the original design, both encoder and decoder were implemented as recurrent neural networks (RNNs).
The encoder processed the input tokens sequentially, updating a hidden state at each step. After the final token, the hidden state (often called the context vector) summarized the entire input sentence
The decoder then used this single vector as its initial hidden state to begin generating the translated sentence.
This approach worked for short sentences but quickly ran into limitations.
The problem with RNNs
RNNs, by design, process sequences step by step. The hidden state must carry forward all relevant information, but it has limited capacity.
For short sentences, this can work reasonably well.
For long sentences, however, the context vector becomes a bottleneck. Information from earlier tokens fades as more tokens are processed, leading to poor translations.
This is often described as the long-range dependency problem. Models struggle to retain and use information from distant tokens in the sequence.
Key milestones in sequence modeling
The path toward solving this challenge unfolded over decades:
1980s: Early RNNs were proposed but struggled with vanishing gradients.
1997: LSTMs introduced gating mechanisms to mitigate vanishing gradients, improving performance on longer sequences.
2014: Attention mechanisms were introduced, allowing models to focus on relevant parts of the input sequence dynamically.
2017: Self-attention was generalized in the Attention Is All You Need paper, giving rise to the modern transformer architecture.
Bahdanau attention: A breakthrough in 2014
In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio introduced the first attention mechanism for neural machine translation.
The key idea: Instead of compressing the entire input sentence into a single context vector, the decoder could attend to different parts of the input at each output step.
For example, when predicting the French word “chat,” the decoder could place more weight on the English word “cat.”
This solved the bottleneck problem by allowing the decoder to access all encoder states and dynamically decide which ones were most relevant at each step.
How Bahdanau attention works
Each input token is encoded into a hidden state by the RNN encoder.
When the decoder generates a token, it computes alignment scores between its current hidden state and all encoder hidden states.
These scores are normalized into a probability distribution (using softmax).
The decoder forms a context vector as a weighted sum of the encoder states, guided by the attention scores.
This context vector, combined with the decoder’s hidden state, produces the next output token.
The process repeats for each token in the output sequence.
Visualizing attention
One of the appealing aspects of attention mechanisms is their interpretability. Alignment matrices show which input tokens are attended to during each output step. For example:
When generating “chat,” the highest attention weight would likely fall on “cat.”
When generating “un,” the attention might shift toward “a.”
This visualization provides intuitive evidence of how attention distributes focus across tokens.
Beyond Bahdanau: Self-attention and transformers
Bahdanau’s work marked the first major pivot away from RNN bottlenecks. Soon after, researchers generalized attention beyond sequence-to-sequence translation.
Luong attention (2015) introduced variants of alignment scoring functions.
Self-attention (2017) extended the idea: instead of attending across encoder-decoder pairs, a sequence could attend to itself, capturing dependencies across all tokens.
This self-attention mechanism became the foundation of the transformer architecture, eliminating the need for recurrence altogether.
Conclusion
The history of attention illustrates a broader theme in AI: the shift from sequential, memory-constrained models to architectures that can dynamically focus on relevant information. Bahdanau’s attention mechanism was the bridge that made this possible, paving the way for transformers to become the dominant architecture in both language and vision tasks today.
In the next part of this series, we will dive deeper into self-attention, unpacking how it works mathematically and why it scales so effectively to large models.