Revolutionizing Sequence Modeling with Transformers

Jan 9, 2025

Attention Is All You Need

Authors

  • Ashish Vaswani (Google Brain)
  • Noam Shazeer (Google Brain)
  • Niki Parmar (Google Research)
  • Jakob Uszkoreit (Google Research)
  • Llion Jones (Google Research)
  • Aidan N. Gomez (University of Toronto)
  • Łukasz Kaiser (Google Brain)
  • Illia Polosukhin

Abstract

  • Dominant sequence transduction models use complex RNNs or CNNs with an encoder-decoder structure, connected via an attention mechanism.
  • Proposed Transformer model uses only attention mechanisms — no recurrence or convolutions.
  • Advantages of the Transformer:
    • More parallelizable
    • Requires less training time
  • Achieved state-of-the-art results:
    • 28.4 BLEU on WMT 2014 English-to-German
    • 41.0 BLEU on WMT 2014 English-to-French

1. Introduction

  • RNNs, LSTMs, and GRUs are the state-of-the-art in sequence modeling and translation.
  • Recurrent models align computation to sequence positions, which limits parallelization.
  • Attention mechanisms model dependencies without regard to distance.
  • The Transformer offers:
    • Improved parallelization
    • Faster training
    • Superior translation quality

2. Background

  • Other models like Extended Neural GPU, ByteNet, and ConvS2S use CNNs.
  • Transformer reduces operations required to relate signals between input/output to a constant number.
  • Self-attention (intra-attention) used for different tasks like reading comprehension and language modeling.
  • The Transformer is the first transduction model using self-attention entirely.

3. Model Architecture

  • Follows encoder-decoder structure with self-attention and fully connected layers.
  • Encoder Stack:
    • Composed of 6 layers with multi-head self-attention and feed-forward network.
    • Uses residual connections and layer normalization.
  • Decoder Stack:
    • Similar to encoder but includes an additional layer for attention over encoder output.
    • Uses masking to maintain the auto-regressive property.

3.2 Attention

  • Scaled Dot-Product Attention:
    • Dot products of queries and keys, scaled by the dimension of keys.
    • Softmax applied to obtain weights.
  • Multi-Head Attention:
    • Projects queries, keys, values multiple times with learned weights.
    • Allows the model to attend to different subspaces.

3.3 Position-wise Feed-Forward Networks

  • Two linear transformations with ReLU activation.

3.4 Embeddings and Softmax

  • Uses learned embeddings and shared weights for transformations.

3.5 Positional Encoding

  • Sine and cosine functions of different frequencies to encode position information.

4. Why Self-Attention

  • Self-attention offers better computational efficiency and path length for long-range dependencies compared to recurrent and convolutional layers.
  • Planned exploration of restricted self-attention for even longer sequences.

5. Training

5.1 Training Data and Batching

  • Trained on WMT 2014 datasets with byte-pair encoding.

5.2 Hardware and Schedule

  • Used 8 NVIDIA P100 GPUs.
  • Base model trained for 12 hours, big model for 3.5 days.

5.3 Optimizer

  • Adam optimizer with specific learning rate schedule.

5.4 Regularization

  • Used dropout and label smoothing for regularization.

6. Results

6.1 Machine Translation

  • Big model outperforms previous models by over 2 BLEU on English-to-German.
  • English-to-French results also set new standards.
  • Cost-efficient in training compared to previous models.

6.2 Model Variations

  • Evaluated different configurations, finding that increasing model size and dropout improves performance.

7. Conclusion

  • Transformer is the first model based entirely on attention.
  • Faster training and improved performance on translation tasks.
  • Future work includes applying the model to other modalities and exploring restricted attention mechanisms.