🤖

Transformer Architecture Overview

Jun 21, 2024

Transformer Architecture in AI

Introduction

  • Transformers power recent AI breakthroughs (e.g., ChatGPT, Vision Transformers, AlphaFold 2).
  • Versatile: Work with text, images, speech, etc., as long as data is represented as vectors.
  • Focus on remastered explanation with emphasis on self-attention.

Data Representation

Text

  • Tokenization: Converts text to subwords using tokenizer.
  • Subwords assigned unique vectors (random or word embeddings).
  • Word Embeddings: Represent semantic similarity.
    • Precomputed by analyzing word co-occurrences.
    • Neural network assigns similar vectors to semantically similar words.

Images

  • Images as matrices (RGB channels, intensity values).
  • Convert to vectors by row-wise concatenation (inefficient for transformers).
  • Use patches and apply linear neural network layer to reduce dimensionality to vector form.

Transformer Processing

  • Neural networks process vectors into better representations across layers.
  • Task examples: predicting next token, sentiment classification.

Transformer Layer

  • Takes input sequence as vectors, outputs vectors of same number and dimensionality.
  • Special tokens (e.g., classification token) added for specific tasks.
  • Special token output used by classification layer for prediction/probability assignment.
  • Training: compare predictions to expected results, backpropagate loss, update parameters.

Transformer Components

Feed Forward Network (MLP Sublayer)

  • Dense layer with GeLU activation doubles the dimension.
  • Another dense layer with GeLU scales down dimension.
  • Independent processing of each token.

Self-Attention Sublayer

  • Self-attention enables information flow within sequence.
  • Computes importance of each token relative to others.
  • Attention weights used to combine token representations.
  • Attention vs. Self-Attention: Self-attention within same sequence, Attention across different sequences.

Attention Computation

  • Input vectors transformed into keys, queries, and values.
  • Multiplied with respective matrices (queries, keys, values) initialized randomly.
  • Attention scores calculated via scalar product and softmax.
  • Final token representation: weighted sum of value vectors.
  • Multi-head self-attention: multiple sets of attention patterns.
  • Number of attention heads: a hyperparameter balancing complexity and memory usage.
  • Research on approximating/replacing attention to reduce resource consumption.

Sequence Information

Positional Embeddings

  • Order matters: Positional embeddings added to input embeddings.
  • Identify position using rule-based or learnable vectors.

Residual Connections

  • Add input of attention layer to its output (normalization maintains value range).
  • Facilitates training by focusing on transforming inputs incrementally.
  • Stacked transformer layers enable deeper problem-solving.
  • Prevents gradient signal loss in deep networks.

Training Procedures

Masked Language Modeling (BERT)

  • Classifier token for sentence pair classification.
  • Randomly mask 15% of tokens with [MASK] token.
  • Objective: adapt weights to correctly predict masked words.
  • Great for training classification transformers/encoders.

Predicting Next Word (GPT)

  • Transformers (decoders) for generating next word predictions.

Transformers vs. RNNs

  • Transformers: Parallel processing of tokens via attention.
  • RNNs: Sequential processing with dependencies, slower training.
  • Transformer’s parallelism enables training on large datasets (e.g., entire internet).

Conclusion

  • Transformers revolutionized NLP with parallel token processing.
  • Learning resources: Jay Allamar’s illustrated transformer, Louis Serrano’s series.
  • Thanks to supporters and viewers.