Described as a state machine with three variables: U, V, W.
Each node in RNN has only these three variables which get updated with each new word read (e.g., "I have a cat").
Problem: The variables U, V, W get updated too frequently, leading to loss of earlier data (贴小广告 analogies for frequent updates that obscure earlier updates).
LSTM (Long Short-Term Memory)
Introduces gates to avoid frequent forgetting.
Updates are less abrupt than RNN, preserving earlier information longer.
"透明的小广告" analogy.
Allows for more sophisticated sequence prediction by maintaining a better memory of previous states.
GRU (Gated Recurrent Unit)
An improvement over LSTM with fewer gates, leading to faster training.
Chronology: RNN (1986), LSTM (1997), GRU (2014).
Machine Translation with Neural Networks
RNN: Maps each word to a corresponding unit in the RNN.
LSTM: Adds directionality (unit-to-unit connections), allowing for sequence prediction
Can be bi-directional (from start to end and end to start).
Parallel processing problems with RNN and LSTM.
Introduction to Transformer
Reference to Google paper "Attention Is All You Need"
Various articles with similar titles followed.
Transformer Structure
Consists of Encoder and Decoder
Encoder: Similar to BERT
Decoder: Similar to GPT (e.g., ChatGPT), scaled up significantly.
Encoder and Decoder connected through Cross-Attention.
Analogous to a transformer with primary and secondary coils.
Encoder Structure
Input embedding
Positional encoding to resolve word order issues (排面 analogy).
Self-attention mechanisms.
Residual connections (残差连接).
Layer normalization.
Six replicated encoder blocks.
Decoder Structure
Masked multi-head attention to predict next word without seeing future words.
Cross-attention connection from encoder output.
Residual connections and layer normalization.
Multi-Head Attention
Uses Q (query), K (key), V (value) matrices for attention mechanism.