Lecture Notes on Generative Pre-trained Transformers (GPT)

Introduction to GPT

GPT stands for Generative Pre-trained Transformer.
Generative: Bots that generate new text.
Pre-trained: Models learn from massive data sets, allowing for fine-tuning on specific tasks.
Transformer: Key invention in AI; a type of neural network.

Different models can process audio, text, images, etc.
Early transformer models focused on language translation (2017, Google).
Current models, like ChatGPT, are designed to predict subsequent text given context.
- Generates a probability distribution of possible next words.

Input Processing:
- Text input is divided into tokens (words, character combinations).
- Each token is linked to a vector representing its meaning.
Attention Mechanism:
- Tokens interact through an attention block to update meaning based on context.
Feedforward Layer:
- Vectors are processed in parallel to further refine meanings.
Repetition:
- Alternation between attention blocks and feedforward layers until final outputs are generated.
Output Generation:
- The last vector is processed to produce a probability distribution over potential next tokens.

Tokens: Basic units of input (words, patches, etc.).
Vectors: Mathematical representations of tokens; similar meanings have nearby vectors in high-dimensional space.
Attention Block: Determines relevant context and updates word meanings.
Feedforward Layer (Multilayer Perceptron): Updates vectors using a standard operation in deep learning.

A prediction model is generated from input data, allowing for text generation by:
- Providing initial text (seed).
- Sampling from the predicted distribution to append new text iteratively.

Deep Learning: Uses data to adjust model parameters instead of explicit programming.
Backpropagation: Common training algorithm for scaling deep learning models.
Parameters: Model weights (e.g., GPT-3 has 175 billion parameters).
Weight Matrices: Organize the parameters for processing input data through multiple layers.

Embedding Matrix: Converts tokens into vectors, with each word represented in high-dimensional space (GPT-3 has 12,288 dimensions).
Semantic Meaning: Directions in embedding space carry meaning; similar words cluster together.
Example: Vector relationships can reveal linguistic relationships (e.g., king - man + woman = queen).

Probability Distribution: Involves mapping the last vector to vocabulary probabilities.
Softmax: Normalizes outputs into a probability distribution; adjusts for predicted next tokens.
Logits: Raw unnormalized outputs before applying softmax; inputs to softmax are referred to as logits.

A good understanding of word embeddings, softmax, and matrix multiplications is key to grasping the attention mechanism.
Future chapters will delve deeper into the attention mechanism and other foundational concepts in deep learning.

The next chapter will focus on the attention blocks, critical for Transformer functionality.