Notes on Transformers, GPT, and Deep Learning

Overview of GPT

GPT stands for Generative Pre-trained Transformer.
- Generative: Refers to the model's ability to generate new text.
- Pre-trained: Indicates the model has learned from a massive amount of data before being fine-tuned for specific tasks.
- Transformer: A specific type of neural network that is key to the current AI advancements.

Models can convert audio to text or generate synthetic speech from text.
Tools like DALL-E and MidJourney generate images from text prompts and are based on Transformers.
The original Transformer was developed for text translation.

The specific model of interest (e.g., ChatGPT) predicts the next word in a text sequence.
Prediction involves creating a probability distribution over possible next words.
Process of Text Generation:
1. Input text is processed into tokens (words or fragments).
2. Tokens are converted into vectors (numerical representations).
3. Vectors interact and update meanings via attention blocks.
4. Vectors undergo processing through multi-layer perceptron blocks.
5. Repeat the attention and perceptron steps until a final vector is produced.
6. The final vector is used to predict the next token.

Attention Block: Allows vectors to communicate and update meanings based on context.
Example: The word "model" means differently in different contexts (e.g., machine learning vs. fashion).

Vectors pass through an operation in parallel, updating based on collective input.
Understanding the underlying matrices is crucial for interpretation.

The last vector in the sequence encodes the overall meaning of the input passage.
Transformers can handle a fixed number of tokens, known as context size.
- Example: GPT-3 has a context size of 2048.

Each token is associated with a vector representing its meaning in high-dimensional space.
Close vectors indicate similar meanings.
Embedding Matrix: Maps tokens to their respective vectors.
- Example: In GPT-3, the embedding dimension is 12,288.

Final output is a probability distribution over possible next words using the softmax function.
Softmax normalizes outputs to create valid probabilities, ensuring they sum to 1.
Temperature Parameter: Adjusts the randomness of predictions based on the distribution.

Understand the difference between weights (learned during training) and data (current input).
The embedding process encodes context to enhance meaning.
Important mathematical concepts include dot products for measuring vector similarity.

The next chapter will delve deeper into attention mechanisms and their significance in Transformers.