MIT 6.S191 Introduction Lecture Notes

Instructor Introduction

Instructor: Alexander Amini
Co-instructor: Ava
Course Title: MIT 6.S191
Overview: Fast-paced, one-week course covering foundational and advanced concepts in AI and deep learning
Course Evolution: Rapidly changing course content due to the dynamic nature of the field

Combination of technical lectures and software labs
Aim: Build foundational understanding and the ability to create state-of-the-art AI models from scratch
Final Project: Project pitch competition with prizes

Previous introduction videos went viral for showcasing AI's capabilities
Example: AI-generated content previously required expensive computational resources
Present: Simplified AI model training accessible on everyday devices (smartphones) using natural language prompts

Intelligence: Ability to process information and inform future decision-making
Artificial Intelligence: Computer's ability to process information and inform decisions
Machine Learning: Subset of AI, programming computers to process information from data
Deep Learning: Subset of Machine Learning, uses neural networks to process raw data

Perceptrons: Basic units of neural networks
- Inputs: Multi-dimensional (X1, X2, ..., Xm)
- Weights: (W1, W2, ..., Wm)
- Bias: Shift activation function horizontally
- Nonlinearity: Activation functions (e.g., sigmoid, ReLU) to introduce nonlinearity
Activation Functions:
- Sigmoid: Squeezes outputs between 0 and 1, useful for probability
- ReLU: Rectified linear unit, commonly used due to computational efficiency

Layers: Composed of neurons; input layer, hidden layers, and output layer
Deep Networks: Stacking multiple layers to increase model complexity
Fully Connected Layers: Each neuron connected to each input from previous layer
Training Neural Networks: Utilize software tools like TensorFlow for easy implementation

Gradient Descent: Optimization algorithm used to minimize the loss function
- Steps: Initialize weights, compute gradient, update weights, repeat until convergence
Loss Function: Measure of how far predictions are from the ground truth
- Cross-Entropy: Used for classification tasks
- Mean Squared Error: Used for regression tasks
Backpropagation: Algorithm to compute the gradient of the loss function with respect to weights
- Uses chain rule of calculus to propagate errors back through the network

Learning Rate: Critical parameter that affects convergence speed and quality
- Too high: Risk of divergence
- Too low: Slow convergence
- Adaptive methods recommended
Stochastic Gradient Descent (SGD): Use of mini-batches for more efficient and scalable training
- Mini-batch size commonly around 32 samples for balance between accuracy and computational efficiency
Parallelization: Leveraging GPUs for faster computation

Overfitting: Model performs well on training data but poorly on unseen data
Regularization Techniques:
- Dropout: Randomly setting activations to zero during training to prevent overfitting
- Early Stopping: Monitor model performance on validation set and stop training when performance worsens

Key Takeaways: Understanding perceptrons, neural network structure, optimization, and regularization
Next Lecture: Deep sequence modeling using RNNs and Transformers
Break: Brief five-minute pause before next lecture by Ava