Overview
- Lecture explains neural network training by building micrograd, a minimal autograd engine.
- Demonstrates forward and backward passes, chain rule, implementing Value objects, operations, and simple neural networks.
- Compares micrograd concepts to PyTorch and shows training a small MLP with gradient descent.
Key Concepts
- Autograd / Backpropagation
- Autograd = automatic differentiation to compute gradients efficiently.
- Backpropagation = recursive application of chain rule on computation graph, from output loss back to inputs/weights.
- Derivative Intuition
- Derivative β limit of (f(x+h) - f(x)) / h; measures local sensitivity (slope).
- For multivariate outputs, partial derivatives measure each inputβs influence on output.
- Chain Rule
- If z depends on y, which depends on x: dz/dx = (dz/dy) * (dy/dx).
- In computation graph, local derivatives at each node multiply with upstream derivative to yield gradients.*
Micrograd Design (Value Object)
- Purpose
- Represent scalar values that track data, gradient, operation, and children to form a computation graph.
- Stored attributes per Value
- data: scalar numeric value.
- grad: derivative of final output w.r.t. this value (initialized to 0.0).
- _prev: set/tuple of child Value nodes (operands that produced this Value).
- _op: string naming operation that created this Value (e.g., '+', '*', 'pow', 'tanh').
- _backward: function/closure that describes how to propagate out.grad into children.grads.
- Operator overloading
- Implement add, mul, pow, neg, etc., returning new Value with proper children and _op.
- Support wrapping Python numbers into Value when combining Value and numeric literal.
- Implement radd/rmul to handle literal on left (e.g., 2 * a).
Backpropagation Implementation
- Manual backpropagation demonstrated step-by-step on small expression graphs.
- Automatic backpropagation steps:
- Build topological ordering of nodes (post-order DFS) so children appear before parents.
- Initialize root.grad = 1.0 (derivative of output with respect to itself).
- Iterate nodes in reversed topological order and call node._backward() to propagate grads to children.
- Important implementation detail:
- Accumulate gradients with += (not assignment) because same node can contribute from multiple paths.
- Local backward closures
- For each operation, define local derivative computations inside a closure stored as _backward:
- Addition: child.grad += out.grad
- Multiplication: self.grad += other.data * out.grad; other.grad += self.data * out.grad
- Power (x^n with n constant): self.grad += n * (self.data ** (n-1)) * out.grad
- Exponentiation e^x: self.grad += out.data * out.grad (local derivative = e^x)
- Tanh: self.grad += (1 - out.data ** 2) * out.grad (local derivative = 1 - tanh(x)^2)
- Division implemented as multiplication by power -1: a / b = a * (b ** -1)
- Subtraction implemented as addition with negation.***
Visualization
- Graphviz (graphviz API) used to draw computation graph nodes and op-nodes for readability.
- Each Value node labeled with data and grad for inspection during examples.
Scalar vs Tensor Explanation
- Micrograd operates on scalar Values for pedagogical clarity.
- Production libraries (PyTorch, JAX) use tensors (arrays of scalars) for efficiency and parallelism.
- Mathematics is identical; tensors package many scalar operations for speed on hardware.
Example: Neuron and Activation
- Neuron model (MLP unit)
- raw activation n = sum(w_i * x_i) + b
- output o = tanh(n) (activation)
- Implemented tanh both:
- as a single primitive operation with its local backward (1 - tanh(x)^2)
- and decomposed into exponentials to show equivalence and exercise additional operations (exp, pow, div, sub).
- Backprop example: propagate through tanh, plus nodes, and multiply nodes to compute gradients on inputs and weights.*
Neural Network Modules
- Micrograd nn structure mirrors PyTorch API:
- Module base class with zero_grad convenience.
- Neuron: holds list of weights (Value) and bias (Value); call computes forward pass.
- Layer: list of Neurons producing multiple outputs.
- MLP: sequence of Layers; supports arbitrary layer sizes.
- Parameters collection
- Each Module implements parameters() generator returning all Value parameters (weights and biases) for optimization and zeroing gradients.
Training Loop (Gradient Descent)
- Loss example: mean squared error (MSE) over examples: sum((y_pred - y_true)^2)
- Training step:
- zero_grad() on all parameters
- forward pass (compute predictions and loss)
- loss.backward() (compute gradients)
- update parameters: p.data += -learning_rate * p.grad (negative sign to minimize loss)
- Practical notes
- Must zero grads before each backward to avoid accumulation across steps.
- Learning rate choice critical: too small β slow; too large β instability or divergence.
- Typical improvements in practice: stochastic minibatching, learning-rate schedules, advanced optimizers.*
PyTorch Comparison / Integration
- PyTorch tensors mirror Value semantics but operate on n-dimensional arrays and include .data, .grad, and .backward().
- Registering custom ops in PyTorch:
- Implement forward and backward (local derivative) to integrate new primitives.
- PyTorch source is large; CPU/GPU kernels implement low-level backward computations (e.g., tanh backward = grad * (1 - output^2)).*
Example Results & Pitfalls
- Small binary classification MLP trained on toy dataset converges with iterative forward/backward/update.
- Common bug demonstrated: forgetting to zero gradients (grad accumulation across steps) alters effective step size and produces incorrect training behavior.
- Another subtle bug fixed: when same Value used multiple times (a + a), must accumulate grads instead of overwriting them.
Key Terms and Definitions
| Term | Definition |
| Autograd | Automatic computation of derivatives for code-defined computations (backprop engine). |
| Backpropagation | Algorithm applying chain rule in reverse through computation graph to compute gradients. |
| Value | Micrograd scalar wrapper storing data, grad, children, op, and backward closure. |
| Computation Graph | Directed acyclic graph of operations/Value nodes representing forward computation. |
| Topological Sort | Ordering nodes so children appear before parents; used to run backward passes in correct order. |
| Gradient | d(output) / d(variable): sensitivity of output to small changes in variable. |
| Loss | Scalar function measuring model performance; optimized during training. |
| Tanh / Activation | Nonlinear function applied to neuron activation; local derivative = 1 - tanh(x)^2. |
| Parameter Update (SGD) | p.data += -lr * p.grad; simple gradient-descent step. |*
Action Items / Next Steps
- Study micrograd source: Value implementation, operator overloading, _backward closures, topological sort.
- Reproduce toy MLP training: implement zero_grad, forward, backward, and simple SGD updates.
- Experiment:
- Replace tanh with relu or other activations (update local backward accordingly).
- Extend Value to support small vectorized operations or batch processing.
- Implement alternative loss functions (cross-entropy) and optimizers (Adam).
- Compare micrograd behavior to PyTorch by running equivalent tiny models and verifying gradients match._
Summary
- Micrograd demonstrates core mechanics of neural network training compactly: represent scalar operations as nodes, chain local derivatives via closures, topologically order nodes, and run backward pass to get gradients.
- Training = forward pass β compute loss β backward pass β update parameters; repeat.
- All concepts generalize to tensor frameworks (PyTorch/JAX); micrograd isolates and clarifies the math and implementation fundamentals.