🧠

Micrograd and Autograd Basics

Dec 6, 2025

Overview

  • Lecture explains neural network training by building micrograd, a minimal autograd engine.
  • Demonstrates forward and backward passes, chain rule, implementing Value objects, operations, and simple neural networks.
  • Compares micrograd concepts to PyTorch and shows training a small MLP with gradient descent.

Key Concepts

  • Autograd / Backpropagation
    • Autograd = automatic differentiation to compute gradients efficiently.
    • Backpropagation = recursive application of chain rule on computation graph, from output loss back to inputs/weights.
  • Derivative Intuition
    • Derivative β‰ˆ limit of (f(x+h) - f(x)) / h; measures local sensitivity (slope).
    • For multivariate outputs, partial derivatives measure each input’s influence on output.
  • Chain Rule
    • If z depends on y, which depends on x: dz/dx = (dz/dy) * (dy/dx).
    • In computation graph, local derivatives at each node multiply with upstream derivative to yield gradients.*

Micrograd Design (Value Object)

  • Purpose
    • Represent scalar values that track data, gradient, operation, and children to form a computation graph.
  • Stored attributes per Value
    • data: scalar numeric value.
    • grad: derivative of final output w.r.t. this value (initialized to 0.0).
    • _prev: set/tuple of child Value nodes (operands that produced this Value).
    • _op: string naming operation that created this Value (e.g., '+', '*', 'pow', 'tanh').
    • _backward: function/closure that describes how to propagate out.grad into children.grads.
  • Operator overloading
    • Implement add, mul, pow, neg, etc., returning new Value with proper children and _op.
    • Support wrapping Python numbers into Value when combining Value and numeric literal.
    • Implement radd/rmul to handle literal on left (e.g., 2 * a).

Backpropagation Implementation

  • Manual backpropagation demonstrated step-by-step on small expression graphs.
  • Automatic backpropagation steps:
    • Build topological ordering of nodes (post-order DFS) so children appear before parents.
    • Initialize root.grad = 1.0 (derivative of output with respect to itself).
    • Iterate nodes in reversed topological order and call node._backward() to propagate grads to children.
  • Important implementation detail:
    • Accumulate gradients with += (not assignment) because same node can contribute from multiple paths.
  • Local backward closures
    • For each operation, define local derivative computations inside a closure stored as _backward:
      • Addition: child.grad += out.grad
      • Multiplication: self.grad += other.data * out.grad; other.grad += self.data * out.grad
      • Power (x^n with n constant): self.grad += n * (self.data ** (n-1)) * out.grad
      • Exponentiation e^x: self.grad += out.data * out.grad (local derivative = e^x)
      • Tanh: self.grad += (1 - out.data ** 2) * out.grad (local derivative = 1 - tanh(x)^2)
      • Division implemented as multiplication by power -1: a / b = a * (b ** -1)
      • Subtraction implemented as addition with negation.***

Visualization

  • Graphviz (graphviz API) used to draw computation graph nodes and op-nodes for readability.
  • Each Value node labeled with data and grad for inspection during examples.

Scalar vs Tensor Explanation

  • Micrograd operates on scalar Values for pedagogical clarity.
  • Production libraries (PyTorch, JAX) use tensors (arrays of scalars) for efficiency and parallelism.
  • Mathematics is identical; tensors package many scalar operations for speed on hardware.

Example: Neuron and Activation

  • Neuron model (MLP unit)
    • raw activation n = sum(w_i * x_i) + b
    • output o = tanh(n) (activation)
  • Implemented tanh both:
    • as a single primitive operation with its local backward (1 - tanh(x)^2)
    • and decomposed into exponentials to show equivalence and exercise additional operations (exp, pow, div, sub).
  • Backprop example: propagate through tanh, plus nodes, and multiply nodes to compute gradients on inputs and weights.*

Neural Network Modules

  • Micrograd nn structure mirrors PyTorch API:
    • Module base class with zero_grad convenience.
    • Neuron: holds list of weights (Value) and bias (Value); call computes forward pass.
    • Layer: list of Neurons producing multiple outputs.
    • MLP: sequence of Layers; supports arbitrary layer sizes.
  • Parameters collection
    • Each Module implements parameters() generator returning all Value parameters (weights and biases) for optimization and zeroing gradients.

Training Loop (Gradient Descent)

  • Loss example: mean squared error (MSE) over examples: sum((y_pred - y_true)^2)
  • Training step:
    1. zero_grad() on all parameters
    2. forward pass (compute predictions and loss)
    3. loss.backward() (compute gradients)
    4. update parameters: p.data += -learning_rate * p.grad (negative sign to minimize loss)
  • Practical notes
    • Must zero grads before each backward to avoid accumulation across steps.
    • Learning rate choice critical: too small β†’ slow; too large β†’ instability or divergence.
    • Typical improvements in practice: stochastic minibatching, learning-rate schedules, advanced optimizers.*

PyTorch Comparison / Integration

  • PyTorch tensors mirror Value semantics but operate on n-dimensional arrays and include .data, .grad, and .backward().
  • Registering custom ops in PyTorch:
    • Implement forward and backward (local derivative) to integrate new primitives.
  • PyTorch source is large; CPU/GPU kernels implement low-level backward computations (e.g., tanh backward = grad * (1 - output^2)).*

Example Results & Pitfalls

  • Small binary classification MLP trained on toy dataset converges with iterative forward/backward/update.
  • Common bug demonstrated: forgetting to zero gradients (grad accumulation across steps) alters effective step size and produces incorrect training behavior.
  • Another subtle bug fixed: when same Value used multiple times (a + a), must accumulate grads instead of overwriting them.

Key Terms and Definitions

| Term | Definition | | Autograd | Automatic computation of derivatives for code-defined computations (backprop engine). | | Backpropagation | Algorithm applying chain rule in reverse through computation graph to compute gradients. | | Value | Micrograd scalar wrapper storing data, grad, children, op, and backward closure. | | Computation Graph | Directed acyclic graph of operations/Value nodes representing forward computation. | | Topological Sort | Ordering nodes so children appear before parents; used to run backward passes in correct order. | | Gradient | d(output) / d(variable): sensitivity of output to small changes in variable. | | Loss | Scalar function measuring model performance; optimized during training. | | Tanh / Activation | Nonlinear function applied to neuron activation; local derivative = 1 - tanh(x)^2. | | Parameter Update (SGD) | p.data += -lr * p.grad; simple gradient-descent step. |*

Action Items / Next Steps

  • Study micrograd source: Value implementation, operator overloading, _backward closures, topological sort.
  • Reproduce toy MLP training: implement zero_grad, forward, backward, and simple SGD updates.
  • Experiment:
    • Replace tanh with relu or other activations (update local backward accordingly).
    • Extend Value to support small vectorized operations or batch processing.
    • Implement alternative loss functions (cross-entropy) and optimizers (Adam).
  • Compare micrograd behavior to PyTorch by running equivalent tiny models and verifying gradients match._

Summary

  • Micrograd demonstrates core mechanics of neural network training compactly: represent scalar operations as nodes, chain local derivatives via closures, topologically order nodes, and run backward pass to get gradients.
  • Training = forward pass β†’ compute loss β†’ backward pass β†’ update parameters; repeat.
  • All concepts generalize to tensor frameworks (PyTorch/JAX); micrograd isolates and clarifies the math and implementation fundamentals.