đŸ€–

CS50 AI Quick Overview

Nov 9, 2025

Overview

These notes summarize core topics from CS50 AI with Python, covering search, adversarial games, logic and inference, probabilistic reasoning, optimization and CSPs, machine learning, reinforcement learning, neural networks, computer vision, NLP, and transformers.

AI Foundations: Problem Formulation and Search

  • Agent perceives and acts in an environment; objective is to reach a goal state with minimal path cost.
  • State space defined by states, actions(s), transition model result(s,a), initial state, goal test, and path cost.
  • Nodes track state, parent, action, and path cost; frontier stores discovered, unexpanded nodes; explored set avoids revisits.

Uninformed vs Informed Search

  • Uninformed search explores without domain knowledge; informed uses heuristics to guide exploration.
  • Heuristic h(n) estimates remaining cost; Manhattan distance sums |Δx| + |Δy| on grids without diagonals.

Search Algorithms Comparison

AlgorithmFrontier StructureNode SelectionOptimality (unit costs)Memory UseNotes
DFSStack (LIFO)Deepest firstNot guaranteedLowerCan be efficient in memory; risks non-optimal paths
BFSQueue (FIFO)Shallowest firstGuaranteedHigherExplores by layers; good when goal is near
GBFSPriority by h(n)Smallest h(n)Not guaranteedModerateFast targeting; may follow misleading heuristics
A*Priority by g+hSmallest f(n)Guaranteed with admissible, consistent hHigherBalances cost-so-far and estimate-to-go

Heuristic Properties

  • Admissible: never overestimates true remaining cost; ensures A* optimality.
  • Consistent: h(n) ≀ h(n') + c(n,n'); enforces monotonicity and supports optimality.*

Implementation Highlights (Python)

  • Node class stores state/parent/action; StackFrontier for DFS; QueueFrontier for BFS.
  • Maze class parses grids, defines neighbors, solves via frontier loop, reconstructs paths.

Adversarial Search: Minimax and Alpha–Beta

  • Game model includes initial state, players(s), actions(s), result(s,a), terminal(s), utility(s).
  • Minimax: Max maximizes utility; Min minimizes; terminal utilities (e.g., tic-tac-toe: X=1, draw=0, O=−1).
  • Alpha–beta pruning tracks alpha (Max best) and beta (Min best) to prune branches without affecting choice.
  • Depth-limited minimax uses evaluation functions at cutoff; stronger heuristics improve play.

Knowledge Representation and Propositional Logic

  • Propositional logic uses symbols and connectives (ÂŹ, ∧, √, →, ↔) with truth tables.
  • Model assigns truth values; KB is known truths; entailment α ⊹ ÎČ holds in all models where α is true.
  • Model checking enumerates all models (2^n); confirms KB ⊹ query if query true in all satisfying models.
  • Knowledge engineering examples: Clue subset and house assignment puzzle encoded with symbols and constraints.

From Model Checking to Inference and Resolution

  • Inference applies rules (Modus Ponens, De Morgan’s, distributive, etc.) to derive conclusions without enumeration.
  • Theorem proving as search: apply rules to expand KB until query appears.
  • Resolution operates on CNF clauses; proof by contradiction derives empty clause to confirm entailment.
  • CNF conversion removes ↔, →, pushes negations inward, distributes √ over ∧.

First-Order Logic (FOL) Basics

  • Extends propositional logic with constants, predicates, and quantifiers (∀, ∃) for objects and relations.
  • Examples express properties (Person(x)), relations (BelongsTo(x,y)), and quantified constraints.

Probability Fundamentals and Bayes’ Rule

  • Probabilities in [0,1]; events’ complements P(ÂŹA)=1−P(A); inclusion–exclusion for unions.
  • Conditional probability P(A|B)=P(A∧B)/P(B); joint via conditional P(A∧B)=P(A)P(B|A).
  • Independence: A ⟂ B iff P(A∧B)=P(A)P(B).
  • Bayes’ rule P(B|A)=[P(A|B)P(B)]/P(A) inverts conditionals using evidence.

Joint, Marginal, and Conditional Distributions

  • Marginalization sums over hidden variables; conditioning decomposes via evidence weights.
  • Compute conditionals by normalizing joint slices; use joint tables for exact calculations.

Bayesian Networks and Inference

  • DAG encodes variables with CPTs; roots unconditional, others conditional on parents.
  • Joint probability factors as product of node-wise probabilities conditioned on parents.
  • Inference by enumeration sums hidden variables; normalize results for posterior distributions.

Approximate Inference

  • Direct sampling draws from generative process to estimate probabilities.
  • Rejection sampling filters by evidence but wastes samples when evidence is rare.
  • Likelihood weighting fixes evidence, samples others, and weights samples by evidence likelihood.

Temporal Models: Markov Chains and HMMs

  • Markov chains assume P(X_{t+1}|X_t) transitions; start distribution and transition matrix define process.
  • HMMs add emissions E_t dependent on hidden states; tasks include filtering, prediction, smoothing, and Viterbi decoding._

Optimization and Local Search

  • Optimization seeks to maximize objective or minimize cost over state space.
  • Hill climbing iteratively moves to best neighbor; terminates at local optimum or plateau.

Local Search Methods Comparison

MethodGoalMove SelectionAccept Worse Moves?StrengthsLimitations
Steepest AscentGreedy improvementBest neighborNoFast local gainsTrapped by local optima
StochasticDiversify pathRandom among betterNoSimple diversificationStill local traps
First-ChoiceSpeedFirst better foundNoEfficient searchSame traps
Random-RestartBroader searchMultiple restartsNoFinds better basinsExtra runs; no guarantees
Local BeamParallel searchKeep top k statesNoWider explorationPremature convergence
Simulated AnnealingEscape trapsRandom neighborYes (exp(ΔE/T))Crosses valleysNeeds schedule tuning

Simulated Annealing

  • Accepts worse moves with probability exp(ΔE/T); high T early enables exploration, low T late exploits.

Linear Programming (LP)

  • Optimize linear objective under linear constraints; solved via Simplex or Interior Point methods.

Constraint Satisfaction Problems (CSPs)

  • Variables with domains subject to constraints; aim to find consistent assignments.
  • Node consistency enforces unary constraints; arc consistency ensures each value has supporting neighbor values.

AC-3 and Backtracking

  • Revise removes unsupported values; AC-3 enforces arc consistency via queue of arcs.
  • Backtracking assigns variables recursively; fails backtrack on constraint violations.

Heuristics and MAC

  • MRV selects variable with fewest legal values; Degree breaks ties by most constraints.
  • LCV orders values that prune least neighbor options.
  • Maintain arc consistency (MAC) during backtracking; propagate inferences and rollback on backtrack.

Supervised Learning: Classification and Regression

  • Classification maps inputs to categories; regression predicts continuous outputs from features.
  • k-NN predicts by nearest neighbors; k odd reduces ties; sensitive to local noise.

Linear Models and Perceptron

  • Linear decision boundary uses w·x + w0; thresholding classifies points.
  • Perceptron updates weights iteratively; learning rate controls update magnitude.

Logistic Regression and SVM

  • Logistic regression outputs probabilities via sigmoid; smooth optimization.
  • SVM maximizes margin; kernels allow nonlinear decision boundaries.

Loss, Overfitting, and Regularization

  • Classification 0–1 loss; regression L1/L2 losses; overfitting harms generalization on new data.
  • Regularization adds λ×complexity to cost; penalizes complex models per Occam’s razor.

Evaluation and Practice

  • Holdout and k-fold cross-validation assess generalization; average performance across folds.
  • scikit-learn enables quick experiments (Perceptron, SVC, KNeighborsClassifier, train_test_split).

Reinforcement Learning and Q-Learning

  • RL learns by interacting with environment to maximize cumulative reward; modeled as MDPs.
  • Q-learning estimates Q(s,a); update Q ← Q + α[r + Îł max_{a'} Q(s',a') − Q]; α is learning rate, Îł discount.
  • Epsilon-greedy balances exploration (Δ) and exploitation (1−Δ); often decay Δ over time.
  • Example: Nim learned via self-play with terminal rewards; small state spaces are tractable._

Unsupervised Learning: Clustering

  • Clustering groups similar data points without labels; k-means alternates assignment and center recomputation.

Neural Networks: From Perceptrons to Deep Learning

  • Single-layer perceptrons compute g(w·x + b) with activations (step, sigmoid, ReLU).
  • Limitations on nonlinearly separable data motivate hidden layers and backpropagation.
  • Gradient descent variants: batch, stochastic, and mini-batch; train by minimizing loss.
  • Dropout regularizes by randomly omitting units during training to reduce overfitting.

Feed-Forward Networks and Practice

  • TensorFlow/Keras build dense networks; compile with optimizer, loss, and metrics; fit for epochs.
  • Hidden units increase capacity; validation and dropout help tune and generalize.

Convolutional Neural Networks (CNNs)

  • Convolution applies learnable kernels to extract local features; stride controls movement.
  • Pooling (max) downsamples feature maps; increases translation robustness and reduces parameters.
  • Typical pipeline: Conv → Pool → Flatten → Dense → Output; multi-stage stacks learn hierarchical features.
  • MNIST CNN example: Conv2D→MaxPool→Dense→Dropout→Softmax achieves ~98.8% test accuracy.

Recurrent Neural Networks (RNNs)

  • Handle sequences by maintaining hidden state over time; support one-to-many, many-to-one, many-to-many mappings.
  • LSTM variants improve memory for long dependencies; used in captioning, sentiment, and translation.

NLP: Grammars, N-grams, and Naive Bayes

  • CFGs define syntactic structures; parsers produce trees; ambiguity yields multiple valid parses.
  • N-grams model local word sequences; Markov chains generate text based on preceding context.
  • Bag-of-words represents text as unordered word counts; Naive Bayes assumes conditional independence of words.

Naive Bayes Classification

  • Compute P(label|words) ∝ P(words|label)P(label); multiply per-word conditional probabilities.
  • Additive (Laplace) smoothing adds 1 to counts to avoid zero probabilities for unseen words.

Word Representations and Embeddings

  • One-hot vectors are high-dimensional and sparse; fail to capture similarity.
  • Distributed embeddings (Word2Vec) learn dense vectors where similar words are close; support analogies.

Sequence-to-Sequence, Attention, and Transformers

  • RNN encoder–decoder compresses input into context; struggles with long sequences due to bottleneck.
  • Attention computes weighted sums over encoder states per output step; aligns input–output and improves accuracy.
  • Transformers process tokens in parallel using self-attention and positional encodings; stacks of attention and feed-forward layers.
  • Decoder uses masked self-attention and cross-attention over encoder outputs; variants scale to large datasets efficiently.

Key Terms & Definitions

  • Agent: acting entity within an environment; seeks to maximize objective or reward.
  • State/Action/Transition: components defining dynamics; result(s,a) yields next state.
  • Frontier/Explored set: frontier holds unexpanded nodes; explored records visited states.
  • Heuristic: estimate of remaining cost; admissible and consistent properties guide optimality.
  • Minimax/Alpha–beta: adversarial decision-making and pruning optimization.
  • Knowledge Base (KB)/Entailment: set of truths; entailment holds across all satisfying models.
  • CNF/Resolution: conjunctive normal form and clause-based proof by contradiction.
  • Bayesian network/CPT: graph of dependencies with conditional tables.
  • Markov chain/HMM: temporal models with hidden states and emissions.
  • Hill climbing/Simulated annealing: local search and probabilistic escape from local optima.
  • CSP/AC-3/MRV/LCV: constraint satisfaction with arc consistency and ordering heuristics.
  • Perceptron/Logistic/SVM: linear classifiers with different decision and optimization properties.
  • Loss/Regularization/Cross-validation: error metrics, complexity penalties, and evaluation strategies.
  • Q-learning/Epsilon-greedy: model-free RL value learning and exploration policy.
  • Convolution/Pooling/CNN: image feature extraction and hierarchical vision models.
  • RNN/LSTM/Attention: sequence modeling with memory and focus mechanisms.
  • Word2Vec/Embeddings: dense word representations capturing semantic similarity.
  • Transformer/Self-attention/Positional encoding: parallel, attention-driven sequence architecture.

Action Items / Next Steps

  • Implement DFS, BFS, GBFS, and A*; verify heuristic admissibility and consistency for optimality.
  • Build minimax with alpha–beta; add depth limits and evaluation functions for complex games.
  • Practice CNF conversion and resolution proofs; encode small domains for model checking.
  • Construct Bayesian networks; perform exact and approximate inference with sampling and likelihood weighting.
  • Model temporal tasks with Markov chains and HMMs; implement filtering and Viterbi decoding.
  • Apply hill climbing variants and simulated annealing; consider LP for linear objectives and constraints.
  • Solve CSPs with backtracking, MAC (AC-3), MRV, Degree, and LCV; propagate inferences and rollback on backtrack.
  • Train classifiers (k-NN, perceptron/logistic, SVM); select losses, regularize, and use cross-validation.
  • Explore RL with Q-learning and epsilon-greedy; tune α and Δ; consider function approximation.
  • Build feed-forward networks with Keras; add dropout and validation; scale to CNNs for images and RNNs for sequences.
  • Implement Naive Bayes with Laplace smoothing; experiment with embeddings (Word2Vec) for semantic tasks.
  • For seq2seq tasks, use attention-equipped models or transformers to improve scalability and accuracy.*