👁️

MIT2025- CNN

Oct 9, 2025

Overview

This lecture introduces computer vision, focusing on how deep learning—specifically convolutional neural networks (CNNs)—enables computers to process, understand, and classify images, mimicking aspects of human visual perception.

Introduction to Computer Vision

  • Vision is a primary human sense allowing understanding of objects, movement, and planning in the environment.
  • Computer vision seeks to replicate human sight using algorithms that interpret raw visual inputs.
  • Applications include facial detection, autonomous vehicles, healthcare, robotics, and accessibility.

Image Representation

  • Images are stored as arrays of numbers (pixels); grayscale images use one value per pixel, color images use three (RGB).
  • Machine learning tasks in vision are typically classification (discrete labels) or regression (continuous values).

Feature Detection & Classification

  • Classification requires models to detect features that distinguish one class from another (e.g., eyes, nose for faces).
  • Traditional machine learning relies on hand-designed features; deep learning uses neural networks to automatically learn features from data.

Fully Connected vs. Convolutional Networks

  • Fully connected networks require flattening images, losing spatial information and becoming computationally expensive for large images.
  • Convolutional neural networks (CNNs) preserve spatial structure by connecting neurons to localized patches via convolution operations.

Convolution Operation

  • Convolutions scan images with small filters (kernels) to detect patterns (e.g., edges, textures) by performing elementwise multiplication and summation.
  • Filters can be hand-designed for simple tasks but are generally learned automatically in deep learning.

CNN Architecture and Operations

  • A typical CNN consists of:
    • Convolutional layers that extract local features using filters.
    • Nonlinearities (e.g., ReLU) that increase model expressivity by thresholding values.
    • Pooling layers (e.g., max pooling) that downsample feature maps to enhance robustness and reduce computation.
  • Filters are learned through optimization (backpropagation) on labeled data.

Applications of CNNs

  • Classification: Predicts what object is in the image.
  • Object Detection: Identifies object class and location (bounding boxes) in an image; modern methods (e.g., R-CNN) learn region proposals and classification jointly.
  • Segmentation: Assigns a class label to each pixel in an image (semantic segmentation).
  • Autonomous Control: Combines camera images and map data to predict steering commands for self-driving cars using regression on continuous control outputs.

Key Terms & Definitions

  • Pixel — A single value representing intensity (grayscale) or color (RGB) in an image.
  • Feature — A distinctive pattern or property (e.g., edge, corner) detected in an image.
  • Convolution — An operation applying a filter across an image to extract features.
  • Filter (Kernel) — A small matrix of weights used in convolution to detect specific patterns.
  • Pooling — Reduces spatial size of feature maps, helping generalization and computation.
  • CNN (Convolutional Neural Network) — A neural network architecture specialized for processing grid-like data such as images.
  • ReLU (Rectified Linear Unit) — Nonlinear activation function setting negative values to zero.
  • Softmax Function — Converts raw output scores into probabilities that sum to one.
  • Object Detection — Task of locating and classifying objects within an image.
  • Semantic Segmentation — Assigning a label to every pixel in the image, identifying regions.

Action Items / Next Steps

  • Review today's lab focused on facial detection and hands-on CNN implementation.
  • Prepare questions on generative deep learning for the next lecture.
  • Continue reading on convolutional neural networks, especially sections on filters and pooling.

Certainly! Here's a comprehensive, in-depth summary and review of the lecture on computer vision and convolutional neural networks (CNNs) from the MIT Intro Deep Learning course:


Comprehensive Summary and Review: MIT Intro Deep Learning — Day 2 Lecture on Computer Vision and CNNs

1. Introduction to Vision and Computer Vision

  • Human Vision Importance: Vision is one of the most critical human senses, enabling us to detect and interpret emotions, facial expressions, navigate environments, manipulate objects, and plan actions.
  • Goal of Computer Vision: The aim is to give computers the ability to "see" and understand the physical world from raw visual inputs, mimicking human sight.
  • Definition of Vision in AI: Simplified as the ability to "know what is where by looking."
  • Dynamic Understanding: Vision is not just static recognition but also involves understanding dynamics — how objects move and change over time (e.g., distinguishing parked cars from moving cars).
  • Applications: Computer vision is pervasive, impacting robotics, mobile computing, healthcare (medical imaging), autonomous driving, accessibility, and more.
  • Example Applications:
    • Facial detection and micro-expression recognition.
    • Autonomous vehicles that drive on unseen roads using only visual input.
    • Medical diagnosis outperforming radiologists in some cases.

2. Image Representation in Computers

  • Images as Numerical Arrays:
    • Grayscale images: 2D arrays where each pixel is a single intensity value.
    • Color images: 3D arrays with three channels (Red, Green, Blue), each pixel represented by three numbers.
  • Machine Learning Tasks:
    • Regression: Output continuous values (e.g., steering angle).
    • Classification: Output discrete class labels (e.g., identifying which US president is in an image).

3. Feature Detection and Classification

  • Human Approach: Humans classify by detecting unique features (e.g., eyes, nose, ears for faces; doors, windows for houses).
  • Machine Learning Challenge: Traditional ML requires manual feature engineering, which is difficult and brittle.
  • Deep Learning Advantage: CNNs learn features automatically from data, removing the need for manual feature design.
  • Hierarchical Feature Learning: Features are learned in layers, from simple edges to complex shapes and objects.
  • Challenges in Vision:
    • Variations in viewpoint, scale, occlusion, lighting.
    • Need for invariance to these changes for robust classification.

4. Fully Connected Networks vs. Convolutional Neural Networks

  • Fully Connected Networks (FCNs):
    • Require flattening images into 1D vectors.
    • Lose all spatial information.
    • Computationally expensive due to large number of parameters.
  • Convolutional Neural Networks (CNNs):
    • Preserve spatial structure by connecting neurons only to local patches.
    • Use convolution operations to scan filters across images.
    • Efficient parameter sharing and spatial locality.

5. The Convolution Operation

  • Concept:
    • A small filter (kernel) slides over the image.
    • At each position, elementwise multiplication between filter and image patch is performed.
    • The results are summed and passed through a nonlinearity.
  • Filters:
    • Can be hand-designed (e.g., edge detectors) or learned.
    • Detect specific patterns like edges, diagonals, crossings.
  • Mathematical Details:
    • Filters are small matrices (e.g., 3x3, 4x4).
    • The convolution output is a feature map showing where the filter pattern matches.
  • Overlap and Stride:
    • Filters slide with some stride (step size).
    • Overlapping receptive fields allow smooth detection across the image.

6. CNN Architecture Components

  • Convolutional Layers:
    • Extract local features using multiple filters.
    • Output is a volume (height x width x number of filters).
  • Nonlinearities:
    • Typically ReLU (Rectified Linear Unit).
    • Threshold negative values to zero, increasing expressivity.
  • Pooling Layers:
    • Downsample feature maps to reduce spatial size.
    • Max pooling selects the maximum value in a patch.
    • Helps increase receptive field and reduce computation.
    • Alternative pooling methods (mean pooling) exist for smoother gradients.
  • Stacking Layers:
    • Multiple convolution + nonlinearity + pooling layers build hierarchical features.
    • Early layers detect edges; deeper layers detect complex shapes and objects.

7. Learning Filters and Training CNNs

  • Filter Learning:
    • Filters are initialized randomly.
    • Optimized via backpropagation using labeled data.
    • The network learns filters that maximize classification accuracy.
  • Visualization:
    • Early filters detect edges and simple patterns.
    • Mid-level filters detect facial features (eyes, nose, ears).
    • Deep filters detect full object parts or entire objects.
  • Importance of Data:
    • The model learns invariances and robust features from diverse training data.

8. CNN Applications Beyond Classification

  • Classification:
    • Predict discrete class labels for entire images.
  • Object Detection:
    • Predict object classes and bounding box locations.
    • Challenges include variable number of objects, sizes, and positions.
    • Naive approach: sliding window with classification on each patch (computationally expensive).
    • Region Proposal Networks (R-CNN):
      • Learn to propose regions of interest.
      • Share features between region proposal and classification.
      • Efficient and accurate end-to-end training.
  • Semantic Segmentation:
    • Assign a class label to every pixel.
    • Output is an image-sized map of class predictions.
    • Uses upsampling (deconvolution) layers to restore spatial resolution.
  • Autonomous Navigation:
    • Combine camera images and map data.
    • Predict continuous control outputs (e.g., steering angle).
    • Model trained end-to-end to learn driving behavior on unseen roads.
    • Demonstrates generalization beyond traditional map-based autonomous driving.

9. Summary and Key Takeaways

  • Images are arrays of pixel values; CNNs process these arrays preserving spatial structure.
  • Convolution operations extract local features using filters that slide across images.
  • Nonlinearities and pooling layers enhance model capacity and robustness.
  • Filters are learned from data, enabling hierarchical feature extraction.
  • CNNs are versatile: used for classification, detection, segmentation, and control tasks.
  • Modern object detection integrates region proposal and classification in a single network.
  • Semantic segmentation predicts pixel-wise labels, useful for detailed scene understanding.
  • Autonomous driving models can learn to navigate new environments using visual inputs and maps.
  • The power of CNNs lies in their ability to learn complex, invariant features from raw data.

10. Additional Notes and Clarifications

  • Zero values in filters: Represent "null" or "don't care" positions, meaning those pixels do not contribute to detection.
  • Filter size and number: Choice depends on image size and task; often a trade-off between detail and computational cost.
  • Orientation invariance: Models trained only on upright faces may fail on rotated faces unless trained with diverse orientations.
  • Pooling trade-offs: Max pooling is harsh but effective; mean pooling offers smoother gradients.
  • End-to-end training: CNNs learn feature extraction and classification jointly, improving performance over separate handcrafted pipelines.

11. Next Steps and Practical Work

  • Lab Exercise: Hands-on implementation of facial detection using CNNs.
  • Further Reading: Deepen understanding of convolution, pooling, and filter design.
  • Upcoming Topics: Generative deep learning — models that generate new data rather than just classify.

If you'd like, I can also help you create a detailed study guide or clarify any specific part of this lecture! Would you like me to do that?