Lecture on AI Alignment and Unsupervised Learning

Jul 4, 2024

Lecture on AI Alignment and Unsupervised Learning at Open AI

Introduction

  • Speaker was invited by UMES.
  • Initially struggled with topic due to confidentiality of current projects.
  • Decided to share old results from OpenAI (2016) on unsupervised learning.
  • Focused on explaining foundational concepts of learning, supervised and unsupervised learning.

Learning in General

  • Fundamental question: Why does learning work at all, particularly for computers?
  • Neural networks and their ability to capture data regularities.

Supervised Learning

  • Definition: Learning with input-output pairs.
  • Mathematical Foundation: Supervised learning guarantees low testing error if conditions on training set and degrees of freedom are met.
  • Historical Notes: Supervised learning is well understood and deeply rooted in statistical learning theory.
    • Includes concepts like VC Dimension.
    • Emphasized how VC Dimension handles infinite precision parameters.
  • Critical Requirement: Training and test distributions must be the same.

Unsupervised Learning

  • Definition: Learning without labeled input-output pairs, discovering hidden structure in data.
  • Challenges: No precise mathematical guarantees unlike supervised learning.
  • Empirical Success: Despite lack of definite guarantees, empirical successes like Boltzmann machines, autoencoders, diffusion models, and language models.
  • Key Question: Why optimizing one objective (e.g., reconstruction error) helps in other tasks?

Distribution Matching in Unsupervised Learning

  • Concept: Matching distributions of different data sources without labels.
  • Procedure: Find function f such that distribution of f(X) matches distribution of Y.
  • Example Domains: Machine translation, speech recognition, substitution ciphers.
  • Outcome: Constraints from high-dimensional data distributions inform about function f.

Compression and Prediction in Learning

  • Key Idea: Compression as a form of prediction for unsupervised learning.
    • Thought Experiment: Compressing concatenated files X and Y reveals shared structure.
    • Better compression indicates better extraction of shared patterns.
  • Equation: Compression of X+Y should not exceed separate compressions of X and Y.
  • Algorithmic Mutual Information: Gain from joint compression represents shared structure.
    • Generalizes concepts like distribution matching.

Formalizing Unsupervised Learning

  • Low Regret Algorithm: Uses labeled and unlabeled data effectively.
    • Ensures maximum benefit from unlabeled data with minimal regret about missed information.
  • Kolmogorov Complexity: Ultimate compressor, although not computable, serves as theoretical ideal.
    • Predicts strings using shortest programs, ensuring low regret.

Conditional Kolmogorov Complexity

  • Definition: Shortest program compressing Y while accessing X.
  • Ensures optimal unsupervised learning by leveraging information between datasets.
  • Application: Practically challenging but conceptually foundational.
    • Use regular compressors to concatenate and process datasets for better predictions.

Conclusion

  • Unsupervised learning benefits from the shared structure between datasets, which can be maximized using concepts like compression.
  • Practical implementations move closer to theoretical ideals (e.g., Kolmogorov Complexity) through innovative algorithms and neural networks.