Lecture on AI Alignment and Unsupervised Learning at Open AI

Introduction

Speaker was invited by UMES.
Initially struggled with topic due to confidentiality of current projects.
Decided to share old results from OpenAI (2016) on unsupervised learning.
Focused on explaining foundational concepts of learning, supervised and unsupervised learning.

Fundamental question: Why does learning work at all, particularly for computers?
Neural networks and their ability to capture data regularities.

Definition: Learning with input-output pairs.
Mathematical Foundation: Supervised learning guarantees low testing error if conditions on training set and degrees of freedom are met.
Historical Notes: Supervised learning is well understood and deeply rooted in statistical learning theory.
- Includes concepts like VC Dimension.
- Emphasized how VC Dimension handles infinite precision parameters.
Critical Requirement: Training and test distributions must be the same.

Definition: Learning without labeled input-output pairs, discovering hidden structure in data.
Challenges: No precise mathematical guarantees unlike supervised learning.
Empirical Success: Despite lack of definite guarantees, empirical successes like Boltzmann machines, autoencoders, diffusion models, and language models.
Key Question: Why optimizing one objective (e.g., reconstruction error) helps in other tasks?

Concept: Matching distributions of different data sources without labels.
Procedure: Find function f such that distribution of f(X) matches distribution of Y.
Example Domains: Machine translation, speech recognition, substitution ciphers.
Outcome: Constraints from high-dimensional data distributions inform about function f.

Key Idea: Compression as a form of prediction for unsupervised learning.
- Thought Experiment: Compressing concatenated files X and Y reveals shared structure.
- Better compression indicates better extraction of shared patterns.
Equation: Compression of X+Y should not exceed separate compressions of X and Y.
Algorithmic Mutual Information: Gain from joint compression represents shared structure.
- Generalizes concepts like distribution matching.

Low Regret Algorithm: Uses labeled and unlabeled data effectively.
- Ensures maximum benefit from unlabeled data with minimal regret about missed information.
Kolmogorov Complexity: Ultimate compressor, although not computable, serves as theoretical ideal.
- Predicts strings using shortest programs, ensuring low regret.

Definition: Shortest program compressing Y while accessing X.
Ensures optimal unsupervised learning by leveraging information between datasets.
Application: Practically challenging but conceptually foundational.
- Use regular compressors to concatenate and process datasets for better predictions.

Unsupervised learning benefits from the shared structure between datasets, which can be maximized using concepts like compression.
Practical implementations move closer to theoretical ideals (e.g., Kolmogorov Complexity) through innovative algorithms and neural networks.