Lecture on AI Alignment and Unsupervised Learning at Open AI
Introduction
Speaker was invited by UMES.
Initially struggled with topic due to confidentiality of current projects.
Decided to share old results from OpenAI (2016) on unsupervised learning.
Focused on explaining foundational concepts of learning, supervised and unsupervised learning.
Learning in General
Fundamental question: Why does learning work at all, particularly for computers?
Neural networks and their ability to capture data regularities.
Supervised Learning
Definition: Learning with input-output pairs.
Mathematical Foundation: Supervised learning guarantees low testing error if conditions on training set and degrees of freedom are met.
Historical Notes: Supervised learning is well understood and deeply rooted in statistical learning theory.
Includes concepts like VC Dimension.
Emphasized how VC Dimension handles infinite precision parameters.
Critical Requirement: Training and test distributions must be the same.
Unsupervised Learning
Definition: Learning without labeled input-output pairs, discovering hidden structure in data.
Challenges: No precise mathematical guarantees unlike supervised learning.
Empirical Success: Despite lack of definite guarantees, empirical successes like Boltzmann machines, autoencoders, diffusion models, and language models.
Key Question: Why optimizing one objective (e.g., reconstruction error) helps in other tasks?
Distribution Matching in Unsupervised Learning
Concept: Matching distributions of different data sources without labels.
Procedure: Find function f such that distribution of f(X) matches distribution of Y.
Example Domains: Machine translation, speech recognition, substitution ciphers.
Outcome: Constraints from high-dimensional data distributions inform about function f.
Compression and Prediction in Learning
Key Idea: Compression as a form of prediction for unsupervised learning.
Thought Experiment: Compressing concatenated files X and Y reveals shared structure.
Better compression indicates better extraction of shared patterns.
Equation: Compression of X+Y should not exceed separate compressions of X and Y.
Algorithmic Mutual Information: Gain from joint compression represents shared structure.
Generalizes concepts like distribution matching.
Formalizing Unsupervised Learning
Low Regret Algorithm: Uses labeled and unlabeled data effectively.
Ensures maximum benefit from unlabeled data with minimal regret about missed information.
Kolmogorov Complexity: Ultimate compressor, although not computable, serves as theoretical ideal.
Predicts strings using shortest programs, ensuring low regret.
Conditional Kolmogorov Complexity
Definition: Shortest program compressing Y while accessing X.
Ensures optimal unsupervised learning by leveraging information between datasets.
Application: Practically challenging but conceptually foundational.
Use regular compressors to concatenate and process datasets for better predictions.
Conclusion
Unsupervised learning benefits from the shared structure between datasets, which can be maximized using concepts like compression.
Practical implementations move closer to theoretical ideals (e.g., Kolmogorov Complexity) through innovative algorithms and neural networks.