you know when uh when UMES invited me to to to attend this event I got really excited I saw all the all the speaker list and I thought great I'll I'll go and I'll talk about something and then I ran into a problem which is that a lot of the technical work that you're doing at open AI actually can talk about so I was really racking my head what could I talk about right now so as of not too long ago I switched all my focus to working on AI alignment all my research focus and we'll have very cool results to show there but not just yet so I'm like okay so this would have to be for the next stock but what would be good for this stock and I came up with something I'll tell you about some very old results that we had an open AI many years ago back in 2016 even which really affected the way I think about unsupervised learning and I thought I'd Shar them with you it is possible that at this point you'll all find them obvious but maybe not all of them so there is at least a small chance that you'll find it interesting so I want to set the expectations modestly and then hopefully exceed them so a theory of unsupervised learning how can such a thing exist so okay before we talk about unsupervised learning we want to talk about learning in general what is learning and why does learning work at all why should learning work at all and why should computers be able to learn and now we just used to the fact we take it for granted that neural networks learn but but why do they like mathematics why should they why would data have regularity that our machine learning models can capture so that's not an obvious question and one important conceptual advance that has taken place in machine learning many years ago by multiple people was the discovery and the formalization of supervised learning so it goes under the name of P learning or statistical learning theory and the nice thing about supervised learning is that it gives you a precise mathematical condition under which learning must succeed you're told that if you have some data from some data distribution that if you manage to achieve a low training loss and the number of your degrees of freedom is not is smaller than your training set then you will achieve low test error and you are guaranteed to do so so you have a mathematical condition where you can say well if if I did find a function out of my function class which achieves low training error then learning will succeed and you can say yeah this is a very sensible mathematical thing that we can reason about and that's why supervis learning is easy then you had all these theorems which I thought were simple you know I found them elegant you've seen maybe theorems like this if you have your basically it's like this is the sort of thing where if you know it you're going to say oh yeah that thing and if you don't know it it will not be possible to explain it in 30 seconds though it it is possible to explain it in five minutes just not 30 seconds none of this stuff is complicated and you've got your little proof which says well if you have some number of functions in your function class the probability that your train error will be far from your test error for at least one function there is some math and the math is simple this is all the math three lines of math so three lines of math can prove all of supervised learning well that's nice that's very nice so supervised learning is something that is well understood comparatively speaking well understood we know why it must Su must succeed so we can go forth and collect large supervised learning data sets and be completely certain that models will keep getting better so that's that's the story there and yeah I forgot to mention a very important piece of this result the test distribution and training distribution need to be the same if they are the same then your theory of supervised learning kicks in and works and will be successful so conceptually it is Trivial we have an answer for why supervised learning works why speech recognition should work why image categorization should work because they all reduce to supervised learning which works which has this mathematical guarantee so this is very nice here I want to make a small side comment on VC Dimension to those of you who care about such things there may be a small subset so if you want to zone out for the next 30 seconds feel free to do so so a lot of writings about statistical learning theory emphasize the VC Dimension as a key component but the main reason the VC dimension in fact the only reason the VC Dimension was invented was to allow us to handle parameters which have infinite Precision the VC Dimension was invented to handle Precision like parameters with infinite Precision so if you have your linear classifier every parameter is infinite Precision but then of course in reality all our floats are finite precision and their Precision is shrinking so you can and you could so you have your so the number of functions that are implemented by a computer is actually small and you can reduce it back to the little this little formula which gives you pretty much all the best bounds that you can get from supervis learning so I find this to be cool I find this to be appealing because you have fewer lines of proof okay now let's talk about unsupervised learning so I claim first of all what is on supervised learning what is it supervised learning is like yeah you've got your data that says here do this here do that and you have all this data you do well in your training error your training error is low you have more training data than degrees of freedom in your function class parameters but maybe other things too degrees of freedom like bits and you say okay super your supervised learning will succeed but what is unsupervised learning what can you say at all about unsupervised learning and I'll say that I at least I have not seen like an exposition of unsupervised learning which you found satisfying how to reason about it mathematically we can reason about it intuitively but can we reason about it mathematically and for some context what is the old dream of unsupervised learning which by the way this dream has been fulfilled but it's fulfilled empirically like can we go just a tiny bit beyond the empirical results like the a that you just look at the images or you just look at the text without being told what to do with them and somehow you still discover the true hidden structure that exists in the data and somehow it helps you why why why should it happen should it happen should we expect it to happen you cannot you don't have anything remotely similar to the supervised learning guarantee the supervised learning guarantee says yeah like get your low training error and you're go get your learning it's going to be great success unsupervised learning it appears it appears that it's not this way and you know like people were talking about it for a long time in the 80s the Bolson machine was already talking about unsupervised learning and and super learning also did not work at small scale but the all but the old ideas were there like the noise in Auto encar to those of you who remember remember it it's like bir or the diffusion model like within it it's a tiny twist the tiniest of twists the language models of all times they also you know generated cool samples for their time but their unsupervised learning performance was not as impressive as those of today but I want to make the case that it's confusing because it's like why is it confusing you optimize like how does superv learning work you say okay let's optimize some kind of reconstruction error or let's optimize some kind of a the noising error or like some kind of self supervised learning error you optimize one objective right oh yeah I just said that but you care about a different objective so then doesn't it mean that you have no reason to expect that you'll get any kind of good un supervised learning results or rather you know you do get them empirically but like is it going to be like the level of Myster is quite High I claim like it seems totally in like a totally inaccessible phenomenon you optimize one objective but you care about another objective and yet it helps how can that be magic okay so yeah the other thing by the way is that un well I guess I'm going to I'm going to say something which is only 90% true on superface learning doesn't you you could say okay so you just learn the structure in the input distribution and it helped you but but then you know what if you're training from the uniform distribution then all your unsupervised learning algorithms will fail how should you think about that so what can we say what can do we need to make assumptions what kind of assumptions so I'd like to present potentially a way of thinking about un supervised learning which I think is I personally find it interesting perhaps you'll find it interesting too let's find out so I want to show you one way of doing unsupervised learning which is not necessarily widely known because it never became the dominant way of doing unsupervised learning but it has the cool feature that similarly to supervised learning it has to work so what kind of mysterious unsupervised learning procedure for you not given any labels to any of your inputs is still guaranteed to work distribution matching distribution matching so what is distribution matching say I've got my data I've got X and I've got Y data sources there's no correspondence between them just got two data sources data source X data source Y data source y language one language two text speech no correspondence between them let us look at this Criterion find the function f such that the distribution of f ofx is similar to the distribution of Y it is a constraint on F and in the case of machine translation and speech recognition for example this constraint might be meaningful like you could say yeah if I have like long sentence if if if you tell me that you have a function such that the distribution you take your distribution of English sentences you apply the function f to them and you get something which is very similar to distribution of French sentences you can say okay I found the true constraint about F if the dimensionality of X and the dimensionality of Y is high enough that's going to give you a lot of constraints in fact you might be able to almost fully recover F from that information this is an example of supervised learning of unsupervised learning where it is still guaranteed to work in the same sense that supervis learning is guaranteed to also substitution ciphers like little simple encryptions will also fit into this framework so that's the thing like okay and so like I I I kind of I ran into this I in the independently discovered this in 2015 and I got really fascinated by it because I thought wow maybe there is something meaningful mathematically meaningful that we can say about unsupervised learning and so but let's see this the thing about this setup is that still a little bit artificial it's still real machine learning setups aren't like this and the way we like to think about unsupervised learning isn't that also okay now I'll present to you the meat the meat of what I'm going what I wanted to say how a proposed way to think about unsupervised learning that lets you that puts it on par with supervised learning says okay what is it doing mathematically how can you be sure that you're supervis learning is good compression to the rescue obviously it is well known I shouldn't say Obviously it is not obvious but it is well known that compression is prediction every compressor can be predictor and vice versa there is a onetoone correspondence between all compressors and all predictors however I would argue that for the purpose of thinking about unsup supervised learning the language of compression offers some advantages at least for me it did perhaps it will for you too so consider the following thought experiment this thought experiment is the most important slide say you have two data sets X and Y in two data sets two files on your big giant hard disk say you have a really great compression algorithm C which takes data in and outputs compressed objects out say you compress X and Y jointly you concatenate them you take the two data sets and you concatenate them and you feed them to your compressor what will happen well let's see what and in particular an important question is what will a sufficiently do good compressor will do my answer is very intuitively it will use the patterns that exist inside X to help it compress Y and vice versa you could make the same Claim about com prediction but somehow it's more intuitive when you say it about compression I don't know why that is but I find it to be the case so that's our clue and you can make an equation like this this where you say hey if your compression is good enough if it's like a real great compression compressor it should say that the compression of your concatenation of your giant files should be no worse than the separate compression of your two files so any additional compression that was gained by concatenation is some kind of shared structure that your compressor noticed and the better your compressor is the more Shar structure it you extract the Gap is the shared structure or the algorithmic mutual information so that's interesting right you can see what I'm alluding to Y is the data of your supervised task X is your unsupervised task but suddenly you have some kind of mathematical reason for the information for the patterns in X to try to help y notice also how it generalizes distribution matching if we are in the distribution matching case where you've got your X is language one and Y is language two and you are saying and you know that there exists some simple function f that transforms one distribution into the other surely your compressor if it's good will notice that and make use of that and maybe even internally try to recover this function so I think that's pretty cool we've closed we've closed the circle so how do we formalize it then what will be the formalization of unsupervised learning let's see let's see if I can do it consider an ml algorithm so here by the way in what follows I'll be a bit sloppy and I will use compression and prediction interchangeably say you have a machine learning algorithm a it's an algorithm a and it tries to compress Y and say it has access to x x is file number one and Y is file number two you want your machine learning algorithm your compressor to compress Y and it can probe X as it ISS fit the goal is to compress y as well as possible we can ask ourselves what is our regret of using this particular algorithm they'll see where I'm getting at regret relative to what if I do a good enough job if I have like being low regret means that I've gotten all the help that that I can possibly get from this unlabeled data this unlabeled data has helped me as much as possible and I don't feel bad about it I don't feel bad I don't feel like I've left some prediction value on the table that someone else with a better compression algorithm could have used that's what it means and in particular it's like yeah like you can go and you can can sleep happily at night knowing that if there is some information in the unlabeled data that could help my task no when else like no no one could have done as good of a job as me I've done the best job at benefiting from my unlabeled data so I think that is a step towards thinking about unsupervised learning you don't know if your unsupervised data set is actually useful it may be super useful it may have the answer it may be totally useless it may be the uniform distrib ution but if you have a low regret on supervis learning algorithm you can say whether it's the first case or the second case I know I've done my best I know I've done my best at benefiting from my unlabeled data and no one could have done better than now I want to take a detour to some to to Theory land which is a little obscure I think it is interesting korov complexity as the ultimate compressor gives us the ultimate low regret algorithm which is actually not an algorithm because it's not computable but you will see what I mean really quickly so coloro first of all like for some context who know who here is familiar with Kor complexity okay about 50% yeah um it's it's the comor complexity is the kind of thing which is easy to explain in one minute so I'll just do it it's like imagine I give you some data or you give me some data and I'm going to compress it by giving you the shortest program that can possibly exist the shortest program that exists which if which if you run it outputs your data yes that is correct you got me it is the length so the shortest program which outputs X yes intuitively you can see that this compressor is quite good because you can prove this theorem which is also really easy to prove or rather it's really it's easy to feel it and once you feel it you could kind of believe me that it's easy to prove and you can basically say that your korov compressor if you use that to compress your strings you'll have very low regret about your compression quality you can prove this result says that if you got your string X your data set database whatever the shortest program which output X is shorter than whatever your compressor needed output and however well your compressor compressed your data plus a little term which is however many like characters of Code you need to implement your compressor intuitively you can see how it makes sense simulation argument right the simulation argument if you tell me hey I've got this really great compressor C I'm going to say cool does it come with a computer program can you give this computer program to K and K is going to run your compressor because it runs computer programs you just need to pay for the program length so without giving you the details I think I gave you the feel of it color of complexity the color of compressor can simulate computer program simulate other compressors this is also why it's not computable it's not computable because it simulates it feels very much at Liberty to simulate all computer programs but it is the best compressor that exists and we were talking about Good compression for unsupervised learning now let us generalize a color of complexity aor compressor to be allowed to use side information oh more than detail so I'll make this detail I I'll I I'll reiterate this point several times because this point is important obviously the comor compressor is not computable it's undecidable but like hey because it's like searches overall programs but hey did you know that if you run SGD over the parameters of some neural net with 100 layers it's like automatically like doing program search over a computer which has a certain amount of memory certain number of steps it's kind of like micro micro micro homog like f anal net you kind of see the feel the similarity it's kind of magical right neural networks can simulate little programs they are little computers they're circuits circuits are computers Computing machines and SGD searches over the program and all deep learning is hinges or top of the SGD miracle that we can actually train these computers with SGD that works actually find those circuits from data therefore we can compute our miniature color compressor and the simulation argument applies here as well by the way I just want to mention this one fact I don't know if you if you ever tried to design a better neural network architecture what you'd find is that it's kind of hard to find a better neic architecture you say well let's add this connection let's add that connection and let's like modify this and that why is it hard the simulation argument because your new architecture can be pretty straightforwardly simulated by old architecture except when it can't those are rare cases and in those rare cases you have a big Improvement such as when you switch from the little RNN to the Transformer the RNN has a bottleneck the hidden state so it is a hard time implementing the Transformer however had we we found a way to engineer an our with a very very large Hidden State perhaps it would become as good as a Transformer again so this is the link you start to see how we switch from the formal from the formal land to neural network land but you see the similarity so conditional chor complexity as the solution to unsupervised learning you can basically have a similar the similar theorem where I'm not going to Define what well I'm going to Define what K of Y given X is it's like the shortest program which outputs y if it's allowed to prob X that's the length of that shortest program which outputs y if it allows to prob backs and you can prove the same result and you can see immediately that yeah this is definitionally the solution to unsupervised learning if you use that you can sleep very very safely at soundly at night knowing that no one does supervised learning better than you right it's literally that so this is the ultimate low regret solution to unsupervised learning except that it's not computable but I do think it's a useful framework and here we condition on a data set not an example this thing we extract all the value for out of X for predicting y the data set the data set not the example so this is the solution to un supervised L done success and there is one little technicality which I need to spend a little bit of time talking about which is we were talking about this conditional kagor of complexity right where you are talking about compressors that get to see try try to compress one thing while having access to another thing and this is a bit unnatural in a machine learning context when you care about like fit in big data sets at least as of today although it's changing fairly rapidly but I think it's still fair to say that there is no truly good way of conditioning on a big data set you can fit a big data set but you can't condition on it not yet not truly so this result says that hey if you care about making predictions about your supervised task y using the the good old fashioned color compressor which just compresses the concatenation of X and Y is going to be just as good as using your conditional compressor there are more details and a few subtleties to what I just said and I'm happy to talk about them offline in the event someone is interested in them but that basically says that if before you're saying hey the previous slide said that you can use this conditional kogoro compressor to to solve on supervis learning this says that you can also use your regular chrom compressor just throw all your data take all your files concate them compress them and that's going to make some great predictions on your supervised task the task that you care about there are some intuitions about why this is true this result is actually like proving this is slightly more tricky so I won't do it