Approximation in Machine Learning and Probabilistic Inference

something that I think affect us all I hope to convince you that this affects us all um we've been thinking a lot lately in my group about abstraction right a very a principle in computer science that we are all very fond of right and and I think bear with me here for a second but what I really want to talk about is how that is starting to how the the breakdown in abstraction in modern machine learning is starting to affect our scientific inferences so what what do I mean by abstraction right so we think about a model right we think about a statistical model a machine learning model something like this and we think okay there sure it's running on some underlying infrastructure but the modern Cloud architecture or Cuda or whatever is so good that we don't have to worry about that particular Hardware instantiation any of you who has ever gotten caught in Cuda driver hell knows that that's not a thing but bear with me we think that's a thing optimization right so we think that optimization we stipulate a model and then we say I'm going to optimize this model and thus optimization is an abstracted object that we call and it doesn't bring its own kind of biases to the party that's the way things are supposed to work and and so too in if we're doing a probabilistic inference Bas rule is something like this we also think about the same thing there right we just appeal to Bas rule inference is taken care of it's a great abstraction you learn about this and when you learn about it with glms or with convex optimization or what whatever that was true yeah now in modern machine learning we're realizing how much that's not true right so we've all heard a lot about stochastic optimization sometimes called the you know the naive Genius of stochastic optimization and how it confers inductive bias onto neural networks as a regularizer um there is also a reproducibility crisis in machine learning if you haven't read about it you should essentially nothing you can't reproduce any of the results that you find in papers right and the and the kind of naive notion that a GitHub repo will do it um uh even even a properly set up Docker container can't always get there so inference is what I want to talk about today the fact that inference now of course we know that when our models get big enough we have to appeal to something like an approximation approximate inference is is kind of one of the cornerstones of of of probabilistic machine learning um but that abstraction I hope to convince you today is is is leaking into our science all right why should we care about this right well let's I don't want to be too basy in today but I am going to talk about probabilistic reasoning the the the whole point of probab reasing I'll go over this I'll go over this sort of in more detail is um that we have this object and we have this this this formul this formalism um via Bay's rule to go from a prior belief a set of data to a posterior belief about the world right and as inference becomes approximate it changes that belief state right how it changes that belief state is going to have meaningful impact on the things we try to do the statements we try to make about the world if we're doing something like um uh model inference unsupervised learning learning latent variables Etc why do I care about this um I care about this a lot my group cares about this a lot as an object cares about inference as an object of scientific inquiry by which I mean there are models a lot of my group is is involved in computational Neuroscience I realize I have I have brought Neuroscience to a protein party to some extent so I'm not going to talk too much about about proteins but it's fine the um the uh there are there are well-developed biologically motivated models um uh you don't have to worry about this these are different cell types we are particularly interested in inferring something about the state of these excitatory uh uh these excitatory neurons in a in a brain system and so when we have this model we collect a bunch of data we have to do approximate inference the statement we get back ideally that is a real thing about the brain we have learned something about how the brain operates how the brain responds to a particular stimulus Etc right if what we're actually learning is some biases of our approximations that's kind of stupid right yeah all right so we've also done a bit of um I'm going to I'm going to talk about probalistic inference in kind of the classic sense we've also done a bit of this for um conditional uh uh diffusion conditional diffusion excuse me um this came up in the last thing so I just wanted to I just wanted to throw in a quick slide about this this is a hastily put together slide during the coffee break um so a quick minute on defic conditional diffusion Jen made this comment about how using guidance to do conditional sampling um feels a lot like Bas Rule and that's absolutely right um when you have an unconditioned diffusion model P ofx if you have a likelihood P of Y given X you can use that as as guidance uh essentially as this um as this gradient signal to push you towards a posterior sample um in the discussion that came up uh there was mention made of twisting and sequential Monte Carlo and the history the Deep history that this touches in um statistics statistical mechanics um Markov chman Carlo and other things I mean like essentially what all this boils down to is it should not be a surprise that there is a ton of work I mean generations and conferences Etc on propagating samples from one density to another right which in the end is what conditional diffusion is so uh John and I were talking about this a few minutes ago like diffusion is at this time where everybody gets to bring their favorite toy to the party right and one of the things that is happening there is people are saying wait wait wait a second we're propagating one density to another over a sequence of of time points or a sequence of intervals so we've got all of these tools to bring right one of those is sequential Monte Carlo where essentially the idea is instead of just injecting that one guidance gradient you are doing a uh uh via this algorithm which if you think about it as particle filtering you're you're essentially exactly correct you do a uh uh you do a a you add a grade then you do an update propagate that forward through time do a correction step and then repeat at each of your time steps right I don't I don't really need to go through this there is a um there is a a a neat paper that a student of mine uh some of you might know Brian trip he's been on the job market um Luhan wo she will be coming to the job market soon um she's an amazing student um if you want to learn more about this you should have her give a talk on that um cool I think that covers it but just to say that this is this is a really rich area right now you mentioned a lot of the the cool papers on that we talked about a lot of cool papers on this there's connections to Hamilton Monte Carlo as well this is really a rich time for this is is is is SMC is Twisted SMC the right way to do it it is a right way to do it HMC is a right way to do it that that it is it is it is worth our time to spend um uh uh uh to spend some effort trying to connect back to all the the work that has been done in this space cool all right have I kind of legitimized myself in a in the protein room now okay cool great I'll go back to other stuff now no Jen told me she was like she was like you can't continue your talk unless you say something about proteins all right um okay so problemistic machine learning I said some vague things about abstraction here let me let me make this one step more concrete right I I I'm no great fan of bay rule but but but it is it is quite it is quite it is quite elegant in in its in its Simplicity which is it says like in some in some kind of fundamental sense it's it's unsalable which is if you treat all of your unknown quantities as probability distributions then you can just appeal to the magic of Bas Rule and let it do the rest right that is like going back philosophically the the the idea of inverse probabilities so what do I mean by that I'm we're going to talk uh for reasons that we become clear about reasoning over some state of the world F okay that is a we we start with some prior belief we also are endowed with a belief about how that state gives rise to data in the world right a likelihood and observation function something like this we're going to collect a bunch of data we are then going to put that together through the magic of bay Rule and I I do mean magic in the sense of like this is a crystal ball this is an oracle right you put these two things in with a data set you get out an updated belief State about the world now why is that so important and why am I dwelling on this because notice the beauty of that is it says model Plus data gives me an updated belief about the world right that is that is kind of one ansot of AI for science right is that we can do that we bring a model it is a tool it is an instrument just like any other instrument any other scientific piece of instrumentation and if we have data we can uh uh we can put those together and learn something about the world okay um that is as true whether f is an individual variable or whether it's a function of some of of some inputs X um how do we deal with this right this is hard as soon as you learn about the magic of Bas rule you learn that we need to approximately solve it right because um Jen was taking pot shots at conjugate mod before right this is just like it's great these These are good useful toy models but then when you actually need to do something scientifically relevant you are in a non-conjugate land you're in a you're in a computationally expensive land so you need to appeal to something else oh shout out again to Hamilton monic Carlo um one way of dealing with that is to say let the abstraction let the abstraction work right I'm a I'm a computer scientist I'm a Believer you say there are probabilistic programming languages ppls like Stan and Pyro um that that that give you a structured language to write down this model and then under the hood it says I'll take care of the inference right now it doesn't actually do that right it uses Markov chain Monte Carlo type techniques um maybe increasingly variational inference so the point is it it's pretending that it's abstracted that way but it also has all these own biases going along going along okay um what else this comes up all over the place I'm writing this as simple models here there's plenty of places where we blur these lines even further right so even down to their names right like a a variational auto encoder again invariant risk minimization these are these are things that even in the writing of the papers we can't quite decide whether it's a model or an inference technique um so the the point of this talk and the point of what I'm saying here is just that as inference gets more and more approximate it is not okay to consider it separate from the model right this crystal ball is obviously not a crystal ball and we ignore it at our Peril so yeah we go to we go to bed at night knowing that mcmc ABC bi um bbvi these type of things these are not the same as this crystal ball um they're a mess right so the the the crystal ball is is not a mess it's actually a Computing machine right it's this kind of weird pixelated robot that is making assumptions and choices all of its own right and we have not accounted for them this is particularly offensive because the whole point of basy machine learning or probabilistic machine learning is to instantiate all of our unknowns as probability distributions and yet the Computing that is done in the center of this we're just like just hide it right and that has real consequences the real consequences is the updated belief the scientific inference that we make is now not a statement about model and data it is a statement about model data and some other stuff that I haven't bothered taken care of right that's the point of today right the point of today is how to rationalize approximate computation probabilistically sorry the point of my talk today not your whole day all right I will introduce an example model here um I'm going to talk about gaussian models because they are tractable um uh in in a particular sense the gaussian process model okay if you've seen a gsum process before I'm not going to do anything particularly Fancy with gum processes the reason I'm choosing gum processes because they are a sufficiently complex probabilistic model that we always have to appeal to approximate inference right you can't do anything with a gum process in the modern world it's just too expensive right however they are also a model that really boils down to just linear algebra and if there's a thing that we know how to really understand in computation it's linear algebra right all right if you don't care about gum processes cool I don't want you to think that this is a talk about gaum processes I think I want you to think this is a talk about this problem of computation in approximate inference and that gaussing processes are just a handy model to run through for today all right so gum process just notation notation slide we have XY pairs we're trying to learn a function that Maps X to Y uh uh for convenience these this will be a regression problem I don't you know it doesn't none of this really really matter um we say that f is distributed according to a ging process with some mean function mu and covariance kernel k um what the fundamental definition of that is is that any subset of the function f in other words take X Take N input points X1 toxn evaluate that function and get this Vector that Vector has a gaussian distribution has a gaussian random is a gaussian random Vector with distribution mu and K where mu and K is normal mu K where mu is the mean function evaluated at those points and K is the coari function evaluated at all pairs okay right what does this look like in practice uh in pictures we say a gan prior this is a common representation a gan process prior will look like this in other words I've seen no data so I have something like mean zero and I have some covariance around it right this is my this is my belief State about the world this is my prior right I know nothing and I have uncertainty about what little I know now I'm going to start collecting data I'm going to do that again using Bas rule right if I have this likelihood that says given some underlying function right I just add some gaussian noise to that and that gives me data I'm now going to query about test points right new points X in the world that induces a posterior that is also a gaussian so here's the linear algebra piece right gaussian conditioning you suffer through that once in your life it ends up looking like this M Star your posterior mean is Mu plus this function of the kernel this is the core object that is painful right which is a linear solve right your algebra chelki conjugate gradients all these type of things right you've got your observations you solve those against the covariance kernel you map those back out to the new data point and that influences your mean function you do the same thing with the variance and off you go like this you've observed a data point this data point happens to be around zero right so this is X this is f right we've gone from that prior belief to now when I observe one data point what happens over here where I haven't really seen data I still am back at my prior I don't know anything as I get closer I start to get more confident about what the Laten function is this is my belief about uh this is my posterior mean about what that latent function is this this dash line is the true latent function so as I collect more data I get more certain okay what am I doing right this is the process in a cartoon of scientific inquiry right I'm going about I'm going from an uninformed belief about the state of the world to I'm gathering data and I'm getting closer to the truth this dark blue line my posterior belief is getting closer to this Dash Blue Line questions happy so far okay okay let me plant this just for the future put it in a different color so you know it's just to hold on to you can also think about this problem as a linear regression type problem right a linear regression in the basis of this kernel function by which I mean if you just call this some weights beta or V right then essentially what you're doing is you're taking a kernel representation of the data you're mapping that through some weights and adding a bias that looks linear regression right and in fact if you've done kernel rid regression all this stuff this is like exactly what what is happening the the point of a reproducing colel Hilbert space um and hbert spaces in general is that they give you linear regression that you can do in a function space all right so we call those the representer weights they turn out to be a thing that we're really going to want to reason about probabilistically okay um you might also be wondering I'm going to talk a lot about inference you might also be wondering about doing model selection this because that's an equally important part how do I fit my hyperparameters of a model how do I do how do I Traverse through the possible model space as I gather more data everything I say here today is going to be applicable to that as well it's just a lot more mechanics um so this paper uh you can read about about this but otherwise it's Friday I'm not going to try to do too much math okay so we've got this idea of a gaussian process right we see that it involves an uh a linear solve right you remember linear solves Once Upon a Time they're expensive the core object computationally right is this kernel Matrix right this notice is a kernel Matrix that is number of data points by number of data points right so in any modern interesting setting we've got hundreds of thousands millions of data points something like this oh whatever maybe tens of thousands it's still computationally prohibitive to work with this Matrix you can form this Matrix well enough take a bit of memory but it's not too bad but solving against this if you remember um if you remember your linear algebra class that's a cubic that's a cubic operation in other words that is cubic in the size of the Matrix n n cubed is a big number it grows fast you can do this this is what I mean about this this this object tractable it's tractable in the uh uh latch notation sense of the word right we can write down what we think this is but it is not tractable in the I want this to finish on a um on a modern GPU in a reasonable amount of time sense okay notice that the posterior mean the posterior variance and this model selection the the the likelihood notice that all of these involve the same object which is this linear solve okay now you say okay this log determinant is also cubic um there's a lot of work in there's a lot of work resolving that to a linear solve as well right so just trust me there's only one problem here which is solving K inverse times something okay so there is a ton a ton of literature I mean it's its own field trying to come up with good approximation methods to do these type of solves cubic scaling is is is is just impossible and so in some special cases you can reduce this in most cases you can't and so arises some large number of of approximation methods one of the most well-used is conjugate gradients um must we go through conjugate gradients you can view linear solves as an optim as an optimization problem a convex optimization problem right so suppose you have this quadratic function X transpose Kad X right remember what we're trying to do is solve Kad inverse y yeah this is a quadratic plus this linear term so this is a good convex optimization problem right now what is the solution of this right well the solution of this is Kad inverse y right gradi that gradi of that okay so what have we done nothing well the nice thing that we've done here is that you can take gradients of this right you can write down a gradient of this it looks like KX plus Yus mu that is a gradient in X right you can now do gradient descent okay gradient descent as we know it works but it works very slowly right so particularly with something is poorly conditioned as a k Matrix right because these matrices are quite large and stretchy um they bounce around a lot right so this is a very slow operation the idea of conjugate gradients um which we've all run into at some point I suspect is just to take steps in a um in a different Norm right so instead of taking steps in the ukian norm uh instead of taking gr steps in the ukan norm you take steps in the Kat Norm they are Kat conjugate steps this is a um this is essentially a cookbook algorithm that you just that you just write down um as with all fast linear solvers so of the whole core point of computational linear algebra is to turn inverse problems which are expensive into forward problems of of this kind of multiplication Matrix Vector multiplication because that's way cheaper all right conjugate gradient comes out looking like this I I we don't need to go through the details of this but the point is this is just absolutely stock and trade off the shelf conjugate gradient the point I want you to see is one thing two things so sorry one is you take in a k hat and a y y minus mu right and you do a bunch of forward multiplies K Hat Time V K Hat Time s Kad Time s k Hat Time s right so this operation on a forward pass is just a forward multiplication also what I want you to notice is that here these are all linear operations right so it feels like you might be able to propagate a belief through them right rather than doing this pointwise gradient you might be able to propagate a probability distribution through them some people are lost at this point saying why are we doing this because remember what we talked about was doing approximate inference approximate computation in a model now we're looking at a procedure and recognizing that it looks like something we will be able to push probability beliefs through right this goes back a long way there's a lot lot of papers about this um this is like this is how old I am all right so back to this so we said the crystal ball is actually a Computing machine right this this this robot is not it's not bad per se but it's but but he's weird right by which I mean there there there are modeling assumptions and choices that that the conjugate gradient method or the whatever the robot the computation machine is making along the way and we have not reasoned about them probabilistically and we should if we if we want to stay within this framework this is this is not some trivial Mischief what I want to show you or the point I want to make is that the statements that we get out become as much about our approximate inference as they become about our model and data I'll repeat that in a moment because it's it's crazy approximate inference in practice so here is what I'll call the mathematical post posterior by that I mean if I had infinite compute and I could compute exactly the correct posterior according to the data that I've observed and the model now right if your model's wrong or your data is wrong or something like this I'm not talking about that this is what the right answer should be that is your true belief state in this in this kind of crystal ball scientific world when you use conjugate gradients which I just showed you in a standard way you get something that looks awfully good you say okay great I can go and operate with that it's different but I'm not too bothered by that now I use a different method subset of regressor a nicer method um don't worry about what any of these things are these are just very popular methods notice that because approximate methods violate the model they are making they end up making strong statements about inference right so here is the truth about this part of the world right here we have a lot of uncertainty about what the truth is if you're trying to make a scientific statement based on an X that exists here you don't know any you you know very little right this method just because of approximate inference thinks you're very confident about what's going on here right so that is a bias and that a poorly calibrated uncertainty that will significantly alter your statement about the world um uh variational inference stochastic variational uh gum process if you ever play with that um it has a known issue where it what it does to to improve computation is it puts inducing points it scatters them throughout space and you kind of collapse your inference onto this this this smaller set of inducing points um you can see that operating here because what happens is you also collapse your confidence you collapse your uncertainty at those points right so you've created these artificial points that are just part of your algorithm where all of a sudden you think you know something about the world you don't so how do you deal with this no before saying how you deal with it I hope you just see that this is a problem right all of these cases have the exact same model they have the exact same data the only thing that has changed is the computation that went into making inference right so think about this as the same as you're fitting some um you're fitting some diffusion model like we were just talking about right and one of these is guidance one of these is SMC one of these is HMC one of these is some other thing right that's a problem right the density that you end up with if you're trying to do something useful with it really shouldn't change that much and we treat this as this this minor inconvenience but it's Central this is a um toy problem okay okay okay okay strong statements and different statements about what you know this is true of this example model the um Gan process but it's true of all probabilistic models and it's true of lots of models we've talked about this week um I I pulled a couple density estimators diffusions Max entropy models uh BNN U Etc it's also true even if you think well I'm not doing I'm not doing probabilistic machine learning so this probably not that big a deal it's it's a little bit of a longer conversation but the same thing is going through is going through whether you whether you recognize it or not and we ignore this effect okay um let me skip this this is just in case you thought this was a silly toy problem which it is a silly toy problem it's a one-dimensional problem with 10 data points I I did this on a on a um on a much larger set of data a much more realistic set and did this across kind of all of the um standard um uh standard gaum process methods um and just shows this just shows that there's a lot of this issue continues in a large way okay uh editorializing why is this AI for science right um I said this earlier right the AI for science like the onsa is that is that data and a model can let us infer something about scientific interest something of scientific interest right we don't normally talk about it so reductively but that is fundamentally what we're trying to do and if that's not true or if there's a hole in that we should pay attention um the modern AI toolkit brings its own opinions right biases shortcoming Etc um to the story right as this as this example shows it's just weak science right where you have something that you're not controlling for that's is not the way we like to do things and and this is not just about probalistic machine learning um so let us let us understand and rationalize the biases of these okay so how to do approximate inference the right way I was struggling for a for a long time how to think about this um I thought a lot about an effective likelihood right so this computational machine that is going on here right is it changing my likelihood is it actually giving me some different observation model for what what let's say how data gives rise to um or sorry how a state of the world gives rise to data um is it changing my prior so like actually what my what is happening is my prior belief is different and what I've what I've um and this this postto uh convince me of is that what what is actually happening here is let's think about computation right what computation is is the calls to your GPU right computation is how data enters your model right it is an effective data set induced by computation what do I mean by that if you say look on disk like on my SSD I've got a million data points and then you run an inference method that touches 12 of those data points right it is apparent that you do not actually have a million data points right like your possession of those data points is cued but if you've only computed on 12 of them right what you're actually representing is a belief dat about 12 data points okay computation just think about it one step further that's what it's doing like computation is the reach out to your disk it instantiates that compute it spends some effort on GPU or on CPU and updates your model accordingly it is the it is the process by which data gets updated as a belief so really what we're going to do is we're going to think about this as exact inference not approximate inference but exact inference where you get an updated belief given an effective data set by which I mean I'm not telling you that I can give you I'm not telling you that I've solved approximate inference and now it's all exact what I'm telling you is that if you take a set of approximate steps I will give you an exact statement about what you did condition on a particular data set that is different than the data set that you have on disk right is the data set induced by the computation it seems like an alternative possibility would be like that the form of the likelihood was wrong and no matter what data set you chose it wouldn't explain answer this question sure sure sure sure yes so there is a there's a there's a question here about what is the right what is the right setting in which to a scientific question right I am I'm not addressing that what I'm saying is if you have a belief and you have a model for how that belief gives gives rise to data then then Bas rule tells you how to update your belief right if your if your prior belief is wrong or your data is wrong you're going to get a bad answer right but all of that is an exact bad answer if you like right so so model missp specification that is a that is a hugely important topic and one that one that we work on in its own way I think what you're saying is that I can account for whatever happened in the robot non Magic Ball by saying there at least exists some data set I could have observed such that if I done exact basian inference through the magic ball I would get the same result i' seen through my approximate in that's that's right that's right and so I guess I just don't is that like Pro provably true like I just because it seems to me an alternative would be that there does not exist such a data set if I don't also change the form of a likelihood for example uh those will turn out to be the same as in as in I will show you that essentially what you're doing is a transformation of this data set and so if you want to call a transformation of the data set a different likelihood sure it is it's a projected likelihood if you want to call it the same likelihood on a different data set I might talk to you further later so yeah yeah yeah yeah yeah the the the the thing that I that I that I want to emphasize is the the or the thing that I want to tweak about what you said is that you said I could imagine a data set that would have given me this exact answer that's not we're doing something a little more precise than that which is saying given the steps that you took this is the exact data set that you actually used oh like you're going to give us a constructive algorithm yeah a constructive algorith that results in essentially you have you thought you had a million data points right we went through this kind of toy example where you got a million data points and you just took 12 of them right but if you do let's say 12 conjugate gradient steps we can show you that what you end up doing is taking um a particular set of linear projections of your data set so it's it's it's like a it's a 12-dimensional projection of this million data points if you like this will be specific to conjugate gradient in this Cas not conjugate gradient this will be specific to this example model the gaussing process um there are extensions of that but the some of the tools we will use here are very and this is what this is kind of the the comment I made about linear algebra for some of the tools that we will use here are um absolutely based on our ability to propagate uncertainty through linear algebraic operations and that gets trickier when you want to do some like lat de lay allocation model or something like that right not impossible but just yeah it's it's a um I'm happy for that to be a shortcoming of this right this is this is not yeah this is not like once you do it here you can just do it anywhere kind of thing all right so effective data set right um iterative numerical methods take linear combinations of data um this is true in gaum processes this is true in lots of methods modulo what I just what I just said Jen we're going to define the set of actions taken by an approximate solver as this Matrix right so you take I actions each of those is some projection of your end data points I showed you in s very briefly in the conjugate gradient in the conjugate gradient method before um but let me uh um uh let me give you kind of other other examples of that right so in um conjugate gradients I showed you this s Matrix which is which is essentially essentially the residual as you go through and take these conjugate um conjugate steps um if you do a cheski factorization essentially you're you're you're doing this um in canonical order uh through through the data set if you do an igen value decomposition right or partial igen value decomposition essentially you're traversing the igen space of the kernel Matrix and and using projections that way um um inducing Point methods n methods subset of regressor methods all of these end up taking some um some basis according to an inducing Point that's quite a bit harder to see but just allow it um okay I'm gonna I'm going to establish that connection further later it's just um it's pretty detailed so I wanted to um give the give the punch line of it first so we talked about this ga process posterior before right and then I said look instead of having a data set y you're going to have a you're going to have a projected data set an effective data set si si transpose y okay so that because this is a linear operation you can push this through Gan inference in other words now instead of um y given f is normal under F and sigma you have S transpose y these are just gussian identities right s transpose F you project your function your your mean function down and you uh project your variance the posterior that gets induced is uh also is is is also projected it maintains the same form now with just a different coari Matrix okay this is a inverse coari call it a Precision it essentially looks like this so notice that this is a s projection of your kernel Matrix if this thing were the identity Matrix you would recover the original stuff so what you end up getting is combined uncertainty okay this is this is in many ways the the punchline of how you reason about how you reason about approximate computation before we had always talked about mathematical uncertainty right this is the this is the true object that in the limit of infinite computation I can say I have some AR priori uncertainty and what happens is based on my observations that happen at these points X right I invert this Matrix multiply that through and that reduces in this kind of sure complement way that reduces the amount of uncertainty that I have this is my posterior uncertainty my mathematical posterior but I recognize that because of taking approximate computational steps I haven't I have to add something back right which is the computation that I haven't done this is what I mean by this being constructive is that you end up with the following representation I mentioned these representer weights before this turns out to be the exactly the uncertainty on the representer weights which it says in this linear regression analog it says look if you have basing linear regression probalistic linear regression right you have some prior belief on your betas on like those linear regression coefficients what you're going to do is you're going to go ahead and up dat that through and over time as you collect more and more data you will increase your confidence so this is a this is a nice outcome right um what you get out is combined effective uncertainty that is a result that that is that is a result of both having finite data mathematical computation and finite comp uh sorry finite data right so mathematical uncertainty Plus computation computational uncertainty that is appealing in the sense of that's that's the way the world should work right you have some limited amount of data you have some limited amount of computation both of those have the effect of reducing uncertainty and the way you combine those two or the way you invest in those two will reduce your uncertainty differently and in different places I mentioned this expression data is as data does in so much as this is a workshop and it's Friday I think it's okay to kind of wander for a moment we think about data as this as this existing object right but what I've been trying to convince you with this whole discussion of your your your your your SSD versus your GPU right is that data is only data only influences your model to the extent that you use it to the extent that you consume it right and it's computation that consumes your data right and so if you don't consume a data point or if you only consume part of a data point part of the information that it has to tell you you only got part of that data point so data is as data does okay computation and data are a little bit more fungible concept than we might have thought them to be you can comp you can compute C you can't compute either of these right or rather this this this object takes all the computation to compute so you I will show you how to compute this these these you know they exist and I will go to the trouble of computing them so you can see them but in effect the algorithm that I will present will only compute this yeah but you're spot on right like this is it is it is impossible to imagine that you could compute this and this because that would be the same as Computing this right sorry if I just missed this earlier but this is a this is about the uncertainty that comes out of the gaussian process model associated with the X's that you put in what about just the point prediction value the the mean value yes yes um those are linked this is a great question thank you for for raising it those are absolutely linked through this right which is it's the uncertainty that tells us how much we should trust an observation and thus the mean depends on it so you're you're you're you're absolutely right point this out this is not just a statement about calibrated uncertainty um sure we all love calibrated uncertainty but mostly we care about mean estimation right like like we are Point estimate people at heart um so so no this this absolutely has an influence on that and and and we'll see that kind of in results okay so we looked at this before right this was approximate inference in practice here's a new way we're going to sequentially update data points so in in this example this is just the original toy example I showed with 10 data points right so here's a here's a an odd way we could look at this we could look at this as saying this is the example where I say you have a million data points you only get you only look at 12 of them what I'm going to do is I'm going to look at that and saying look I have these 10 data points but I'm going to look at them one by one and I'm going to track the computation ational uncertainty so when I have that first data point remember this is the the variance that I had right that variance is made up of this amount of mathematical uncertainty in blue right this is just the marginal variance a little easier to look at plus all the computation that I haven't done right so Jen this is your green plus blue right I just because I can calculate in this toy example I can get you the blue I can get you the green as we add more data of course the mathematical uncertainty which is conditioning on all 10 of these data points does not change but what you can see is that as you add data you are spending computation to reduce your computational uncertainty in various places your effective data set is growing you continue to do that until you have effectively conditioned all your data and you are left with exactly what we started in the first place which is the true mathematical uncertainty there's no computational uncertainty left because you've spent all the computation you needed to you can see this in you can see this in uh other methods that that was just exact inference but I can show you this now on these same 10 data points with just the number of iterations that you take through a particular method so this is a cheski factorization and you can see that as you evolve you are reducing computational uncertainty you are reducing it at a different rate in different places per different approximation Ali to the point of we not solving how to do approximate inference here we telling you if you're doing approximate inference this is the actual uncertainty that you should represent right I hope that's not unsatisfying same thing with conjugate gradients so this is what I was saying about I hope this is not unsatisfying is that remember before how we said that conjugate gradients and cheski and all these we're making different assumptions about the world all we are doing now is making clear what those assumptions are right you can see that these have different posteriors now in different computation um I mentioned that that that I'll skip this I finish at 11:30 right I'll skip this okay yep um here's the answer to your question right or I gave you the answer to your question here's a here's a here's a picture um so you might be concerned that this is just a story of of uncertainty um this is negative log likelihood right which which is some combination of mean squared error and also uncertainty right like your your your log determined envelope um so it does a much better job in uncertainty estimation this is uh ingp what I just showed you versus um kind of the the Baseline standard or the the the gold standard of of approximate inference so by having in this case less overconfidence which is a weird thing to say but let me let me be clear about what that means what we showed is that approximate inference methods because they are bringing extra biases they can only reduce your posterior more than it should be they they reduce your uncertainty more than it should be right because they are overconfident so what tracking computation does is it gets you back to a properly calibrated and theoretically provable more accurate uncertainty that uncertainty will then benefit you in generalization you can see it here in variance but it's also a mean phenomenon so this is just looking at the mean squared error because of exactly the equation I showed you it it works better as well y what's that yeah this is like the weakest protein right this is the UCI data set protein right which like I I don't know what that actually involves but it's just called protein yeah I mean if it makes you feel any better one of them is called keg so it's like we're we're like at that level right yeah okay so I think I I think I've made these points already um the result of this is if you up update your your gaan process or your model via Matrix Vector multiplication then the combined uncertainty of this algorithm is precisely the correct object to capture your belief um I didn't I didn't show the theory behind this but but I I do mean precisely the correct the correct object it is is is it is provably correct um little bit of technical details uh I mentioned this weight space view right where essentially these are the represented weights these are the the betas in that in that linear regression um if we represent uncertainty on that that is that is it turns out exactly what we're doing right um if you know the representer weights right so suppose you knew these exactly well then you're finished right because you have completed the linear solve right so that so the problem of of doing true uncertainty uh uh true posterior inference exact posterior inference is completed but you don't now suppose you're on the other end of the Spectrum which is you have no knowledge whatsoever of these these represented weights then it turns out the right prior to endow this with to endow these representative weights is this and you recover the prior from that in other words it is how much you know about these coefficients that exactly is is monitoring how far you've moved from the prior to the posterior so as you do computation if you can update your belief State about these using the computation that you've done that is exactly tracking you from prior to posterior that's a weird way to think about ging conditioning but it turns out to be the the the right thing okay a lot to get through hold on this is way easier the um so um the the what all this math comes out to is an algorithm that iteratively walks through constructively updating and creating this C Matrix that we mentioned before which is the combined uncertainty right so how it does that how it does that is by creating a belief State on these representer weights right these VI it walks through as you take your linear updates as you take your actions SII which I gave you a table for different iterative solvers um cheski conjugate gradients ET none of this in other words depends on um conjugate gradients this is a call to a conjugate gradient method right or a call to a different like pick your iterative solver of choice as long as it spits out actions you might call it a policy um connections there for another talk right um as long as it spits out these actions you will walk through this and update both your computational uncertainty excuse me not your computation you update your combined uncertainty which is in effect updating both both your posterior and your computational uncertainty I apologize for for jumping through that but that how this is how it's getting that yeah Beyond inition you yeah so I told you how when you have no belief when you have when you have no update to your representer weights you essentially have an uninformed prior and that and that Returns the prior of the gum right you haven't seen enough data yet when this is reduced to zero you have solved the problem exactly so you're finished so all of these all of these methods depend on a residual right like they depend on some Er some err term to do their um uh uh to to to query their policy so what you have is this residual which is your data minus your representer weights which is your belief about how much you've solved this problem already propagated forward right this is beta time K right that error you then that error is a linear term you are then going to take a linear in product with the action that your policy gives you so that's all linear and you can just push your variance right through that right that gives you an update right and you just walk through updating that's what I mean when I said it's all linear algebra right just ask clarification question so we talking about um the representer weight uncertainty of like probabilistic numerics like exact would go zero with computation yes so this is um yes this is absolutely an example of probalistic numeric um that that example is is made plain in that the reason I the reason I I I go at it from the other direction here is because this I think with with with all love to the probabilistic numerics community in which I'm aart this is a place where solving this problem requires you to use problemistic numerics whereas often probis numerics people think you know because I want to do this right and this is like you are required to do this yeah but but but no those connections are um there are there commented out slides in here that draw that out entirely all right we went okay okay so here's so here's this in one final picture the mathematical uncertainty of of this method oh sorry of this model and this data looks like this right the computational uncertainty depends on where we have spent compute so I've colored these data points red for whether they've been touched more by the method touched more in the SI transpose y sense right so what you can see from that and and this is just for conjugate gradients in the big box and then here's some others you can look at it's the same story over and over again but what you can see is that in places where you've hit this a lot there is lower computational uncertainty which confers lower uncertainty out here whereas here these data points which you haven't actually looked at even though you think you have right these still have high prior these have close to Prior uncertainty around them because they haven't been hit by the computational method right some other time they might have been some other random method they might have been but that's the that's the that's the coherent statement we were trying to get to oh yeah here you go here then approximate inference the way and probabilistic numeric numerical treatment of represented weights are shown to be identical yeah okay final final bit approximate inference should take should be taken into account in probabilistic machine learning today we've done that for gaussian processes absent that absent doing that computational uncertainty is ignored and I hope I've convinced you that it is surprisingly significant is changing your belief about the world in ways that you're not paying attention to data is as data does um in a very concrete sense right so data is as data does like smart is as smart does stupid is as stupid does depending where you come from you might have heard this statement before the idea is that like these are not immutable properties like data is not an immutable property of the world it is what you do with it that counts in a very concrete sense this shows mathematically that data only exists to the extent that you use it which is fun to think about um some other questions you might have what other inference settings admit admit this um this is kind of Jen's concern this is all of our concern about how specific what I've showed you is to gum processes the problem is absolutely everywhere the solution that happens to be based on a lot of linear algebra is very specific to ging processes right um how do you push this through the decision Theory right if you want to be if if you want to pay attention to this we've we've written about a paper about this using this for basing optimization um how to use this for computational Neuroscience I could have talked about that um but already um Active Learning I think this is really cool because I've shown now that data and computation are actually fungible right and this is something that science scientists actually care about right because we don't just work with imag net like there's actually a cost to collecting new data and you can now you have an active exchange space there's a there's a there's an exchange rate between collecting data and spending more computation like you can precisely answer the question should I buy a new GPU or should I get more experimental time right which is a fun thing to think about [Music] um great thank you very much sorry for taking a little too long I have questions to been asking thank you so in the weight update view of this where you're allocating resources to some data points is that giving us useful way of of prioritizing among data so I might say if I have some measurements that are very close to each other I should spend as much time on that set and I should allocate my resources to ones that are further away or if I have experimental errors and some of those error bars on are larger I should deprioritize spending resources on those points is that sort of what it GP is able to do for us yes so in some sense already the approximate inference methods are trying to do that right because they're not just like for for example that is the difference between doing gradient descent and doing conjugate gradients right is you're taking a you're taking a a step in a conjugate Norm which is hopefully saying look if I just looked at this data point which is One Direction in ukan space I don't need to look at this other data point which is another Direction in ukian space but it's actually almost the same direction in Kad conjugate space so so the reason I do that Preamble is to say like this is not exclusive to to to the it GP like this is this is everywhere but but yes you are absolutely right from the from the extent of you are seeing as you progress through where you are getting um where your combined uncertainty is largest and then you can this is this is part of the kind of active online PR you can query data in places of large uncertainty or you can spend more compute in places of large uncertainty that is um that is a tool you now have we have used that for um I mentioned this um Bas optimization problem you could imagine doing exactly that in basing optimization where you say look I'm what I'm interested in is finding regions of high utility and so I'm going to spend my updates there like I'm going to allocate compute according to that but it's absolutely the right way to be thinking about that thanks uh I was curious if this gives you tools to say given a fixed compute budget which of these different approximate inference algorithms are going to like minimize my combine uncertainty the most and if that gives you insights into like for your particular data set which which approximate inference methods to use that's a cool idea um it's certainly okay trivially speaking yes it would do that because if you could evaluate combin certain but the question is actually well wait a second you just spent number of methods times that computation so how could you actually do that in an online way choosing between them or maybe for particular problem domains that you're interested in where you collect data sets in certain ways and you know have certain types of things you could ask are are certain variational inference methods better suited for for these types of problems based on this analysis so I mentioned I mentioned [Music] here I mentioned here that we query a policy right where we select the action so one thing that you could imagine doing excuse me one thing that we have done is to say well wait a second can I differentiate through this based on my where I'm collecting my data such that instead of doing this kind of um mixed integer programming question of like which of these methods should I pick at different times I'm actually just letting the data tell me what my policy should be that is definitely a thing we can do that's the way we've approached that question so far so that's the only thing I have a coherent answer for um but it is definitely yeah it is definitely something we've been thinking about in that in that regard just clarify what you mean is the policy encompasses like all the approximation methods and it can switch between them apparently what the policy's job to do is to say I want to look at the state of the world which is typically for a policy like this it's looking at residuals right it's saying like which directions have I searched and how small is the error in that direction and it is making an update based on that like that is like that we don't normally think about an igen value decomposition is doing that but like if you do gaidal elimination like that's just like precisely what it's doing like you're walking through that coming up with new igen directions to operate on right conjugate gradients does that same thing with residuals in the conjugate direction there's no reason why you can't differentiate through that as you are differentiating through this whole procedure to different um different hyperparameters that exist in here to do exactly what what what he's asking for this is really cool you mentioned earlier about like model selection how this can also be used for that yep um just want to know if you could elaborate a bit more and and possibly also just like extending that model selection space to not just GPS but other models but like yeah um I can elaborate on the first part of that question not the second part of that question um or you know over coffee I can elaborate on the second part of that question but it is like is way past a grain of salt it's like a whole salt mine worth of worth of worth of um worth of that so where are sorry I'm looking for the model selection piece so essentially what we're trying to do here is maximize this likelihood objective so now first of all if you believe that we can propagate uncertainty through this there should be no trouble in propagating uncertainty through this right so so so so just doing that is okay the problem that comes up and the real issue with doing model selection is that remember I was talking about the effective data set is what you end up doing is naively if you just try to push it GP through this you end up doing approximate inference excuse me you do model selection on the effective data set which turns out to be a really bad idea because then all of a sudden you're almost ending up in a pathological case of what you are asking for which is you're saying find me the model that agrees most I'm going to look for data that agrees that gives me like the most certainty about that so you do a terrible job on all your other data so what you end up having to do do is coming up with a new objective function that basically that basically maintains uh Fidelity over the whole data set rather than just on the effective data set that is the means by which we extend this to model selection but then the actual operations of propagating uncertainty through this type of object is exactly the same thing and you can do that the same way with um stochastic Trace estimators which end up looking exactly like that just with different basis vectors that's the that's a two-minute explanation of how you push that through um that brings up some computational issues of its own which we have ways to deal with and there's a paper not referenced here that I can send you to give you all the specifics on that the interest time we got to wrap up here so we have time for lunch and then we'll reconvene at 1:30 to hear thanks everybody

Transcript for:Approximation in Machine Learning and Probabilistic Inference

Transcript for:
Approximation in Machine Learning and Probabilistic Inference