so uh what did we talk about last time so last time we did a bunch of lqr so we did kind of like lqr as a QP and then we used that to kind of bootstrap our way into talking about this like ricotti recursion um and saw you know how that's uh a way of sort of taking advantage of the Block sparsity in the QP to solve it fast it's also the there's way more going on there so we're going to do a little bit more of a dive into the The Depths there so one in particular right we we kind of hinted somehow out of that you get a feedback policy right that's applicable to like any initial condition which is kind of wild and so we're going to kind of dig into like there's more going on there uh we're gonna we're GNA kind of explore that a little bit today and and see yet another way to look at lqr and another kind of viewpoint on it um and that kind of explains that whole feedback policy story so today what we're going to do is um a little bit of cleanup from last time on a couple little points the first one is we're going to talk for a real quick second about the infinite Horizon uh lqr problem which uh we kind of hinted at last time with this whole thing about like the K Matrix like ASM toting out to a constant K after you go back for a little while and that's that's called the infinite Horizon solution so we're going to talk about that in some a little bit more detail today then that's going to there's then we're going to talk a little bit more again cleaning up some loose ends on the lqr sorry about this idea of controllability has anyone heard of this before in like a linear systems class we'll talk about that a little bit um and then we're gonna get into dynamic programming which is a a very deep subject and will give us yet another way to get at lqr so that's four ways to do lqr the end of this and that'll kind of lead us into a whole bunch of other stuff uh that we'll kind of then continue with next time who's done dynamic programming before good number it's very like very strong in the RL world and and like other areas of computer science to that's the big one all right cool uh so let's do it uh so first thing is to kind of like just tie up the loose ends on this infinite uh Horizon story and so yeah it's obviously this like limiting case of you have a very long time Horizon and we saw how that stuff ASM totically becomes constant and so just kind of explain that a little more for time invariant lqr problems this is only a thing for time invariant because obviously for time variant stuff things got to change all the time time in variant lqr as we saw right the K matrices that's terrible I'm trying uh no a matrices um converge to constant values and we kind of said last time is that for stabilization problems so like if you're just like you know balancing you know the cart pull or whatever um we usually just use that constant K and that's kind of it and actually um all the toolboxes you're probably familiar with like matlb you know uh Julia python toolboxes the control toolboxes have a function for like lqr dqr where you put an ABQ andr and it spits out a k Matrix and in fact that is the infinite Horizon K Matrix and we're going to talk about right now is how how that how they get that right so remember what we were doing last time we got this like backward recursion thing is called ricotti recursion um and for the P's and the K's and it looks like this we've got we solve for the K and this is just from last time I'm just going to write it real quick it's a little ugly but you will see this enough times that hopefully it starts to kind of stick um so that's the K part and then the P part you plug you do that one then you plug it into the second one and you get an equation for p and you just kind of alternate these guys and this is Remember the Time invariant case so the A's and the B's and the q's and the RS are con or fixed matrices that don't have time into fees okay so we do this backwards and we said you know we iterated this backwards a whole bunch and we got constants um so in the infinite Horizon limit what we're really saying is that p k + oneals PK equals some like P Infinity right so it's saying like when I do that recursion I get back the same thing right does that sound familiar to anyone talking about like a whole bunch of stuff we talked about really early in the class hey yes that's a fix Point iteration so it turns out that the rot equation in the LTI case has a stable fixed point which is the infinite Horizon value and in fact one way to solve for the infinite Horizon K is to just do fixo iteration I you just run ricotti a whole bunch until it until it's out some constant and there you go but we talked about you know fix point eration and kind of slow what whatever so there may be other ways to do it and in fact you know there are and the tool boxes that do this pretty much none of them do it as a fix Point iteration problem because that can converge super slow in some cases and stuff like that so another way to do this is it's just you can you know move things over to one side set it equal to zero and have a root finding problem and use Newton on it also right so you can also solve this as a um Ro finding or fix Point problem and um most of the toolboxes like do some variation on that they they're very there's kind of fancier more specialized techniques for ricot equations because they're so common and show up so much uh that there's like kind of fancy ways to do it but like a reasonable thing to do is to run the fixo thing for a while and then do like a single Newton a couple Newton steps and you'll get the answer to like super high accuracy um so yeah then because the Newton method uh will have some finite Basin of Attraction and have a good enough initial guess it it will not converge very well so a reasonable approach would be to run the fix Point iteration for a little while and then this is a common strategy by the way in a lot of numerical methods you if you have some slowly converging thing um but that's super stable or whatever you run that for a while and then you do what's called solution polishing which is where you do like one or two rounds of Newton because Newton can get you like a super tight you know exact answer essentially um and so that's like a common so like doing something else and then like doing one or two newy solves at the end it's called solution polishing it's super common in a lot of sort of numerical algorithms uh so yeah basically uh there's tool boxes to do this for you right so Julia Matt lab all of the things by it's on uh control toolboxes call this dare uh specifically dare stands for discret algebraic rotti equation that gives you the P so dare will give you the p and then dqr it will give you the K and there's functions for both of these and yeah so any any of theol boxes tool boxes will this cool and we'll do some examples that's so that's super useful so now you know infinite Horizon lqr if you have a stabilization problem where you just care about like balancing or you know kind of keeping your system at some fix Point write down linearize about the fix point A and B make up some costs you can tune them a little bit just call dqr or whatever and you'll got sure K and there there you go uh okay any questions about that yeah yeah for reference for if you're doing trajectory tracking tracking some reference like then you want the tiny varying lqr and then you write down a whole you know series of as and B's and you do the full rotti thing and cach everything yeah and then when you're doing it when you're doing your rollouts you run the ks at the correct time and whatever yeah so you're G to do the step on homework promise but yeah that's kind of the yeah yes yeah no we're going to do it it it is important it comes it's useful not for that case but it becomes super useful when you're doing NPC as soon as we start talking about like constraints and solving the optimization problems online then we're going to use the p and we're going to start getting into how that looks today there a little bit of a a story there okay cool so uh the next story is controlability this is again some loose ends on the lqr story so the basic thing here that we're trying to get at is how do we know like if lqr will work um and specifically how do we know if it uh if it can stabilize the system right so we already heard like so so far right we already like talked a bunch about how um we have some conditions on Q and R right specifically we need Q to be positive semi-definite and R to be strictly positive definite and those are basically to ensure convexity to ensure that that QP has a unique minimum so that like the solution like it's a welled QP so that's still we got that it turns out though there's also conditions on the Dynamics in this context the A and B matrices uh such that the system and we call those controlability conditions basically is it even possible right so you could make up a system that has like 10 degrees of freedom but one control input and it's just going to Bas plan because you just don't have enough Control Authority to balance it right because you don't have enough knobs you don't have enough control inputs that way we call that an uncontrollable system right and in that case you is not going to work so this is about like conditions on the system matrices The System Dynamics uh such that we we will be able to you know control it uh so for specifically for LTI so this is a a general kind of topic for linear systems there's nice ways to compute This and like be confident in it for nonlinear systems it's a complicated story in general but for linear systems there's nice ways to do it for LTI systems linear time invariant where it's fixed a andd then there's like a really clean close form version of that we can do right right now let's do that we'll talk a little bit about some of the other cases after okay so there's a nice simple basic close floor answer so here's what we're going to do we're going to write down kind of a little uh trajectory from some initial state so for any initial State X notot um some later time State xn is given by the following so this is literally I'm just going to iterate the Dynamics so I've got xn equals a * xn-1 + B * U nus1 and if I go back another time step I can write that is a Time xus 2 but I'm going to Nest this uh so it's going to be a * uh minus 2 plus b * u nus 2 plus b * u n one right so this so I can keep nesting the Dynamics in from earlier time steps and if I come just do that all the way back in time I'm that looks like this I've got like a to the n x not plus a to the nus1 B un plus a the N minus 2 uh U one plus dot dot dot dot dot b u nus one so I can like so the whole idea here is I want to eliminate all the states so I've just written this out so that um I I keep plugging in like ax plus b everywhere recursively backwards for the x's and so I can write this whole thing so that I just get it in terms of the initial State X notot that makes sense so literally what this is is like me expanding out a roll out from xnot I give you an xnot in all the use and I just iterate the Dynamics a whole bunch this is it right for a linear system does that make sense everybody okay so I can like now take that whole thing and squish it into a big Matrix equation that looks like this so I've got B A * b a s time B dot dot dot all the way to a this guy times all the controls stacked up so it' be like uus one U dot dot dot stack all the control Vector um plus the initial condition so that looks like a to the N everyone cool with that okay so now if I look at this um this is like a big weird linear system and specifically like if I were to give you a problem like what's the shape of this Matrix like uh the yeah so like this Matrix here so like the left hand side here is xn so it's a state Vector um so whatever that is right n dimensional and then this guy though is like all the use for all time stacked up so this is a huge Vector on the right a little Vector on the left so what shape is this Matrix short and fat right short and fat Matrix so what kind of problem is this if I were to give you like a linear system like it is a an underdetermined linear system right and there's sort of an infinite number of possible solutions to this so what do you usually do in a situation like this you want to get a solution the least squares problem we can find a least squares solution to this everyone remember this yeah cool so then remember like how you would get a least Square solution of this guy huh say uh you could that's a very kind of brute forcey expensive way to do it but not wrong there's a there's a like sort of Cheapo sort of normal way to do this uh normal being actually a hint uh this is uh called the pseudo inverse it's also called the normal equations in uh Le squares so I'll show you what this is if is this jogging any memories you guys heard this before yeah okay so um this is a a Le squares problem or like the U trajectory for U one and we can write it like this if I wanted to solve for all those U's I could write like un minus one u u dot dot dot you not um equals and then this is the the least Square Z trick so that thing let's call this big Matrix over here C which is actually another hint call that Big C so if I were to do the Le squares thing um I'll just write it and then we could talk about a little bit um here's what this looks like I'm sure you guys have seen this before at some point in your lives um and then I basically all I'm doing here is I'm solving for the U over to the right so the stuff then on the right hand side this ends up being so this is like if I give you any xnot and any xn and it's an LTI system I can like write this down as a big Le squares problem that's the answer right everyone cool with that okay so yeah do we remember this thing in the brackets here what this is called it's called a pseudo inverse you guys seen this also called the more pen row pseudo inverse so Pudo inverse um by the way you should never actually do this this is like bad for you uh in particular if anyone's like that more hard so squaring C like this and then inverting it is a bad idea uh because it gets ill conditioned uh so does anyone know what you're supposed to do there the SVD is actually a way to do it without squaring that's but it's more there's a cheaper way to do this anyone know what the like correct way to do this is yeah you can do that do you know what that does under the hood though uh it does a QR factorization there's a clever way to do this where you take the QR factorization of C or C transpose and like then you don't have to square c um if you want I'm sure I think Kevin might talk about this in his off hours uh some more uh like how you get to this um hoping you've seen this in some linear algeb class but yeah the trick which to get this is literally you basically you write this down C is not Square you multiply both sides by C transpose or whatever get that thing which is square and now convertiable so the key idea here is this guy okay so that's that so what do I need here to make this problem solvable that's the key question can you give me a condition on c for this thing to be have a solution yeah so specifically I need to in inside the inverse so I need that CC transpose Matrix to be full rank what uh Dimension is that thing that CC transpose thing it's by where n is the state size of the state right so it's so CC transpose needs to be invertible got it's an N byn Matrix where n's the state Dimension so you can equivalently say that the Big C Matrix needs to be rank n where n is the state Dimension right everyone cool with that let's write that down um so for BC transpose to the vertible we need rank B to be equal to n where n equals the dimension of X cool everyone that's everyone cool with that yeah all right so that's sort of you know like sophomore linear algebra stuff um there's now there's a little bit of a extra leap that's like a little less sophomore level linear algebra it's maybe like a grad level linear algebra okay so it turns out that so this we just wrote this out for like some generic big trajectory right like n big end time steps could be whatever and so you get this Big C Matrix that could be arbitrarily big how do I like so in principle like do I have to do this forever like can I how do I know like I can actually do this for like some reasonable size seed can I stop at some point do I have to take this out for infinite Horizon to like calculate control so it turns out you can stop after little end number of time steps and the reason why is a little subtle um do know yeah so I just wrote this thing out I just wrote out like if I started some xnot and I simulate forward some some number of time steps and I get to some XM yeah I just wrote that out so that's like an arbitrary number of time steps right and what I did did as I wrote down this like least squares problem to solve for all the used so this is like an arbitrary length of time right so where like this right here this this thing some big number of you know step it could be I don't know right so the question now is like if it has to be like I can't solve this problem it's got to be million time steps right like that doesn't really help me uh and we're talking about infinite Horizon lqr here right so does it have to be an infinite number of time steps if that's the case then it this useless right so can I solve this for like a finite number of time steps and be able to say it's controllable that make sense that's what I'm after right now everyone clear on that okay and so the answer is you can while we're talking about it because it's actually useful and you can do it with a finite number of Concepts and so here's I'll just say it and like then we can talk about it so it turns out um when I check this thing for the rank I can stop at a at um I can stop at in little end uh time steps which is the state dimension in that uh C Matrix and the reason is the following thing that is non-trivial and uh so it's because so when I'm calculating this C what's showing up in there like it's B like so if I take it for longer I just get higher and higher powers of a yeah you know let's you want to say it yeah so another way to say this you're right yes so it's it's like a power iteration type thing right so turns out there's a clean result called the Kaye Hamilton theorem you heard of this hav heard of this before okay so um basically what that says um dos that a uh to the N or like large n can be written in terms of a linear combination of lower powers of a uh like up to little n where n's the the dimension of a so it says basically this like a to the big n equals some sum uh from like k equals z to uh little nus one with some coefficients Alpha like a to the N or some Alpha uh there's a bunch of like related results to this uh so it's another way to say this is like a to the big n is like a matrix polinomial in you know a itself up to only a little n the other way to say this is that a solves its own characteristic equation you heard that where the characteristic equations like this equation for finding like the igen values uh for a matrix I don't know this one's on Wikipedia you can look it up it's not a like absolutely key thing all this says that for an LTI system when we're trying to check this controlability thing we can stop at little end and that's all we need to do little end time steps basically it says if it's not controllable after little end time steps it's never gon to be controlled you can only check up to that okay so basically we're checking rank right so what this is really saying is adding four times deps or right to our c uh can't increase the rank anymore and so uh what we're going to do then is Define this C to be what we just wrote down but for any given system it's only going to go up to a to the nus one inside there where n minus one that little n Dimension and this thing has a name if you go like look at books and stuff uh this is called The controllability Matrix any haven't seen this before yeah cool that's kind of how you get it okay fun fact uh we're not really do this too much but what do you do in the LTV case yeah so in the LTV case basically you don't have a result like this you basically just have to check it for the whole trajectory so if you're doing LTV and it's a finon you just have to check the whole trajectory you can do it this way um in which case has anyone heard of a controlability gramian before yeah that's out for the LTD write out all the timeing As and Bs Right Out b c trans Che um you shouldn't you shouldn't do c transfer rank of C again but you have to do it for the whole trajectory there's no shortcut on the LTV case but fun fact that CC transpose thing is the controlability Grammy you've heard of that before uh that's it's just a name it's not it's not useful by the way you should never actually compute it it's a bad idea yeah so I mean like you can say things like this it's it's iy right so like I mean it's really just a binary rank thing like it's either controllable or not like looking at it in sort of like relative terms like how you know magnitude wise is kind of because like you can always basically do a change of units or something that'll totally change those numbers right if I rescale my units like it'll change all those values so I I would argue like the relative like magnitude of that thing is not actually very informative because it's kind of based on your unit choices and stuff which are arbitrary right so it's really just this like rank condition the issue is this granan thing because it's like Square and if you're doing it for really long it has the classic like you know tail wagging the dog kind of problem it gets super ill conditioned and like the rank stuff is almost useless uh from a numerical perspective so uh don't do it okay cool so that's that any other question about this stuff okay cool so that is kind of the end of our initial foray into lqr topics and like linear system Z topics uh has anyone here a linear systems theory course by the way I know there's a couple in CMU yeah so that's usually like a good idea if you really care about controls U basically it's an entire semester course and like stuff like this like controllability graming like Matrix math of various flavors that are useful for this stuff so you haven't done that might be a good idea if you if you like this stuff and want to spend more time with it okay thus ends this topic we're gonna switch gears a little bit so any last question about any of the stuff Okay cool so now we're going to talk about is bellman's principle we're going to switch gears now and talk about dynamic programming who's heard of this before only a couple people okay interesting um all right so here's what this is about uh so this is like a really really core like RL like kind of optimal control topic that shows up all over the place and like it's a very intuitive idea so the the idea is that like Optimal control problems um have like an inherently sequential structure to them like basically things happen in through time and like really all this is is that like the arrow of time basically saying that like past actions affect can only like actions can only affect future times they can't affect the past right it's literally just that at its heart right so it's a pretty intuitive idea um but that has interesting like math repercussions uh so yeah said like very simply like past uh control moves uh only affect future States and future control moves inputs can't affect the past can't affect past States this is like painfully obvious um and so that that fact is called bellman's principle also known as the principle of optimality it seems like so trivial that it almost like shouldn't have this like Ros name I think it's a little awkward that it does but um it has lots of interesting kind of if you follow the rabbit hole a little deeper there's interesting things so this is called bellman's principle or principle of optimality uh States the consequences of this basically of this or optimal trajectories for like opo control problems and literally like this is an entire subject it's just that that's the whole thing uh okay so like the way to say this I guess um maybe I'll move to a new page for this we draw a picture so let's say I have a trajectory and I'll just like cartoon a thing out here quick let's say this is time and this is like X let's say we've got you know a really sort of trivial 1D thing this is X notot let's say we just like are wiggling around you know and that like here's you know xn over here say that's our trajectory so what this is saying is that like I need to make this wiggle so if I were to pick some like random point on this trajectory like say halfway through like say here like this is some XK in the middle if I were to write down a new trajectory optimization a new optimal control problem starting from XK instead of X notot what the what this says is just that like if I start there the answer to that sub trajectory problem has to be this a sub trajectory of the bigger problem like because obviously let's say you know hypothetically if if I were to solve this problem and get like another answer that went like this if that somehow had lower cost I would would have taken that path on the bigger problem because that would have made the bigger problem have lower cost does that make sense to everybody it's very obvious that that like you know shouldn't be as big a deal as it is maybe I don't know but uh you know it is a big deal okay so that's cool everyone's got that so yeah so summarize this basically if this path uh had lower cost for the short problem um I would have taken it starting from X Okay so for the laor which just say um so another way to say is is sub trajectories of optimal trajectories have to be uh optimal trajectories or like the appropriately defined sub problem like starting from that you know State okay so I've said that now like three different ways and draw a picture might take a little bit longer to stick but that's sort of the idea okay so that idea uh kind of cues up this whole line of stuff called dynamic programming uh which we're now going to talk about and basically the idea is um given that fact and we've seen hints of this already if that's true then it means like if I it kind of implies that you should start at the end and work your way backwards right because if that's true then like if I start at the goal I can take like one little step backwards and solve a really easy little problem and then I can take another step back and if I know the tail solution now I can solve another little problem connect to that and then they back and right it sort of implies this sort of backwards recursion starting at the goal which you know we've seen before in that whole ricotti thing and in the back the like backward ricotti iteration also like the backward like back propagation wavered uh gradiant pass in the Pagen stuff right so this idea of like going backward from the tail or from the the goal is is kind of it's heading so let's write that down uh this sort of um suggest starting from the end trajectory and working backwards um and yeah as I said we've already seen some hints of this from ricotti stuff um okay so here's what we're going to do this uh the move now is we're going to make some step up we're going to Define this thing called the optimal cost to go your control person that's what it's called you do an RL this is called the value function personally I like cost to go better than value function because value function seems very vague and meaningless to me whereas cost to go not if has like a real meaning to it which will'll be clear in a second but you know whatever you want to call it School uh so we're going to find this thing and who's yeah who's done this stuff before quick show hands youve seen value functions before somewhere yeah okay well we'll so PA to go AKA value function ter value function and we're going to call this V of x uh in in general we'll have a Time step on it VK of X and the idea is in words what this thing is doing is this function will tell you the cost that it will take you to get to the goal starting at that X at time K so I have this function if I know this function if I want to know the cost to get from any initial condition to the goal I plug it in this function it tells me that optimal that make sense got that so it tells me the cost if I I knew the optimal solution from any initial conditions whatever it kind of wraps all that up into this one function I can query I can give it a time and a state and it'll tell me the the best I can do right in terms of the cost to get to the goal from there makes sense everybody so en codes cost inur starting from 8X at time K uh if we act optimal act act optimally okay so that's the definition so now we're going to go do this for the lqr problem sort of see how it works out okay think this is a new page situation again here we go okay ready so for lqr what we do is we're going to start at the end but that's going to be V subn of X and so remember in lqr we had this cost function we had the stage cost which is over the x is in the used and then we had this terminal cost so what do we think this cost to go thing is at time and at the end it's just the terminal cost because I can't do anything else right so basically by construction for lqr is cost cost to go thing at the final time is just wherever you are there plugged into that terminal cost because you literally can't do anything about it so that's it okay that makes sense everybody so if I say times up here's where you are there's no moves left it's just that cost that's that's the cost you've got you have nothing you can do about it okay so that's the terminal cost that's so we work backwards the cost to go at time nend for lqr is just the terminal cost and we're going to just Define this thing um to be this is another hint so we're going to call the The Matrix that shows if in our cost to go the cost to go for uh this is not like a super big you know giveaway for lqr the cost to go is always a quadratic function so what we're going to do is we're just going to write it as a quadratic function and we're going to compute it going backwards with this fman thing um and we're going to always Define like the weight Matrix that shows up in there is p so it'll be P subk for V subk right so it's going to look like this so for time n it's p subn and p subn is defined just like by construction to be Q subn it's just that terminal cost Hesh in thing Okay cool so that's the the the final step now what we're going to do is back up one step and calculate V subn minus one okay so here's how we do that VN minus one is going to be Min over U and what I'm going to do is basically write down the one-step cost that stage cost for time n minus one so that's uh Xpose q and I'm going to do this I guess for LTI so I can write CER subscripts but it's the same stuff if you want to put subscripts on can you transpose r u so that's the stage cost for time n minus one and then to get tail in there I take the Dynamics and plug them into the value function so if I if I know VN for the next time step I can plug in the Dynamics so I can write this all in terms of the current time step so maybe it to be I'll write these guys down this the reason I did this right is remember that VN the function of xn and write this all at time n minus one I can just plug in the right everyone got that so now this is all in terms of uh u n minus one xn minus one and VN right so this lets me go backwards um um so now what I'm going to do is solve this guy um let's draw we're going to come back to this later so we'll take a little shortcut or a little like detour I'm going to just solve that Min problem right now so this thing is a quadratic it's all nice because we picked everything to be convex so so I can find the Min of this thing it's just a quadratic I can find the Min by just setting the gradient equal to zero let's do that real quick so um next step is I'm going to plug the definition of VN here so this is U the X term does nothing because I'm minimizing over U so that X term like drops out right when I take the gradient it doesn't go up so we're going to drop that term real quick it'll come back in a second for this part it doesn't matter so I've got this term and then I'm expand that cost to go term with the dynamic so I've got a X plus b transpose n and then you know ax plus b cool I take the gradient of this thing and set it equal to zero that gives me r u plus b transpose e n a plus bals and then I can like rearrange this and solve for U so unu n minus one is going to be minus r I guess I'm being inconsistent about the subscripts but you know we're going to be a little to save me writing and so you don't have to suffer through my ends that look like K Inu if these are n minus one uh inverse the spus okay that's kind of a mess so what we're gonna do now though should start jogging your memory for things we did last time this whole thing in front of the X going to be called k n minus one it's starting to look familiar and now we're going to do we're GNA just take that and plug it back in so we did them in we're gonna go ahead and plug it back in there so uh I guess star on this guy a little detour and then we're gonna plug this back in uh specifically plug = minus k x uh back into star and what that gets you is VN minus one x equals I kind of just like stuff all that together I'm going to get 12 x transpose and I'm G get this giant Matrix U plus a transpose a plus a minus kose E N A minus a all in a big bracket time and what do you know so this is now VN minus one look at Matrix X and we're g to call that giant Matrix in the middle e min starting to look very familiar so now we've got uh vnus one of x equals Xpose e nus one now like we've seen before we've got this recursion for K and P we start at time n get some K minus one PN minus one now we bootstrap we keep going all the way back to time to to zero oh little K times de equals zero whatever and what do we think this thing is the rot equation again so we got the ricot equation before by looking at like Matrix barity and back substitution and now we're seeing that it's actually um Dynamic program it's actually doing the Bellman recursion on this dynamic programming formulation of the problem and uh last time we we got these KS and PS kind of randomly and we didn't know what the P's were the Ks are like the feedback gains so the Ks are the feedback policy that we get with dynamic programming the P's are the pession of the cost to go that make sense everybody the cost to go in the case of lqr the cost to go is a quadratic function always and it turns out the p is the hes okay is that cool with everybody any questions about any of this oh yeah i' would suggest like don't sweep the details of the linear algebra stuff too much you should you should do it so you like kind of do the gymnastics yourself that's just like you know that's like grammar right it's like stressing about like I don't know English grammar it's not the story right you should know the story but you got to speak English too gotta know your grammar too like linear algebra is like the grammar right okay so to summarize that whole thing um for like more General case so we did it for the special case of the lqr problem but in general for any problem it looks like this so we start with VN of x uh is our terminal cost remember we wrote down that problem we've got a stage cost terminal cost in general nonlinear Dynamics arbitrary cost so when you do this on an arbitrary problem the cost to go is initialized at Time end with just the terminal cost uh we set our times f k equal to n and then up with this backward recursion that looks like this while while K not one um we're g to get V K minus one of X is equal to Min over U I'll write it like this just meaning you're going to minimize it with within the bounds on you right so if you got control limits ever you put those in Min over U of the stage cost for that time step plus the cost to go for the the next time step in the future with the Dynamics plugged in so if I have VN back up one step I put in the stage cost I plug the Dynamics into that cost to go from time in I minimize over U and I get once I Min over U right now the U are gone and it's just a function of X right and that gets me the the V at the next time step then after I do that I just kind of decrement K and keep going all the way back cool so here's that's the general case um if you somehow magically know the cost to go for a given problem and this is getting into where the P's are useful for for other things um if you know the cost to go there's a sense in which knowing V ofx encodes like everything there is to know about an optimal control problem like if you have V you have everything and you just like you solve the problem and this is kind of why uh if you have V the optimal control policy is u of x equals Arin over U subject to the control limits right of your state Cost Plus your to go basically what this saying is if I know the cost to go instead of having to solve this monster trajectory optimization problem over some big Horizon I literally can solve a little single one step problem over one control move and this gives me the optimal solution that extent stage yeah the stage cost the one step cost that's the thing inside the sum from our formulation of the whole problem from like last time terminal cost is little is L subn and the terminal cost is just a function of X it's at the end we have no Mar this is the cost you get at every time step and you sum up the little L's for all the time steps to get total cost is some of those L's for all the time steps plus the yeah yeah you can do that here too that just corresponds to um I you can pull that into the L if you want actually that's literally just a particular choice of L it's putting like a decaying you know e to the something negative something factor in front of the stage cost at every time step right so yeah you can like if you if you want you can Define l as a function of t or as like L sub K or whatever and wrap that into L uh that happens in optimal control too if you want you can do it that way it's just a choice of cost function yeah literally it's put a minus sign in front of it and maximize instead of yeah it's the same thing exactly yeah so this would be called the like one-step reward One Step cost stage cost stage like it's the same thing yeah literally just flip the sign and call it reward if you want uh yeah does that make sense cool anybody else so this is this is how you would find the cost to go this is like a recursion so find the cost to go yeah yeah so yeah so that's how we Define V and this thing if you know V then you can find the optimal U at any time by problem yeah makes sense okay so uh for all the RL people in the room um we can equivalently State this whole thing um this whole dynamic programming thing uh in terms of this other function uh called the action value function or uh somewhat annoyingly uh Q function and we're not going to call it the Q function because we're already using Q for lqr for like the cost weights and stuff and it's just annoying to overload notation so we're going to call it s just I'm picking the next letter that's not overloaded um so what this thing is we'll call the S function s of X and U and literally what it is is just the the whole term inside that AR so it's the cost to with the Dynamics plugged in plus the stage cost so it's l plus the a + one with the F plugged in that so that's called the action value function it's usually denoted U of comma um but we'll use that to avoid overloading like the lqr notation why would you want to do this with the RL people someone has seen this before now because the reason that people in RL like this is that if I learn s instead of V if I know V then I still need the Dynamics right so if I if I know V Sol this minimization problem where I have to actually plug the Dynamics model into V I'm doing RL and I want to be model free and not have a model at all if I directly learn this s thing instead of V now this S I learn this thing directly it includes the Dynamics sort of implicitly inside it and I don't need to learn a Dynamics model separately so it kind of squishes the cost to go and the Dynamics together into one big function that make sense so why if you're doing model forl you probably do a q function you know action value thing instead of a value function but you know you could also like learn the Dynamics and the value function separately or whatever it's sort of that's that's the why so this can avoid uh a need for an explicit Dynamics model okay so this is like kind of the the basic story for dynamic programming it's uh it's a very deep topic there's all kinds of stuff to say about this but um now we have to talk about the curse unfortunately that's what it's called and I make that up uh the the unfortunately dynamic programming is cursed um so there's a bunch of cool things to say about this right so first of all is um all the stuff we talked about so far the Pagen stuff the uh like trajectory optimization stuff uh We've so far mostly been doing like lqr stuff which is globally optimal because it's a convex problem but in general if I do like the pagony stuff if I do this like you know uh all the optimization stuff we talked about it's only finding you a local Optimum right it turns out though if I if I can do it um dynamic programming actually gives me the global Optimum if I can do it right so that's super cool it's actually a sufficient not just so we did all this optimization everything we talked about were firstorder necessary condition necessary conditions so they're just good for a local Optimum it turns out this thing gives you the global solution which is awesome so you'd think maybe we'd like to do this all the time but um we can't and this is the curse uh so the the the issue is that this whole scheme is only really tractable for like really simple problems in particular it's always tractable for lqr problems we just did it it's also tractable for very low dimensional nonlinear problems where you can basically like grid the state space and like kind of and the key reason is that in this backward recursion thing here we're calculating weic lqr just so happens that function is quadratic and therefore to like represent the function we just need to like keep track of one Matrix right that P Matrix but if for a gen even if the cost is Quadra if I'm talking a general system nonlinear nasty Dynamics function in here it's actually a recursion right so every time I go backwards it's like f of f of f for some really nonlinear gross Dynamics so this function can be like arbitrarily gross arbitrarily nonlinear and it's high dimensional right so if I'm doing this for a big robot like if I'm doing this for like a humanoid this is like an ad dimensional nonlinear this is literally like a 80 dimensional nonlinear partial differential equation right it's the worst thing you could imagine trying to do so um the the key issue is I can write that down but I uh for high dimensional problems I literally I just have no way of representing V for an arbitrarily nasty nonlinear function so you can kind of guess where this is going right so I can't do it in general because I can't write down V in a like reasonably compact way uh so let's see can do it for lqr and we can do it for like low dimensional problems um and kind of the core reason is that V stays quadratic for lqr but basically uh becomes impossible to write down uh even for like relatively benign seeming n manyar problems okay uh let's see and yeah then the other thing to say about this even if we hypothetically could write down this v um the uh the little problem the little like one step problem to find the controls might actually be hard right so it's it's small but um that like one step Min over U of like this action value thing um will be like non-convex in general and possibly actually hard uh okay so here's kind of the whole curve story uh basically the cost of doing dynamic programming blows up in the dimension of the state because we have to represent that V of X somehow and you can imagine like any any kind of way you might think about doing this if it's an arbitrary function like would involve some kind of like you know greeting of the state space or representing with some function basis or sample points or whatever and all of those sort of things blow up like combinator in the state Dimension so like it basically just doesn't work after you get past like order 10 Dimensions uh using like kind of classical function approximation ideas uh just just because you have to represent V okay so why why do we care then okay so as I like is maybe becoming apparent based on the stuff I've said doing doing exact diic programming like like basically hopeless but doing approximate dynamic programming where you use a function approximator to represent the value function cost to go function or that Q function action value function thing can be really powerful and is the basis for a lot of RL right Q learning blah blah blah active critic like all these things are really based their core on doing some kind of dynamic programming with function approximation right for for the um for the the V or the whatever you want to call it right uh and yeah basically a ton of like modern RL is based on that uh and then the other thing about this that makes it actually worth talking about is remember we said that like the whole Pagen thing is only for deterministic problems uh it turns out like the way to generalize to stochastic control problems is via Dynamic program so uh Yeah so basically the dynamic programming stuff we just wrote down that all generalizes to the stochastic setting almost trivially actually and um the way you do it is almost trivially uh just like basically just wrap everything we did in expectation operators so literally all that like recursive minimization stuff you just slap expectation all of it and now you've got like the stoas version uh and yeah pagon doesn't generalize right that's okay that's that story let's see how much time do we have left we have eight minutes left maybe we can do something fun okay any questions about this stuff the last little yeah yeah we'll do this later for real but basically like everywhere I'm doing these like minimizations anywhere I'm like any of these optimization problems I I get like rather than the cost it's the expected cost where things are stochastic right right we're going to do it for real uh in a few weeks okay anything else on this yeah they're extremely closely related yeah yeah there's lots to say about this this is probably not the time or place for that conversation we can have that conversation later yeah yeah uh anything else yeah so the Markov thing and the Bellman thing are like intimately connected actually yeah like another way of stating this stuff in the stochastic setting really you need it um the Bellman thing applies to Mark problems okay all right so let's do this last I think we have time for this um where did it go okay this is fun so finally now we can talk about what the lrange mult multipliers are everyone was wondering about this right this is like the burning question in your minds from uh from last time okay so remember um when we did the rotti derivation from the QP we had like these lambdas in there right they were the L multipliers of the Dynamics constraints and we ended up doing the backward back substitution stuff and solving for them right if you remember when we did that uh they actually had a uh when we were doing that if you go back and look uh we ended up actually getting Lambda k equals p k * XK and it's hopefully obvious from like this whole thing that the PS we did in the QP version I called them PS as like a foreshadowing they're the same PS we got from the dynamic programming thing they are the HS of this quadratic cost to go so um from uh our DP thing we have V ofx equals 12x transpose PX and so the obvious kind of conclusion here is that actually these lambdas are the gradients of the cost to go so this might take a while to chew on it's it's kind of deep and cool um it turns out that these lambdas that show up in all these like Primal dual uh formulations of the optimal control problem when you do trajectory optimization or whatever when you solve those QPS they're the Dynamics multipliers they're also called called the co-states in Pagen these are the gradients of the cost to go uh we did it here just in the lqr case but it turns out this is a general thing for so for nonlinear problems any kind of problem you want those lambdas are always the gradient of the cost to go which is cool write that down and yeah this carries over to the general setting not just okay so if anyone had that burning question left over from a while ago that's the answer um kind of cool so let's let's play around with that a little bit so here's some code stuff um it's the same exact uh double integrator problem we looked at last time just to like do a few of the things from today here's that controllability Matrix this problem is like 2D State 1D input so n equals two so the controlability Matrix reduces to just Bab it should have rank two it does that checks out uh let's see what else we going do here so what we're to do now is we're going to go back and do the QP version of all this stuff with the same kind of problem as last time function we're going to now write out this cost to go thing that's what that looks like right um we're going to solve the QP that we did before same stuff Dynamics quadratic cost um this time we're GNA actually look at the multipliers because now we know what they are but here we so we solved the QP just like last time and we're also going to do the dynamic programming solution aka the rotti solution we're going to kind of cach all that stuff and and just look at some things so here's the first thing we're going to do just to like show that all the stuff's the same we're going to plot the um the solutions the state solutions for both the QP uh and the dynamic programming solution and you know they're on top of each other the same they give the same answer no surprise there um now we're going to plot the P Matrix the cost to go Hess from dynamic programming and this we kind of saw this last time with the ks but the P's also converge as you go backward to a steady state solution because it's an LTI problem right so this is that infinite Horizon solution cool um costs just to like you know belabor this the costs are exactly the same out to like all the decimal places in double Precision they match they exactly the same thing so whether you use like dynamic programming use the up it's the same answer um now we're going to do is uh do the infinite Horizon K from the control systems toolbox we like kind of did this last time so those match super precisely you know like out to nine digits we can also compute that infinite Horizon P Matrix from the control 2 Box called dare same thing and we'll look at it compared to ours from dynamic programming they match out to like nine 10 decimal places again so those match all good uh What are we going to do here uh oh yeah this is like that whole sub trajectories being optimal thing so we can actually do a roll out starting at like a later time step so if we start at like time step 50 and like roll it out from time step 50 these things have to match right that we were saying and they do so like those the tail trajectory matches if I solve it from a sort of halfway point or something the controls also match exactly right so now check this out so here's the Lambda from the QP I'm just going to pick out a random one from the middle of the trajectory is like times set 50 or whatever same deal that's Lambda from the QP now what I'm going to do is I'm gonna take I'm gonna you know Auto diff the gradient of my cost to go and check it out they match like out to a lot of decimal places uh we can also try that with the infinite Horizon one just showing you that like the infinite Horizon approximation is pretty good once you get bar out that's cool the other thing we can do just to give you some intuition for what this thing actually is right so this is trying to tell you the cost to go remember it's the optimal cost starting from some X to get to the goal right so if I pick some random state in the middle of the trajectory the cost to go gradient would be if I start at some random X in the middle of the trajectory I compute its cost then I perturb it a little bit compute a cost and I like finite diff those right that that finite diff gradient that should be the cost to go gradient right if I two nearby trajectories compute their costs right solve the TR problem solve the QP compute their costs and finite defim and get a gradient from that that's what this is like the sensitivity of like little perturbations in the state right to to that so if I go do that I'm going to computer roll out uh from some initial State and using that lqr policy right and then another one from some nearby slightly perturb state right like one minus 6 and I'm going to go finite diff them I'm going to compute the cost of this whole trajectory pite diff gradient again that's that's matching right so that's the same thing that's what it is if I take some random State I tweak it a little bit and I compute the cost to get that solve the whole optimal control problem right and I get two costs and I that's what it's telling me right it's telling you like how the cost will change if you solve the optimal control problem for some little tweak in the skatee cool all right that's all I got questions I'll hang out a little bit but yeah