Matrix Calculus Lecture

foreign so welcome everybody to this IAP class on Matrix calculus I'm Professor Alan Edelman and the other Professor who uh the other Professor Stephen Johnson who's going to have he had to go out of town and so we're going to take advantage of Zoom technology too so they could give lectures remotely we did some testing of the audio in the video so hopefully it'll work out just fine um so maybe just a quick uh introduction so uh oh good good yes Stephen recording all right thank you for the reminder yeah so I even put the two reminders on the Blackboard yeah so that was reminder number one good okay so let's see so so um uh both of us are in the math department I'm also in uh C cell and um I run the Julia lab over in seasail and uh my own research involves both mathematics as well as Computing software Stephen you want to say a few words quickly about yourself hello everyone uh so uh some of you may remember me if you took 1806 in the spring 1806 in the spring or in the fall uh with me or in the fall so there's the two or three or four hints okay yeah so so those are those people who took 18 and 6 with me will will have gotten one hour of these 16 hours of lectures that were that we're going to do this IEP and that'll be familiar but yes I'm I'm also in the math department I'm also in physics and uh uh you know I I come into Matrix calculus through a lot of pde constrained optimization uh mainly but uh I've also worked with Julia which will also be used on the problem of sets and uh so I'm sorry I can't be there with you in person I'll be missing I'll be doing it remotely for the first week and a half and then I'll be back in person back in person okay so let me quickly show everybody the GitHub site that we're going to be using uh where did it disappear on me where did it go oh there it is okay so um oh no that's not the right one this was another course all right I think we should just Dive Right In so let me go ahead and do that so uh so yes here are the lectures the problem sets three units and we assume people are familiar with linear algebra and not much more okay and there's no assumption that you've already used Julia before but of course if you've taken 1806 with Professor Johnson or some other courses you you may already be somewhat familiar but you could easily learn the kind of Julia we're using here is sort of basic calculator Julia and so you won't need to to know much okay so let me delve right in into where does matrix calculus fit in so if you look at mit's course catalog which I've copied over here uh this is replicated probably in universities all over the planet right there's single variable calculus this is like the first semester of calculus and it's required at MIT for all undergraduates to take 1801 and learn how to take the derivative or an integral of a function of one variable okay and then there's that second semester of the sequence 1802 where you learn Vector calculus or multivariate calculus right and so you you learn the basic definitions of a gradient or a Jacobian and that's kind of this whole 1802 thing and like I said this is in every university around the world okay and I bet everybody in this room has gone through these two classes one way or another or learned it on the streets or something so uh so so it seems to me that there's actually a sequence that that's sort of being cut off completely arbitrarily right the sequence should go scalar Vector right so that's you know one two Dimensions Matrix uh right scalar Zero Dimensional Vector is one dimensional Matrix matrices are two-dimensional and then of course higher dimensional arrays right it makes it just makes sense that that that these objects are you should be able to do calculus on any of these objects when you talk about programming languages for example um some but not look every programming language has two dimensional structures and more right arrays so uh if you're familiar with Matlab it does not have a one-dimensional array people talk about n by one column vectors and one by n row vectors but they're really just matrices right uh other languages have one dimensional array some even have Zero Dimensional arrays right so these are the things that you find in programming languages so in this is this is Julia the size of a matrix for example has two components the size of a vector has one component Etc so it seems to me you know I've been teaching at MIT now I don't even want to count the years we're pushing on 30 years now for me and you know when I started here linear algebra was not what it is today I mean maybe you have no pre I don't know if you folks know but linear algebra was this thing to avoid it was this like required course maybe and you know a few people needed it perhaps but uh I think because of machine learning um and statistics and lots of other reasons linear algebra has gradually sort of taken over a much bigger part of today's tool you know you know tools for for lots and lots of areas that you know compared to to when I started 30 years ago and so machine learning statistics engineering you know everybody needs linear algebra and uh and so it stands to reason that you'd want to be able to do calculus on matrices and and higher dimensional objects especially because everybody these days they're doing machine learning I don't have to tell you because everybody in MIT is already doing it right machine learning you want to take you need to take gradients and you need to take gradients of complicated objects right so you want to do gradient descent you need to be able to do this sort of thing so I but I've always been interested in Matrix calculus I was just those you know I was like calculus to begin with and I thought the idea of doing calculus with matrices just seemed like you know I was like linear algebra I thought it would be sort of fun to to marry calculus and linear algebra and you know I used to go to the library and then when Google came along you'd Google Matrix calculus and uh up until fairly recently I was rather disappointed with what I found right Matrix calculus was that I found like three books I think with the title Matrix calculus and the moment I opened the book I realized no no this is not what I wanted at all right I wanted to be able to to do like the kind of calculus we did in in college you know 1802 capitalists but on higher dimensional objects okay so you know here's a little bit of a quick question for all of you to think about if you don't already know the answer you might ask yourself for example suppose you have the function that squares The Matrix right what should the derivative be I mean in fact you if you've never done this before you might not know what it should look like right um should it be should it be 2x like the scalar case well I'll tell you right now that that's not correct when X is not a scalar um and you know then you know of course you could ask about you know what is the derivative of the Matrix inverse or what is the derivative of the Matrix inverse squared and so forth um or yeah is the derivative of x inverse the same as Negative X and minus two and the answer is no so so the point I want to make the first point I want to make is Matrix calculus is not it's more complicated than scalar and Vector calculus it's not it's not a lot more complicated but it's unfamiliar to everybody who hasn't studied it that's what I want to say that that you can't just say oh I know Vector calculus I know scalar calculus it's probably just some simple generalization so my first message to you is no it's not I mean you can Master it you'll learn it during this IOP course but it's not just the it's not just a simple obvious generalization okay and so as far as applications go this is a bit of a collage that Stephen put together uh from a year or two ago so you could just look at this yourself but um I I think you all probably you're all here for some reason so you probably already all know the buzzwords but machine learning parameter optimization stochastic grading dissent automatic differentiation back propagation right so these are all the reasons for doing Matrix calculus and this little collage here you could just see the places where things are happening um you know in various parts of the web okay so you can take a quick look yourselves the the the the the part that interests some of us are the applications to physical problems uh as well as you know machine learning and statistics and so for example in the upper left here you have the so-called topology optimized aircraft Wing where uh you know in the old days when people used to engineer an airplane wing what they would do is they would just pick a design and then they would just somehow figure out the aerodynamics and if they wanted to try again they would take another design and figure out the aerodynamics but now with faster computers and machine learning techniques you could have the computer decide what is the you know you can optimize for whatever you want you know minimize fuel or or minimize metal that you use or whatever it is you want to optimize for and you can let the computer figure out the shape and so that becomes a big optimization problem and that becomes that gets called topology optimization it happens for an airplane wing it happens for fluid dynamics so it doesn't really matter what you do the point is that in the old days it was hard enough to just simulate the physics around anything like an airplane wing but now we can actually put that that physics inside an inner loop and we can put an optimization problem around it right and that's that's happening everywhere so data science and multivariate statistics of course is another big area especially the computer scientists are doing this sort of stuff all over the place nowadays okay so there's you can take a look at this video or here's a book so uh and let me say some roles let me say a few words about the role of autodif or automatic differentiation so automatic differentiation is is a very exciting technology and one of the things that the one of the things that I like to say I mean you can look at the slide but when I first learned about automatic differentiation I was rather surprised as to what it is and what it isn't so uh how many of you are familiar with auto diff already a little bit okay so about maybe five or six people have raised their hands a minority of the course but for sure so you know everybody at MIT is good at differentiating right every you didn't get into MIT if you can't differentiate functions you're all good at it I have no doubt right that every certainly if you came to this class but everybody at MIT can differentiate everything you know I don't care if you write x squared squared you know sine cosine you know arc10 maybe you have to remind yourself but yeah you can you can differentiate everything and the the nowadays automatic differentiation has become almost more of a compiler technology than a math course right it's kind of become this it sort of become this new thing it's not numerical differentiation like if you take a numerical analysis course or maybe you see a little bit at the end of your calculus course where where you learn to take a a Delta y over a Delta X you remember you take a small difference over a small over a small change so it's not that and it's not like some of you may have used uh Wolfram Alpha or Mathematica where you let the computer differentiate symbolically and it's not that either and I think that's what makes it interesting that it's neither of those two things right it it's something different and uh and and we'll show you a little bit about how it's done and the what you learn is that you should never well we teach 18 1801 you know to do things it to me it's almost like what's happened with long division like maybe how many of you actually learned to do long division how many of you think you could still do it today if if forced to right so so to tell Stephen the whole class raised their hand for the first one and a minority for the second question so so but of course one of the good things is you kind of understand the vision maybe maybe by learning long division I'm not even sure but calculus is I think becoming the same way where where uh you know we teach it the old-fashioned way and I guess that's good because you can learn it you know you understand it you know but as to as to who you know what should you know we don't really think humans should do long division and it's getting to the point where probably uh you know this may be heresy here on Miami building two in in the math this is home of mathematics at MIT but I would say that that taking derivatives may or may not be as important uh certainly complicated derivatives are Beyond human ability anyway no matter how good you are at it you know the things we want to differentiate these days are just too complicated okay so um what did I put over here let's see so um yeah so today's today's courses are mostly symbolic I think that's fair to say that 1801 is probably a 90 symbolic course um uh and I'm saying yeah today's differentiation neither of these two things the math is fun we'll learn about in this class Okay so wait wait let me put the microphone on you even if you're using the computer to take derivatives right so so you write a program you let it take the derivative for you uh to use that effectively you really have to have some idea of what's going on under the hood uh because there are especially there are cases where it doesn't work well you need to know about those you need to and you need to know something about the technology and about what its capabilities are to know when to use forward mode or reverse mode differentiation and what a vector Jacobian product is and why it matters in order to really you know be a effective user of these things kinds of things yes Steven's getting fancy already a little bit faster than I would have but but he's quite right that the that that you know uh you know I mean I would actually tell people that that it's not such a bad idea to know how an engine works if you want to drive a car but I guess you don't really have to um and you know these days cars don't even have engines anymore many of them but uh um so so so what what Stephen's saying is is that automatic differentiation is probably not quite as easy as driving a car that actually understanding the technology underneath it is will really help you to use it and sometimes it's necessary Okay so let me start with something you all know just to establish some notation okay and then I think Stephen in the second part of today's lecture will kind of reiterate and talk a little bit more about the notation that we'll we'll be using um but I want to emphasize the concept of linearization I just want to get that word out and you'll see more of that pretty soon but instead the idea that a derivative is taking a non-linear function and pretending at least locally that it's a linear function I want to emphasize that that point of view and of course you all know that so looking over here at at this expression you all know that that that in some sense if you're sitting at a point you know x0 comma y zero then the linear function that's tangent to the curve can be written as you know y minus y zero is is well the linear function is equal to F Prime of X zero x minus x0 right so if I put the this down in the denominator it's just saying the change in y over the change of X is the derivative right so in some sense why we take derivatives this is all very trivial in one dimension but it's good to kind of keep track of this point of view so that when we go into higher Dimensions you kind of remember that the calculus is really about pretending some complicated curved surface is just locally linear uh in one in for one variable that's just a tangent line for multivariables you have planes hyperplanes and so forth but they're they're flat they're that's what a linearization is so here are some notations we will use Delta to indicate finite perturbations and so I have this approximate symbol over here okay so so Delta Y is approximately F Prime of X Delta X or you could take the infinitesimal limit and many of you are familiar with d y over DX is f Prime of X but how many of you have actually seen it written this way where where you know or or we told you're not allowed to do that where you go right so everybody everybody knows everybody knows this notation and you know maybe they told you that this is really not a division or maybe they told you it is kind of a Division I don't know what your Calculus teacher told you but has anyone ever have you seen it like this how many of you have seen it like this where you're allowed to okay so almost everybody good okay so we'll talk a little bit about what that really means uh here's here's the linearization uh that's instead of just using y's I'm using F's over here and uh this one is my favorite where you write if if Y is equal to f of x right so then so in fact there's no real Y at all if you just have a function of X that DF is equal to F Prime of X DX and so we have this this this this what we have F Prime is a linear function in the end I mean in one variable it's just a constant right but it's it's it's a constant times the change of X is the change of f okay so you all know this it's one dimensional calculus but just to set the stage okay can I say something so you know the reason we don't want to put DX on the other side is we don't want to divide by DX one way of thinking about it is that pretty soon this right now it's a scalar we can divide by numbers but pretty soon X could be a vector or it could be some other thing you can't divide by a vector so it's it's harder to it's easier to work to generalize this kind of notation to other kinds of objects to where you can multiply you can operate but you can't divide yep okay good yeah try dividing a vector by a vector doesn't really mean much all right so again more trivialities Justice at the stage just to go you know baby steps so here what I'm doing is I'm looking at the square function okay and I'm looking at it at a particular point the point three comma 9 and you all know that the derivative is 2X and so if if x equals 3 2x is equal to six right so that's the derivative the square of 3 of course is 9 and you know if you do this on a computer this is easy to do it's you don't even need a computer really it's actually easy to check this even on paper and pencil but if you get closer and closer to three as I'm doing in these four lines over here of course you get closer and closer to nine and what we're interested in is the difference so you see that if I add a little Delta X to 3 again a 9 plus Delta Y which is is 9 plus 6 Delta X plus some higher order term which we don't care about right and so we see this as Delta Y is f Prime of X zero Delta X right Delta Y is 6 Delta X and so the the thing I want you to walk away with is is really that you know I want you to look at these numbers like this one here I think of this as the Delta X and think of this this red part here as the six Delta X X right so so again this this idea that Stephen also just said it that this is a linear functional this is the function that multiplies by six if I want to know the change in y I have that change in X and I multiply by six that's my you see eventually this is going to be not just multiply by 6 but it's going to be you have a vector change and you're going to multiply by a matrix okay so but just going slowly we make a little change and you multiply by six and that's that's what's going here on here so I like to think of DX and d y as really small numbers on a computer the the mathematicians call them infinitesimals uh I'm you know from you know there's a lot of rigorous math that makes infinitesimals work which I think is is sort of a great human achievement but of little value to to practical computations I think for practical computations to think of I mean you could think of it anyway you like but I like to think of DX and d y when I'm when I'm actually you know when I'm working and I'm not thinking theoretically I like to think of DX and d y as like the limit of very small numbers and and you know when I play on the computer I type 0.0001 or 0.0001 and I say that's small enough you know it depends on context of course but that's you know that's usually what I do okay so all right now we get to go to where the the big boys and big girls go right so so we're leaving the world of scalar calculus and entering this new world of uh of of Matrix calculus so to get started let me just mention a little bit of notation so we it it's handy to have the element wise so yeah just a little bit of of vector and Matrix notation so um I like to use the Julia notation for element-wise product of vectors so um so so this dot times this this point I I like to call it Point wise times right so two three point wise times 10 11 is 20 33. if you happen to use languages I think python does it too certainly Julia they pronounce it as broadcasting I I always hated that word because I don't feel it it really indicates what's going on I think Point wise multiply I mean to me a DOT is like a decimal point says it better but if you're used to the term broadcasting that's fine as well so so you see we're doing element-wise multiplication and I you know in uh in this little demo here you'll also see uh some people use this dot with a circle around it to indicate pointwise multiply I'll remind you if Case linear algebra was too long ago the notion of a trace of a matrix is is if you have a matrix a big a big Square Matrix then you look at the diagonal and you add up all the numbers and that's and that's the the trace and uh if you go to this matrixcalculus.org there there was oh yes there's a story here let me let me step back and tell you the story so the reason why we started one of the reasons we started teaching this class during IAP was because there was a question on the math Piazza you know all the undergraduates have access to a Piazza page for for not for an individual class but for the majors and there was this question that arose which is uh basically this question if if Y is an N by m Matrix here I'll use the I'll use the mouse so it'll get recorded properly so if Y is an N by m Matrix X is an N by K Matrix and Theta is also a matrix a k by m Matrix then this scalar the trace makes sense right it's a the trace of this multiply makes sense and one could try to understand what does it mean to take the derivative okay and uh and and the student on Piazza asked how do you do it and you know how do you learn how to do things like this and that really is the origin of this IAP class that here's the answer the answer is itself a matrix it's minus two x transpose times y minus X Theta and you can actually get the answer through this matrixcalclist.org right so you those of you want to you could look at it's kind of a nice web page you could type in some matrices it's got its limitations but you can look at some of these and see what's going on okay but let me go back how do I get back to my slides I'll never find them again where are my slides they're underneath here aren't they no I think they're underneath yeah they're underneath oh you think I could have gotten it that way full screen mode I think yeah oh okay let's see can I really get to that uh control never mind I don't know how I can get back anyway yeah so you might play with this a little bit and see some of the things you can do it has limitations for example it doesn't do a good job when the answer is higher than two-dimensional right but you can take a look and see see what it'll do one of the things you you you might see some Matrix calculus in some classes for example I think this is on the next slide but I'll just put it on the Blackboard for example you might ask for How do you take the gradient of X transpose X on things like that or X transpose ax I think it's coming up on the next slide anyway so how do you take the gradient with respect to X of objects like these and I've seen many classes at MIT do that and what I would say is they would do it the old-fashioned way they they do it variable by variable by variable as opposed to the holistic way so one of the things that I'd like to kind of invite all of you to think about and as we go through this course is to think of a matrix holistically or think of a vector holistically stop thinking of a vector as a bunch of elements or a matrix as an M by n table as you would see in an early linear algebra class think of it of a matrix is sort of having you know like like you know we're more than our hands and our feet and our noses and our mouths right like we're people right I want you to think of matrices that way as as a holistic object and there are ways of doing Matrix calculus without going down to the element level to kind of get that point across a little bit more I still see this today though it used it happened more and more I as a professor would walk into a classroom to start to give a lecture and in the old days professors used chalk and blackboards I guess they still do that sometimes you know and I would walk in and I would see what's on the Blackboard from the previous class and um as I'm erasing it I would actually look and see what would be written and I would see people taking essentially I would stand back and I would look at the entire Blackboard and I realized that this is just a couple of Matrix multiplies written out element wise but it fills up the whole Blackboard because for whatever reason the professor didn't use Matrix notation they just use scalar notation with indices all over the place and one of the things you'll see is how to kind of grow up from that point of view to actually sometimes it's useful to work with indices and sometimes it's comforting you feel like you're getting the right answer but in some sense it's more elegant when you don't have to do that and we'll show you how to do that as well okay so uh let's take a look at at let's let's take a look at sort of the various types of cases of what it even means to take a derivative right and so uh so so 1801 is very much in the upper left the upper left corner right so you have a scalar in and a scalar comes out right that's everybody's first semester of calculus okay if you go along the first row over here for example probably at least in physics but very simple physics you learn the idea that you could have say a function you you could have your your position in three dimensions as a function of time right so your your input is time it's a scalar and your output is a position in space like I'm you know for those of you who can't see what I'm doing online I'm moving my hand to indicate a trajectory in space okay and of course the derivative is now is is is is is the velocity Vector right that's what the derivative is of the function time it's it's tangent to the curve and and its magnitude tells you how fast you're going right and that's that's the derivative of a vector with a scalar now this is not the sort of thing you would see much in in previous classes I wouldn't think but it's perfectly reasonable to also talk about a trajectory and Matrix space right and so you could have an M by n Matrix that's a function of time you know I could say it as every element as a function of time and of course you could take the derivative of that thing right which will be a fixed Matrix which if you could imagine M by n Matrix space mathematicians love to imagine High dimensional Matrix spaces High dimensional spaces of all kinds so if you can imagine M by n Matrix space then the derivative would be a tangent in that big high dimensional space okay so that's the first row maybe it's a good time to now go down the first column so uh this this this this blue it looks yeah I guess it looks more purple on the screen but it's blue is is sort of the the this is what you do in multivariate calculus 1802 and machine learning everywhere which is we take gradients so uh it's very typical where you have a function with many variables going in I'll just talk about vectors but in fact you could have lots more variables than just a vector but let's just say in machine learning you have a vector going in and a scalar going out the scalar is often called the loss function in machine learning and uh how many of you have heard the word loss function everybody okay I think there are things students are learning in kindergarten these days so so yeah everybody's right so you're all familiar with many many variables coming in that's the vector the input and the loss function the scale are coming out and you want to take a gradient of that thing right and so if you actually literally have a vector going in then the gradient is uh the gradient is in fact for every scalar function the gradient always has the same shape as the inputs so if your input is a vector then the gradient is a vector right so so that's often denoted with this knobler notation you know in latex it's backslash nabla so usually pronounced grad F or gradient of F and uh in in Julia it would be a vector we don't have to talk about column vectors and row vectors it's it's simple to just say vectors it gets too confused when you have column vectors but people are used to that notation it's just a vector but you could call it a column Vector but the derivative is the transpose it's a row Vector right and uh Stephen and I we we actually went back and forth on this a couple years ago and we're absolutely convinced that this is the best way to do it that the gradient uh is is a vector a column vector and the the the F Prime is a uh row Vector so in particular I'll just I'll give you the answer for X transpose X so here's f of x right and we're talking about X being a vector right and so um so the the gradient of f is equal to 2x okay but DF will be two x transpose DX or F Prime of X will be two x transpose and I hope you could start to appreciate why this is a good idea because we want this to be a linear operator that we put in you know we put in a little a little Vector change and we want the output to be a scalar right so it makes no sense to go you see it makes no sense to go 2x DX where everything this is a little Vector change and this is a little Vector there's no such thing as multiplying vector by vector but if we write this 2 2x transpose then then when we multiply by DX as we're doing here out comes a scalar right so I make a little small change to X infinitesimal or a tiny little change however you want to think of it and out comes the scalar change to F and so this actually makes perfect sense and I don't think you would learn it that way in 1802 and so this is but it this is this seems to lead to consistent answers and so we're going to encourage you to take that point of view all right so I've covered four of these boxes let me point out that there's a color coding that you might have noticed to these boxes if you haven't noticed it already the green is when the derivative is is Zero Dimensional it goes along the diagonals right the blue is one dimensional answers right the vector and the gradient uh those two boxes and then the next diagonal uh it's a funny diagonals right it's going from Northeast to Southwest but this funny diagonal the the next step is when the derivative itself is a matrix right and so uh one one example was in the top right when you just have a trajectory and Matrix space um another one that is in the bottom left is if you have if your parameters are coming into a machine learning algorithm as an array and you have a loss function then the gradient as I said the gradient of a scalar function always matches the shape of the input right and so the gradient of a matrix is a matrix okay and then in the middle of these three by three arrays I think you would have all seen in 1802 if you I don't know if you remember it anymore some of you probably do most of you probably do some of you may forget no two doesn't always cover it unfortunately it doesn't cover jacobians and certainly not as linearization they sometimes do Jacobian matrices determinants for integrals but they don't even always cover that apparently but you mean you mean students in 1802 will not see a Jacobian matrix in some form or another apparently what does MIT come to how many of you have actually taken 1802 at MIT okay and among those how many of you think you first once learned about a Jacobian matrix so you don't have to remember it okay well we've got a good set of students because apparently they learn Jacobian matrices I would have thought that that would just go without saying but but it's very often it's only it's only for for multi-dimensional integration so it's only determinants of jacobians that they ever see so the Jacobian and the and it's determinant they kind of mix get get kind of mixed together all right that sounds unfortunate but okay thanks for the update on 1802. um I know a few years back I actually looked at the 1802 ASE notes and I know that that it was there but uh I don't know what changes have been made okay so uh so so but in the event uh just to kind of set the stage this is where you have a vector input and a vector output right so you have a function in say three dimensions and the you know and the output is also a function say in three or it could be lower or higher Dimensions right and uh the if if you if you have n Dimensions going in and M as in married Dimensions going out then uh the derivative is most naturally done if you do it as a matrix not as in a linear operator but if you do as a matrix it'll be an M by n Matrix okay and then of course we get into the interesting situations As you move down this table the situations where that web page doesn't do a very good job at it but we're going to show you how to do this how to think about just even what what what's a good notation when you start moving into a higher order arrays right where where it's no longer you know the derivative is no longer uh best expressed with two-dimensional matrices anymore and so that's that's down on the bottom right Okay so here here are some answers to to set the stage then again we're going to show you how to do this but I think it's kind of nice to foreshadow a little bit of what you're going to see so so here I'm going to use this sort of operator notation starting with one you're familiar with the derivative of x cubed is 3x squared DX right so everybody kind of knows that and just to encourage you just to remember what I've said before that I want you to think of this as if I make a little small change to X my change to my function will be 3x squared times that small change right if if I'm at x equals 7 right and I make a 0.01 change then the result is going to be 3 7 squared times 0.01 approximately right right and of course that that gets better as 0.01 gets closer and closer to zero okay so uh here's here's here's one that in fact let's use the mouse again just to to emphasize but this one here whoops anyone do that this one here is is uh let's see we have a I want this to be a vector input and a scalar output so it's this box again and it's the same thing I wrote over here I'm just using prime instead of transpose but I want to again I want to encourage you that if you make if x is a little X is a vector and DX is a small change to that Vector then you take the dot product right so I hope everybody realizes that X like if you ever go X transpose y that's the same thing as a DOT product right that means multiply all the elements and then add them everything up right so this is a DOT product between 2X and which is this is you know I'm at the point x and my change is a little DX this dot product will be the change to X transpose X okay and here's a here's your first Matrix answer I told you the that if you square a matrix I told you the answer is not like 2x right what is it well in fact there's no because matrices don't commute the the X and the DX don't commute in a way you could think about it this way should we go X DX times 2 or DX times x times two well in fact there's no reason to prefer one over the other and in fact the correct answer is to add the multiplications both ways so so if you make a small change to a matrix and you can ask what is the change to the square then this is the correct answer this is it and you see now maybe for the first time y we need to think of it as a linear operator we can't we can write this out as a big Matrix you see if but it's not a good idea right it might be comfortable but we don't recommend it so you see you could think of X as as N squared variables if x is an N by n Matrix right if x is an N by n Matrix here's X that's n by n you know we have N squared variables and so we could think of the function that takes that squares a matrix if you want you could like flatten everything and think of it as of going from N squared variables to N squared variables and then the derivative would have n to the fourth things in it you could write it as an N squared by N squared Matrix but we're not recommending that we're saying just think of it as this linear operator okay and that's going to be the the answer for the derivative of x squared okay for a matrix Square obviously it reduces to 2x DX when when it scalars but it's much better than that okay so um sometimes I open up a Julia to do this but I think these slides kind of say it all so just to kind of hit home on both the The Matrix square and the the X transpose X I'll just give you a numerical example just to make sure this is completely clear so let's consider the function X transpose X right so I hope you all realize that just the sum of the squares of the entries of of x right and so if if I'm at the point 3 4 wait so I'm at a point in space two dimensional space three four right so my function value is 9 squared is 9 plus 16 is 25. right and let's make a little change let's let's say my DX is I'm going to go 0.001 in my first Direction and 0.002 in the next Direction well you could do the math and basically the answer is about 25.022 okay and um so so here we're explicitly making those changes right but the whole point of calculus is is to have like a formula for it for doing that right like in in you know the reason why we don't just like do do the calculation that's on this line is because the magic of calculus is there's an exact way to do it uh and so and this is this is this is that exact way if you go 2x 0 transpose DX you see we're going to get that right answer okay and that's what we love about calculus okay and um so so the Delta f is going to be that any questions about that so far okay and um I could I don't know well let me just see do you want me to see me do do this with a matrix I don't think I put in the slides should I do a matrix quickly in Julia we could do it you want people people are saying yes I should do it all right so let's open up a Julia quickly you know I think this is the sort of thing that's just easy enough I could actually just do it in the rebel should I do with the vs code or just doing the rep I'll just do it in the rebel okay so let's get a Julia this is a bit impromptu but why not okay so here let's just grab a Julia okay what I'd like to do is demonstrate So the plan here is to demonstrate that um d x how do I do this make that DX squared is going to equal DX times X Plus X DX okay so that's going to be my goal um any particular size you want to see three by three good enough I guess so let's just do three by three it'll make this a little bit bigger so uh I don't even think I need linear algebra let's see I don't think I need any packages so should I take all right let's let me hear your nine favorite integers small don't make them too big come on just shout out nine integers five seven negative three one two three thank you okay that occurring along here three more six seven eight all right just to show you that there's nothing up my sleeve okay and um let's go Y is equal to x squared maybe so here's the Matrix Square okay so you can check that yourselves and let's take a DX to be oh let's just go 0.001 times Rand of three three oh is that too ugly that's fine let's just do it okay so here's a bunch of ugly numbers okay so so everybody understands that I'm sitting at this Matrix X and I'm moving over to X Plus DX right so um and I want to know the difference so let's do that right so d y will do it for us numerically is going to be this is going to be X plus d x squared minus x squared okay so there's the d y okay and now let's compare that with this magic Matrix calculus formula The Matrix calculus formula is x times DX plus d x times x and you know this is where I always have that fear that I made a mistake or a typo but let's let's hit enter and see what happens and look at that you see so this is the numerical way of calculating a derivative just see where I am over there minus where I started or there's this magic formula and just to show you that these other these other ways you might think of doing it won't work like 2x DX that's not right um or two times DX times x I don't need that at times that's not right but the one that the one that is right here let's put it back up is clearly that one okay so okay so I haven't showed you how to derive it yet but that's coming soon but I just wanted you to sort of really I feel for me I don't know if your minds work this way but for me seeing this is is pretty convincing right like I feel like I know what's going on right I really I really like looking at it this way okay so uh I mean after all if I were doing calculus on a computer for scalers I would I would just take X Plus DX squared you know for scalars and this is what I would do and I would get 2x in fact we can just to remind you that this really is the same thing we could take X to be three and DX to be 0.001 okay and um and then we can go like this and I can compare that with with what everybody knows to be from ordinary calculus 2x times DX right and you know to the right number of digits they match okay any questions about this all right so hopefully you now kind of I believe at least experimentally not theoretically that that that you know Matrix calculus kind of works all right so getting back to um to the slides here so it's helpful to know certain rules that you all learned in freshman calculus extend to matrices so you might remember that you learned the product rule right so what's the product rule say say that yeah you know people learn it with different notations um many people learn it as d u v equals U DV Plus vdu right other people learn the product rule with other notations but the thing that I want you to know is that the product rule just works for matrices and vectors as well as for scalars so anytime you have a compatible Vector product or Matrix like a vector dot product or a matrix times Vector where where the this where the sizes all work out then um it's perfectly okay to take the Matrix rule d a b to be d a b plus ADB okay so this one the only thing is that you have to remember is matrices don't commute so if a is on the left and B is on the right a must always so in all the formats a must be on the left and B must always be on the right like if if right so so for example this this was a little bit giving you the the wrong idea I mean it might have been better to say u d v plus d u v you see because this will work for matrices right I must always keep the U on the left and the V on the right and that's that's the only rule that makes matrices just a little bit different from scalars okay now so so let's apply that to X transpose X right and so if we apply it to X transpose X it says that d x transpose X will be DX trans plus X Plus X transpose DX okay and here's where students get confused I'm going to go ahead and combine them and violate every rule that I just said I'm going to say that I can combine them and get two x transpose DX wait a minute what are how is it that I'm I can combine this here when I just told you that you can't put things in other orders anybody want to tell me why in this case I can get away with it I just said to you the 8 must be on the left and the B must be on the right and here I am I'm telling you that that DX transpose x and x transpose DX are actually the same thing how come I'm allowed to do that in this one case and combine these two can anybody tell me yes please in the back yeah the dot product's just a scalar it doesn't matter what order you do it in exactly right if I just to kind of show you in Julia oh you know what there's a tiny little hole where the wire comes out and if the chair leg falls into that hole I go down with it all right so here we go so let's say x is let's say x is one two three oh actually let's put it with commas okay and if DX is 0.001 times Rand of three see I think if I did it this way you'd all believe it if I go can I do Dot and base let's see maybe maybe not not defined okay so but anyway I don't even want to do it that way let's go X transpose DX okay and D X transpose X you see they're exactly the same thing after all the dot product of the the right so so my point is the dot product of X and DX is of course the same as DX and X okay so um so a times B just to remind you a times B is not B times a but um but X transpose y equals y transpose x if x and y are vectors understood those two you understand the difference between the general case where things don't commute and the dot product where things do commute okay so students seem to get confused by that sometimes and so I just wanted to say that as clearly as I know how Okay so one more time using the product rule on DX transpose X you can use this approach to write it as D you use the product rule okay and then you recognize that those two dot products are the same and so you can combine them into two x transpose DX okay and so uh in a way now this one of course you could do it with it to take the gradient of X transpose X is not hard to do the old-fashioned way and you know there's no you could decide for yourselves but on the Blackboard let me just quickly remind you that here's the old-fashioned way to do it the old-fashioned way is to say f of x equals x transpose X is the sum of the x i squared and so this is like the baby way to do it but this is the sort of thing that you'll see in a lot of classes right and you'll you'll learn that the the gradient of f is the vector that has the F dx1 all the way down to DF dxn and so you'll you'll look at f of x i and you'll say oh the F of d x i is equal to 2 x i and so you'll put them all in a vector and you would say ha the gradient of f is equal to 2x right and that's you know there's nothing wrong with that I mean you know you get good at this sort of thing but I hope you can sort of see the advantage of doing it holistically I don't have to use indices I don't have to think about what this thing is made of I just do this little trick here and boom I get the answer right away right it's kind of like a magic trick for vectors and matrices that like if you take courses where people are taking gradients or or all the other students in the class are going to do it this way and you're going to know the secret trick of being able to get the gradient this ways right so you'll have superpowers that they don't have okay so of course both ways work but uh okay so okay so so here let's see we have d u transpose V is um uh this this is this is sort of the point that I was making already so I think we have that okay any questions all right uh Okay so so let's just again a little warm up and um so so it's sometimes it's good to sort of count um how many parameters are needed to express an answer I mean parameters are sort of a vague term but think about it as how many numbers do I need if I'm going to put something in a matrix how many elements so so let me ask you quickly if I have a function that takes n inputs to M outputs and I do want to express the derivative as a matrix how many numbers do I need so n inputs M outputs how many numbers do I need to express it as a matrix n times n all right everybody knows okay and I think this is my last slide so I'll just say a few words about second derivatives quickly and then maybe we take like a five minute break or something and then uh and and then Stephen will will speak so uh so I don't know how much we're going to talk about second derivatives in this class but there's one example that comes up so often that it's probably just worth mentioning and in a way it I think when I first learned about differentiating vectors I always got confused between the Hessian which is a matrix it's a second derivative Matrix and the Jacobian which is also a matrix but it's a first derivative Matrix right so so uh so so the so so so let me be clear uh if you have a function from RM to RN you can express the answer as an M by n Matrix and that's what we call the Jacobian right so the big point I'm just saying is it's a function from vectors to vectors the other case that comes up very often is when you have a function if you have a function from vectors to scalars okay and so um so um so so so examples like X transpose ax would be you know Vector transpose Matrix times Vector as a function of the vector okay and uh and so now the so so if it's going from Vector to scalar remember that the the gradient is a vector it's one dimensional but the second derivative is often expressed as a matrix and that's what we call the Hessian so you might ask yourself if we're trying to get away from element wise Matrix representations what's sort of the abstraction that we will need to talk about second derivatives and the answer in advance linear algebra classes is called a quadratic form okay and we'll see what that's so so we don't we can get past the the clunkiness of matrices that you know like 1806 and get into the more abstraction and those are the quadratic forms okay so I think that's the end of what I'm going to say today left chance for questions otherwise what should we take should we take a five minute break Stephen what do you think or a little a little less sure five minutes five minutes is good all right so I've got 1203 on the clock back here so um let's just take a five minute break to 1208 and people can get some water or ask me some questions or read their email or whatever you do in five minutes all right and then uh Steven you can grab the screen and set yourself up

Transcript for:Matrix Calculus Lecture

Transcript for:
Matrix Calculus Lecture