Notes for Mathematics for Machine Learning

hi there i'm david dye and welcome to the mathematics for machine learning specialization before we get stuck in let's set the scene machine learning is a set of powerful mathematical tools that enable us to represent interpret and control the complex world around us however even just the word mathematics makes some people feel uneasy and unwelcome to explore the topic the purpose of this specialization is to take you on a tour through the basic maths underlying these methods focusing in particular on building your intuition rather than worrying too much about the details thanks to the amazing machine learning community it's actually possible to apply many powerful machine learning methods without understanding very much about the underpinning mathematics by using open source libraries this is great but problems can arise and without some sense of the language and meaning of the relevant maths you can struggle to work out what's gone wrong or how to fix it the ideal outcome of this specialization is that it will give you the confidence and motivation to immediately dive into one of the hundreds of brilliant applied machine learning courses already available online and not be intimidated by the matrix notation or the calculus we want to open up machine learning to as many people as possible and not just leave all the fun to computer scientists this first course offers an introduction to linear algebra which is essentially a set of notational conventions and handy operations that allow you to manipulate large systems of equations conveniently over the next five modules we'll be focusing on building your intuition about vectors and transformations through the use of quizzes and interactive widgets as well as occasionally asking you to fill in the gaps in some python coding examples in the final module dr sam cooper will bring it all together by showing you how linear algebra is at the heart of google's famous page rank algorithm which is used for deciding the order of web pages in search results hopefully if you find this course useful you'll stick around for our follow-on course where sam and i will introduce you to multivariate calculus then in our other course dr mark dyson roth will introduce principal component analysis so welcome we really hope that the course will be productive and useful for you but also quite a lot of fun and i look forward to hearing from you in the forums in this video we're going to look a bit more at the types of problems we might want to solve and expose what linear algebra is and how it might help us to solve them the first problem i might think of is uh one of price discovery say i go shopping on two occasions and i buy apples and bananas and the first time i buy two apples and three bananas and they cost eight euros and the second time i buy say 10 apples and one banana and the cost is 13 euros and the a's and the b's here are the price of a single apple and a single banana and what i'm going to have to do is solve these what are called simultaneous equations in order to discover the price of individual apples and bananas now in the general case of lots of different types of items and lots of shopping trips then finding out the prices might be quite hard it might be quite difficult to solve all of these equations by hand so we might want a computer algorithm to do it for us in the general case now this is an example of a linear algebra problem i have some constant linear coefficients here these numbers 2 10 3 1 that relate the input variables a and b to the output 8 and 13. that is if i think about a vector a b that describes the prices of apples and bananas then this gets translated into a cost if i know how many of them i want to buy and the cost happens to be eight on the first trip and 13 euros on the second trip and i can write this down as a matrix problem where the 2 3 is my first trip and the 10 one is my second trip and then these are then matrices that's a matrix then and these are vectors and what we're going to do over the course of modules one to three is build up looking at these different types of mathematical objects and understanding what they are and how to work with them these vectors and these matrices um and then we'll come back and figure out how to solve this problem in the general case another type of problem we might be interested in is fitting an equation to some data in fact with neural networks and machine learning we want the computer in effect not only to fit the equation but to figure out what equation to use that's a highly inexact description really of what's going on but it gives the right sort of flavor but let's say we have uh some data like this histogram here this looks like a population with a an average and some variation here some width another type of problem we might want to solve as well as the apples and bananas problem is how to find the optimum value of the parameters in the equation describing this line the ones that fit the data in the histogram best that might be really handy then using that equation we'd have an easy portable description of the population we could carry around without needing all the original data which would freeze for example from privacy concerns now we could plot how good the fit was in terms of the parameters and that's what we'll look at in the next video in this video we've set up two problems in this first module on linear algebra first the problem of apples and bananas of solving simultaneous equations and secondly the optimization problem of fitting some data with an equation with some fitting parameters and these problems will go on to look at and motivate our work right through the course on linear algebra and its partner with multivariate calculus so the first thing we need to do in this course on linear algebra is to get a handle on vectors which will turn out to be really useful for us in solving those linear algebra problems that is the ones that are linear in their coefficients such as most fitting parameters we're going to first step back and look in some detail at the sort of things we're trying to do with data and why those vectors you first learned about in high school are even relevant and this will hopefully make all the work with vectors a lot more intuitive let's go back to that simpler problem from the last video the histogram distribution of heights of people in the population you know there aren't many people above about two meters say and there aren't really very many people below 1.5 meters saying we wanted to try fitting that uh distribution with an equation describing the variation of heights in the population and let's say that equation has just two parameters one describing the center of the distribution here and we'll call that mu and one describing how wide it is um which we'll call sigma um so we could fit it with some curve that had uh two parameters uh mu and sigma and i i would uh use an equation like this i i'd call it uh f of x uh some function of x where x is the is the height is equal to 1 over sigma root 2 pi times the exponential of minus x minus mu squared divided by two sigma squared so this equation only has two parameters sigma here a mu and it looks like this and it has an area here of one because there's only 100 of people in the population as a whole now don't worry too much about the form of the equation this one happens this is called the normal or gaussian distribution and it's got a center of mu and a width of sigma and it's normalized so that its area is one now how do we go about fitting this distribution that is finding mu and sigma or the best possible mu and sigma that fits the data as well as is possible imagine that we had guessed that the width was wider than it really is but keeping the area at one so if we guessed that it was a fatter and probably a bit shorter distribution something like that say so this one has uh something like that this one has a wider sigma but it's got the same mu it'd be too high at the edges here and too low in the middle so then we could add up the differences between all of our measurements and all of our estimates we've got all of these places where we underestimate here and all of these places where we overestimate here and we could add up those differences or in fact the squares of them to get a measure of the goodness or badness of the fit and we'll look in in detail at how we do that um once we've we've done all the vectors work and once we've done actually all the calculus work then we could plot how that goodness varied as we change the fitting parameters sigma and mu and we get a plot like this so if we had uh our correct value our best possible value for mu here and our best possible value for the width sigma here we get a we could then plot uh for a given value of mu and sigma what the difference was so if we were at the right value we'd get a value uh of goodness where the the sums of the squares the differences was naught and if uh mu was too far over if we had misestimated mu and we'd uh got the distribution shifted over either width was right but we had some wrong value of mu there that we get some value of all the sums of the squares of the differences of our goodness being some value here that was higher and it might be the same if we went over the other side and we had some value there and if we were too wide we'd get um something there or too thin it would we would get something that was too thin like that um something like that say so we'd get some other value of goodness and we could imagine plotting out uh all of the values of where we had the same value of goodness or badness for different values of mu and sigma and we could then do that for some other value of badness and we might get a contour that looked like this uh and another contour that looked like this and so on and so forth now say we don't want to compute the value of this goodness parameter for every possible mu and sigma we just want to do it a few times and then find our way to the best possible fit of all so say we started off here with some guess that was too big a mu and two smaller width we thought people were taller than they really are and that they were tighter packed in their heights than they really are uh well what we could do is we could say well if i do a little move in mu and sigma then does it get better or worse and if it gets better well we'll keep moving in that direction um so we could imagine making a vector of a change in mu and a change in sigma um and we could have our original mu and sigma there and we could have a new value u prime sigma prime and ask if that gives us a better answer if it's better there or if mu mu prime sigma prime took us over here if we were better or worse there something like that now actually if we could uh find what the steepest way down the hill was then we could go down this set of contours this sort of landscape here towards the minimum point towards the point where we get the best possible fit and what we're doing here these are vectors these are little moves around space they're not moves around a physical space that moves around a parameter space but it's the same thing so if we understand vectors and we understand how to get downhills that sort of curviness of of this value of goodness that's calculus then once we've got calculus and vectors we'll be able to solve this sort of problem so we can see that vectors don't have to be just geometric objects in the physical world of space they can describe directions along any sorts of axes so we can think of vectors as just being lists if we thought of the space of all possible cars for example so you know here's a car ah there's its back there's its window there's the front something like that um there's a car there's the window we could write down in a vector uh all the things about the car we could write down its cost in euros we could write down its emissions performance in grams of co2 per 100 kilometers we could write down its its knox performance um you know how much it polluted our city and kill people due to air pollution we could write down its euro end cap star rating how good it was in a crash we could write down its top speed and write those all down in a list that was a vector that'd be more of a computer science view of vectors whereas the spatial view is more familiar from physics in my field metallurgy i could think of any alloy as being described by a vector that describes all of the possible uh components all the compositions of that alloy einstein when he conceived relativity conceived of time as just being another dimension so space time is a four-dimensional space three dimensions of meters and one of time in seconds and he wrote those down as a vector of space time of uh x y z and time which he called space time when we put it like that it's not so crazy to think of the space of all the fitting parameters of a function and then of vectors as being things that take us around that space and what we're trying to do then is find the location in that space where the badness is minimized the goodness is maximized and the function fits the data best if the badness surface here was like a contour map of a landscape we're trying to find the bottom of the hill the lowest possible point in the landscape so to do this well we'll want to understand how to work with vectors and then how to do calculus on those vectors in order to find gradients in these contour maps and minima and all those sorts of things then we'll be able to go and do optimizations enabling us to go and work with data and do machine learning and data science so in this video we've revisited the problem of fitting a function to some data in this case the distribution of heights in the population what we've seen is that the function we fit whatever it is has some parameters and we can plot how the quality of the fit the goodness of the fit varies as we vary those parameters moves around these fitting parameter space are then just vectors in that space of fitting parameters and therefore we want to look at and revisit vector maths in order to be able to build on that and then do calculus and then do machine learning so we have these things called vectors like this guy here what we want to do first is get an idea of what makes a vector a vector what we'll do in this video is explore the operations we can do with vectors the sort of things we can do with them that define what they are and the sort of spaces they can apply to so a vector we can think of as an object that moves us about space like this guy here this could be a physical space or a space of data at school you probably thought of a vector as something that moved you around a physical space but in computer and data science we generalized that idea to think of a vector as maybe just a list of attributes of an object so we might think of a house say so here's a house um and we this might have a number of attributes we could say it was 120 square meters in in floor area it might have two bedrooms say it might have one bathroom that would be sort of sensible uh and it might be worth uh 150 000 euros say and i could write that down as the vector 120 square meters two bedrooms one bathroom and 150 000 euros um and this that is uh uh while in physics we think of this as being a thing that moves us about space in data science we think of this vector as being a thing that describes the object of a house so we've generalized the idea of moving about space to be include the description of the attributes of an object now a vector is just something that obeys two rules firstly addition and secondly multiplication by a scalar number we'll do this first thinking of a vector as just a geometric object starting at the origin so something like this so we've got a vector r there vector addition is then when we just take another vector so let's take another vector like this guy here let's call him s and where we put s on the end of r so then that's s and therefore if we put s on the end of r we get a sum that's r then going along s we'll call that guy r plus s now we could do this the other way around we could do s and then r and that would be s plus r to go along s and along that way and that would be s plus r there s plus r and we see that they actually give us the same thing the same answer so r plus s is equal to s plus r so it doesn't matter which way round we do the addition so the other thing we want to be able to do is scalar multiplication that is to scale vectors by a number so a number a say making it twice as long or half as long something like that so we'd say that uh say three r was doing r three times that would be three r there my a was three or we could do a half r which would be something like that the only tricky bit is uh is what we mean by a minus number and by minus r we'd mean going back the other way by a whole r so we take our we go back the other way the same distance there that would be minus r so in this framework minus means go back the other way at this point it's convenient to define a coordinate system so let's define space by uh two vectors let's call the first one that takes us left right and is of unit length one length one let's call that a vector i will have another vector here that goes up down a vector j it's also a unit length of length one and then we'd say just use our vector addition rules if we wanted a vector r here something like this that was uh three twos by which we mean we go uh three eyes one i two i three i's and then two j's so we go three i's plus two js and that gives us a vector r here just from our vector sum and what we mean in the three two is do three i's added together or a scalar multiple of three i's and then do a scalar multiple of three js as a vector sum and that's what we mean by a coordinate system of defining r as being three two so then if i have another vector s let's say s is equal to uh minus 1 i's and 2 js that is it takes us back 1 i and up 2 j's so that's s then r plus s would be that we just put s on the end of r and then r plus s is going to be therefore that total vector that's going to be r plus s and we can just add up the components right so uh r is three i's and s takes us back one so it's three plus minus one it gives us two eyes and in the js r takes us up two and s takes us up another two so that's a total of four j's so we can just add up the components when we're doing vector addition so we can see that because we're doing this component by component then vector addition must be what's called associative formally what this means is that if we have three vectors r s and another one t it doesn't matter whether we add r plus s and then add t or whether we add r to s plus t it doesn't matter where we put the bracket we can do this addition and then that one or we can do this addition and then that one so a consequence of it not mattering what order we had so x plus r is equal to r plus s we can also see that therefore it doesn't matter what order we do the additions in if we've got three and that's called uh associativity that's associative that's formally that definition and vector addition we can see when we're adding it up like this will be associative so i've just got rid of the uh the s's and so on so we can talk about another issue which is uh in a coordinate system what do we mean by multiplication by a scalar so if we want to take a multiplication by a scale let's say 2 then we define this to mean that 2r would be equal to 2 times the components of r so 2 times 3 for i's and 2 times 2 for the j's because we've got 2 there multiplied by 2 and that will give us 6 4. so 2r would be doing r and then doing another r so that would be 2r which would be the vector six four go along three eyes four five six eyes and up four j's now i need to think about another question which is minus r so r is this minus r is then that which will be uh minus 3 minus 2 that'll be minus r so then we see sort of obviously kind of that r plus minus r is equal to 3 uh plus minus 3 on the eyes and 2 plus minus 2 on the j's which is equal to 0 0. so if we do r and then add minus r we end up back at the origin and therefore we've defined what we mean by uh vector subtraction here vector subtraction is just uh addition of minus one times whatever i'm doing pulling after the minus sign so if we think of another vector s we had s was -1 2 before right minus 1 i plus 2 j's so then r minus s would be this so that's minus s there is equal to go along one on the eyes and -2 on the j's so r minus s add up the components r minus s let's switch to an addition so r minus s is this vector here that's r minus s if we add up the components of that it's three i's plus one three plus one on the eyes and two plus minus two on the j's so that gives us the vector four zero so if we do r is go along three a minus s is go along one we've got a total of four and if r is go up two and minus s's go down two we've ended up going uh up down zero in total so then we've not only done addition by components we've done now what we mean by vector subtraction as well as being addition of a negative one multiple of the thing that we're doing the minus by uh and that's vector subtraction edition by components so uh let's come back to the house example for a moment so we said we had a house that's my house that was 120 square meters two bedrooms one bathroom and 150 000 euros so if i put the units in that's square meters that's its number of beds that's its number of baths and that's it's thousands of euros but it's worth so two houses now is equal to the vector addition of those things is equal to uh two and the way we're defining vector addition times 120 to 150 which would be equal to 240 4 to 300 so we'd say that in this scheme the way we're defining it then two houses would be 240 square meters that makes sense uh four bedrooms two bathrooms and worth 300 000 euros if i bought two houses identically next to each other um and that would be a scalar multiple or an addition of one house to another yeah one house plus one house we could keep on doing that with three houses or differently shaped houses or whatever it was or negative houses um the way we've defined vectors that will still apply to these objects of houses so that's vectors we've uh defined two fundamental operations that vectors satisfy that is addition so like r plus s here uh a multiplication by a scalar so like 2r here and minus s here and we've explored the properties that those imply like associativity of addition and subtraction what subtraction really means of vectors r plus minus s being r minus s and we've noticed that it can be useful uh to define a coordinate system in which to do our addition and scalings like r32 here uh using these uh fundamental basis vectors uh these things that define the space i and j which we call the basis vectors or the things that define the coordinate system we've also seen that although perhaps it's easiest to think of vector operations geometrically we don't have to do it in a real space we can do it with vectors that are data science lists of different types of things like the attributes of a house so that's vectors that's all the fundamental operations so in this first module of our course on linear algebra we first looked at the problem of data that our world has so much of it and then if we could figure out how to analyze and use it we could really solve problems in the world and we've looked at where these courses fit in terms of helping us access the world of machine learning and data science then we've moved on to look at some example problems the problem of solving some simultaneous equations for example to discover the price of things in the apples and bananas problem or the problem of fitting a model equation with some fitting parameters we want uh to optimize against some data we've then said that both of these problems are going to involve vectors and possibly some calculus so we've started off our journey with vectors and with defining vector addition and scalar multiplication in the next module we'll go further to look at some more operations with vectors and define what we mean by a vector space at the coordinate system of a vector space or its basis in this module what we're looking at is the sort of things we can do with vectors first we can look at their modulus or magnitude we can look at a way to combine vectors together to get a number called the dot product which will take us on to find the scalar and vector projections that will then take us on to looking at the vectors that we use to define the space the basis vectors and at linear independence and linear combinations and that will wrap up module two this is hopefully going to be a good work we'll get all set with some nice problems and exercises to try this stuff out this is the main module on vectors after this in the next modules we'll move on to matrices so we've got stuff to do let's get started with the next video on the modulus and the dot product so we've looked at the two main vector operations of addition and scaling by a number and those are all the things we really need to be able to do to define what we mean by a vector the mathematical properties that a vector has now we can move on to define two things the length of a vector also called its size and the dot product of a vector also called its inner scalar or projection product the dot product is this huge and amazing concept in linear algebra with huge numbers of implications and we'll only be able to touch on a few parts here but enjoy it's one of the most beautiful parts of linear algebra so when we defined a vector initially say this guy r here we did it without reference to any coordinate system in fact the geometric object this thing r just has two properties its length and its direction that is pointing that way so irrespective of the coordinate system we decided to use we want to know how to calculate these two properties of length and direction if the coordinate system was constructed out of two unit vectors that were orthogonal to each other like i here and j here in 2d then we can say that r is equal to a times i plus b times j and when i say unit about i and j i mean they're of length one which people will often denote by putting a little hat over them like this then from pythagoras we can say that the length of r is given by the hypotenuse so what i mean by that is uh if we draw a little triangle here then we've got this length here is a eyes and so if we write the length being well with these two little vertical lines it's just of length a because eyes of length one right and this side here is bjs and that's of length b so this side here is from pythagoras it's just a squared plus b squared all square rooted and that's the size of r so we can write down r quite often people will do this write r down like this just ignoring the i and j and writing it as a column vector so r is equal to a b and the size of r we write it down as being the square root of a squared plus b squared now we've done this for two spatial directions defined by unit vectors i and j that are at right angles to each other but this definition of the size of a vector is more general than that it doesn't matter if the different components of the vector are dimensions in space like here or even things of different physical units like length and time and price we still define the size of a vector through the sums of the squares of its components the next thing we're going to do is to define the dot product one way one way among several of multiplying if you like two vectors together if we have two vectors r and s here r here has components r i r j so r in the i direction are in the j direction and s has components s i and sj then we define r dotted uh with s to be given by multiplying the i components together so that's ri times s i and adding uh the j components together so that's r j times s j that is uh the dot product is just a number a scalar number like three uh given by multiplying the components of the vector together in turn and adding those up so in this case uh that would be uh three and two for the uh for the rirj and minus one uh and two for s so if we do that then we get a sum the r dot s is equal to minus 3 plus 4 which gives us 1. so r dot s in this case is just 1. now we need to prove some properties of the dot product first it's commutative so commutative and what commutative means is that r dot s is equal to s dot r it doesn't matter which way round we do it and it doesn't matter because uh when we put these numbers in here if we interchange those the rs and s's we get the same thing when we multiply minus one by three it's the same as three times minus one so it doesn't matter which way round we do the dot product s dot r is equal to r dot s which means it's commutative the second property we want to prove is that the dot product is distributive over addition uh by which i mean that if i've got a third vector here now t that r dotted with s plus t is equal to r dotted with s plus r dotted with t i can multiply it out in that kind of way this probably feels kind of mundane or obvious but let's prove it in the general case so let's say i've got some n-dimensional vector r components r1 r2 all the way up to rn uh an s is the same as components s1 s2 all the way up to sn and t has components t1 t2 all the way up to tn then let's multiply it out so if we take the left hand side r dotted with s plus t that's going to be equal to r one r times s one plus t1 if we take the components and then r2 component r2 times components s2 plus t2 and then all the dimensions in between and then finally rn times sn plus tn and then what we can do is we can then separate that out so we've got multiply that out so we've got a r1 s1 plus r1 t1 plus r2 s2 plus r2 t2 plus all the ones in between rn sn plus rn tn and then we can collect them together cause we've got the r1 s1 terms r2 s2 r all the way to rnsn and that's of course just equal to r dotted with s and if we collect the rt terms together we've got r one t with two one r two t two all the ones in between and r n t n and that's just r dotted with t so we've demonstrated that this is in fact true that you can uh pull out plus signs and dots in this this way which is called being distributive over addition the third uh thing we want to look at is what's called associativity so uh that is if we take uh a vector a dot product and we've got r dotted with uh some multiple of s uh where a is just a number it's just a scalar number so we're multiplying s by uh a scalar and what we're going to say is that that is equal to a times r dotted with s and that means that it's associative over scalar multiplication and we can prove that quite easily just in the 2d case um so if we say we've got r1 times a s1 r plus r2 times a s2 that's the left-hand side just for a two-dimensional vector um then we can pull the a out right so we can take the a out of both of these and so then we've got r1 s1 plus r2 s2 and that's just r dot s a times r dot s so this is in fact true and so we've got our three properties that the dot product is commutative we can interchange it it's distributive over addition which is this expression and it's associative over scalar multiplication we can just pull out scalar numbers out as an aside sometimes you'll see people in physics and engineering write vectors in bold and numbers or scalars in normal font or they'll underline their vectors to easily distinguish them from things that are just scalars whereas in maths and computer science people don't tend to do that it's just a notation difference between different communities and it's not anything fundamental to worry about the last thing we need to do before we can move on is draw out the link between the dot product and the length or modulus of a vector if i take a vector and dot it with itself so r dotted with r what i get is just the sums of the squares of its components so i get you know r1 times r1 plus r2 times r2 and all the others if there are all the others so i get r1 squared plus r2 squared now that's quite interesting because that means if i take the dot product to a vector with itself i get the square of its size of its modulus so that is uh that equals uh r1 plus r2 uh squared square rooted all squared so that's mod r squared so if we want to get the size of a vector we can do that just by dotting the vector with itself and taking the square root and that's really kind of neat and really hopefully quite satisfying let's take the cosine rule from algebra which you'll remember probably vaguely from school and that said if we had a triangle with sides a b and c then what the cosine rule said was that c squared was equal to a squared plus b squared minus 2 a b times the cos of the angle between a and b because that angle theta there um now we can translate that into our vector notation we call this vector r here and we called this vector s here then this vector would be minus s plus r so that vector would be r minus s minus s plus r so we could say that c squared was the modulus of r minus s squared and that would be equal to the modulus the size of r squared plus the size of s squared minus 2 mod r mod s cos theta now here's the cool bit we can multiply this out using our dot product because we know that the size of r minus s squared is equal to r minus s dotted with itself now that's that's just that and we can multiply that out and then we'll compare it to this right hand side here so r minus s dotted with r minus s well that's going to be if we we need to figure out how to multiply that out that's going to be equal to r dotted with r uh and then take the next one minus s dotted with r minus s dotted with r again if you take that minus s and that r minus s dotted with r again and then minus s uh dotted with minus s so that is we've got the modulus of r squared here i mean alt dot r with itself minus twice s dotted with r uh and then minus s dotted with minus s well that's going to be the size of minus s squared which is just the size of s squared and then we can compare that to the right hand side and when we do that comparison compare that to the right hand side the minus r squareds are going to cancel uh the r squared even the s squares are going to cancel and so we get a result which is that minus twice s dotted with r is equal to minus twice modulus of r modulus of s cos theta that is and then we could lose the minus sign right minus signs will cancel out just multiply through by minus one so and then the twos we can cancel out again so we can say that the dot product r dot s just to put it in a more familiar form is equal to mod r mod s cos theta so what we found here is that the dot product really does something quite profound it takes the size of the two vectors if we if these were both unit length vectors those would be one and multiplies by cos of the angle between them it tells us something about the extent to which the two vectors go in the same direction because if cos theta uh if theta was zero then cos theta would be one and r dot s would just be the size of the two vectors multiplied together if the two vectors on the other hand were 90 degrees to each other if they were r was like this an s was like this and the angle between them theta was equal to 90 degrees cos theta cos 90 is zero and then r dot s is going to be we can immediately see r dot s is going to be some size of r some size of s times zero so if the two vectors are pointing uh at 90 degrees to each other if they're what's called orthogonal to each other then the dot product is going to give me zero if they're both pointed in the same direction say s was like that and the angle between them is naught cause uh of naught is equal to one and then r dot s is equal to the mod r times mod s just the multiplication of the two uh sizes together fun one last fun one here is that r and s were in opposite directions so let's say s was now going this way and the angle between them was 180 degrees cos of 180 180 degrees is equal to -1 so then r dot s would be equal to minus the size of r times the size of s and so what the dot product here really does with this cause it tells us when we get the minus sign out that they're going in opposite directions so there's some property here than the dot product we've derived by looking at the cosine rule that we've derived here when the dot product's zero they're 90 degrees to each other they're orthogonal when they go in the same way we get a positive answer when they're going more or less in opposite directions we get a negative answer for the dot product now there's one last thing to talk about in this segment which is called projection projection uh and for that we'll need to draw a triangle so if i've got a vector r and a another vector s now if i take a little right-handed triangle drop a little right-handed triangle down here where this angle is 90 degrees then uh i can do the following uh if i can say that if this angle here is theta that cos theta is equal to from socatoa it's equal to the adjacent length here over the uh hypotenuse there adjacent over the hypotenuse that is uh and this hypotenuse is the size of s so that's the adjacent over the size of s there now if i compare that to the definition of the dot product i can say that r dotted with we'll have fun with colors dotted with s is equal to uh mod r size of r times the size of s times cos theta cos theta uh but the size of s times cos theta if i pull s up here i just need to put my theta in there cos s cos theta is just that adjacent side so that's just that adjacent side here in the triangle so uh the adjacent side here is just kind of the shadow if i if i had light coming down from here it's the shadow of s on r that length there it's kind of a shadow cast if i had a light at 90 degrees to r shining down on s and that's called the projection so what the dot product gives us is it gives us the projection here of s on to r times the size of r um and uh one thing to notice here is that if s was perpendicular to r if s was pointing this way it would have no shadow that is if cos theta was 90 degrees that shadow would be naught the cos theta would be naught here and i get no projection um so uh the other thing the dot product gives us is it gives us uh the size of r times some idea about uh the projection of s onto r the shadow of s onto r so if i divide the dot product r dot s by the length of r just bring the r down here i get mod s cos theta uh i get that adjacent side i get a number which is called because r dot s is a number and the size of r is a number and that's called the scalar projection um and that's why the dot product is also called the projection product because it takes the projection of one vector onto another we just have to divide by the length of r and if r happened to be a unit vector or one of the vectors we used to define the space of length one then that would be of length one and r dot s would just be the scalar projection of s onto that r that that vector defining the axes or whatever it was now if i want to remember to encode something about r which way r was going uh into the dot product or into the project product it's i could define something called the vector projection and that's defined to be r dot s over mod r dotted with itself so r dot r mod r squared uh so that's r dot s over r dot r if you like because mod r squared is equal to r dot r and that and we multiply that by the vector r itself so that is that's dot product's just a number these sizes are just a number ah and r itself is a vector so what we've done here is we've taken the scalar projection r dot s over r this guy that's how much s goes along r and we've multiplied it by r divided by its length we've multiplied it by a vector going the direction of r but that's been normalized to have a length one so that vector projection is a number times a unit vector that goes in the direction of r so if r say was some number of lengths the vector that would be r divided by its size say if that was a unit length vector i've just drawn there and the vector projection would be that number s dot r that adjacent side times a vector going in the unit length of r so that's uh if you like the scalar projection also encoded with something about the direction of r just a unit vector going in the direction of r so we've defined a scalar projection here and we've defined a vector projection there so good job this was really the core video for this week we've done some real work here we've found the size of a vector and we've defined the dot or projection product we've then found out some mathematical operations we can do with the dot product um but it's distributed over vector addition and associative with scalar multiplication now that it's commutative we've then found that it finds the angle between two vectors the extent to which they go in the same direction and then it finds the projection of one vector onto another it's kind of how one vector will collapse onto another which is what we'll explore in the next two videos so good work now's a good time to pause and try some examples but put all this together and give it all a workout and a bit of a try before we move on now so far we haven't really talked about the coordinate system of our vector space the coordinates in which all of our vectors exist but it turns out that in doing this thing of projecting of taking the dot product we're projecting our vector onto one which we might use as part of a new definition of the coordinate system so in this video we'll look at what we mean by a coordinate system and then we'll do a few cases of changing from one coordinate system to another so remember that a vector like this guy r here is just an object that takes us from the origin to some point in space which could be some physical space or could be some space of data like bedrooms and thousands of euros for the house or something like that what we haven't talked about so far really is the coordinate system that we use to describe space so uh we could use a coordinate system defined by these two vectors here i'm going to give them names we call them i and j before i'm going to give them names e1 and e2 i'm going to find to be of unit length so i'm going to give them a little hat meaning they're of unit length and i'm going to define them to be the vectors 1 zero and zero one and if i had more dimensions in my space i could have e3 hat e4 high e five hat e million hat whatever here the instruction then is that r is going to be uh equal to doing a vector sum of two e ones or three e ones and then some number of e two so we'll call it doing three e ones hats plus four e e2 hats and so we'll write it down as a little list 3 4. so r is the 3 4 here is the instruction to do 3 e1 hats plus 4 e2 hats but if you think about it my choice of e1 hat and e2 hat here is kind of arbitrary it depends entirely on the way i set up the coordinate system there's no reason i couldn't have set up some uh coordinates but some angle to that um or even use vectors to find the axes that weren't even at 90 degrees to each other and were of different lengths i could still have described r as being some sum of some vectors i used to define uh the space so i could uh have say another set of vectors b uh i'll call b1 here in the vector 2 1 and i could have another vector here b 2 as the vector -2 4 and i've defined it in terms of the coordinates e and i could then describe r in terms of you you're using those vectors b1 and b2 it's just the numbers in r would be different so we call the vectors we use to define the space these guys e or these guys b we call them basis vectors so the numbers i've used to define are only have any meaning when i know what about the basis vectors so i'll refer to these basis vectors e is 3 4 but r referred to the basis vectors b also exists we just don't know what the numbers are yet so this should be kind of amazing r the vector r has some existence in a deep sort of mathematical sense completely independently of the coordinate system we use to describe the numbers in the list describing r ah the vector that takes us from there from the origin to there still exists independently of the numbers used in her which is kind of neat right sort of fundamentally sort of idea now if the new basis vectors these guys b are at 90 degrees to each other then it turns out the projection product has a nice application we can use the projection or dot product to find out the numbers for r in the new basis b so long as we know what the b's are in terms of e so here i've described b1 as being 2 1 as being e1 plus e2 e twice e1 plus 1 e2 and i've described b2 as being -2 e1s plus 4 e2s and if i know b in terms of e i'm going to be able to do use the projection product to find r described in terms of the b's but this is a big if the b 1 and b 2 have to be at 90 degrees to each other if they're not we end up being in big trouble and need matrices to do what's called a transformation of axes from the e to the b set of basis vectors we'll look at matrices later but this will help us out a lot for now using dot products in this special case where the new basis factors are orthogonal to each other is computationally a lot faster and easier it's just less generic but if you can arrange the new axis to be orthogonal you should because it makes the computations much faster and easier to do so you can see that if i project r down onto b1 so i look down from here and project down at 90 degrees i get a a length here for the scalar product and that scalar projection is the shadow of r on to b1 and the a number of the scalar projection describes how much of this vector i need and the vector projection is going to actually give me a vector in the direction of b1 of length equal to that projection now if i take the vector projection of r onto b2 going this way i'm going to get a vector in the direction of b2 of length equal to that projection and if i do a vector sum of that vector projection plus this guy's vector projection i'll just get r so if i can do those two vector projections and add up their vector sum i'll then have rb being the numbers in those two vector projections and so i found how to get from r in the e set of basis vectors to the b set of basis vectors now how do i check that these two new basis factors are 90 degrees to each other i just take the dot product so we said before the dot product cos theta was equal to the dot of two vectors together so b1 and b2 divided by their lengths so if b1 dot b2 is zero then cos theta is zero and cos theta's zero if they're at 90 degrees to each other if they're orthogonal so i don't even need to calculate thanks i'll just calculate the dot product so b1 dot b2 here i take 2 times -2 and i add it to 1 times four which is minus four plus four which is zero so these two vectors are at ninety degrees to each other so it's going to be safe to do the projection so having talked through it let's now do it numerically so if i want to know what r uh described in the basis e an r's pink right if i take r in the basis e and i'm gonna dot him with b1 and the vector projection divides by the length of b1 squared so r in e dotted with b1 is going to be 3 times 2 plus 4 times 1 4 times 1 divided by the length of b1 squared so that's the sum of the squares of the components of b so that's 2 squared plus 1 squared so that gives me 6 plus 4 is 10 divided by 5 which is 2. so this projection here is of length 2 times b1 so that projection there that vector is going to be 2 times b1 so that is in terms of the original set of vectors b original vectors e are r e dot b1 over b1 squared times b1 is the vector 2 is 2 times the vector 2 1 is the vector 4 2. and i can do then now this projection onto e2 so i can do our e dotted with b2 and divide by b the length of b2 squared and re dot b2 is 3 times -2 plus 4 times 4 divided by the length of b2 squared which is minus two squared plus four squared so that's three times minus two is six minus six plus four times four is sixteen so that's a minus six plus sixteen is ten divided by this length here is uh 4 squared is 16 2 squared is 4 so 20 so that's equal to a half so this vector projection here is that guy times b 2 so that's r e dot b 2 over the modulus of b 2 squared that's my half times the vector b 2 so that's a half times the vector b 2 which is minus 2 4. now if i add those two together 4 2 this bit that vector projection plus this vector projection so this guy is going to be a half b2 plus uh my half minus 2 4 is minus 1 2. if i add those together i've got 3 4 which is just my original vector r 3 4 in the basis e so in the basis of b 1 and b 2 our b is going to be 2 a half very nice to a half so actually in the basis b is going to be two a half there so our b is two times b one plus a half times b two very nice so i've converted from the e set of basis vectors to the b set of basis vectors which is very neat just using a couple of dot products so this is really handy this is really cool we've seen that our vector describing our data isn't tied to the axes that we originally used to describe it at all we can redescribe it using some other axes some other basis vectors so the basis factors we use to describe the space of data and choosing them carefully to help us solve our problem will be a very important thing in linear algebra and in general and what we've seen is we can move the the numbers in the vector we used to describe a data item from one basis to another we can do that change just by taking the dot or projection product in the case where the new basis vectors are orthogonal to each other so we've seen how we don't just have to have basis vectors that are normal one o and o one the so-called natural basis we can have different basis vectors that redefine how we move about space in this video we're going to define what we mean by a basis by a vector space by the term linear independence which is going to let us understand how many dimensions our vector space possesses so first let's define what we mean by a basis a basis is a set of n vectors that are not linear combinations of each other which means they're linearly independent of each other and that span the space they describe the space is then n-dimensional this first criteria i can write down by saying that they're linearly independent if i can't write any of them down by taking some combination of the others so for example let's say i've got a vector b1 here by taking multiples of b1 i can get anywhere along the 1d space of a line if i take a second vector b2 that isn't just a multiple of b1 then i can get anywhere in the plane of this board by taking combinations of b1 and b2 some number of b1s plus some number of b2s and this is a 2d space now let's take a third vector b3 now for b3 to be a valid third basis vector it has to be impossible to be for me to find some numbers a1 and a2 uh such that i can satisfy this sub so it has to be impossible for me to find b3 as some combination of b1s and b2s where a1 and a2 are just numbers that has to be impossible and if it is impossible b3 is a third basis vector and b3 is linearly independent if it is possible for me to find an a1 and a2 that satisfies that sub b3 uh is then linearly dependent on b1 and b2 and it lies in the same plane as b1 and b2 and if it's uh possible for me to you know impossible to find an a1 and a2 b3 must have some component out of the board so i can then use b3 to give me a three-dimensional space so that lets us define what we mean by the number of independent linearly independent basis vectors in our space if i had a fourth basis vector b4 that wasn't a linear combination of b1 b2 and b3 i'd have a four dimensional space and so on up to as many dimensions as i like now notice what my base suspect vectors b don't have to be they don't have to be unit vectors by which i mean vectors of length one and they don't have to be orthogonal that is at 90 degrees to each other but everything is going to be much easier if they are so if at all possible you want to use all for normal basis vector sets 90 degrees of unit length now let's think about what happens when we map from one basis to another the number line of the axis of the original grid then projects down onto the new grid and potentially has different values on that grid but the projection keeps the grid being evenly spaced therefore any mapping we do from one set of basis vectors from one coordinate system to another keeps the vector space being a regularly spaced grid where our original vector rules of vector addition and multiplication by a scalar still work it doesn't warp or fold space which is what the linear bit in linear algebra means things might be stretched or rotated or inverted but everything remains evenly spaced and linear combinations still work now where the new basis vectors aren't orthogonal then to do the change from one basis to another we won't just be able to use the dot product we'll need to use matrices instead which will meter the next module so that's the formal definition of what we mean by a basis and by linear independence for example say i have a bunch of of 2d data points like this and so they all more or less lie on a straight line right we can we can kind of see that these guys all more or less learn a line that's going to be something like that i can imagine uh redescribing that data by mapping them onto that line and then saying how far they're along that line so i can map this guy down onto the line and i can say the origin maps down there and i can then say this data point is that far along the line and he's also so he's that far along the line there and he's that this far away from the line so i've got two dimensions here how far i am along the line and how far i am from the line and these guys they're all slightly different distances from the line there's a little bit of an argument in stats as to whether we do the distance that way vertically or that way um as a projection for the distance from the line but it's sort of a theoretical argument but notice that this distance from the line is effectively a measure of how noisy this data cloud is if they are all tight on the line they'd all be very small distances away and if they were all quite spread they'd be quite big distances away so this distance from the lineness is in fact the noise um and that's information that isn't very useful to us so we might want to collapse it except that that noise dimension tells me how good this line fit is if the best fit line was was all skewed was all wrong i get a much bigger numbers for the noisiness and if the best fit line was as good as possible i get the minimum possible uh number for the noisiness so that noise dimension contains information that's going to tell me how good my fit is so what i'm doing data science it tells me how good my fit of my to my data is and the way i've defined these uh two directions along the line and away from the line they're orthogonal to each other so i can use the dot product to do the projection to map the data from the x y space onto the space of the line along the line and away from the line um which is what we did and do in the last little segment now if we're thinking about a neural network in machine learning that recognizes faces say maybe i'd want to make some transformation of all the fix pixels in a face into a new basis that describes the the nose shape the skin hue the distance between the eyes those sorts of things and discard the actual pixel data so the goal of the learning process of the neural network is going to be to somehow derive a set of basis vectors that extract the most information rich features of the faces so in this video we've talked about the dimensionality of a vector space in terms of the number of independent basis vectors that it has and we found a test for independence that the set of vectors are independent if uh one of them were a linear combination of the others um we've talked more importantly about what that means in terms of mapping from one space to another and how that's going to be useful in data science and machine learning let's just take a moment to think about what we've done in this module because you've worked quite hard and if all this is completely new to you you've had to think about a lot of new ideas we've looked at vectors as being objects that describe where we are in space which could be a physical space a space of data or a parameter space of the parameters of a function it doesn't really matter it's just some space then we've defined vector addition and scaling a vector by a number making it bigger or reversing its direction then we've gone on to find the magnitude or modulus of a vector and the dot scalar and vector projection product we've defined the basis of a vector space its dimension and the ideas of linear independence and linear combinations we've used projections to look at one case of changes from one basis to another for the case where the new basis is orthogonal so we've done a lot of stuff with vectors and along the way we've done a bit of thinking about how this will apply to working with data so hopefully that's been really useful and you've had enjoyed giving it all a workout in the exercises and activities and i'll see you in the next modules where we'll move on to think about the related idea of matrices so back at the start of the course we encountered the apples and bananas problems how to find the price of things when we only have the total bill and we've looked at vectors so far and now we're going to look at matrices and these are objects that rotate and stretch vectors but they're also objects that let us solve these sorts of problems so let's go back to that apples and bananas problem so i walk into a shop and i buy two apples and three bananas and that that costs eight euros so we're saying two apples and three bananas cost eight euros now so i go to the shop on another day and i buy ten apples and one banana and that that costs me well the shopkeeper charges me 13 euros and i want to discover what the price for one apple and one banana is so i can decide which offers better value or even just predict my bill now you might say this is silly uh what shop doesn't have sticker prices after all but actually in business uh with complicated products and service agreements and higher purchase this sort of thing price discovery happens all the time you know think about what happens when you buy a car for instance now these are just simultaneous equations but i can write them down in another way so the way i would write this down with matrices would be as follows so i'd write it down it's this matrix what i what i'm now calling a matrix an object with numbers in 2 3 10 1 a b equals 8 13. and uh these things are call matrices this is a two by two matrix this is a a two row by one column matrix and this is another two row by one column matrix and the instruction here is to uh multiply this out in the following way so i would multiply the elements in the rows by the elements in the column so i'd multiply 2 by a plus 3 times b that's that row times that column and i'd say that equaled the top row on the right hand side and i'd do the same for the next row that row times that column is 10a plus 1b is equal to the row on the bottom on the right hand side and that looks like my two simultaneous equations but this is really handy because these things notice look like vectors so this matrix operates on this vector to give this other vector and my question is what vector transforms to give me this guy on the right now let's look at what happens if i multiply this matrix here by uh the unit basis vector the x-axis vector well when i do that when i do that multiplication i'm going to get 2 times 1 plus 3 times naught and i'm going to get 10 times 1 plus 1 times naught so i get the vector 210 so what this does is it takes the little unit vector which we called e1 hat and it transforms it to another place which is 210 which is going to be up here somewhere so that's e1 hat changed and that's equal to 2 10. now if i do that with the other basis vector if i do 2 3 10 1 multiplied by 0 1 then i'm going to get 2 times 0 plus 3 times 1 10 times 0 plus 1 times 1 i'm going to get 3 1. so the other basis vector e 2 hat gets transformed over to 3 1 which is going to be over here somewhere so that's e2 changed i'm using the prime here to indicate changed 3 1. so what this matrix does is it moves the basis vectors in some way it transforms them it changes the space so what this matrix 2 3 10 1 does is it's a function that operates on input vectors and gives us other output vectors and the set of simultaneous equations here is asking in effect what vector i need in order to get a transform product at the position 813 in order to get an output of 813 now we can see what we mean now by the term linear algebra linear algebra is linear because it just takes input values are a and b and multiplies them by constants so everything is linear and it's algebra that is it's a notation describing mathematical objects in a system of manipulating those notations so linear algebra is a mathematical system for manipulating vectors and the space is described by vectors so this is interesting there seems to be some kind of deep connection between simultaneous equations these things called matrices and the vectors we were talking about last week and it turns out the key to solving simultaneous equation problems is appreciating how vectors are transformed by matrices which is the heart of linear algebra previously we introduced the idea of a matrix and related it to the problem of solving simultaneous equations and we show that the columns of a matrix just read us what it does to the unit vector along each axis now we'll look at different types of matrix and what they do to space and what happens if we apply one matrix transformation and then another which is term composition now because we can make any vector out of a vector sum of the scaled versions of e1 hat and e2 hat then what that means is the result of the transformation is just going to be some sum of the transform vectors which i'm calling e1 hat and e2 hat this is a bit hard to see but what it means is that the grid lines of our space stay parallel and evenly spaced they might be stretched or sheared but the origin stays where it is and there isn't any curviness to the space it doesn't get warped and that's a consequence of our scalar addition and multiplication rules for vectors so that is if i write down the matrix as capital a and the vector it's transforming as r you know whatever it was a b in our apples and bananas problem and that gives me some uh altered version we said it was 813 before but i'm going to call it r transformed or r prime then we can look at what happens if i do algebra with it so if i multiply uh r by some number just a number let's call it n and if i apply a to n r what i'm saying is that i will get n r prime and hopefully you can see if i put an n in there when i multiply it all out i'm going to get an n in there similarly if i multiply a by the vector r plus s then i will get a r plus a s so if i get that multiplication get that do the whole thing again with another vector and get another vector s and add those two that will be true so what i'm saying here is that if i can then think of these as being the original basis vectors e 1 hat and e 2 hat i'll then get some addition of e 2 and e 1 primed so if i say that's n e 1 hat plus m e 2 hat i'll get a uh n times e 1 hat plus m times a e 2 hat which is n e one prime plus m e two primed which is just uh the uh vector sum of some multiple of those so this space gets transformed e1 e and e2 get moved and then i can just add up vectors with them so that's very nice that means that all of our vector sum rules work now maybe that's a bit confusing so let's try it with an example so i've got my matrix a here for my apples and bananas problems of 2 3 10 1 or if we like the vector 210 and the vector 3 1 and let's try uh an example like a vector 3 2. now if i multiply that out just straightforwardly as we probably did at school i've got 2 times 3 plus 3 times 2 so that's 6 plus 6 that's 12. and i've got 10 times 3 which is 30 plus 1 times 2 to 2 so that's 32 but i could think of that as being 2 3 10 1 times 3 times 1 0 plus 2 times 0 1 that is 3 of e 1 hat and 2 of e 2 hat in a vector sum now i can take the 3 outs that's 3 times 2 3 10 1 times 1 0 plus 2 times 2 2 3 10 1 times 0 1 and this we know that this is what happens from e1 hat to get to e1 prime here so that's 3 times 210 and that's 2 times what happens to e2 hat and that goes to 3 1. so that gives us 6 plus 6 is 12 and 30 plus 2 is 32 so it really is true these rules really do work and we can think of a matrix multiplication as just being the multiplication of the vector sum of the transformed basis vectors so pause them for a moment and try that maybe with an example of your own and verify that that really does work because that's really quite deep this matrix just tells us where the basis vectors go that's the transformation it does it's not a complicated multiplying out thing we don't need to worry about the mechanics of doing the sum we can just think of it in terms of what it does to vectors in the space let's look at some special types of matrices that do simple things uh and then we'll think about how to combine them to do complicated things first let's think about a matrix that doesn't change anything uh a matrix that's just composed of the basis vectors of the space so 1 0 and 0 1. so that if i multiply it by some vector x y that's not going to change x y if i multiply it out i'm going to get 1 times x plus 0 times y 0 times x plus 1 times y x y so it doesn't change the vector x y the it's just composed of the basis vectors here and it doesn't change them and that's called therefore the identity matrix it's the matrix that does nothing and leaves everything preserved and it's called i now what if i have different numbers along the leading diagonal something like three zero zero two for instance well that's going to scale the x axis here by a factor of three it's going to go to three naught when i multiply it out and the y axis is going to scale by a multiple of 2. it's going to go from 0 1 to 0 2. so i've scaled space this way by a factor of 3 and that way by a factor of 2 so my little unit square has gone to being a rectangle from being a square originally it scaled up this way three times and this way two times and of course if the scale factor here was a fraction if it was a third or something then i'd have squished space that way instead i'd have gone that way and made it thinner and taller something like that so a fraction then squishes space the next thing to think about is what happens if i've got a matrix where i scale by say -1 here on one of the axes well what that's going to do so the original axes is it's going to flip them around so if i've got one zero here and zero one being the other axis of course then that's going to scale the first one over here to being minus one zero out of the the zero two is going to scale the other axis up to zero two here so my original little cube goes from here up to here so it's changed in area by a factor of two one times two but it's also flipped over um the x-axis has gone over there now what does that mean well if i had previously an axis system where i went using my right hand for my right hand there that's my first one that's my second axis around and i went counterclockwise now i'm going the other way now i've got a left-handed coordinate system and i go clockwise so i need to get my left hand out to describe them now so i've changed the sense of the coordinate system in flipping it over now the other matrix we need to think about get another pen is minus one zero zero minus one and that inverts everything it inverts both axes it takes one zero to minus one zero and it takes 0 1 down here to 0 minus 1. so it flips everything in both coordinates and that's called an inversion another matrix i can think of is 0 1 1 0. that's kind of fun what that does is it takes uh i hat 0 1 0 here it takes it to 0 1 so it takes it to there 01 that guy goes there and it takes the other axis which was there which was also 0 1 and it takes it to 1 0. so what it does is it flips them around it's like i put a mirror in at 45 degrees i could have another mirror would be 0 minus 1 minus 1 0 and that would take 1 zero here and it would make it zero minus one take it down there zero minus one and it would take this axis zero one and it would make it to minus one zero so it would make it over there so that's like having a mirror plane in there and just for completeness i can think of another two mirrors and they would be minus one zero 0 1 and that flips the x axis that's like a vertical mirror that's that guy and i can have another one which would be 1 0 0 -1 and that flips the horizontal axis that flips this guy down but leaves this guy unchanged so those are all my mirrors another thing i might want to think about are shears so i wanted to keep e1 hat where it was one zero but move e2 hat over so e2 hat is zero one so i wanted to move e2 hat over to here say so i wanted to get me two primed to be equal to one one so now in which case i only just write down my matrix right i can say that uh e1 becomes itself one zero and e2 becomes one one that would be the transformation matrix for that shear it would be shearing the unit square over from being a little square to being a little parallelogram here something like that of course i could shear the x-axis as well i could do some combination of shears but that's basically how a shear would look now the last sort of shape change i can do after stretches inversion mirrors and shears is a rotation if i take e1 hat again an e2 hat if i rotate them round well e1 hat's going to go round here so e1 primed is going to be equal to 01. and e2 hat is going to go round to here an e2 primed is going to become -1 0. so that 90 degree rotation there is going to have the transformation matrix 0 1 minus 1 0. and in general i can write down a rotation by an angle here uh let's say an angle here of theta i can write that down as being cos theta uh sine theta sine theta cos theta and i need to put a minus sign in here where positive thetas are actually that way um so we did a rotation by -90 so sine of minus 90 is -1 and that's a general expression for a rotation in 2d if i wanted to do it in 3d i need to think about the axis i was doing it along or around so if i was rotating about z i would preserve all of the z's if for a 3d rotation something like that but this isn't really a course about uh matrix rotations and matrix geometry and and so on that would be something like crystallography this of course about data science so we don't need to think too much about rotations but it is interesting if we need to do things like transform faces if we wanted to do facial recognition we'll want to do these sorts of stretches and mirrors and shears and rotations to faces to get them all facing like that rather facing like this or at some funny angle that we had from our camera that was looking at somebody so we do need to do this in data science on occasion and that's rotations so what we've described in this video is we've described all the possible sort of changes we can do with a matrix so now we think need to think about how we do a rotation and then a stretch and that's the next little part now what's the point of telling about all these different geometric transformations well if you want to do any kind of shape alteration say of all the pixels in an image of a face or something like that then you can always make that shape change out of some combination of rotations shears stretches and inverses that is if i first apply one transformation a1 to a vector r then that makes some first change then if i apply another transformation a2 to that to the result of that then i can perform a composition of the two transformations what i've done is i've performed first a1 and then a2 2r now maybe this isn't so obvious so let's slow down and do a concrete example let's uh start out with our basis vectors so we've got uh our e1 as being 1o uh and our e2 as being o1 now let's take our first transformation a1 as being a 90 degree anti-clockwise rotation so we're going to work out what happens if we rotate this by 90 degrees anti-clockwise so e1 comes down to some transformed e1 let's call it e1 prime of uh 0-1 so we can put zero minus one in there and our e2 rotates down here to be some transformed version of e2 which we'll call which will then be e2 primed which will be at one zero so then we've got an overall a1 which describes a 90 degree rotation now then let's take another one another matrix which also transforms our original basis vectors our original e1 and e2 and what we'll say is that that is a let's say a mirror say so that moves a vertical mirror moves e1 to e1 prime is going to be -1 0 and it's going to leave e2 where it was if i just reflect vertically so my transformation a2 now is going to be minus 1 0 and it leaves the other one the same 0 1. now let's ask ourselves what happens if i do a2 to a1 so now i'm going to reflect over the result of doing a1 so e1 prime is actually when i reflect vertically going to stay in the same place and that's going to give me e1 let's say double prime for doing it twice and e2 prime is going to reflect over here and it's going to become e2 double prime is going to be one zero so i'm going to have an overall result of doing a2 2 a1 which is going to be uh what i get e 1 prime i first write down 0 minus 1 and e 2 double prime i'm going to write down minus 1 0. now i can work out what that is actually in a matrix way without having to draw it all out by saying uh a2 to a1 is doing a2 which is -1 0 0 1. to a ones basis vectors and a uh a one sorry eight one there a one was zero minus one one zero which is just the two transform basis vectors the single primes and when i do that i just have to do this matrix to that basis vector and then this matrix to that basis vector and what that looks like if i do that is i uh do this matrix to that basis vector so i'll do that row times that column and that gives me minus one times zero plus zero times minus one that gives me zero and that one to that one so the second row uh first column 0 times 0 plus 1 times -1 gives me -1 and then i do that a-2 transformation to the second column now to the second basis vector of a1 so minus 1 times 1 uh plus zero times zero a minus one times one uh second row zero times one plus one times zero gives me zero so i do the row times the column for all the possible row and column combinations so that is in fact the same thing so we can have discovered really how to do matrix composition or matrix multiplication doing uh one matrix to another transformation matrix notice that geometrically we can show that a2 then a1 or isn't the same as doing the transformations in the other order first a1 then a2 so just watch that for a minute so this one we've done a2 and then what if we then do a1 well a1 is a 90 degree rotation so if we rotate uh e1 prime then we find that e1 double prime would be equal to our original o2 at e2 that is o1 and our e2 primed which was just staying where it was when we rotate that down then that'll come down to here that'll come down to 1o so our e2 double prime will be 1o let's look how that works out matrix wise so if we do a1 2a2 a1 was 0 1 minus 1 0 and if we do that 2 a 2 which was -101 then we've got to do that matrix multiplication and what that's going to give us is uh 0 times -1 plus 1 times 0 0 then this one times this one minus 1 times -1 plus what 0 times zero is one this row second column zero zero one one and second row second column minus one naught naught one gives me naught so that is the first column is zero one and the second column is 1 0. so that isn't the same as doing the operations in the other sequence you see these minus signs are flipped in fact what happens what's happened is this composition rotating and then flipping over is the equivalent of reflecting the whole lot the original basis vectors in this mirror and flipping and then rotating around the effect is i've just flipped my original axes e1 e2 which is putting a mirror in here so doing the two operations in opposites in different sequences doesn't give you the same operations what we've shown is that matrix multiplication isn't commutative a2 a1 isn't the same as a1 a2 so we have to be very careful with matrix multiplication we can do them in any order meaning they're associative that is uh we could if we do a3 to a2 to a1 we can do a2 a1 and then do a3 or we can do uh a3 a2 a1 so we could do that and then that those are the same they're associative but we can't interchange the order we can't swap them around it's not commutative so this is interesting there seems to be some deep connection between simultaneous equations these things called matrices and the vectors we were talking about in the last module and it turns out the key to solving simultaneous equation problems is appreciating how vectors are transformed by matrices which is the heart of linear algebra so now we're finally going to find a way to solve the apples and bananas problem and along the way we're gonna find out about a thing called the inverse of a matrix and a method for finding it so let's go back to that apples and bananas problem we said that i walked into a shop and i bought two apples and three bananas and that that cost eight dollars or euros or whatever my currency is then i went to the shop on another day and i buy 10 apples and one banana and i got charged uh 13 euros for that so i wrote that down as a matrix times a vector and i could call that matrix a and i could call this vector r and i could call that output vector s so a operates on r and gives me s and what i want to do is i want to find out what r is what this vector a b is in order to give me the right numbers for the output vector now let's just think about this matrix a for a moment let's see if i can think of another matrix i'm going to call this matrix a to the minus 1 that when i multiply it by a gives me the identity matrix and i call this matrix here the inverse of a because what it does to a is it exactly reverses whatever a does and gives me just the identity so if i take this equation uh a r equals s if i multiply it on the left by a to the minus 1 on both sides well what i find is is that this is just going to be the identity which is the matrix that does nothing so when i multiply the identity by r i just get r so then i've got r is equal to a to the minus 1 times s the inverse matrix of a times s so if i could find the inverse of a the matrix when multiplied by a that gives me the inverse if i can find the inverse of a i can solve my problem and i can find what my a and b are and i've solved my apples and bananas problem now hold that thought about inverses for just a moment so we said the inverse of a times a was the identity matrix and we'll just part that up there for a moment now i actually don't have to go all the way to the inverse to solve the apples of my nars problem i can do it just by substitutions so let's uh look at a slightly more complicated problem and see what i mean so i'm going to take a matrix that looks like uh like this one one three one two four one one two and i'm gonna say that i have apples and bananas and carrots say okay they're not a fruit but they're another sort of fruit-like object that we eat and so i'm saying that i bought one apple one banana and three carrots and that cost me let's say 15 euros another day i went and i bought one two and four and that cost me 21 euros and another day i bought one one and two and that cost me 13 euros now the thing to notice about a system of equations like this is that i can if i take this row off of another row i haven't changed anything right so i know that an apple a banana and three carrots cost 15. so if i take an apple a banana and three carrots and 15 off of the next row i haven't changed anything really about that row the same for the third row so uh i can make this problem simpler by going and doing that that's sort of a process of elimination so if i take uh if i call this row one row two and row three if i take row one off of row two then i would have take row one off of row two i'll have zero one one if i take row 1 off of row 3 i'll have 0 0 and minus 1. i'd still have a b and c and i've taken on the right hand side i've taken row 1 i've taken row 1 off of row 2 so i've taken 15 off of 21 so i'll have 6 left and i've taken row 1 off of row 3 so i'll have 15 minus 13 uh is 13 minus 15 is minus 2. so ah interesting so now i can see that actually minus c is equal to minus 2. so i can multiply that row through by -1 here i'll just change pens and i've got a solution now for c but c is equal to 2. so that's interesting now i've got what it's called a triangular matrix that is everything below the body diagonal is zero and that's a very interesting sort of matrix i've reduced it to what's called echelon form all the numbers below the leading diagonal is zero now i can do what's called back substitution i can take my answer for c here and i can put my answer for c back into the first two rows so if i take 1c equals 2 i know that 1 times c here is 2 here so i can take that off there and i can do the same i can take 3 times that off the first row so if i go take three times it off the first try i'm gonna get one one zero abc and i've taken three times the first row off of here so i've taken three twos off of here so i'm going to get nine and if i take that third row off of the second row i'm going to get zero one zero and i've got two offer six is four and i can complete i haven't changed the last row at all so now i know that c here is equal to two i know that if i multiply that row by that column i've got just b is equal to four and i've got a plus b is equal to nine but i can take my answer for b back off this first row so i've got then 1 0 0 0 1 0 0 0 1 a b c and i've taken uh one b i've taken four off of the first row so that's five four two so my solution here for my apples bananas and carrots problem is that apples cost five euros bananas cost four euros and carrots cost two euros um and that's solved the problem for me very nice um so i didn't really have to compute the inverse at all but i've only found out the answer for this specific set of uh specific output vector so we said that this was a times r is equal to s i've only found it for this specific s if i did the inverse in a general case i could do it for any s i could find r for any s so what i've done here is i've done uh what's first elimination i've taken multiples of rows and i've taken them off of each other to get to having a triangular form where the bottom corner here is zeros and i've just got ones on the leading diagonal and then i've done back substitution back substitution well i've gone back up of putting the numbers for c back into the first two rows the number b back into the first row and so on to get my solution to my problem and that's very nice this is actually one of the most computationally efficient ways to solve this problem and it's going to work every time and it's very simple to understand and we can do it in relatively few operations so it's really really nice and one thing to notice here is that in doing so what i've done is i've transformed a into the identity matrix here this is just the identity matrix this one's on the leading diagonals and zeros everywhere else so this is going to be a key to a way to find the inverse which is what we'll look at next now let's think about how i can apply this idea of elimination to find the inverse matrix which solves the more general problem no matter what vector i write down on the right hand side so have a three by three matrix a and it's inverse b which are multiplied together to give the identity matrix i so before we had the matrix for a was 1 1 3 1 2 4 1 1 2. and b here is is i'm going to actually introduce some notation i'm going to call b uh uh composed of elements b11 uh b12 b13 where the first digit represents the row so then i'll have b 2 1 the second digit represents the column so this would then be b 2 2 and this one would be b row 2 column 3. i'll have b 3 1 b 3 2 b 3 3. so that is that's the row and that's the column r equals i what we're saying here is that b is actually the inverse of a so that is if i multiply a by its inverse i'll get the identity and actually the uh the inverse is special because i can do it either way around i can apply it to a on the right or on the left and it'll still work because i times the inverse of a is just the inverse of a so it doesn't really matter which way round i i do it um so but i'm just doing it here so that i've got some b's to play with and i don't get confused through my a's and my beat right but for this example i'm saying b is the inverse matrix of a now notice that this first column of b here is just a vector it's a vector that describes what the b matrix the inverse of a does to space actually it's the transformation that that vector does to the uh first uh the x-axis if you like so uh and the identity matrix is just one zero zero zero one zero zero zero one right so i could write this down i could write down this a that big square guy times b one one b two one b three one and i would get just the first column of the identity matrix 1 0 0. now i could solve that by my elimination method and back substitution just in the way i did before except i'd be juggling different numbers here then i could do it again for the second column of b the second vector of b for the second vector of the identity matrix and i could do it again for the third one so i'd be doing that in some series if you like but here's the fun bit i could do it all at once so i'm going to do this process of elimination and back substitution all at once um i'm going to do it for all the columns on the right hand side simultaneously so if i take the first row here and i take it off the second row as before i'm going to get rid of this one and if i take the first row off of the third row i'm going to get rid of both this one and this one so uh if i've got put the unaltered ones down here 1 1 3. if i take that off of the second row i'm going to get 0 1 1 and if i take that off the third row i'm going to get 0 0 minus 1. so on the right hand side i'm going to have uh the unaltered one is 100 and i've taken that off of the second row so i've got taken one off that that's -1 1 0 and i've taken that off of the third row as well so i've got minus 1 0 1. and so now i can multiply that third row through by minus one and that's in the form i then want well i've got one's on the leading diagonal and zero is below it so when i multiply that three by minus one i get a plus there and a minus there now i can substitute that third row back into the second and first rows so i can take one of it off of the second row and i can take three of it off the first row so now the unaltered one is the bottom one one o minus one if i take one of those off of the second row then i've got zero one zero there i'm going to take that off of the second row over here so if i take one off of minus one i get minus two take zero off one i've got one take minus one off of zero i've got i'm effectively adding one then i want to do that again to the first row i wanna take three of them off to make this guy here zero so i've got then one one zero there and i want to take three of these off the first row so i take three off of one gives me minus two take 0 off there and i've got to take 3 of minus 1 off of 0 so that gives me plus 3. so we're nearly there i've just got this irritating one here i've got to take this row off of this row and i'll be home so if i take my third rows unaltered let's put some brackets in my third row is unaltered my second row is going to be unaltered 1 0 minus 1 minus 2 1 1. i take that row off of that row then that altered one gives me one zero zero take the one off there take this row off of this row minus two off of minus two um i've got to take minus 2 off of minus 2 so that gives me 0. got to take 1 off of 0 that gives me minus 1 and i've got to take 1 off of 3 which gives me 2. so that's my answer so now i've got the identity matrix here um for a in effect i've transformed a till it's the identity matrix i've got my b matrix which i haven't really changed and my identity matrix over here i've changed so now i've got an identity times a b matrix is equal to this guy so the identity times something is just itself so this is in fact my answer for the inverse of a or b so i've found a way here to find the inverse of a matrix just by doing uh my row elimination and then my back substitution which is really cool so what we've done is we've found an inverse matrix a to the minus 1 here and if we multiply a times a to the minus 1 we'll get the identity and prove to yourself if you like just pause for a moment and have a go at doing that for yourself and verify that that times that does in fact give you the identity matrix and in school you probably did this a different way but computationally this way is much easier particularly when it comes to higher dimensions you know 100 rows and columns or something like that there are computationally faster method of doing what's called a decomposition process and in practice what you do in any program that you write is you simply call the solver for your problem or the function uh something like inv a or whatever it is and uh it will pick the best method by inspecting the matrix you give it and return the answer but the point here is to show how these problems are actually solved in a computer and also we'll observe some of the properties of these methods in different circumstances that will affect the sorts of things we want to do when solving these sorts of problems so what we've done here is figured out how to solve both sets of linear equations in the general case by a procedure we can implement a computer really easily and we've made that general by finding a general method to find the inverse of a matrix in the general case for whatever is on the right hand side of our system of equations and hopefully that's really satisfying and really nice in the final video in this module we're going to look at a property of a matrix called the determinant and what happens when the matrix doesn't have linearly independent basis vectors picking up on the basis vectors we looked at in the last module let's go back and look at a simple matrix uh like uh a nor nor d what this matrix does is it scales space so if we have our original uh basis vectors e hat and e2 hat it scales them out um by a factor of a in this direction to a naught and by a factor of d in this direction to naught d and we call those uh e1 prime and e2 prime um now see what i've done to the space it was originally uh this size one by one and what i've done is i've scaled the space every vector in the space by a factor of a this way and by a factor of d this way and therefore i've scaled the space by a factor by of a d all areas in the space everything's got bigger by a factor of a d and i call that the determinant of the transformation matrix now if i instead have a matrix which is a naught b d instead the same but with a b what that does is e one hat stays as it was it's scaled by a factor of a but e two hat goes somewhere different so e1 hat goes to a naught but e2 goes to bd somewhere like this bd now see what's happened to the space i've gone from being something like this to being something like this i've made a parallelogram but the area of the panagram is still a d it's its base times its perpendicular height so the determinant is still a d of this transformation matrix now if i have a general matrix a c b d then it turns the original unit square into a parallelogram like this it moves uh this vector to here this vector to here then to find the area of this parallelogram i'm going to have to do a bit of maths so i'm going to do that and then we'll have a look at it in a moment so i've done the maths and it's here and you can do the geometry and puzzle it out for yourself and pause it if you'd like to verify that i'm correct but the area of the parallelogram here is actually a d minus bc and i'm going to denote the operation of finding that area uh with vertical lines like this and call that finding the determinant of a which you probably saw in school now in school when you looked at matrices you probably saw that you could find uh the inverse of a matrix like abcd by flipping the terms on the leading diagonal and taking the minus of the off diagonal terms for a 2 by 2. so let's do that and see what we get so when we multiply that out we get a d minus b c interesting then on uh this this element here is a uh minus b minus uh plus b a and so we've got a minus b plus b a so that's zero when we do this term we get uh c d minus c d and when we do this term here we get c b times minus b so that's minus b c plus a d so that's a d minus b c interesting so that's the determinant so if i multiply by one over the determinant a d minus b c then these terms will become one and i'll get the identity matrix so if this was the matrix a and i think of this and this together as being a to the minus 1 then i've got uh this is in fact when i pre-multiplied by 1 over the determinant is in fact the inverse of a so we've proved here that that determinant that inversely learned in school is in fact correct but the interesting thing is that this determinant there this amount that it scales space if we then take this matrix when we do the flipping around we haven't changed its scaling of space we need to undo that scaling and bring it back down to a scale of one so the determinant here is what we need to divide the inverse matrix by in order for the it to properly be an inverse now we could spend another video looking at the extension of the idea of row echelon form to find out how to find determinants in the general case computationally but this is both tricky to show and it's kind of pointless knowing how to do the operations isn't a useful skill anymore because we just type uh debt a into our computer and python or matlab will do it for us from a learning perspective it doesn't add much uh to row echelon well echelon does actually which is why we went through it i'm not going to so i'm not going to teach you how to find determinants in the general case if you want to know lookup qr decomposition uh and then follow that through and that's how computationally you go and find it out or in a linear algebra textbook is the other place to look that's how you do it in the general case now let's think about this matrix a here it transforms e1 and e2 hat to two points on the same line it transforms e1 hat to one one and it transforms e2 hat from there over to 2 2 and they're both points on the same line they're just a multiple of each other they're not linearly independent um what this matrix in fact does is it transforms every point in space on a line and notice that the determinant of that matrix is going to be zero if i take a d minus b c determinant of a is not because the area of any area has gone onto a line and therefore that area is naught so however you compute it either geometrically or computationally you get a determinant of naught so if i had a three by three matrix with a similar situation describing a 3d space and if i had the same position where one of the new bases vectors was just a linear multiple of the other two it wasn't linearly independent then that would mean the new space was either a plane or if there was only one independent basis vector a line like we have here in either case the volume closed would be zero so the determinant would be zero now let's turn back to our row echelon form and apply that idea let's take this set of simultaneous equations say you can see that the third row is just the sum of the first two so row 3 is equal to row 1 plus row 2. and you can see that column 3 is just given by 2 of column 1 plus column 2. if you want to pause for a moment to verify that but it really is true so this transformation matrix doesn't describe three independent basis vectors one of them is learning independent on the other two so this doesn't describe a new 3d space it collapses into a 2d space so let's see what happens when i try to reduce this to row echelon form if i uh take off the first row from the second okay so far so good i've got that and my abc stays the same and i take the first one off the second one um i've got 12 take 12 off 17 i've got five if i then take the first and second ones off the third one i've now got zeros everywhere here and if i do that on here i take 12 and 17 of 29 i get 0 here so now i've got c equals zero which is sort of true but not useful there are an infinite number of solutions for c in effect any value of c would work so now i can't solve my system of equations anymore i don't have enough information so if we think about this from a sort of um solving simultaneous equations point of view my mistake was when i went into the shop to buy apples and bananas and carrots the third time i went in i just ordered a copy of my first two orders so i didn't get any new information and i don't have enough data therefore to find out the solution for how much individual apples and bananas and carrots cost so what we've shown is that where the basis vectors describing the matrix aren't linearly independent then the determinant is zero and that means i can't solve the system of simultaneous equations anymore which means i can't invert the matrix because i can't take one over the determinant either and that means i'm stuck this matrix has no inverse so there are situations where i might want to do a transformation that collapses the number of dimensions in a space but that will come at a cost another way of looking at this is that the inverse matrix lets me undo my transformation it lets me get from the new vectors to the original vectors and if i've dumped a dimension by turning a 2d space into a line i can't undo that anymore i don't have enough information because i've lost some of it during the transformation i've lost that extra dimension so in general it's worth checking before you propose a new basis vector set and then use a matrix to transform your data vectors that this is a transformation you can undo by checking that the new basis vectors are linearly independent so what we've done in this last video in this module is look at the determinant how much we grow space we've also looked at the special case by the determinant is zero which means that the basis vectors aren't linearly independent which means the inverse doesn't exist in this first module on matrices what we've done so far is define what a matrix is as something that transforms space we've looked at different archetypes of matrices like rotations and inverses and stretches and shears and how to combine them by doing successive transformations then we've looked at how to solve systems of linear equations by elimination and how to find inverses and then finally we've looked at determinants and linear independence so what we've done in this last video in this module is look at the determinant how much we grow space the area change we've also looked at the special case where the determinant is zero and found that that means that the basis vectors aren't linearly independent and that that means that the inverse doesn't exist so in this first module on matrices what we've done is define what a matrix is it's something that transforms space we've looked at different archetypes of matrices like rotations and inverses and stretches and shears and how to combine them by doing successive transformations matrix multiplication or composition then we've looked at how to solve systems of linear equations by elimination and how to find inverses and then finally we've looked at determinants and linear independence and next week we'll carry on and look at matrices in some more detail now there's an important other way to write matrix transformations down which is called the einstein summation convention and that writes down what the actual operations are on the elements of a matrix which is useful when you're coding or programming it also lets us see something neat about the dot product that i want to show you and it lets us deal with non-square matrices when we started we said that multiplying a matrix by a vector or with another matrix is a process of taking every element in each row in turn multiplying with corresponding element in each column in the other matrix and adding them all up and putting them in place so let's write that down just to make that concrete so i'm going to write down a matrix a here and i'm going to give it elements and a is an n by n matrix and i'm going to give it elements a11 a21 all the way down to a n1 and then a12 all the way across to a1n and then i'll have a22 here all the way across all the way down until i fill it all in i've got a n n down here so the first suffix on this matrix first suffix on all of these elements in the matrix is the row number and the second one is the column number now if i want to multiply a by another matrix b and that's also going to be an n by n matrix and that will have elements uh b 1 1 b 1 2 across to b 1 n and down to b n 1 and across to b n n um dot dot dot dot dot if i multiply those together i'm going to get another matrix which i'll call a b um and then what i'm going to do is i'm going to take a row of a multiplied by the elements of a column of b and put those in the corresponding place so let's do an example so if i want element let's say a b element 2 3 i'm going to get that by taking row 2 of a multiplied by column 3 of b so i'm going to take row 2 of a that's going to be a 2 1 a 2 2 and all the others up to a 2 n and i'm going to multiply it by column 3 of b so that's b 1 3 b 2 3 all the way to b n 3. and i'm going to add all those up and i'll have a dot dot in between so that's going to be this element row 2 column 3 of a b now in einstein's convention what you do is you say well okay this is the sum over some elements j of a i j b j k where so if i add these up over all the possible js i'm going to get um a11b11 plus a 1 2 b 2 1 and so on and so on and that's for i and k is 1. i'm going to then go around all the possible i's in k's so what einstein then says is he then says well okay if i've got a repeated index i just won't bother with the sum and i'll just write that down as being a i j b j k and that's equal to this uh the product uh a b r i k um so a b i k is equal to a i 1 b 1 k plus a i 2 b 2 k plus a 1 uh up sorry i 3 b 3 k and so on and so on until you've done all the possible js and then you do that for all the possible i's and k's and that will give you your whole matrix for a b for the product now this is quite nice if you were coding you just run three loops over i j and k and then use an accumulator on the js here to find the elements of the product matrix a b so the summation convention gives you a quick way of coding up these sorts of operations now we haven't talked about this so far but now we can see it there's no reason so long as the matrices have the same number of entries in j that we can't multiply them together even if they're not the same shape so we could multiply a two by three matrix something with two rows and three columns so one two three one two three by a three by four matrix three there and four there so it's got one two three four times and when i multiply those together i'm going to go that row times that column i've got the same number of js in each case so and then i'm going to be able to do that for all of the possible columns so i'm going to get something with four columns and when i i'm going to be able to do that for the two rows here i'm going to do that row times that one so i'm going to get a 2 by 4 matrix out so it's going to have 1 2 3 4 1 2 3 4. so i can multiply together these non-square matrices if i want to and i'll get in the general case some other non-square matrix are going to have the number of rows of the one on the left and the number of columns of the one on the right now uh all sorts of matrix properties that you might want inverses and so on determinants all start to get messy and mucky and you sometimes can't even compute them when you're doing this sort of thing but there are times when you want to do it and the einstein summation convention makes it very easy to see how you do it and how it's going to work as long as you've got the same number of j's you're good you can multiply them together now let's revisit the dot product uh in light of the summation convention so if we've got two vectors let's call them u and v and we'll say uh use a column vector having elements u i and v is another column vector having elements v i and when we dot them together what we're doing is we're multiplying u1 by v1 adding u2 v2 all the way up so in the summation convention that's just u i v i where we repeat over all the i's and add but this is just like writing u as a row matrix pushing you over from being a vector to being a matrix with elements u1 u2 all the way up to un and multiplying it by another matrix v1 v2 all the way up to v n that's the same that matrix multiplication is the same thing as the dot product i just push the u vector over and then my dot product is just like doing a matrix multiplication which is sort of neat so there's some uh equivalence between a matrix transformation a matrix multiplication and a dot product so let's look at that now if i take a a unit vector here let's call him u hat with components u1 and u2 and let's imagine uh what happens if i dot him with the axis vectors so if i've got an axis here e1 hat which would be 1 0 and i've got another axis here e 2 hat which will be 0 1. now let's think about what happens when i dot u hat with e 1 when i do the projection of u hat onto e1 so when i drop u-hat down onto the axis here when i do the projection of u-hat onto e1 i'm going to get a length here just of u1 just of the x-axis element of u-hat now what happens if i drop uh project e1 onto u hat well i'm then going to get this projection and i'm going to get a length here this projected length here now the fun thing is we can actually draw a line of symmetry here through these two projections where they cross and this little triangle and this little triangle are actually the same you can go and do a bit of geometry and prove to yourself that that's true so this length here this projection there that projection is the same length as that projection which is implied by the dot product if i dot e1 with u hat um when i do this multiplication here it's symmetric i can flip them around and i get the same answer so this shows geometrically why that's true and if we repeat this with the other axes with e2 here and any other axes there are then we'll also get the same results so this is why the projection is symmetric and the dot product is symmetric and why projection is the dot product so there's this connection between this numerical thing matrix multiplication and this geometric thing projection which is quite quite beautiful and mind-blowing really and that's why we talk about uh a matrix uh multiplication with a vector as being the projection of that vector onto the vectors composing the matrix the columns of the matrix so what we've done in this video is look at the summation convention which is a compact and computationally useful but not very visual way to write down matrix operations and that's opened up looking at funny shape matrices and that's opened up re-examining the dot product here so that's really nice we've said before that the columns of a transformation matrix are the axes of the new basis vectors of the mapping in my coordinate system we're now going to spend a little while looking at how to transform a vector from one set of basis vectors to another let's say i have two new basis vectors that describe the world of panda bear here and panda's world is orange so panda's got a first a basis vector there and then another basis vector there say and let's say in my world panda bears basis vectors are at 3 1 and a 1 1 and my uh my basis vectors here are e 1 hat equals naught 1 for sorry for e 2 hat and e 1 hat is equal to 1 naught so those are my basis vectors and the orange ones are panda's basis vectors now uh and so panda's basis factors if the first one for panda is his one zero and the first one the second one is zero one in panda's world so bears basis vectors are 3 1 and 1 1 in my blue frame that is i can write down bear's transformation matrix as 3 1 1 1. now let's take some vector i want to transform let's say that vector is in bears world is the vector uh a half of three one in bear's world so it's three over two one over two so the instruction there is do uh three over two of three one and then do one over two a half of one one in my frame if you like so in my world that's going to give me the answer of three times three over two plus one times one over two is 9 10 halves which is 5 and 1 times 3 over 2 plus 1 times a half so that's a total of 2. so that's the vector 5 2 in my world 5 2. those two are the same things so this is bears vector and this is my vector so this transformation matrix here are bears basis in my coordinates in my coordinate system so that transforms bears vectors into my world um which is a bit of a problem you know usually i'd want to translate my world into bears world so we need to figure out how to go the other way so my next question is how do i perform that reverse process well it's going to involve the inverse so if i call bears transformation matrix b i'm going to want b inverse b to the minus 1. and the inverse of this matrix well it's actually pretty easy we can write down the inverse of that matrix pretty easily it's going to be a half of 1 3 flip the elements on the leading diagonal and put a minus on the off diagonal terms and we can see the determinant of that's three minus one is two so we divide by the terminals by half so that's going to be b to the minus one and that's the uh my basis vectors in bears coordinates so that's my basis in bear in bear's world so my 1 0 is going to be a half of 1 minus 1 in bear system and my 0 1 is going to be a half of minus 1 3 in their system and we can verify that this is true if we take this guy a half one minus one and compose it with bears vectors we've got uh one plus minus one of those is going to give me three plus one is uh three minus one is 2 1 minus 1 is 0 so that's 2 0 halve it gives you 1 0. so that really does work if i take a half 1 minus 1 of bears vectors i get my unit vector okay so that really does do the reverse thing so then if i take my vector which was 5 2 and then i do that sum i should get the world in bears basis so i've got five times uh a half minus a half using that guy plus two times r minus one three and that will give me a half of 3 2 when i multiply it all out and if i do the same thing here i've got 5 times 1 minus 1 times 2 gives me 3 over 2 gives me 3 over 2. it all works out if you do it that way or if you do it that way you still get that answer so that's bear's vector again which is the vector we started out with so that's how you do the reverse process you need to if you want to take bears vector into my world you need bears basis in my coordinate frame and if you want to do the reverse process you want my basis in bears coordinate frame that's probably quite counter-intuitive so let's try another example where this time bears world is going to be an orthonormal basis vector set so here's our basis vectors at one zero and zero one in my world in blue and bears world is in orange bears world and bears world has uh one one times and i've made it unit length so it's one over root two and minus one one again the unit lengths are one over root two so they're those two and those you can do a dot product to verify that those two are 90 degrees to each other and there bears vectors 1 0 and 0 1. so that's then i can write down bear's uh transformation matrix that transforms a vector of bears now if i've got the vector in bears world that's 2 1 then i can write that down and i will therefore get the vector in my world so when i multiply that out what i get is i'll get 1 over root 2 times 2 minus 1 which is 1 and then 1 times 2 plus 1 times 1 gives me 3. so in my world the vector is as i've written down 1 over root 2 times 1 3. so if i want to do the reverse transformation i need b to the minus 1 and b to the minus 1 is actually quite easy because this is an orthonormal basis the determinant of this matrix is one so it all becomes quite easy so i just get one over root two keep the leading terms the same flip the sign of the off diagonal terms because it's a two by two matrix so it's really simple and if you go and put that in if you say uh if i take one over root two times one minus one so i take one of those plus one of those uh divide by uh multiplied by root two i do in fact get one zero and the same for zero one it all works so then if i take uh the vector in my world this vector in my world which is one three and multiply it out then what i get is the vector in bears world so that's 1 plus 3 which is 4 1 so minus 1 plus 3 is 2 and i've got 1 over root 2 times 1 over root 2 so that's a half so in bears world this vector is 2 1 which is what we originally said so it really works now this was all prep really for the fun part which is we said before in the vectors module that we could do this just by using projections if the new basis vectors were orthogonal which these are so let's see how this works with projections so let's try it with projections what we said before was that if i take my version of the vector and dot it with a bear's axis so the first of bear's axes is that in my world then i will get the answer of the vector in bears world so that gives me uh 1 over root 2 times one over root two which is a half of one plus three which is four so that gives me two and that's going to be the first component of bears vector because it's the first of bears axes and i can do it again with the other of bear's axes so that's 1 over root 2 1 3 that's the vector in my world with the other of bear's axes which is 1 over root 2 times minus 1 1. and when i do that dot product what i'll get 1 1 over root 2's will multiply to give me a half again and i've got 1 times minus 1 plus 3 times 1 is a total of 2 which is 1 and that's bear's vector notice 2 1. so i've used projections here to translate my vector to bears vector just using the dot product now remember with the vector product um what i'd have to do is i'd have to remember to normalize when i did the multiplication by theirs vectors i'd have to normalize by their lengths but in this case their lengths are one so it's actually really easy so we don't have to do the complicated matrix maths we can just use the dot product if bears vectors are orthogonal now there is one last thing if you try this with the example we did before with bears vectors of 3 1 and 1 1 so before we had those being bears vectors if you try the dot product with those because they're not orthogonal to each other it won't work and give it a go for yourself and verify that that it really won't work that they need to be orthogonal for this to work if you have them not being orthogonal you can still do it with the matrix transformation you just can't do it with the dot product now let's look at how we do the transformation of a vector that's already in a change basis last time we looked at bears basis bears bases had a first axis of 3 1 and a second axis of 1 1 and let's say i have a vector x y defined in bears basis and say i want to transform it by doing something like a rotation of 45 degrees but the problem is i don't know how to write a 45 degree rotation in bear's funny coordinate system i only know how to write down a 45 degree rotation in my normal 1001 system so in my system which is 1 0 01 a 45 degree rotation rotates one o up like that so it becomes if it's still a unit vector one over root two one over root two that is a normalized one one and it takes one round to minus 1 over root 2 1 over root 2. that is that 45 degrees there that 45 degrees there so i can write down the rotation in my notation let's call it r being a 45 degree rotation as being 1 over root 2 times 1 1 minus 1 1. that's what a 45 degree rotation is in my world so what i need to do is first transform the vector x y into my basis and i do that by multiplying it by b right then i can apply my nice sensible rotation r to that vector that's now in my basis so what i get here when i do rb i've got the vector in my coordinate frame now the problem is bear doesn't care about my world he wants to get the rotation in his basis so then i have to transform the resulting vector back into bears basis and i do that by applying b to the minus one and b to the minus one i get by flipping the terms on the leading diagonal taking minus the off diagonal terms and dividing by the determinant here which is two so i multiply by a half so that then gives me the vector back in the vector back in bears in bears frame so overall what i've done is i've done b to the minus 1 times r times b and what that's giving me is it's giving me the rotation in bear's coordinate system which is really neat so now all we have to do is do the sums and when we do that our b gives us this uh and b to the minus one rb gives us this i've written them down there so pause and ponder if you uh want to verify those on your own so this is what a 45 degree rotation looks like in bear's coordinate system notice that it's completely different to the one in my standard basis it isn't very easy necessarily or obvious to enter it just out of your head you have to do the calculation so if you want to do some kind of transformation but in some funny basis this equation b to the minus 1 rb is going to be very useful to step back here the point is that if we want to transform to non-or normal coordinate systems then the transformation matrices also change and we have to be mindful of that and this is the sort of algebra you see all the time we've got the transformation matrix r wrapped around by b b to the -1 that does the translation from my world to the world of the new basis system so thanks a lot bear you really helped us out to understand all this stuff so what we've done in these two videos is we've looked at how the numbers in a vector change when we change the basis and we thought about the coordinate systems and how to do transformations in non-orthogonal coordinate systems it's been quite hard work but this really sets us up for example in principle component analysis to operate in different basis systems now it's going to be really useful if we can make a transformation matrix whose column vectors make up a new basis all of whom's vectors are perpendicular or what's called orthogonal to each other in this video we're going to look at this and why this is but first i want to define a new operation on a matrix called the transpose this is where we interchange all the elements of the rows and columns of the matrix so i denote the transpose as a t and i say that the ijth element of the transpose of a is equal to uh the elements on the opposite diagonal aji so if i had a matrix like one two three four and i wanted to know it's transpose then i'd interchange the elements uh that are off the diagonal so the one and the four stay where they are because if it was if i and j were the same one one one one they would stay the same the same for two two but the elements element one two interchanges with element two one so they go like that so one two three four becomes when i transpose it one three two four now let's imagine that i have a square matrix of dimension n by n um that defines a transformation with a new basis uh and the columns of that transformation matrix a are some vectors a1 a2 uh that are the basis vectors in the new space right as we've said before and i'm going to make this a special matrix where i impose the condition that uh these vectors are orthogonal to each other and they have unit length that is a i dotted with aj is equal to zero if i isn't equal to j that is their orthogonal and it's equal to one if i is equal to j that is their of unit length now let's think about what happens if i multiply this matrix on the left by a transpose so a transpose is going to be given by just uh flipping the rows and columns of all of the elements across the leading diagonal of a so that is i'm going to have a 1 is going to become a row a 2 is going to become the next row all the way down to a n because i flip all of the elements across the leading diagonal so when i multiply a t by a let's see what happens it's going to be quite magical so if i get a t times a then i'm going to get well the first element here is going to be the first row times the first column and that's a1 times dotted with a1 and that's one the second element here the first row the second column is a1 dotted with a2 and that's zero and a1 dot a3 all the way up to a n is going to give me a whole series of zeros where this guy remembers a n and then when i do the second row and the first column i'm going to get another zero i want to do the second row and the second column i'm going to get a one second row third column zero all the way across and when i do the third one i'm going to get the same again i'm going to get zero zero like it's all i mean zero except a3 times a3 and that's going to be that element all the way across and what i'm gradually building up here is i'm building up the identity matrix so what i found is that a t times a gives me the identity matrix and what that means is is that a t is a valid inverse of a which is really kind of neat a set of unit length basis vectors that are all perpendicular to each other are called an orthonormal basis set orthonormal and the matrix composed of them is called an orthogonal matrix one thing also to know about an orthogonal matrix is that because all the basis vectors in it are of unit length it must scale space by a factor of one so the determinant of an orthogonal matrix must be either plus or minus one the minus what arises in the new basis if the new basis vector set flips space around as they sort of invert it they make it left-handed notice that if a t is the inverse then i should be able to post multiply a by a t and get the identity um as i can i can do this either way around which also means by the same logic that the rows of the orthogonal matrix are for orthonormal as well as the columns which is kind of neat and we saw in the last video that actually the inverse is the matrix that does the reverse transformation so the transpose matrix of an orthonormal basis vector set is itself another orthogonal basis vector set which is it's really neat now remember that in the last module on vectors we said that transforming a vector to a new coordinate system was just taking the projection or dot product of that vector so say that's the vector with each of the basis vectors so basis vector this one basis vector that one and so on um so long as those basis vectors were orthogonal to each other if you want to pause and think about that for a moment in the light of all we've learned about matrices just think look and think about this for a moment now in data science what we're really saying here is that wherever possible we want to use an orthonormal basis vector set when we transform our data that is we want our transformation matrix to be an orthogonal matrix that means the inverse is easy to compute it means the transformation is reversible because it doesn't collapse space it means that the projection is just the dot product lots of things are nice and pleasant and easy if i arrange the basis vectors in the right order then the determinant is one and that's an easy way to check and if they aren't just exchange a pair of them and actually then they will be a determinant one rather than minus one so what we've done in this video is look at the transpose and that's led us to find out about the most convenient basis vector set of all which is the all for normal basis vector set which together make the orthogonal matrix whose inverse is its transpose so that's really cool we've said several times now that life is much easier if we can construct an orthonormal basis vector set but we haven't talked about how to do it so in this video we'll do that starting from the assumption that we already have some linearly independent vectors that span the space we're interested in so let's say i have some vectors v and i've got a whole group of them v1 v2 all the way up to vn and there's enough of them that they span the space so let's sort of sketch them out so i've got say a v1 here a v2 over here another v3 down there somewhere and uh they're linearly independent let's assume that if you want to check linear independence you can write down their columns in a matrix and check the determinant isn't zero if they are linearly dependent uh one of that would give you a zero determinant um but they aren't orthogonal to each other or of unit length my life would probably be easier if i could construct some orthonormal basis somehow and there's a process for doing that which is called the gram schmidt process the gram schmidt process which is what we're going to look at now let's take arbitrarily the first vector in my set call him v1 so we take v1 in this first step uh she let's call him she actually she gets to survive unscathed so we're just going to normalize her and we're going to say that my eventual first basis vector e is going to be equal to v 1 just normalized to b of unit length just divided by that its length so e is just going to be some normalized version of v1 and i can now think of v2 as being composed of two things one is a component let's do this in orange a component that's in the direction of e1 like that plus a component that's perpendicular to e1 and uh but the component that's in the direction of e1 i can find by taking the vector projection of v2 onto e1 so i can say v2 is equal to the vector projection of v2 onto e1 um dotted together and if i want to get that actually as a vector i'll have to take e1 um which is of unit length i'd have to divide by the length of e1 but the length of e1 is one so forget it um and if i take that off of v2 then i'll have this guy and let's call him u2 so i can then say that u2 so plus u2 so i can then rearrange this and say that u2 is equal to v2 minus this projection v2 dot e1 times e1 and if i normalize u2 if i take u2 divided by its length then i'll have a unit vector which is going to be normal 2v1 so if i take a normalized version of that let's say it's that will be e2 and that will be at 90 degrees to e1 so it'll actually be there e2 once i've moved it over and that will be another unit length vector normal to e1 so that's the first part of taking an orthonormal basis now my third vector v3 isn't a linear combination of v1 and v2 so v3 isn't in the plane defined by v1 and v2 so it's not in the plane of e1 and e2 either so i can project v3 down let's say something like that onto the plane of e2 and e1 and that projection will be some vector in the plane composed of e2s and e1s so i can then write down that v3 minus v3 dotted with e1 e1s that's going to be the component of v3 that's made up of e1s minus v3 dotted with e2 e2s that's the component of v of v3 that's made up of e2s and then all that's going to be left is going to be this perpendicular guy there so that's going to be a perpendicular vector which we'll call u3 which is perpendicular to the plane now some funny 3d space so diagram gets quite messy and then if i normalize u3 divide by the length of u3 then i'll have a unit vector which is normal to the plane normal to the other two so now i've got an orthonormal basis for e1 e2 e3 and i can keep on going through all the vn's until i've got enough orthonormal basis vectors to complete the set and span the space that i originally had but i've gone from a bunch of awkward non-orthogonal non-unit vectors to a bunch of nice orthogonal unit vectors an orthonormal basis set so that's how i construct an orthonormal basis set and make my life easy so that my transformation vectors are nice my transformation matrices are nice sorry and so that i can do the transposes the inverse and all those lovely things so i can use dot product projections for the transformations all those nice things that are going to make my life very very much nicer whenever i'm doing any transformations or rotations or whatever it is i want to do with my vectors so that's going to be really nice this is a really nice process and what we'll do next is we'll apply this we'll try this for an example and see how it rolls and then apply that to doing a transformation okay so let's put all this together let's use our transformations knowledge and our bases knowledge in order to do something quite tricky and see if we can't actually make our life quite simple what i want to do here is know what a vector looks like when i reflect it in some funny plane for example the way this board works when i write on the light board here if you're looking at it it will all the writing would appear mirrored but what we do to make that work is we reflect everything in post-production left right and then everything comes out okay the example we're gonna do here ask what the reflection of bear say something in a mirror would look like to me uh if he the mirror was off at some funny angle now my first challenge is going to be that i don't know the plane of the mirror very well but i do know two vectors in the mirror one one one and two zero one and i've got a third vector which is out of the plane of the mirror which is at 3 1 -1 that's my third vector so i've got vectors v 1 v 2 and v 3 and these two guys are in the plane of the mirror we could draw it something like v1 and v2 and they're in some playing like this and v3 is out of the plane so i've got v3 there v1 and v2 so first let's do the gram-schmidt process and find some awful normal vectors describing this plane and its normal v3 so my first vector e1 is going to be just the normalized version of v1 v1 here is of length uh root 3 1 squared plus 1 squared plus 1 squared all square rooted so it's going to be 1 over root 3 times 1 1 1. that's a normalized version of v1 so then we can carry on and i can find u2 is equal to v2 minus some number of e1 so that's going to be then the sum number is going to be the projection of v2 onto e1 times e1 so that's going to be 201 minus 201 dotted with e1 which is 1 over root 3 1 1 1 times 1 over root 3 1 1 1 because that's e 1. so that's 201 minus the root 3s are going to come outside so i can just have them being a third 201 dotted with one more one is two plus zero plus one is three so that actually goes and has a party and becomes one and yeah okay i confess i fixed the example so it's 201 minus 1 1 1 which is going to give me 1 minus 1 1 minus 1 is 0. so 1 minus 1 0 that's u2 now if i want to normalize u2 i can say e2 is equal to the normalized version of u2 which is just 1 over root 2 times 1 minus 1 o so then i just need to find u3 and i can do the same again with u3 i can say that that's equal to v3 minus the projection of v3 onto e1 minus the projection of v3 onto e2 so that's going to be 3 1 minus 1 minus 3 1 minus 1 dotted with 1 over root 3 1 1 1 and that's a number and it's going in the direction of the unit vector of e2 of e1 minus v3 dotted with e2 so that's 3 1 minus 1 dotted with 1 over root 2 1 minus 1 0. times e 2 which is 1 over root 2 1 minus 1 0. so it's quite a complicated sum um so i've got but i've got an answer here which i can do 3 1 minus 1 minus the 1 over root 3's come out again 3 plus 1 minus 1 is 3 so that goes and i've got one one one there so that becomes one minus uh the halves are going to come out the one over root twos i've got three minus one minus zero so that's two so they cancel and become one again as i said i fixed the example to make my life easy um so then i've got three one minus one minus one one one minus 1 minus 1 0. so therefore i get an answer for u 2 being 3 minus 1 minus 1 so that's 1. 1 minus 1 is 0 minus -1 is plus 1 minus 1 minus 1 is -2 and so i can then normalize that and get e3 so e3 is just the normalized version of that which is going to be one over root six of one one minus two now let's just check so one one minus two is normal to one mono it is normal to have one one one those two are normal to each other so our normal base is set uh just need to make sure that's one over root six they are all of unit length so i can write down my new uh transformation matrix which i'm going to call e it's the transformation matrix described by the basis vectors e1 e2 e3 so i've got e1 e2 e3 all written down as column vectors and that's going to be my transformation matrix that first contains the plane notice and then contains the normal to the plane so i've redrawn everything just to get it all more compact so we can carry on we've got our original two vectors v1 and v2 and we've defined e1 to be the normalized version of v1 and we've defined e2 to be the perpendicular part of v2 to e1 normalized to be of unit length so these are all in a plane and then e3 is normal to that plane it's the bit of v3 that we can't make by projection onto v1 and v2 then of unit length now say i've got a vector r over here r now what i want to do is reflect r down through this plane so i'm going to drop r down through this plane there is when he intersects the plane and then out the other side to get a vector r prime and let's say that r has some number like i know two three five two three five now this is going to be really awkward this planes off at some funny angle composed of these vectors and even these basis vectors it's some funny angle and then how do i drop it down and do the perfect there's gonna be a lot of trigonometry but the neat thing here is that i can think of r as being composed of a vector that's in the plane so some vector that's composed of uh e1s and e2s um sorry so here's e1 here and some vector that's normal so some vector that's some number of e3s and when i reflect it through the bit that's in the plane is going to be the same but this bit that's some number of e3s this bit here i'm just going to make into minus this bit here so if i wrote that down as a transformation matrix the transformation matrix in my basis e is going to be to keep the uh e1 bit the same keep the e2 bit the same so that's the e1 bit that's the e2 bit and then reflect the e3 bit from being up to being down 0 0 -1 so that's a reflection matrix in e3 so that's a reflection in the plane so just by thinking about it quite carefully i can think about what the reflection is and that te is in the basis of the plane not in my basis but in the basis of the plane so that is easy to define and therefore if i can get the vector r defined in the plane's basis vector set in the e basis i can then do the reflection and then i can put it back into my basis vector set um and then i've have the complete transformation so a way of thinking about that is that if i've got my vector r and i'm trying to get it through some transformation matrix to r prime but that's going to be hard that's going to be tricky but i can transform it into the basis of the plane so i can make an r in the basis of the plane and i'm going to do that using e to the minus one e to the minus one is the thing that got me into i've got a a vector of mine into bears basis remember then i can do my transformation i can do my reflection uh in the plot basis of the plane i can do my reflection transformation and then i will get r uh that's been transformed primed in the basis of the plane then i can read that back into my basis by doing e because e is the thing that takes bear's vector and puts it back into my basis so if i go the i can avoid the hard thing by going around doing these three operations so r e to the minus 1 e inverse t in the e basis e if i do those three things i've done the complete transformation and i get r prime so this problem reduces to doing that matrix multiplication um so we've got that we've got that so we can just do the math now and then we'll be done so i've just put the logic up there so that i can uh have the space down here for later and i've put the transformation we're going to do there a couple other things to note one is because e we've carefully constructed by our gram schmidt process to be orthonormal we know that e transpose is the inverse so calculating the inverse here isn't going to be a pain in the neck the other thing is compared to the situation with bear where we're changing bases here we're changing from our vector r to bears or actually the plane's coordinate system then we're doing the transformation of the reflection in the plane and then we're coming back to our basis so the e and e to the minus one are flipped compared to the last video because we're doing the logic the other way about it's quite neat right the actual multiplication of doing this isn't awfully edifying it doesn't build you up very much it's just doing some arithmetic so i'm going to write it down here and if you want to verify you can pause the video and then we'll come back and we'll comment on it so this is uh t e times the transpose of e then i take that and multiply it by e itself and i get e t e e transpose which is this guy simplifies to this guy so that's t all comes out quite nicely so that's very very nice so then we can apply that to r so we can say that t times r is equal to t times our vector 2 3 5 and that's going to give us r prime and that gives us an answer of one third of eleven fourteen five so r prime here is equal to one third of eleven fourteen and five so that's a process that would have been very very difficult to do with trigonometry but actually uh with transformations once we get into the plane of the mirror and the normal to the mirror then it all becomes very easy to do that reflection operation and it's quite magical really it's really amazing so that's really nice right it's really cool so what we've done here is we've done an example where we've put everything we've learned about matrices and vectors together to describe how to do something fun like reflect a point in space in a mirror this might be useful for instance if you want to transform images of faces for the purposes of doing facial recognition you know transform my face from being like that to being like that and then we could use our neural networks on machine learning to do that facial recognition part in summary this week we've gone out into the world with matrices and learned about constructing orthogonal bases changing bases and we've related that back to vectors and projections so it's been a lot of fun and it sets us up for the next topic which sam is going to lead on eigenvalues and eigenvectors welcome to the final module i'm dr sam cooper from the dyson school of design engineering at imperial college london and i'm an energy researcher who uses machine learning as a tool to help me understand lithium-ion batteries i'll be taking over from dave for the last topic of the linear algebra course but if you continue with the specialization i'll also be leading the next course which is on multivariate calculus in this final module we're going to be focusing on eigen problems which will require us to draw on nearly all of the ideas that you've learned in the previous modules then to finish the course you'll be applying this tool in a coding exercise to recreate google's famous pagerank algorithm which takes your internet search results and displays them in some kind of convenient order i hope you enjoy this module and make use of the forums to let us know how you're getting on the word eigen is perhaps most usefully translated from the german as meaning characteristic so when we talk about an eigen problem we're talking about finding the characteristic properties of something but characteristic of what this module like the previous weeks will try and explain this concept of eigenness primarily through a geometric interpretation which allows us to discuss images rather than immediately getting tangled up in the maths when i first learned this topic not only did we start by grinding through formulae we also spent very little time discussing what all the maths was doing this topic is often considered by students to be quite tricky but it's my belief that once you know how to sketch these problems the rest is just algebra so as you've seen from previous weeks it's possible to express the concept of linear transformations using matrices these operations can include scalings rotations and shears often when applying these transformations we are thinking about what they might do to a specific vector however it can also be useful to think about what it might look like when they're applied to every vector in the space this can be most easily visualized by drawing a square centered at the origin and then seeing how your shape is distorted when you apply the transformation so if we apply a scaling of two in the vertical direction the square would now become a rectangle whereas if we applied a horizontal shear to this space it might look something like this now here's the key concept we are using our little square to help us visualize what is happening to many vectors but notice that some vectors end up lying along the same line that they started on whereas others do not to highlight this i'm going to draw three specific vectors onto our initial square now let's consider our vertical scaling again and think about what will happen to these three vectors as you can see the horizontal green vector is unchanged pointing in the same direction and having the same length the vertical pink vector is also still pointing in the same direction as before but its length is doubled lastly the diagonal orange vector used to be exactly 45 degrees to the axis its angle has now increased as has its length i hope you can see that actually besides the horizontal and vertical vectors any other vector's direction would have been changed by this vertical scaling so in some sense the horizontal and vertical vectors are special they are characteristic of this particular transform which is why we refer to them as eigenvectors furthermore because the horizontal vector's length was unchanged we say that it has a corresponding eigenvalue of 1 whereas the vertical eigenvector doubled in length so we say it has an eigenvalue of two so from a conceptual perspective that's about it for 2d eigenproblems we simply take a transform and we look for the vectors who are still laying along the same span as before and then we measure how much their length has changed this is basically what eigenvectors and their corresponding eigenvalues are let's look at two more classic examples to make sure that we can generalize what we've learned here's our marked up square again and now let's look at pure shear where pure means that we aren't performing any scaling or rotation in addition so the area is unchanged as i hope you spotted it's only the green horizontal line that is still laying along its original span and all the other vectors will be shifted finally let's look at rotation clearly this thing has got no eigenvectors at all as all of the vectors have been rotated off their original span in this lecture we've already covered almost all of what you need to know about eigenvectors and eigenvalues although we've only been working in two dimensions so far the concept is exactly the same in three or more dimensions in the rest of the module we will be having a look at some special cases as well as discussing how to describe what we've observed in more mathematical terms as we saw previously eigenvectors are those which lie along the same span both before and after applying a linear transform to a space and then eigenvalues are simply amount that each of those vectors has been stretched in the process in this video we're going to look at three special cases to make sure the intuition we've built so far is robust and then we're going to try and extend this concept into three dimensions the first example we're going to consider is that of a uniform scaling which is where we scale by the same amount in each direction as you will hopefully have spotted not only are all three of the vectors that i've highlighted eigenvectors but in fact for a uniform scaling any vector would be an eigenvector in this second example we're going to look at rotation in the previous video we applied a small rotation and we found that it had no eigenvectors however there is one case of non-zero pure rotation which does have at least some eigenvectors and that is 180 degrees as you can see the three eigenvectors are still laying on the same spans as before but just pointing in the opposite direction this means that once again all vectors for this transform are eigenvectors and they all have eigenvalues of -1 which means that although the eigenvectors haven't changed length they are all now pointing in the opposite direction this third case we're going to look at a combination of a horizontal shear and a vertical scaling and it's slightly less obvious than some of the previous examples just like the pure shear case we saw previously the green horizontal vector is an eigenvector and its eigenvalue is still one however despite the fact that neither of the two vectors shown arrigon this transformation does have two eigenvectors here i've now added the second eigenvector onto the image and it shows us that although the concept is fairly straightforward eigenvectors aren't always easy to spot let's now apply the inverse transform and watch our parallelogram go back to its original square but this time with our eigenvector visible hopefully you're at least convinced that it is indeed an eigenvector and it's as it stays on its own span this problem is even tougher in three or more dimensions and many of the uses of eigen theory in machine learning frame the system as being composed of hundreds of dimensions or more so clearly we're going to need a more robust mathematical description of this concept to allow us to proceed before we do let's just take a look at one quick example in 3d clearly scaling and shear are all going to operate in much the same in 3d as they do in 2d however rotation does take on a neat new meaning as you can see from the image although both the pink and green vectors have changed direction the orange vector has not moved this means that the orange vector is an eigenvector but it also tells us as a physical interpretation that if we find the eigenvector of a 3d rotation it means we've also found the axis of rotation in this video we've covered a range of special cases which i hope have prompted the questions in your mind about how we're going to go about writing a formal definition of the an eigen problem and this is exactly what we're going to be discussing next time see you then hopefully you all now have a reasonable feeling for an eigenproblem looks like geometrically so in this video we're going to formalize this concept into an algebraic expression which will allow us to calculate eigenvalues and eigenvectors whenever they exist once you've understood this method we'll be in a good position to see why you should be glad that computers can do this for you if we consider a transformation a what we have seen is that if it has eigenvectors at all then these are simply the vectors which stay on the same span following a transformation they can change length and even point in an opposite direction entirely but if they remain in the same span they are eigenvectors if we call our eigenvector x then we can say the following expression a x equals lambda x where on the left hand side we're applying the transformation matrix a to a vector x and on the right hand side we are simply stretching the vector x by some scalar factor lambda so lambda is just some number we're trying to find values of x that make the two sides equal another way of saying this is that for our eigenvectors having a applied to them just scales their length or does nothing at all which is the same as scaling the length by a factor of 1. so in this equation a is an n-dimensional transform meaning it must be an n-by-n square matrix the eigenvector x must therefore be an n-dimensional vector to help us find the solutions to this expression we can rewrite it by putting all the terms on one side and then factorizing so a minus lambda i times x equals zero if you're wondering about where the i term came from it's just an n by n identity matrix which means it's a matrix the same size as a but with ones along the leading diagonal and zeros everywhere else we didn't need this in the first expression we wrote as multiplying vectors by scalars is defined however subtracting scalars from matrices is not defined so that i just tidies up the maths without changing the meaning now that we have this expression we can see that for the left-hand side to equal zero either the contents of the brackets must be zero or the vector x is zero so we're actually not interested in the case where the vector x is zero that's when it has no length or direction and is what we call a trivial solution instead we must find when the term in the brackets is zero referring back to the material in the previous parts of the course we can test if a matrix operation will result in a zero output by calculating its determinant so debt of a minus lambda i equals zero calculating the determinants manually is a lot of work for high dimensional matrices so let's just try applying this to an arbitrary 2 by 2 transformation let's say a equals a b c d substituting this into our eigen finding expression gives the following debt of a b c d minus lambda 0 0 lambda evaluating this determinant we get what is referred to as the characteristic polynomial which looks like this so lambda squared minus a plus d lambda plus a d minus b c equals zero our eigenvalues are simply the solutions of this equation and we can then plug these eigenvalues back into the original expression to calculate our eigenvectors rather than continuing with our generalized form this is a good moment to apply this to a simple transformation for which we already know the eigen solution let's take the case of a vertical scaling by a factor of two which is represented by the transformation matrix a equals one zero zero two we can then apply the method that we just described and take the determinant of a minus lambda i and then set it to 0 and solve so the determinant of uh one minus lambda zero zero two minus lambda equals one minus lambda two minus lambda which is of course equal to zero this means that our equation must have solutions at lambda equals one and lambda equals two thinking back to our original eigenfinding formula a minus lambda i x equals zero we can now sub these two solutions back in so thinking about the case where lambda equals one we can say one minus one zero zero two minus one times this x vector x y x one and x two must equal to 0 0 0 1 times x1 x2 therefore we've got 0 and x2 must equal zero now thinking about the case where lambda equals two at lambda equals two you get one minus two and two minus two and then you get of course minus one zero zero zero which equals 2 minus x1 0 which equals 0. so what do these two expressions tell us well in the case where our eigenvalue lambda equals 1 we've got an eigenvector where the x2 term must be 0 but we don't really know anything about the x1 term well this is because of course any vector that points along the horizontal axis could be an eigenvector of this system so we write that by saying at lambda equals 1 x our eigenvector can equal anything along the horizontal axis as long as it's zero in the vertical direction so we put in an arbitrary parameter t similarly for the lambda equals two case we can say that our eigenvector must equal zero t because as long as it doesn't move at all in the horizontal direction any vector that's purely vertical would therefore also be an eigenvector of this system as they all would lie along the same span so now we have two eigenvalues and their two corresponding eigenvectors let's now try the case of a rotation by 90 degrees anti-clockwise to ensure that we get the result that we expect which if you remember is no eigenvectors at all the transformation matrix corresponding to a 90 degree rotation is as follows a equals 0 minus 1 1 0. so applying the formula once again we get the determinant of 0 minus lambda minus 1 1 0 minus lambda which if we calculate this through comes out to lambda squared plus one equals zero which doesn't have any real numbered solutions at all hence no real eigenvectors we could still calculate some complex eigenvectors using imaginary numbers but this is beyond what we need for this particular course despite all the fun that we've just been having the truth is that you will almost certainly never have to perform this calculation by hand furthermore we saw that our approach required finding the roots of a polynomial of order n i.e the dimension of your matrix which means that the problem will very quickly stop being possible by analytical methods alone so when a computer finds the eigen solutions of a hundred dimensional problem it's forced to employ iterative numerical methods however i can assure you that developing a strong conceptual understanding of eiger problems will be much more useful than being really good at calculating them by hand in this video we translated our geometrical understanding of eigenvectors into a robust mathematical expression and validated it on a few test cases but i hope that i've also convinced you that working through lots of eigen problems is as is often done in engineering undergraduate degrees is not a good investment of your time if you already understand the underlying concepts this is what computers are for next video we'll be referring back to the concept of basis change to see what magic happens when you use eigenvectors as your basis see you then so now that we know what eigenvectors are and how to calculate them we can combine this idea with a concept of changing basis which was covered earlier in the course what emerges from this synthesis is a particularly powerful tool for performing efficient matrix operations called diagonalization sometimes we need to apply the same matrix multiplication many times for example imagine that a transformation matrix t represents the change in location of a particle after a single time step so we can write that our initial position described by vector v0 multiplied by the transformation t gives us our new location v1 to work out where our particle will be after two time steps we can find v2 by simply multiplying v1 by t which is of course the same thing as multiplying v0 by t two times so v2 equals t squared times v0 now imagine that we expect this same linear opera transformation to occur every time step for n time steps where we can write v n is t to the power of n times v zero you've already seen how much work it takes to apply a single 3d matrix multiplication so if we were to imagine that t tells us what happens in one second but we'd like to know where our particle is in two weeks from now then n is going to be around 1.2 million i.e we'd need to multiply t by itself more than a million times which may take quite a while if all the terms in the matrix are zero except for those along the leading diagonal we refer to it as a diagonal matrix and when raising matrices to powers diagonal matrices make things a lot easier in fact have a go just now to see what i mean all you need to do is put each of the terms on the diagonal to the power of n and you've got the answer so in this case t to the n is a to the n b to the n and c to the n it's simple enough but what if t is not a diagonal matrix well as you may have guessed the answer comes from eigen analysis essentially what we're going to do is simply change to a basis where our transformation t becomes diagonal which is what we call an eigen basis we can then easily apply our power of n to the diagonalized form and finally transform the resulting matrix back again giving us t to the power of n but avoiding much of the work as we saw in the section on changing basis each column of our transform matrix simply represents the new location of the transformed unit vectors so to build our eigenbasis conversion matrix we just plug in each of our eigenvectors as columns c equals eigenvector 1 2 and eigenvector 3 in this case as we're using a three-dimensional example however don't forget that some of these may be complex so not easy to spot using the purely geometrical approach but they appear in the maths just like the others applying this transform we find ourselves in a world where multiplying by t is effectively just a pure scaling which is another way of saying that it can now be represented by a diagonal matrix crucially this diagonal matrix d contains the corresponding eigenvalues of the matrix t so d equals lambda 1 lambda 2 and lambda 3 with zeros elsewhere we're so close now to unleashing the power of eigen the final link that we need to see is the following bringing together everything we've just said it should now be clear that applying the transformation t is just the same as converting to our eigen basis applying the diagonalized matrix and then converting back again so t equals c d c inverse which suggests that t squared can be written as c d c inverse multiplied again by c d c inverse so hopefully you've spotted that in the middle of our expression on the right hand side you've got c multiplied by c inverse but multiplying a matrix and then by its inverse is just the same as doing nothing at all so we can simply remove this operation equals c d d c inverse and then we can finish this expression by saying well this must be c d squared c inverse we can of course then generalize this to any power of t we'd like so finally we can say that t to the power of n is going to equal c d to the power of n multiplied by c inverse we now have a method which lets us apply a transformation matrix as many times as we'd like without paying a large computational cost this result brings together many of the ideas that we've encountered so far in this course and in the next video we'll work through a short example just to ensure that this approach lines up with our expectations when applied to a simple case see you then now that we've walked through the theory of eigenbases and diagonalization let's have a go at a simple 2d example where we can see the answer graphically so we'll know whether or not our method has worked as expected consider the transformation matrix t equals 1 1 0 2. hopefully you'd feel fairly comfortable at this point in drawing the transformed square and vectors that were used in the previous examples as the first column is just 1 0 this means that our i hat vector will be unchanged however the second column tells us that j hat the second vector will be moving to the point 1 2. let's also consider the orange diagonal vector to point one one multiplying through gives us one one zero two multiplied by one one so thinking about rows times coals we're going to get one plus 1 and 0 plus 2 which gives us 2 2. it's interesting to consider that this particular transform could be decomposed into a vertical scaling by a factor of two and then a horizontal shear by a half step because we've chosen such a simple transformation hopefully you've already spotted the eigenvectors and can state their eigenvalues these are at lambda equals 1 our eigenvector is 1 0 and at lambda equals 2 our eigenvector equals 1 1. now let's consider what happens to the vector minus 1 1 when we apply t so 1 1 0 2 applied to minus 1 1 this time is going to equal rose times coles minus one plus one and zero plus two which equals zero two and if we apply t again we're gonna get the following 1 1 0 2 applied to 0 2 which is going to equal rows times cos again 0 plus 2 and 0 plus 4. so this thing finally is going to equal 2 4. now instead if we were to start by finding t squared so t squared is going to equal this vector this matrix multiplied by itself so applying rows times coles we're going to get 1 times 1 times 1 times 0 so that's 1 rows times cos here we're going to get a 3 rows times cos here we're going to get a 0 rows times cos here we're going to get a 4. now we can apply this to our vector and see if we get the same result so 1 3 0 4 multiplied by -1 1 is going to equal rows times coals so we're going to get minus 1 plus 3 and 0 plus 4 which of course equals 2 4. we can now try this whole process again but using our eigenbasis approach we've already built our conversion matrix c from our eigenvectors so c is going to equal one zero one one but we are now going to have to find its inverse however because we've picked such a simple 2x2 example it's possible just to write this inverse down directly by considering that c would just be a horizontal shear one step to the right so c inverse must just be the same shift back to the left again so c inverse is going to equal 1 minus 1 0 1. it's worth noting that despite how easy this was i would still always feed this to the computer instead of risking making silly mistakes we can now construct our problem so t squared is going to equal c d squared c inverse which of course in our case is going to equal 1 1 0 1 multiplied by our diagonal matrix which is going to be 1 and 2. and that's all squared multiplied by c inverse which is 1 minus 1 0 1. working this through we can see that let's keep this first matrix 1 1 0 1 and work out this bit so we'll say okay this is going to be 1 and 4 on the diagonal 1 minus 1 0 1 and let's work out these two matrices here so we've got one one zero one multiplied by so we're doing rows times coles in each case so for example one zero times one zero we get a one here we're gonna do the second row in the first column we get a zero there first row and second column we're going to get a minus one there and the second row and the second column we're going to get four here okay and then working it through one more step we're gonna see that we get more more grinding we get first row first column one second row first column zero first row second column is three and second row second column is four and applying this to the vector minus one one we're going to get something like this so one three zero four applied to minus one one is going to be rows times coals so minus one plus three and zero plus four which equals 2 4 which pleasingly enough is the same result as we found before now there is a sense in which for much of mathematics once you're sure that you've really understood a concept then because of computers you may never have to do this again by hand however it is still good to work through a couple of examples on your own just to be absolutely sure that you get it there are of course many aspects of eigen theory that we haven't covered in this short video series including undiagonizable matrices and complex eigenvectors however if you are comfortable with the core topics that we've discussed then you're already in great shape in the next video we're going to be looking at a real world application of eigen theory to finish off this linear algebra course this is a particularly famous application which requires abstraction away from the sort of geometric interpretations that we've been using so far which means that you'll be taking the plunge and just trusting the maths see you then the final topic of this module on eigenproblems as well as the final topic of this course as a whole will focus on an algorithm called pagerank this algorithm was famously published by and named after google founder larry page and colleagues in 1998 and was used by google to help them decide which order to display their websites when they returned from search the central assumption underpinning page rank is that the importance of a website is related to its links to and from other websites and somehow eigen theory comes up this bubble diagram represents a model mini internet where each bubble is a web page and each arrow from a b c and d represents a link on that web page which takes you to one of the others we're trying to build an expression that tells us based on this network structure which of these web pages is most relevant to the person who made the search as such we're going to use the concept of procrastinating pat who is an imaginary person who goes on the internet and just randomly clicks links to avoid doing their work by mapping all the possible links we can build a model to estimate the amount of time we would expect pat to spend on each web page we can describe the links present on page a as a vector where each row is either a 1 or a 0 based on whether there is a link to the corresponding page and then normalize the vector by the total number of links such that they can be used to describe a probability for that page for example the vector of links from page a would be 0 1 1 1 because vector a has links to sites b to c and to d but it doesn't have a link to itself also because there are three links on this page in total we would normalize by a factor of a third so the total click probability sums to one so we can write l a equals zero a third a third a third following the same logic the link vectors of the next two sites are shown here and finally for page d we can write l d is going to equal so d is connected to b and c but not a two sites in total zero a half a half zero we can now build our link matrix l by using each of our link vectors as a column which you can see will form a square matrix what we're trying to represent here with our matrix l is the probability of ending up on each of the pages for example the only way to get to a is by being at b so you then need to know the probability of being at b which you could have got to from either a or d as you can see this problem is self-referential as the ranks of all the pages depend on all the others although we built our matrix from columns of outward links we can see the rows describe inward links normalized with respect to their page of origin we can now write an expression which summarizes the approach we're going to use the vector r to store the rank of all web pages to calculate the rank of page a you need to know three things about all other pages on the internet what's your rank do you link to page a and how many outgoing links do you have in total the following expression combines these three pieces of information for web page a only so r a is going to equal the sum from j equals 1 to n where n's all the web pages of the link matrix relevant to web page a and location j multiplied by the rank at location j so this is going to scroll through each of our web pages which means that the rank of a is the sum of the ranks of all the pages which link to it weighted by their specific link probability taken from matrix l now we want to be able to write this expression for all pages and solve them simultaneously thinking back to our linear algebra we can rewrite the above expression applied to all web pages as a simple matrix multiplication so r equals l r clearly we start off not knowing r so we simply assume that all the ranks are equally and normalize them by the total number of web pages in our analysis which in this case is four so r equals a quarter a quarter a quarter a quarter then each time we multiply r by our matrix l this gives us an updated value for r so we can say that r i plus 1 is going to be l times ri applying this expression repeatedly means that we are solving this problem iteratively each time we do this we update the values in r until eventually r stops changing so now r really does equal lr thinking back to the previous videos in this module this implies that r is now an eigenvector of matrix l with an eigenvalue of 1. at this point you might well be thinking if we want to multiply r by l many times perhaps this would be a best tackled by applying the diagonalization method that we saw in the last video but don't forget this would require us to already know all the eigenvectors which is what we're trying to find in the first place so now that we have an equation and hopefully some idea of where it came from we can ask our computer to iteratively apply it until it converges to find our rank vector you can see that although it takes about 10 iterations for the numbers to settle down the order is already established after the first iteration however this is just an artifact of our system being so tiny so now we have our result which says that as procrastinating pat randomly clicks around our network we'd expect them to spend about 40 percent of their time on page d but only about 12 percent of their time on page a with 24 on each of pages b and c we now have our ranking with d at the top an a at the bottom and b and c equal in the middle as it turns out although there are many approaches for efficiently calculating eigenvectors that have been developed over the years repeatedly multiplying a randomly selected initial guess vector by your matrix which is called the power method is still very effective for the pagerank problem for two reasons firstly although the power method will clearly only give you one eigenvector when we know that there will be n for an n webpage system it turns out that because of the way we've structured our link matrix the vector it gives you will always be the one that you're looking for with an eigenvalue of one secondly although this is not true for the four web page mini internet when looking at the real internet you can imagine that almost every entry in the link matrix will be zero i.e most pages don't connect to most other pages this is referred to as a sparse matrix and algorithms exist such that multiplication can be performed very efficiently one key aspect of the page rank algorithm that we haven't discussed so far is called the damping factor d this adds an additional term to our iterative formula so r i plus 1 is now going to equal d times l r i plus 1 minus d over n d is something between zero and one and you can think of it as one minus the probability with which procrastinating pat suddenly randomly types in a web address rather than clicking on a link on his current page the effect that this has on the actual calculation is about finding a compromise between speed and stability of the iterative convergence process there are over 1 billion websites on the internet today compared with just a few million when the pagerank algorithm was first published in 1998 and so the methods for search and ranking have had to evolve to maximize efficiency although the core concept has remained unchanged for many years this brings us to the end of our introduction to the page rank algorithm there are of course many details which we didn't cover in this video but i hope this has allowed you to come away with some insight and understanding into how the pagerank works and hopefully the confidence to apply this to some larger networks yourself this brings us to the end of the fifth module and also to the end of this course on linear algebra for machine learning we've covered a lot of ground in the past five modules but i hope that we've managed to balance the speed with the level of detail to ensure that you've stayed with us throughout there is a tension at the heart of mathematics teaching in the computer age classical teaching approaches focus around working through lots of examples by hand without much emphasis on building intuition however computers now do nearly all of the calculation work for us and it's not typical for the methods appropriate to hand calculation to be the same as those employed by a computer this can mean that despite doing lots of work students can come away from a classical education missing both the detailed view of the computational methods but also the high-level view of what each method is really doing the concepts that you've been exposed to of the last five modules cover the core of linear algebra that you will need as you progress your study of machine learning and we hope that at the very least when you get stuck in the future you'll know the appropriate language so that you can quickly look up some help when you need it which after all is the most important skill of a professional coder that's it for this course on linear algebra in our specialization on mathematics for machine learning our goal was to give you some of the underpinnings of linear algebra in order to enable you to access neural networks machine learning and data science courses more generally this is intended as a course for a social scientist engineer or physicist to develop the insight required to access other courses on machine learning tools and in order to do useful work in to solve problems with machine learning in the real world it's not intended as a foundation for a graduate linear algebra course and nor have we have the time to discuss topics like the cross product that don't have applications in machine learning we started out in the course by thinking of some data problems fitting functions to data or the apples and bananas price discovery problem which we used to take us on a journey into linear algebra first vectors and their matrices through it all you did exercises to try out what you'd learned and then took those into a couple of programming exercises in python at the end this will have given you the confidence to say that you're really getting towards having an intuitive understanding of linear algebra and how to apply it in code if you've enjoyed this course we continue the journey in machine learning and data science in the next course in this specialization multivariate calculus there we look at calculus and how to find the gradients of functions of more than one variable this will let us examine minimization and how to fit models to data once we have minimization and linear algebra and probability you'll have all the underpinning mathematical tools and concepts you need to then look at principal component analysis machine learning and neural networks sam and i and all the team here at imperial really hope that you found this course valuable and useful and fun and of course we wish you the very best for the future and we'll look forward to seeing you again in the next course welcome to introduction to multivariate calculus course two of the mathematics for machine learning specialization we'll be embarking on a whistle stop tour of calculus with a special focus on interactive animations and practical coding examples i'm dr sam cooper from the dyson school of design engineering at imperial college london and i use machine learning as a tool to help me characterize and design energy materials i'm also an online learning addict to understand multivariate calculus you need to be able to apply the linear algebra methods covered in the previous course so if you've not completed it already you may wish to go back many important machine learning approaches such as neural networks have calculus at their very core so the aim of this course is for you to see the connection between the maths and the meaning the ideal outcome will be for you to finish this specialization and then feel confident enough to immediately dive into one of the many wonderful applied machine learning courses already available online we're going to start the course by laying down the foundations of calculus through the use of intuition building animations then we're going to generalize these concepts to multi-dimensional systems where you will discover how to navigate mountain ranges in the dark next you'll be learning exactly how calculus is used to train neural networks as well as working through how we would actually write some code to do this using the python programming language finally you will put all of your multivariate calculus skills into action with my colleague professor david dye who will explain the fundamentals of regression before asking you to write some code to fit complicated functions to real data i really hope you enjoy the course and i look forward to hearing from you in the forums welcome to module one of six on this course introducing you to various concepts in calculus that we believe you'll find useful when studying machine learning we start right from the basics but build you up fairly quickly to some interesting applications in modules five and six this week we'll be focusing on the fundamental theory of calculus as well as some handy rules to speed things up where possible i'm going to try and represent the concepts graphically so that you can see the derivations rather than just reading them this sometimes means that we'll be skimming over a few details but i'll also provide you with links to the more rigorous descriptions so that you can check them out if you're interested speaking for myself i find it generally true that if i can't draw something then i don't really get it even if i could write down all the relevant equations so for the tests i'm going to be asking you to identify solutions on graphs rather than grinding through the algebra for the most part i hope you enjoy this first module and i look forward to meeting many of you in the discussion forums before diving into calculus we should first talk briefly about what functions are and where we use them essentially a function is a relationship between some inputs and an output so for example if i had a function for modeling the distribution of temperature in this room i might input the x y and z coordinates of a specific location i'm interested in as well as the time t and then the function would return me the temperature at that specific point in space at that moment in time like so many areas of maths even if the idea is quite straightforward often the notation can make things unnecessarily confusing we'll be confronted by this again later in the course and although it's sometimes fairly arbitrary historical reasons that decide this like different people inventing different parts of maths at different times it can also be because a particular notation style is more convenient for the specific application that it was developed for however and here is where a lot of the problems seem to arise in order to use and play with the interesting applications of maths it requires you to have done quite a large amount of often quite boring groundwork mathematical language is like learning any other language in that respect you can't enjoy french poetry until you've learned a lot of french vocabulary and grammar including all its quirks and irregularities quite understandably some people find this off-putting my french is still terrible for example this is made worse by the fact that most people have not even realized that there is maths poetry waiting for them at the end of this algebra tunnel machine learning is a whole genre of this poetry so stick with us for the rest of the specialization and you'll be ready to appreciate it and even write some of your own back to functions we often see expressions such as f of x equals x squared plus three perhaps the only thing worth mentioning about this notation style is yes i admit it it is absurd that you should somehow just know that f brackets x means f is a function of x and not f multiplied by x sometimes this gets genuinely unclear when other bracket terms appear in your expression for example f of x equals this thing over here you can assume that g h and a are all not variables otherwise i would have to write f of x g h and a but you could only know for sure what was going on here if it was explained to you with more context for example is g a function being applied to x what about h and a over here maybe they're both just constants although you do face the same problem of missing context in any language like the famous panda who goes to a restaurant and eats shoots and leaves and they were both delicious i can only apologize on behalf of maths and hopefully console you with two facts firstly maths is at the core of my day job and i often still get confused myself and secondly i still love it and it get better at second guessing any missing context all the time just like any other language in the following exercise i'm either going to show you some data or describe a concept to you and you're going to have to select a function which you think might be used to represent it this is not intended to be a difficult exercise but what i would like you to understand is that selecting a function is the creative essence of science this process of selecting a candidate function or hypothesis to model a world is what the great geniuses of science are remembered for there then follows a potentially long and difficult process of testing this hypothesis but there will be nothing to test without that first creative step calculus is simply the study of how these functions change with respect to their input variables and it allows you to investigate and manipulate them but ultimately it's just a set of tools and by the end of this course you'll be using them yourself to model some real world data in the last video i told you that calculus is just a set of tools for describing the relationship between a function and the change in its variables and so in this video we're going to explore what this means and how it might be useful let's start by having a look at a classic example a speed versus time graph for a car the most obvious thing this graph tells us is that the car's speed is not constant because a constant speed would just be a flat horizontal line furthermore starting from zero speed at zero time this car's speed is initially increasing with time which is another way of saying that it is accelerating towards the end of the time period the car's speed is shown to be rapidly decreasing meaning that it is decelerating there's a lot of information in this graph and calculus will allow us to extract much more than just the speed as we've already said a horizontal line implies a constant speed and then the more the line is sloping up the greater the acceleration in fact acceleration can be defined as the local gradient of a speed time graph and clearly acceleration itself is also a function of time in our example we refer to the gradient at a single point as the local gradient and we can illustrate this concept by drawing a tangent line which is a straight line that touches the curve at a point and is also the same gradient of the curve at that point after the initial acceleration the car's speed reaches a peak and then begins to decelerate again deceleration has a negative slope by recording the slope of these tangent lines at every point we could plot an entirely new graph which would show us acceleration against time rather than speed against time before we plot this for the complex case let's think about what this acceleration time graph would look like for a car travelling at constant speed constant speed we've got a flat horizontal line on our speed time graph then we can say that its gradient is of course zero so the acceleration time graph would also just be a horizontal line but in this case an acceleration equals zero going back to our more complex place let's just talk through what we should expect before we have a go at plotting it so initially the gradient is positive and fairly constant before it drops to zero at the peak it then becomes negative for a period before returning to zero let's now take a look at the graph for acceleration versus time overlaid onto the speed time graph don't forget the vertical axis for the blue line is speed and will have units of distance per time whereas the vertical axis for the orange line is acceleration and will have units of distance per time squared because they've got different units we could scale either of these two lines vertically in this plot and the meaning would still be identical however these have been scaled just to make the most use of this plot area are available you can see that the points at which the acceleration function is 0 ie where it crosses the horizontal axis coincide with where the speed time graph is flat and has zero gradient as we would expect although we will be discussing the formal definition of a derivative in a later video what we've just done by i is the essence of calculus where we took a continuous function and described its slope at every point by constructing a new function which is its derivative we can in principle plot the derivative of the acceleration function following the same procedure where we simply take the slope of the acceleration function at every point this is the rate of change of acceleration which we can also think of as being the second derivative of the speed and it's actually referred to as the jerk of the car think of the jerky motion of a car as it stops and starts now you may have never heard of this concept before but hopefully just by telling you that it's the derivative of the acceleration curve this should be all you need to know to approximately sketch the jerk also very interesting is the idea of taking our baseline speed function and trying to imagine what function this would have been the gradient of as in applying the inverse procedure to the one that we've just discussed we can refer to this as the antiderivative which for those who've done calculus before you may remember that it's closely related to something called the integral for the example we're discussing here it would represent the distance of the car from its starting position this should make more sense when you consider that the change in the distance with respect to time i.e the slope of the distance time graph i.e how much distance you are covering per unit time is just the speed this analysis of slopes is all we're going to discuss in the video and even though you haven't laid out yet the formal definition of calculus you should already be able to answer lots of gradient type questions in the following exercise which will put you in a very strong position to start thinking about differentiation more formally now that we've explored the relationship between functions and their gradient we should be ready to lay out the more formal definition of a derivative all we're going to do is translate the understanding about gradients that we saw in the previous video into some mathematical notation that we can write down unfortunately this is the bit that a lot of people seem to find uncomfortable so i will do my best to make it as painless as possible we talked in the last video about horizontal lines having a gradient of zero whilst upwards and downward sloping lines having positive or negative gradients respectively we can write down a definition of this concept by taking the example of a linear function which has the same gradient everywhere so if we start by picking any two points let's say here and here we can then say that the gradient of this line is equal to the amount of that function that increases in this interval divided by the length of the interval that we're considering this description is often just to condense to the expression rise over run where rise is the increase in the vertical direction and run is the distance along the horizontal axis so we've got the rise here and the run down here if our function was sloping down and we pick points at the same horizontal locations then our run would be the same but our rise would now be negative so our gradient equals rise divided by run fairly straightforward so far but how does it relate to the more complicated functions we saw previously where the gradient is different at every point now the rise over run type gradient varies depending on where we choose our points let's pick a single point where we'd like to know the gradient which we'll say is at point x on the horizontal axis the value of our function at this point is therefore clearly just f of x we're going to use the same logic as before so we now need to pick a second point to draw our rise over run triangle we can call the horizontal distance between our two points delta x where as usual a delta is being used to express a small change in something which means our second one must be at position x plus delta x we can also write down the vertical position of our second point as a function f evaluated at our new location x plus delta x ie f of x plus delta x so we can now build an expression for the approximate gradient at our point x based on the rise of a run gradient between point x and any second point remembering that the run will be our distance delta x and our rise is just the difference in heights of the two points the last step in this process is simply to notice that for nice smooth continuous functions like the one we're showing here as delta x gets smaller the line connecting two points becomes a better and better approximation of the actual gradient at our point x we can express this concept formally by using the limit notation scheme which says that as delta x goes to zero our expression will give us a function for our gradient at any point we choose which we write as f dash of x or d f by dx depending on which notation scheme you prefer this is a slightly strange concept as we're not talking about delta x equals zero as dividing by zero is not defined but instead just when x is extremely close to zero this process of extreme rise over run is what differentiation is so when you're asked to differentiate a function this literally just means substitute the function into this expression we could fill up several videos with more robust interpretations of this infinitely small but non-zero delta x but for now just don't worry about it too much we know more than enough to continue on our journey let's now put our new derivative expression into practice and see if it works first we should try this out on a linear function once again as we know that the answer is just going to be a constant so what will the gradient of f of x equals three x plus two b so we can immediately sub this straight in and say okay f dash of x is going to equal the limit as delta x goes to 0 of so we're taking our function and subbing it in 3 x plus delta x plus 2 minus 3 x plus 2 all divided by delta x okay and now we just work through the algebra so we can say the limit of delta x goes to zero of and so we can expand these brackets out so we're going to get three x plus three delta x plus two minus three x minus 2 all divided by delta x okay and now we can look at this and say some of the terms are going to cancel so this 3x here goes with this 3x here and this plus 2 here goes to this minus 2 here so we can keep going with this line and say well the limit and it's going to be 3 delta x all over delta x and then working down here we can say okay well the delta x's themselves are just going to cancel so you end up with the limit of delta as delta x goes to 0 of just 3. now this thing doesn't have a delta x in it so actually the answer for this one is just the number 3. because all the delta x terms have disappeared in this case then the limit expression has no effect so we can just ignore it the gradient of our linear function is just a constant as we expected in fact i'm sure many of you will have known already that the gradient of an equation of the form f equals ax plus b is just a but it's reassuring to see this result emerging from our expression something else to take away from this simple example is that we actually differentiated two things at once a three x bit and a plus two bit we could have differentiated them separately and then added them together and still got the same result this interchangeability of the approach is called the sum rule and it's pretty handy let's now try a slightly more complicated example so if we take the function f of x equals 5 x squared all we're going to do is take this thing and put it into our differentiation expression at the top so f dash of x equals the limit as delta x goes to zero of well we've got five x plus delta x all squared minus 5 x squared all divided by delta x and all we're going to do now is work through the algebra and see what comes out the other side so equals the limit once again as delta x goes to 0 of and let's expand this thing out so we're going to think x plus delta x squared we're going to get x squared plus 2 x delta x plus delta x squared so which means we're going to get 5 x squared plus 10 x delta x plus 5 delta x squared once again minus our 5x squared here and all of that is divided by delta x and now what do we got on this top row here so we've got a pair of 5x squares so we can get rid of those because it's 5x squared and minus 5x squared over here okay and we can also see that we've got a delta x in both the top terms and also in this bottom term here so we can get rid of that with this and this over here okay so let's write that line again so we can now say it's the limit as delta x goes to zero of we've just got 10 x plus uh 5 delta x okay and we're looking at this thing so it's saying the limit as delta x goes to 0 of this expression here now only the second term has got a delta x in it so what's going to happen as our delta x goes to 0 we're going to just forget about this term it's going to become extremely small so we can write that this is equal to just 10 x so the derivative of the expression 5x squared is just 10x we can generalize the lesson from this example to a rule for handling functions with powers of x for example if we take the function f of x equals ax to the power of b and substitute it into our differentiation expression we will find that the derivative is always f dash of x equals a b x to the power of b minus one so the original power gets multiplied to the front and then the new power is just one less than it was before this is known as the power rule and you can put this into your calculus toolbox along with the sum rule that we saw earlier so you've now seen two examples in which we've applied the limit of rise over run method to differentiate two simple functions but as you can probably imagine if we wanted to differentiate a long complicated expression this process is going to become quite tedious what we're going to see in later videos are more rules like the sum rule and the power rule which will help us speed up the process however before we do that we're just going to look at some fairly magical special case functions which differentiate in an interesting way in this video we're going to run through three special case functions which give us interesting results when differentiated the first example we're going to work through is the function f of x equals 1 over x which you can see plotted in the corner now take a minute to notice that the gradient of this function is negative everywhere except at the point x equals zero where we can't see what it is and actually something quite interesting must be happening at this point as on the negative side the function drops down presumably towards negative infinity but then it somehow re-emerges from above on the positive side this sudden break in our otherwise smooth function is what we refer to as a discontinuity we've mentioned already that the operation divide by zero is undefined which means that this function simply doesn't have a value at the point x equals zero but what about the gradient well let's sub our function into the differentiation expression to investigate so if we take it f of x equals one over x we can say that f dash of x must equal the limit as delta x goes to zero of 1 over x plus delta x minus 1 over x all divided by delta x and we look at this thing and we say we're going to have to make this top row the numerator have a single fraction so we're gonna have to combine these two fractions here which means that we're gonna have to make the denominators the same so we'll multiply top and bottom by x here and top and bottom by x plus delta x here so we can say that the limit of x over x brackets x plus delta x minus x minus x plus delta x over x brackets x plus delta x all divided by once again delta x now looking at these two we can see that we've got an x and a minus x both with the same denominator so we can subtract these now so this cancels with this so we can say okay this thing is going to be the limit of minus delta x divided by x brackets x plus delta x all divided by delta x now at this point you can see that you've got a delta x at the very top and a delta x at the bottom so once again these two terms cancel each other out and what you're left with is the limit as delta x goes to zero of minus 1 divided by x squared we're going to open up this bracket plus x delta x and this once again is where the magic of limits comes into play so we look at this function and we say okay this term here has got a delta x in it this means that as delta x becomes very very small this term itself is going to become very small and therefore eventually become irrelevant so what we can say is as we apply our limit we can actually ignore this term here entirely and we get minus 1 divided by x squared which looks like this so as we realize just by looking this derivative function is negative everywhere and like our base function the derivative is also undefined at x equals zero the second function we're going to look at i'll start simply by explaining what it does it's a function that has the special property that the value of the function f of x is always equal to the value of its own gradient f dash of x now there is a boring function which has this property which is the function f of x equals zero because as it's a horizontal line clearly both the function and the gradient are zero everywhere but there's also a much more interesting case we're not going to work through the rigorous derivation but we can skip through some of the more interesting bits so let's start by noticing that our mystery function must always either be positive or always be negative as if it ever tried to cross the horizontal axis then both the function and the gradient would be zero and so it gets stuck so we'd just be at our boring function zero again the next thing to realize is that by virtue of always increasing or always decreasing it can never return to the same value again plenty of functions could fit these criteria and focusing on the positive case they all look something like this however besides the zero function there is only one function that will satisfy all our demands this is the exponential function e to the x where e is euler's number named after the 18th century mathematician the number e which is approximately 2.718 is very important for the study of calculus but more than that e like pi turns up all over mathematics and seems to be written all over the fabric of the universe as differentiating e to the x gives us e to the x clearly we can just keep differentiating this thing as many times as we'd like and nothing's going to change this self-similarity is going to come in very handy the last special case function that we're going to talk about in this video are the trigonometric functions sine and cosine you may recall that for a right angle triangle sine of angle x multiplied by the hypotenuse r gives you the length of the opposite side to the angle and the graph of sine x looks like this let's take a look at this function and see if we can work out what shape its derivative would be by i so sine x starts with a positive gradient which gently decreases until it's zero at the top of the bump and then it starts being negative again until it gets the bottom of the next bump and it turns out that the derivative of sine x is actually just cosine x now what happens when we differentiate cosine x well actually we get minus sine x differentiating a third time gives us minus cosine x and then amazingly differentiating a fourth time brings us all the way back to our original function sine x and then the pattern of course repeats this self-similarity may remind you somewhat of the exponential function we discussed above and that is because these trigonometric functions are actually just exponentials in disguise albeit quite a convincing disguise which we won't be discussing here many of the details of the functions that we've talked about in this video were skimmed over rather quickly but for the benefit of this particular course all i need you to understand is that differentiation is fundamentally quite a simple concept even when you might not be able to battle through all the algebra you're ultimately still just looking for the rise of a run gradient at each point this pragmatic approach to calculus is going to come up again when we start talking about calculating gradients with computers sometimes we really can just find an equation for the gradient as we've been doing for all of our examples so far however if instead of a nice smooth function we just have discrete data points then it can seem as if there's nothing for us to differentiate but as we shall see rise over run comes back once again to save the day now that we've defined the differentiation operation in terms of an exact mathematical formula it's become clear that even for relatively simple functions calculating derivatives can be quite tedious however mathematicians have found a variety of convenient rules that allow us to avoid working through the limit of rise over run operation whenever possible so far we've met the sum rule and the power rule in this video we will cover a convenient shortcut for differentiating the product of two functions which sensibly enough is called the product rule it is of course possible to derive the product rule purely algebraically but it doesn't really help you to develop much insight into what's going on for all the topics covered in this course i'd always prefer you came away with some degree of intuitive understanding rather than just watching me grind through examples so let's see if we can make any progress by drawing things imagine a rectangle where the length of one side is the function f of x and the other side is the function g of x this means that the product of these two functions must give us the rectangle's area which we can call a of x now consider that if we differentiate f of x times g of x what we're really looking for is the change in area of our rectangle as we vary x so let's see what happens to the area when we increase x by some small amount delta x a quick footnote here for the case we've shown we've picked a particularly friendly pair of functions where they both happen to increase with x however this won't necessarily always be the case but it does just make drawing things a lot easier and the conclusions would ultimately be the same we can now divide up our rectangle into four regions one of which was our original area a of x as the total edge length along the top is now f of x plus delta x this means that the width of the new region must be the difference between the original width and the new width and of course the same logic applies to the height we can now write an expression for just the extra area which we will call delta a this is the sum of the area of the three new rectangles i want to avoid a drawn out conversation about limits but fundamentally i hope you can see that as delta x goes to zero although all of the new rectangles will shrink it's the smallest rectangle that is going to shrink the fastest this is the intuition that justifies how we can ultimately ignore the small rectangle and leave its contribution to the area out of our differential expression altogether now that we've got our expression for approximating delta x we return to our original question what is the derivative of a with respect to x so we want the limit of delta a divided by x ie rise over run which means we also need to divide the right hand side by delta x we are so close at this point all we need to do is slightly rearrange this equation so firstly by splitting it into two fractions and then secondly by moving f of x and g of x out of the numerators what i hope you can see now is that the first part contains the definition of the derivative of g of x and the second part contains the derivative of f of x which means that we are now ready to write down our final expression for the derivative of a with respect to x which has now just been reduced to this the derivative of a of x is just f of x times the derivative of g of x plus g of x times the derivative of f of x so we can now add the product rule to the list of tools in our calculus toolbox if we want to differentiate the product of two functions f of x and g of x we simply find the sum of f of x g dash of x and g of x f dash of x this is going to come in really handy in later videos as we start to deal with some more complicated functions see you then so far we have learned about the sum rule the power rule and the product rule in this video we will be discussing our fourth and final tool for this module which is called the chain rule following this our toolbox will then be sufficiently well stocked that we'll be ready to start tackling some heftier more interesting problems sometimes we use functions as the inputs of other functions as you can probably imagine describing this can get a little bit confusing so what we're going to do is give each of our functions meanings so that we can hopefully keep track of what's going on consider the nested function h of p of m we have the function h which is describing how happy i am as a function of p how many pizzas i've eaten that day then the pizzas i get to eat per day is itself a function of m which is how much money i make so we're still ultimately relating money to happiness but via the concept of pizza this nested function scenario comes up a lot in science and engineering as you relate chains of concepts together and i suppose this particular example i've chosen here also gives you some insight into my priorities in life so first i'm going to give you the function relating happiness and pizza which has the following polynomial form h of p equals minus a third p squared plus p plus one over five which is easily understandable from a plot what we can see is that although without any pizza it's still possible to be happy in principle my peak happiness is with about one and a half pizzas any more pizza than this and i become less happy and then beyond about three pizzas my happiness becomes rapidly negative next is our function relating pizza and money p of m equals e to the power of m minus one which is also fairly straightforward to understand by looking at a plot if you've got no money you can't buy any pizza but the more money you have your pizza purchasing power increases exponentially as at first you can take advantage of bulk buy discounts but as you start getting really rich you can buy your own pizza oven and eventually even build your own pizza factory what we'd like to know is by considering how much money i have now how much effort should i put into making more if my aim is to be happy to work this out we're going to need to know what the rate of change of happiness is with respect to money which is of course just d dh by dm for now this relatively simple example we could just directly substitute our pizza money function into our happiness pizza function which would give us this and then differentiate this thing directly however the chain rule provides us with a more elegant approach which importantly will still work even for complicated functions where direct substitution like this may not be an option consider the derivative of h with respect to p and of p with respect to m you'll notice that in this particular notation convention where the derivatives are represented by quotients the product of these two quantities looks like it would give you the desired function dh by dm and in actual fact this is a perfectly sensible way to think about what's going on this approach is called the chain rule because in a sense we are making a chain of derivative relationships now this is certainly not what you might describe as a formal derivation but it is already good enough to enable you to make use of the chain rule effectively so let's now apply this rule to our function firstly let's differentiate our two functions which give us dh by dp equals 1 minus 2 over 3p dp by dm equals e to the power of m and then multiplying these together is simple enough however we just need to remember that if we don't want pizzas p to appear in our final expression then we just need to sub in our expression for p in terms of m and then finally by simply rearranging the terms we recover the expression that we saw at the start of the video so dh by dm equals 1 3 e to the m times 5 minus 2 e to the power of m and there we have it now don't forget that although for this simple example the chain rule didn't save us a huge amount of time compared to substituting in before the differentiation don't let that fool you what's magic about the chain rule is that it for some real world applications we may not have a nice analytical expression for our function but we may still have the derivatives so being able to simply combine them with the chain rule becomes very powerful indeed before we finish let's have a quick look at our money happiness function and its derivative on a graph as we can see if you're broke and it's really worthwhile making some money but the benefit of getting more especially once you have enough pizza decreases dramatically and quickly becomes negative we now have a fourth and final tool for the module and in the next video we'll be putting them all to use in an example in this video we're going to work through a nasty looking function that will require us to use all four of the time saving rules that we've learned so far to make matters worse this function isn't going to describe anything familiar so we'll just be flying blind and have to trust the maths this will hopefully give you the confidence to dive into questions in the following exercise consider the rather nasty function f of x equals sine of two x to the power of five plus three x all over e to the seven x the essence of the sum product and chain rules are all about breaking your function down into manageable pieces so the first thing to spot is that although it is currently expressed as a fraction we can rewrite f of x as a product by moving the denominator up and raising it to the power of minus one there is actually another rule specifically for dealing with fractions directly called the quotient rule but it requires memorizing an extra expression so we're not going to cover it in this course as you can always use the approach that we've used here and rewrite your fraction as a product next we'll split f x up into the two parts of the product and work out how to differentiate each part separately ready to apply the product rule later on let's start with the first part which we can call g of x we've got the trigonometric function sine applied to a polynomial 2x to the 5 plus 3x this is a classic target for the chain rule so all we need to do is take our function and split it up into the two parts in order to apply the chain rule so we can take g of u equals sine of u and u of x equals 2 x to the power of 5 plus 3 x now we've got these two separate functions and we're going to want to differentiate each of them in order to apply the chain rule so we say okay g dash of u is going to just equal sine differentiates to cos so cos u and u dash of x is going to equal 10 x to the power of 4 plus 3. now we've got these two expressions and i'm going to use a mixed notation for convenience and i'm going to say okay well d g by d u multiplied by d u by d x is going to give us so we've got cos of u multiplied by 10 x to the power of 4 plus 3. and now we're going to want to have a final expression that doesn't include a u so we can think about these two cancel each other out and we get d g by d x and this just equals k cos of u which is just 2 x to the power of 5 plus 3 x multiplied by 10 x to the power of 4 plus 3. and that's it we now have an expression for the derivative of g of x and already we've made use of the chain rule the sum rule and the power rule now for the second half of our original expression which we will call h of x once again we can simply apply the chain rule after splitting our function up so we can say that h of v equals e to the v and v of x equals minus 7 x now once again we've got our two functions we just want to find the derivatives so h prime of v is just going to equal well we've seen this one before e to the v again and v prime of x is going to just be minus seven so combining these two back together and using different notation we can say that d h by d v multiplied by d v by d x is just going to give us we've got minus 7 times e to the minus 7 x and actually already all of our v's have disappeared so this is already our final expression as we now have expressions for the derivatives of both parts of our product we can just apply the product rule to generate the final answer and that's it we could rearrange and factorize this in various fancy ways or even express it in terms of the original function however there is a saying amongst coders that i think can be applied quite generally which is that premature optimization is the root of all evil which in this case means don't spend time tidying things up and rearranging them until that you're sure that you've finished making a mess i hope that you've managed to follow along with this example and will now have the confidence to apply our four rules to other problems yourself in calculus it's often the case that some initially scary looking functions like the one we worked with just now turn out to be easy to tame if you have the right tools meanwhile other seemingly simple functions occasionally turn out to be beasts but this can also be fun so happy hunting that brings us to the end of module one i hope those of you familiar with calculus from high school still managed to find this a useful refresher and for those of you new to calculus well done for sticking out to the end of the first module in the rest of this course we're going to be extending our analysis of functions to multi-variable systems and also applying calculus to some interesting data analysis problems however the basic ideas that we learnt in this module like the limit of rise over run operation will be at the core of everything we'll be doing so if you're happy so far then you're in a great position to make it through to the end see you in module two welcome to module 2 of this course on calculus as i mentioned at the end of the first module if you're happy with the concept of differentiation then in many ways everything we add from here will just be extensions of this core idea as well as some interesting applications the title of this course is multivariate calculus so you won't be surprised to hear that we are now going to generalize the concept of gradients to multi-variable systems the words multivariable and multivariate are typically used interchangeably and i will undoubtedly be using both in this module there is however a subtle difference between these two terms which is related to whether there are multiple input variables or multiple output variables or both but this distinction is not critical in the study of calculus that we're doing here so we can leave it to the statisticians to sweat over with more than one variable to play with we will now be able to use calculus to help us navigate around high dimensional spaces and hopefully by the end of the module we'll find some treasure see you then in the first module we started by trying to develop a strong visual intuition relating derivatives to the gradient of a function at each point we then followed this up by describing four handy rules to help speed up the process of finding derivatives however all of the examples we looked at were for systems involving a single variable so now we're going to have to see what happens when we apply the same idea to systems with many variables known as multivariate systems but before we do that we need to talk about what a variable is in the first place previously we've shown examples of problems where one of the variables is a function of the other ie y equals f of x but where it wouldn't necessarily make sense to say that x equals g of y for example a vehicle speed is clearly a function of time as at each time the vehicle can only have one speed however we could not say that the time was a function of the vehicle's speed as there might be multiple times at which the vehicle was traveling at any given speed this is why we typically refer to the speed as a dependent variable as it depends on time and so time can be thought of as an independent variable in this particular context typically when you first learn calculus you take functions containing variables and constants and then differentiate the dependent variables such as speed with respect to independent variables like time however what gets labeled as a constant or a variable can be subtler than you might think and will require you to understand the context of the problem being described let's continue using the example of a car to see how this might come up consider the following highly simplified expression relating the force f generated by a car's engine to its mass m acceleration a aerodynamic drag d and velocity v if you're driving a car then you can change your speed and acceleration by pressing the accelerated pedal to adjust the force but the mass and the drag are fixed features of the car's design this means that we would call the force an independent variable as you control it directly but your speed and acceleration depend on the variables as they are consequences of your applied force your mass and drag coefficients are clearly constants in this context however if you are a car designer looking to design each engine size in a new fleet perhaps your market research has given you a specific acceleration and speed target so in this context now your force is still the independent variable but now speed and acceleration are constants whereas the mass and drag have become variables which you can adjust by redesigning your car we refer to these slightly confusing variable design constants as parameters we often think of varying them as exploring a family of similar functions rather than describing them as variables in their own right but these are largely engineering distinctions and not the kind of thing that would worry a mathematician as you will see later in this course when fitting functions to some data it's the parameters of our fitting function which we are varying to find the best fit which means that we'll have to differentiate with respect to these parameters there is still some logic at work here but my advice to you is just don't worry about it too much the key takeaway here is you can in principle differentiate any term with respect to any other so don't get caught off guard when things you thought were constants suddenly start to vary let's now look at a different example imagine that you wanted to manufacture a metal can so you need to understand the relationship between the various key design parameters we can start by writing a reasonable approximation for the cans empty mass m by breaking the area down into pieces so the circles on top and bottom are pi r squared each and when we unroll the body we get a rectangle where the width must be the same as the circumference of the circle ie 2 pi r and we'll just call the height h so taking these areas and multiplying them by a thickness t we get the total volume of metal in the can finally multiplying this by the metals density rho we get its mass so we can say that the mass m is going to equal the area of these two little circles pi r squared times two of course plus the area of the rectangle in the middle so that's the circumference two pi r multiplied by the height h and both of these are going to be multiplied by the thickness t and the density rho so t rho and t rho i've written it in this long form here to make life a bit easier later on at this point what should we label as a constant or a variable well with the exception of pi which is definitely a constant for this universe it's not entirely clear we could in principle change any of the radius the height the wall thickness even the material's density so let's find the derivative of the cans mass with respect to each of them to calculate these all we do is when differentiating with respect to a certain variable simply consider all of the others to behave as constants so starting with an easy one let's try h so dm by dh is going to equal this thing 2 pi rt rho as the first term did not contain the variable h and we're treating all the other terms as constants then just like last week a constant simply differentiates to zero so we ignored it whereas for the second term it does contain h and so it's just multiplied by some constants so differentiating this just leaves those constants we can see from this expression that the partial derivative with respect to h no longer even contains h which is what you would expect as the mass will vary linearly with the height when all else is kept constant notice that instead of using the normal d that we saw from doing derivatives in last module we must use the curly partial symbol which signifies that you've differentiated a function of more than one variable let's now find the partial derivative with respect to the other variables starting with r so partial m by partial r is going to equal so there's an r squared term here which differentiates the 2r so we're going to have 4 pi r t rho and this term just contains an r so the r disappears 2 pi h t rho next we're going to do the variable t and we can say dm by dt is going to equal where there's just a t in here so it's not very exciting 2 pi r squared rho plus 2 pi r h rho and finally all we've got left is rho so we can say dm by d rho is just going to equal 2 pi r squared t 2 pi r squared t and 2 pi r h t plus 2 pi r h t and really although this is quite a straightforward example that's basically it for partial differentiation it's no more complicated than univariate calculus we met last module the only difference being that you have to be careful to keep track of what you're considering to be held constant when you take each derivative i hope this short introduction to multivariate calculus has helped you see that this concept is nothing to be intimidated by partial differentiation is essentially just taking a multi-dimensional problem and pretending that it's just a standard 1d problem when we consider each variable separately i look forward to showing you the pretty amazing things that we can use this concept for later in the module see you there we've already seen how to think about partial differentiation as just a simple extension of the single variable method that we derived in the last module in this video we're going to explore some slightly more complicated partial differential examples and we're also going to build something called the total derivative of a function so let's dive straight into a slightly more tricky problem than we saw last time consider the function f x y and z equals sine of x e to the power of y z squared we're now just going to work through and find the derivatives with respect to each of these three variables so let's start with x as the exponential term does not refer to x we can treat it as a constant and sine as we saw last week differentiates the cosine so we can write df by dx is just going to equal cos x e to the y z squared next we'll differentiate with respect to y in this case the sine term does not refer to y so we treat this as a constant but for the exponential term we can either apply the chain rule to it or just remember that the result of this operation for an exponential will just be to multiply the derivative of the exponent to the front and the derivative of y z squared with respect to y is just z squared so we can write d f by d y is going to equal sine of x e to the y z squared multiplied by z squared lastly we'll differentiate with respect to z once again the only z is in the exponential term so similar to the previous example we simply multiply through by the derivative of the exponential with respect to z so we just get d f by d z is going to equal sine x once again e to the y z squared and this time the derivative of this thing is 2 y z so now that we have these three partial derivatives we're going to introduce a new idea called the total derivative imagine that the variables x y and z were actually all themselves a function of a single other parameter t where x equals t minus one y equals t squared and z equals one over t and what we're looking for is the derivative of x with respect to t in this simple case we could just substitute for all our three variables directly in terms of t simplify a little bit and then differentiate directly with respect to t which gives us sine of t minus 1 times e however in a more complicated scenario with many variables the expression we needed to differentiate might have become unmanageably complex and perhaps we won't have a nice analytical expression at all the alternative approach is to once again use the logic of chain rule to solve this problem where the derivative with respect to our new variable t is the sum of the chains of the other three variable as shown in this expression since we've already got our three partial derivatives of f with respect to x y and z now we just need to find the derivatives of the three variables with respect to t and we'll have all the things we need to evaluate our expression so dx by dt is clearly 1 dy by dt is 2t and dz by dt if you remember from our 1 over t example is minus t to the power of -2 nothing hugely complicated here however when we then sub it into our total derivative expression you can probably see why i've chosen not to write this all out by hand as at first the expression is a bit of a monster however after substituting for x y and z all in terms of t and then simplifying down again we can see that the second and third terms are the same just with opposite signs and so they will cancel each other out then kind of amazingly we arrive at the same result as we saw at the beginning of the lecture so hopefully now you'll be feeling reasonably comfortable with partial differentiation as we've already done quite a lot of it and maybe you can even see why the total derivative function might be quite handy you are now ready to have a go yourselves in the following exercises and then you'll have all the pieces that you'll need in place for us to build our partial derivatives into something really useful see you then previously we saw that we can differentiate functions of multiple variables and then it isn't much harder than the univariate calculus we met at the start of the course in this video we're going to introduce the jacobian which brings in some of the ideas from linear algebra to build these partial derivatives into something particularly useful the concept of the jacobian can be applied to a variety of different problems but in the context of getting started with optimization and machine learning there is a particular scenario that comes up a lot which is the jacobian of a single function of many variables in short if you have a function of many variables so f x1 x2 x3 etc then the jacobian is simply a vector where each entry is the partial derivative of f with respect to each one of those variables in turn by convention we write this as a row vector rather than a column vector for reasons that will become clear later in the course let's start by looking at a nice simple function to see just how straightforward building a jacobian can be consider the function f of x y and z equals x squared y plus 3 z to build the jacobian we just find each of the partial derivatives of the function one by one so we've got d f by d x is going to equal we're just differentiating with respect to x so we get 2 x y when we treat everything else as conference d f by d y is going to equal just x squared and d f by d z is just going to be so this thing's just a constant and the z disappears so we just get the number three now bringing all of those together we just end up with a jacobian j which is just going to be 2 x y x squared 3. so what does this tell us we now have an algebraic expression for a vector which when we give it a specific x y z coordinate we'll return a vector pointing in the direction of steepest slope of this function the vector for this particular function has a constant contribution in the z direction which does not depend on the location selected for example at the point 0 0 0 we can see that our jacobian is just going to be j of 0 0 0 is just going to be well 0 0 again but the number 3 over here so we can see that our jacobian is a vector of length 3 pointing directly in the z direction some of the numerical methods that we will discuss later in this course require us to calculate jacobians in hundreds of dimensions however even for the 3d example we just solved graphical representation is already quite difficult so we are now going to drop down to two dimensions so that we can actually see what's going on here but to keep things interesting we're going to look at a particularly complicated but rather attractive function so here is the equation that we're going to plot but i'm showing it to you briefly just so that you can see that even though it is a bit of a monster i hope you would agree that just with the tools we've seen so far we really could calculate the partial derivatives of this thing but you wouldn't really learn anything new from grinding through this instead i'm going to show you the results graphically so here is this function with the x-axis horizontal and the y-axis vertical and the colors are indicating the value of z with the bright region suggesting high values and the dark regions suggesting low values because of the nature of this particular function i know that nothing interesting happens outside of the regions shown here so we can forget about it for now although this plot is fairly clear our intuition about the gradient is a bit lacking in this format so let's briefly make things 3d where the values of z are now also represented by the height so as we said at the start of the video the jacobian is simply a vector that we can calculate for each location on this plot which points in the direction of the steepest uphill slope furthermore the steeper the slope the greater the magnitude of the jacobian at that point hold the image of this 3d space in your mind as we now go back to 2d and rather than showing all of the grid points i used to plot the graph let's instead convert to a contour plot where just like a map of actual mountains we will draw lines along the regions of the same height which for us means the same value of zed this removes a lot of the clutter from the plot which is useful for the final step that i'd like to show you which will be adding lots of jacobian vectors on top of our contour plot however before i do that let's take a look at these four points and see if your intuition is now in tune by guessing which one will have the jacobian with the largest magnitude overlaying the jacobian vector field we can see that they are clearly all pointing uphill away from the low dark regions and towards the high bright regions also we see that where the contour lines are tightly packed this is where we find our largest jacobian vectors such as at point a whereas at the peaks of the mountains and at the bottom of the valleys or even out on the wide flat plains our gradients and therefore our jacobians are small my hope is that this clear two-dimensional example will give you the confidence to trust the maths when we come up against much higher dimensional problems later in the course see you then in this video we're going to extend the concept of the jacobian from vectors up to matrices which will allow us to describe the rates of change of a vector-valued function however before we do this we are first going to recap by applying what we've learned so far about jacobians to another simple system if we consider the function f of x and y equals e to the power of minus x squared plus y squared then using our knowledge of partial differentiation it's fairly straightforward to find its jacobian vector so we're now going to do the reverse of our approach from the last time and start by looking at the vector field of the jacobians then see if we can understand how the function must look let's start by finding the jacobian at a few specific points firstly the point -1 1 which we'll highlight on our axis in pink substituting these coordinates into our jacobian expression and simplifying we can see a vector pointing directly towards the origin next if we move further out to the point 2 2 find the jacobian we now get a much smaller vector but pointing once again directly at the origin lastly before i reveal the whole vector field let's look at what's going on at the origin itself subbing the point 0 0 returns the zero vector suggesting that the function is flat at this point which must mean one of three things either this point is a maximum minimum or something called a saddle which we'll cover later in this module however if we now reveal the rest of the jacobian vector field it becomes clear that the origin must be the maximum of this system let's now move back to the color map representation of this function where the brighter regions represent high values of f so finally we can remove the vector field and observe the function in 3d which i hope matches up with your expectations next we're going to build a jacobian matrix which describes functions that take a vector as an input but unlike our previous examples also give a vector as the output if we consider the two functions u of x and y equals x plus 2y and v of x and y equals 3y minus 2x we can think of these as two vector spaces one containing vectors with coordinates in uv and the other with coordinates in x and y each point in x y has a corresponding location in uv so as we move around x y space we would expect our corresponding path in uv space to be quite different and so it is for those of you familiar with a linear algebra you may have guessed already where this is going we can of course make separate row vector jacobians for u and v however as we are considering u and v to be components of a single vector it makes more sense to extend our jacobian by stacking these vectors as rows of a matrix like this so now that we have the structure and motivation for building a jacobian matrix for vector-valued functions let's apply this to our example functions and see what we get so we have u of x y equals x minus 2 y and we've also got v of x y equals 3 y minus 2 x we can build the jacobian from these two by simply saying well the jacobian is going to be let's open some big brackets and say well it's d u d x d u d y d v d x d v d y so let's just work through it so d u d x is just going to be one d u d y is just minus two d v d x is -2 again and dvdy is 3. our jacobian matrix no longer even contains any variables which is what we should expect when we consider that clearly both u and v are linear functions of x and y so the gradient must be constant everywhere also this matrix is just the linear transformation from x y space to u v space so if we were to apply the x y vector 2 3 we'd get the following 2 3 and that's going to equal 2 minus 6 that's minus 4 and minus 4 plus 9 that's 5. now this is all well and good but of course many of the functions that you'll be confronted with will be highly non-linear and generally much more complicated than the simple linear example we've just looked at here however often they may still be smooth which means that if we zoom in close enough we can consider each little region of space to be approximately linear and therefore by adding up all the contributions from the jacobian determinants at each point in space we can still calculate the change in the size of a region after transformation a classic example of this occurs when transforming between cartesian and polar coordinate systems so if we have a vector expressed in terms of a radius r and the angle up from the x-axis theta but we'd like them expressed in terms of x and y instead we can write the following expressions just by thinking about trigonometry now we can build the jacobian matrix and take its determinant the fact that the result is simply the radius r and not the function theta tells us that as we move along r away from the origin small regions of space will scale as a function of r which i hope will make a lot of sense to you when we look at our little animation here that's all for this video i hope you will now be able to build jacobian vectors and matrices for yourself with confidence in the exercises and more importantly that your intuition on the meaning of this concept is starting to develop see you next time we've now seen that the jacobian describes the gradient of a multi-variable system and that if you calculate it for a scalar valued multivariable function you get a row vector pointing up the direction of greater slope with a length proportional to the local steepness in this video i'm going to briefly introduce a kind of gradient playground to help you further develop your intuition on the jacobian which will also set you up for the following exercises in everyday language we use the word optimization to describe the process of trying to make something as good as it can be in mathematics optimization basically means the same thing as much of the research is dedicated to finding the input values to functions which correspond to either a maximum or a minimum of a system examples of mathematical optimization in action in the real world include the planning of roots through busy cities the scheduling of production in a factory or strategy for selecting stocks when trading if we go back to the simplest function we saw in the last section and we said that we wanted to find the location of the maximum we could simply solve this system analytically by first building the jacobian and then finding the values of x and y which make it equal to zero however when the function gets a bit more complicated finding the maximum or minimum can get a bit tricky if as in this case we still have an analytical expression then we can at least still find the general expression for the jacobian but now simply setting it to zero is not only much more complicated but it also is not enough as this function has multiple locations with zero gradient if we assume that all of the maxima and minimum of this function can be seen in the region that we're plotting here then just looking at the surface plot of our function makes it very clear where the tallest peak and deepest trough are we refer to all the peaks as maxima but in this case we have a single tallest peak a which we will call the global maximum as well as several local maxima at c and e similarly we refer to all the troughs as minima and we also have a single deepest point at d which we call the global minimum as well as a local minimum at point b all fairly straightforward however there's a very important point here that's perhaps so obvious that you might have missed it imagine standing on the surface with its hills and valleys and we're trying to climb to the top of the highest peak that's no problem we just look around spot the tallest mountain and walk straight towards it but what if we were walking at night this would be much like the scenario where we didn't have a nice analytical expression for our function so we simply aren't able to plot the whole function and look around perhaps each data point is the result of a week-long simulation on a supercomputer or maybe the outcome of an actual real-world experiment these nighttime scenarios very commonly arise in optimization and can be very challenging to solve however if we're lucky we might find that using a torch we can see the jacobian vectors painted on road signs all around us and each one would say peak this way we would have to remember that although the jacobians all point uphill they don't necessarily point to the top of the tallest hill and you could find yourself walking up to one of the local maxima at c or e and even worse when you get there you'll find that all of the road signs are pointing directly at you this nighttime hill walking analogy is often used when discussing the problem of optimization however it does have some misleading features such as the fact that when you're really evaluating a function it's no problem to effectively transport all over the map by teleportation as you can try the function at many different places but there's no need to evaluate everywhere in between and the calculation costs the same essentially no matter how far apart the points are so we're not really walking instead we're going to switch to the analogy of a sand bit with an uneven base in the following exercises you're going to try to find the deepest point of a sandpit by measuring the depth at various points using a long stick this is a very deep sandpit so once you push the stick down to the bottom there's no way to move it around sideways you just have to pull it out and try somewhere else also crucially just like our night walking scenario you will have no idea what the peaks and troughs look like at the bottom of the pit because you can't see the sand is in the way as you work through the exercise i'm hoping that you'll start to pick up on a few of the subtleties of optimization and hopefully leave you with a few new questions as well see you next time and have fun in the sandpit so now that you've all spent some time playing in the sandpit we can introduce an additional concept which relates to multivariate systems called the hessian in many ways the hessian can be thought of as a simple extension of the jacobian vector for the jacobian we collected together all of the first order derivatives of a function into a vector now we're going to collect all of the second order derivatives together into a matrix which for a function of n variables would look like this however this is one of those scenarios where using an abbreviated notation style comes in really handy we saw in the previous module that we can just keep differentiating a function using the same method to find higher and higher order derivatives similarly for a partial derivative if you want to find the second derivative with respect to x1 then x2 it's as simple as just differentiating with respect to x1 assuming all the other variables are constant and then differentiating with respect to x2 again assuming all the other variables are constant as you can see from this general form our hessian matrix will be an n by n square matrix where n is the number of variables in our function f let's now take a look at a quick example it often makes life easier to find the jacobian first and then differentiate its terms again to find the hessian so for our function f of x y and z equals x squared y z so f of x y z equals x squared y z we're going to first build the jacobian for this thing which of course is going to be j equals so we've got differentiate with respect to x we get 2 x y z differentiate with respect to y we're just going to get x squared z and differentiate respect to z x squared y now using this we can then think about differentiating again with respect to each of the variables which will then give us our hessian matrix so h is just going to equal and put a big bracket here so we want to take this thing and differentiate it with respect to x again so we're going to get 2 y z so the next term along we're going to differentiate this thing with respect to y 2 x z and again with respect to z 2 x y okay now we'll take the next row and we'll say we're going to differentiate this thing with respect to x then y and z so you get 2 x z uh we're going to differentiate respect to y so we're going to get nothing we're going to differentiate this thing with respect to z and we get x squared lastly take this term make the last row differentiate respect to x we get 2 x y and then with respect to y we get x squared and with respect to z we get nothing so one thing to notice here is that our hessian matrix is symmetrical across the leading diagonal so actually once i'd worked out the top right region i could just have written these directly in for the bottom left region this will always be true if the function is continuous meaning that it has no sudden step changes we could now simply pass our hessian an xyz coordinate and it will return a matrix of numbers which hopefully tells us something about that point in the space in order to visualize this we're going to have to drop down to two dimensions again consider the simple function f of x and y equals x squared plus y squared calculating the jacobian and the hessian are both fairly straightforward and hopefully you could have visualized how this function would have looked in your head however if you hadn't known what function we were dealing with and calculated the value of the jacobian at the point zero zero you'd have seen that the gradient vector was also zero but how would you know whether this thing was a maximum or a minimum at that point you could of course go and check some other point and see if it was above or below but this isn't very robust instead we can look at the hessian which in this simple case is no longer even a function of x or y its determinant is clearly just two times two minus zero times zero which is four the power of the hessian is firstly that if its determinant is positive we know we are dealing with either a maximum or a minimum secondly we then just look at the first term which is sitting at the top left-hand corner of the hessian if this guy is also positive we know we've got a minimum as in this particular case whereas if it's negative we've got a maximum lastly slightly modifying our function to include a minus sign and recalculating our jacobian and our hessian and our hessian determinant we now see the third interesting case this time our hessian determinant is negative so we know that we're not dealing with a maximum or a minimum but clearly at this point zero zero the gradient is flat so what's going on well if you look at the animation what we've got here is a location with zero gradient but with slopes coming down towards it in one direction but up towards it in the other we call this kind of feature a saddle point and they can also cause a lot of confusion when searching for a peak in the last module of this course you're also going to see another way that the hessian can help us with optimization but for now we've simply got another tool to help you navigate the sandpit see you next time in this module we've discussed how to think about two-dimensional functions as landscapes and we've also seen that we can construct jacobian vectors which tell us both the direction and the magnitude of the gradient at each point in space last video we added one further tool to our toolbox which allowed us to double check what kind of feature we were standing on when we landed on a point with a zero gradient these concepts will all be very useful to develop your understanding of optimization problems and have also let you see why multivariate calculus is worth knowing however in this video we are going to remind ourselves about two features of real systems which so far we've avoided firstly for many applications of optimization such as in the training of neural networks you are going to be dealing with a lot more than two dimensions potentially hundreds or thousands of dimensions this means that we can no longer draw our nice surface and climb its mountains all the same math still applies but we now have to use our 2d intuition to guide us and enable us to trust the maths secondly as we mentioned briefly before even if you do just have a 2d problem very often you might not have a nice analytical function to describe it and calculating each point could be very expensive so even though in principle a plot could possibly be drawn you wouldn't be able to afford either the supercomputer time or perhaps the laboratory staff to fully populate this thing thirdly all the lovely functions that we've dealt with so far were smooth and well behaved however what if our function contains a sharp feature like a discontinuity this would certainly make navigating the sandpit a bit more confusing lastly there are a variety of factors that may result in a function being noisy which as i'm sure you can imagine might make our jacobian vectors pretty useless unless we were careful so this brings us nicely to the second topic in this video which is a question that i hope you've all been screaming at your screens for the past few minutes if as i said a minute ago we don't even have the function that we're trying to optimize how on earth are we supposed to build a jacobian out of the partial derivatives this is an excellent question and leads us to another massive area of research called numerical methods there are many problems which either don't have a nice explicit formula for the answer or do have a formula but solving it directly would take until the end of time to fight back against the universe mocking us in this way we have developed a range of techniques that allow us to generate approximate solutions one particular approach that is relevant to our discussion of the jacobian actually takes us right back to the first few lectures on this course where we defined the derivative we started by using an approximation based on the rise of a run calculated over a finite interval and then looked at what happens as this interval approached zero all we're doing with the finite difference method is accepting that we're not going to work out the value of the function at every single point in space so we're just going to use the points that we do have and build an approximation for the gradient based on that in the example shown here we have already calculated lots of points on this one-dimensional function but clearly that's not going to be practical for higher dimensional scenarios so all we do is take this logic one step further and say if we start from an initial location and we would like to approximate the jacobian we will simply approximate each partial derivative in turn so taking a small step in x allows us to calculate an approximate partial derivative in x and a small step in y gives our approximate partial in y two things to bear in mind here firstly how big should our little step be well this has to be a balance as if it's too big you will make a bad approximation for reasons that i hope will be obvious by this point but if it's too small then we might run into some numerical issues just remember when your computer calculates the value of the function at a point it only stores it to a certain number of significant figures so if your point is too close your computer might not register any change at all second as we mentioned earlier what happens if your data is a bit noisy to deal with this case many different approaches have been developed but perhaps the simplest is just to calculate the gradient using a few different step sizes and take some kind of average this brings us to the end of this video so i hope you can see that once we leave the world of nice smooth functions and enter the real world of noisy data and computationally expensive functions things start to get a lot more interesting see you next time that's it for module 2 of the course i hope you're now feeling reasonably confident about multivariate calculus and able to navigate multi-dimensional hypersand pits of your own in future see you in the next module where we will be discussing the multivariate chain rule and how it can help you optimize the weightings in a neural network see you then welcome to module 3 of the course in the previous two modules we've seen how to differentiate functions in single and multi-variable spaces as well as hopefully developing some intuition about what's really going on when we do this we've also learned a selection of useful tools for both rapidly calculating derivatives and for navigating high dimensional spaces in this module we're going to upgrade the univariate chain rule so that it can tackle multivariate functions and then see an example of where this might come in handy see you then in the last module we saw something called the total derivative which showed us that when we have a multi-variable function such as f of x y and z but the variables x y and z were themselves each a function of some additional variable t then if we want to calculate the derivative of f with respect to t we can use this expression which is simply the sum of the chains relating f to t through each of its three variables this allowed us to calculate the result in a piecewise manner rather than substituting everything in at the start and computers are really good at solving piecewise problems quickly what we're now going to do is generalize this concept and also simplify the notation a little if we had a function f of n variables x1 x2 x3 all the way up to xn i can write this as just f of x but now you'll notice that i've written the x in bold just to help you remember that this x represents a sequence of variables which we'll now more conveniently think of as an n dimensional vector once again each of the components of our x vector are themselves functions of some other variable t and what we'd like to know is the derivative of f with respect to t at this point we're going to need some more space to write so we'll keep our function f at the top but you'll just have to remember that each component of x is also a function of now we are looking to once again build an expression linking f to t through the sum of the chains of each of its variables so we're going to need all the partial derivatives of f with respect to x as well as the derivatives of each component of x with respect to t once again we're going to store all of these objects in a pair of n-dimensional vectors finally we're looking to build a multi-variable chain rule expression so we are looking to find the sum of the product of each pair of terms in the same position in each vector thinking back to our linear algebra this is exactly what the dot product does but there is no need for us to write out these vectors in full so we can simply write the dot of our two multi-variable derivative expressions and that's it we now have our generalized form of the multi-variable chain rule expressed nice and neatly so we can now update our list of tools to reflect this but actually although we've seen this in practice already in the last module it's worth pointing out that all the rest of our time-saving rules already work for multivariate problems as they are so we'll be ready to put these into practice very soon see you then last video we finished up with an expression for the multivariate chain rule in this video we're going to start by picking up one more little detail which you might have already spotted and then follow this up by adding another link to our chain if we have a function f of x where x is a vector in which each term in x depends on t which we can compactly write like this we also have our compact form of the multivariate chain rule to go with it what i hope you might have noticed last time is that our vector of partial derivatives df by dx is just the same as the jacobian vector which we saw last module except that we wrote it as a column instead of a row vector so from our knowledge of linear algebra we can say that df by dx must be the transpose of the jacobian of f the last thing to realize is that taking the dot product of two column vectors is the same operation as multiplying a row vector by a column vector so finally we can see that our old friend the jacobian offers us perhaps the most convenient representation of the multivariate chain rule next we're going to see that the chain rule still works for more than two links to start us off we'll work through a really quick univariate example where we're going to add in another function separating f from t so we can say f of x is going to equal 5 x and we can have x of u now can equal 1 minus u and u of t finally can be t squared so we've got three functions and we're separating f from t now by an extra step of course we can just sub in each step into each other and find an expression for f as a function directly of t or we say well it's going to be 5 of whatever x is and x is 1 minus whatever u is and u is t squared so this thing goes to five minus five t squared and of course we can now directly differentiate this thing and say d f by d t equals minus 10 t or we can apply a two-step chain rule and say the following so we can have d f by d t is going to equal d f by d x times d x by d u and finally d u by d t okay now subbing in for each of our terms we just get well this thing's going to equal d f by d t is so df by dx is just five we're going to multiply that by dx by du which is just minus one and finally du by dt is just going to be two t so once again we're going to recover the same answer which is minus 10 t so we can see that this approach works for chains of univariate functions and we could extend it out to as many intermediary functions between f and t as we'd like but what about the multivariate case well the chain rule does work here too but we do just have to pay attention to a few extra details let's start by considering the function f of x of u of t again where the function f takes the vector x as an input but this time x is a vector valued function which also takes a vector u as its input as the last video the bold symbols indicate vectors so u is again a vector valued function and it takes the scalar t as its input ultimately we are still relating the scalar input t to a scalar output f but we're doing this via two intermediary vector valued functions x and u if once again we'd like to know the derivative of f with respect to t we can essentially write the same expression as the univariate case except that now several of our terms are in bold we've already seen that differentiating the scalar valued function f with respect to its input vector x gives us the jacobian row vector we've also seen that differentiating the vector-valued function u with respect to the scalar variable t gives us a column vector of derivatives but what about the middle term dx by du well for the function x we need to find the derivative of each of the two output variables with respect to each of the two input variables so we end up with four terms in total which as we saw in the last module can be conveniently arranged as a matrix we still refer to this object as a jacobian so we can now say that the derivative of f with respect to t is the product of the jacobian of f with the jacobian of x and the derivative vector of u notice that the dimensions of these vectors and matrices as shown here are such that this operation is possible and that they will return a scalar just as we expected that's it for this video i hope you're now starting to see the various threads of linear algebra and multivariate calculus weave together see you in the next one in this video i'm going to introduce to you the concept of artificial neural networks in just enough detail that we will be ready to see how the multivariate chain rule is crucial for bringing them to life i think it's safe to assume that everyone watching this video will at least have heard of neural networks and you're probably also aware that they've turned out to be an extremely powerful tool when applied to a wide variety of important real world problems including image recognition and language translation but how do they work you will often see nice diagrams like this one where the circles are our neurons and the lines are the network of connections between them this may sound a long way removed from the topics we've covered so far but fundamentally a neural network is just a mathematical function which takes a variable in and gives you another variable back where both of these variables could be vectors let's now have a look at the simplest possible case so that we can translate these diagrams into some formulae here we have a network which takes in a single scalar variable which we'll call a zero and returns another scalar a1 we can write this function down as follows a1 equals sigma of w times a0 plus b where b and w are just numbers but sigma is itself a function it's useful at this point to give each of these terms a name as it'll help you keep track of what's going on when things get a bit more complicated so our a terms are called activities w is a weight b is a bias and sigma is what we call an activation function now you might be thinking how come all the terms use a sensible letter except for sigma but the answer to this comes from the fact that it is sigma that gives neural networks their association to the brain neurons in the brain receive information from their neighbors through chemical and electrical stimulation and when the sum of all these stimulations goes beyond a certain threshold amount the neuron is suddenly activated and starts stimulating its neighbors in turn an example of a function which has this thresholding property is the hyperbolic tangent function tanch which is a nice well-behaved function with a range from -1 to 1. you may not have met tanch before but it's just the ratio of some exponential terms and nothing your calculus tools can't already handle tanch actually belongs to a family of similar functions all with this characteristic s shape called sigmoids hence why we use sigma for this term so here we are with our non-linear function that we can evaluate on a calculator and also now know what all the terms are called at the start of this video i mentioned that neural networks could for example be used for image recognition but so far our network with its two scalar parameters w and b doesn't look like it could do anything particularly interesting so what do we need to add well the short answer is more neurons so now we're just going to start building up some more complexity whilst keeping track of how the notation adapts to cope if we now add an additional neuron to our input layer we can still call this scalar output variable a1 but we will need to be able to tell the difference between the two inputs so we can call them a00 and a01 to include this new input in our equation we simply say that a1 equals sigma of the sum of these two inputs each multiplied by their own weighting plus the bias as you can see each link in our network is associated with a weight so we can even add these to our diagram adding a third node to our input layer a03 follows the same logic and we just add this weighting value to our sum however things are starting to get a bit messy so let's now generalize our expression to take n inputs for which we can just use the summation notation or even better notice that each input has a weight so we can make a vector of weights and a vector of inputs and then just take their dot product to achieve the same effect we can now have as many inputs as we want in our input vector so let's now apply the same logic to the outputs adding a second output neuron we'd call these two values a10 and a11 where we now have twice the number of connectors each one with its own weighting and each neuron has its own bias so we can write a pair of equations to describe this scenario with one for each of the outputs where each equation contains the same values of a0 but each has a different bias and vector of weights unsurprisingly we can again crunch these two equations down to a more compact vector form where the two outputs are each rows of a column vector meaning that we now hold our two weight vectors in a weight matrix and our two biases in a bias vector now let's have a look at what our compact equation contains in all its glory for what we call a single layer neural network with m outputs and n inputs we can fully describe the function it represents with this equation and peering inside we can see all the weights and biases at work the last piece of the puzzle is that as we saw at the very beginning neural networks often have one or several layers of neurons between the inputs and the outputs we refer to these as hidden layers and they behave in exactly the same way as we've seen so far except their outputs are now the inputs of the next layer and with that we have all the linear algebra in place for us to calculate the outputs of a simple feed-forward neural network however persuading your network to do something interesting such as image recognition then becomes a matter of teaching it all the right weights and biases which is what we're going to be looking at in the next video as we'll bring the multivariate chain rule into play see you then in the previous video we introduced the concept of neural networks and then we worked through the algebra required to describe a fully connected feed-forward network with hidden layers in this video we're going to see how the multivariate chain rule will enable us to iteratively update the values of all the weights and biases such that the network learns to classify input data based on a set of training examples when we say that we are training a network we typically mean using some kind of labeled data which are pairs of matching inputs and outputs for example if we were to build a network to recognize pictures of faces and predict if they were happy then for our training data each of the inputs might be the intensity of a single pixel from the image and this would be paired with an output which just says whether this image contains a face and whether it was a happy face the classic training method is called back propagation because it looks first at the output neurons and then works back through the network if we start by choosing a simple structure such as the one shown here with four input units three units in the hidden layer and two units in the output layer what we're trying to do is find the 18 weights and five biases that cause our network to best match the training inputs to their labels initially we will set all of our weights and biases to be a random number and so initially when we pass some data into our network what we get out will be meaningless however we can then define a cost function which is simply the sum of the squares of the differences between the desired output y and the output that our untrained network currently gives us if we were to focus on the relationship between one specific weight and the resulting cost function it might look something like this where if it's either too large or too small the cost is high but at one specific value the cost is at a minimum now based on our understanding of calculus if we were somehow able to work out the gradient of c with respect to the variable w at some initial point w0 then we can simply head in the opposite direction for example at the point shown on the graph the gradient is positive and therefore increasing w would also increase the cost so we should make w smaller to improve our network however at this point it's worth noting that our cost function may look something more like this wiggly curve here which has several local minima and is more complicated to navigate furthermore this plot is just considering one of our weights in isolation but what we'd really like to do is find the minimum of the multi-dimensional hyper surface much like the 2d examples we saw in the previous module so also like the previous module if we want to head downhill we will need to build the jacobian by gathering together the partial derivatives of the cost function with respect to all of the relevant variables so now that we know what we're after we just need to look again at our simple two node network and at this point we could immediately write down a chain rule expression for the partial derivative of the cost with respect to either our weight or our bias and i've highlighted the a1 term which links these two derivatives however it's often convenient to make use of an additional term which we'll call z1 that will hold our weighted activation plus bias terms this will allow us to think about differentiating the particular sigmoid function that we happen to choose separately so we must therefore include an additional link in our derivative chain we now have the two chain rule expressions we'd require to navigate the two-dimensional wb space in order to minimize the cost of this simple network for a set of training examples clearly things are going to get a little more complicated when we add more neurons but fundamentally we're still just applying the chain rule to link each of our weights and biases back to its effect on the cost ultimately allowing us to train our network in the following exercises we're going to work through how to extend what we saw for the simple case to multi-layer networks but i hope you've enjoyed already seeing calculus in action see you next time this brings us to the end of the third module on this calculus course hopefully you are now starting to see how both calculus and linear algebra are going to be important for developing a detailed understanding of machine learning techniques you already have a lot of powerful tools in your toolbox and you are nearly ready to put them into action next module we'll be looking at power series which will bring us all the way back to the rise over run approximation from module one and help us tie up a few loose ends see you then welcome to module 4 of the course in this module we are going to learn how to take a complicated function and build an approximation to it using a series of simpler functions the particular approach that we're going to be looking at is known as the taylor series and it will require all of the calculus skills you've learned so far to be put into practice both the jacobian and the hessian will be making appearances again when we consider the multivariate case which will hopefully allow you to see how all of the concepts that you've seen so far are actually linked together see you then the taylor series is one of a family of approaches used for building approximations to functions so before diving into the maths it's worth stopping for a minute to talk about when it might be useful to have an approximation one example that seems to stick in people's minds is to do with cooking a chicken imagine that you could write a function which describes the relationship between the mass of a chicken m and the amount of time t it should be put in the oven before it will be perfectly cooked clearly there are a lot of assumptions that we'd need to have in place for such a function to even exist for example this would only apply to a certain type of oven preheated to a certain temperature applied to a certain configuration of chicken i.e one that's in one piece rather than chopped up furthermore the heat transfer properties of a chicken will almost certainly vary in a highly non-linear fashion as a function of the mass let's have a look at the sample function that has some of the features that we've mentioned you can imagine that on the one hand the chicken cooking time tea will be quite sensitive to a lot of parameters that we might expect to vary case to case however on the other hand if we'd like to sell a nice recipe book we're going to need to be pragmatic and simplify this monster as people tend not to want to solve complicated equations while making dinner so what do we do firstly we're going to make two assumptions one about ovens and the other about chickens so let's assume that it's reasonable to suggest that everyone who buys your cookbook will have a sufficiently similar oven such that we can assume that they behave basically the same secondly that everyone will be cooking a sufficiently similar chicken in terms of their heat transfer properties as a function of mass that you can also assume that these will be the same in each case this allows us to remove two potentially problematic terms that might themselves have been a function of many other variables but we were still left with quite a messy function the next thing we're going to do is have a look at a plot of this function now already i'm telling you about what i view as relevant information in this system as i'm not showing you timings for chickens heavier than four kilograms or chickens of negative mass and in fact what i'm going to do now is go one step further and say that considering a typical chicken from a supermarket is about 1.5 kilograms so let's focus here and build a nice straight line approximation of the form y equals mx plus c by using the taylor series approach which we will be learning in the following videos it is possible for me to derive a function with the same height and slope as one of the points in my graph this line is a fairly reasonable approximation to the function in the region close to the point of interest but as you move further away the approximation becomes pretty poor however this cookbook is not for people either roasting giant or miniature chickens so we end up being able to write down an expression in a much more palatable format where our approximation t star is approximately equal to 50 times the mass in kilograms plus 15. so if you'd like to roast a 2 kilogram bird leave it for about 115 minutes in the next few videos we're going to go into a lot more detail about the taylor series and also find out how to derive higher order terms i really don't know whether the famous chefs around the world are knowingly using taylor series approximations when they write their cookbooks but i like to think they are see you next time in this short video i'm just going to give you one more teaser about what the taylor series is doing before we try and write down anything like a formal definition this will allow you to have a go at some graphical questions first which is much like how we approach london differentiation at the start of this course taylor series are also referred to as power series and this is because they are composed of coefficients in front of increasing powers of x so we can write a simple generalized expression for a power series as g of x equals a plus bx plus cx squared plus dx cubed etc potentially going off for infinitely many terms depending on what function we're considering when we calculate a taylor series in the next video we will build it up coefficient by coefficient where each term that we add improves the approximation and in many cases we will then be able to see a pattern emerging the coefficients which thankfully saves us the trouble of calculating infinitely many terms however many of the applications of taylor series involve making use of just the first few terms of the series in the hope that this will be a good enough approximation for a certain application starting from just a single term we call these expressions the zeroth first second and third order approximations etc collectively these short sections of the series are called truncated series so let's begin by looking at some arbitrary but fairly complicated function if we start with something like this where we have a function and we're just making it up perhaps shaped like this all we're going to do is focus on one particular point on this curve so let's say this point here and then we're going to start building our function by trying to make it more and more like the point that we've chosen so as the first term of our generalized power series is just a number a and we're ignoring all the other terms for now we know that our opening approximation must just be a number that goes through the same point so we can just dive straight in and add our zeroth order approximation function to our plot it's going to be something like that clearly this hasn't done a great job at approximating the red curve so let's now go to our first order approximation this thing can also have a gradient and if we'd like to match our function at this point it should have the same gradient so we can have something a bit more like this which is supposed to be a straight line clearly our approximation has improved a little in the region around our point although there is still plenty of room for improvement we can of course move on to the second order function which as we can see is a parabola although at this point things get a little tough to draw and matching second derivatives by i is also not easy but it might look something like this so we can say okay let's go up and we're going to come down like that hopefully without having gone into any of the details about the maths you'll now be able to match up some mystery functions to their corresponding truncated taylor series approximations in the following exercise in the next video we're going to work through the detailed derivation of the terms but i hope this activity will help you not to lose sight of what we're trying to achieve in the end see you then in this session we're going to work through the derivation of power series representation of functions which in short is the idea that you can use a series of increasing powers of x to re-express functions just to warm up by blowing your minds we can take the function e to the x which we met earlier in the course and re-express it as the series 1 plus x plus x squared over 2 plus x cubed over 6 plus higher order terms now i hope you would agree with me that this is pretty incredible and hopefully by the end of this video you will understand how to build series like this for many other interesting functions what the taylor series method tells us is that if we know everything about the function at some point whereby everything i mean the function's value its first derivative second derivative third derivative etc then we can use this information to reconstruct the function everywhere else so if i know everything about it at one place i also know everything about it everywhere however this is only true for a certain type of function that we call well-behaved which means functions that are continuous and that you can differentiate as many times as you want that said we know lots of functions like that so it turns out to be a very useful tool indeed so now we're going to try and explain this concept graphically by looking at a graph of some arbitrary function f of x which looks like this and building a sequence of gradually improving approximations so our first approximation which we can call g zero we're going to build using just one piece of information from our function f of x which will be the value of the function at our point of interest which in this case is just x equals zero as you can imagine if all we're going to build our function off is this one single value then clearly our guess function isn't going to be able to capture a very complicated shape like this and in fact all it can be is a horizontal line at the point f at zero so we can plot this first approximation and also write an expression for calculating g zero of x which is just f of zero we call this our zeroth order approximation and clearly it's not very good but also notice that as this line is flat it's actually not even a function of x we can do better so let's now find our first order approximation for this we're now going to use two pieces of information the value of the function at x equals zero but also the value of the gradient at x equals zero which we will call f dash at zero using these we can build another straight line of the form y equals mx plus c where the vertical axis intercept c is just f of zero but just substituting in for f dash of 0 as the gradient of our line so we can now plot our first order approximation to the function which has the same value of the gradient and of the function f of x and we can also write down its expression g one of x is f of zero f dash of zero times x this thing clearly does a better job than g zero at approximating f of x near the point x equals zero but it's still not great moving quickly on to our second order approximation g two of x we're going to use three pieces of information f of zero f dash of zero and the second derivative f double dash of zero now to have a function that can make use of these three pieces of information we're going to need a quadratic equation so we can write y equals a x squared plus b x plus c differentiating this thing twice we can just say well y prime equals 2 a x plus b and y double prime equals to a now what we want is for this function to be the same as f of x when we sub it in at the point equals zero x equals zero so we can say okay at x equals zero we want this thing to equal f double prime of zero so clearly our coefficient a because this thing's not even a function of x therefore a is just going to equal f double dash 0 divided by 2. now if we look up at this equation here we also want the first derivative to be equal to the function at zero so we can set this equal to f dash of zero and we're subbing in zero for this x here so this term just disappears so we can now say that also b equals f dash at 0. lastly subbing in 0 here and here and setting this thing just equal to f of 0 we can say that clearly c is also just equal to f of zero c equals f at zero so we now have all three coefficients for our equation and we can say let's now go back to our graph and add our second order approximation which as it's just an x squared term will be a parabola notice that each time we update our approximation the region in which it matches up with f of x grows a little so let's now take one more step down the rabbit hole and find the third order approximation so we can write y equals a x cubed plus b x squared plus c x plus d as based on what we've already seen from the first three steps it's only the coefficient a that we need to find as b c and d will be the same as we found for g2 so let's now differentiate this thing three times y prime equals three a x squared plus two b x plus c y double prime equals six a x plus 2 b and y 3 is just going to be 6 a so clearly we want this thing the third derivative of our approximation function to equal our function f of x when we differentiate it three times and evaluate it at the point x equals zero so we're setting this x to zero now there is no x in this bit we can say f 3 at 0 equals 6 a therefore a just equals f 3 at 0 divided by 6. finally we can add this third order approximation to our graph and write out its expression you can now see that not only has the approximation improved significantly but also that we can add higher order terms piecewise and in each case the lower order terms remain the same what we now want to build is a general expression the lower order terms remain just right down the fourth order approximation without working through it notice that the one over sixth coefficient in front of the cubic term was the result of having to differentiate a cubic term twice so when we differentiate x to the fourth power three times we're going to get four times three times 2 in front leading to a coefficient of 1 divided by 4 times 3 times 2 which is 1 divided by 24. we have a name for the operation 4 times 3 times 2 which is 4 factorial in fact all the terms can be thought of as having a factorial in front of them even 0 factorial which for reasons i won't go into here is in fact equal to 1. so with this last piece of the puzzle in place we can now say that the nth term in the approximation is just the nth derivative of f evaluated at zero divided by n factorial multiplied by x to the power of n and therefore the complete power series can be written as follows so you can see it's the sum from n equals 0 to infinity of these terms although what we've written here certainly does count as a taylor series because we're specifically looking at the point x equals zero we often refer to this case as a maclaurin series so you're still going to have to wait one more video before seeing the taylor series expression in all its glory in the rest of this module we're going to be applying the concept of power series to some interesting cases as well as generalizing it to higher dimensions where rather than building approximation curves we will be constructing approximation hypersurfaces see you then i began our last session by trying to motivate you with the fact that we can re-express the function e to the x as a power series so i want to make sure that your mind is well and truly blown by showing you that when we differentiate this function term by term which isn't very difficult to do as it's just a polynomial something rather satisfying happens where just as we'd expect for the derivative of e to the x this infinitely long series remains unchanged which i think is pretty awesome so now finally you're going to get to see the taylor series fundamentally we are applying the same logic as the maclaurin series that we derived previously but whereas the maclaurin series says that if you know everything about a function at the point x equals zero then you can reconstruct everything about it everywhere the taylor series simply acknowledges that there is nothing special about the point x equals zero and so says that if you know everything about the function at any point then you can reconstruct the function anywhere so a small change but an important one let's look again at the function e to the x as well as the first four approximation functions now using our understanding of the maclaurin series we can also rewrite this in a compact summation notation such that we could if we wanted build a polynomial approximation function to an arbitrary degree of accuracy so what if now instead of expanding around the point x equals zero we wanted to start from the point x equals p well to begin with let's have a look at the first four approximation functions at this point the expressions to describe these approximations are still going to require the value of the function as well as all of its derivatives at this point but how do we now adjust our general equation to allow for an arbitrary expansion point x equals p well clearly the zeroth order term this is just straightforward it's going to be a horizontal line that just uses the point f of p everywhere but what if we take a closer look at the first order approximation this should give us enough insight that we'll then be able to tackle all the other higher order terms so we're essentially looking to build a tangent to the curve at point p all straight lines are of the form y equals mx plus c so let's now work through this by putting in place all the information that's available to us so we can say okay here is some nice function that we're going to deal with and if we're looking at the point p here this is of course at f of p height so it's coordinates of this point our p f p and we want to build an approximation function that is first order so it's got a it's got a y-axis intercept and it's got a gradient and it's going to look something like this okay the same gradient the same height of the function at that point so of course that green line is going to have the equation y equals m x plus c now we know immediately the m is the gradient of this straight line and we know the gradient of our function here is just f dash at p so we can say y equals f dash at p x plus c so all that's left is for us to find c and to do that we're going to need to know x and y but we do know a point on our line it's the point p f of p so we just sub these in f of p equals f dash of p at p plus c rearranging for c we can just see that c is going to equal f of p minus f dash of p at p subbing the c back into our general equation we can now bring it round and say well this line has the equation y equals m which is just f dash of p times x plus c which is just f of p minus f dash of p multiplied by p okay now the last thing i'm going to do is take this and factorize this f dash of p out and that'll leave you with a nice form of this expression y equals f dash of p multiplied by there's an x of it here and there's a minus one of it here and minus p of it here so we can say x minus p plus just this term here f of p what this shows us is that by building our approximation around the point p when we use our gradient term f dash of p rather than applying it directly to x we instead now apply it to x minus p which you can think of as how far are you away from p so we can now write down our first two approximation functions to f x at the point p the zeroth and the first thinking back to the maclaurin series from the previous video all we need to do to convert to taylor series is use the second derivative at p rather than zero and also replace x with x minus p but notice that the factor of one half still remains so we can now look at our concise summation notation for the maclaurin series and upgrade the whole thing to the more general taylor series form noticing of course that if we decide to set p to zero then the expressions would actually become identical so we now have our one-dimensional taylor series expression in all its glory which will allow us to conveniently re-express functions into a polynomial series in the remainder of this module we're going to play around with this result as well as extending it to higher dimensions see you then in this video we're going to put our understanding of power series to the test by looking at two interesting examples first off we're going to build a maclaurin series expansion of the cosine function cosine is the epitome of a well-behaved function as it is certainly continuous everywhere as well as being infinitely differentiable as we're building a maclaurin series we're going to want to find out everything about the function at the point x equals zero so let's start by doing some differentiation and what we get if you remember from earlier in the course is this cyclic pattern of cosines and sines positive and negative which takes us back to the cosine again after four steps if we now evaluate this derivative at the point x equals zero we see that the cosine terms are either one or minus one and the sine terms are all zero this must mean from a power series perspective that every other term will have a zero coefficient notice that these zeros will occur whenever we differentiate an odd number of times which means that all the odd powers of x like x to the 1 x cubed x to the 5 etc will all be absent from the series the even powers of x are all what we call even functions which means that they are all symmetrical around the vertical axis just like cosine is so we can now bring back our general expression for the maclaurin series and just start writing out the terms the icing on the cake is that at this point we just notice that we can build a neat summation notation which fully describes this series without having to write out all the terms notice that this expression doesn't even contain any reference to cosine as all that we need to know is captured by the minus 1 to the power of n which just keeps flipping from negative to positive and negative again now that we've done all the hard work we can simply ask our computer to plot this sequence of increasingly accurate maclaurin series which hopefully starting from a horizontal line at y equals 1 will line up with your expectations notice that outside of a region of fairly close to the point x equals zero the approximation explodes off and becomes useless by the time we get our 16th order approximation we've pretty much nailed the region shown in our graph here although just outside of these axes the function would also be growing hugely positive so you must always be careful when handling series approximations that you know the domain in which it's acceptable in the second example we're going to take a look at the function f of x equals 1 over x which of course looks like this it's a nice simple function but notice the discontinuity at x equals zero this is absolutely not a well-behaved function in fact it's so badly behaved that when we even try and build the zeroth order approximation we immediately run into problems because we have to perform the operation one divided by zero which is not defined and if you try to ask your computer to do this it may give you back the answer nan which stands for not a number so we're going to need to try a different angle of attack clearly we aren't going to have much luck at the point x equals zero so why not try going somewhere else anywhere else let's look at the point x equals one certainly it passes the first test of being able to evaluate the function at this point however moving away from x equals 0 means that we're now going to need to use the taylor series instead of the maclaurin series so we now need to find a few derivatives of the function and see if we can spot a pattern when we evaluate these functions at the point x equals one hopefully you'll recognize that we get a sequence of factorials emerging just as we did when deriving the power series formula in the first place so if we now substitute these values into our taylor series formula the factorial terms will cancel and all we are left with is a sum of x minus 1 to the power of n terms with alternating signs which we can simplify to this neat summation notation so once again let's now pass this formula to the computer and ask it to plot a sequence of improving approximations starting as ever with a horizontal line but this time starting at the height of the function at x equals one there are several really interesting features of this particular example which tell you interesting things about the power series more generally firstly the approximations ignore the asymptote going straight across it and furthermore the region of the function where x is less than zero is not described at all by the approximations secondly although the function is gradually improving for larger values of x you can see the tail wildly flailing around as the sign of each additional term flips from positive to negative and back again i hope these two examples have made it clear both how the taylor series manages to reconstruct well-behaved functions like cosine x but also why they can struggle to deal with something badly behaved like one over x as we bring this video to a close let's now watch an action replay of these two sequences of improving approximations in the next video we're going to talk briefly about what it means to linearize a function and see how this relates to the taylor series analysis that we've seen so far see you then in this session we're just going to take what we've learned about taylor series so far and reframe it to a form that will help us understand the expected error in an approximation so as we've already seen we can build up a sequence of gradually improving approximations to functions by adding higher power terms all i'm going to do now is change the notation of some of the terms so if we take our first order approximation as an example what this expression is saying to us is starting from the height f of p as you move away from p your corresponding change in height is equal to your distance away from p times the gradient of your function at p so in a sense rather than using gradient equals rise over run we are just rearranging to rise equals gradient times run you are ultimately going to use your approximation to evaluate the function near p as you must already know about it at p so i'm now going to say that our distance from p which we currently call x minus p we will now call delta p to signify that it represents a small step size away from p and we can also recast x now in terms of p saying that x is just p plus delta p finally it actually is now time to say goodbye to p as in our first order approximation everything is actually in terms of p so all i'm going to do is swap all the p's for x's as this is the form that people would usually expect to see it in i'm going to do it all at once to make it as painless as possible i haven't changed anything conceptually i've just written an x everywhere where we used to have a p and we can now of course also rewrite our taylor series expression in this new form with just x and delta x so we're in good shape to talk about approximations what i now want to know is when i use the first order approximation instead of evaluating the base function how big should i expect the error to be you can see for example that the gap between the white and green lines grows as we move along the x-axis away from the point x well one way to think about it is that we know our function can be exactly represented by this infinitely long series so although we won't be able to evaluate all of the terms we do know that the next term along i.e the first term that we ignore when we're using our first order approximation has a delta x squared in it this means that if we can say that delta x is a small number then delta x squared must be a really small number and similarly delta x cubed must be a ridiculously small number so we can now rewrite our first order approximation to include an error term which we just say is on the order of delta x squared or equally that it is second order accurate this process of taking a function and ignoring the terms above delta x is referred to as linearization and i hope it's now clear to you why that's the case we've taken some potentially very nasty function and just approximated it with a straight line the last idea that i'm going to share with you in this video is perhaps the most interesting as it brings us right back to our rise over run approximation that we met at the beginning of the course the green line is our first order taylor series approximation to the function at the point x which is of course also the tangent to the curve at x but let's now add another line which is the approximation to the gradient at x using our rise of a run method with a second point delta x away we use this forward difference method to help us build the definition of a derivative at the beginning of the course and we noticed that as delta x goes to zero the approximation becomes exact however what happens if delta x does not go to zero and the second point remains some finite distance away from x well your calculated gradient will now contain an error and once again we can rearrange the full taylor series to work out how big we'd expect that error to be with a bit of slightly fiddly algebra we can rearrange this expression such that the gradient term f dash of x is isolated on the left hand side because the series is infinite this is still an exact expression for the gradient to x but what we get on the right hand side is something that looks suspiciously like the rise of a run expression plus a collection of higher order terms if we notice that the first of the higher order terms has a delta x in it we know that we can now lump them all together and say that using the rise of a run method between two points with a finite separation will give us an approximation to the gradient that contains an error proportional to delta x or more simply put the forward difference method is first order accurate that's all for this video it might seem a little odd to go to all that trouble just to get an idea of the error in an approximation but it turns out to be a hugely important concept when as is typical we ask computers to solve numerically rather than analytically see you next time so far in this module i've introduced you to the concept of power series and then shown you how we can use them to approximate functions we've also seen how power series can then be used to give you some indication of the size of the error that results from using these approximations which is very useful when applying numerical methods the last idea that we're going to very briefly look at on this topic is upgrading the power series from the one-dimensional version that we've seen so far to its more general multivariate form just to recap on the notational options from the previous video we saw that we can re-express the taylor series from a form that emphasizes building the approximation of a function at a point p to a totally equivalent form that emphasizes using that function to evaluate other points there are a small distance delta x away just to make sure you can see that they're the same we can write out the first four terms one above the other to make the comparison clear in this video we are going to continue using the delta x notation as it's more compact which will come in handy when we're in higher dimensions we won't be working through the multivariate taylor series derivation in any detail as all i really want you to take away from this video is what a multi-dimensional power series will look like both as an equation and in a graph so keeping in mind our one-dimensional expression let's start by looking at the two-dimensional case where f is now a function of the two variables x and y so our truncated taylor series expressions will enable us to approximate the function at some nearby point x plus delta x y plus delta y let's look at the simple two-dimensional gaussian function which many of you may be familiar with from studying the normal distribution in statistics it's a nice well-behaved function with a simple maximum at the point zero zero now in a one-dimensional analysis our power series would give us an expression for a one-dimensional function however in a two-dimensional case our power series would of course now give us a two dimensional function which we would more intuitively refer to as a surface just as with the one-dimensional case our zeroth order approximation are simply the value of the function at that point applied everywhere which means in 2d this would just be a flat surface so a zeroth order approximation at the peak would look like this and a zeroth order approximation somewhere on the side of the bell would look a bit more like this this is fairly straightforward but now let's think about the first order approximation drawing the analogy from the 1d case again the first order approximation should have a height and a gradient so we're still expecting a straight surface but this time it can be at an angle now if we are calculating it at the peak it wouldn't be a very exciting case as it's a turning point and so the gradient is zero however let's look again at the point on the side of the slope taking the first order approximation here gives us a surface with the same height and gradient at the point finally let's look at the second order approximation for these two points we're now expecting some kind of parabolic surface however if i calculate it at the peak nothing seems to appear but that's just because we've created a parabola inside our bell curve so we need to look from underneath to even see it finally if we plot the second order approximation at the point on the side of the bell curve you can see that we end up with some kind of interesting saddle function but it's still fairly clear even by eye that the gradient and the curvature are matching up nicely at that point although this approximation is evidently not useful outside of a fairly small region around the point we're now going to take a look at how we would write expressions for these functions so if we wanted to build a taylor series expansion of the two-dimensional function f at the point x comma y and then use it to evaluate the function at the point x plus delta x comma y comma delta y our zeroth order approximation is just a flat surface with the same height as the function at our expansion point the first order approximation incorporates the gradient information in the two directions and once again we are thinking about how rise equals the gradient times the run notice here for compactness i'm using the partial symbol with a subscript to signify a derivative with respect to a certain variable if we look now at the second order approximation we can see that the coefficient of one-half still remains as per the one-dimensional case but now we have three terms all of which are second derivatives the first has been differentiated with respect to x twice the last with respect to y twice and the middle term has been differentiated with respect to each of x and y now there are of course higher order terms but we've already got more than enough to play with it so let's look again at the first order term it's the sum of the products of the first derivatives with the step sizes but hopefully this will remind you of our discussion of the jacobian so we can actually re-express this as just the jacobian multiplied by a vector containing delta x and delta y which we can write as delta bold x where the bold x signifies a vector containing those two terms finally if we look at the second order term in the same way and notice that these second derivatives can be collected into a matrix which we've previously defined as our hessian then to make the sum we need we now have to multiply our delta x vector by the hessian and then again by the transpose of the delta x vector and that's it so we now have a nice compact expression for the second order multivariate taylor series expansion which brings together so much of our calculus and linear algebra skills and makes use of the jacobian and hessian concepts which we defined earlier in the course and of course although we've been talking about the two-dimensional case in this video so far we could actually have any number of dimensions contained in our j h and delta x terms so this immediately generalizes from 2d to multi-dimensional hypersurfaces which is pretty cool see you in the next one this brings us to the end of the fourth module of this calculus course you now have all the tools in place to do something really useful and in the next two lectures david is going to walk you through a method for fitting functions to data which will make use of a lot of the calculus including all the four time saving rules in their multivariate form as well as putting the jacobian into action good luck and i'll talk to you again at the end of the course hi there i'm david dye and i'll be taking you through the last few modules of this course in this module we'll start to use the calculus we've done and put it together with vectors in order to start solving equations in this first video we'll look at a nice simple case where we just need to find the gradient uh the derivative in order to solve an equation using what's called the newton-raphson method now say we've got that distribution of heights again uh with a a mean an average a mu and a width sigma and we want to fit an equation to that distribution that um so we don't have to after we fitted it bother about carrying around all the data points we just have a model with two parameters a mean and a width and we can do everything using just the model and that would be loads faster and simpler and would let us make predictions and so on so it'd be much much nicer but how do we find the right parameters for the model how do we find the best mu and sigma we can well what we're going to do is we're going to find some expression for how well the model fits the data and then look at how that goodness of fit varies as the fitting parameters mu and sigma vary so we're trying to solve an equation where the fitting perhaps the variables in the equation but in order to get there in the next module actually we're first going to need to do a bit more calculus so first let's look at the equation of a line say here y equals x cubed minus 2x plus 2. if we differentiate this equation uh we get the quadratic 3x squared minus 2 and that quadratic will have two solutions and therefore two turning points will exist one a maximum what a minimum just as we see here now say that i don't know what the equation looks like i'm blind and i haven't got enough computer resources to graph out the values at every point or more likely in reality say the function exists in so many dimensions that i can't visualize it at all but say i only need to find the solution to the equation where y equals norm so where x cubed minus 2x plus 2 is equal to 0. we can see there's actually only one solution here on this graph but there could be more depending on the exact form of the equation i was trying to solve now say that i have an idea that i want to hunt for a solution somewhere near some initial guess at the red dot here for instance the constant's pretty small and positive so i might guess that for i need a slightly negative value of x say minus 2 might be somewhere near a solution now if i can evaluate the value of the function at my guess of x equals minus two i find that the function has a value of minus two and if i ask what the gradient is at that value of x equals minus two i find that the gradient is positive and it's ten now i can extrapolate the gradient to the intercept with the y-axis which is uh would be my first guess of the solution of the equation which i'm trying to solve where it finds that intercept with the y-axis so i can use that value of x at the intercept as a new estimate for what the solution to the equation is effectively i'm guessing that the function is a straight line and then i'm using the gradient to extrapolate to find the solution it isn't really a straight line of course but the first order approximation would be that it's a straight line and we'll use that to update our guess and then go again and evaluate so i can write down an equation for my new guess x i plus one based on my previous guess i'd attempt x i as being my original guess minus the value of the function divided by its gradient so let's see how it plays out we can make a table starting with our initial guess at i equals naught and then we can find the gradient and the intercept and then use that to generate a new guess in this case minus 2 minus minus 2 divided by 10 gives us -2 plus 0.2 which is minus 1.8 then we can evaluate the result for that guess and find that it's just a little less than zero minus 0.23 and the gradient is 7.7 so we've gone from being out by two on y to being out by just 0.23 so in some sense we've got like 10 times better in our estimate just in our first go if we carry on then we get the next guess for x2 is minus 1.77 and that's just 0.005 away from the axis it's really close then if we have another go after just three iterations we get an answer of x equals minus 1.769 which is just 2.3 times 10 to the minus 6 from the axis so in just three iterations we pretty much solve the problem which is pretty cool this method is called the newton-raphson method and it's really pretty neat to solve an equation all we need to be able to do is evaluate it and differentiate it we don't need to graph and visualize it everywhere calculating it lots and lots of times we don't need to be able to solve it algebraically either which if we have lots of dimensions to a data set say and a big multi multi multi-multi-variable function we're trying to fit to that data it's going to be much too expensive to try and solve it analytically or even plot it out for all the possible values of the variables this sort of method where we try a solution and evaluate it and then generate a new guess and then evaluate that and again and again and again it's called iteration and it's a very fundamental computational approach now there are some things that can go wrong sometimes with this method so let's briefly look at those say i started off with a guess of x equals zero which evaluates to y equals two when i find the gradient for that and extrapolate it that takes me away from the solution to the other side of the turning point it gives me a new guess that x equals one when i evaluate that then i get a value for y at x equals one of one when i find the gradient and extrapolate back then my new estimate lands me back at x equals zero which is where i've begun so i have a problem i seem to have magically landed in a closed loop where my estimates just cycle back between x equals naught and x equals one and i never get close i never even go anywhere near to the solution x equals minus 1.769 there's another problem which is that if i'm close to a turning point this bottom here to a minimum or a maxima then because my gradient will be very small when i divide by the gradient in the newton-raphson equation here my next estimate will take me zapping off to some crazy value um and therefore it won't converge to the dive off somewhere those are the problems so that's the newton-raphson method we iterate to a solution to an equation by each time making a new estimate for the solution using the gradient to extrapolate towards the solution then going again and again and again most of the time this works really well as a means to step towards the solution so what we've done in this video is look at a method for using just the gradient to step our way towards solving a problem this method is called the newton-raphson method and it's a really powerful way to solve an equation just by evaluating it and its gradient a few times it's as if you're standing on a hill in the fog and you can know your height and you can know locally what's going on around you you can know how steep the hill is but you can't see the whole landscape around you you don't know what it looks like you don't know how to get down the mountain if you like down to a nice safe place that doesn't have cliffs so what you do is you guess based on how steep the hill is locally around you which way to go so you want to go down to sea level so you go down the hill then you take that step blindfolded and when you get there you ask again what height you're at and how steep it is and then you keep making more steps down the hill until either something goes wrong and you need to go back down the other way or you get home to where you want to want it to be the point is you don't need to know what the landscape looks like the function you just need an altimeter the value of the function and to be able to feel with your toe what the gradient is like locally around you what we'll do in the next video is look at how to apply this where we've got multiple variables not just x and that will involve finding the gradient vector how to go down a hill on a contour plot okay so now we've looked at the newton-raphson method which uses the gradient to iteratively solve a 1d function of x say now we'll generalize that to figure out how to do something similar with a multi-dimensional function a function of multiple variables and how to use the gradient to find the maxima and minima of such a function later on this will let us optimize find the best fit for the parameters of a function that we're trying to fit say i've got a function like f equals x squared y here this function looks like this and what i've got here is a function that gets big and positive when x is big and y is positive and it gets negative when y is negative so if i look down the y-axis i get a projection of a straight line and if i spin to look down the x axis i get an upright parabola for y positive and one going down for y negative and the function is equal to zero along both the axes now at the bottom here in matlab i've used the surfc function to plot contours when the function has a constant value just like the contours of constant height on a topographical map so the question we want to answer in this video is how do i find the fastest or steepest way to get down this graph now previously you found out what the gradient of a function is with respect to each of its axes so you can find the fdx just by differentiating the function f treating all the other variables as constants so in this case we've got a function f of x y is equal to x squared y and so df dx is equal to differentiate that treating y as a constant so that's just 2x times y and uh df dy is equal to well the x squared we treated the constant and y differentiates just to one so it's just x squared now we can write down uh these two gradients in a vector which we call the grad of f so this guy we call grad um and this vector is just the vector uh where we write them down as uh the components of a vector so we say that grad is that it's just the vector found by writing down d f by d x and d f by d y in the x and y locations of the vector and glad is like the most awesome vector ever grad is the little vector that combines linear algebra and calculus together so grads a lot of fun so if we run along a little vector r is equal to dx zero then if we dotted that with grad f then we get df dx times dx plus zero um so we'd move the amount dx along the x-axis so this is doing the right sort of thing right a dot product of grad f with a vector is going to take it looks like some amounts of df dx along the x bit of that vector and some amounts of d f d y along the y bits of that vector so it's going to tell us how much we've gone down that's sort of how grad works and the important thing is to remember this is evaluated at f has some values in the space a b so it's evaluated at a location so then if we want to know how much the function will change when we move along some unit vector in an arbitrary direction so we'll call that unit vector r hat and we'll give it components c and d so c squared plus d squared is equal to 1. so df is going to be df dx times c plus df dy times d so what we're doing here is we're doing grad of f dotted with our hat and we call this thing the directional gradient so another question we can ask is what's the maximum value this directional gradient can take so we've got a grad f as a vector and we're dotting it uh with our hat and our question is what's the maximum value they can take given that r hat is a unit vector well the only thing then left when we dot them together is going to be the cos theta term because it's a b cos theta and that's going to be maximized when cos theta is 1 which means theta's 0 which means r hat is parallel to grad f so in order to find the maximum value of the directional gradient we want an r hat that's actually the normalized version of grad so we can write that down we can do that calculation so we'll have grad f dotted with the normalized version of itself so grad f divided by its modulus but grad f dotted with itself is just equal to the size of grad f squared we've got to divide by the size so the maximum value the directional gradient can take is just the size of grad f and that's therefore the steepest gradient we can possibly have the size of the steepest gradient we can possibly have is just the size of grad f the sum of the squares of the components of grad so that's really fun the other question to ask is which way does grad point this is less easy to see but grad points up the direction steepest descent perpendicular to the contour lines so think if you're on a mountain and it's foggy and you can only see locally around you and you want to go down the hill you don't walk around the contour you go down the hill so if the contour's like this and i'm staring down the hill down the hill is at 90 degrees to the contour line and think about the way it's defined is it up or down um dfdx is positive if you're going up um so actually grad f points up the hill um in the steepest way and minus grad points down the hill the steepest way so grad points up the direction of steepest descent now if we have some data science problem where we want to minimize the difference between our data values and our model fit then what we want to do is find the lowest point in the function the function is kind of the badness so we want to find the best so we want to find the point where the badness is minimized and like newton-raphson we can use the gradient to go from some trial point down towards the solution but in newton-raphson we're trying to find the zero point here we don't know what the minimum value of the function is so we don't know how far we down we need to go as if we're somewhere on our mountain but we don't know the altitude of the valley so what we do in what's called the gradient descent method is we take a series of little steps down the hill that is if we started at some position sn then our our next position sn plus 1 is given by s n plus some little step down the hill and that little step is given by minus some amount times grad and grad evaluated at the previous position sn on the graph that's going to look like taking a little step the little blue one down the hill and then we re-evaluate to make another sn plus one and make another step and take a series of steps down the hill if we overshoot that's okay because grab will just take us back to the minimum and notice that as the gradient gets shallower as we come towards the turning point then the steps automatically get smaller because grad gets smaller so this is quite a nice method there are lots of ways to enhance it but that's the main idea it's very powerful and simple the other thing to notice is that there are multiple local minima in the landscape then we might get stuck in one of them and of course which one we find depends on our starting point we won't find the others we'll only find one at a time so in the general case there are some problems but nevertheless this is quite a neat method this just following little steps down the hill and this is probably the main way in which most numerical optimizers work that people use in the real world so what we've done in this video is we've introduced ourselves to grad the gradient vector a merged calculus and vectors together to take our first steps in vector calculus and that's allowed us to find a method for just using the gradient to step our way towards solving a problem for multi-variable cases and that method is called gradient descent so we found out about grad and that led us on to sketch out a neat method for finding the minima and maxima of multi-variable functions which we call gradient descent in this video we'll look at what happens if we want to find the minimum or maximum subject to some constraint that we want to find the maximum somewhere along a line or something like that this is called the method using lagrange multipliers so what we did in the last video is we looked at a function like this one f here this function f is x times the exponential of minus x squared plus y squared and it's one of the standard matlab examples and i've plotted below it here a contour map of the values if i now plot just the contour map i can plot out the vectors of grad f as we said these vectors are perpendicular to the contour lines and they point up the steepest gradient now let's return to the function we were looking at last time f equals x squared y which is the khan academy standard example for this problem last time we plotted grad f on both the 3d version and then drop that down onto the corresponding contour map actually it's going to be easier if we now just work with the contour map itself so here's the contour map and what i've done is i've used the gradient function in matlab to just plot out the gradient perpendicular to the contours everywhere in the space so the low sides are at negative y down here and the high sides of a positive y up here now what happens if i want the maximum value for this function f equals x squared y but constrain it to lie on a circle where's the highest point anywhere on that circle so i have a circle of radius a squared and i want to find the minimum maxima on that path as i go around that circle now you can imagine that describing an equation for how x squared y varies as i go around a circle apart it's going to be a complete nightmare but that's not the question we are actually asked what we wanted to know was what's the maximum along that circle not what's the value everywhere on the circle now lagrange was one of those french mathematicians whose name is inscribed around the eiffel tower and what legrand noticed was that when the contours just touched the path then you'll have found the maximum minimum point that is f when f is a little bit smaller it won't quite touch the path see it here in 3d and when it's a bit bigger it might cross a couple of times see here in 2d on the contours but when the contour just touches then we'll have found the minima and maximum of the function f as we go around the path our circle in this case and what lagrange noticed was that when the contour touches the path then the vector perpendicular to the contour is in the same direction up to a minus sign as the vector of the uh of the path itself that's perpendicular to the path so if we can find grad we can find the minimum maximum points and solve the problem if we can find grad perpendicular contour on both the path and the function we're away so we want to maximize the function f of x y which is equal to x squared y subject to some constraint which we're going to call the constraint equation g of x y and that's the equation of a circle x squared plus y squared and it has some value some particular value to the problem we want to solve of a squared so what we're saying is if we've got the function doing something like this and it has its grad that way and the circle has some path that's doing that and it's got its grad that way what we're doing is we're solving grad f is equal to lambda some number times grad g where lambda is lagrange it's lagrange's multiply or verlagrand multiplier and that's all we need to do so we just need to set up this set of equations and then solve them so if i take grad f is equal to the grad of x squared y well that's equal to if i differentiate for the df dx i've got 2xy and if i take d f d y the y just goes and i've got x squared and i'm saying that that's equal to lambda times grad g which is equal to lambda times if i differentiate g x squared plus y squared with respect to x i've got 2x if i differentiate it with respect to y i've got 2y and that's giving me two equations with two unknowns and i've got a third equation which is the constraint equation itself that brings in the actual value of the circle i'm particularly interested in so i've just got to solve that so if i take the first line i've got 2xy is equal to lambda times 2x i cancel the two x's and therefore i've got y is equal to lambda straight out if i take the second row the y row if you like i've got x squared is equal to lambda times two y but lambda is itself y so that's equal to 2y squared so i can say that x is equal to the square root of 2 times y and because i've done the square root i've got a plus minus here and now if i take the the constraint equation itself the third one i've got x squared plus y squared is equal to a squared but x squared is equal to 2y squared so that's equal to 3y squared so then if i square root that i can say that y is equal to a divided by the square root of 3 and again because i've done this square root i've got a plus minus so that's my solutions so i can write those out and i've got the solutions are going to be a over the square root of 3 times if i take y is 1 then x is root 2 times that so root 2 1 and i've got a over root 3 times root 2 minus 1 for y and i've got all the possibilities a over root 3 minus root 2 and 1 and i've got a over the square root of 3 of minus root 2 and minus 1. now if i find the values of the function f of x y as i go uh take all of those so i've got to find x squared y for that so that's a cubed over 3 root 3 because i'm cubing that bit in effect because they're both in x and y and i've got x squared is 2 plus y is what times y is 1 so that's 2. here i've got the same thing but i've got what y is negative so now i've got a cubed over three root three times minus two this one's y is part uh plus and when this squares the minus sign is going to disappear so this is going to give me another plus solution two over 3 root 3 a cubed and i've got here here y is negative so i'm going to get a minus here uh a cubed over 3 root 3. so i've got a max uh a max i've got a min i've got a max the way i've written out and another min so i've got two positive solutions and i wanted to find the maximums so those are there and i've got the minima as well for free so let's look at what that looks like now on the graph so two of our solutions are here and two here when we switch to the 3d view we can see that the two with positive wire maxima and the two with negative y of the minima so that's really neat what we've done here is we've used our understanding of gradients to find the minima or maxima subject to some constraint equation like a straight line or a circle and very often when we want to make it that some of the variables in a function that we want to fit are fixed in relation to each other they have some fixed relationship like that around a circle so this is a very handy thing to be able to do it's really useful so hopefully that was good fun this week you've taken all the multivariable calculus you've done so far and looked at how to put that together with vectors and linear algebra to do vector calculus we've been looking at optimization problems together how to solve equations and optimize functions for example to optimize the fit of a function to some data we started off with the newton-raphson method where we use the gradient to estimate how far we should step to get from a current guess to a solution to our next guess and then we kept doing that until our estimate for the solution converged to an answer we were happy with we then moved on to look at how to find the gradient in the multi-variable case we define the gradient vector grad which is perpendicular to the contour lines and has elements equal to the differential of the function along each of the axes df dx dx fdy and so on and it has magnitude given by the size of grad the sum of the squares of the element we then figured out that we could use grad to go down a hill to find the minimum values of a function by taking little steps down the hill at a time and that's really neat because it means we don't need to evaluate the function everywhere and then find the minimum or solve the function using algebra as long as we can find the gradient an initial estimate we can just take steps down the hill until we find ourselves in the bottom of the valley and this gradient descent method is probably the most powerful method for finding minima that exists finally we looked at how to solve a problem subject to a constraint we found that this reduced to equating the gradient of the function to the tangent of the constraint taking two grads and then solving the simultaneous equations resulting from equating those two vectors that was the lagrange multipliers method so this is really neat the big picture is we've learned how to optimize functions and we've taken ourselves from the realm of doing algebra to the realm of solving problems with computers in the next module what we'll do is we'll move on to look at how to fit functions to data using what's called the least squares method so this week we're finally going to apply all this multivariate calculus to help us fit functions to data this is really handy it allows us to make sense of our data to start doing statistics testing and to start to really apply all of the stuff we've been doing into the real world if you have some big bundle of data the first thing you need to do is to clean it up there are other courses on this but the first thing you need to do is clean it up to get it into shape to start doing the maths with that means figuring out what the sensible thing to do is with things like partially empty data entries zeros those sorts of things to figure out if you can do any dimensionality reduction by binning or grouping the data together eliminating duplicates garbage data all those sorts of things often the easiest thing to do is to tag all of your data so that you can reorder them and put them back in when you've changed the order then sort them and look for funny values and then figure out what to do with those to get rid of them all together or to replace them with something sensible and then once you've cleaned it up you can start graphing it take averages look at standard deviations and all those sorts of things of course in doing this it really helps if you're interested in the data you need it to be kind of like a person you that you want to figure out and understand until it becomes an old friend if over time as you collect more data when it starts to change you want to get to the point where you're going to notice those changes actually so you really want to get it intimate and friendly with your data once you've graphed it in a sensible way you'll often get a plot like this guy here which is a simple xy plot of some data this data seems to plot like a straight line if you know something physically about the processes involved in generating the data or if you have some hypothesis as to how the variables are related you can try and fit that model to the data alternatively you can just try fitting something sensible based on how it looks like a straight line to this data here for example now i can model my straight line y here as being a function of the i observations x i and a vector a of the fitting parameters in the case of a straight line equals y equals mx plus c then the parameters in the vector a would be the the gradient m and the intercept c of the straight line so here i've plotted the optimal straight line for this data it happens to have a gradient of 215 gigapascal's gpa and an intercept of 0.3 gpa for c i can also find the mean of x x bar and the mean of y y bar which are the geometric center of mass of that data set now in order to find the optimal value of m and c let's first define a residual r which we define as the difference between the data items y i and the predicted location of those on the line at y which would be mx plus c so r is y i minus mxi minus c then i can take a measure of the overall quality of the fit being a quantity i'll call chi squared which is the sum of the squares the residuals are i do this so that i penalize both data that are above and data that are below the line i don't want the pluses and minuses to net off against each other also i really want to badly penalize uh data items that are a long way away from the line and then i'm going to try and find the best chi-squared possible the one that's lowest i'm doing a minimization so now it's worth plotting out what chi squared is going to look like for lots of different possible values of m and c which is what i've done on this contour plot here in the middle at about 215 and near an intercept of zero i find my minimum and the further i get away from those the worse it is and note that in terms of chi-squared it's slanted there seems to be some kind of trade-off the bigger c gets the lower the optimum value of the gradient m and vice versa and if i look on this plot that's kind of obvious if i make the line steeper on the original uh fit then in order for it to fit well the intercept is going to have to get smaller actually i'm pivoting about the center of mass and also this shallow trough here in the chi-squared value is going to be really quite shallow so this is actually going to be quite a tricky problem for a steepest descent algorithm it's going to be easy to get down the size but it's going to be difficult to get down the bottom of the valley to find the actual minimum but nevertheless it looks like it's going to be quite an okay problem to solve it has one easy to spot minimum and therefore we can find it and note that to do this with any precision if i do it simply by doing lots of computations here like i've done uh for different m's and c's and uh finding the minimum that way plotting it all out finding the minimum on this graph to do this i have to do a lot of maths in matlab this contour plot took about 200 000 computations to make so even for a simple problem like this we really do want to find an algorithm that's going to let us get there a bit more efficiently now the minimum is going to be found when the gradient of chi squared is zero so if we just take the grad of chi squared with respect to the fitting parameters and set it to zero that's going to be our solution now the neat thing is that in this particular case we can actually solve this problem explicitly so that's what we're going to do in this video but then we'll go on to see how to do it by linear descent and then we'll find a better algorithm and then we'll explore how to do this sort of problem where it's not so easy to do explicitly so if we differentiate the first row with respect to m then the first thing to worry about is all these sums over the data items i but actually it turns out that we don't need to worry about these sums because we're not differentiating the x i or the y i themselves so they'll just sit benignly in the sum you know mx1 plus mx2 plus mx3 is the same as uh if we differentiate that with respect to m it's to just get x one plus x two plus x three um so uh we don't have to worry about those sums and then it's easy right we differentiate a square uh that drops in power by one uh and we multiply by two and then we take the differential of the inside of the bracket with respect to m is minus x i uh we can then uh take that that minus two outside of the sum altogether in fact for the second row it's easier because the differential the inside of the bracket with respect to c is just minus one so we just get the two down from the power and the minus sign and so it all looks quite easy keeping on uh looking at the second row then the sum of c times the number of data items we can take out of the sum altogether and then we've just got the sum of the y i's and the sum of m times the x i's um and if we divide that through by the number of data items uh we get our result that c is going to be y-bar minus m times x-bar at y-bar and x-bar being the average uh we can carry on in that way and generate an answer to m which i'm just going to leave here i don't think there's any point in showing all the mass to you blow by bow it's a bit trickier to see and i'm not going to go through actually the math and the derivation but you can also find estimates for the uncertainties in c m which i've put up here which i'll call sigma c and sigma m and it's very important actually when you're doing a fit to get an idea of the uncertainties um in those fitting parameters and to quote those in your fit so i'm just going to leave those here in case you need to use them so coming back to our uh fitted data we can plot it out again here now the amazing thing here is just how accurate this sort of fitting really is it's really cool you know we've got uh quite noisy data but and a gradient of 215 but the uncertainty is only nine it's about five percent it's really amazing um now you should always plot your fit and visually compare it just as a sanity check and we can see why this is a good idea here this is anscombe's famous quartet you can find the uh the graph on wikipedia all these four data sets have the same chi squared means best fit lines and uncertainties in the fitting parameters but they're obviously very different data in the right hand two cases probably fitting a straight line is just the wrong thing to do the bottom left if you remove the flyer data point the gradient's different and the intercept it's only the top left where the fit's actually doing the right thing all together and there's another subtlety actually which if we go back and look at c the intercept we can see that the intercept depends on the gradient m which is what we said earlier when we looked at a plot of chi squared now there's a way to recast the problem which is to look at the deviations from the center of the mass of the data at x bar instead and then the intercept b now rather than c is the location of the center of mass in y at y-bar and then the constant term in the fit b that constant b doesn't depend on the gradient anymore and neither therefore does its uncertainty include a term from the uncertainty in m in fact if i plot out uh the contour plot for chi square when i do that i find that it then isn't slanted it's a nice sort of circular looking thing because i've removed the interaction between m and the constant term so it's a mathematically much more reasonable well-postulated problem so that's the essence of regression of how to fit a line to some data and this is a totally really useful life skill whatever almost whatever your professional job what we'll do in the next couple of videos is look at how to do this in more complicated cases with more complicated functions and how to extend the idea of regression to those cases the main thing really that we've defined here that's important to remember is this goodness of fit estimator chi squared the sum of the squares the deviations of the fit from the data and chi squared is going to be really useful to us going forward in this video we're going to look at how to finally learn how to fit our distribution of heights data then in the following exercise you'll actually do that for yourself and then in the next video we'll wrap up with looking at how to do this practically in matlab or python so we're looking at how to fit a function that's arbitrarily complicated compared to the simplest case of linear regression y equals mx plus c that we looked at last time of course there's intermediate possibilities between very complicated and the simplest possible but this gives you the general case before we move on to look at how to the computer using tools instead of writing our own so let's say we have a function y of some variable x and that function has parameters in it ak where k goes from uh 1 all the way up to m so there's m of these a's so for example we could have uh x minus a1 squared plus a2 that would be an example of some function y this function isn't linear in a1 notice if i double a1 i don't double the function um so it's a non-linear least squares we're going to do now say i want to fit the parameters ak to some data so i've got some data observations i've got i equals 1 to n of them and i've got pairs of data y i and x i so for every x i've got a y and i have an associated uncertainty sigma i that is uh the more uncertain i am about the data point y i the bigger the uncertainty sigma is going to be so i can i can sketch that out something like uh like this for instance uh as an example um of my y eyes and my x eyes are there so i've got an x i y i and each one's got a an uncertainty uh sigma i there then i'm going to define a goodness of fit parameter chi squared chi squared there as being the sum over all the data points i um of the difference between the y i and the model of x i uh with its parameters ak and i'm going to divide all of those by sigma squared and i'm going to take the squares of the differences so what i'm doing here is i'm penalizing each of these differences by an uncertainty sigma squared when i make chi squared so that uncertain data points have a low weight in my sum of chi squared so that they don't balance they don't affect the fit too much if we don't know what the sigmas are we could assign them all to be one right this will just drop out but if we have an idea of the measurement uncertainties this gives us a way to include it and my minimum of chi squared is of course going to be when the grad of chi squared is equal to zero now in the general case i might be able to write down an expression for grad here but i might not be able to solve it algebraically so instead i'm going to look to solve grad chi squared equals zero by steepest descent going down the contours simply by updating the vector of fitting parameters a so i've got my vector a i'm going to say that my next iteration is going to be my current iteration minus some constant times the grad of chi squared so i'm going to go down the gradient here by an amount given by the constant which is sort of a pull handle for how aggressive i want to be in going down the steepest descent and that i'm going to use to make my next um guess for what the fitting parameter should be and i'm going to write them down as a vector and i'll keep doing that until that i reach this criteria that the grad of chi squared is zero or which means i've found the minima or failing that until chi squared stops changing which should be the same thing right or i just give up because it's been so many iterations and i get bored and i decide that something's gone wrong so to do this grad i've got to differentiate chi squared so i've got to do d chi squared by d a k for each of the k's in turn and when i do that well the sum has nothing to do with k because it's to do with i um and i'm going to get a 2 down and i'm going to get the sigma squared has nothing to do with k when i differentiate this i'll get the 2 down so and then i'll have the bracket itself y i minus y of x i and the aks and then i'm going to get the differential of the bracket i get a minus sign out of that and i get d y d x a k so i get d y d a k so that's going to be my differential there the minus 2 i can just take out so when i come to update here if i wrap the minus 2 into the constant i can just make that a plus ignore the two so i'm then going to get uh a current plus this thing the sum from i equals 1 to n because the minus signs will go i'll wrap the 2 into the constant then i'll just get this y i minus y over x i and the aks divided by sigma squared times the differential evaluated at a current because i don't know a next yet so the steepest descent formula here is just going to be that a next is equal to a current plus this sum and i've got to be able to differentiate y with respect to a k and this is one of these formulas that look really intimidating but really isn't when you actually try to use it for our example here we take this example here if we differentiate that with respect to a1 then we'll get that dy by a1 is equal to minus 2 x minus a1 um i take the 2 down and i get a minus sign when i differentiate the stuff in the bracket and when i do dy by da2 i'm just going to get 1. so it's actually really easy when we come to finally use it but the expression looks intimidating so that's the steepest descent formula for the case of fitting a non-linear function where we're trying to minimize the sum of the squares of the residuals and this is therefore called non-linear least squares fitting there are lots of more complicated methods than steepest descent for solving these sorts of problems which we'll look at a little bit next time but first i want you to try and give this a go and code it up and see how it looks for the sandpit problem that you were doing before so that's the simplest version of how to do a general fitting a finding of the minimum or least value of the sum of the squares of the residuals for a model that's non-linear in both the functions and the fitting parameters so it's called generalized nonlinear least squares fitting in this video we're going to make some final comments on the least squares regression fitting of data and we're going to look at how we really do this in anger in the world using computational tools like matlab or python or r so you've made a gradient descent least squares minimizer and then use that to solve the sample problem already there are a few comments to make before we move on in reality there are a huge number of solvers for non-linear least squares problems we can observe that if we do a taylor series expansion of chi squared then the second term the second derivative is the hessian which gives us information about the curvature or the gradient of the gradient the gradient of the jacobian and therefore we can shoot directly for where the jacobian is zero just as in newton-raphson using that second derivative now using the hessian would be faster than simply taking steps along a steepest descent algorithm effectively we'd be using the hessian to give us a guess as to the size of the step we should take in gradient descent the problem is is that often the hessian isn't very stable especially far from the minimum so the levenberg marquatt method uses steepest descent far from the minimum and then switches to using the hessian as it gets close to the minimum based on a criteria as to whether chi squared is getting better or not if it's getting better it uses the hessian and if it's her in trouble it uses steepest descent there's also the gauss-newton method and the bfgs method amongst many others that either use the hessian directly or build up information about the hessian over successive iterations and depending on the convergence then different methods may be better than others robust fitting is another topic you should be aware of in case you need to look it up later if we come back to anscum's quartet here we see that the bottom left data set in our problem there's just that one flyer data point a truly robust fitting method would be unbothered by such a data point one approach to robust fitting in minimizes instead of the least squares the absolute are the square deviations so it doesn't weight the points that are far away from the line as strongly and that means it fits a little bit what visually looks a bit better now let's turn to look at how you do this in the real world in matlab it's easy you simply import your data using the import data tab at the top of the screen and then flick over to the apps tab and start up curve fitting the curve fitting app there you can even symbolically define your own algorithm you have to pick a starting guess and it will fit your function for you or you can use a pre-built function that already knows its jacobians and will therefore be faster and more efficient in python it's very nearly as simple in the scientific python scipy set of modules the optimizer module includes a least squares curve fitting minimizer curve fit which is at the link below this video accompanying this video we've left you one of the psyp examples for fitting some data so you can see how to use it uh this is it um and it's amazing uh the fitting uh takes uh really works very well produces this this nice graph here of this apparently noisy data with this crazy function and the fitting only takes three lines the three bulb ones here two lines to define the function and one line to do the fit um the rest of that code is all about plotting it and importing the data and everything else while generating it in this case actually effectively this set of routines in scipy is the modern implementation of what's called min pac which was a fortran set of routines published by george moore in 1980 and described in the book numerical recipes and it's absolutely astonishing how easy it is to do this stuff now you know when i was a student you were reduced to reading this long difficult textbook now you just effectively write one line in python just as in python in the r statistical programming language there's also a minimizer for doing non-linear least squares fitting of models to data so if you like r for looking at data you can also do all this stuff just as easily in r so what i want you to do now is write a python code block to fit the gaussian distribution shown here you'll need to give it a starting guess and we give you the input data for the height distribution in the population this is the final height distribution data set that we've been talking about schematically all the way through these two courses and what you'll find is the mean height b here is about 178 centimeters and the characteristic width of this distribution is about 11 centimeters that's parameter c in the equation here now it's useful at this point to think about why we need to have the starting guess if we started with a guess here for the mean of hundred centimeters the model curve wouldn't overlap with the data at all so when we did a little move to b we get no change and therefore the gradient of chi squared with respect to the mean b would be zero so the algorithm wouldn't know what direction to go in we wouldn't get a sensible answer for the jacobian or grad and therefore our algorithm wouldn't know where to go to find the minimum so in doing any of this data fitting it's vital to come up with a good means for generating a starting guess here it's easy you pick the biggest value for instance not coincidentally it's also equally important to compare the fit to the data and ask if you believe the final fit so what we've done in this video is we've finished our little discussion of using vectors and multivariate calculus together to help us do optimizations of functions and to fit data to functions and it's turned out in the end to be really easy computationally we can fit functions in just a few lines of code in python or matlab or r but now after all this work you understand something of how those algorithms work under the hood and that means that hopefully you'll be much better off at figuring out how to fix them when they go wrong for instance whether jacobian doesn't make any sense and also you've got all the maths you need to access the next course in this specialization on machine learning that brings us to the end of this introductory course on multivariate calculus for machine learning congratulations to you all for sticking with the course and thank you for your contributions to the forums we've moved very quickly starting right at the beginning with the definition of the derivative then exploring multi-dimensional mountain ranges training neural networks and finally finishing up by seeing regression in action although we certainly didn't cover all of the details of a traditional calculus course what i hope is that you come away with an understanding of the fundamentals and language of calculus as well as some intuition as to where it might be usefully applied in machine learning it's been such a pleasure to teach you and best of luck with the next course on the mathematics for machine learning specialization hi i'm mark i'm a lecturer in statistical machine learning at imperial college in this course we're going through the mathematical foundations that we need in order to do dimensionality reduction with principal component analysis data in real life is often high dimensional for example if we want to estimate the price of our house in years time we can use data that helps us to do this the type of the house the size the number of bedrooms and bathrooms the value of houses in the neighborhood when they were bought the distance to the next train station and park the number of crimes committed in the neighborhood the economic climate and so on and so forth there are many things that influence the house price and we collect this information in a data set that we can use to estimate the price of our house another example is the 640 by 480 pixels color image which is a data point in a one-dimensional space where every pixel corresponds to three dimensions one for each color channel red green and blue working with high dimensional data comes with some difficulties it is hard to analyze interpretation is difficult visualization is nearly impossible and from a practical point of view storage can be quite expensive however high dimensional data also has some nice properties for example high dimensional data is often over complete that means many dimensions are redundant that can be explained by a combination of other dimensions for example if we took a color image with four channels red green blue and a grayscale channel then the grayscale channel can be explained by a combination of the other three channels and the image could equally be represented by red green and blue alone as we can reconstruct the grayscale channel just using that information dimensionality reduction exploits structure and correlation and allows us to work with a more compact representation of the data ideally without losing information we can think of dimensionality reduction as a compression technique similar to jpeg or mp3 which are compression algorithms for images and music let's take a small binary image of the handwritten digit 8. we're looking at 28 by 28 pixels each of which can be either black or white therefore this image can be represented as a 784 dimensional vector however in this example the pixels are not randomly black or white there is structure they often have the same value as the neighboring pixels in this data set there are many examples of an eight they differ a little bit but they look sufficiently similar that we can identify them as eight we can use dimensionality reduction now to find a lower dimensional representation of all eights that is easier to work with than a 784 dimensional vector the lower dimensional representation of a high dimensional data point is often called a feature or a code in this course we will look at a classical algorithm for linear dimensionality reduction principal component analysis or pca the purpose of this course is to go through the necessary mathematical details to derive pca in the first module we will look at statistical representations of data for example using means and variances and we also describe how means and variances change if we linearly transform our data set in the second module we will look at the geometry in vector spaces and define how to compute distances and angles between vectors in the third module we'll use these results to project data onto lower dimensional subspaces and in the fourth module we'll derive pca as a way to reduce dimensionality of data using orthogonal projections when we work with data we often need to find compact ways to describe some of its properties it wouldn't be particularly handy or useful to give somebody the entire data set instead we can use some statistical properties to describe the data for example we can say something about the center or the average of the data or the spread or orientation in the following videos we'll be looking at exactly these kind of statistical properties and discuss mean values to describe average data points variances that make statements about the spread and covariances which characterize orientation and correlations data is often compactly described by some of its statistical properties such as the mean and the variance in this video we'll explain how to compute means of data sets the mean of a data set describes the average data point the mean does not have to be a typical data point and it also does not need to be part of the data set itself for example if we look at a set of images of digit 8 the average image looks like this it has properties of all images in the data set but it is not part of the data set itself we obtain the average 8 as follows remember that an image can be represented as a long vector in a high dimensional vector space by stacking all pixels together after transforming all images into these vectors we take all image vectors in our data set add them together and divide by the number of images in the data set this gives us the average image vector if we then reshape that vector into an image again we get the average digit in the data set here's an example with four eighths the mean of the first one is just the image itself but when we add the second image we see that the average image now contains properties from both images when we add the third image the mean image is all three images on top of each other divided by three after the fourth image we can still see characteristics of all four images in the average image when we add more images to our data set the average digit becomes a bit more blurry and if we take all eights in our data set we get this eight as the average image generally if we have a data set x1 to xn let's say d consists of data points x12 xn we get the mean value or the expected value of this data set as follows we write well the expected value of d is one over the number of data points so this is n times the sum n equals 1 to capital n of x n so we sum up all data points in our data set and divide by the number of data points that we have let's look at an example i create a data set consisting of the five numbers that i get when i roll five dice we got one two four six and 6. so let's write this down so we're going to call this d prime is 1 2 4 6 and 6. and now the expected value or the average of this data set is the sum of all elements in this data set so this is 1 plus 2 plus 4 plus 6 plus 6 divided by the number of elements in our data set and that is 5. so if we sum these things together and divide by 5 we get 19 over 5 or 3.8 we can clearly see that 3.8 is not part of the data set and cannot even be achieved by rolling a dice therefore it is also worthwhile keeping in mind that although the mean is the average of the data set it does not have to be a typical instance in this video we computed the mean value of data sets as the average data point the mean value doesn't have to be part of the data set in the next video we'll introduce variances to describe the spread of the data around the mean value in the last video we discussed mean values of data sets which represent an average data point in this video we will look at variances to describe other properties of a data set let us have a look at two different data sets d1 and d2 d1 is represented by the blue dots located at 1 2 4 and 5 and d2 is represented by the red squares at -1 3 and 7. d1 and d2 have the same mean which is 3 but the data points in d2 are less concentrated around the mean than the data points in d1 remember the mean value is the data point we would expect on average but to describe the concentration of data points around the mean value we can use the concept of the variance the variance is used to characterize the variability or spread of data points in a data set in one dimension we can look at the average square distance of a data point from the mean value of this data set okay let's do this for d1 and d2 so d1 was 1 2 4 and 5 and the mean value or expected value of d1 was 3 and d2 was -1 3 and 7 with exactly the same mean value so now we want to compute the average squared distance of d1 from the mean and from d2 from the same mean so let's do this for d1 first so we get 1 minus 3 squared plus 2 minus 3 squared plus 4 minus 3 squared plus 5 minus 3 squared so these are the sum of the average of the square distances and to get the average we divide by 4 which is the number of data points in d1 so if we do the computation we get 4 plus 1 plus 1 plus 4 divided by 4 which is 10 over 4. so now we do the same for d2 and for d2 we get minus 1 minus 3 squared plus 3 minus 3 squared plus 7 minus 3 squared and we divide by the number of data points in d2 which is 3 and we get 16 plus 0 plus 16 divided by 3 which is 32 over 3. so now this number is now bigger than this number which means that the average squared distance of d2 from the mean value is bigger than the average squared distance of d1 from the from the mean value which indicates that the spread of the data is higher in d2 then in d1 so what we have done can be formalized so assuming we have a data set consisting of n data points x1 to xn then we can define the average squared distance as the following so we have x1 up to xn and then we define this to be our data set x and we define now the variance of this data set to be 1 over n times the sum of small n equals 1 to big n of x n minus mu squared where mu is the mean value of the data set x yeah so what we have done here is exactly the same as what we did before with d1 and d2 we computed an average squared distance of the data points in the data set from the mean value of the data set and now we can also make some statements about this first the variance as defined here can never be negative because we just sum up squared values and that also means we can take the square root of the variance and this is called the standard deviation the standard deviation is expressed in the same units as the mean value whereas the variance unfortunately expressed in squared units so comparing them is quite difficult therefore when we talk about spread of the data we usually look at standard deviations so in this video we looked at variances of one-dimensional data sets and in the next video we will generalize this to higher dimensions in the last video we looked at variances for one-dimensional data sets and in this video we were looking at variances for high-dimensional data sets the intuition and definition of the variance we had earlier does not really work in the same way in high dimensions and squaring vectors is not really defined assuming we have a two dimensional data set we can now compute variances in the x direction in the y direction but these variances are insufficient to fully describe what is going on in the data set in particular we only have the variation of the data in either direction independent of the other direction but we may also be interested in the relationship between the x and y variables and this is where the concept of the covariance between these components comes into play let's have a look at an example in two dimensions for this data set we can compute the variances in the y direction and the variances in the x direction which are indicated by these vertical and horizontal bars but this can be insufficient because we can look at other examples where the variances in the x and y directions are the same but the data set look very different if we look at this particular example we have a different shape of the data set but the variances in the x direction and the variance in the y direction are exactly the same and the mean values of these data sets are also identical and i can look at different sets like this one here and this one these four data sets look very different we've seen four different examples with four different properties or shapes of the data sets but the variances in x and the variances in y and the mean values are identical if we exclusively focus on the horizontal and vertical spread of the data we can't explain any correlation between x and y in this last figure we can clearly see that on average if the x value of a data point increases then on average the y value decreases so that x and y are negatively correlated this correlation can be captured by extending the notion of the variance to what is called the covariance of the data the covariance between x and y is defined as follows so covariance between x and y is defined as the expected value of x minus the mean in the x direction times y minus the mean in the y direction just right mu x is the expected value in the x direction and mu y is the expected value of the y coordinates for the 2d data we can therefore obtain four quantities of interest we obtain the variance of x we get the variance of y and we also have the covariance terms the covariance between x and y and the covariance between y and x so we summarize these values in this matrix called the covariance matrix with four entries and in the top left corner we have the variance in the x direction then we get the covariance term between x and y in the top right corner the covariance between y and x in the bottom left corner and the variance of y in the bottom right corner if the covariance between x and y is positive then on average the y value increases if we increase x and if the covariance between x and y is negative then the y value decreases if we increase x on average if the covariance between x and y is zero then x and y have nothing to do with each other they are uncorrelated the covariance matrix is always a symmetric positive definite matrix with the variances on the diagonal and the cross-covariances or covariances on the off-diagonals if we now look at d-dimensional data sets let's say we have a data set consisting of n vectors x 1 2 x n and every x i is in rd then we can compute the variance of this data set as 1 over n times the sum i equals 1 to capital n x i minus mu times x i minus mu transpose where mu is the mean of the data set and this is called the covariance matrix of the data and this is a d by d matrix so previously we looked at means and variances to describe statistical properties of data sets in this video we'll discuss what happens to means and variances when we linear transform the data set that means we shift it around or stretch it so let's look at a data set d which is given by these blue dots over here so we have a data set d which is -1 2 and 3. now let's compute the mean value of this data set to start with so the expected value of d is minus 1 plus 2 plus 3 divided by 3 which is 4 over 3. so i'm indicating the mean value of this data set by this blue star so now the question is what happens to this data set when we shift it so let's have a look at a shift by two to the right so we end up with these red dots now what happens to the mean value of this data set if we shift the data set the original data set to the right well the answer is the mean also shifts by 2. so let's define d prime to be the data set one plus sorry one four and five which is d plus two for every individual component in d then the expected value of d prime is 1 plus 4 plus 5 divided by 3 which is 10 over 3 and in particular we can write this in a slightly different way we can write this as 4 over 3 which comes from here plus 2 which is the shift or the offset of the original data set so we can generalize this now to general shifts we can write an expected value of d plus a where a is a constant factor is a plus the expected value of d so up to this point we know what happens to the mean of the data set when we shift the data set but what happens if we stretch the data set let's say we also stretch by a factor 2 and that means we multiply every individual component in the data set d by 2. so then we end up with this data set on the indicated by the red dots here and we define d double prime to be minus 2 4 and 6. and if we now compute the mean of d double prime then we end up with minus 2 plus 4 plus 6 divided by 3 which is 8 over 3 and we can rewrite that again in a convenient way which is four over three which again comes from here plus sorry times 2 which is the scaling factor so in general we can write that the expected value of alpha times d is alpha times the expected value of d putting everything together we can shift and scale our original data set and the new mean of this linear transformation is given as follows so if we compute the expected value of alpha times d plus a we get alpha times the expected value of d plus a where alpha is the scaling factor and a is the shift in this video we looked at the effect of linear transformations on the mean value of a data set in the next video we will look at exactly the same for the variances we've seen what shifting and scaling a data set does to the mean now let's have a look at the effects on the variance remember that the variance is a measure of the spread of the data what do we expect when a data set is shifted let's have a look at this data set over here we have three data points given at minus one plus two and plus three and now we are shifting the data set towards the right the variance of the data set is indicated by the blue bar on the at the bottom the shifted data set is now given by the red dots and we shift every individual data point by two so question is now what happens to the variance well the shift of the data does not really affect the relation of the data points among themselves therefore the variance does not change the variance is identical so the values of the blue data set is identical to the variance of the red data set therefore a general result is that if we have the variance of d that is exactly the same as the variance of our data set d plus a where a is an offset applied to every individual element of d let's now scale the data set and see what effect that has on the variance of the data so we're going to take exactly the same data set as before the variance is indicated by the blue bar and we're going to scale every individual data point by two so the question is what is the variance of two times d so new data set is indicated by these red dots remember that the variance is the average squared distance of the data points from the mean if we scale the data set by a factor of 2 the distance of every data point to the mean is scaled by 2 but the squared distance is scaled by 4 and the variance is therefore 4 times as big as it used to be and our next result is that the variance of alpha times d is alpha squared times the variance of d where alpha is a real number that scales every individual element in the data set d now let's have a look at high dimensional problems assume we have a data set d which is a collection of data points x1 to xn and the xi live in rp remember the variance of this data set is given by a covariance matrix if you perform a linear transformation of every data point say ax i plus b for a given matrix a and an offset vector b the question is what happens to our data set if we do this to every single data point while we get the covariance matrix of the transformed data set as follows we're getting the variance of a times d plus b is a times the variance of d times a transpose in this video we saw what effects a linear transformation of a data set has on the mean and the variance in particular we saw that shifting data has only an effect on the mean but not the variance where scaling the data affects both the mean and the variance we've already come to the end of this module we looked at statistics to summarize and characterize data sets means variances and covariances we looked at how these statistics change when we shift or scale the data set we'll use means and covariances in module 4 when we discuss the pca algorithm for dimensionality reduction well done it is often necessary to measure similarity between data points in the context of dimensionality reduction we are interested in finding compact representations of data that live in a lower dimensional space but which are similar to the original data one of the key concepts we will be looking at in this course is orthogonality we use orthogonal projections of data points as a way to compress data while minimizing the loss of information thinking of data points as vectors in the vector space will allow us to look at two types of similarity between data points distances of data points from each other and the angle between these two data points let's have a look at an example assume we have we're living in a two-dimensional world and we have two vectors x and y things that we will be discussing in this course are how to compute the length of individual vectors so that means for x that means the length of this line here we will be looking at angles between two vectors and we would be looking at distances between individual data points or vectors in this vector space so in this case you would be interested in the distance between x and y in order to measure angles and compute lengths and distances we need to equip the vector space with an inner product which allows us to talk about geometric properties in this vector space in the next videos we will be looking at computing lengths distances between vectors and angles between vectors in order to measure angles lengths and distances we need to equip the vector space with an inner product which allows us to talk about geometric properties in a vector space an example of an inner product that we may know already is the dot product between two vectors x and y if x and y are two vectors in rn so then the dot product is defined as uh x transpose y is the sum i equals 1 to n of x i times y i where x and y are n dimensional vectors the length of x is then defined as the square root of the dot product of x with itself so the length of x is the square root of x transpose x which we can also write as the square root of the sum of i equals 1 to n of x i squared let's have a look at an example we take a vector x as 1 2 and the vector y as 2 1 in a 2 dimensional plane then we can compute the length of vector of the vector of x as the square root of 1 squared plus 2 squared which is the square root of 5 and the length of y similarly is the square root of 5 as well if we're interested in the distance between two vectors x and y we simply compute the length of the difference vector so we generally define then the distance between x and y to be the length or the norm of x minus y which is the square root of x minus y transpose times x minus y so now let's compute the distance between our two vectors over here so the distance between these two vectors is effectively just the length of this difference so we can write this as the norm of 1 2 minus 2 1 so this is our difference vector and that is the norm of minus 1 and 1 and this is the square root of 1 plus 1 which is the square root of 2. so the length of this yellow segment is square root 2. so the last thing we are still interested in is the angle between two vectors and we can compute the angle also using the dot product the cosine of the angle let's call this angle alpha so the cosine of alpha is given by x transpose y divided by the length of x times the length of y and in our example over here we'll get that the cosine of the angle between x and y is x transpose times y which is 4 divided by the length of x times the length of y but we've done these computations already which come from here so it's square root 5 in any case so therefore we get the cosine alpha is 4 over 5 which corresponds to approximately 0.64 radians and that is our angle alpha between these two vectors in this video we looked at dot products as special cases of inner products to compute lengths of vectors to compute distances between vectors and angles between two vectors and in part two of this video we will be looking at general inner products to compute exactly the same quantities previously we looked at the dot product to compute angles lengths and distances in this video we will be looking at generalizations of the dot products in order to compute angles lengths and distances sometimes it is necessary to use an unconventional way to measure these geometric properties and the inner product allows us to do exactly this kind of thing an inner product is a generalization of the dot product but with the same idea in mind we want to express geometric properties such as lengths and angles between vectors let's define what an inner product actually is so let's write the definition we're looking at two vectors for any vectors x and y in a vector space v an inner product is defined as a symmetric positive definite bilinear mapping so let's see what that means we take a mapping that takes two inputs out of this vector space it's a mapping from v times v to the real numbers and we say this function is symmetric positive definite and bilinear so let's unpack this a little bit let's start with bilinearity bilinear means that for vectors x y and z in this vector space and real numbers lambda we get that lambda times x plus z and y so the scalar so the inner product between lambda x plus z and y can be written as lambda times the inner product between x and y plus the inner product between set and y so this is linearity only in the first argument of this function and we require to have linearity also for the second argument of this function so similarly we will then require that the inner product between x and lambda y plus z is lambda times the inner product between x and y plus the inner product between x and z so this means linearity also in the second argument and that's the reason why this is called bilinear it means linearity in both arguments of this function so positive definite means that the inner product of x with itself is greater or equal to zero and equality holds if and only if x is zero vector and the last component that we need is symmetry and symmetric means that the inner product of x and y is the same as the inner product of y and x so the order does not matter let's have a look at an example in r2 if we define our inner product to be x transpose times the identity matrix times y then we get exactly the dot product that we are very familiar with but now let's have a look at a different example where we define our inner product to be x transpose times a times y where a is the matrix 2 1 1 2 then we can also write our inner product to be 2 times x 1 y 1 plus x 2 y 1 plus x 1 y 2 plus 2 times x 2 y 2 and this inner product is different from the dot product any symmetric positive definite matrix in this equation defines a valid inner product in this video we introduce the concept of an inner product which we will use in the next videos to discuss geometric properties of vectors such as lengths and angles in the last video we defined inner products now we will use inner products to compute the lengths of vectors and distances between vectors the length of a vector is defined by the inner product using the following equation so the length of a vector x is defined as the square root of the inner product of x with itself remember that the inner product is positive definite that means this expression is greater or equal than zero therefore we can take this square root we can now also see that the length of a vector depends on the inner product and depending on the choice of the inner product the length of a vector can be quite different similarly the geometry in the vector space can be very different the length of x is also called the norm of x so let's have a look at an example assume we're interested in computing the length of a vector in two dimensions and x is given as the vector 1 1. so in a diagram we have 1 here and approximately here then our vector x would be here and now we're interested in computing the length of this vector so in order to compute the length of the vector we need to define an inner product so why don't we start with the standard dot product so if we define x y to be x transpose y so the inner part of x and y to be x transpose times y then the length of x is the square root of 2. so it's 1 squared plus 1 squared and we take the square root of this now let's have a look at a different inner product let's define x y to be x transpose times 1 minus a half minus a half 1 times y which we can also write as x 1 times y 1 minus a half x1 y2 plus x2 y1 plus x2 y2 using the definition of this inner product then the length of a vector is the square root of x1 squared minus a half x1 x2 plus x2 x1 plus x2 squared and this is identical to the square root of x1 squared minus x1 x2 plus x2 squared will get smaller values than the dot product definition up here if this expression is positive if we now use the definition of this inner product to compute the length of our vector up here we will get that the squared norm or the inner product of x with itself is 1 plus 1 minus 1 which is 1 and therefore the norm of x is just one and one in this case would be the length of the vector using this unusual definition of an inner product whereas the same vector would be longer had we used the dot product up here the norm that we just looked at also has some nice properties let me write a few of those down so in particular one of the properties is that if we take a vector and stretch it by a scalar lambda then the norm of this stretched version is the absolute value of lambda times the norm of x the second property is the triangle inequality which says that the norm of x plus y is smaller or equal to than the norm of x plus the norm of y let's have a look at an illustration so let's assume we have our coordinate system in two dimensions and we use the standard vectors x equals 1 0 so x would be sitting exactly here so this is x and y is 0 1 then x plus y is sitting here if we use the dot product as our inner product then the norm of x is one which is the same as the norm of y and the norm of x plus y is square root 2. so what the triangle inequality says that x plus y norm is smaller or equal than the norm of x plus the norm of y and that is also true in our case because squared of 2 is smaller equal than 2. and there's another inequality i would like to mention which is the koshi schwarz inequality and that one says that the absolute value of the inner product of x with y is smaller or equal than the product of the individual norms of the two vectors in this video we looked at lengths of vectors using the definition of inner products and now we're going to use this to compute distances between vectors in the next video now that we know how to compute the length of a vector we can also compute distances between any two vectors x and y the distance between two vectors is defined as the length of the difference vector so we can write the distance between x and y to be the norm of x minus y and as we know this depends on the definition of our inner product if we use a dot product then the distance is called the euclidean distance let's have a look at an example we're going to look at two vectors x and y we're going to say x is 2 3 and y is 4 1. so let's draw this so this is x and y is over here in order to compute the distance between these two vectors the first thing we actually need to do is let's have a look at this difference vector so x minus y is 2 minus 4 in the first component and 3 minus 1 in the second component that means we get -2 and plus 2 as the difference vector so now we can define inner products and let's now use the dot product as our first example if we use the dot product to compute the length of this difference vector we will get the square root of the first component squared plus the second component squared which is 4 plus 4 and that's the square root of 8. and if we use a different inner product let's say we define the inner product between x and y to be x transpose times 1 minus half minus a half and one times y then if we now use the difference vector here the result will be the square root of 12. so we can see that depending on the choice of our inner product we will get different answers of what the distance between x and y actually is in this video we computed distances between two vectors using inner products and we saw that depending on the inner product the distance between these two vectors can differ previously we have looked at lengths of vectors and distances between vectors in this video we'll introduce angles as a second important geometric concept that will allow us to define orthogonality orthogonality is central to projections and dimensionality reduction similar to lengths and distances the angle between two vectors is defined through the inner product if we have two vectors x and y and we want to determine the angle between them we can use the following relationship the cosine of this angle between the two vectors is given by the inner product between the two vectors divided by the norm of x times the norm of y let us have a look at an example and let's compute an angle between two vectors x which is 1 1 and y which is 1 two let's quickly draw this this is x and this is y and we're interested in the angle omega between those if we use a dot product as the inner product we get that the cosine of omega is x transpose y divided by the square root of x transpose x times y transpose y which is 3 divided by the square root of 10. this means the angle is approximately 0.32 radians or 18 degrees intuitively the angle between two vectors tells us how similar their orientations are let's look at another example in 2d again with the dot product as the inner product we're going to look at the same vector x that we used to have before so x equals 1 1 which is this vector over here and now we choose y to be minus 1 and plus 1 and this is this vector now we're going to compute the angle between these two vectors and we see that the cosine of this angle between x and y is with the dot product x transpose times y divided by the norm of x times the norm of y and this evaluates to zero this means that omega is pi over two in in radians or if we want to say this in degrees we have 90 degrees this is an example where two vectors are orthogonal generally the inner product allows us to characterize orthogonality two vectors x and y where x and y are non-zero vectors are orthogonal if and only if the inner product is zero this also means that orthogonality is defined with respect to an inner product and vectors that are orthogonal with respect to one inner product do not have to be orthogonal with respect to another inner product let's take these two vectors that we just had where the dot product between them gave that they are orthogonal but now we're going to choose a different inner product in particular we're going to choose the inner product between x and y to be x transpose times the matrix 2 0 0 1 times y and if we choose this inner product it follows that the inner product between x and y is -1 this means that the two vectors are not orthogonal with respect to this particular inner product from a geometric point of view we can think of two orthogonal vectors as two vectors that are most dissimilar and have nothing in common besides the origin we can also find the basis of a vector space such that the basis vectors are all orthogonal to each other that means we get the inner product between bi and bj is zero if i is not the same index as j and we can also use the inner product to normalize these basis vectors that means we can make sure that every bi has length one then we call this an orthonormal basis in this video we discussed how to compute angles between vectors using inner products we also introduce the concept of orthogonality and so that vectors may be orthogonal with respect to one inner product but not necessarily if we change the inner product we will be exploiting orthogonality later on in the course if we have a vector and we want to compute the smallest difference vector to any point on a line that does not contain the vector then we will end up finding a point on the line such that the segment between the point and the original vector is orthogonal to that line in the previous videos we looked at properties of inner products to compute lengths angles and distances we focus on inner products of finite dimensional vector spaces in this video we will look at two examples of inner products of other types of vectors inner products of functions and inner products of random variables the inner products we discussed so far were defined for vectors with a finite number of entries and we can think of these vectors as discrete functions with a finite number of function values the concept of an inner product can be generalized to continuous valued functions as well and then the sum over individual components of vectors turns into an integral and the inner product between two functions is defined as follows we can write that the inner product between two functions u and v is the integral of an interval from a to b of u of x times v of x dx and as with our normal inner product we can define norms and orthogonality by looking at this inner product if that integral evaluates to zero the functions u and v are orthogonal let's have a look at an example if we choose u of x equals sine x and v of x is cosine of x and we define f of x to be u of x times v of x which is sine x times cosine x then we're going to end up with this function this function is sine x times cosine x we see that this function is odd which means that f of minus x equals minus f of x if you choose the integral limit to be minus pi and plus pi then the integral of this product sine of x times cosine of x evaluates to zero and that means that sine and cosine are orthogonal and it actually holds that if you look at a set of functions say 1 cosine x cosine of 2x cosine 3x and so on that all of these functions are orthogonal to each other if we integrate from minus pi to plus pi another example for defining an inner product between unusual types are random variables or random vectors if we have two random variables which are uncorrelated then we know the following relationship we know that the variance of x plus y is the variance of x plus the variance of y where x and y are random variables if you remember that variances are measured in squared units this looks very much like the pythagorean theorem for right triangles that one states that c squared equals a squared plus b squared if we look at triangles of this form where this is a b and this is c let's see whether we can find a geometric interpretation of the variance relation of uncorrelated random variables random variables can be considered vectors in a vector space and we can define inner products to obtain geometric properties of these random variables if we define the inner product between two random variables between x and y to be the covariance between x and y we see that the covariance is symmetric positive definite and linear so linearity would mean that the coherence of lambda times x plus y and z where x y and z are random variables and lambda is a real number is lambda times the covariance between x and z plus the covariance between y and z and if the length of a random variable is the square root of the covariance between of x with itself which is the square root of the variance of x then this is the standard deviation of the random variable x so this is the length of a random variable therefore the zero vector is a vector that has no uncertainty that means standard deviation is zero if we now look at the angle between two random variables we get the following relationship we get the cosine of theta which is the angle between two random variables is by definition the inner product between the two random variables divided by the length of the first random variable times the length of the second random variable and if we now write this out using the definition of our inner product we get the covariance between x and y divided by the square root of the variance of x times the variance of y and this evaluates to zero if and only if the covariance between x and y is zero and that is the case when x and y are uncorrelated coming back now to our geometric interpretation we would now replace a with the standard deviation of x b is the standard deviation of y and c is the square root of the variance of x plus the variance of y and this is how we get our geometric interpretation of random variables in this video we looked at inner products of rather unusual objects functions and random variables however even with functions and random variables the inner product allows us to think about lengths and angles between these objects in the case of random variables we saw that the variance of the sum of two uncorrelated random variables can be geometrically interpreted using the pythagorean theorem this is the end of this module congratulations in this module we introduced the concept of an inner product a generalization of the dot product the inner product allows us to talk about geometric concepts such as length distances and angles between vectors one special case we discussed was orthogonality in the next modules we will use orthogonality heavily for projections of high dimensional data onto lower dimensional subspaces and integral part of principal component analysis high dimensional data is often hard to analyze or visualize however high dimensional data quite often possesses the property that only a few dimensions contain most information and most other dimensions are not essential to describe key properties of the data when we compress or visualize high dimensional data we lose information ideally we find the most informative dimensions in the data and keep them when we compress the data and ignore the irrelevant dimensions in this course we will have a close look at orthogonal projections which play a central role in algorithms for dimensionality reduction in this video we'll look at orthogonal projections of vectors onto one-dimensional subspaces let's look at an illustration we're given vector x in two dimensions and x can be represented as a linear combination of the basis vectors of r2 we also have a one-dimensional subspace u with a basis vector b that means all vectors in u can be represented as lambda times b for some lambda now we're interested in finding a vector in u that is closest to x let's have a look at this when i compute the length of the difference of all vectors in u and the vector x i'm getting the graph on the right it turns out that we can find the vector in u that is closest to x by an orthogonal projection of x onto u that means the difference vector of x and its projection is orthogonal to u overall we are looking for the orthogonal projection of x onto u and we will denote this projection by pi u of x the projection has two important properties first since pi u of x is in u it follows that there exists a lambda in r such that pi u of x can be written as lambda times p a multiple of the basis vector that spans u then lambda is the coordinate of the projection with respect to the basis b of the subspace u the second property is that the difference vector of x and its projection onto u is orthogonal to u that means it's orthogonal to the basis vector that spans u so the second property is that the inner product between b and the difference between pi u of x and x is zero so that's the orthogonality condition these properties generally hold for any x in rd and one dimensional subspace is u now let's exploit these two properties to find pi u of x i've flown the setting once more over here we have a two-dimensional vector x we have a one-dimensional subspace u which is spanned by the vector b and we're interested in finding the orthogonal projection of x onto u which we call pi u of x and we have two conditions for pi u of x the first thing is that since pi u of x is an element of u we can write it as a scaled version of the vector b so there must be a lambda in r such that pi u of x is lambda times b and the second condition is the orthogonality condition that the difference vector between x and pi u of x is orthogonal to to u which means it's orthogonal to the spanning vector b and now let's explore these two properties to find pi u of x so first we start writing we use the the condition that b and pi u of x minus x inner product is 0 which is equivalent to that the inner product with of b and pi u of x minus the inner product of b and x is zero where we exploited now the the linearity of the inner product now we're going to rewrite pi u of x as lambda times b so this is equivalent to b times lambda b or inner product with lambda b minus the inner product of b and x that must be zero now we can move the lambda out again because of the linearity of the inner product which is then lambda times the squared norm of b minus the inner product of b and x must be zero and that is equivalent to lambda is the inner product of b with its with x divided by the squared norm of b so now we've found lambda which is the coordinate of our projection with respect to the basis b and that means that our projection using the second condition sorry first condition is lambda times b which is now the inner product of b with x divided by the square norm of b times b if we choose the dot product as the inner product we can rewrite this in a slightly different way so we would we would get b transpose times x times b divided by the squared norm of b so now given that this one is a scalar we can just move it over here so this is equivalent then just saying b times b transpose divided by the square norm of b times x is our projected point and if we look at this this is a matrix and this matrix is a projection matrix that projects any point in two dimensions onto the one-dimensional subspace so now if we look at the special case of b having norm 1 then we get a much simpler result and we would get so if if the norm of b equals 1 then we will get that lambda is b transpose times x and pi u of x is b times b transpose times x so you would get the coordinate of the projected point with respect to the basis v just by looking at the the dot product of b with x and the the projection matrix is simply given by b times b transpose let me make a comment at the end our projection pi u of x is still a vector in rd however we no longer require d coordinates to represent it but we only need a single one which is the lambda in this video we discussed orthogonal projections of vectors onto one-dimensional subspaces we arrived at the solution by making two observations we must be able to represent the projected point using a multiple of the basis vector that spans the subspace and the difference vector between the original vector and its projection is orthogonal to the subspace in the next video we will look at an example in the last video we derived the formula for projections of vectors onto a one-dimensional subspace in particular we arrived at this equation if we choose the dot product as the inner product here x is a d-dimensional vector and b is the basis vector that spans that one-dimensional subspace that we want to project x onto in this video we're going to look at an example assume our vector b that spans our one dimensional subspace is the vector 2 1 and the vector x we want to project onto that subspace is given by one two let's quickly draw this so x is the vector 1 2. so x is living over here and the vector b is going to be the vector 2 1 and so this is b and b spanning our subspace means that our subspace u is going to extend along this line so now we're interested in computing the orthogonal projection of x onto u so using now this equation here we get so x transpose times b is 2 plus 2 and now we divide by the squared norm of b so the length of b is 2 squared plus 1 squared square root so overall we get the squared norm to be 5 and then times b will get us to 2 1 so overall 4 over 5 times the vector 2 1. this means that our orthogonal projection is four over five times the vector b that means if we take four fifth of this vector then we will get to our orthogonal projection so this point here is now pi u of x so the orthogonal projection of x onto u and the length of this segment is four-fifth times the length of b we've gone through orthogonal projections onto one-dimensional subspaces in the next video we're going to look at orthogonal projections onto high dimensional subspaces in the last video we learned about orthogonal projections onto one-dimensional subspaces in this video we look at the general case of orthogonal projections onto n-dimensional subspaces for this we'll exploit the same concepts that worked in the one dimensional case let's start with an illustration we're going to look at a case where we have a vector x that is living in a three-dimensional space and we define a subspace a two-dimensional subspace u which has basis vectors b1 and b2 which are for example this vector and b2 is this vector so we write u is spanned by b1 and b2 so u in this case would be the plane down here so this is u so now we are looking at the orthogonal projection of x onto u and we're going to denote this by pi u of x so that projection is going to look something like this that's the projection point so this is the orthogonal projection of x onto the subspace u so we can already make two observations the first thing is that because pi u of x is an element of u it can be represented as a linear combination of the basis vectors of u so that means we can write pi u of x is lambda 1 times b 1 plus lambda 2 times b 2 for appropriate values of lambda 1 and lambda 2. and the second property is that the difference vector of x minus pi u of x of this this vector over here is orthogonal to u which means it's orthogonal to all basis vectors of u and we can now use the inner product for this and we can write that x minus pi u of x inner product with b1 must be zero and the same is true for b2 but now let's formulate our intuition for the general case where x is a d-dimensional vector and we're going to look at an m-dimensional subspace u okay let's derive this result i copied our two insights up here and i have defined two quantities a lambda vector which consists of all these lambda i here and a b matrix where we just concatenate all basis vectors of our subspace u now with this definition we can also write pi u of x equals b times lambda let's assume we use the dot product as our inner product now if we use our second property we'll get that so pi u of x minus x inner product with bi is now equivalently written as the inner product of b lambda minus x and bi and this needs to be zero where i just used the definition of p u of x in here so now we can simplify this by exploiting the linearity of the inner product and we'll get b lambda times or inner product with bi minus the inner product of x with bi needs to be zero and this holds for i equals 1 to m with the inner product we can now write this in the following way we can write it as lambda transpose times b transpose times bi minus x transpose times bi equals 0 for i equals 1 to m and now we can write this as a set of conditions and if we summarize this we would get lambda transpose times b transpose times b minus x transpose times b must be zero now we need to talk here about an m dimensional zero vector what we would like to do now is we would like to identify lambda for this we are going to right multiply the inverse of b transpose times b onto the entire equation and then we get lambda transpose equals x transpose times so x transpose times b times b transpose b inverse which then also means we can write lambda as the transpose of this entire expression we get b transpose b inverse so this matrix is symmetric so it's tran transpose is the same as the original matrix times b transpose x so now we have identified lambda to be this but we also know that our projection point can be written as b times lambda so this means we will get pi u of x as b times lambda which is b times b transpose b inverse times b transpose x we can now identify this expression as the projection matrix similar to the one dimensional case and in the special case of an orthonormal basis b transpose times b is the identity matrix so we would get pi u of x is b times b transpose times x the projected vector pi u of x is still a vector in rd but we only require m coordinates the lambda vector over here to represent it as a linear combination of the basis vectors of the subspace u we also effectively got the same result as in the one dimensional case remember in 1d we got lambda equals b transpose x divided by b transpose b and we got a projection point pi u of x which was b transpose x divided by b transpose b times b now b transpose b is now expressed as matrix b transpose times matrix b but we now have the inverse matrix sitting here instead of dividing by a scalar that's the only difference between these two results in this video we looked at orthogonal projections of a vector onto a subspace of dimension m we arrived at the solution by exploiting two properties we must be able to represent the projection using a linear combination of the bases of the subspace and the difference vector between the original vector and its projection is orthogonal to the subspace in the next video we're going to look at a concrete example the last video we derived orthogonal projections of vectors onto m-dimensional subspaces in this video we'll run through a simple example we're going to define x to be a three dimensional vectors given by 2 1 1 which is over here and we define two basis vectors for our two-dimensional subspace b1 to be one two zero and b2 to be one one zero so that means u which is spanned by b1 and b2 is going to be effectively the plane and its extension so this all is u the orthogonal projection was given as pi u of x is b times lambda and we define b now to be b1 and b2 concatenated which is 1 2 0 1 1 0 and lambda was given as b transpose b inverse times b transpose x b transpose times x is given as 4 3 vector b transpose b is a 2 by 2 matrix which is 5 3 3 2. now we solve for lambda as b transpose b inverse times b transpose x which means we find lambda such that b transpose b lambda equals b transpose x using gaussian elimination we arrive at lambda equals minus 1 3 and this implies our projection of x onto the space spanned by the two b vectors is minus one times b one plus three times b two which is two one zero in our diagram over here this would correspond to this vector here this is pi u of x this result makes sense because our projected point has as a third component the zero and our subspace requires that a third component is always zero our projected vector is still a three-dimensional vector but we can represent it using two coordinates if we use the bases defined by b1 and b2 therefore that is the compact representation of the projection of x onto this lower dimensional subspace in this video we looked at a concrete example of the orthogonal projection of a three-dimensional vector onto a two-dimensional subspace in the next video we're going to exploit orthogonal projections and derive a dimensionality reduction algorithm called principal component analysis in this module we discussed orthogonal projections of vectors onto subspaces we exploited two things to get to the solution the projected vector can be represented as a linear combination of the basis of the subspace and the vector that connects the data point and its projection must be orthogonal to the subspace we will use orthogonal projections in the next module when we introduce the pca algorithm this was a tricky module but we made it through well done in this course we will introduce principal component analysis of pca an algorithm for linear dimensionality reduction pca has been around for about 100 years and is still one of the most commonly used techniques for data compression and visualization high dimensional data for example images often has the property that it lies on a low dimensional subspace and that many dimensions are highly correlated here's an illustration in two dimensions although the data does not quite lie on a straight line the data does not vary much in one dimension so that we can express it as if it was on a line with nearly no loss a key idea behind pca is to use orthogonal projections to find lower dimensional representations of data that retain as much information as possible similar to the example that we just looked at in the following videos we will derive pca as an algorithm that minimizes average reconstruction errors by orthogonal projections in this video we will introduce the setting of pca and the high level idea assume we have a data set x in rd consisting of n vectors so x is a data set and we have n vectors x 1 to x n where the x i are d dimensional vectors our objective is to find the low dimensional representation of the data that is as similar to x as possible before we start let's briefly review three important concepts the first one is that every vector in rd can be represented as a linear combination of the basis vectors so let's write this down so we write x n can be written as the sum of i equals 1 to d of beta i n times b i where in the following we will assume that the bi are an orthonormal basis of rd if we assume that we use the dot products our inner product and b1 to bd on orthonormal basis we can also write the beta i n as x and transpose times b i which means we can interpret beta i n to be the orthogonal projection of x n onto the one dimension subspace spanned by the ith basis vector the third property is that if we have an orthonormal basis b1 to bm of rd and we define b to be the matrix that consists of these orthonormal basis vectors then the projection of x onto the subspace we can write as x tilde is b times b transpose times x that means x tilde is the orthogonal projection of x onto the subspace spanned by the m basis vectors and b transpose times x are the coordinates of x tilde with respect to the basis vectors collected in the matrix b this is also called the code so coordinates or code now let's have a look at pca the key idea in pca is to find a lower dimensional representation x and tilde of xn that can be expressed using fewer basis vectors let's say m we assume the data is centered that means the data set has mean 0 and we also assume that b1 to bd on orthonormal basis of rd generally we can write any x and tilde in the following way x and tilde can be written as a sum i equals 1 to m of beta i n times b i plus the sum of i equals m plus 1 to d of beta i n times bi so we took this entire thing is still living in rd so we took our general way of writing any vector in rd which comes from property one and we split the sum in property one into two sums one is living in a m-dimensional subspace and the other one is living in a d minus m-dimensional subspace which is an orthogonal complement to this particular subspace in pca we ignore the second term so we get rid of this part and then we call the subspace that is spent by the basis vectors b1 to bm the principal subspace so b1 to bm span the principal subspace although x n tilde is still a d dimensional vector it lives in an m dimensional subspace of r d and only m coordinates beta n 1 to beta n m are necessary to represent it so these ones are the coordinate of this x and tilde vector the betas are then also called the code or the coordinates of tilde x n with respect to the basis vectors b1 to bn and the setting now is as follows assuming we have data x1 to xn we want to find parameters beta i n and orthonormal basis vectors bi such that the average squared reconstruction error is minimized and we can write the average squared reconstruction error as follows you can write j that's the average squared reconstruction error is going to be 1 over n times the sum n equals 1 to n and then we write x n minus x and tilde square let's have a look at an example we have data living in two dimensions and now we want to find a good one-dimensional subspace such that the squared or average squared reconstruction error of the data original data points and their corresponding projections is minimized here i'm plotting the original data set with their corresponding projections onto one-dimensional subspaces and i'm cycling through a couple of options of subspaces and you can see that some of these projections are significantly more informative than others and in pca we are going to find the best one our approach is to compute the partial derivatives of j with respect to the parameters the parameters are the beta i n and the bis we set the partial derivatives of j with respect to these parameters to zero and solve for the optimal parameters but one observation we can already make and that observation is that the parameters only enter this loss function through n tilde this means that in order to get our partial derivatives we need to apply the chain rule so we can write dj by d either beta i n or b i can be written as d j by d x n tilde times d x n tilde by d either beta i n or bi and the first part we can already compute and we get d j by d x and tilde is minus 2 over n times x n minus x and tilde transpose and the other derivatives we compute in the next videos in the last video we set up the pca objective and in this video we will determine our first set of optimal parameters we make two general assumptions in the beginning the first thing is that we have centered data that means that the expected value of our data set is zero and the second one is that the basis vectors form an orthonormal basis from the previous video we carry over the following results first we can write our projected data point x and tilde as a linear combination bjn times bj where bj form the orthonormal basis of our subspace our loss function is the average squared reconstruction error between our original data point and the projection and the partial derivative of our loss function with respect to x and tilde is given by this expression and now we are ready to compute the partial derivative of j with respect to the beta i n parameters as follows so dj with respect to d b i n is the derivative of j with respect to x n tilde times the derivative of x and tilde with respect to beta i n so now we're going to have a closer look at this one here so d x n tilde by d beta i n is simply given by b i for i equals 1 to m and the reason for this is that if we take the derivative with respect to one fixed beta i n then only the i component of this sum will play a role and with that that's the reason why we end up simply with bi but that also means that our derivative of j with respect to d beta i n is now given as dj by d beta i n is minus 2 over n times x n minus x and tilde transpose times b i where here we used equation c to get the first part and we plugged in this bi over here and what we're going to do now is we are going to replace x and tilde using equation a so we end up with minus 2 over n times x n minus the sum of j equals 1 to m b j n times so beta j n times b j transpose times b i and this is given as minus 2 over n times x n transpose times bi minus beta i n times b i transpose times b i where we exploited that the bi form an orthonormal basis if we multiply b i onto both components here we end up with the sum of b j n times b j transpose times b i and since b j transpose times b i is one if and only if i equals j and otherwise zero we end up with only one term which will be one so we end up with 2 n times x n transpose bi minus beta i n so now we need to set this to 0 in order to find our beta in parameters and this is zero if and only if the beta i n parameters are given by x and transpose times b i which are going to define as equation d what this means is that the optimal coordinates of x and tilde with respect to our bases are the orthogonal projections of the coordinates of our original data point onto the ith basis vector that spans our principal subspace in this video we determine the coordinates of the lower dimensional data as the orthogonal projection of the original data onto the basis vectors that span the principal subspace in the next videos we will determine the orthonormal basis that spans that principle subspace in the last video we determined the coordinates of the optimal projection with respect to the orthonormal basis that spans our principle subspace before we go on and determine the optimal basis factors let's rephrase our loss function first i have copied over the results that we have so far so the the description of our projected data point our loss function the partial derivative of our loss function with respect to our projected data point and the optimal coordinates that we found in the last video before we go on and determine the optimal basis vectors let's rephrase our loss function this will make it much easier to find our basis vectors for this let's have a closer look at the difference vector between our original data point and our projected data point so you can write so x and tilde is given by equation a which is the sum over j equals 1 to m beta j n times b j if we now use the results for our optimal beta jn parameters from here we get this is j equals 1 to m x n transpose times b j times b j where we used d now we rewrite this in the following way this is just a scalar or dot product in this particular case dot products are symmetric so we can swap the order and we can also move the scalar over here so what we end up with is bj times bj transpose times xn and this one we can write generally as j equals 1 to m times bj times bj transpose xn where we moved the xn out of the sum and if we look at this this is a projection matrix so this means that x n tilde is the orthogonal projection of x n onto the subspace spanned by the m basis vectors b j where j equals 1 to m similarly we can write x n as the sum j equals 1 to m of bj times bj transpose times xn plus a term that runs from m plus one to d b j times b j transpose times x n so we write x n as a projection onto the principle subspace plus a projection onto the orthogonal complement and this term is the one that is missing over here that's the reason why x n tilde is the approximation to x n so if we now look at the difference vector between x and tilde and x n what remains is exactly this term so x n minus x n tilde is the sum j equals m plus 1 to d of bj times bj transpose times xn so now we can look at this displacement vector so the difference between xn and its projection and we can see that the displacement vector lies exclusively in the subspace that we ignore that means the orthogonal complement to the principal subspace let's look at an example in two dimensions we have a data set in two dimensions represented by these dots and now we are interested in projecting them onto the u1 subspace when we do this and then look at the difference vector between the original data and the project data we get these vertical lines that means they have no x component or no variation in x that means they only have a component that lives in the subspace u2 which is the orthogonal complement to u1 which is the subspace that we projected onto so with this illustration let's quickly rewrite this in a slightly different way i'm going to write this as sum of j equals m plus 1 to d of b j transpose x n times b j and we're going to call this now equation e we looked at the displacement vector between xn and its orthogonal projection onto the principal subspace xn tilde and now we're going to use this to reformulate our loss function so from equation b we get that our loss function is 1 over n times the sum n equals 1 to n of x n minus x and tilde squared so this is the average squared reconstruction error and now we're going to use equation e for the displacement vector here so we rewrite this now using equation e as 1 over n times the sum n equals 1 to capital n and now we're going to use inside that squared norm this expression here so we get the sum j equals m plus 1 to d of bj time bj transpose times xn times bj squared and now we're going to use the fact that the bjs form an orthonormal basis and this will greatly simplify this expression and we will get 1 over n times the sum n equals 1 to capital n times the sum j equals m plus 1 to d of b j transpose times x n squared and now we're going to multiply this out explicitly and we get 1 over n times the sum over n times the sum over j times bj times x n times x n transpose times b j so this part is now identical to this part and now i'm going to rearrange the sums so i'm going to move the sum over j outside so i'll have sum over j equals m plus 1 to d times bj transpose so this is independent of n times 1 over n the sum n equals 1 to capital n of x n times x and transpose and there's a bj from here missing times bj so i'm going to bracket it now in this way and what we can see now is that if we look very carefully we can identify this expression as the data covariance matrix s because we assumed we have centered data so the mean of the data is zero this means now we can rewrite our loss function using the data covariance matrix and we get that our loss is the sum over j equals m plus 1 to d of bj transpose times s times bj and we can also use a slightly different interpretation by rearranging a few terms and using the trace operator so we can now also write this as the trace of the sum of j equals m plus 1 to d of bj times bj transpose times s and we can now also interpret this matrix as a projection matrix this projection matrix takes our data covariance matrix and projects it onto the orthogonal complement of the principal subspace that means we can reformulate the loss function as the variance of the data projected onto the subspace that we ignore therefore minimizing this loss is equivalent to minimizing the variance of the data that lies in the subspace that is orthogonal to the principal subspace in other words we're interested in retaining as much variance after projection as possible the reformulation of the average squared reconstruction error in terms of the data covariance gives us an easy way to find the basis vector of the principal subspace which we'll do in the next video the last video we found that minimizing the average squared reconstruction error is equivalent to minimizing the projection of the variance of the data when projected onto the subspace that we'll ignore in pca in this video we will exploit this insight and determine an orthonormal basis of the m-dimensional principle subspace using the results from earlier over here we can write our loss function as the sum j equals m plus 1. to d of bj transpose s times bj where s is the data covariance matrix minimizing this objective requires us to find the orthonormal basis that spans the subspace that we will ignore and when we have that basis we take its orthogonal complement as the basis of the principle subspace remember that the orthogonal complement of a subspace u consists of all vectors in the original vector space that are orthogonal to every vector in u let us start with an example to determine the b vectors and let's start in two dimensions where we wish to find a one-dimensional subspace such that the variance of the data when projected onto that subspace is minimized so we're looking at two basis vectors b1 and b2 in r2 so b1 and b2 and b1 will be spanning the principle subspace and b2 its orthogonal complement that means the subspace that we will ignore we also have the constraint that b1 and b2 are orthonormal which means that bi transpose times bj is delta ij meaning that this dot product is 1 if i equals j and 0 otherwise in our example with our two vectors our loss function is j is b2 transpose times s times b2 with the constraint that b2 transpose times b2 is 1. to solve this optimization problem we write down the lagrangian and the lagrangian is b2 transpose s b2 plus lambda times 1 minus b2 transpose times b2 where lambda is the lagrange multiplier so now we compute the gradients of the lagrangian with respect to b2 and with respect to lambda and set them to zero so dl d lambda is 1 minus b2 transpose times b2 and this is 0 if and only if b2 transpose times b2 is 1 so we recover our constraint so now let's have a look at the partial derivative of l with respect to b 2. so we get 2 b 2 transpose times s from the first term and minus 2 lambda b 2 transpose from the second term and this needs to be 0 and that is 0 if and only if s times b2 is lambda times b2 here we end up with an eigenvalue problem b2 is an eigenvector of the data covariance matrix and the lagrange multiplier plays the role of the corresponding eigenvalue if we now go back to our loss function we can use this expression we can write j which was b2 transpose times s times b2 now we know that s times b2 can be written as lambda times b2 so we get b2 transpose times b2 times lambda and because we have an orthonormal basis we end up with lambda as our loss function therefore the average squared reconstruction error is minimized if lambda is the smallest eigenvalue of the data covariance matrix and that means we need to choose b2 as the corresponding eigenvector and that one will span the subspace that we will ignore b1 which spans the principle subspace is then the eigenvector that belongs to the largest eigenvalue of the data covariance matrix keep in mind that the eigenvectors of the covariance matrix are already orthogonal to each other because of the symmetry of the covariance matrix so if we look at a two-dimensional example if this is our data then the best projection that we can get that retains most information is the one that projects onto the subspace that is spanned by the eigenvector of the data covariance matrix which belongs to the largest eigenvalue and that is indicated by this long arrow over here now let's go to the general case if we want to find the m-dimensional principle subspace of a d-dimensional data set and we solve for the basis vectors b j where j equals m plus 1 to d we optimize these ones we end up with the same kind of eigenvalue problems that we had earlier with a simple example we end up with s times bj equals lambda j times bj for j equals m plus 1 to d and the loss function is given by the sum of the corresponding eigenvalues so we can write j is the sum from m plus 1 to d of all lambda j also in the general case the average reconstruction error is minimized if we choose the basis vectors that span the ignored subspace to be the eigenvectors of the data covariance matrix that belong to the smallest eigenvalues this equivalently means that the principal subspace is spanned by the eigenvectors belonging to the m largest eigenvalues of the data covariance matrix this nicely aligns with properties of the covariance matrix the eigenvectors of the covariance matrix are orthogonal to each other because of symmetry and the eigenvector belonging to the largest eigenvalue points in the direction of the data with the largest variance and the variance in that direction is given by the corresponding eigenvalue similarly the eigenvector belonging to the second largest eigenvalue points in the direction of the second largest variance of the data and so on in this video we identified the orthonormal basis of the principal subspace as the eigenvectors of the data covariance matrix that are associated with the largest eigenvalues in the next video we will put all the pieces together and run through the pca algorithm in detail in this video we'll go through the individual steps of pca but before we do this let me make two comments when we derived pca we made the assumption that our data is centered that means it has mean zero this assumption is not necessary to derive pca and we would have come to the same result but subtracting the mean from the data can avoid numerical difficulties assume the values of our data are centered around 10 to the 8 then computing the data covariance matrix requires us to multiply huge numbers which results in numerical instabilities an additional second step that is normally recommended after subtracting the mean is to divide every dimension of the centered data by the corresponding standard deviation this makes the data unit free and guarantees that the variance of the data in every dimension is one but it leaves the correlations intact let's have a look at an example clearly this data spreads much more in one dimensions than the other dimension and the best projection of pca is clear however there's a problem with this data set the two dimensions of the dataset are both distances but one is measured in centimeters and the other one in meters the one measured in centimeters naturally varies much more than the other one when we divide each dimension of the data set by the corresponding standard deviation we get rid of the units and make sure that the variance in each dimension is one when we look at the principle subspace of this normalized data set we can now see that there is actually quite a strong correlation between these two dimensions and the principal axis have changed but now let's go through pca step by step and we'll have a running example we're given a two-dimensional data set and we want to use pca to project it onto a one-dimensional subspace the first thing that we do is to subtract the mean the data is now centered next we divide by the standard deviation now the data is unit free and it has variance one along each axis which is indicated by these two arrows but keep in mind that the correlations are still intact third we compute the data covariance matrix and its eigenvalues and corresponding eigenvectors the eigenvectors are scaled by the magnitude of the corresponding eigenvalue in this picture the longer vector spans the principal subspace let's call it u and in the last step we can project any data point x star onto the principal subspace to get this right we need to normalize x star using the mean and standard deviation of the data set that we use to compute the data covariance matrix so we're going to have a new x star and the new x star is going to be the old x star minus the mean of the data set divided by the standard deviation and we do this for every dimension in x star now we can get the projection of x star as x-star tilde or the projection of x-star onto the principle subspace u as b times b transpose times x star where b is the matrix that contains the eigenvectors that belong to the largest eigenvalues as columns and b transpose times x star are the coordinates of the projection with respect to the basis of the principle subspace in this video we went through the steps of pca first we subtract the mean from the data and centered at zero to avoid numerical problems second we divide by the standard deviation to make the data unit free third we compute the eigenvalues and eigenvectors of a data covariance matrix and finally we can project any data point onto the principal subspace that is spanned by the eigenvectors that belong to the largest eigenvalues in the last video we've gone through the steps of the pca algorithm in order to do pca we need to compute the data covariance matrix in d dimensions the data covariance matrix is a d by d matrix if d is very high so in very high dimensions then computing the eigenvalues and eigenvectors of this matrix can be quite expensive it scales cubically in the number of rows and columns in this case the number of dimensions in this video we provide a solution to this problem for the case that we have substantially fewer data points than dimensions assume we have a data set given as x1 up to xn in rd and we assume that the data is centered so it has mean 0 then the data covariance matrix is given as s equals 1 over n times x transpose times x where we define x to be the matrix that consists of x1 transpose up to x n transpose and that is an n by d matrix we now assume that n is significantly smaller than d that means the number of data points is significantly smaller than the dimensionality of the data and then the rank of the covariance matrix is n so rank of s equals n and that also means it has d minus n plus one many eigenvalues which are zero that means that the matrix is not full rank and the rows and columns are linearly dependent in other words there are some redundancies in the next few minutes we'll exploit this and turn the d by d covariance matrix s into a full rank n by n covariance matrix without eigenvalues 0. i've just moved the covariance definition up here to have a bit more space on the board in pca we ended up with the following eigenvalue eigenvector equation we had s times b i equals lambda i times b i where b i is the basis vector of the orthogonal complement of the principal subspace now let's rewrite this equation a little bit we're now going to replace s with a definition up here so we'll get 1 over n times x transpose x this is s times b i equals lambda i times bi and now we multiply x from the left hand side so we will get x times x transpose x b i times 1 over n equals lambda i times x times bi and now we have a new eigenvector eigenvalue equation so lambda i is still an eigenvalue and now we have eigenvectors x times bi which we call ci of the matrix 1 over n times x times x transpose this means that 1 over n times x x transpose has the same non-zero eigenvalues as the data covariance matrix but this is now an n by n matrix so that we can compute the eigenvalues and eigenvectors much quicker than for the original data covariance matrix all right so this is an n by n matrix whereas s used to be a d by d matrix so now we can compute the eigenvectors of this matrix one over n times x x transpose and we use this to recover the original eigenvectors which we still need to do pca currently we know the eigenvectors of one over n times x x transpose and we want to recover the eigenvectors of s if we left multiply our eigenvalue eigenvector equation with x transpose we get the following so you get 1 over n times x transpose times x times x transpose times c i equals lambda i times x transpose times c i and now we find our s matrix again this is s and this also means that we will cover x transpose times c i as an eigenvector of s that belongs to the eigenvalue lambda i in this video we reformulated pca such that we can efficiently run pca on data sets where the dimensionality of the data is substantially bigger than the number of data points we derive pca from the perspective of minimizing the average squared reconstruction error however pca can also be interpreted from different perspectives in this video we'll have a brief look at some of these interpretations let's start with a recap what we have done so far we took a high dimensional vector x and we projected it onto a lower dimensional representation z using the matrix b transpose the columns of this matrix b are the eigenvectors of the data covariance matrix that are associated with the largest eigenvalues the z values are the coordinates of our data point with respect to the basis vectors which span the principle subspace and that is also called the code of our data point once we have that low dimensional representation z we can get a higher dimensional version of it by using the matrix b again so multiplying b onto z to get a higher dimensional version of the z in the original data space we found the pca parameters such that the reconstruction error between x and the reconstruction x tilde is minimized we can also think of pca as a linear autoencoder an autoencoder encodes a data point x and tries to decode it to something similar to the same data point the mapping from the data to the code is called an encoder let's write this down so this part here is called an encoder the mapping from the code to the original data space is called the decoder if the encoder and decoder are linear mappings then we get the pca solution when we minimize the squared autoencoding loss if we replace the linear mapping of pca with a non-linear mapping we get a nonlinear auto encoder a prominent example of this is a deep autoencoder where the linear functions of the encoder and decoder are replaced with deep neural networks another interpretation of pca is related to information theory we can think of the code as a smaller compressed version of the original data point when we reconstruct our original data using the code we don't get the exact data point back but a slightly distorted or noisy version of it this means that our compression is lossy intuitively we want to maximize the correlation between the original data and the lower dimensional code more formally this would be related to the mutual information we would then get the same solution to pca we discussed earlier in this course by maximizing the mutual information a core concept in information theory when we derived pca using projections we reformulated the average reconstruction error loss as minimizing the variance of the data that is projected onto the orthogonal complement of the principal subspace minimizing that variance is equivalent to maximizing the variance of the data when projected onto the principal subspace if we think of variance in the data as information contained in the data this means that pca can also be interpreted as a method that retains as much information as possible we can also look at pca from the perspective of a latent variable model we assume that an unknown lower dimensional code z generates data x and we assume that we have a linear relationship between z and x so generally we can then write that x is b times z plus mu and maybe some noise we assume that the noise is isotropic with mean zero and covariance matrix sigma squared times i we further assume that the distribution of this z is a standard normal so p of z is gaussian with mean zero and covariance matrix the identity matrix we can now write down the likelihood of this model so the likelihood is p of x given z and that is a gaussian distribution in x with mean b z plus mu and covariance matrix sigma squared i and we can also compute the marginal likelihood as p of x is the integral of p of x given z so that is the likelihood times the distribution on z dz and that turns out to be a gaussian distribution in x with mean mu and with covariance matrix b times b transpose plus sigma squared i the parameters of this model are mu b and sigma squared and we can write them explicitly down in our model up here so models model parameters are b and mu and sigma squared we can now determine the parameters of this model using maximum likelihood estimation and we will find that mu is the mean of the data and b is a matrix that contains the eigenvectors that correspond to the largest eigenvalues to get the low dimensional code of a data point we can apply bayes theorem to invert the linear relationship between z and x in particular we are going to get p of z given x as p of x given z so that is the the likelihood which is comes from here times p of z so that's our distribution that we have here divided by the marginal likelihood p of x which comes from here in this video we looked at five different perspectives of pca that lead to different objectives minimizing the squared reconstruction error minimizing the autoencoder loss maximizing the mutual information maximizing the variance of the projected data and maximizing the likelihood in a latent variable model all these different perspectives give us the same solution to the pca problem the strengths and weaknesses of individual perspectives become more clear and important when we consider properties of real data in this module we derived pca and looked at some of its properties by minimizing the average squared error between a data point and its projection onto the principle subspace we found that the best thing to do is to project onto the subspace that is spanned by the eigenvectors that belong to the largest eigenvalues pca has a lot of similarities to other machine learning algorithms so it's good that we had a closer look at it this module was very hard but i hope you gained some new and interesting insights either through the detailed derivation or through the practical exercises well done completing this module in this course we looked at pca a practical machine learning algorithm for dimensionality reduction and data compression we covered a lot of material to get to the point where we could derive and understand how pca works we started off with some summary statistics of data means and variances which we later used to compute the data covariance matrix in pca then we looked at inner products which allowed us to compute the length distances and angles between vectors we use these inner products for orthogonal projections of data onto lower dimensional subspaces and in the end we put everything together for the pca algorithm i'm aware that this course was challenging at times but i hope you enjoyed it nevertheless a big thanks to everyone who made it this far and completed the course very well done

Transcript for:Notes for Mathematics for Machine Learning

Transcript for:
Notes for Mathematics for Machine Learning