Importance of Math in Machine Learning

[Music] so at a very high level what does math have to do with machine learning like why math for machine learning among all the things that we could be talking about and all programming involves mathematics at some level mathematics and programming and computers have been tied together since the inception of computer science and programming machine learning in particular though is programming by optimization the way that we program computers to do things in machine learning is through optimization and in order to understand optimization in order to understand what it is that we're optimizing and why and how that works we need mathematics and this is what makes machine learning such a more mathematical discipline of programming than something like web development or database management or any of the other different branches of programming that are popular these days and so in this series we're going to cover several of the ways that optimization and machine learning intersect or interact linear algebra which will help us understand the objects being optimized calculus which will help us understand how it is that we optimize those things and then probability and statistics which will help us understand what that thing is that we are optimizing that we are making better so we'll start with that section on linear algebra which is to understand the objects that are being optimized so the takeaways for today are that linear algebra is important and why it's so important that linear algebra despite its name is actually not that much like algebra and that the svd or singular value decomposition is a matrix version of refactoring a common technique in programming first linear algebra is important why do we care about linear algebra first pass definition is that linear algebra is the mathematics of arrays an array is a collection of numbers a sort of block-like collection of numbers those numbers are typically usually we think of them as being floating point numbers in a computer or real numbers in mathematics and lots of objects are represented by arrays and machine learning our data is going to be in an array there will be our inputs and our outputs if our inputs are images they're going to be arrays of numbers that represent that image very often our models which are like the equivalents of our programs in machine learning will be represented by arrays or by collections of arrays and then the internal computations even of those models will be represented by arrays as well and so the core operation in linear algebra the core operation that we use to manipulate those arrays is matrix multiplication so a couple of examples of matrix multiplication in action you may have heard of the dot scalar or inner product the correlation or the covariance in statistics linear and logistic regression ideas from classical machine learning principle components analysis also a bunch of useful algorithms outside of what you would think of as machine learning discrete fourier transform page rank then things in machine learning again hidden layers of neural networks convolutions the second order methods for optimization newton and lbfts matrix multiplication is essentially the operation at the core of every single one of these examples and in the end gpus you may have heard of the importance of gpus graphics processing units gpus are a huge part of what drove this huge explosion in machine learning recently and they were originally designed for video games and they made video games faster and better but they ended up also making machine learning faster and better because underneath it all both graphics rendering and machine learning require linear algebra and that matrix multiplication is surprisingly simple despite how many things it can be used for so we have our arrays of numbers so we have one array here on the left another on the right that's on the left side of the equal sign here and that's going to result in the output of another array so if i take two matrices and multiply them i take the rows of the first matrix and the columns of the second matrix and i combine those together to get the elements of the output what i do is i take each number in this row and each number in this column and i multiply them together that gets me a whole bunch of numbers and then i add them up and that gives me this final number here i reduce them down with this sum operation so mathematically that might be written that if i were to matrix multiply x and y together here and look at the ith row and the jth column then what i'm going to do is take the ith row of x the jth column of y and multiply their entries and sum it up and if i want to get all the entries of that output of that matrix multiplication of x and y i just do that for every row and for every column matrix multiplication takes two matrices where one has the same number of rows as the other has columns and turns them into a single output with the same number of rows as the first one in the same number of columns as the second one it's relatively simple given how many things it's capable of doing but it's also kind of a weird and specific rule that if i just told you oh here's how we're going to combine collections of numbers there's no obvious reason why you would do it that way and from here i think a lot of linear algebra tutorials would just jump straight into algebra they would give algebraic definitions in terms of equations of a whole bunch of key concepts and then start manipulating those equations manipulating matrices and i think this is not the right tact to take here because i think it doesn't give you a motivation for why we're doing that rule and how you should really think about linear algebra rather than thinking of it as manipulating a bunch of equations let's dive in to this first of our major concepts that linear algebra is not like algebra linear algebra is really often taught like algebra first discovered in this world where people were really concerned with solving equations and linear algebra kind of looks like algebra because it's got addition and it's got multiplication and there's rules for combining them so it looks a lot like algebra and lots of linear algebra classes spend their time on just those computations manipulations determinants eigenvalues things like that and folks end up getting pretty confused because the rules are complicated and reasoning with equations is pretty hard and unintuitive it took me you know a very long time before i became comfortable just doing that kind of algebraic manipulation with linear algebra but that's not the only way to understand linear algebra there's lots of other ways to look at it that are more fruitful and so one of them is a sort of geometric view of linear algebra that i think is best put forward by this youtube series by the youtuber three blue one brown formerly of khan academy the essence of linear algebra and that views linear algebra as the study of linear transformations of spaces of transformations of rotations reflections and scalings of a grid and so that's one interesting way of looking at it and i recommend you check out that youtube series it's very beloved there's also other mathematical ways of understanding linear algebra that don't have to do with equational reasoning for example graphical linear algebra approach which came out of the world of category theory in the last 10 years as this nice blog post series by pawel sobachinski that basically views linear algebra as what happens when adding meets copying which is an interesting alternative way of looking at linear algebra and uses diagrammatic reasoning like you see on the bottom right of the slide rather than equational reasoning it's completely rigorous it's in some ways even more rigorous than the normal way that you do linear algebra but it involves this very spatial picture based reasoning so if you're interested in abstract math that's a fun direction to go but i don't think either of those necessarily is the right approach for folks who work with software and hardware for them i think the right of way to think about linear algebra is that linear algebra is more like programming so matrices are our functions shapes are the types of our data and our functions and multiplication is function composition and this approach uses a bunch of ideas from programming from computer science rather than from say physics or from abstract math so in programming we combine functions with matching types through function composition so we combine the functions together that way in linear algebra we combine matrices that have matching shapes through matrix multiplication when we're doing programming we define one function in terms of the others so if i wanted to make a function that checks whether a string is too long to tweet maybe tells a user hey this is too long shorten that string up if you want to be able to tweet it i might define that by composing a couple of functions first i check the length of the string then i check whether the length of that string is over 140. so i combine these two functions together that gives me too long to tweet as a function that composition if i were to look at the type of len it takes say a string and returns an integer that's the length of the string over 140 takes an integer and returns a boolean a true or a false value and so too long to tweet take those two pieces those arrows and slap them onto each other it now takes a string and returns a bool that composition of those two functions takes the input of one to the output of the other in terms of types and we can see that if we draw the typical way of representing functions in at least the american mathematical education system when you first learn about functions when you're in school you see a picture that looks something like this all the inputs are in a little circle on one side then the function is like an arrow that points from one of those inputs to the output so a and b are both strings of length one the beginning of the declaration of independence when in the course of human events it becomes necessary that's a strength of length 8007. 8007 is bigger than 140 and so the declaration of independence is too long to tweet so you can read off how these functions behave by following those arrows so we get this purple arrow representing too long to tweet by following these red and blue arrows representing len and over 140. and really this is the essence of programming if you ever do functional programming especially pure functional programming this is all that you do and it's enough to do anything that you can do in any language we often do slightly different things in other programming languages but in principle this is the core of what we are doing when we are programming computers combining these things together and in linear algebra we do much the same thing so when we want to do function application when we want to apply a function to data that's matrix vector multiplication often we might think of matrix vector multiplication as an equational thing as just a matrix vector on the left-hand side is equal to the thing on the right-hand side the way you would say 2 plus 2 equals 4. but the view that i'm trying to put forward here is that you should instead think of this as m of v m the matrix applied as a function to the vector v is equal to the thing on the right hand side which for now we're just calling mv under that view a matrix is a function that takes in arrays with a certain number of rows and puts out arrays with a certain number of rows in this case the number of rows in that vector this or length of that vector is just k here to emphasize it could be a bunch of different things but the output the number of rows in the output is given by the number of rows in that matrix it's in this case three so like that function too long to tweet had a type signature that said it takes in strings and returns bools this matrix has a type signature it takes in arrays of length k1 and returns arrays of length 3 1. let's just pick a particular matrix here this matrix x here that's 1 0 0 0 1 0 and let's view it the way we would view a function in that ovals and arrows way of viewing things you can see the inputs on the left the outputs on the right so the outputs 0 1 2 and 0 1 5 both get taken to the same thing and just as an exercise if you were to name this the way you name a function in programming what name would you give this matrix x here when we're doing math we often think of these things as variables we don't give them descriptive names we give them names like x and w but when we're thinking about functions we often give them a specific name so this x here is taking in inputs that look like things on the left and returning things that come out on the right here when we combine matrices through matrix multiplication when i take two matrices and multiply them together that is like function composition that is like that combination of two functions so for example we could take that matrix x and then apply on the left matrix y and we would then get a different function a different matrix that does something different to its inputs as an example here i've got this first matrix we already talked about strips off that last entry so it strips off the z-axis here so then we get a vector that looks like this one in the middle here so that red arrow here points from the inputs on the left to the outputs on the right and then we can chain that with that second matrix this matrix indicated in blue here the y matrix that takes two-dimensional inputs and returns two-dimensional outputs and that one zeros out the y-axis so then we end up with a vector that lies just along the x-axis with only the first entry of that vector that went in and so this is i think this way of drawing things is maybe a little bit more common for vectors than that oval way of drawing things so i wanted to make sure we covered that the combination of those two things the combination of those two functions is itself also a function and it's also in fact a matrix and that matrix is obtained from the two submatrices by doing matrix multiplication so that matrix multiplication rule why did we use that specific matrix multiplication rule it was so that this would be true it was so that if we did matrix multiplication on the left of this matrix by this matrix we would get this matrix representing their composition so that is the reason for that matrix multiplication rule fundamentally this is the approach for thinking about linear algebra that i think is most fruitful for doing machine learning and i think also for a lot of other branches that linear algebra gets applied in but the remaining question is why linear algebra for machine learning we've understood a little bit more about linear algebra but we haven't really tackled why this is important for machine learning yet i said that we use arrays to represent functions and represent data in machine learning but why this and what kinds of functions can we represent that way so why basically is linear algebra so important for optimization and hence machine learning i think the right tack to get to understanding this is to think about how would we represent functions as data so the fundamental question if we want to program by optimization is that we need to be able to represent programs and manipulate them so we're going to start with a program that doesn't work and then slowly over time obtain a program that does work that's that process of optimization in order to do that we need to have that program and we need to be able to manipulate it somehow the normal way that we represent programs as data is as strings right the usual way is build website dot pi right that is actually in the end is string of characters and that's how humans tend to manipulate programs in general we could also represent our programs as dictionaries as lookup tables it's basically applying that oval and arrow view of functions to your program so those are different ways to do it neither one is actually really good for optimization if we want to optimize things quantitatively actually the way that we represent programs is horrible right that's an absolutely horrible way to represent functions if we want to be able to optimize them so just take an example very simple python program to check if a number is odd and return whether it's odd or even as a string pretty much any single character change to the code on the left would break it would make it not do the right thing or it would completely break the program make it not valid python anymore it's a great way for representing programs in order to get programmers to work with it for humans to adjust the behavior of programs but for a computer that's really difficult to be able to make these coordinated changes and it's very difficult to make that numerical quantitative in a way that we can actually automate this process which is the goal of machine learning so the main way that we make progress is by restricting ourselves to a smaller set of functions and focusing on making those work well and so looking for a more limited data type that maybe doesn't necessarily represent all possible functions but is still powerful if we focus on easy to optimize functions we'll get two separate branches of machine learning one based off of trees that gets you to things like random forests gradient boosted trees things like that which we won't be talking about and the other direction which is towards using arrays so using matrix multiplication as function application and thinking about only linear functions at least at first and the benefit is that arrays are really easy to change just a tiny bit if i make a tiny change to the entries of a matrix then i'll get a small change in the function that the matrix applies so the outputs will only change a little bit if i change the matrix a little bit and as we'll see when we talk about calculus that's huge that's extremely valuable and then of course there's what we talked about linear algebra being able to be made fast and there being all these really useful things that you can do with it and so at its core this is why linear algebra is so important for machine learning i i want to go over a little bit what i mean by arrays represent linear functions and now it's finally time to do a little bit of algebra there's basically two rules for something to be a linear function it needs to respect the first rule here f of a plus b is f of a plus f of b distribution over addition rule and then it also needs to respect this distribution over multiplication rule f of lambda times a is equal to lambda times f of a so that says if i scale the input then i just scale the output and if i look at the output on the sum of two inputs it's the sum of those outputs so what this allows us to do is rearrange a problem that may come in looking like f of a plus f b which would require us to compute something twice and swap that around just computing it once f of a plus b the reason why we have those two rules is because that means that linear functions play really nicely with weighted sums if i combine those two rules i can use them over and over again if i multiply by a bunch of different scalars and if i add a whole bunch of things together that's what this thing is doing here in the center i can pull that out as well just as i could pull the plus sign out from the inside of the function application here and i could pull the scalar multiplication out to the outside of function application i can do that with a whole bunch of sums and multiplication so any time that i do a weighted sum of things my linear function will be able to either go on the inside or the outside and that's huge weighted sums show up all over the place and basically makes linear functions really easy to reason about so we'll see how this works with how linear functions interact with the number zero so first off zero is going to always be sent to zero so we get this from our multiplication rule if i apply f to zero that's the same as applying f to lambda times zero because lambda times zero is zero and i can then pull that out which means that lambda times f of zero is equal to f of zero that can only be true if f of 0 is 0. if inputs collide we can also determine using these rules that more things have to be sent to 0. so if two things get sent to the same thing then their difference has to be sent to zero if f of a minus f of b is zero so if a and b are sent to the same output by f then f of a minus b is zero so that difference between a and b that thing gets sent to zero so we can do that we're just pulling that minus sign inside and if you're being really technical we're basically pulling a minus one inside and then using the addition rule anything that is sent to zero is called the kernel of that function a minus b here is in the kernel of the function so one thing that comes out of this is you can also check that result that we got that zero has to be sent to zero because if i set a and b equal to each other then i get that same result so zeros in the kernel zero is always sent to zero other things might also be in the kernel that kernel there is made up of weighted sum going back to this idea of weighted sums if f of a is zero and f of b is zero then a plus b is also sent to zero so f of a plus b that's the same of f of a plus f of b those are both zero so the output is zero and so this collection of things that gets since zero is made up of weighted sums if i take a whole bunch of things that are all sent to zero and all i can do is combine them by weighted summing then i'm still gonna stay in this collection of things sent to zero the same is true of the collection of things not sent to zero anything not in the kernel if i combine those things with weighted sums then they'll stay outside of the kernel i went through all this in order to get to this important concept the rank of the function or the rank of the matrix we can make new non-kernel elements new things not sent to xero by making weighted sums of known non-kernel elements and rank answers the question how many of these things do i need in order to make every possible thing that's not in the kernel this collection of things sent to zero zero is not that interesting so this is sort of telling us how many things do i need in order to understand all the interesting behavior of this matrix as a function or of this linear function so that was just a little sort of quick exercise to give you a flavor for how reasoning works with linear functions and how much you can learn about linear functions with just a few quick linear algebraic manipulations but i wanted to return to this idea of linear algebra as programming linear algebra as less like those algebraic manipulations and more like the things we do when we program computers the singular value decomposition is an important concept in linear algebra and in the understanding of matrices and in my view it is the equivalent of refactoring a program but for linear algebra refactoring programs means taking a program and rewriting it but not changing the way it behaves so you change the internals maybe you change just exactly what's going on under the hood but you don't change anything about its overall behavior the same inputs still go to the same output so we often refactor programs in order to understand them better sometimes we do it in order to make them run faster there's lots of different reasons why we might refactor a program and there's a couple of common tricks for doing that one is to identify separation of concerns to say oh this one piece of the program is doing too many things i need to separate those out from each other so in linear algebra the equivalent of that is eigen decomposition it says okay this matrix is working on an n-dimensional thing but it's really actually doing n one-dimensional things for example and that's exactly what is done in eigen decomposition we find those n1-dimensional pieces and separate them out that's separation of concerns we can also remove code that doesn't do anything right you may identify oh this branch of code is effectively doing nothing it's here but it's dead weight and that's very close to what we do when we do low rank approximation when we try and find a matrix that has much lower rank than the matrix but still does about the same thing so we'll talk about low rank approximation in a little bit the other piece that we often do when we refactor program is we break something up into multiple functions this is much like separation of concerns but instead of maybe splitting it up in parallel we're splitting it up in that we are taking one function that is doing several things in a row and we are decomposing that into a bunch of pieces that then when composed together still do the same thing so maybe i was loading data then applying a function to data then writing it to disk all in a single function let's split that up into three functions that each do those three pieces and then that's our final result that's a refactor to break it up into those three pieces and that is the equivalent of the singular value decomposition so in linear algebra we have this singular value decomposition decomposing breaking up into pieces undoing a composition and it achieves the same thing as that particular type of refactoring in programming so in general in fact any function can be decomposed the way we do in the singular value decomposition which i think is very cool which is you basically break your function up into three pieces so i've got this diagram here we're going to go through an example with that is odd function so we have this single is odd function here that just did it all in a couple of lines and goes straight from integers to strings but that function can actually be broken down into pieces and this can be useful refactoring can be useful in case maybe sometime later somebody says oh we actually wanted to not return strings but to return just the booleans instead or we actually wanted to do this mod 3 rather than mod 2. and so breaking it up into pieces can make it easier to make those changes later if we want to break this function down into pieces there's a sort of natural way to break it down in my view which is to first do the part where we determine whether this number is even or odd with this mod 2 operation then we apply to bool we take that integer and we turn it into a true or false value and then we do something different for true and for false so for true we return odd and for false we return even so that's what this breakdown achieves here this split up into the mod 2 tubule 2 string setup and this actually this represents a very generic strategy so one way of thinking about this is that mod 2 picks representatives for this function there are really only two outputs here there's only odd and there's even and so really we only need to know what this function does on two inputs and then know which inputs correspond to those two classes there's gonna be one representative for odd one representative for even and in this first step here we're gonna say okay do you get sent to the same value as zero or do you get sent to the same value as one that sort of simplifies this function down after this point when it comes to mapping those inputs to the final output we only need to do it for those two specific representatives and it makes this first step just a matter of saying okay these guys are all going to get treated the same these guys are all going to get treated the same so that's our first step here picking out representatives then we have a step that's just a reversible renaming step so associate each representative with its output one to one so zero is going to be even it is false that this is odd so that gets mapped to false one gets mapped to true and the important thing here is that each output is targeted by one and only one representative so there's no mixing no combination no splitting just one to one a renaming operation and then finally we just need to put them into the correct type we got a true and a false here but what we wanted was actually a string at the end because this is going out to user who wants to know whether this is odd or even we're effectively just recognizing true and false are the outputs here and we just need to map them on to what the string is that we want to show the user so we're finding essentially a copy of the output of tubule of true and false inside of the type string and we can apply that same breakdown to a matrix so that breakdown there is generic for any function it's called the canonical decomposition and we can apply that canonical decomposition onto a matrix the equational way of writing is to say that m equals a b c so a times b times c the sort of function view is to say that the function m that maps arrays of dimension n to arrays of dimension m that function can be broken down into the composition of three functions c then b then a so this diagram here says the same thing as this equation over here and in this setup c is a wide matrix b is a square matrix and a is a tall matrix so c has at least as many columns as it has rows b has exactly as many columns as it has rows and a has at least as many rows as it has columns typical case to think in mind here is is where c is much wider than it is tall and a is much taller than it is wide c here is going to is going to have r rows and columns n is the number of inputs so it needs to have that many columns in order to be able to take the same inputs as m but it's going to map it down to only our outputs so it's going to shrink it down and what this is doing is it's throwing out that kernel so if a and b are sent to the same output by m so we're picking out representatives in this step we're saying any things that get sent to the same thing we want to map those all to one value and only worry about one value like how we did that mod two operation for is odd to say okay i know what i want to do with zero and one let me map everything that is the same as zero to zero everything that's the same as one to one so here we're doing that same thing and now we're looking for two things that get sent to the same output by m and so that connects us back to that kernel right and to that idea of rank how many things do i need in order to build all possible outputs and so that's why this output here has dimension r it has the same size as that set that we need to build all non-kernel elements but what effectively we're doing here at a high level you can think of it as throwing out everything that gets sent to zero then that second step here with b is our reversible relabeling that's why this matrix is square it takes in our inputs and returns our outputs so it's just a relabeling it says okay i'm gonna maybe like slightly change around my axes but i'm not really making any serious changes and it's always reversible it's square so the inputs are the same size as the outputs that means it's possible for it to be reversible if the outputs are way bigger or way smaller than the inputs then it's not always going to be possible to reverse that operation and so those matrices a and c are in general not reversible so this is our special reversible step and then lastly we have a tall matrix at the end a whose outputs can be bigger than its inputs and it effectively finds a copy of all rays with r elements among the set of all m element arrays for example if you can see my video here if i wanted to find a copy of two dimensional arrays among the set of all three dimensional rays there are lots of possible examples my fingers here represent two axes and i'm moving them around in 3d space to represent a whole bunch of different potential copies of the set of all two-dimensional arrays among the set of all three-dimensional arrays the three-dimensional space like the one we live in so this step here says okay these are dimensional arrays that came out of b i want to match that to the outputs of m on its inputs and so those outputs have dimension m i need to sort of inject these r-dimensional arrays into that output space this is very similar to choosing to say okay true corresponds to odd and false corresponds to even in our is odd example there's many possible choices of two words out of the set of all possible strings and we chose those in particular because that's what our function was supposed to output so that gives us this decomposition into three pieces one way to think about it is if you're familiar with these ideas the first function is the onto piece it's the surjection piece of the function m the part in the middle is the bijective piece of the function or the isomorphism piece of the function the reversible piece of the function and then a is the injective piece of the function the part that is into one to one so we're breaking our function down into those three types of pieces the function m need not have any of those properties but it has pieces that each have those properties of injectivity bijectivity and surge activity so that's this way of breaking down matrices in general if we make certain special choices then we get this singular value decomposition if we aim to make that middle matrix diagonal so it only has numbers in the rows and columns that are equal to each other row one column one row two column two that's represented by this dashed line here then we end up with this singular value decomposition where u and v are unitary matrices they don't grow or shrink anything they in addition to throwing stuff out all they do is change bases change the axes spin them around or reflect them that's what these two the two matrices c and a become these unitary matrices u and v and then that is the singular value decomposition but the important thing about the ceo value decomposition in my view is that it breaks down those three pieces of surjection bijection and injection this svd is a very special way to break up a matrix as maybe indicated by its relationship to this canonical decomposition this canonical way of breaking down matrices and one reason why is it can be used to calculate low rank approximations if you've done principal components analysis linear algebra based technique for dimensionality reduction pre-processing and analyzing data that's based on the singular value decomposition in a pure math setting it's used to classify and measure matrices and linear algebra to determine the norm and size of matrices to measure the inner product of matrices is all based on the singular value decomposition but we're going to talk about how it's used in low rank approximations since this comes up in numerical linear algebra and in machine learning when the rank is not close to their total dimension they are said to have low rank what that looks like if you actually look at a matrix with low rank often but not always that shows up as a very obvious visual pattern that the matrix has like you know some simple pattern to it so just as an example taking this video here from a security camera and turning it into a matrix where each column here represents a single frame of the security video then you can see that it's basically constant all the time there's these little deviations from that pattern inside the matrix but basically there's a simple pattern of just repeat the same column over and over again representing effectively the fixed background of the scene and i just want to compare that to what a full rank matrix would look like so as an example here's what a full rank matrix of the same size would look like it's sort of more like what you would get by looking at at the static coming out of a tv from an old school tv tuned between channels again we're taking each frame here and making it a column and then we're looking over time for this axis here each column here is one frame of this video so the same as in this previous example with the security camera footage but now since it's static there isn't this simple pattern there isn't this simple oh the background stays the same all the time a full rank matrix doesn't have those kinds of simple patterns in it and can't be broken down in the same way so lots of matrices that we come across in real life when we measure data like pointing the security camera somewhere and measuring its inputs we end up with something that is nearly low rank it's not exactly low rank it can't be exactly represented by a simple pattern but it's nearly low rank so the simplest low rank pattern in the video is just this background so if i just take this vector representing a single frame here and then repeat it over and over again then i'll get this matrix over here which is very close to the original matrix so one thing to note here this matrix multiplication here is kind of like a repeat or a tile function right if you were to implement a function that did this you might call it repeat or tile so that's what this all ones vector is doing here so another example of how we can think of matrices more like we think of functions in in computer programs and this is effectively an approximation to that original input this guy here is very close to this guy on the right here and this guy it is rank one one way to see that it is low rank is that it can be represented by this really tiny thing here so the reason why this is useful is because we can use it to compress data so we can take something that looks like this and turned it into just this thing here on the right hand side so that would really compress our video down quite a bit of course it would also remove all the things that are moving in the video which is maybe too much compression but this general principle of low rank approximation is what jpeg is based off of so jpeg basically does a low rank approximation to pictures using the fourier transform but i think what's more useful in this case and this idea comes from fast.ai's numerical linear algebra course which is a great resource is to actually do foreground background separation so if we take this original video frame and that rank one approximation for that frame looks like this then if i subtract that off then i get only that foreground that small bit there that was deviating from this background pattern is of interest so even when that approximation that compression is not a good compression for our purposes it can still be useful and so these low rank approximations show up in both ways when we're doing data science and machine learning sometimes it's done to do compression sometimes it's done to pull out the interesting uncompressible pieces so that's our three takeaways for today that linear algebra is important this idea of matrix multiplication despite its relative simplicity is actually very powerful shows up in lots of different places linear algebra despite the way it's normally taught is not actually that much like algebra it's more like computer programming in a lot of ways where shapes are our types matrices are our functions and matrix multiplication is function composition and then finally the singular value decomposition is our matrix refactoring when we want to refactor programs common operation something we need to do all the time when we're doing computer programming refactoring shows up in the form of this singular value decomposition among other ways that we might refactor matrices some more resources for understanding linear algebra effectively we aren't really going to be able to cover all the mathematics you need for machine learning in just three sessions and so i wanted to give you some pointers to more resources and what is most useful to do next is very dependent on what your past history is with this branch of math and what your future plans are so if you had a traumatic linear algebra class back in the day maybe in college and it was just maybe not your favorite class in college and uh you felt like you didn't learn that much then you should try essence of linear algebra by three blue one brown on youtube it's bingeable you could fit it in an afternoon if you really wanted to with these really nice slick animations a lot of people say it's like a sort of religious experience of oh i finally see the light of linear algebra like why it was so confusing for so long so highly recommended if you want to level up your abstract math chops if you really want to get a better handle on how to think about mathematics read and understand mathematics maybe for being able to read machine learning papers a little bit better then that graphical linear algebra blog post series by paul sibichinsky is a great place to start it's aimed at people who have not taken a proof-based math class before and to teach them how to do that while also teaching them some cutting edge mathematics which is a great combination it's well written uh if you're more of the sort of hacker type who really wants to get your hands dirty with some actual code then i would try numerical linear algebra by fast.ai with rachel thomas so it's an online class and textbook focusing on applications of linear algebra for machine learning and if you're the type who prefers the traditional math class experience then i would check out linear algebra done right by sheldon axler the textbook has been free in the past i'm not sure if it's currently free but the lecture series on youtube is certainly free so this linear algebra done right lecture series by sheldon axler goes through a very classic traditional math class approach to linear algebra so if you wanted to to do that that would be the place to do it i would say hey friends charles here thanks for watching my video if you enjoyed it give it a like if you want more ways to biases tutorial and demo content subscribe to our channel and if you've got any questions comments ideas for future videos leave a comment below we'd love to hear from you

Transcript for:Importance of Math in Machine Learning

Transcript for:
Importance of Math in Machine Learning