Variational Autoencoders (VAEs)

hey guys in this video let's talk about variational autoencoders and i'm super excited about this not only because a lot of you asked for this topic but also because variational autoencoders itself enables a lot of cool applications for example you know by changing the latent vector we'll get to that in a minute by changing these uh these latent vectors slightly you can add a smile to mona lisa or a bigger smile to mona lisa or add sunglasses to mona lisa right so it does enable a lot of cool applications so let's uh look at what it means in this video okay and i try to make it as digestible as possible and in the next video let's understand this further by just writing a few lines of code of course using keras and python and also working on mnist dataset so the autoencoder part of variational autoencoder should make sense to you we covered this in videos number 85 to 90 so nothing new so i'm not going to spend too much time on this except a quick thing auto encoder is as the name suggests you're encoding something you're encoding the input data into a smaller space and then decoding it back to get the original data back now what good is that yeah what did we learn here we learned that okay you can take a bigger image represent that into a smaller space like smaller latent vector let's say this vector is of size uh five okay you have five numbers completely representing this image and you take those five numbers as input to your decoder and then you get the original image back which is pretty cool i think the a good application is uh sending a large file over something assuming the other side already has a decoder so you just send these vectors from that point on right so that that works very well but that's not the primary application of auto encoder we went through a few tricks like what type of tricks did we go through well uh the one of the applications was okay noise reduction we said okay instead of training it only on the image if i go back again your training your x values is your einstein picture in this case and your y is your einstein picture you're saying hey my input is einstein my output is einstein just update the weights until you know the reconstruction error is minimal meaning you're reconstructing einstein back and as long as you know error keeps going down go ahead go ahead go ahead and stop and you think you cannot get any better so you get there and whatever the latent vector that stays here after the training is your low dimensional representation right so this is cool now you know your low dimensional representation so you do exactly the same except you trick this you say hey my input is a noisy image but then my output is a nice clean image so you train this and this latent vector and the decoder can be used to actually denoise your input images so that's one application the other application is anomaly detection you have a whole bunch of information going in and you are looking at the reconstruction error and you already trained it on this input data right so anything that's above this reconstruction error is an anomaly this is how you design an anomaly detection detector and by the way auto encoder i'm just showing here encoding a decoding phase and this can be a regular convolutional networks or this can be lstm or anything so you have lstm encoder decoder networks but just look at the structure of this okay you're encoding that into a latent vector which is a smaller space and then you decoding it back in this example we are looking at reconstruction error because that's how you kind of train an auto encoder you know by uh using reconstruction error as your loss and you're kind of uh improving on that okay another example that we did is like domain adaptation we kind of gave einstein image as input and mona lisa as output and we said hey go ahead and train the you know and we're tricking this again an auto encoder and again we have uh our decoder with the latent vector so you provide that and it creates uh it creates your mona lisa picture and another example image colorization okay black and white to color so this is another example we already looked at and again if you haven't go ahead and check videos 85 to 90 on my uh channel okay so now let's get back to our auto encoder what is an auto encoder you have the encoder part and you have the decoder part which is basically a generative it's actually generating stuff as long as we provide a latent vector right so let's say my vector is one two three four i give that vector and then it has the decoder weights already pre-trained so it's going to create an image or generate data in this example an image so this is this is nothing but a generator and again we covered generative adversarial networks as part of our videos 125 and 126 and in this case g in generative stands for generating once you train it it generates new images this is also pretty cool okay but the point i'm trying to make here is auto encoders or variational auto encoders are not generative adversarial networks because here you have a generator network that's kind of trying to fool the network and you have a discriminator that's trying to catch whoever is fooling it and you're updating these two networks independently well one competing against the other and eventually when uh you know this is good at fooling and this is not good at being fooled that's when you have uh it converged and you use that as your generator right so that's completely different discussion but i just want to make sure you understand that that is a uh you know generative network so when it comes to variational autoencoders we're trying to create a generative network by using that to generate new data okay now how is it any different from auto encoders now let's get into that what does it require for us to generate new data if you just look at this generative part what does it require for us to generate this new data it requires us to provide a vector right i mean if you're not training it we are just talking about generation part generative part so you need whatever that vector is but how the heck do we know what values are right for that latent vector well this is like how do we determine those values so if you look at for example if we want to get this sampling if i train this on a whole bunch of mona lisa pictures or einsteins and mutants and a whole bunch of these images and i know what the vectors you know how should look like right i mean i don't the computer knows what the vectors should look like so if you just put those latent vector distribution let's say this is how it looks like all the values that kind of give us mona lisa are up here all the values that give us newton are down here all the values that gave us einstein are here and all the values that give us uh einstein smiling is somewhere else all the values with einstein wearing glasses is somewhere else and someone else wearing glasses is somewhere else in fact what if we can bring all of those together right so we'll get to that in a second so this is the latent distribution but if i know what this vector is meaning this space where the newton variables store are stored or latent vectors are stored sorry i should say then it's fine but if i just randomly sample meaning i just give any some random vector then all the values right here mean nothing so if you just give garbage in garbage out that's pretty much it you will just get nice nothing else so it's not going to create mona lisa to us so in other words how the heck do we know where monalisa latent vectors are where einstein's vectors are okay this is where the variational auto encoders kind of help us how let's look at it what if we know how to pick these appropriate latent vectors okay this is the key part of variational auto encoders okay now uh i copied this i should have uh put the source down here i seriously apologize if this image belongs to uh one of you i'll probably do that in retrospect later on uh but you see this latent space and of course we'll generate our own space like this using our own code maybe i should have used that code but this one represents our mnist data set all the way from zero through nine right so these are the digits that are present and you can see each color if i pick a latent vector from here then i'll get a value i'll get an image that shows 9. if i pick a vector from here probably that shows me 6 and so on so what if we can control how this distribution happens what if we can constrain this latent vector values to a continuous region that means as i go from this to there in terms of changing my vector values remember in my original title screen you see a whole bunch of weird uh you know numbers going on all that's generated by changing the latent vector from here to somewhere else okay within a range that's pretty much it okay so what if we can constrain this this is where uh again vaes or variational auto encoders come into picture we'll understand this one a little bit more in a second uh maybe the arrow should point somewhere else like in this gap right here i just realized that number five is assigned this weird color that i can barely see maybe that represents right here but all i'm trying to say here is you see some gaps between these two clusters if you sample from these two gap from this gap you'll probably get a noisy image of whatever that that region represents okay as long as you're in this continuous region you do get a different variation of the same image maybe here the uh you know what is this nine is slanted here maybe the nine is straight but you're very varying it this is where varying the latent vector means you get a varying image okay i hope this part makes sense so now you know what variational part is what auto encoder part is and how it relates to constraining the latent vectors to a predefined uh space a continuous region now what do we mean by that let's go to one more step again i'm not going to go into a lot of math or anything here for two reasons one i'm not an expert at all of these topics to that level and to if you just want to use this and put together your own then you should be able to do that with this type of information okay now how to define this latent space okay how to define this latent variable space well first of all instead of mapping input to a fixed latent vector let me go back this is very very important instead of mapping this to a fixed latent vector which means we have to know exactly what it is what if let's go back through all the animations sorry about that what if you map it to a distribution what's the distribution if i say okay my distribution is a normal distribution or a gaussian distribution then i can say hey just predict any number or use any vectors okay within this distribution meaning if i tell you this is my boundary pick any values within this boundary then it's going to work well values right here are not going to work maybe this can be a bit smaller but hopefully you got the idea so that's that's the point here now you force these latent variables to be normally distributed okay why because it's easy for us to kind of define a normal distribution using mean and the variance okay so instead of passing the entire encoder output okay what if we use mean and standard deviation describing this distribution okay now you only have two variables that need to be trained that's it not the entire distribution okay which is impossible so that's one reason now this this is where it gets into some territory if you're not from statistics background that can kind of scare you which is now we are going to quantify the distance between the learned distribution during the training process and the standard normal distribution okay and how do you quantify these two there is something called kl divergence i believe in one of my previous videos we use this as a statistical metric but for now let's just think about okay use a metric kl divergence to quantify this distance okay now you have a metric that tells just like mean squared error it tells you the difference between the actual line you know data point and the predicted line right in your linear regression now we're going to use kl divergence to kind of quantify the difference between these two distributions that's it because we're not quantifying a data point but an actual distribution so when you're comparing two distributions you can use student t test and all the other stuff okay so we're going to use kl divergence for that fact i think that should explain even if you have basics in statistics now while we are training we are going to force a normal distribution as close as possible to standard normal distribution by using the scale as a loss function in other words the summary of this statement is the scale divergence is going to be one of the loss functions that we are going to minimize during the training process i said one of the loss functions because you know what the other loss function is what is the other loss function for autoencoder reconstruction loss we just saw one of the applications as you know anomaly detection where it uses reconstruction loss so there are two terms one is kl loss the other one is reconstruction loss so we minimize these two now coming back you see i put i put a digital sunglasses to mona lisa but uh a variational auto encoder is an auto encoder this is basically an auto encoder right except instead of a fixed vector we are going to have a distribution that's defined by a mean and standard deviation or variance if you want to call it sigma squared right so a mean and standard deviation i i hope this makes sense instead of using a latent vector a single vector okay as your input to the decoder we are going to use a distribution so this is how we are constraining the space okay a distribution and you have a mean and standard deviation for that distribution and from there you're going to sample a latent vector and z stands for this vector latent vector space and we're going to sample or a sampled latent vector let me put it that way okay and z is still stochastic meaning it's still uh it's still random but we are going to pick it using these two parameters that we're going to learn well i think i put another slide that kind of explained this uh yeah the the key question here is how to run back propagation again if you're watching this video you probably know what i'm talking about right when we say break propagation when we are training this network the weights are updated okay using stochastic gradient descent or some sort of a you know this back propagation algorithm meaning how do you how do you run this back propagation or how do you train this if you have a sampling right there if you have a sample latent vector isn't the whole point like not having a fixed vector and just define the distribution using mean and standard deviation yes that is the case and here is the equation that answers your questions hopefully so mu i mean our distribution is defined by this right i mean here this is the mean and sigma is the standard deviation we already talked about that to this we are multiplying epsilon which is the standard normal distribution and it's a fixed space for sure but we are randomly sampling sampling from this fixed space so we are not training this during the back propagation so that's okay so this is the trick and i think this is called re re-parameterization trick to make this back propagation possible for this type of distributions again let me repeat this what is going to be trained during the training process is the mean and variance or standard deviation if you want to call this okay mean and standard deviation this epsilon we are randomly picking is sample but it's not learned during back propagation okay so i hope this kind of makes sense let me go back to this previous image before we stop this video and in the next video we are going to uh we are going to apply this principle okay and put together variational autoencoder for mnist dataset but the whole the key point here is we are going to create a constrained space of latent vectors from which we can sample and it means something without that the we don't know what latent vector to pick right that's why auto encoder is not uh relevant as a or useful as a generator or as a generative model but variational auto encoders can be very very useful so please watch the next video so you learn how to apply this in keras in python on mnist dataset please do subscribe i do this type of cool stuff because you ask for it okay subscribe and encourage me thank you very much

hey guys in this video let&#39;s talk about variational autoencoders and i&#39;m super excited about this not only because a lot of you asked for this topic but also because variational autoencoders itself enables a lot of cool applications for example you know by changing the latent vector we&#39;ll get to that in a minute by changing these uh these latent vectors slightly you can add a smile to mona lisa or a bigger smile to mona lisa or add sunglasses to mona lisa right so it does enable a lot of cool applications so let&#39;s uh look at what it means in this video okay and i try to make it as digestible as possible and in the next video let&#39;s understand this further by just writing a few lines of code of course using keras and python and also working on mnist dataset so the autoencoder part of variational autoencoder should make sense to you we covered this in videos number 85 to 90 so nothing new so i&#39;m not going to spend too much time on this except a quick thing auto encoder is as the name suggests you&#39;re encoding something you&#39;re encoding the input data into a smaller space and then decoding it back to get the original data back now what good is that yeah what did we learn here we learned that okay you can take a bigger image represent that into a smaller space like smaller latent vector let&#39;s say this vector is of size uh five okay you have five numbers completely representing this image and you take those five numbers as input to your decoder and then you get the original image back which is pretty cool i think the a good application is uh sending a large file over something assuming the other side already has a decoder so you just send these vectors from that point on right so that that works very well but that&#39;s not the primary application of auto encoder we went through a few tricks like what type of tricks did we go through well uh the one of the applications was okay noise reduction we said okay instead of training it only on the image if i go back again your training your x values is your einstein picture in this case and your y is your einstein picture you&#39;re saying hey my input is einstein my output is einstein just update the weights until you know the reconstruction error is minimal meaning you&#39;re reconstructing einstein back and as long as you know error keeps going down go ahead go ahead go ahead and stop and you think you cannot get any better so you get there and whatever the latent vector that stays here after the training is your low dimensional representation right so this is cool now you know your low dimensional representation so you do exactly the same except you trick this you say hey my input is a noisy image but then my output is a nice clean image so you train this and this latent vector and the decoder can be used to actually denoise your input images so that&#39;s one application the other application is anomaly detection you have a whole bunch of information going in and you are looking at the reconstruction error and you already trained it on this input data right so anything that&#39;s above this reconstruction error is an anomaly this is how you design an anomaly detection detector and by the way auto encoder i&#39;m just showing here encoding a decoding phase and this can be a regular convolutional networks or this can be lstm or anything so you have lstm encoder decoder networks but just look at the structure of this okay you&#39;re encoding that into a latent vector which is a smaller space and then you decoding it back in this example we are looking at reconstruction error because that&#39;s how you kind of train an auto encoder you know by uh using reconstruction error as your loss and you&#39;re kind of uh improving on that okay another example that we did is like domain adaptation we kind of gave einstein image as input and mona lisa as output and we said hey go ahead and train the you know and we&#39;re tricking this again an auto encoder and again we have uh our decoder with the latent vector so you provide that and it creates uh it creates your mona lisa picture and another example image colorization okay black and white to color so this is another example we already looked at and again if you haven&#39;t go ahead and check videos 85 to 90 on my uh channel okay so now let&#39;s get back to our auto encoder what is an auto encoder you have the encoder part and you have the decoder part which is basically a generative it&#39;s actually generating stuff as long as we provide a latent vector right so let&#39;s say my vector is one two three four i give that vector and then it has the decoder weights already pre-trained so it&#39;s going to create an image or generate data in this example an image so this is this is nothing but a generator and again we covered generative adversarial networks as part of our videos 125 and 126 and in this case g in generative stands for generating once you train it it generates new images this is also pretty cool okay but the point i&#39;m trying to make here is auto encoders or variational auto encoders are not generative adversarial networks because here you have a generator network that&#39;s kind of trying to fool the network and you have a discriminator that&#39;s trying to catch whoever is fooling it and you&#39;re updating these two networks independently well one competing against the other and eventually when uh you know this is good at fooling and this is not good at being fooled that&#39;s when you have uh it converged and you use that as your generator right so that&#39;s completely different discussion but i just want to make sure you understand that that is a uh you know generative network so when it comes to variational autoencoders we&#39;re trying to create a generative network by using that to generate new data okay now how is it any different from auto encoders now let&#39;s get into that what does it require for us to generate new data if you just look at this generative part what does it require for us to generate this new data it requires us to provide a vector right i mean if you&#39;re not training it we are just talking about generation part generative part so you need whatever that vector is but how the heck do we know what values are right for that latent vector well this is like how do we determine those values so if you look at for example if we want to get this sampling if i train this on a whole bunch of mona lisa pictures or einsteins and mutants and a whole bunch of these images and i know what the vectors you know how should look like right i mean i don&#39;t the computer knows what the vectors should look like so if you just put those latent vector distribution let&#39;s say this is how it looks like all the values that kind of give us mona lisa are up here all the values that give us newton are down here all the values that gave us einstein are here and all the values that give us uh einstein smiling is somewhere else all the values with einstein wearing glasses is somewhere else and someone else wearing glasses is somewhere else in fact what if we can bring all of those together right so we&#39;ll get to that in a second so this is the latent distribution but if i know what this vector is meaning this space where the newton variables store are stored or latent vectors are stored sorry i should say then it&#39;s fine but if i just randomly sample meaning i just give any some random vector then all the values right here mean nothing so if you just give garbage in garbage out that&#39;s pretty much it you will just get nice nothing else so it&#39;s not going to create mona lisa to us so in other words how the heck do we know where monalisa latent vectors are where einstein&#39;s vectors are okay this is where the variational auto encoders kind of help us how let&#39;s look at it what if we know how to pick these appropriate latent vectors okay this is the key part of variational auto encoders okay now uh i copied this i should have uh put the source down here i seriously apologize if this image belongs to uh one of you i&#39;ll probably do that in retrospect later on uh but you see this latent space and of course we&#39;ll generate our own space like this using our own code maybe i should have used that code but this one represents our mnist data set all the way from zero through nine right so these are the digits that are present and you can see each color if i pick a latent vector from here then i&#39;ll get a value i&#39;ll get an image that shows 9. if i pick a vector from here probably that shows me 6 and so on so what if we can control how this distribution happens what if we can constrain this latent vector values to a continuous region that means as i go from this to there in terms of changing my vector values remember in my original title screen you see a whole bunch of weird uh you know numbers going on all that&#39;s generated by changing the latent vector from here to somewhere else okay within a range that&#39;s pretty much it okay so what if we can constrain this this is where uh again vaes or variational auto encoders come into picture we&#39;ll understand this one a little bit more in a second uh maybe the arrow should point somewhere else like in this gap right here i just realized that number five is assigned this weird color that i can barely see maybe that represents right here but all i&#39;m trying to say here is you see some gaps between these two clusters if you sample from these two gap from this gap you&#39;ll probably get a noisy image of whatever that that region represents okay as long as you&#39;re in this continuous region you do get a different variation of the same image maybe here the uh you know what is this nine is slanted here maybe the nine is straight but you&#39;re very varying it this is where varying the latent vector means you get a varying image okay i hope this part makes sense so now you know what variational part is what auto encoder part is and how it relates to constraining the latent vectors to a predefined uh space a continuous region now what do we mean by that let&#39;s go to one more step again i&#39;m not going to go into a lot of math or anything here for two reasons one i&#39;m not an expert at all of these topics to that level and to if you just want to use this and put together your own then you should be able to do that with this type of information okay now how to define this latent space okay how to define this latent variable space well first of all instead of mapping input to a fixed latent vector let me go back this is very very important instead of mapping this to a fixed latent vector which means we have to know exactly what it is what if let&#39;s go back through all the animations sorry about that what if you map it to a distribution what&#39;s the distribution if i say okay my distribution is a normal distribution or a gaussian distribution then i can say hey just predict any number or use any vectors okay within this distribution meaning if i tell you this is my boundary pick any values within this boundary then it&#39;s going to work well values right here are not going to work maybe this can be a bit smaller but hopefully you got the idea so that&#39;s that&#39;s the point here now you force these latent variables to be normally distributed okay why because it&#39;s easy for us to kind of define a normal distribution using mean and the variance okay so instead of passing the entire encoder output okay what if we use mean and standard deviation describing this distribution okay now you only have two variables that need to be trained that&#39;s it not the entire distribution okay which is impossible so that&#39;s one reason now this this is where it gets into some territory if you&#39;re not from statistics background that can kind of scare you which is now we are going to quantify the distance between the learned distribution during the training process and the standard normal distribution okay and how do you quantify these two there is something called kl divergence i believe in one of my previous videos we use this as a statistical metric but for now let&#39;s just think about okay use a metric kl divergence to quantify this distance okay now you have a metric that tells just like mean squared error it tells you the difference between the actual line you know data point and the predicted line right in your linear regression now we&#39;re going to use kl divergence to kind of quantify the difference between these two distributions that&#39;s it because we&#39;re not quantifying a data point but an actual distribution so when you&#39;re comparing two distributions you can use student t test and all the other stuff okay so we&#39;re going to use kl divergence for that fact i think that should explain even if you have basics in statistics now while we are training we are going to force a normal distribution as close as possible to standard normal distribution by using the scale as a loss function in other words the summary of this statement is the scale divergence is going to be one of the loss functions that we are going to minimize during the training process i said one of the loss functions because you know what the other loss function is what is the other loss function for autoencoder reconstruction loss we just saw one of the applications as you know anomaly detection where it uses reconstruction loss so there are two terms one is kl loss the other one is reconstruction loss so we minimize these two now coming back you see i put i put a digital sunglasses to mona lisa but uh a variational auto encoder is an auto encoder this is basically an auto encoder right except instead of a fixed vector we are going to have a distribution that&#39;s defined by a mean and standard deviation or variance if you want to call it sigma squared right so a mean and standard deviation i i hope this makes sense instead of using a latent vector a single vector okay as your input to the decoder we are going to use a distribution so this is how we are constraining the space okay a distribution and you have a mean and standard deviation for that distribution and from there you&#39;re going to sample a latent vector and z stands for this vector latent vector space and we&#39;re going to sample or a sampled latent vector let me put it that way okay and z is still stochastic meaning it&#39;s still uh it&#39;s still random but we are going to pick it using these two parameters that we&#39;re going to learn well i think i put another slide that kind of explained this uh yeah the the key question here is how to run back propagation again if you&#39;re watching this video you probably know what i&#39;m talking about right when we say break propagation when we are training this network the weights are updated okay using stochastic gradient descent or some sort of a you know this back propagation algorithm meaning how do you how do you run this back propagation or how do you train this if you have a sampling right there if you have a sample latent vector isn&#39;t the whole point like not having a fixed vector and just define the distribution using mean and standard deviation yes that is the case and here is the equation that answers your questions hopefully so mu i mean our distribution is defined by this right i mean here this is the mean and sigma is the standard deviation we already talked about that to this we are multiplying epsilon which is the standard normal distribution and it&#39;s a fixed space for sure but we are randomly sampling sampling from this fixed space so we are not training this during the back propagation so that&#39;s okay so this is the trick and i think this is called re re-parameterization trick to make this back propagation possible for this type of distributions again let me repeat this what is going to be trained during the training process is the mean and variance or standard deviation if you want to call this okay mean and standard deviation this epsilon we are randomly picking is sample but it&#39;s not learned during back propagation okay so i hope this kind of makes sense let me go back to this previous image before we stop this video and in the next video we are going to uh we are going to apply this principle okay and put together variational autoencoder for mnist dataset but the whole the key point here is we are going to create a constrained space of latent vectors from which we can sample and it means something without that the we don&#39;t know what latent vector to pick right that&#39;s why auto encoder is not uh relevant as a or useful as a generator or as a generative model but variational auto encoders can be very very useful so please watch the next video so you learn how to apply this in keras in python on mnist dataset please do subscribe i do this type of cool stuff because you ask for it okay subscribe and encourage me thank you very much

Transcript for:Variational Autoencoders (VAEs)

Transcript for:
Variational Autoencoders (VAEs)