Understanding Convolutional Neural Networks

i will give you a very simple explanation of convolutional neural network without using much mathematics so that even a high school student can understand it easily let's say you want the computer to recognize the handwritten digit 9. the way computer looks at this is as a grid of numbers here i'm using -1 and 1. in reality it will use rgb numbers from 0 to 255. the issue with this presentation is that this is too much hard-coded if you have a little shift in digit 9 for example 9 here was in the middle but in this case it is in the left and the representation of numbers just changes it doesn't matches match with our original number grid and computer will not be able to recognize that this is number nine there could be a variation since it is a handwritten digit there could be variation in how you write it which will change the two-dimensional representation of numbers and again you will not be able to match it with the original grid so we use artificial neural network for this kind of case to handle the variety in this deep learning series we have already looked at artificial neural network video on handwritten digits recognization if you are not seen that video please make sure you see it so that your fundamentals on artificial neural networks are clear in that we created a one-dimensional array by flattening the two-dimensional representation of our hand written digit number and then we build a neural network with one hidden layer and output layer and this dense neural network will work okay for a simple image like handwritten digit but when you have a bigger image let's see this little cute looking koala the image size is 1920 by 1080 we have three as rgb channel here one for red green and blue in this case the first layer neuron itself will be six million if you have let's say hidden layer with 4 million neurons you're talking about 24 million weights to be calculated just between the input and hidden layer and remember deep neural networks have many hidden layers so this can go easily into like 500 million or 1 billion of weights that you have to compute and that's too much computation for your little computer see my rabbits are getting electrical shock because it's just too much to do so the disadvantages of using a n or artificial neural network for image classification is too much computation it also treats local pixels same as pixels far apart if you have koala's face in a left corner versus right corner it is still a koala doesn't matter where the face is located so the image recognization task is centered around the locality okay so if the pixels are moved around it should still be able to detect the object in an image but with a n it's hard so how does human recognize this image so easily so let's go into the neuroscience little bit and try to see how we as humans recognize any image so easily when we look at koala's image we look at the little features like this round eyes this black prominent flat nose this fluffy ears and we detect these features one by one in our brain there are different set of neurons working on these different features and they're firing they're saying yeah i found koala's ears yes i found koala's nose and so on then these neurons are connected to another set of neurons which will aggregate the results it will say if in the image you are seeing koalas eye nose and ears it means there is a koala's face in the image similarly if there is koala's hands and legs it means there is koala's body and there are different set of neurons which are connected to these neurons which will again aggregate the results saying that if the image has koala's head and body it means it is koala's image same thing with handwritten digit nine there are these little edges which come together and form a loopy circle pattern which is kind of like a head of digit nine in the middle you have a vertical line at the bottom you have a diagonal line sometimes you don't have diagonal line at all but we know that whenever there is a loopy circle pattern at the top vertical line in the middle diagonal line in the end that means digit nine so how can we make computers recognize these tiny features we use the concept of filter in case of nine we have three filters the first one is the head which is a loopy circle pattern in the middle you have vertical line in the end you have diagonal filter so we take our original image and we will apply a convolution operation or a filter operation so here i have a loopy circle pattern or a head filter this filter right here the way convolution operation works is you take three by three grid from your original image and multiply individual numbers with this filter so this minus 1 is multiplied with this one this one is multiplied with this one and so on in the end you get a result and then you find the average which is divided by 9 because there are total 9 numbers and whatever number you get you put it here now this particular thing is called a feature map so by doing this convolution operation you are creating a feature map so you do it for the second round of three by three grid here i'm taking a stride of one you can take a stride of two or three also you don't need to have three by three filter you can have four by four or five by five filter and then you keep on doing this for your entire number and in the end what you get is called a feature map now the benefit here is wherever you see number one or a number that is close to one it means you have a loopy circle pattern so this is detecting a feature in the case of koala this would be eye or a nose because for koala i knows ears are the features so by applying loopy pattern detector i got this one here in my feature map i also call it the feature is activated you know it got activated here for number six it will be activated in the bottom in this area if you have two loopy patterns the feature will be activated at top and bottom if your number like this it might be activated in different area in summary when you apply this filter or a convolution operation you are generating a feature map that has that particular feature detected so in a way filters are nothing but the feature detectors for koala's case you can have eye detector and when you apply convolution operation in the result see you got these two eyes at this location if the eyes are at a different location it will still detect because you are moving the filter throughout the image and they are location invariant which means doesn't matter where the eyes are in the image these filters will detect those eyes and it will activate those particular regions here i have six eyes from three different koalas and they are activated accordingly great the hand of koala is in this particular region for therefore when i apply hence detector it will activate here now for number nine and i'm just moving between number nine and koala so that the presentation is simple enough and you still get an idea in case of nine we saw that we need to apply three filters the head the middle part and the tail and when you apply those you get three feature maps so i apply three filters i got three feature maps and this is how these feature maps are represented if you're reading any online article or a book they are kind of stacked together and they almost form a 3d volume in case of koala my eye nose and ear filters will produce three different feature maps and i can apply convolution operation again and let's say this time the filter is to detect head by the way the filter doesn't have to be 2d it can be three dimensional as well so just imagine this first dimension is representing eyes and the second slice is representing nose and third slice representing ears and by doing that filter you can say that koala's head in is in this particular region of an image so you are aggregating this result using a different filter for head and now this becomes a koala head detector similarly there could be koala body detector and now we got these two new feature maps where this feature map is saying that koala's head is at this location and paula's body is at this particular location then we flatten these numbers see in the end these are like two dimensional numbers so we can flatten them so to convert 2d array into 1d array and then when you get these two array just join them together after you join you can make a fully connected dense neural network for your classification now why do we need this fully connected network here well you can have a different image of koala see my koala is sleeping he's tired so now his eyes and ears are at a different location look at his ears see they're here for previous image the ears were in a different location so that generates a different type of flattened array here and you all know if you know basics about neural network that neural networks are used to handle the variety in your inputs such that it can classify those variety of inputs in a generic way here the first part where we use convolutional convolution operation is feature extraction part and the second portion where we are using dense neural network is called classification because the first part is detecting all the features ears nose eyes head and body etc and the second part is responsible for classification we also perform a value operation so so this is not a complete convolutional neural network there are two other components one is value which is nothing but if you have seen my activation video on the same deep learning tutorial series we use erect value activation to bring non-linearity in our model so what it will do is it will take your feature map and whatever negative values are there it just replaces that with zero it is so easy and if the value is more than zero it will keep it as it is so you see just look at the values it's pretty straightforward a value helps with making the model non-linear because you are picking bunch of values and making them zero so if you see my previous videos in this deep learning tutorial series you will get an idea on why it brings the non-linearity especially see the video on the activations in the same tutorial series the the link of this playlist is in the in the video description below so you'll understand why relu makes it non-linear but we did not address the issue of too much computation yet my rabbits are still getting electrical shock do something because see for this image size if you are applying convolution let's say with some padding you're still getting same size of image you did not reduce the image size sometimes people don't use padding so they reduce the image size but only little bit so pulling is used to reduce the size so main purpose of pulling is to reduce the dimensions so that my computer doesn't get this shock you know so the first pulling operation is um the max pulling so here you take a window of 2x2 and you pick the maximum number from that window and put it here so here check this yellow window 5 1 8 2 what is the maximum number 8 so put 8 here here what is the maximum number 9 so put 9 here similarly here maximum number in green window is three so put three so you take the feature map apply your convol your pulling and generate a new feature map after the pulling but the new feature map is half the size if you look at the numbers you know you have reduced your 16 numbers into four so it's too much or saving in your computation so how it will look for our digit nine case when you apply max pooling well you can do a stride of one in this case we did two by two window and stride of two start of two means once we are done with this window we move two points forward for further two pixels further in this case we can do one stride see this is one stride you get an idea and we keep on taking max and this is what we get when our number is shifted so see this is the original number where we got this max pooling map when number is shifted you get this pulling map so still you are detecting the loopy pattern at the top so max pulling along with convolution helps you with position invariant feature detection doesn't matter where your eyes or ears are in the image it will detect that feature for you there is average pooling also instead of max you just make an average see 5 and 1 6 and 2 8 8 and 8 16. 16 divided by 4 is 4. so but max pooling is more generally used but sometimes people use average pooling also so benefits of pooling number one obvious it's reducing your dimension and computation the second benefit is reduce overfitting because there are less parameters and the third one is model is variant tolerant towards variation and distortion because if there is a distortion and if you're picking just a maximum number you are capturing the main feature and you are filtering all the noise so this is how our complete convolutional neural network looks like in that you will have typically a convolution and value layer then you will have pulling then there will be another convolution value pulling there could be n number of layers for convolution and pulling and in the end you will have fully connected dense neural network in this particular case the first convolution layer is detecting eye nose and ears many times you will start with the little edges you don't even start with eye and nose but here for the simplicity i have put them but usually you start with edges then you go to eye nose ears then you go to head and body and then you do flattening again anything on the left hand side of this vertical line is feature extraction so the main idea behind convolutional neural network is feature extraction because the second part is same it is a simple artificial neural network but by doing this convolution you are detecting the features you are also reducing the dimension there are three benefits of convolution operation the first one is connection sparsity reduces overfitting connection sparsity means not every node is connected with every other node like in artificial neural network where we call that a dense network here we have a filter which we move around the image and at a time we are talking about only a local region so we are not affecting the whole image the second benefit is convolution and pulling operation combined gives you a location invariant feature detection which means koala's eye could be in the left corner in the right corner anywhere we will still detect it third is a parameter sharing which is when you learn the parameters for a filter you can apply them in the entire image the benefit of value is that it introduces non-linear linearity which is essential because when we are solving a deep learning problems they are non-linear by nature it also speeds up training and it is faster to compute remember value is you are just doing one check whether the number is greater than zero or not if it is greater than zero keep the number less than zero make it zero the benefit of pulling is that it reduces dimension and computation it reduces overfitting and makes the model tolerant to our small distortions how about rotation and thickness because by itself cnn cannot handle the rotation and the thickness so you need to have training samples which have some rotated and scaled sample you know some thick samples some thin samples and if you don't you can use data augmentation technique what is data augmentation let's say for handwritten digits you take your original data set and then you pick few samples and then you rotate them manually or you make them larger or you make them smaller thicker or thinner and you generate new samples by doing that you can handle rotation and scale in convolutional neural network once again here is a quick summary of what is convolutional neural network you can take a screenshot of this image put it at your desk if you are trying to learn cnn and a computer vision to summarize you take your input image then you apply convolution operation and value then you apply pulling again convolution value pulling and you can do this n number of times after that the second stage is classification where you use densely connected neural network now very important thing to mention here is these filters the network will learn on its own in previous presentation we saw that we applied those filters by hand but this is the beauty of convolutional neural network that it will automatically detect these filters on its own and that is part of the training so when the neural network is training or when the cnn is training because you're supplying thousands of koalas images here using that it will use back propagation and it will figure out the right amount of filters it will figure out the values in this filter and that is part of the learning or the back propagation as a hyper parameter you will specify how how many filters you want to uh have and what is the size of each of the filters that's it but you do not specify the exit values within these filters the network will learn those on its own and this is that is the most fascinating part about neural network in general in next few videos we will be doing coding using convolutional neural network and will be solving variety of computer vision problems so i hope you like this explanation if you don't know me i'm daval patel i teach data science machine learning python programming and career guidance on my youtube channel if you are starting machine learning and if you are looking for a very basic beginner's level of tutorials then i have a complete playlist you can start with very basic python and pandas knowledge on this playlist and can learn machine learning in a very very easy to understand manner then gradually in this playlist i try to cover data science and machine learning projects as well i'm continuing my deep learning tutorial series right now and my goal is to finish all the topics in deep learning including convolutional neural networks rnns language models and so on so please stay tuned uh watch my videos and if you have any comments or feedback please let me know in the video comment below

i will give you a very simple explanation of convolutional neural network without using much mathematics so that even a high school student can understand it easily let&#39;s say you want the computer to recognize the handwritten digit 9. the way computer looks at this is as a grid of numbers here i&#39;m using -1 and 1. in reality it will use rgb numbers from 0 to 255. the issue with this presentation is that this is too much hard-coded if you have a little shift in digit 9 for example 9 here was in the middle but in this case it is in the left and the representation of numbers just changes it doesn&#39;t matches match with our original number grid and computer will not be able to recognize that this is number nine there could be a variation since it is a handwritten digit there could be variation in how you write it which will change the two-dimensional representation of numbers and again you will not be able to match it with the original grid so we use artificial neural network for this kind of case to handle the variety in this deep learning series we have already looked at artificial neural network video on handwritten digits recognization if you are not seen that video please make sure you see it so that your fundamentals on artificial neural networks are clear in that we created a one-dimensional array by flattening the two-dimensional representation of our hand written digit number and then we build a neural network with one hidden layer and output layer and this dense neural network will work okay for a simple image like handwritten digit but when you have a bigger image let&#39;s see this little cute looking koala the image size is 1920 by 1080 we have three as rgb channel here one for red green and blue in this case the first layer neuron itself will be six million if you have let&#39;s say hidden layer with 4 million neurons you&#39;re talking about 24 million weights to be calculated just between the input and hidden layer and remember deep neural networks have many hidden layers so this can go easily into like 500 million or 1 billion of weights that you have to compute and that&#39;s too much computation for your little computer see my rabbits are getting electrical shock because it&#39;s just too much to do so the disadvantages of using a n or artificial neural network for image classification is too much computation it also treats local pixels same as pixels far apart if you have koala&#39;s face in a left corner versus right corner it is still a koala doesn&#39;t matter where the face is located so the image recognization task is centered around the locality okay so if the pixels are moved around it should still be able to detect the object in an image but with a n it&#39;s hard so how does human recognize this image so easily so let&#39;s go into the neuroscience little bit and try to see how we as humans recognize any image so easily when we look at koala&#39;s image we look at the little features like this round eyes this black prominent flat nose this fluffy ears and we detect these features one by one in our brain there are different set of neurons working on these different features and they&#39;re firing they&#39;re saying yeah i found koala&#39;s ears yes i found koala&#39;s nose and so on then these neurons are connected to another set of neurons which will aggregate the results it will say if in the image you are seeing koalas eye nose and ears it means there is a koala&#39;s face in the image similarly if there is koala&#39;s hands and legs it means there is koala&#39;s body and there are different set of neurons which are connected to these neurons which will again aggregate the results saying that if the image has koala&#39;s head and body it means it is koala&#39;s image same thing with handwritten digit nine there are these little edges which come together and form a loopy circle pattern which is kind of like a head of digit nine in the middle you have a vertical line at the bottom you have a diagonal line sometimes you don&#39;t have diagonal line at all but we know that whenever there is a loopy circle pattern at the top vertical line in the middle diagonal line in the end that means digit nine so how can we make computers recognize these tiny features we use the concept of filter in case of nine we have three filters the first one is the head which is a loopy circle pattern in the middle you have vertical line in the end you have diagonal filter so we take our original image and we will apply a convolution operation or a filter operation so here i have a loopy circle pattern or a head filter this filter right here the way convolution operation works is you take three by three grid from your original image and multiply individual numbers with this filter so this minus 1 is multiplied with this one this one is multiplied with this one and so on in the end you get a result and then you find the average which is divided by 9 because there are total 9 numbers and whatever number you get you put it here now this particular thing is called a feature map so by doing this convolution operation you are creating a feature map so you do it for the second round of three by three grid here i&#39;m taking a stride of one you can take a stride of two or three also you don&#39;t need to have three by three filter you can have four by four or five by five filter and then you keep on doing this for your entire number and in the end what you get is called a feature map now the benefit here is wherever you see number one or a number that is close to one it means you have a loopy circle pattern so this is detecting a feature in the case of koala this would be eye or a nose because for koala i knows ears are the features so by applying loopy pattern detector i got this one here in my feature map i also call it the feature is activated you know it got activated here for number six it will be activated in the bottom in this area if you have two loopy patterns the feature will be activated at top and bottom if your number like this it might be activated in different area in summary when you apply this filter or a convolution operation you are generating a feature map that has that particular feature detected so in a way filters are nothing but the feature detectors for koala&#39;s case you can have eye detector and when you apply convolution operation in the result see you got these two eyes at this location if the eyes are at a different location it will still detect because you are moving the filter throughout the image and they are location invariant which means doesn&#39;t matter where the eyes are in the image these filters will detect those eyes and it will activate those particular regions here i have six eyes from three different koalas and they are activated accordingly great the hand of koala is in this particular region for therefore when i apply hence detector it will activate here now for number nine and i&#39;m just moving between number nine and koala so that the presentation is simple enough and you still get an idea in case of nine we saw that we need to apply three filters the head the middle part and the tail and when you apply those you get three feature maps so i apply three filters i got three feature maps and this is how these feature maps are represented if you&#39;re reading any online article or a book they are kind of stacked together and they almost form a 3d volume in case of koala my eye nose and ear filters will produce three different feature maps and i can apply convolution operation again and let&#39;s say this time the filter is to detect head by the way the filter doesn&#39;t have to be 2d it can be three dimensional as well so just imagine this first dimension is representing eyes and the second slice is representing nose and third slice representing ears and by doing that filter you can say that koala&#39;s head in is in this particular region of an image so you are aggregating this result using a different filter for head and now this becomes a koala head detector similarly there could be koala body detector and now we got these two new feature maps where this feature map is saying that koala&#39;s head is at this location and paula&#39;s body is at this particular location then we flatten these numbers see in the end these are like two dimensional numbers so we can flatten them so to convert 2d array into 1d array and then when you get these two array just join them together after you join you can make a fully connected dense neural network for your classification now why do we need this fully connected network here well you can have a different image of koala see my koala is sleeping he&#39;s tired so now his eyes and ears are at a different location look at his ears see they&#39;re here for previous image the ears were in a different location so that generates a different type of flattened array here and you all know if you know basics about neural network that neural networks are used to handle the variety in your inputs such that it can classify those variety of inputs in a generic way here the first part where we use convolutional convolution operation is feature extraction part and the second portion where we are using dense neural network is called classification because the first part is detecting all the features ears nose eyes head and body etc and the second part is responsible for classification we also perform a value operation so so this is not a complete convolutional neural network there are two other components one is value which is nothing but if you have seen my activation video on the same deep learning tutorial series we use erect value activation to bring non-linearity in our model so what it will do is it will take your feature map and whatever negative values are there it just replaces that with zero it is so easy and if the value is more than zero it will keep it as it is so you see just look at the values it&#39;s pretty straightforward a value helps with making the model non-linear because you are picking bunch of values and making them zero so if you see my previous videos in this deep learning tutorial series you will get an idea on why it brings the non-linearity especially see the video on the activations in the same tutorial series the the link of this playlist is in the in the video description below so you&#39;ll understand why relu makes it non-linear but we did not address the issue of too much computation yet my rabbits are still getting electrical shock do something because see for this image size if you are applying convolution let&#39;s say with some padding you&#39;re still getting same size of image you did not reduce the image size sometimes people don&#39;t use padding so they reduce the image size but only little bit so pulling is used to reduce the size so main purpose of pulling is to reduce the dimensions so that my computer doesn&#39;t get this shock you know so the first pulling operation is um the max pulling so here you take a window of 2x2 and you pick the maximum number from that window and put it here so here check this yellow window 5 1 8 2 what is the maximum number 8 so put 8 here here what is the maximum number 9 so put 9 here similarly here maximum number in green window is three so put three so you take the feature map apply your convol your pulling and generate a new feature map after the pulling but the new feature map is half the size if you look at the numbers you know you have reduced your 16 numbers into four so it&#39;s too much or saving in your computation so how it will look for our digit nine case when you apply max pooling well you can do a stride of one in this case we did two by two window and stride of two start of two means once we are done with this window we move two points forward for further two pixels further in this case we can do one stride see this is one stride you get an idea and we keep on taking max and this is what we get when our number is shifted so see this is the original number where we got this max pooling map when number is shifted you get this pulling map so still you are detecting the loopy pattern at the top so max pulling along with convolution helps you with position invariant feature detection doesn&#39;t matter where your eyes or ears are in the image it will detect that feature for you there is average pooling also instead of max you just make an average see 5 and 1 6 and 2 8 8 and 8 16. 16 divided by 4 is 4. so but max pooling is more generally used but sometimes people use average pooling also so benefits of pooling number one obvious it&#39;s reducing your dimension and computation the second benefit is reduce overfitting because there are less parameters and the third one is model is variant tolerant towards variation and distortion because if there is a distortion and if you&#39;re picking just a maximum number you are capturing the main feature and you are filtering all the noise so this is how our complete convolutional neural network looks like in that you will have typically a convolution and value layer then you will have pulling then there will be another convolution value pulling there could be n number of layers for convolution and pulling and in the end you will have fully connected dense neural network in this particular case the first convolution layer is detecting eye nose and ears many times you will start with the little edges you don&#39;t even start with eye and nose but here for the simplicity i have put them but usually you start with edges then you go to eye nose ears then you go to head and body and then you do flattening again anything on the left hand side of this vertical line is feature extraction so the main idea behind convolutional neural network is feature extraction because the second part is same it is a simple artificial neural network but by doing this convolution you are detecting the features you are also reducing the dimension there are three benefits of convolution operation the first one is connection sparsity reduces overfitting connection sparsity means not every node is connected with every other node like in artificial neural network where we call that a dense network here we have a filter which we move around the image and at a time we are talking about only a local region so we are not affecting the whole image the second benefit is convolution and pulling operation combined gives you a location invariant feature detection which means koala&#39;s eye could be in the left corner in the right corner anywhere we will still detect it third is a parameter sharing which is when you learn the parameters for a filter you can apply them in the entire image the benefit of value is that it introduces non-linear linearity which is essential because when we are solving a deep learning problems they are non-linear by nature it also speeds up training and it is faster to compute remember value is you are just doing one check whether the number is greater than zero or not if it is greater than zero keep the number less than zero make it zero the benefit of pulling is that it reduces dimension and computation it reduces overfitting and makes the model tolerant to our small distortions how about rotation and thickness because by itself cnn cannot handle the rotation and the thickness so you need to have training samples which have some rotated and scaled sample you know some thick samples some thin samples and if you don&#39;t you can use data augmentation technique what is data augmentation let&#39;s say for handwritten digits you take your original data set and then you pick few samples and then you rotate them manually or you make them larger or you make them smaller thicker or thinner and you generate new samples by doing that you can handle rotation and scale in convolutional neural network once again here is a quick summary of what is convolutional neural network you can take a screenshot of this image put it at your desk if you are trying to learn cnn and a computer vision to summarize you take your input image then you apply convolution operation and value then you apply pulling again convolution value pulling and you can do this n number of times after that the second stage is classification where you use densely connected neural network now very important thing to mention here is these filters the network will learn on its own in previous presentation we saw that we applied those filters by hand but this is the beauty of convolutional neural network that it will automatically detect these filters on its own and that is part of the training so when the neural network is training or when the cnn is training because you&#39;re supplying thousands of koalas images here using that it will use back propagation and it will figure out the right amount of filters it will figure out the values in this filter and that is part of the learning or the back propagation as a hyper parameter you will specify how how many filters you want to uh have and what is the size of each of the filters that&#39;s it but you do not specify the exit values within these filters the network will learn those on its own and this is that is the most fascinating part about neural network in general in next few videos we will be doing coding using convolutional neural network and will be solving variety of computer vision problems so i hope you like this explanation if you don&#39;t know me i&#39;m daval patel i teach data science machine learning python programming and career guidance on my youtube channel if you are starting machine learning and if you are looking for a very basic beginner&#39;s level of tutorials then i have a complete playlist you can start with very basic python and pandas knowledge on this playlist and can learn machine learning in a very very easy to understand manner then gradually in this playlist i try to cover data science and machine learning projects as well i&#39;m continuing my deep learning tutorial series right now and my goal is to finish all the topics in deep learning including convolutional neural networks rnns language models and so on so please stay tuned uh watch my videos and if you have any comments or feedback please let me know in the video comment below

Transcript for:Understanding Convolutional Neural Networks

Transcript for:
Understanding Convolutional Neural Networks