For image data, there is a whole branch of feature engineering that deals with extraction of features from images. Here's a quick intuition on feature extraction from images. To start with, let's take a simple example. In this black and white image, clearly we may observe that it's a number 8. But the question is, how would a computer know that it's a number 8?
The answer lies in how a computer stores this image data. which is basically a matrix. This particular image has a size of 22 by 16, meaning there are 22 pixels along the height and 16 pixels along the width and in totality there are 352 pixels.
A computer stores a number between 0 and 255 for each of these 352 pixels and this number is a measure of the respective pixels brightness. As you may clearly see, Brighter pixels have a value approaching 255 and darker ones have values approaching 0. With this understanding, a very simple way to represent features for this image could be to append all these rows horizontally and have a 1 by 352 matrix. That's about black and white images.
Things would change slightly for a color image. Over here, the computer would store three pixels. pixel matrices, one each for blue, green and red color intensities.
Again, on a scale of 0 to 255 depending on brightness intensities. We may replicate the previously discussed approach for black and white image here as well. For that, we simply need to prepare a new pixel matrix carrying mean pixel values for corresponding pixels from the three matrices which are called as channels. Further steps are the exact same as discussed previously.
While this approach of feature extraction from images is simple, it is prone to a lot of problems. Consider we are comparing these two images. We have extracted features for these using our simple approach. As we may observe, this second image has a background while the first does not. Also, the color of the two dogs are very different.
And not just that, even the expressions of the two dogs are different. All these factors are collectively called as noise. A good feature extraction approach would tend to minimize this noise. And clearly, our simple approach cannot do it, as it directly uses image pixel intensities. Not just that, our simple approach is computationally expensive, as it retains so much unnecessary information that a good feature extraction approach may otherwise drop.
Hence, we need a more systematic approach to extract features from images, an approach that retains only the crucial information while discarding the rest. For achieving this, we use HOG or Hoag feature descriptor. Simply put, Hoag computes pixel-wise gradients and orientations and plots them on a histogram and hence the name histogram of oriented gradients.
For this original image, this is the hog representation. As you may observe, hog has simplified representation of this image drastically, retaining only the most important information. This way, hog is able to minimize overall noise here.
In the coming slides, we would go step by step into building this hog representation for this sample image. We would try to cover the in-depth intuition without going too much into technicalities. First step is image pre-processing wherein we would standardize image size. When we are building a face recognition machine learning model, we are feeding tons of images for training our model and it is very important for us to ensure all images are of equal size.
In our case, we are resizing our images to 32 by 64. This means we are having 32 pixels along image width and 64 pixels along image height. Next up we would divide our resized images into 4 by 4 cells for calculating gradients and orientations which we would discuss next. Just to highlight here it is recommended to resize the images to 64 by 128. and keep cell sizes 8x8 or 16x16.
But in our case, the input images themselves are very small. So we are using small resize and cell size values. Alright, this is the fun part. As we already know, HOG is Histogram of Oriented Gradients.
In this section, we would calculate the gradient and orientation, which we would then plot on a histogram in the next section. Here we have divided our image into 4x4 cells. As you may see, we have 8 such cells along width. denoted by this c8 and 16 cells along the height denoted by c16. For this first cell, let's assume this is how the pixel value matrix is like.
As you may observe, this is 4 by 4 the way we discussed previously. The first thing we have to calculate here is gradient in x and y directions that we call gx and gy. Gradient in x direction or gx is the difference of pixel values on the right of this highlighted pixel to the pixel value on the left. So, this is 89 minus 78 equal to 11. Alright, this is our gx.
Similarly, gradient in y direction or gy is the difference of pixel values above this highlighted pixel to the pixel value below. So, this is 64 minus 50. 56 equal to 8. This is our gy. Now to calculate the total gradient magnitude and orientation from these gx and gy gradient components, we would use Pythagoras theorem.
Yes, the one we studied in school. So the total gradient magnitude is the root sum of squares of gx and gy which is 13.6. Similarly, orientation is the inverse tangent of gy by gx which is 36 degrees. Similarly, Hogg algorithm would compute gradients and orientation for all these pixels of this selected 4x4 cell.
For the image edge pixels, Hogg uses a technique called padding. Please refer to the link in the description part of this video to know more on padding. Next up, let's try to plot these gradient and orientation values on a histogram. Gradient and orientation values for our highlighted pixel are 13.6 and 36 degrees respectively.
We already know that this orientation value may vary between 0 to 180 degrees. For the histogram, HOG prepares bins of 20 degrees each. So there are nine bins in total.
Next up, Hogg would start inserting gradient magnitude values as per the pixel orientation into these nine bins. For our Highlighted pixel orientation is 36 degrees which is closer to 40. So the major contribution of gradient magnitude would go to 40 degree bin and the minority would go to 20 degree bin. We use angle weights to do it. Simply put 16 by 20 is the weight that is assigned to the 40 degree bin which is the major chunk of gradient magnitude and 4 by 20 is the weight assigned to the 20 degree bin which is minor. Similarly, all other gradient values are added to this histogram based on their pixel orientations for the selected cell.
Towards the end, we get a 1 by 9 feature matrix for this cell. Then, Hogg would compute such 1 by 9 feature matrices for the remaining cells. Once done, next step is normalization. Localized gradients of image are sensitive to overall lighting.
We can partly reduce this variation by normalizing the gradients. For this, we make groups of cells known as blocks. Here, we have grouped 4 cells into each block.
Hence, we have a total of 7 blocks along the width and 15 blocks along the height when we shift this block one cell at a time like this. Alright, for a cell, Hogg previously prepared Feature matrices of 1 by 9 each. Typically a normalized matrix for a block would have cell features appended horizontally. So for a block a normalized matrix would be 1 by 36. Now to get a normalized vector for this block we need to compute this k which is the root sum of squares of all 36 block features.
Then we divide all features with this computed k to finally reach out to normalized vector for this highlighted first block. Similar way, Hogg would compute normalized feature vectors for all 105 blocks by moving this block one cell at a time, both along the image width and height. Once done, Hogg appends the 36 features from all 105 block normalized vectors horizontally, giving us a 1 by 3780 dimensional image descriptor. We could validate this number during the coding part.
Well, this completes intuition on hog feature descriptor. By the way, 3780 are a lot of features which would make our model building part computationally very expensive and might lead to overfitting as well. So, we need to reduce these dimensions for which we would use a popular dimensionality reduction technique called PCA.