VGG 16 Architecture Overview

All right, let's now get concrete and talk about the VGG 16 architecture. So this is a very simple, straightforward architecture, you can think of it as similar to the Alex net we talked about last lecture, except that we have no more layers, it's essentially a network with 16 layers. Just for reference, given the figure that I showed you last video, we have the VGG 16 located here. And you can see it's relatively large, it's actually the second largest network in this figure here, it's around, I would say, 125 million parameters, the bigger network is VGG 19, which is a variant of that with 19 layers, but you can see, adding these three more layers doesn't really change the performance. So we are focusing here on the 16 version. So in this video, I will mainly just show you how the architecture looks like. And then in the next video, I will make a new video showing you a code implementation of that. So here, just the overview. And this is from the paper. So you can find more information about this network architecture and the paper called very deep convolutional networks for large scale image recognition from 2014. It's already seven years old. But yeah, again, it's a simple architecture, I think it's something easy to implement and easy to toy around with. So not a bad thing to learn about. So an advantage, like I said, is that it's relatively straightforward, which makes the coding very simple, as you will see in the next video. So essentially, it's based on just using three by three convolutions. So you can see here, all of them are three by three. And that way, it's also very simple. The stride is one for the convolutions, and they are using the same convolution. So we have the same convolution with a padding such that the input size matches the output size after each convolution. And they are using two by two max pooling then to reduce the size. This is not shown here, I will show you the max pooling in the next slide. One thing to notice though here, or one, one aspect of it is it's very large in the number of parameters and thus also very slow. So if I go back, you can see it's pretty large architecture. So just looking at it, if it only has 16 layers, why is it so large? I mean, later, we will see architecture called resonant 34. In the next video, which is much smaller, smaller, even though it has 34 layers. So what makes this architecture so large? So essentially, it's if you look at it, it's the number of channels also. So you I mean, you can have, of course, more channels than that. But if you just count, I mean, let's just take one of those. This one, it has 512 input channels, and 512 output channels, right? If you have a three by three convolution, then 512 inputs. And then we have 512. So this is one kernel, right? And we have 512 kernels, right? And then we also have the bias. I mean, the bias doesn't really matter. But it's another on 512 on top of it. But this part, I mean, it's a really large number, right? And then we have it multiple times. And yeah, this adds up or here, fully connected layers, we have 4096 times 4096, in terms of the number of weights for each of them, right? And then also on top of it, of course, for 96 bias units, but these are really large. So if you mean these add up, right, so if you have multiple of these, um, yeah, so this is just referring to the number of parameters. So here's maybe a nicer visualization of that architecture from this website here. And yeah, you can see visually how it looks like it's the basic concept behind conclusions to make or to squeeze out these features. Usually, there was just a brief moment. So there was a question on Piazza, where someone asked about that. So like the general trend or some guidelines for conclusions, and that is exactly it here. So we start with a large height and weight, sorry, height and width. And then we make the height and width smaller, but we add channels. So each channel, I mean, that is at least what we hope it will happen, each channel will learn different type of feature information, because each channel is essentially created by a different kernel, right. So usually, so we have the input image here, it's a 224 times 224 image with three color channels. After the first convolution, we have 64 color channels, we use the same convolution that we talked about in a previous video to maintain the size. And then here in red, these are the max pooling layers, the two by two max pooling, which will reduce the size by half. And then we have yet another round of squeezing and so forth. And you can see, we are increasing the number of channels. So the width increases, whereas the height, I mean, the width in terms of the number of channels increases, and here the height and width of the feature map decrease. So we are squeezing, you can think of it as squeezing out the information here. Yeah, and then here, in the end, we have the fully connected parts, which we can actually also represent as convolutions, which I will show you later. This is why it's actually shown like this. So we can actually, I mean, doesn't really matter whether we implemented as fully connected layer or convolution, as I will show you in a later video, it's kind of it's equivalent. Okay, so this is how the architecture looks like, on a conceptual level. Now let me take you to the code example in the next video.

Transcript for:VGG 16 Architecture Overview

Transcript for:
VGG 16 Architecture Overview