Vision Transformers Lecture

hey everyone uh probably wait for a minute or so for any kind of doubt clearance from the earlier session then we'll get started with today's session e all right uh looks like there are no doubts or queries so let us uh continue with respect to uh today's project so we are going to be dealing with something known as Vision Transformers okay so I'll quickly share you my screen okay uh canly drop an s in the chat box if my screen is visible kindly contact support team uh for any kind of recordings of the sessions also anything related to subject I'm here to help you out all right thank you for the confirmation so let us get started uh with today's session so basically uh here we are going to be dealing with something known as Vision Transformers so before that uh let us like talk about a bit of Transformers so uh defin Ely you would have heard about Transformers in uh GPT or at least Chad GPT you guys are familiar right so here GPT or it is it stands for generative pre-trained Transformers okay so these are Transformer architecture which are pre-trained which will be giving you uh which is actually generating uh some kind of textual data over here okay so that is chat GPT or basically how chat GPT came into existing so we have Transformer architecture which is working on it okay so Transformers have been really great when it comes to uh okay uh all right so let us get uh let us continue so Transformers uh are a really great architecture which was traditionally created for natural language processing purpose okay so that means they were uh created or invented or kind of uh architect towards natural language processing and also uh few of the Transformers do work on a bit of uh vision okay or basically image application so that is what we are going to be dealing with so it has already created tremendous transformation when it comes to uh Transformer architecture when it comes to textual based data set okay all kind of processing okay so I think uh I don't need to explain much on things we know how great chbt is and all the other Transformer architectures which are coming into piure okay so for years uh Transformer architecture has uh been utilized okay and we have Vision Transformers over here which kind of applied for vision applications okay that's why we are just addressing them as Vision Transformers or basically we are using Transformer architecture uh which is going to be utilized for vision purpose and also here uh below I'm just added up the title of the data set which we are using so cifar data set is kind of a basic or kind of a data set which is present for generally benchmarking the performance of any architecture okay so here uh we are going to be using cifar data set for our architecture benchmarking also okay so more about cifar data set what exactly it is fundamentally used for uh kind of updates we are going to be seeing a bit later on and also in the previous uh session we had seen tensorflow data set right so TF DS or tensorflow data sets also has cifar data set we can actually import cifar data set through tensorflow data set itself okay and that also directly load function instead of going with IMDb you can go with cifar okay so Sr 10 and Sr 100 we'll discuss that about a bit later now let us talk about the uh architectural approach of vision Transformers okay so uh the introduction I think I already gave you an information on on it okay so Vision Transformers is nothing but Transformer architecture to recognize uh for image recognitions over here okay so initially architecture or Transformer architectures was developed for natural language processing task and also has achieved significant prop all right uh what are the key components which are present over here okay so especially with respect to here we have classification head right so these are actually multi uh attention heads so that is the fundamental architecture of uh Vision Transformer itself so uh basically uh uh primarily there are two fundamental components when it comes to Transformer architecture one is attention heads which kind of gives attention to multiple aspect at once okay so I I will give you a few examples uh more towards with respect to how we human beings interpret with uh this multi-head attention or how we uh humans utilize with respect multi attention heads okay so here in the architecture we have more of classification head which is going to be the last part uh that will be responsible for classifying Okay so attention head is one thing and one more thing is MLPs okay so MLP means multi-layer perceptrons okay so traditionally the architecture looks very simple okay so they are straightforward uh sorry feed forward neural networks itself okay so we had seen with respect to the reset 50 architecture right how did we okay we had discussed with respect to what of the convolutions all those things so there we had fully connected layers so that fully connected layers is nothing but uh a multi-layer perceptor on itself with respect to AR architectural description but when it comes to uh its approach or what exactly it does so that is why it is named after as multi-layer percepton okay so there are multiple layers which are going to be having different perceptions okay so that's why it is called as multi-layer perceptrons again more of it uh I'll discuss a b later so first let us discuss about the key components of V8 or Vision Transformers so P primarily we have uh Transformer encoder that is going to be our actual architecture so before that we will be having positional embedding uh sorry patch embeddings and positional embeddings okay and then we are going to be giving uh like in between the flow we have we will be having Transformer architecture and finally it goes on to the classification head okay so let us discussing discuss these and all okay so uh patch embeddings are basically uh the functionality of patch embedding over here is to divide an image with nonoverlapping patches typically smaller sizes of pixels okay so just imagine like we have 100 by 100 not a standard uh pixel value but just assume that okay we have 100 by 100 uh some kind of image okay so I'll will be creating patches of 10 by 10 of uh 10 different uh images or 10 different patches over here okay sorry 100 by 100 means I can do 10 by 10 okay so 100 patches can be created over here by splitting up the entire image okay so that is actually the patch embeddings okay so we need to give the local information about these patches which is going to be like serving the building fundamental building blocks okay so uh before going on to the more of patch EMB bidding so generally uh I want inputs from you guys okay so if at all you are seeing a big statement over here uh for example you have Transformer encoder Did You observe letter by letter before coming into a conclusion that it is Transformer or else within just a uh Mill seconds you can identify this word is uh Transformer so was it letter by letter or at one glance drop your observations in the chat box at one glance one glance right so how does our brain work over here is simply with respect to giving attention to multiple things at a very small instance okay so again uh when you're actually seeing the entire word Transformer encoder you have t r a n s f o r m e r so these are individual letters so we will be generally taking up uh time to each like if traditionally you are going with okay so not talking about how we interpret so here you need to recognize individual characters right so each character or each letter has its own kind of sounds or uh how we pronounce it all right and then you need to keep things combined okay so uh maybe if I don't have T at the uh t a n uh s okay so T R A NS okay so something like that if you remove that it becomes uh former maybe uh something like that okay so you have encoder you might have another D before that it becomes decoder okay so instead of n if you have de so that becomes decoder right so here even the spatial information is available all right so that is actually positional embed so we'll break it on fundamentally a bit later so uh our brain is interpreting with multiple such patches information so here T R A NS all these are individual patches right and at a very single glance you understood all the letters which are available and also you know which letter comes first which letter comes second which letter comes third all those things so at a single clance our brain our human brain is processing so much of information very similar approach is what Transformer architecture is doing so here uh sometimes if I give you similar spellings of words uh maybe you end up you will have little uh uh concern or uh a bit of extra attention you will be providing to know the actual ones okay so uh Transformer transformation something like that okay so first letter and last letter is generally what we actually see okay so in between if I flip one or two letters also hardly you will notice it okay so that is also is possible because our brain is stained so much that it can easily recogn or read through words at a very single glance itself okay so here there is going to be multiple uh perceptions which are coming into picture Okay so uh again here we are I am just giving example with respect to letter by letter right so initially Transformer was created for an naal language processing right so you have an entire statement so when there is an entire statement for example cifar data set is used for our example project so this particular statement if I remove maybe there is one word added up okay so CIF data set is not used for our project okay so this one little word not is going to change the entire meaning of it right so uh sometimes you have statements which uh at one glance you will read it you will understand in one perception and if you read again you will have multiple perception okay but generally for us human beings also uh we might have multiple perception but we strongly believe in one perception which is going to be the dominant one so multi-layer perceptrons is something like that okay so there are going to be multiple layers of percepton But ultimately the information is going to be pulled up and one strong perceptions would be taken into considerations okay so at a glance when you're reading a statement so whichever your perception is strongly coming up so that perception is interpreted maybe if you read again uh maybe a different perception might come up chances are less but it's a possibility so those are multi-layer percepts we'll break it down even a little extra probably when we are programming it okay so we'll Define a different class itself for class itself for the uh multi layer per PR will Define a small function okay so that's all we'll just have dense layers with drop outs on it nothing else okay we'll see that little later so now coming to actual uh fundamental steps what are happening in these uh ones okay so as I Wasing perceptions and all classification it falls under that so now patch Tings first we divide up into patches okay so if you're working on natural language processing we generally divide them into tokens so each word is considered as like an individual token so for an image we are going to be breaking down into patches okay so small small small small patches we'll be dividing it okay so we'll have a very good visualization of how these patches will look like okay so uh probably I'm not sure whether we can get it done in this particular session or in the next session we we are going to be uh for sure we'll be evaluating it okay or basically we'll be seeing the patch patches over here okay and then there is lot of embedding happening for these patches so what we do is we are going to be taking up those patches or pixel data and flatten it up and then we are going to be connecting them uh back to back or something like that okay so a sequential pattern will be created okay so here uh it's kind of uh we are flattening up into vectors and also we are going to be creating up some spatial information about the patches in onedimensional representation okay so again uh much more special information is going to be present in positional embedding but it is going to be converted into onedimensional representation okay so next comes with respect to positional embedding okay so uh again Vision Transformers uh lacks feature extraction techniques like uh convolution which we use in convolution networks right so here we need to uh identify facial relationships and uh we need to retain the spatial information specially so that's why positional embedding comes into picture so they embed spacal information about the positions of each individual patches over here okay so patches are converted into one dimensional uh vectors so here if if you keep those patches directly feed on to the Transformer architecture it's not going to be uh that great with respect to identifying spatial information so that is why position embedding will come into picture Okay so these are the both of the things we are going to be attaching and then passing on to the Transformer encoder okay so when you when we are coding I'll point it out where exactly we are adding these up okay coming to uh Transformer encoder so Transformer encoder is obviously the core building block of our vision Transformer okay and is also composed of a lot of multiple uh uh MLPs okay okay so that is uh multi-layer Perron so fundamentally we have two sub layers when it comes to a Transformer encoder one is self attention mechanisms or sometimes we call as multi attention uh mechanisms okay and then a feed uh forward neural network which is going to be multi-layer perceptrons so in self attention mechanism so it will allow to uh it will make the model or allow the model to capture Global dependencies and relationship between the patches okay so at a very single uh glance when you are seeing Transformer the word right so you are seeing with respect to the global characters or the characters over here okay so T R A NS all those things characters and the relationship between them okay so how uh how do you pronounce Transformers and uh how exactly we have interpreted this okay so that is what self attention mechanism does so sometimes pronunciations will be different for example for same spelling okay we can take the simplest example t h e the or sometimes we address it as d uh generally the is used when you are addressing to a particular object or you are highlighting a particular object okay or if you're telling a general object then it is going to be the right uh and especially with respect to to more in more uh easier approach would be uh okay I I think that that was self-explanatory okay so uh I hope everyone understands how the and the are used and most kind of it is interchangeably used over here right so it's all about your perception on it right sometimes I use the sometimes I use the right uh we are still discussing on the same topics right so here we are at the positional embedding at the moment all right uh so as I was telling uh positional embeding okay we were in Transformer class Transformer encoders my bad okay so two functionalities one is self attention mechanism one is feed forward neural networks okay so feed forward neural networks is nothing but multi-layer perceptrons so self attention will grab all kind of relationship between the global dependencies relationship between the individual Global dependencies so here uh the global information is nothing but different patches so we are going to be identifying the relationship between different patches and it is going to be seen in multiple perceptions okay so that is nothing but the feed forward neural network which you be having okay so multi-layer perceptrons uh are kind of a multiple layer of interconnected artificial neurons which is going to be having a glance of different or different wordings how you are going to be giving it okay and sometimes the words like uh okay uh there are actually a lot I'm not getting examples at the okay so uh I think tier and tear t e a r so tier and te both H has the same spelling right so but based on different words around it it depends on what we pronounce it right so based on the context around it the word will change up okay so in some context it might be tier in some context it is going to be 10 okay so I think you're not focusing we are still on the same topic that's why we are on the same slide all right so coming back as I was telling the Transformer encoder uh where we have multi-layer perceptrons which kind of focuses on such kind of different perspectives so very similarly here uh multi-layer perceptrons are going to be seeing individual patches to identify the objects the character itics of objects okay maybe if it's an animal definitely we are going to be observing with respect to Eyes Ears all those things okay so generally Dom uh if you are just considering with respect to uh classifying an image of a cat its eyes is going to be much of higher importance right so your perceptions will come into picture where you're going to be viewing the uh eyes much more focused towards the classifications okay so those and all will be depending on the classification head okay uh okay I'll break it down into simple terms just a second so if you are not understanding probably you should have asked it way earlier itself okay I'm not sure why you're waiting till the very end all right so uh with respect to words I was giving a simple example at a single glance you will be recognizing this right so it will be breaking down in this particular ways so these can be considered as patches where embeddings is going to be happening okay so what happens is it is going to be converting into vector quantity is so mathematical operations so mathematical representation for here to understand we'll keep like that itself okay so it is going to come up in this particular way of one dimensional space Transformer and then encoder starts coming to picture directly okay so this is going to be onedimensional what is happening in patch embeddings but we need to understand or have the global dependencies or rather uh in simple terms positional information so where this T is belonging in the first character where this R is there and how do you break this values so maybe it's kind of vector quantities like 3.12 uh uh 2.28 something like that okay so this is signifying for T and there will be newer data for R so how do you break telling uh these are the information which are giving up so that would be taken care by our positional embedding which is going to be providing you the global dependencies so here we are addressing these individual character AS Global dependencies and what is the position or relationship between those two so uh similarly if you're considering an image we are going to be splitting up into patches where each patch embedding is taken care over here and uh this will be ultimately converted into a onedimensional vector okay and the information about these okay how they are connected so after this pixel uh this will be coming into picture right so it is changing up so you you're not sure when it will change up over here or basically which is going to be the next row of pixels so all those information are present in positional embedding okay and then comes with respect to Transformer encoder so Transformer encoder will take those information about your one-dimensional vectors and positional embedding which is uh tagged along with it so with this it is going to give attention to multiple things at a single glance something like that okay and when you're doing this there is going to be little different perceptrons also okay so this is not the uh the Transformer architecture has two things so when I'm telling single glands so that is responsible by the multi attention HS or sometimes called as self attention hits okay so that is what is giving glance at the entire data to get useful information and then it is going to be passed on also it will have something known as multi-layer perceptrons which is nothing but a feed forward neural network where you have neurons dense neurons where every single neuron is connected to every single neuron okay so this is going to be multi-layer perceptrons so how multiple layer perceptrons or what kind of data it works with I'll give an example with respect to the wording itself tear or it is TE so uh tears rolled out of his eyes so here in this statement we are going to be considering like tier okay so whereas te uh some other statement okay so uh I'm getting all kind of past tense statements over here he told the book uh he was about to tear the book it's not tear the book right so there is multiple perceptions which are coming into picture all right so same thing what is happening in this multi-layer perceptrons okay and finally the classification head is going to be the sole decision making and reasoning okay so if you this is similar to fully connected layers in a convulsion neural network okay so which is going to take the decision making in classifying them so that is the overall architecture over here is it clear now yes no any doubts only couple of s what about others okay I'm not getting responses for those who mention they are not understanding anything all right if you have any doubts feel free to drop it down in the chat box we'll go further so basically that is what is happening now coming to the data set here we have CIFA data set and it's kind of the most generally used for uh benchmarking your uh model or how exactly it is behaving okay so what becomes or why CIFA data set is challenging is mainly due to the size of it okay so it is just 32 pixel by 32 pixels now we actually have 4K images all those things right so 32x 32 is actually a very small in size which makes the image very very very blurry okay so figuring out or classifying with such kind of blurry data is going to help a lot or basically is going to be one of the uh tough situations what the model can sustain or can tackle up so that's why it is called as benchmarking scores okay so here cifar 10 and cifar 100 as the name suggest they are with either 10 classes or 100 classes over here okay and each in cifar 10 I have just listed out with respect to what are the classes which are available over here so again uh it is not limited to one type of objects which we are deing dealing over here okay so we have a little uh cluster of things bit of animals over here uh bit of automobile kind of stuffs over here okay so uh those are the data sets again I didn't mention over here but here uh it has actually 100 different groups of classes where there are 20 super classes like uh in terms of animals in terms of objects uh in terms of vehicles so all those things okay so there are 20 uh super classes under that we again have five sub classes so that is making an entire of 100 classes okay but with respect to the number of amount of data which is present here also it is only 60,000 images here also 60,000 images and the 60,000 images is divided into 50,000 images for training purpose and 10,000 images for test purpose okay you can mix it up but uh I don't recommend it okay mainly due to the reason of these 50,000 and this 10,000 images are going to be balanced okay so uh if you consider with respect to 10 classes so in 50,000 images each class will have at least 5K images or exactly 5K K images of for each class in 50,000 images and in 10,000 you can guess it it is 1K images okay so that's how very well it is balanced and in this case it's going to be 500 uh for each image class for training purpose and 100 for testing or validation purpose kindly elaborate your quy so uh not getting anything is kind of a uh vague question to be asked all right uh so there are super classes like these mammals flowers insects household all those things and as I was mentioning the data set is very well balanced with respect to 6,000 images which I just splitted up with respect to 5K and 1K Mees over here okay all right uh so these are going to be our fundamental steps that we are going to be following up over here okay uh probably I don't think so time will allow complete this and for sure we have another session dedicating for this okay probably we'll get the uh data over here okay so this is of the entire architecture okay so all the steps so first layer is going to be the input layer so input layer at present we have CIFA data set as 32 by 32 okay okay but what we'll do is we'll not work on uh this particular uh data or we will just enhance it or kind of duplicate pixels okay so we have open CV package which we are going to be using and in that open CV we are going to be resizing it okay so if I'm not wrong something like 72 by 72 we are going to be doing okay and then data augmentation comes into picture so data augmentation is a step where you are ch changing up the image in a different way okay from a raw format now uh simple question we all can recognize images of cats right now what if I make the image of cat Okay so maybe uh it's something of like this if I take it in a upside down Position will you be still able to uh will you be still able to recognize this as an image of a cat videos in we are not able to we have been told that we will be thought from Basics and it is more like student giving presentation okay uh I'll take that as a feedback but I don't think so it is the same case for with everyone okay so that is the primary reason why we have kept these sessions live sessions so you could have always asked up your questions in the previous three sessions I think until now we have already like more kind of 70 to 80% into the uh live classes so that is the first thing which I said in the uh beginning of the sessions as well so again it's again uh the fundamentals are pretty much required okay so without understanding the fundamentals and that's how the course has been covered up you can give your feedback okay so no issues but again efforts are required from your side also to to be learned on okay so this is not a class room kind of training more kind of uh industrial so back to the question as I was telling uh we can L recognize this is an image of a cat can we recognize this one as an image of a cat yes or no will you able to recognize it no you cannot recognize this is an image of a cat uh I'm not a great artist okay so that's that's why you probably are thinking like that okay so maybe uh if it is a proper cat image at that scenario yes right because we human beings have that interpretation or at least seen the cat upside down at least once okay so if I start writing uh something like this upside down Pro you can read it at a single glance itself or you might take a bit of seconds to interpret it all right so this is what makes challenging for the machine learning models or deep learning models or any algorithms you are considering so data augmentation is a process of converting your original data into augmented data such a way that the learning is more difficult it is going to be more robust in scenario okay so that is what data augmentation will give in okay so we are going to be making up our training data okay so the augmentation is going to be only on the training data not on the test data set so it's more like kind of you are making the model to learn harder so that the prediction will be much easier okay next we have with respect to Patches so I think we had already discussed how an image uh gets breaked up with respect to Patches and then Transformer block which will be present and finally we will be having uh flattening up and also for the classification head we are going to be using multi-layer perceptrons itself very similar to the architectural level it is same as feed forward neural network of a fully connected layer okay so these are just names of what we take into consideration and finally output layer which is going to be a dense layer with number of classes which we are classifying all right so that is uh with respect to Vision Transformers so let us get started with respect to the coding uh any doubts in the meantime you can feel free to drop it down in the chat box I'll just create a new folder in my desktop and La jupter Notebook on it I shareed the screen no worries we will be covering up uh before the next session everything we are going to be covering up okay so that's why we have dedicated for this project two sessions and uh example we working on the entire project itself one more session okay so we are into the fourth session we'll we have one more session all right uh let us quickly go ahead and create an interactive file you can do uh any sorts of projects okay so with respect to submission uh you will be getting instructions on that but for learning purpose uh play around uh do a lot of projects work on more uh different data sets and also different architectures mix and match uh change up with respect to a bit of hyper parameters uh which those things we are going to be discussing in this project as well all right uh is my screen visible kindly drop an s in the chat box just for confirmation I created and uh empty this one also all right great so thank you for the confirmation so let us get started uh we are going to be uh utilizing or we are going to be using a lot of packages over here so initially what we'll do is uh we can just go ahead and pull in tensor flow as TF and also uh okay now for handling purpose numai we need definitely at the very end we need to convert it as numai and then anything else we need I think we can import it on the go at the moment I don't recall any other packages okay so mostly it is tens flow and uh under 10 FL flow itself okay and I think we have to import uh okay at the moment we'll just import it yeah we will be using something as weighted Adams okay uh there are warnings there will be a lot of warnings popping up okay so what I'll do is I'll just import warnings package which is uh you can install it by P test warnings and we are going to be giving all warnings to be ignored so once I execute that you can see no more warnings are popping up because they they don't have actually any functional impact on us okay so now let us go ahead and pull up the data set uh actually we can go ahead and import up tensorflow uh data set also and from that we can import it or actually in tensorflow doas do dat sets we have the s data okay so from here we are going to be loading up the data set and you can pull up the doc string or documentation to see uh what exactly is happening and all okay so you can see it is loading up with respect to cifar data set and these are all our classes and it is going to be returned in this particular format okay so I'll quickly pull this up so we can have the same namings over here for our data also all right the data is imported uh but it's more kind of pixel values over here okay so we will not see it until uh we kind of have visualization technique okay so for that we need to definitely import few things so let us pull in the shape to see what exactly is the data so we have 5,000 uh sorry 50,000 rows over here go here okay so that is RGB or it can be um CM y BK so that is also actually only three okay all right so those things are added up you can see other shape also okay so now we are going to be defining few hyper parameters okay so maybe we can do that a bit later 50,000 images where each image has 32 by 32 pixels that means you will have horizontal 32 rows and vertical 32 rays and each pixel is represented with exad decimal value of R GP okay red green blue so again if you are not uh understanding that I probably I can point you out where and get more culing okay so if you just Google hex color so these are hex colors what you have okay uh any one of the websites you can go and you can see here hexad decimal values we have okay so Color Picker something like that I can go I remember hex colors were showing up over here itself okay so type that okay so there you go okay so here can you see RGB coloring over here so this coloring based on certain values it will change so entire white is 255 255 255 and entire uh black is 0 0 0 okay so you can see those are the colorings and in between are the other colors for example if you go to Red so it is 255 and 0 0 so red is having a weightage of 255 other two are Z and zero okay so quite interesting I suggest you to go and uh check that out okay so next we have something known as Hyper parameters so hyper parameters are kind of parameters which helps to helps the model how exactly it needs to be learning or uh helps with respect to the architecture what are the number of layers we have so anything which uh anything with respect to parameters which you can change are addressed as hyper parameters okay so now uh I think everyone can agree with me that Digital Electronics works with only zeros and ones and more to wordss with respect to any machine learning algorithm deep learning algorithm as internally numbers itself so uh we were seeing an example where uh linear linear regression is kind of a straight line where Y is equal to MX plus c m and C values are evaluated so those values uh could you reframe your question by default means okay you can do that uh I'll continue so here uh as I mentioned with as I mentioned see anything which you can manipulate is called as hyper parameters and coming to deep learning or uh any algorithms we have internally numbers itself so what is the Precision level the numbers would change up okay so that will be called as learning rate okay so we defining and then we have uh it it is the data set itself just Google up cifar data set and you can see it all right uh next we are going to be having something known as weight deck so as I was telling here in a simplest algorithm Y is equal to mx + C right so generally our equations also uh the final uh solution or evaluation or something known as final results here let me tell us classification is going to be having something known as weighted values plus the inputs so that will address as X plus b so so this is bias and these are weights okay so this weights is going to lead towards uh something kind of uh saturation level where it'll be determining or kind of creating an understanding only with respect to the training data set okay to Simply put down if at all I have uh data points like this in a linear equation it will come something like this so there will be situation where it will not overlap with situ uh overlap with all the data points okay so there are going to be errors with respect to your training data but if at all I start creating another equation line which passes through like this in curvy Manner and it becomes completely having understanding only towards that data set is not going to be that great okay so that's why we will be having weight decas one more analogy what I can give you is uh just imagine that in your childhood you we all had silly assumptions right so something like happens in this particular manner itself something happens in that particular manner itself so if you strongly believe that is the universal truth then you are not going to learn the actual thing behind it right so you need to unlearn things before learning new things okay so that is what is weight Decay so fundamentally it is weight values which is getting decayed over here but the understanding is that you're unlearning stuff before learning new things okay so that is will be that will be weight decase okay and here we are going to be uh working on batches so we'll Define a batch size of 256 so generally we go with respect to two to the power of values so again there is lot of uh computational reasons behind this okay so U maybe if you had read about fundamentals of digital electronics you would be seeing up with uh values with respect to uh in terms of binary values in two to the^ ofs okay so we have 32bit operating system or something like that okay so as I was telling we have only binary values what digital system understand that is zeros and ons okay so that's why we go with two powers to especially being of uh being of more efficient values efficient way okay random example I so now we have either 32bit operating system or 64bit operating system so just consider 32bit operating system so you will be having 32 bit values over here so if you're not going with respect to these two to the power off uh then what happens is you will be having incomplete kind of values which you'll be using okay or you'll be using only 16bit addressing okay we'll take example 25 bit addressing so if you're doing 25 bit addressing you can go with only 32bit so extra 8 bits are useless okay so they are going to be big or padded with zeros so that is the reason path size is again how you want to delete it okay so here we have the most Optimum size so you can change the path size and play around and you can see it uh image size as I was mentioning earlier we are going to be reshaping into 72 and path size uh something divisible by the image size only make sure you will be having that at that scenario we we are just taking up six okay so maybe make it seven maybe uh not 7even uh maybe make it 8 or 12 or three so such kind of values okay so the number of patches here only you can tell by just dividing you will be getting those patches right so number of patches would be nothing but image size divided by path size and we are going to be having Square so that's why two just optimize design uh go ahead and utilize some other data also okay so there are few more uh Dimensions what we will be doing is one uh sorry few more hyper parameters one is called as images are square in shape that's why all right so fality of the projected space which is going to be used in Transformer so just consider this is an input shape for our Transformer design okay so we are going to be having in terms of 64 2 to the^ of and internally we'll have attention heads as number of heads as four and Transformer units are going to be with respect to twice of projection Dimension because it's a square shape and the input is going to be coming in like kind of single vector quantity right so one dimensional Vector so that would be our Transformer units over here okay I will uh probably explain that just a second okay then coming to trans units we are going to go with respect to eight okay uh layers not and then we have MLP head units so this is going to be uh 2K by 1K okay so 2K by 1K so 2 to the power of 20 20 by 2^ 10 all right so here uh see you have an image of square you have 72 pixels over here if you're trying to divide six of them or each pixel being six you will end up with respect to 12 pixels and the patches are divided in square values also right so here also you will be ending up with respect to 12 so 12 into 12 is going to be total 144 pixels that is what you have evaluated over here okay so projection dimensions are uh nothing but the patches are going to be applied to the Transformer okay so we'll be doing data augmentation and from there uh positional embeddings and Patch embeddings we will be getting so that will be the dimensions the number of uh features if you take any CSV file how we'll be having data in terms of these uh different features right so here we are telling take into consideration of 64 Dimensions okay and number of heads we are just giving four attention heads at once okay so it can look up to four Patches at once and then uh Transformer uh units so this is going to be uh it it is representing the size of the uh number of Dimensions okay so dense layer we defined right so how many neurons we'll be having so those uh individual neurons you are going to be defining right so very similarly we have Transformer uh units okay so they are typically a kind of fully connected layers which is going to be projecting the patches in the Transformers okay and again number of layers we are defined like eight and finally we have with respect to multi uh MLP head units which is nothing but the uh size of the dense layer for the final classifier now these values are F optimized okay so you'll start off with respect to random values okay so it's not going to be so precise so optimized so start off with random values maybe we'll end up with respect to 50% uh increase the learning rate maybe your learning rate is going to be going down decrease the learning rate size okay so maybe add one more zero over here okay at that scenario you you might have a better learning rate but it will be slower okay so your system is going to run slower over here okay so higher computational uh slower learning for elongated time maybe it will give you better results okay so that is what is hyper parameters over here okay uh probably data augmentation we'll take up in the next session so we are actually at end of time we shall conclude the session over here any further doubts where is we'll take up at the beginning of the next session so that's all for this session thank you everyone uh I did explain what is learning rate kindly go through the recordings and if you had finished the prerequisite contents you would be already clear with this program or this project itself all right thank you everyone

Transcript for:Vision Transformers Lecture

Transcript for:
Vision Transformers Lecture