Transformers for Time Series Forecasting

greetings fellow Learners now before we get into this fantastic world of Time series forecasting with Transformers I have a thought-provoking question for you where do you use historical data to make decisions in your personal life now I personally use this to track my finances specifically I track how much I've spent in the last say 3 to 6 months in order to determine what my future spending and my budget should be and it has served me pretty well so flipping this over to you are there any examples like this where you can come up with in your personal life please comment down below and I would love to know your thoughts now this video is going to be divided into three passes the what why and how of Time series forecasting with Transformers using the Informer architecture so let's get to it this is the Transformer Network it has an encoder and a decoder it was originally designed to solve sequence to sequence problems now sequence is data with defined ordering like the words in a sentence or like time series data originally it was implemented with language translation in mind so let's say that we were translating from English to French so during the forward pass we pass in English words in parallel to the encoder each word is then encoded into some Vector representation then these are then passed to the decoder along with some start token at its front and the decoder generates a French translation one word at a time for a deeper dive on the training and code of the Transformer architecture you can check out the Transformers from SC Scrat playlist now instead of language what if we were to input time series data let's modify this architecture to better suit time series data and specifically we are going to look at the Informer architecture but for now just at a very broad level so once again let's walk through the forward pass here we pass the time series data in parallel to the encoder the encoder generates embedding vectors which may be less than the input sequence length then these encoder output vectors are then passed into the decoder now to the input end of the decoder we're going to pass some sample of input data which we had originally passed into the encoder and now the decoder is going to use this information and the information it got from the encoder to generate data for all time stamps at the same time so that's the Informer flow now comparing the flows between the original Transformer and the Informer we can note three main differences first the original Transformer the number of encoder vectors equaled the number of words in the Informer the number of encoder vectors may be less than the number of timestamps now the second difference is that the original Transformer we passed in some start token to the decoder and in the Informer we passed in multiple vectors which are typically some subset of the input that we originally passed into the encoder but we pass it now to also the decoder now the third difference is that in the original Transformer the decoder generated the output one timestamp at a time whereas in the Informer architecture the decoder generates the outputs for all timestamps simultaneously so why do we have these main differences between architectures well we're going to look at the details in the next pass quiz time have you been paying attention let's quiz you to find out which of the following is a difference between between the traditional Transformer and the Informer architecture a the number of vectors output from the encoder B the number of timestamps predicted by the decoder per inference C the start token passed to the decoder or D all of the above comment your answer down below and let's have a discussion and if at this point you think I do deserve it please do consider giving this video a like because it will help me out a lot now that's going to do it for pass one for this video but keep paying attention because I will be back to quiz [Music] you the Transformer architecture has a few issues when dealing with time series data first is the quadratic computation of self attention second is the memory bottleneck in stacking layers for long in puts and the third is the speed Plunge in predicting long outputs we're going to talk about each of these starting with now the quadratic computation of self attention so let's explain these words attention involves how much Focus one data point has on another data point larger the focus greater is their correlation now self attention means that we compare all the input data points to all the same input data points and then we identify these correlations so for n input time series data points the full self attention requires some order of n s multiplication operations that is quadratic in input size this can be very costly in terms of time for large input sequences the Informer architecture addresses this though using prob sparse cell attention and this involves identifying a subset of active data points and only performing multiplication operations on those specific data points and all in all with this type of attention we can reduce the number of multiplication operations from the order of n Square to the order of n log n and less multiplication operations means faster processing during the forward pass which also means faster difference now let's go on to the second issue that we discussed which is the memory bottleneck in stacking layers for long inputs so in the original Transformer architecture the encoder layers are typically stacked and each encoder layer as we just discussed requires some order of n SAR multiplication operations n being the number of data points more layers stacked means that much more memory is consumed by the results ing matrices so to reduce the memory consumed the Informer makes use of a process known as distillation distillation in chemistry involves extracting some component from a mixture and in much the same way the Informer extracts a subset of active data points from all the other input data points so in one encoder layer we perform prob sparse self attention with this we have active data points for which we can continue performing operations on and we have passive data points that remain untouched for the most part and these passive data points are just occupying space so we can remove them and hence we extract just the active data points using distillation and we pass this subset of active data points to the next encoder layer and this way when you stack encoder layers the Informer architecture makes use of far less memory than the traditional Transformer architecture now let's talk about the third issue the speed Plunge in predicting long outputs so in the original Transformer the decoder will generate the outputs one time step at a time and for very long outputs with hundreds or thousands of time steps as we see commonly in a lot of Time series data this can make inference a very long process so to reduce this time we make use of a generative inference in which like the outputs of all the time stamps are generated at the same time quiz time it's that time of video again have you been paying attention let's quiz you to find out which of the following is a key advantage of prob sparse self attention over full self attention a it increases the number of parameters in the model B it reduces the computational complexity by focusing on a subset of relevant queries C it ensures every query attends to every key without exception or D it uses fixed positional encoding for all queries and keys now comment your answer down below and let's have a discussion now that's going to do it for Quiz 2 and P two of this explanation but do keep keep paying attention because I will be back to quiz you so this is the Informer architecture let's walk through the architecture and highlight some key Concepts we passed the time series data X suben in parallel to the encoder it performs multi-ad prob Spar self attention now self attention means that every data point is compared to every other data point to to determine some correlations prob sparse means that we determine a subset of these active data points and multi- head means that the self attention process happens on multiple times in parallel for identifying more complex patterns in data now the active data points are then selected through distillation it's not mentioned in this image but it happens in these blue regions after each encoder layer now upon distillation we reduce the space taken up by the matrices and it's easier to stack the encoders and hence we see this trapezium format instead of like the normal block format as we saw in the original Transformer architecture we get some concatenated features that may be a smaller size than the input size and this is then passed into the decoder now to the decoder we pass some part of the input as X score token this could be the last seven data points in this case and then we pass a bunch of zero pad data points which will be populated through the decoder now this information is passed into a masted multi-head prop Spar self attention that's an absolute mouthful but most of these terms should be familiar from what we discussed in the encoder part but this block is masked to ensure that the decoder cannot cheat and look ahead when performing self attention and then we pass this into the multi- head attention block that is we perform attention on the inputs from the decoder side to the results from the encoder and then after passing through a fully connected layer the decoder is going to produce the outputs of different time stamps all in once and this process is done through generative inference to speed up inference now the last question that we kind of have here is how do we actually train this model well during the forward pass the model predicts multiple time steps simultaneously we then take the mean squared error of these multiple time steps to generate a loss and this loss is then back propagated through the decoder and then through the encoder the parameters of the model update and hence the model can learn quiz time okay this is going to be a fun one let's say the input to the encoder is 10 vectors what is the possible number of vectors that the encoder outputs a 10 for the original Transformer and 12 for the Informer B 6 for the original Transformer and 10 for the Informer C six for the original Transformer and six for the Informer or D 10 for the original Transformer and six for the the Informer comment your answer down below and let's have a discussion and at this point if you do think I deserve it please do consider giving this video a like because it's going to help me out a lot that'll do it for quiz time and pass three of this explanation and before we go let's generate a summary the Transformer network was traditionally used in sequence to sequence modeling however for time series forecasting the AR Ure can run into some issues the quadratic computation of self attention the memory bottleneck in stacking layers for long inputs and the speed Plunge in predicting long outputs now to deal with these three issues we can use three techniques prob sparse attention distillation and generative inference so prob sparse attention reduces the number of multiplication operations by identifying active data points distillation will physically isolate these active data points to reduce memory consumption and then generative inference allows a decoder to make predictions in parallel across multiple timestamps for faster inference and that's all that we have for today while we scratch the surface of this Informer architecture in this next playlist of videos we will actually code out the Informer from scratch and dive into the architecture further and if you want to brush up your knowledge on Transformers and coding that from scratch too you can check out this playlist of videos right here thank you all so much for watching and if you like this video and you think I do deserve it please do consider giving this video a like and I will see you in the next one bye-bye

Transcript for:Transformers for Time Series Forecasting

Transcript for:
Transformers for Time Series Forecasting