Qualcomm Day on TinyML with Marios Fourlaris

so it is uh 8 am uh in california and uh 5 p.m in amsterdam that's where our speaker is uh hello everyone and welcome to the dynamite talk series today we have a qualcomm day on on tinyml my name is uh if gary gossip i am from home ai research in the bay area in california and our guest speaker today is doctor marius for the ruckus and he is from qualcomm area research center in amsterdam so before we start it is my pleasure to acknowledge our sponsors for this talk series sponsors and strategic partners arm deep light h impulse m the green wave technologies latin ai hrtg mob maxim integrated which is a part of analog devices now kixo qualcomm reality i see it sentimental synthesis and synthetic and if your company is interested in supporting this series please contact olga at timemail.org for more information i have one more announcement here so we are going to offer to the community tiny ml asia this will be our second tiny ml asia event the event last year in november was quite successful with almost 2000 people joining there uh four days of very interesting talks and discussions so this year is going to be more or less a similar format so the event will take place on november 2nd to 5th and it's going to be live uh starting at 8 00 a.m uh china standard time now you can see the link there and you can registration is already open and and thanks to the sponsors and again you see them below um registration is waived for for the participants so again it's going to be an exciting week there so next is as many of you know we started our first project it was a [Music] tiny email vision challenge which we did jointly with hackster dot dot io so this challenge was completed as of about a week ago so there were uh almost 500 participants and uh 52 submissions uh the the the committee the judges were this week to to to evaluate those submissions and the winners will be announced next week it's quite exciting to see this community working together and solve uh vision challenge problems together and you see all the sponsors there our next talk is going to be as usual on tuesday uh october 5. uh the speaker is professor alessio the moscow from um imperial college of one of london he is going to be talking about verification of ml based ai systems and applications to html and if interested in presenting uh you can send email to jokes at tinymail.org at this point it is my pleasure to introduce uh mario's from rikers he is a deep learning researcher at qualcomm air research in in in the netherlands uh where he works on power efficient training and interference in influence of neural networks specifically uh focusing on quantization and um computer memory which is gaining more more momentum and more popularity uh he's also interested in in low power ai applications um in general so he completed his uh graded work in machine learning at university kurdish london and he holds a master's in engineering degree from the university of cambridge so mario's the stage is yours and looking forward to your presentation thank you jenny um hi everyone and thanks so much for joining um just a little correction i'm not a doctor uh yet so just uh to point out but um yeah thanks for upgrading me again um so uh yeah so today we're going to talk about a practical guide to neural network quantization uh this presentation is based and motivated by a recently published white paper on quantization uh that can be found on archive and it is intended as a go to document for engineers and researchers that want to learn more about quantization and how to effectively apply it to the models so let's start with a brief overview about what we're going to talk today very brief introduction to energy machine energy efficient machine learning and why quantization is part of it uh introduction to the fundamentals of quantization and how this is simulated in neural networks and then we're going to talk about the two main clusters of algorithms available post training quantization and quantization aware training and propose our pipelines and how to implement them finally if we have time just a brief overview of the aim at toolkit which is our open source toolkit for quantization and compression so um this talk is going to be about deep learning and your networks um as we know there are ubiquitous these days but there there's been a trend of exponentially increasing energy consumption over the past few years we're seeing significant but small improvements in accuracy uh with the cost of large energy consumption on the graph here you can see the trend over the past six years and the size of the models in terms parameters according to this trend in 2025 we can project a neural network to have about a hundred trillion parameters which is about a hundred thousand times um it's about the capacity of the human brain if you think about synapses in the brain but it's about a hundred thousand times less efficient than human brain so we see there's a low scope for more efficient digital hardware another trend we're observing is more and more ais moving from the cloud to the phone to the edge device this is underpinned by privacy concerns faster execution and reduced communication overheads however this has this brings on challenges because these devices are thermally and power constrained so if we want to have sleek and thin phones that can have lasting battery we really need to push ai to be more efficient on device qualcomm has been at the forefront of this range research and effectively the issues with ai on device is that when we're executing neural networks there's a lot of data transferring between the ddr memory and the compute course this because you're in inference we get new input data we load the layers sequentially and that means we have to transfer weights and activations back and forth from memory and this data transfer leads to increased thermal energy and consumption is actually the bottleneck in terms of neural network execution from a thermal perspective um so how we can actually reduce this power there's a three main category of methods the one is compression where we prune the model while trying to keep the accuracy the other one is quantization uh where we're learning to reduce the speed precision of the model and the operations and the last one is compilation where we'll learn to compile ai more to how to compile ai models so they can run more efficiently on hardware the focus of today's talk of course is quantization which brings me to our white paper it's unimaginably called a white paper on your network quantization so you can easily find it and it's currently an archive and as i said it's intended as a go to document for everyone from people who start in quantization already have some backgrounds so what is neural network quantization the main idea is that we start from an already trained neural network and we want to store it in a lower precision a good analogy for new people would be an image on the side where we can reduce the pixel representation and as a result uh reduce the size of the image but we also have a loss of resolution in a neural network we the equivalent one will be to store the weights in low precision but on top of that we also want to perform calculations in a little bit with and also save the activation maps in the reduced uh presentation this leads to power reduced power consumption and latency and in detail here are the benefits of quantization there's an obvious memory benefit an 8-bit representation is four times more efficient in terms of memory than 32-bits but more interestingly power consumption wise it's also extremely efficient we can see here that floating point addition is 30 times less power efficient than the intake counterpart and about and there for multiplication as well about 20 times more energy consuming with a less memory access and simpler calculations due to the fixed point operations we also have faster execution and also much reduced silicon area so um some fundamentals the matrix multiplication um is effectively the building block of every uh neoneco inference this is uh this takes place in both uh convolutional and fully connected layers and here i'm going to go through a short example of how this is done typically in an accelerator uh looking at a matrix w an x and a bias b um so this is a very um miniature and um kind of imaginative example of what uh of a multiply and accumulate array in modern hardware the element c are the compute units that perform the scalar multiplication uh the a's are the accumulators that sum the apples of each row and the bias term here is frequently pre-pre-loaded into the accumulator before the multiplication so step by step um one way of looking at this is that this is not the only way it can be configured but we preload the accumulators with the bias then we map the matrix w into the compute elements we load its element into one of the processing elements and then we bring the columns of matrix x sequentially and at each cycle we compute one output and we repeat until the full multiplication is completed so the interesting thing is um how this can change when we decide to quantize these matrices as we saw with the image analogy quantization comes at the cost of loss precision but we can approximate a floating point tensioner with an integer tensioner multiplied by the skip by scale factor the scale factor can be power of 2 or remain a floating point number so looking at the weights we can actually express them now as the scale factor 1 over 2 for 5 times an in tate or u into eight matrix and specifically we choose a scale factor such as we can map the maximum value one to the maximum representative number from u into eight which is two five five uh however this approximation is not free as we said it reduces an error which we call quantization noise and the whole field of quantization is interested in reducing that noise or making the network more robust to it so two two main types of uh uniform quantization uh i'm going to briefly discuss are the symmetric and asymmetric case in a symmetric case the uh floating point grid is perfectly aligned with the fixed point so the zero of the float point grade it is aligned with the fixed point grid um however in this case it the choice of sine on sine integers matters because of the distributions that can represent so uh for the sine case it's better suited for symmetric distributions like the ones we see in weights whereas the unsigned case is better for skewed distributions like the output of a relu activation the asymmetric case is uh more general and more flexible it has this extra offset set um that allows us to [Music] move the fixed point along the axis and effectively target a distribution better this z is also frequently called uh the zero point uh because it allows to map the zero or from the floating point to the grid without any error so the same example we saw before i'm going to repeat this now but with the quantized values we look at the integers here for the weights and the matrix x we have the scale factors 1 over 255 for both and we represent the bias with a higher bit width of 32 because the accumulators are also represented with a higher bit width that's to avoid overflow by the summation process here so we first load the uh the bias then we bring the matrix w into the compute units and uh sequentially bring the columns of x until the output is calculated so these values here actually as i said again in inside the two um and they're not in the right scale at this point they have to be brought back to the original scale and this happens by combining the scale factors of x and w and we pre-multiply the matrix by the product of the two but generally we want to reduce the uh output of the uh we actually want to reduce the output of this multiplication back to the low precision of eight bits and this is why we have the step the next step which is activation quantization um here i'm just like illustrating that we use a maximum number from this tensor to scale down the uh the uiint8 representation so a common question is should we be using symmetric or symmetric and to illustrate this better have an example of how either of them work uh how other of them will work on hardware so in the left we have the symmetric quantization for uh the matrix multiplication of wx and here we assume asymmetric of both weights and activations x so um the first two terms are exactly the same calculation whereas the second and third term on the right depend just on the quantization parameters and the matrix w and they are effectively constant they can be pre-computed and added to the layer bias however the last term here is data dependent and that's the additional overhead will incur by having an asymmetric quantization of the weights as we can see that's because of the offset zw that creates that term you can imagine this as adding an input channel if you look in a fully connected layer and if we want to reduce this overheads during inference it's recommended to use symmetric weights and asymmetric activations so simulated quantization um what we've seen so far is how inference is done on device however testing our quantized networks and the performance in device can be very expensive and time consuming instead we would like to be able to simulate on device performance using commonly available hardware so we want to simulate this fixed point operations with floating point numbers using uh general purpose hardware this could be cpus or gpus and the simulation is achieved by introducing the simulated quantization operation here these are seen as quantizers in the compute graph the benefits of the such simulation is enables for gpu acceleration and there's no need for dedicated kernels and most importantly you can actually test a lot of quantization configurations a bit with until we figure out the one that works best before we can deploy your device so really um accelerates experimentation as well uh on the left is a graphical representation of the of the mac array we saw before um we theorized with the fixed point representation of each of these quantities um so we see the way weights are in eight uh we have the macrae the accumulation and activate an activation function and finally the quantization step so on the right we're going to see how this is simulated using floating point numbers uh we add these quantizer blocks in between the tensions and operations where we need to reduce the bit width so in this case we see that we do this between the linear layer and the weight because the weight is in plotting point 32 and then again after the activation function because we have to re-quantize uh the output so the output is still in floating point at the two in terms of actual type memory but these values from deepening the output actually lie on the quantization grid so this blocks we saw before um what kind of operations do they actually perform uh let's assume asymmetric quantization which is the most um which is the most general case so they are defined by the quantization parameters which in this case is scale factor and the offset z or zero points and also the bit width b so the quantizer applies an element-wise operation that involves first mapping the input tensor into the quantity onto the integer grid this is done by scaling and rounding the outputs and then also clipping to the limits of the grids and then we have the quantization step where we use this quantization parameters to scale the tensor back to the original domain so a little short example of a toy example how this can be done uh looking at a between 4 which means we have 16 levels we can represent and a little two by two matrix here um so arbitrarily i'm not re i'm using a scale factor 1 of 15 which is the uh one over the maximum number of levels including zero and kind of arbitrarily using z as eight at this situation and this is because we're trying to map zero into -5 but we'll see this has actually an impact in terms of the final product um so the first thing as we said is scaled by the scale factor and then we apply the rounding operation and we also offset by the zero point and this is the output then we apply clipping operation where we see that this value 20 has been clipped to the maximum number represented which is 15 and then we followed by the quantization step and that has the effect of actually clipping quite significant part of this value here so if we're chosen zed more intelligently we'll have actually a body part part of that we're gonna um so yeah that was just like a um just manifestate the importance of choosing the parameters right so what we've seen so far is called potential quantization where we specify one set of quantization parameters potential this is the most commonly used uh quantization and as is supported by all uh all commercial hardware however sometimes you observe that channels may be very differently distributed within the same tensor so looking at this case here channel 2 has a much higher dynamic range than the rest of the channels so if we're trying to represent all channels within the same grids we're going to get reduced resolution for these channels so in this case we can imagine that having per channel quantization can be very beneficial as it can reduce the quantization noise and improve accuracy uh that is this is why per channel quantization of weights becomes increasingly um popular but it's not supported by all hardware so there are many type of layers in neural networks that are not necessarily linear layers and how these models are modeled depends greatly on the specific hardware we are using sometimes the mismatch between the simulated performance and the on target performance is down to how this layers have been quantized or simulated so here we provide some guidance how to simulate quantization for a few commonly used layers and on the left we're starting with the max pool in this case we don't need yes so somewhere else before we go to kind of specific implementation for the d player uh layers several questions if you don't mind answering them material related uh question from uh magendra in unit uh integral eight times uh integral eight simulation the result is shown as integral 32 unless we are accumulating a lot of uh integral eight products with uh large magnitudes in the graph verities seems to waste of space could do with 24 bits or so depending on how many values are being uh accumulated correct um yeah i mean i think the 32 some of these channels are extremely large uh i understand that the 24 32 might be a bit of a um overcoat but i i i understand that this choice has been made by hardwood people for a reason um but i'll be honest but i don't i'm not that close to hardware to necessarily have an answer whether we can use 24 i think it's just to be safe but yeah another question related to symmetrical asymmetric in isometric quantization is z a vector is the zero point different for each channel uh that's the choice yeah so z generally in the potential case is just a scalar so plus the whole tensor in the uh channel case uh i mean depending on the hardware implementation you can assume that it's a it's a vector so each channel has its own zero point uh one more in the simulated uh quantized inference be true compared to the hardware integral inference um i'll be honest i'm not entirely sure it will be true means [Laughter] well i mean i think it it's it's not like i think there's certain simplifications in the um for example we're still adding floating point biases and have seven operations in floating point i think there is some understanding that how we can deal with these discrepancies uh to a big extent so not everything not everything is exactly as it would be on hardware it's just close enough to the final performance to be acceptable yeah and the question from ashwin kind of follow one question from ashrae and uh why the symmetric quantization recommended over symmetric quantization after activation step wait uh why is isometric quantization recommended over symmetric quantization of the activation oh yeah so um the idea because the overhead is very small so if we um i'm gonna very quickly if i can run to slide so the that term here uh on the side only only appears here because of the zero offset of the weight so if we cross out this term um uh this term and this term disappear and we only have this term here that can be pre-computed so it's effectively uh very little um overhead in terms of hardware to do that and if you have swiss straits or hard spaces these kind of activations that have a part which is on the negative axis you can have increased performance by using asymmetric and the last question before we kind of proceed it's a clarification type of question uh follow up on the question about z as a vector you mentioned specific hardware for per channel quantization does it require an additional dot product well it it depends on um it would if you're using it for weights which generally don't recommend it would have an extra term and it depends on whether they infer the the logic after the accumulator can take care of the things out of a channel level so every accumulator will be scaled with a different scale factor because it's per channel but about whether it can actually do this extra additions we saw here um that will depend uh like on the hardware but if you also stick to a symmetric quantization of weights you don't have to the problem anymore and and at least and then you know at least after the asymmetric case that this term can be easily you can in theory can be easily computed you can like literally as uh before you can hardware take that term and add it into your bias well thank you let's uh proceed with the within yeah thanks for the questions um yeah super channel and i'm going now to this list so i said about max pooling uh the quantization operation is not required afterwards because the input and the output are on the same grid um so we have um forever pulling um the average of two inches of a few integers is actually not an integer so we have to re-quantize but it's also fine to keep to tie the quantizes because the ranges of the input of the output are the same and for the element-wise addition this is slightly more nuanced and there's not universal except solution but extra care is needed if there's a mismatch in the ranges between the tensions being added but as a safe step it's better to add a quantization step afterwards and the same is the case with concatenation where um where the branches the the greeds of the branches may not actually align so in this case uh it's also better to probably quantize the output these examples demonstrate that choosing how to quantize these layers is not always straightforward and requires special attention and sometimes the engineer might have and then might need an understanding of what of the hardware as well um so choosing the quantization parameters um the s and the and the z that we talked about before is important how we choose them and that's because they will induce different levels of quantization noise this step is also called all the time in the literature as calibration and um just introducing here uh first i'm going to introduce the source of quantization error with a little um illustration so um we have this values on the real grid x and we define the limits of where we clip as q mean and q max and this in turn defines a scale factor we then map into the integer domain and with the clipping rounding operation and then here we see that these two values collapse in the same grid point because they're very close and this outlier or value outside the limits has been clipped into the limit of the grid then we have a dequantization step and finally if we actually map and project the original points onto this um final grid we see that we can visualize the error and for the values inside the grid we have the the limits we have rounding error and the running error is actually uh capped by s divided by two and this is for the points where we scale down here where something lies exactly between two grid points and this is a maximum running arrow we can incur and then we have a clipping arrow which is the distance from the grid limit and the original point and generally the quantization error is the sum of the two and there's a trade-off in in this error and i'm going to illustrate this here by showing that if we want to increase the limits in order to reduce the clipping error this in turn will increase the scale factor and will it will reduce the clipping error in by actually increasing the running error so here we see there's a trader between the two uh contributions uh so how we choose um cumin and q max and by consequence like the scale factor s really has an impact on this on on performance and we have a few methods how to set these values and choosing one of them depends on the amount of complexity and also where the data is available so the easiest one and commonly used in the mean max range but we just use the mean of the tensor and the max to define the limits and then we have the optimization based method so a bit more advanced and in this case we're trying to minimize an objective which is a loss l in between the and quantized value and the quantized value which in turn depends on convenient q max so we're looking for the optimal set of these grid limits that will minimize that loss and the most commonly used loss for this is a mean squared error but we can also use cross entropy for a specific case of logics which we have found works better if you want to quantize the logics if we have no data available a solution is to use the bachelor based method in this case we use the offset and the scaling factor along with the spread alpha to define the ranges and this is kind of assuming like a gaussian distribution of the preactivations so in this table we compare these methods for activation quantization we see that the combination of optimize using optimization based methods is outperforms the other ones and using uh also x entropy for logits when we want to quantize them is definitely the best solution across the board so uh what algorithms should i choose to improve my accuracy range setting is can normally not be enough and here we have we split the algorithms into the two main classes we um observe in literature uh one is post-training quantization of ptq and the other one is quantization aware training or qat so ptq takes a pre-trained network converts it to a fixed point and without any access to the training pipeline so we might not require any data or just a small calibration set without the need for labels and it's generally quite easy to use through an api call the downside is it leads to lower accuracies and lower bit widths and this can be adapted by using possibly quantization where training which generally leads to high accuracies but has some drawbacks in terms of it requires access to the training pipeline and labeled data longer training times and possibly hyper parameter tuning without any training of neural networks we first start with post training quantization and here i'm going to present our post training pipeline i'm going to go step by step for this and generally starting from a pre-trained model as a floating point model is recommended or actually required in this case um and the first step is cross-layer equalization um as we said before it's it's quite problematic that we have an evil unevenly distributed channels in between the tensioner so here we see second layer of mobile navy 2 which is a depth-wise separable layer and the dynamic range is some of the ways very few of them actually dominate dynamic grains whereas all the other ones have a much smaller range so if we try to represent this within the same grids we're going to spend we're going to have no precision left for the smaller waves because we used it all for these larger dynamic ranges cross-layer equalization trying to solve this problem and the way it's doing it is by scaling the channels of neighboring uh weight layers and the way it's doing and the way this is done is by exploiting the scalar covariance of relu meaning that multiplying the um the input of a value is the same as multiplying with the same scale the output um so this allows us to move the channel scale factors in between the adjacent layers so if you look at this small example here in layer one we have that the weight output 2 dominates the dynamic range and we can we could divide um this channel by this factor of s2 which is calculated with this form here and multiply the corresponding input channel of the adjacent layer by the same factor this has the effect that it equalizes the ranges of the first layer with a very minimal effect on layered on the second layer so we can have a much better uh multi-utilized quantization grid for both layers so one unfortunate side effect of this process is that by scaling these layers these channels we can end up increasing the bias in situation this increased bias can lead to uneven activation ranges and harm the accuracy the quantized performance after activation quantization to address this problem uh we propose bias absorption and this is a technique in which we absorb this vector c uh from layer two to there from layer one to layer two so we can see we subtract c here and we move it into the bias uh vector of the adjacent layer um so looking at the the weight range as we saw before by using the equalization we see much better distribution of the weights along the channels and in terms of performance we're looking at mobile navy 2 and intake quantization and by combining these two equalization techniques we go from zero accuracy complete lack of complete collapse and performance to only no point eight percent um job from floating point and interestingly it's actually outperforming the channel quantization in this case um so in terms of next step is to add the quantizes again recommend symmetric weights and estimating activations and that's followed by the weight uh range setting or setting the quantization parameters and previously based on what we're seeing we recommend msc methods and that generally outperforms in all cases even for the channel and for potential quantization looking at resonant and mobile v2 so um here we have a choice an option and we decide whether we could proceed with calibration data or not if data is not available we move to the step called bias correction um in bias correction um we observe generally that the um error of the the quantization error of the output which is shown here is not as biased so it actually depends on the statistics of the input here we see distribution actually of this of this bias output quantization error and this can lead to performance degradation a solution for that is to apply a technique called bias correction and this can be done completely data free using the bachelor parameters and an assumption of gaussian preactivation so using that and the pdf and the cdf of the normal distribution we can completely analytically correct for part of this error so uh this table here shows the benefits of using that method so by combining the cross-layer equalization with the bias correction we see that we fully recover floating point accuracy for mobile lab2 and that you can also hear on the graph we see how much the uh how much smaller the biased error is in this case after we apply this operation if we do have calibration data then we recommend a much more powerful algorithm called add around and add around is yeah sorry and here is adoram so traditionally in ptq after we define the quantization parameters we use the round nearest operation this because around the nearest the lowest mean squared error however there's a question is this the optimal way of rounding for the final task loss here we see that uh by by quantizing with different ways the um first layer of resin 18 and we see the impact in the validation accuracy so for flooring or ceiling we get completely random performance but the stochastic case is an average the same as the around nearest but the best uh output of stochastic rounding is about more than ten percent better than the uh around the nearest so the question is you would be going up or down and this is a way of systematically finding this running choice the answer is yes is we can learn to round and the way adoranga is doing this is by minimizing the local l2 loss of the output of the layer uh where w um in this case we look at the output of the of a linear layer and uh the way this problem is solved is by using the soft quantized weights the soft quantized weights um work as following you round down the weights and then you have this learned value hp that is between zero one so in this situation you see like this this um hp will should actually move around the two points and at some point land either at the next squid point of with one or the floor with zero the choice for this function h is the rectified sigmoid and this is because here's how the rectifier sigma looks compared to a normal sigmoids and the reason is because um it um for the values where the output of h v becomes zero one the sigmoid has zero um gradients where the rectifier sigmoid still has some value so it can allow for more flexible learning and it will allow for hp to reach this extremities uh to also ensure that uh this value hv does converge to zero or one we also add this regularization term hp um and this is seen um this term is actually done here and we also use beta annealing during training quickly some results about how their own works we're looking at 4-bit wave quantization 8-bit activations and we compare to the normal cle bars correction pipeline and some other bias correction methods from banner l which is actually a bit unfair comparison because they use per channel quantization but we can see that in all cases adiron performs better and even for inception b3 the difference is pretty big compared to the best other methods um so last step activation rain setting here we have the uh msc uh rain setting and um for we have data and batch number rates sending if we have no data um if after following all the steps the accuracy is still unsatisfactory we propose a step of debugging a section we have some debugging steps in the flow chart which can be found in the white paper please consult it if you want to if you have issues and this should help you to close the gap accuracy even further some results um we have here a lot of models and benchmarks i have some color coding about the drop in accuracy as green less than one percent orange one two one half and red more than one and a half and here we see that um uh for eight differentization uh uh we ptq performs really really well and we see that for difficult models like efficient light mobile level 2 or even for um language models like bert uh the four bit the four breeze results are not as good but we said it's harder to quantize but still pretty impressive in certain cases so quantization where training the biggest difficulty in qat is quantizing the backward path and it's because of the presence of these quantizers because they have this around the nearest operation and because this heavy side step function uh gradient-based training will be impossible because the gradients are not meaningful so the solution is to have the straight through estimator which approximates this the gradient of this operation with one and is it's equivalent to having a simulated forward path where rather than having the step function we have a ramp uh one of the benefits of using the straight through estimation estimator is that can allow us to actually learn this quantization parameters directly rather than have to set them and by using the task loss gradients we define an optimal trade-off between we can find an optimal trade-off between these two arrows through the through the greatness of the task uh one nuanced thing about qat is how to treat bathroom batch normalization is com it can be found in all computer vision models but for faster inference we generally fold it to the previous layer this folding happens by combining the the scaling operation with the weight and the and the offset operation with the bias and we can see here on the left how the computer graph is transformed for inference we scale the weights and subtract the buyers and then we effectively remove the batch number and reparametrize weights and bias this is very obvious for p2q there's only one way of doing it but for qat there's a debate in the community about how whether the statistics and the parameters about genomes should be updated during training um when doing quantization-weight training so one approach from christian motile suggests that should be using a double forward path uh one at forward path for calculating statistics one for doing the actual quantize uh um calculation and we actually find that using a simple static folding which we fold the parameters and remove bathroom uh in most cases performs uh better than this approach and it's also cheaper and faster because we remove the battery operation from the computer graph for per channel quantization uh this choice does not really matter that much anymore because uh the bachelor scaling parameters can be absorbed into the channel per channel scale factors so even if you we can leave bachelor intact and that generally leads to higher performance than folding it um so here's a conversational web training pipeline we start from a pre-trained model in most cases we apply cle and bachelor unfolding step if we have potential quantization uh just to point out here that if we don't use cle for certain models we actually cannot recover accuracy by qat so for mobile navy 2 it's very very important to actually apply cle before even qat because otherwise we will get stuck to like a very low accuracy um so it's generally recommended step for lower for models with deploy separable layers and then we have the quantizers and the range estimation range estimation generally recommend msc but in this case also mean max range for weights is generally fine because this can be recovered during training but very importantly we recommend always to use learnable quantization parameters and here there's a detail about how you optimize them it's recommended to not be the same schedule and optimizer as the weights because they have to change differently from the weights slower one solution is to use the gradient scaling from srl in the lsq paper or equivalent or even better than time is to use an optimize like atom with a different landing rate scheduled from the weights and then we train quickly some results here again um i'm not going to spend my time on 8 bits but you see generally even improves performance is lowercase but most importantly is the 4-bit quantization where we see it close the gap and the accuracy and just to illustrate this and putting the side by side for p2q and uh qrt and um so here um in a lot of k especially in some cases like bert we see that we move from like the one color to the other we have significantly improvement booming accuracy and all efficient at d1 we go to minus five percent compared to complete loss of accuracy uh some models still a bit harder to quantize even in this case generally in the uh the ones that use efficient lights as a backbone so um yeah so here are overview of the recess we're doing a qualcomm for quantization uh here are the papers we published in the last two or three years uh actually the latter the last one only came out today in archives so please go and check it if you want and yeah um i don't know uh afghanistan i talk about aim at or how we're doing with time uh i think we have several questions to answer uh okay i'm gonna we'll answer some questions and then uh in the remaining time you can go through that through that okay let's do that uh question from neil are open source framework available uh for qad um well yeah um in our case there's uh sorry i'm moving slides accidentally uh well there's acuity uh is supported by emmett which is our toolkit um and otherwise yeah that's the answer i have for this right and also related questions from neil uh can you take an existing trained uh network uh in floating point and use some kind of transfer learning with qat to convert it to a quantized network rather than having to start training from scratch uh transfer learning um well i mean there's uh the common thing in qut sometimes you use a teacher student network that's seen as a powerful method for improving performance uh but uh transfer learning would that be like using a hypernet so having a bank of networks that have already been trained i'm not sure how uh transfer learning well i think i maybe i don't really grasp the power of the idea of transfer learning in the setting yeah it could be a topic for next research paper yes uh a question from lord isn't there the possibility to train the model directly in an integral format instead of doing a post-trading flow to integrate conversion uh integra so so to just train yeah okay that's for qrt or ptq sorry a question um yeah because um well i mean the difficulty with integers is that then you you don't have you can't use gradients anymore and so if you're using qit uh you might be looking at a you know sampling methods or combinator optimization or something it would be a different paradigm the reason why i resolve into using the floating point and actually normally keep the floating point shutter weight is because it allows us to use gradients um well not real gradients because they don't exist but some others made of the gradients um but in if you're using a in binary networks that could be a different answer but in a multi-bit case that could be very difficult to optimize just because there's so many options in in the grid okay several more questions let's see how much you're doing in terms of time yeah these two okay uh question from mayan what is the validation and test data set sizes used for the best stochastic runs it could be just a huge bias in this election that does not generalize oh that would be just the total valid that will be the whole validation set of image net in that case that that's like we used there's no test just uh passing the whole validation set through a question from thomas there's a there was a there's a lot of there would be a lot of hashtag by hundreds of houses samples and if you plot them um there's a big range in between all of them um so yeah yeah uh question to thomas have you worked with rnn models um um since someone in our group has worked with rnn models in the past i have not and i will not i actually have not touched them um and uh yeah my experience is very limited in evidence but um maybe um yeah maybe some of this stuff we've since birth we've we've kind of stopped using our hands a lot of time and several questions from uh miguel uh i'll read all of them there are four of them uh and you decided what order to do i can i'm also quickly looking at them but go ahead yeah okay yeah uh well let's kind of start with a simple one is amit now available for tensorflow 2 by torch 1.8 plus uh let's see what this lie says um because i am uh not sure about this question i'm sorry about that um i guess this is going to be available in the if you go into the github um link then that should be answered there okay then there are three more questions from miguel what group convolution con transpose does the per channel mean a per group of channels uh yeah i think in that case um i guess i guess you can find a group or output depending on what you can do um on hardware um i think if you can map it accordingly into the uh into the accumulator then that should be fine if you use a separate one but also seeing time and how responding here uh yeah i i i guess like as simon said that's still fine um but it's not very common approach okay next one from miguel what happens when range doesn't have zero inside what sorry uh what happens if range oh when range doesn't have zero inside zero oh um so generally speaking well generally speaking we in more quantization techniques we're always trying to include zero it's not in for example i haven't seen any method where um you wouldn't represent uh you would you would generally try and represent zero unless uh in the asymmetric case uh maybe you can do that but you need to make sure just to make sure for sparsity reasons every channel when you mix them with the activations you need to make sure that this can be represented without error but if you generally speak in the symmetric case you will always try and include zero in it and the last one from because cross-leg equalization would increase errors in this case what techniques would you recommend to recover errors uh well i mean it depends on yeah i mean it depends on the form of error uh bias correction i guess if it causes some can take some of that and the bias absorption um um is supposed to be taking care of the error inducing the activations but i think for example you have other forms of activation so i i have not actually i think these two methods are generally quite powerful uh into avoiding that um i don't know if there's another form of error that i'm not currently considering okay so we have uh i think what five minutes left so mario's maybe uh you can give us a bit of a refresher on the amit tool uh and for those people who are interested to learn more there was a whole tutorial on amateur given at the time email summit in march also by qualcomm so you can find this video and the material on on the youtube channel on the amateur and we can send you link there so but if you can give a bit of a refresher um that that would be good i'm yeah i'm not going to spend it's just uh yeah these are the links for finding the aim tool which is the open source model efficiency toolkit and that includes most of pretty much all the methods we talked about today uh it's mainly consists of two things one is the aim it that has a quantization and compression techniques and the other one is aimed model zoo with eight bit quantized models both in tension flow and pi touch and aim it effectively um plugs into your into your flow and uh will include um both quantization and compression techniques um it has data frequencies brain setting techniques and quantization simulation and it also allows the fine tuning or qat as well on top and the model zoo has models both in tensorflow and pytorch for very wide range of benchmarks and different models as you can see here and for all of them uh this stage these uh results with less than one percent accuracy this is all open source yeah this stuff is open source yeah um if um if there's a very small chance maybe one of the two other techniques we talked about now are not open source but i i wouldn't have commented that because i'm not 100 sure um and yeah just a reminder i love what we talked about today and much more detail and a lot of kind of details about the uh and results calculations are in the white paper and uh yeah i recommend you to go and have a look at it on archive uh or by following this uh scanning the qr code and uh please we welcome suggestions feedback and comments it's kind of a live document for us and we want to include the insights from the community and make it better for everyone uh in the community awesome yeah and that's the uh end and uh yeah thanks so much for the quite interesting question actually i'm really sorry some of them i um wasn't able to fully answer uh two very last ones uh on one on amateur what is the output format after the optimization and also tensorflow i'm not sure whether the second one stands for but in in the diagram you show there optimizer what is the format there yep yeah so uh optim um so optimization is just uh you you save out the the model with new weights and uh also the uh the quantization encodings they call it you name it so the idea is that uh when you apply the simulation quantization aim it will print out the weights by also print out the quantization parameters are needed for on-device inference so an optimized ar model is just that but with better weights and different quantizers the very last question from mayan again who's interested in rnn uh for the rnn quantization duty quantize all the our inputs output of different gates and hidden state contents mainly does the quantization error propagate before rnans um yeah uh i think this is more of a question for tywin i guess who's just pointing it i might my guess would be that everything has to be unless there's a reason unless you can actually somehow map it away from the from these uh fixed point accelerators and you can take into cpu otherwise uh if you want to use accelerators you have to use um yeah quantized values cool well thank you marius for very interesting and really practical uh presentation on quantization and and the tools and all the ins and outs uh and also thank you all for for joining this uh tiny ml talk series i think we'll have um next one in in again on tuesday and uh at this point madison would also like to acknowledge our sponsors again if you can advance to the sponsors like oh maybe that i think i think we already did this one yeah so this is the list of sponsors and we would like to acknowledge them one by one it's arm uh the software and hardware foundation for 10 ml we have deep light uh from canada they use ai to make other ai faster smaller and more power efficient h impulse uh they're going to have their imagine conference this week by the way so they provide 10 ml for all developers and i encourage you to join the the imagine conference i think it's also free of charge this week starting tomorrow amsa uh it's a company in israel the i e in iot edge ai visual sensors green waste technologies from france are enabling the next generation of sensor and caribou products to process rich data with energy efficiency hotg distributed instructor infrastructure for tiny ml applications related ai from from here from the bay area adaptive ai for the intelligent edge maxim integrated they they just have a partnership with adi enabling intelligence uh kixo otml automated machine learning platform that builds tiny ml solutions for the age using sensor data qualcomm i think we spoke enough about qualcomm today thank you qualcomm uh reality i add advanced sensing to your product with hai 10ml sensemail another interesting startup in this space build smart iot sensor devices from data since sans from switzerland they build sensing inference hardware for ultra low power enabled mobile and edge devices cntn from the southern california area building neural decision processors and software and as i said next presentation will be in uh next week on tuesday by professor lumosa on verification of male based l systems and it starts at 8 a.m so thank you all for joining thank you for your question thank you marios again for the very interesting presentation and we'll stay in touch

so it is uh 8 am uh in california and uh 5 p.m in amsterdam that&#39;s where our speaker is uh hello everyone and welcome to the dynamite talk series today we have a qualcomm day on on tinyml my name is uh if gary gossip i am from home ai research in the bay area in california and our guest speaker today is doctor marius for the ruckus and he is from qualcomm area research center in amsterdam so before we start it is my pleasure to acknowledge our sponsors for this talk series sponsors and strategic partners arm deep light h impulse m the green wave technologies latin ai hrtg mob maxim integrated which is a part of analog devices now kixo qualcomm reality i see it sentimental synthesis and synthetic and if your company is interested in supporting this series please contact olga at timemail.org for more information i have one more announcement here so we are going to offer to the community tiny ml asia this will be our second tiny ml asia event the event last year in november was quite successful with almost 2000 people joining there uh four days of very interesting talks and discussions so this year is going to be more or less a similar format so the event will take place on november 2nd to 5th and it&#39;s going to be live uh starting at 8 00 a.m uh china standard time now you can see the link there and you can registration is already open and and thanks to the sponsors and again you see them below um registration is waived for for the participants so again it&#39;s going to be an exciting week there so next is as many of you know we started our first project it was a [Music] tiny email vision challenge which we did jointly with hackster dot dot io so this challenge was completed as of about a week ago so there were uh almost 500 participants and uh 52 submissions uh the the the committee the judges were this week to to to evaluate those submissions and the winners will be announced next week it&#39;s quite exciting to see this community working together and solve uh vision challenge problems together and you see all the sponsors there our next talk is going to be as usual on tuesday uh october 5. uh the speaker is professor alessio the moscow from um imperial college of one of london he is going to be talking about verification of ml based ai systems and applications to html and if interested in presenting uh you can send email to jokes at tinymail.org at this point it is my pleasure to introduce uh mario&#39;s from rikers he is a deep learning researcher at qualcomm air research in in in the netherlands uh where he works on power efficient training and interference in influence of neural networks specifically uh focusing on quantization and um computer memory which is gaining more more momentum and more popularity uh he&#39;s also interested in in low power ai applications um in general so he completed his uh graded work in machine learning at university kurdish london and he holds a master&#39;s in engineering degree from the university of cambridge so mario&#39;s the stage is yours and looking forward to your presentation thank you jenny um hi everyone and thanks so much for joining um just a little correction i&#39;m not a doctor uh yet so just uh to point out but um yeah thanks for upgrading me again um so uh yeah so today we&#39;re going to talk about a practical guide to neural network quantization uh this presentation is based and motivated by a recently published white paper on quantization uh that can be found on archive and it is intended as a go to document for engineers and researchers that want to learn more about quantization and how to effectively apply it to the models so let&#39;s start with a brief overview about what we&#39;re going to talk today very brief introduction to energy machine energy efficient machine learning and why quantization is part of it uh introduction to the fundamentals of quantization and how this is simulated in neural networks and then we&#39;re going to talk about the two main clusters of algorithms available post training quantization and quantization aware training and propose our pipelines and how to implement them finally if we have time just a brief overview of the aim at toolkit which is our open source toolkit for quantization and compression so um this talk is going to be about deep learning and your networks um as we know there are ubiquitous these days but there there&#39;s been a trend of exponentially increasing energy consumption over the past few years we&#39;re seeing significant but small improvements in accuracy uh with the cost of large energy consumption on the graph here you can see the trend over the past six years and the size of the models in terms parameters according to this trend in 2025 we can project a neural network to have about a hundred trillion parameters which is about a hundred thousand times um it&#39;s about the capacity of the human brain if you think about synapses in the brain but it&#39;s about a hundred thousand times less efficient than human brain so we see there&#39;s a low scope for more efficient digital hardware another trend we&#39;re observing is more and more ais moving from the cloud to the phone to the edge device this is underpinned by privacy concerns faster execution and reduced communication overheads however this has this brings on challenges because these devices are thermally and power constrained so if we want to have sleek and thin phones that can have lasting battery we really need to push ai to be more efficient on device qualcomm has been at the forefront of this range research and effectively the issues with ai on device is that when we&#39;re executing neural networks there&#39;s a lot of data transferring between the ddr memory and the compute course this because you&#39;re in inference we get new input data we load the layers sequentially and that means we have to transfer weights and activations back and forth from memory and this data transfer leads to increased thermal energy and consumption is actually the bottleneck in terms of neural network execution from a thermal perspective um so how we can actually reduce this power there&#39;s a three main category of methods the one is compression where we prune the model while trying to keep the accuracy the other one is quantization uh where we&#39;re learning to reduce the speed precision of the model and the operations and the last one is compilation where we&#39;ll learn to compile ai more to how to compile ai models so they can run more efficiently on hardware the focus of today&#39;s talk of course is quantization which brings me to our white paper it&#39;s unimaginably called a white paper on your network quantization so you can easily find it and it&#39;s currently an archive and as i said it&#39;s intended as a go to document for everyone from people who start in quantization already have some backgrounds so what is neural network quantization the main idea is that we start from an already trained neural network and we want to store it in a lower precision a good analogy for new people would be an image on the side where we can reduce the pixel representation and as a result uh reduce the size of the image but we also have a loss of resolution in a neural network we the equivalent one will be to store the weights in low precision but on top of that we also want to perform calculations in a little bit with and also save the activation maps in the reduced uh presentation this leads to power reduced power consumption and latency and in detail here are the benefits of quantization there&#39;s an obvious memory benefit an 8-bit representation is four times more efficient in terms of memory than 32-bits but more interestingly power consumption wise it&#39;s also extremely efficient we can see here that floating point addition is 30 times less power efficient than the intake counterpart and about and there for multiplication as well about 20 times more energy consuming with a less memory access and simpler calculations due to the fixed point operations we also have faster execution and also much reduced silicon area so um some fundamentals the matrix multiplication um is effectively the building block of every uh neoneco inference this is uh this takes place in both uh convolutional and fully connected layers and here i&#39;m going to go through a short example of how this is done typically in an accelerator uh looking at a matrix w an x and a bias b um so this is a very um miniature and um kind of imaginative example of what uh of a multiply and accumulate array in modern hardware the element c are the compute units that perform the scalar multiplication uh the a&#39;s are the accumulators that sum the apples of each row and the bias term here is frequently pre-pre-loaded into the accumulator before the multiplication so step by step um one way of looking at this is that this is not the only way it can be configured but we preload the accumulators with the bias then we map the matrix w into the compute elements we load its element into one of the processing elements and then we bring the columns of matrix x sequentially and at each cycle we compute one output and we repeat until the full multiplication is completed so the interesting thing is um how this can change when we decide to quantize these matrices as we saw with the image analogy quantization comes at the cost of loss precision but we can approximate a floating point tensioner with an integer tensioner multiplied by the skip by scale factor the scale factor can be power of 2 or remain a floating point number so looking at the weights we can actually express them now as the scale factor 1 over 2 for 5 times an in tate or u into eight matrix and specifically we choose a scale factor such as we can map the maximum value one to the maximum representative number from u into eight which is two five five uh however this approximation is not free as we said it reduces an error which we call quantization noise and the whole field of quantization is interested in reducing that noise or making the network more robust to it so two two main types of uh uniform quantization uh i&#39;m going to briefly discuss are the symmetric and asymmetric case in a symmetric case the uh floating point grid is perfectly aligned with the fixed point so the zero of the float point grade it is aligned with the fixed point grid um however in this case it the choice of sine on sine integers matters because of the distributions that can represent so uh for the sine case it&#39;s better suited for symmetric distributions like the ones we see in weights whereas the unsigned case is better for skewed distributions like the output of a relu activation the asymmetric case is uh more general and more flexible it has this extra offset set um that allows us to [Music] move the fixed point along the axis and effectively target a distribution better this z is also frequently called uh the zero point uh because it allows to map the zero or from the floating point to the grid without any error so the same example we saw before i&#39;m going to repeat this now but with the quantized values we look at the integers here for the weights and the matrix x we have the scale factors 1 over 255 for both and we represent the bias with a higher bit width of 32 because the accumulators are also represented with a higher bit width that&#39;s to avoid overflow by the summation process here so we first load the uh the bias then we bring the matrix w into the compute units and uh sequentially bring the columns of x until the output is calculated so these values here actually as i said again in inside the two um and they&#39;re not in the right scale at this point they have to be brought back to the original scale and this happens by combining the scale factors of x and w and we pre-multiply the matrix by the product of the two but generally we want to reduce the uh output of the uh we actually want to reduce the output of this multiplication back to the low precision of eight bits and this is why we have the step the next step which is activation quantization um here i&#39;m just like illustrating that we use a maximum number from this tensor to scale down the uh the uiint8 representation so a common question is should we be using symmetric or symmetric and to illustrate this better have an example of how either of them work uh how other of them will work on hardware so in the left we have the symmetric quantization for uh the matrix multiplication of wx and here we assume asymmetric of both weights and activations x so um the first two terms are exactly the same calculation whereas the second and third term on the right depend just on the quantization parameters and the matrix w and they are effectively constant they can be pre-computed and added to the layer bias however the last term here is data dependent and that&#39;s the additional overhead will incur by having an asymmetric quantization of the weights as we can see that&#39;s because of the offset zw that creates that term you can imagine this as adding an input channel if you look in a fully connected layer and if we want to reduce this overheads during inference it&#39;s recommended to use symmetric weights and asymmetric activations so simulated quantization um what we&#39;ve seen so far is how inference is done on device however testing our quantized networks and the performance in device can be very expensive and time consuming instead we would like to be able to simulate on device performance using commonly available hardware so we want to simulate this fixed point operations with floating point numbers using uh general purpose hardware this could be cpus or gpus and the simulation is achieved by introducing the simulated quantization operation here these are seen as quantizers in the compute graph the benefits of the such simulation is enables for gpu acceleration and there&#39;s no need for dedicated kernels and most importantly you can actually test a lot of quantization configurations a bit with until we figure out the one that works best before we can deploy your device so really um accelerates experimentation as well uh on the left is a graphical representation of the of the mac array we saw before um we theorized with the fixed point representation of each of these quantities um so we see the way weights are in eight uh we have the macrae the accumulation and activate an activation function and finally the quantization step so on the right we&#39;re going to see how this is simulated using floating point numbers uh we add these quantizer blocks in between the tensions and operations where we need to reduce the bit width so in this case we see that we do this between the linear layer and the weight because the weight is in plotting point 32 and then again after the activation function because we have to re-quantize uh the output so the output is still in floating point at the two in terms of actual type memory but these values from deepening the output actually lie on the quantization grid so this blocks we saw before um what kind of operations do they actually perform uh let&#39;s assume asymmetric quantization which is the most um which is the most general case so they are defined by the quantization parameters which in this case is scale factor and the offset z or zero points and also the bit width b so the quantizer applies an element-wise operation that involves first mapping the input tensor into the quantity onto the integer grid this is done by scaling and rounding the outputs and then also clipping to the limits of the grids and then we have the quantization step where we use this quantization parameters to scale the tensor back to the original domain so a little short example of a toy example how this can be done uh looking at a between 4 which means we have 16 levels we can represent and a little two by two matrix here um so arbitrarily i&#39;m not re i&#39;m using a scale factor 1 of 15 which is the uh one over the maximum number of levels including zero and kind of arbitrarily using z as eight at this situation and this is because we&#39;re trying to map zero into -5 but we&#39;ll see this has actually an impact in terms of the final product um so the first thing as we said is scaled by the scale factor and then we apply the rounding operation and we also offset by the zero point and this is the output then we apply clipping operation where we see that this value 20 has been clipped to the maximum number represented which is 15 and then we followed by the quantization step and that has the effect of actually clipping quite significant part of this value here so if we&#39;re chosen zed more intelligently we&#39;ll have actually a body part part of that we&#39;re gonna um so yeah that was just like a um just manifestate the importance of choosing the parameters right so what we&#39;ve seen so far is called potential quantization where we specify one set of quantization parameters potential this is the most commonly used uh quantization and as is supported by all uh all commercial hardware however sometimes you observe that channels may be very differently distributed within the same tensor so looking at this case here channel 2 has a much higher dynamic range than the rest of the channels so if we&#39;re trying to represent all channels within the same grids we&#39;re going to get reduced resolution for these channels so in this case we can imagine that having per channel quantization can be very beneficial as it can reduce the quantization noise and improve accuracy uh that is this is why per channel quantization of weights becomes increasingly um popular but it&#39;s not supported by all hardware so there are many type of layers in neural networks that are not necessarily linear layers and how these models are modeled depends greatly on the specific hardware we are using sometimes the mismatch between the simulated performance and the on target performance is down to how this layers have been quantized or simulated so here we provide some guidance how to simulate quantization for a few commonly used layers and on the left we&#39;re starting with the max pool in this case we don&#39;t need yes so somewhere else before we go to kind of specific implementation for the d player uh layers several questions if you don&#39;t mind answering them material related uh question from uh magendra in unit uh integral eight times uh integral eight simulation the result is shown as integral 32 unless we are accumulating a lot of uh integral eight products with uh large magnitudes in the graph verities seems to waste of space could do with 24 bits or so depending on how many values are being uh accumulated correct um yeah i mean i think the 32 some of these channels are extremely large uh i understand that the 24 32 might be a bit of a um overcoat but i i i understand that this choice has been made by hardwood people for a reason um but i&#39;ll be honest but i don&#39;t i&#39;m not that close to hardware to necessarily have an answer whether we can use 24 i think it&#39;s just to be safe but yeah another question related to symmetrical asymmetric in isometric quantization is z a vector is the zero point different for each channel uh that&#39;s the choice yeah so z generally in the potential case is just a scalar so plus the whole tensor in the uh channel case uh i mean depending on the hardware implementation you can assume that it&#39;s a it&#39;s a vector so each channel has its own zero point uh one more in the simulated uh quantized inference be true compared to the hardware integral inference um i&#39;ll be honest i&#39;m not entirely sure it will be true means [Laughter] well i mean i think it it&#39;s it&#39;s not like i think there&#39;s certain simplifications in the um for example we&#39;re still adding floating point biases and have seven operations in floating point i think there is some understanding that how we can deal with these discrepancies uh to a big extent so not everything not everything is exactly as it would be on hardware it&#39;s just close enough to the final performance to be acceptable yeah and the question from ashwin kind of follow one question from ashrae and uh why the symmetric quantization recommended over symmetric quantization after activation step wait uh why is isometric quantization recommended over symmetric quantization of the activation oh yeah so um the idea because the overhead is very small so if we um i&#39;m gonna very quickly if i can run to slide so the that term here uh on the side only only appears here because of the zero offset of the weight so if we cross out this term um uh this term and this term disappear and we only have this term here that can be pre-computed so it&#39;s effectively uh very little um overhead in terms of hardware to do that and if you have swiss straits or hard spaces these kind of activations that have a part which is on the negative axis you can have increased performance by using asymmetric and the last question before we kind of proceed it&#39;s a clarification type of question uh follow up on the question about z as a vector you mentioned specific hardware for per channel quantization does it require an additional dot product well it it depends on um it would if you&#39;re using it for weights which generally don&#39;t recommend it would have an extra term and it depends on whether they infer the the logic after the accumulator can take care of the things out of a channel level so every accumulator will be scaled with a different scale factor because it&#39;s per channel but about whether it can actually do this extra additions we saw here um that will depend uh like on the hardware but if you also stick to a symmetric quantization of weights you don&#39;t have to the problem anymore and and at least and then you know at least after the asymmetric case that this term can be easily you can in theory can be easily computed you can like literally as uh before you can hardware take that term and add it into your bias well thank you let&#39;s uh proceed with the within yeah thanks for the questions um yeah super channel and i&#39;m going now to this list so i said about max pooling uh the quantization operation is not required afterwards because the input and the output are on the same grid um so we have um forever pulling um the average of two inches of a few integers is actually not an integer so we have to re-quantize but it&#39;s also fine to keep to tie the quantizes because the ranges of the input of the output are the same and for the element-wise addition this is slightly more nuanced and there&#39;s not universal except solution but extra care is needed if there&#39;s a mismatch in the ranges between the tensions being added but as a safe step it&#39;s better to add a quantization step afterwards and the same is the case with concatenation where um where the branches the the greeds of the branches may not actually align so in this case uh it&#39;s also better to probably quantize the output these examples demonstrate that choosing how to quantize these layers is not always straightforward and requires special attention and sometimes the engineer might have and then might need an understanding of what of the hardware as well um so choosing the quantization parameters um the s and the and the z that we talked about before is important how we choose them and that&#39;s because they will induce different levels of quantization noise this step is also called all the time in the literature as calibration and um just introducing here uh first i&#39;m going to introduce the source of quantization error with a little um illustration so um we have this values on the real grid x and we define the limits of where we clip as q mean and q max and this in turn defines a scale factor we then map into the integer domain and with the clipping rounding operation and then here we see that these two values collapse in the same grid point because they&#39;re very close and this outlier or value outside the limits has been clipped into the limit of the grid then we have a dequantization step and finally if we actually map and project the original points onto this um final grid we see that we can visualize the error and for the values inside the grid we have the the limits we have rounding error and the running error is actually uh capped by s divided by two and this is for the points where we scale down here where something lies exactly between two grid points and this is a maximum running arrow we can incur and then we have a clipping arrow which is the distance from the grid limit and the original point and generally the quantization error is the sum of the two and there&#39;s a trade-off in in this error and i&#39;m going to illustrate this here by showing that if we want to increase the limits in order to reduce the clipping error this in turn will increase the scale factor and will it will reduce the clipping error in by actually increasing the running error so here we see there&#39;s a trader between the two uh contributions uh so how we choose um cumin and q max and by consequence like the scale factor s really has an impact on this on on performance and we have a few methods how to set these values and choosing one of them depends on the amount of complexity and also where the data is available so the easiest one and commonly used in the mean max range but we just use the mean of the tensor and the max to define the limits and then we have the optimization based method so a bit more advanced and in this case we&#39;re trying to minimize an objective which is a loss l in between the and quantized value and the quantized value which in turn depends on convenient q max so we&#39;re looking for the optimal set of these grid limits that will minimize that loss and the most commonly used loss for this is a mean squared error but we can also use cross entropy for a specific case of logics which we have found works better if you want to quantize the logics if we have no data available a solution is to use the bachelor based method in this case we use the offset and the scaling factor along with the spread alpha to define the ranges and this is kind of assuming like a gaussian distribution of the preactivations so in this table we compare these methods for activation quantization we see that the combination of optimize using optimization based methods is outperforms the other ones and using uh also x entropy for logits when we want to quantize them is definitely the best solution across the board so uh what algorithms should i choose to improve my accuracy range setting is can normally not be enough and here we have we split the algorithms into the two main classes we um observe in literature uh one is post-training quantization of ptq and the other one is quantization aware training or qat so ptq takes a pre-trained network converts it to a fixed point and without any access to the training pipeline so we might not require any data or just a small calibration set without the need for labels and it&#39;s generally quite easy to use through an api call the downside is it leads to lower accuracies and lower bit widths and this can be adapted by using possibly quantization where training which generally leads to high accuracies but has some drawbacks in terms of it requires access to the training pipeline and labeled data longer training times and possibly hyper parameter tuning without any training of neural networks we first start with post training quantization and here i&#39;m going to present our post training pipeline i&#39;m going to go step by step for this and generally starting from a pre-trained model as a floating point model is recommended or actually required in this case um and the first step is cross-layer equalization um as we said before it&#39;s it&#39;s quite problematic that we have an evil unevenly distributed channels in between the tensioner so here we see second layer of mobile navy 2 which is a depth-wise separable layer and the dynamic range is some of the ways very few of them actually dominate dynamic grains whereas all the other ones have a much smaller range so if we try to represent this within the same grids we&#39;re going to spend we&#39;re going to have no precision left for the smaller waves because we used it all for these larger dynamic ranges cross-layer equalization trying to solve this problem and the way it&#39;s doing it is by scaling the channels of neighboring uh weight layers and the way it&#39;s doing and the way this is done is by exploiting the scalar covariance of relu meaning that multiplying the um the input of a value is the same as multiplying with the same scale the output um so this allows us to move the channel scale factors in between the adjacent layers so if you look at this small example here in layer one we have that the weight output 2 dominates the dynamic range and we can we could divide um this channel by this factor of s2 which is calculated with this form here and multiply the corresponding input channel of the adjacent layer by the same factor this has the effect that it equalizes the ranges of the first layer with a very minimal effect on layered on the second layer so we can have a much better uh multi-utilized quantization grid for both layers so one unfortunate side effect of this process is that by scaling these layers these channels we can end up increasing the bias in situation this increased bias can lead to uneven activation ranges and harm the accuracy the quantized performance after activation quantization to address this problem uh we propose bias absorption and this is a technique in which we absorb this vector c uh from layer two to there from layer one to layer two so we can see we subtract c here and we move it into the bias uh vector of the adjacent layer um so looking at the the weight range as we saw before by using the equalization we see much better distribution of the weights along the channels and in terms of performance we&#39;re looking at mobile navy 2 and intake quantization and by combining these two equalization techniques we go from zero accuracy complete lack of complete collapse and performance to only no point eight percent um job from floating point and interestingly it&#39;s actually outperforming the channel quantization in this case um so in terms of next step is to add the quantizes again recommend symmetric weights and estimating activations and that&#39;s followed by the weight uh range setting or setting the quantization parameters and previously based on what we&#39;re seeing we recommend msc methods and that generally outperforms in all cases even for the channel and for potential quantization looking at resonant and mobile v2 so um here we have a choice an option and we decide whether we could proceed with calibration data or not if data is not available we move to the step called bias correction um in bias correction um we observe generally that the um error of the the quantization error of the output which is shown here is not as biased so it actually depends on the statistics of the input here we see distribution actually of this of this bias output quantization error and this can lead to performance degradation a solution for that is to apply a technique called bias correction and this can be done completely data free using the bachelor parameters and an assumption of gaussian preactivation so using that and the pdf and the cdf of the normal distribution we can completely analytically correct for part of this error so uh this table here shows the benefits of using that method so by combining the cross-layer equalization with the bias correction we see that we fully recover floating point accuracy for mobile lab2 and that you can also hear on the graph we see how much the uh how much smaller the biased error is in this case after we apply this operation if we do have calibration data then we recommend a much more powerful algorithm called add around and add around is yeah sorry and here is adoram so traditionally in ptq after we define the quantization parameters we use the round nearest operation this because around the nearest the lowest mean squared error however there&#39;s a question is this the optimal way of rounding for the final task loss here we see that uh by by quantizing with different ways the um first layer of resin 18 and we see the impact in the validation accuracy so for flooring or ceiling we get completely random performance but the stochastic case is an average the same as the around nearest but the best uh output of stochastic rounding is about more than ten percent better than the uh around the nearest so the question is you would be going up or down and this is a way of systematically finding this running choice the answer is yes is we can learn to round and the way adoranga is doing this is by minimizing the local l2 loss of the output of the layer uh where w um in this case we look at the output of the of a linear layer and uh the way this problem is solved is by using the soft quantized weights the soft quantized weights um work as following you round down the weights and then you have this learned value hp that is between zero one so in this situation you see like this this um hp will should actually move around the two points and at some point land either at the next squid point of with one or the floor with zero the choice for this function h is the rectified sigmoid and this is because here&#39;s how the rectifier sigma looks compared to a normal sigmoids and the reason is because um it um for the values where the output of h v becomes zero one the sigmoid has zero um gradients where the rectifier sigmoid still has some value so it can allow for more flexible learning and it will allow for hp to reach this extremities uh to also ensure that uh this value hv does converge to zero or one we also add this regularization term hp um and this is seen um this term is actually done here and we also use beta annealing during training quickly some results about how their own works we&#39;re looking at 4-bit wave quantization 8-bit activations and we compare to the normal cle bars correction pipeline and some other bias correction methods from banner l which is actually a bit unfair comparison because they use per channel quantization but we can see that in all cases adiron performs better and even for inception b3 the difference is pretty big compared to the best other methods um so last step activation rain setting here we have the uh msc uh rain setting and um for we have data and batch number rates sending if we have no data um if after following all the steps the accuracy is still unsatisfactory we propose a step of debugging a section we have some debugging steps in the flow chart which can be found in the white paper please consult it if you want to if you have issues and this should help you to close the gap accuracy even further some results um we have here a lot of models and benchmarks i have some color coding about the drop in accuracy as green less than one percent orange one two one half and red more than one and a half and here we see that um uh for eight differentization uh uh we ptq performs really really well and we see that for difficult models like efficient light mobile level 2 or even for um language models like bert uh the four bit the four breeze results are not as good but we said it&#39;s harder to quantize but still pretty impressive in certain cases so quantization where training the biggest difficulty in qat is quantizing the backward path and it&#39;s because of the presence of these quantizers because they have this around the nearest operation and because this heavy side step function uh gradient-based training will be impossible because the gradients are not meaningful so the solution is to have the straight through estimator which approximates this the gradient of this operation with one and is it&#39;s equivalent to having a simulated forward path where rather than having the step function we have a ramp uh one of the benefits of using the straight through estimation estimator is that can allow us to actually learn this quantization parameters directly rather than have to set them and by using the task loss gradients we define an optimal trade-off between we can find an optimal trade-off between these two arrows through the through the greatness of the task uh one nuanced thing about qat is how to treat bathroom batch normalization is com it can be found in all computer vision models but for faster inference we generally fold it to the previous layer this folding happens by combining the the scaling operation with the weight and the and the offset operation with the bias and we can see here on the left how the computer graph is transformed for inference we scale the weights and subtract the buyers and then we effectively remove the batch number and reparametrize weights and bias this is very obvious for p2q there&#39;s only one way of doing it but for qat there&#39;s a debate in the community about how whether the statistics and the parameters about genomes should be updated during training um when doing quantization-weight training so one approach from christian motile suggests that should be using a double forward path uh one at forward path for calculating statistics one for doing the actual quantize uh um calculation and we actually find that using a simple static folding which we fold the parameters and remove bathroom uh in most cases performs uh better than this approach and it&#39;s also cheaper and faster because we remove the battery operation from the computer graph for per channel quantization uh this choice does not really matter that much anymore because uh the bachelor scaling parameters can be absorbed into the channel per channel scale factors so even if you we can leave bachelor intact and that generally leads to higher performance than folding it um so here&#39;s a conversational web training pipeline we start from a pre-trained model in most cases we apply cle and bachelor unfolding step if we have potential quantization uh just to point out here that if we don&#39;t use cle for certain models we actually cannot recover accuracy by qat so for mobile navy 2 it&#39;s very very important to actually apply cle before even qat because otherwise we will get stuck to like a very low accuracy um so it&#39;s generally recommended step for lower for models with deploy separable layers and then we have the quantizers and the range estimation range estimation generally recommend msc but in this case also mean max range for weights is generally fine because this can be recovered during training but very importantly we recommend always to use learnable quantization parameters and here there&#39;s a detail about how you optimize them it&#39;s recommended to not be the same schedule and optimizer as the weights because they have to change differently from the weights slower one solution is to use the gradient scaling from srl in the lsq paper or equivalent or even better than time is to use an optimize like atom with a different landing rate scheduled from the weights and then we train quickly some results here again um i&#39;m not going to spend my time on 8 bits but you see generally even improves performance is lowercase but most importantly is the 4-bit quantization where we see it close the gap and the accuracy and just to illustrate this and putting the side by side for p2q and uh qrt and um so here um in a lot of k especially in some cases like bert we see that we move from like the one color to the other we have significantly improvement booming accuracy and all efficient at d1 we go to minus five percent compared to complete loss of accuracy uh some models still a bit harder to quantize even in this case generally in the uh the ones that use efficient lights as a backbone so um yeah so here are overview of the recess we&#39;re doing a qualcomm for quantization uh here are the papers we published in the last two or three years uh actually the latter the last one only came out today in archives so please go and check it if you want and yeah um i don&#39;t know uh afghanistan i talk about aim at or how we&#39;re doing with time uh i think we have several questions to answer uh okay i&#39;m gonna we&#39;ll answer some questions and then uh in the remaining time you can go through that through that okay let&#39;s do that uh question from neil are open source framework available uh for qad um well yeah um in our case there&#39;s uh sorry i&#39;m moving slides accidentally uh well there&#39;s acuity uh is supported by emmett which is our toolkit um and otherwise yeah that&#39;s the answer i have for this right and also related questions from neil uh can you take an existing trained uh network uh in floating point and use some kind of transfer learning with qat to convert it to a quantized network rather than having to start training from scratch uh transfer learning um well i mean there&#39;s uh the common thing in qut sometimes you use a teacher student network that&#39;s seen as a powerful method for improving performance uh but uh transfer learning would that be like using a hypernet so having a bank of networks that have already been trained i&#39;m not sure how uh transfer learning well i think i maybe i don&#39;t really grasp the power of the idea of transfer learning in the setting yeah it could be a topic for next research paper yes uh a question from lord isn&#39;t there the possibility to train the model directly in an integral format instead of doing a post-trading flow to integrate conversion uh integra so so to just train yeah okay that&#39;s for qrt or ptq sorry a question um yeah because um well i mean the difficulty with integers is that then you you don&#39;t have you can&#39;t use gradients anymore and so if you&#39;re using qit uh you might be looking at a you know sampling methods or combinator optimization or something it would be a different paradigm the reason why i resolve into using the floating point and actually normally keep the floating point shutter weight is because it allows us to use gradients um well not real gradients because they don&#39;t exist but some others made of the gradients um but in if you&#39;re using a in binary networks that could be a different answer but in a multi-bit case that could be very difficult to optimize just because there&#39;s so many options in in the grid okay several more questions let&#39;s see how much you&#39;re doing in terms of time yeah these two okay uh question from mayan what is the validation and test data set sizes used for the best stochastic runs it could be just a huge bias in this election that does not generalize oh that would be just the total valid that will be the whole validation set of image net in that case that that&#39;s like we used there&#39;s no test just uh passing the whole validation set through a question from thomas there&#39;s a there was a there&#39;s a lot of there would be a lot of hashtag by hundreds of houses samples and if you plot them um there&#39;s a big range in between all of them um so yeah yeah uh question to thomas have you worked with rnn models um um since someone in our group has worked with rnn models in the past i have not and i will not i actually have not touched them um and uh yeah my experience is very limited in evidence but um maybe um yeah maybe some of this stuff we&#39;ve since birth we&#39;ve we&#39;ve kind of stopped using our hands a lot of time and several questions from uh miguel uh i&#39;ll read all of them there are four of them uh and you decided what order to do i can i&#39;m also quickly looking at them but go ahead yeah okay yeah uh well let&#39;s kind of start with a simple one is amit now available for tensorflow 2 by torch 1.8 plus uh let&#39;s see what this lie says um because i am uh not sure about this question i&#39;m sorry about that um i guess this is going to be available in the if you go into the github um link then that should be answered there okay then there are three more questions from miguel what group convolution con transpose does the per channel mean a per group of channels uh yeah i think in that case um i guess i guess you can find a group or output depending on what you can do um on hardware um i think if you can map it accordingly into the uh into the accumulator then that should be fine if you use a separate one but also seeing time and how responding here uh yeah i i i guess like as simon said that&#39;s still fine um but it&#39;s not very common approach okay next one from miguel what happens when range doesn&#39;t have zero inside what sorry uh what happens if range oh when range doesn&#39;t have zero inside zero oh um so generally speaking well generally speaking we in more quantization techniques we&#39;re always trying to include zero it&#39;s not in for example i haven&#39;t seen any method where um you wouldn&#39;t represent uh you would you would generally try and represent zero unless uh in the asymmetric case uh maybe you can do that but you need to make sure just to make sure for sparsity reasons every channel when you mix them with the activations you need to make sure that this can be represented without error but if you generally speak in the symmetric case you will always try and include zero in it and the last one from because cross-leg equalization would increase errors in this case what techniques would you recommend to recover errors uh well i mean it depends on yeah i mean it depends on the form of error uh bias correction i guess if it causes some can take some of that and the bias absorption um um is supposed to be taking care of the error inducing the activations but i think for example you have other forms of activation so i i have not actually i think these two methods are generally quite powerful uh into avoiding that um i don&#39;t know if there&#39;s another form of error that i&#39;m not currently considering okay so we have uh i think what five minutes left so mario&#39;s maybe uh you can give us a bit of a refresher on the amit tool uh and for those people who are interested to learn more there was a whole tutorial on amateur given at the time email summit in march also by qualcomm so you can find this video and the material on on the youtube channel on the amateur and we can send you link there so but if you can give a bit of a refresher um that that would be good i&#39;m yeah i&#39;m not going to spend it&#39;s just uh yeah these are the links for finding the aim tool which is the open source model efficiency toolkit and that includes most of pretty much all the methods we talked about today uh it&#39;s mainly consists of two things one is the aim it that has a quantization and compression techniques and the other one is aimed model zoo with eight bit quantized models both in tension flow and pi touch and aim it effectively um plugs into your into your flow and uh will include um both quantization and compression techniques um it has data frequencies brain setting techniques and quantization simulation and it also allows the fine tuning or qat as well on top and the model zoo has models both in tensorflow and pytorch for very wide range of benchmarks and different models as you can see here and for all of them uh this stage these uh results with less than one percent accuracy this is all open source yeah this stuff is open source yeah um if um if there&#39;s a very small chance maybe one of the two other techniques we talked about now are not open source but i i wouldn&#39;t have commented that because i&#39;m not 100 sure um and yeah just a reminder i love what we talked about today and much more detail and a lot of kind of details about the uh and results calculations are in the white paper and uh yeah i recommend you to go and have a look at it on archive uh or by following this uh scanning the qr code and uh please we welcome suggestions feedback and comments it&#39;s kind of a live document for us and we want to include the insights from the community and make it better for everyone uh in the community awesome yeah and that&#39;s the uh end and uh yeah thanks so much for the quite interesting question actually i&#39;m really sorry some of them i um wasn&#39;t able to fully answer uh two very last ones uh on one on amateur what is the output format after the optimization and also tensorflow i&#39;m not sure whether the second one stands for but in in the diagram you show there optimizer what is the format there yep yeah so uh optim um so optimization is just uh you you save out the the model with new weights and uh also the uh the quantization encodings they call it you name it so the idea is that uh when you apply the simulation quantization aim it will print out the weights by also print out the quantization parameters are needed for on-device inference so an optimized ar model is just that but with better weights and different quantizers the very last question from mayan again who&#39;s interested in rnn uh for the rnn quantization duty quantize all the our inputs output of different gates and hidden state contents mainly does the quantization error propagate before rnans um yeah uh i think this is more of a question for tywin i guess who&#39;s just pointing it i might my guess would be that everything has to be unless there&#39;s a reason unless you can actually somehow map it away from the from these uh fixed point accelerators and you can take into cpu otherwise uh if you want to use accelerators you have to use um yeah quantized values cool well thank you marius for very interesting and really practical uh presentation on quantization and and the tools and all the ins and outs uh and also thank you all for for joining this uh tiny ml talk series i think we&#39;ll have um next one in in again on tuesday and uh at this point madison would also like to acknowledge our sponsors again if you can advance to the sponsors like oh maybe that i think i think we already did this one yeah so this is the list of sponsors and we would like to acknowledge them one by one it&#39;s arm uh the software and hardware foundation for 10 ml we have deep light uh from canada they use ai to make other ai faster smaller and more power efficient h impulse uh they&#39;re going to have their imagine conference this week by the way so they provide 10 ml for all developers and i encourage you to join the the imagine conference i think it&#39;s also free of charge this week starting tomorrow amsa uh it&#39;s a company in israel the i e in iot edge ai visual sensors green waste technologies from france are enabling the next generation of sensor and caribou products to process rich data with energy efficiency hotg distributed instructor infrastructure for tiny ml applications related ai from from here from the bay area adaptive ai for the intelligent edge maxim integrated they they just have a partnership with adi enabling intelligence uh kixo otml automated machine learning platform that builds tiny ml solutions for the age using sensor data qualcomm i think we spoke enough about qualcomm today thank you qualcomm uh reality i add advanced sensing to your product with hai 10ml sensemail another interesting startup in this space build smart iot sensor devices from data since sans from switzerland they build sensing inference hardware for ultra low power enabled mobile and edge devices cntn from the southern california area building neural decision processors and software and as i said next presentation will be in uh next week on tuesday by professor lumosa on verification of male based l systems and it starts at 8 a.m so thank you all for joining thank you for your question thank you marios again for the very interesting presentation and we&#39;ll stay in touch

Transcript for:Qualcomm Day on TinyML with Marios Fourlaris

Transcript for:
Qualcomm Day on TinyML with Marios Fourlaris