Lecture on Machine Learning in Image Compression

so I have a feeling that a lot of people here are really excited about machine learning and I think that excitement is not unreasonable machine learning has it has shown its potential before it just completely removed a revolutionised computer vision I don't know if there's going to be a revolution in image compression but we've already seen that ml has made its way into existing compression methods we've heard talks about predict two methods to predict more decisions to speed up the encoder they're learned intra prediction methods there are learned in loop filters learn upscaling filters but what I want to focus on today is a bit more radical I want to talk about image compression methods that are learn to end-to-end and we've also seen some examples of that in the special session this morning I think it's really a topic worth thinking about even though the complexity may still be quite high although I heard about the CPU decades just a minute ago and that makes me feel a bit more comfortable so I'm interested in in just how good compression can get if we if we try to fully exploit the the potential of these techniques and let me give you an example of the kind of results that you can expect from from learn image compression so so when we apply a modern compression method to an image like this and and when we target a very low bitrate then we're gonna see some artifacts so here's a chibi sintra the target bitrate is about 0.1 bits per pixel and if you look closely you can actually identify many of the components that are inside a chibi see intra so for example if you take a look at the sky you can see that there's their casing so and that basically tells you there is some sort of a block partitioning going on if you look at some of the prominent edges in the picture then you can see ringing and that basically gives away that you have some frequency transform and this kind of linear transform also causes some structural distortions that can sometimes completely alter the appearance of for instance of objects like this boat up here if you look at some of the some of the more organic shapes like the the bushes down here then you can actually see that there's directional prediction going on because it tends to force these edges into into sort of easier to be coded shapes another difficulty that you will find in in existing methods is if you want to do perceptual rate allocation that can be quite hard to do and for instance here the system doesn't really allocate enough bids to the texture in the background so you have a lot of flattened texture up here so all of these coding tools are revealed because they have been independently developed and optimized and then then they've been joined together and in kind of a manual process that happens at the standardization committees in a compression method that's optimized end-to-end artifacts like that don't really have to happen and not only that if you if you set up set it up the right way an end-to-end train method will also figure out automatically how to do with the rate allocation in an optimized way basically just based on the distortion metric that you optimize it for here's the image coded to the same bitrate with an intro end train method so you can see there's no stair casing the edges here are crisp there's no ringing both pretty much looks still like about it might lose some details the bushes look more natural and also the method learned shift some of the bits into the into the background so here's another close-up comparison again the end to end train method produces somewhat more pleasant artifacts they tend to keep the image structure in fact more here's another good example for learned rate allocation so this is a chibi see intra and this is an end-to-end train method so it pushes a lot more bits into this texture here and this is the same bit rate another example this is a chibi see intra you take a look at the the rose petals for example you can see that the reconstructions have a more natural sort of image natural image like look even though they are clearly missing some details one more example take a look at the background so if you look up here you have a lot of these directional prediction artifacts which may make the clouds look kind of unnatural you have a lot of transform artifacts here and if you look at the end-to-end train method it all looks more natural although we might lose some details down here so if I go back you can see that there is a bit of detail loss but overall you have sort of a more balanced visual impression so the field of learn compression is already too big to be covered in detail in less than an hour so so what I'm gonna do is I'm gonna present you with an introduction into the kinds of models that that I'm most familiar with and I'm gonna make some connections with other existing work I'm gonna start with something fairly simple transfer encoding and I'm gonna explain how we can actually use neural networks to extend it from linear to nonlinear transforms then I'm going to relate nonlinear transform coding to representation learning which is a much wider concept and which we can use as a source of inspiration for for better learned compression methods third I'm going to try to anticipate some questions that you might have if you are not familiar with this this field and last I want to update you on what happened last week at the at the first challenge online image compression only have like this much left ok let's take a look at a JPEG as a stand-in for a simple transform holder so we have an image called X which is subjected to a DCT transform that gives us the transform coefficients which I'm calling Y here these are quantized typically with a non uniform quantizer and once they are discrete we can actually entropy code and decode them and we can advantage take advantage of some statistical relationships there we can use run length encoding of the exact scan Huffman code etc so I'm going to skip over a couple of important details here just to keep it simple the main point is that we'll end up with a set of perturb coefficients which I'm calling y hat and these are then fed through the inverse transform thank you and that gives us the reconstructed image X hat and then we then we can basically measure the distortion or the metric of our choice so what what ends up defining this method is on one hand the distortion and on the other hand the rate so why do we use the DCT or similar transforms well we can go all the way back to the paper proposing it in 1974 the author's their measure the rate distortion performance of the DCT and they note that it it's almost as good as the Quixote but we can compute it much faster so that explains why it has been so successful and why we're still using transforms that are similar what's interesting here is are the assumptions that the authors based their analysis on so the authors say in the paper that it's really true only for ar1 signals or for gaussian signals but another assumption that's not really even mentioned in the paper is that they're looking for a linear transform so the KLT is the best linear transform and we can't really blame them for all emitting that assumption because in 1974 while linear transforms were basically all you could do in general though when we do image compression the signal is is not Gaussian it's highly non Gaussian and the rate distortion optimal transform is very likely not linear and this is where a and ends come in artificial neural networks our generic function approximate errs they've been around for a long time but training of large aliens has been mostly infeasible until very recently and now with modern hardware it's become very easy to actually obtain good approximations for arbitrary input-output relationships so generally an a n n is a mapping from an input vector X to an output vector Y and it consists of linear transforms that alternate with nonlinear transforms and then the the weights of these linear transforms make up the parameters of the network so I'm calling those theta here the crucial thing here to realize is that we can use these Universal function approximate errs to approximate the optimal our rate distortion they rate the store to an optimal transform even if they are nonlinear so so for the next few slides I want to talk about a relatively simple model which I like to call nonlinear transform coding it's basically what it sounds like it is we basically just replace the neural net linear transforms with an ant's and to keep things simple we're going to use a uniform quantizer and a very simple arithmetic coder which basically just assumes that all the elements are independent so there's no context there's no additive 'ti it's relatively straightforward the problem is again this system has a lot of parameters and the arithmetic coder so the earth way the cool air has parameters in that it has it has to model a probability distribution which we can use to generate the tables for the earth medic coder and the neural networks have parameters in the weights and quite a lot of them so if we don't know what the parameters are we can't really use it we have to use some some method to actually get these parameters and that's where the machine learning comes in and fortunately machine learning gives us all the tools to optimize the system to find good parameters not necessarily the optimal ones but very good ones and I'm calling this here the machine learning treatment because this is a set of tools that has been has become really popular recently so so what do we do first of all we need to know what we want to optimize we need to have a loss function and in an image compression that's that's fairly straightforward any lossy compression method wants to minimize its rate distortion cost so we can write the rate simply as the entropy of the coefficient distribution so that's going to be the expected value over all images of the negative logarithm of the coefficient distribution and we can write the squared error also with an expectation over all images and then of course we will have to trade these things off and for simplicity here I'm just going to assume we are going to target one particular trade off so we have a constant lambda just sitting here being multiplied with the distortion over we add it all up and now we would like to minimize this if we actually write this out in its explicit explicit form it becomes quite on so you see that it becomes a rather nasty expression it has all the parameters sprinkled inside of it and now we have to fight find a way to minimize this so one standard way of approaching this if you have these expectations there is to use stochastic gradient descent stochastic gradient descent basically replaces the expectation with an approximation which is just a mean over a small number of images so you're this is called a batch so we take a batch of images and we average the loss function over these these images so that's good because we don't have to evaluate any integrals to compute that expectation the other thing we do is we take the derivative of that expectation and pull it inside of the sum so so now it looks a lot more feasible but we still have a rather unwieldy derivative here to work out and that's where all these automatic differentiation frameworks come in like tensor flow so that software basically lets us write down these expressions in symbolic form and it automatically computes the derivatives for us so we don't have to work them them out manually so this might seem like we can now actually find parameters of the system but we've neglected one thing it turns out that we've overlooked that there's actually a quantization function in here and because the quantizer is essentially a step function in the best case we'll get zero derivatives and if we're really unlucky we will get infinite derivatives and so this pretty much kills gradient descent and the solution that I like to use to fix this problem is to pretend that during training that there is no quantizer and instead we were going to replace it with adding uniform noise so we'll do this during training and then when we actually use the system we're just gonna quantize it as usual this has the interesting effect that the probability distribution of the coefficients is going to change so if you do uniform quantization and let's say the coefficients here have a Laplace distribution then you would get then the quantized coefficients would have a distribution like these the strain of Delta functions and that's because you basically take all the probability mass that's inside a bin and you map it or two to this one one value when instead we add uniform noise we will get a distribution that looks like this so what this basically does is it interpolates the probability mass function of the quantized coefficients and it's kind of like a continuous relaxation so the nice property of this is that we can during training we can use some parametric model to to track these continuous distributions and then we want to use it for a quantity for actual compression we just have to evaluate the function at the integer values and that gives us the probability mass function which in turn allows us to construct the earth meta coding tables okay so here is the result of me applying this to a very simple toy example in linear transform coding it might not make much sense to do compression of a scalar variable in nonlinear transform coding that actually does make sense so what I did here was to - instead of compressing images I'm compressing a laplacian source distribution and this is just a scalar distribution so I design a very small neural network that just takes one scalar as an input it has a a handful of neurons and then it outputs another scalar and what you see in this graph here is for once you have the source distribution it's a blue line and then you have the these vertical lines here indicate the boundaries of where in the transform domain 1 1 value flips over into the next integer value the the stocks in here represent the points that you get when you take an integer value in the transform domain and map it back into the data space so what you can see here is basically that this nonlinear transfer encoder acts like a a non uniform quantizer and what's what's very nice and encouraging is that it actually figures out that it should use it at zone and it also puts the representers near the conditional mean within each bin so these are two properties that we know will help with compressing a source like this but the the interesting thing is that we didn't put that in there so the the system learned to do that just based on the getting the samples from the laplacian distribution during training so now we can go a step up and we can extend this thing to two dimensions so to make it more interesting what I did here was constructed probability density that looks like a banana and I applied the same technique to to this distribution but I constraints the transforms to be linear so what you can see here is that when when you constrain it to be a linear transform that you're basically stuck with a lattice quantizer still the the system tries to sort of perform optimally under this constraint so what did what it figures out to do is that it's going to align the the lettuce with the sort of the fundamental directions of the banana if you all so it's trying to do its best to adapt to the source distribution and this is what you get when you when you lift that constraint so now I'm going back to using a very small neural network instead of the linear transform so two things happen here one thing is that the there costs the rate plus lambda times distortion cost went down from 6.9 to about six so it's gotten quite a bit better at compressing this source the other thing that you can see is that it's a lot more flexible it has now adapted much more to the data distribution you can see that it's still a lattice but the lattice is what it's it's almost like it shrink wrap the banana now we can compare this to a rate concentrate vector quantizer and this is a rate constraint vector quantizer that I optimized for this source distribution it has pretty much it's pretty much the best you can hope to get so you can see here that it's actually not that much better so it went from five point nine seven to five point nine five so in conclusion the nonlinear transform coder is not necessarily optimal but because it is so much more flexible it can actually approach the optimum quite a bit better so now let's get to images for coding images we use quite a bit bigger neural networks and instead of using general linear transforms we use convolutions for the results that I'm showing next we also use the GD Nano reality that I discussed this morning and I want to point out for the end that we're not using thousand layers we're just using three so it's it's not quite as bad as magic so here I'm going to show you some results for this particular image and at one particular target bitrate but for three different compression methods one for JPEG and for JPEG 2004 this nonlinear transform coder note that none of these methods are really state-of-the-art in any way and not even the nonlinear one so this is an older model and I just want to compare the basically the three different transforms his JPEG should look familiar you get the typical block artifacts then you have JPEG 2000 that's an orthogonal wavelet so you get the typical ringing about the around the object boundaries you get some artifacts because we're assuming that the wavelet is separable so diagonal lines are a little bit harder and this is the nonlinear transform you can see that the image is lacking a lot of detail but again it's it manages to make the image look a relatively natural it comes out look looking natural but a lot more smooth than that would be that normal here's a close-up what's quite nice is how the model seems to learn something about object boundaries so if you look up here it's not just throwing out linear frequency components because the edges here are about as sharp as as over here so it's roughly the same for some reason it's learn to simplify the shape instead and that seems to be a property of all these of all of these types of nonlinear transforms okay so now we just generalize linear transform coding to nonlinear transforms and what we'll see next is how we can generalize it even further to something that I call representation learning here's a definition of representation learning has learnt representations aim to make it easier to extract useful information when building classifiers or other predictors in the case of probabilistic models a good representation is often one that captures underlying explanatory factors of the observed input so as you can see representation learning there isn't really a clear definition of it but in general the goal is to find an alternative representation of the data in this case images you aim to classify images then for example you might want to have a representation where the classes are nicely separated if you want to do compression then you might want to find a representation that's D correlated and so on one concept that has been around for decades in representation learning is the auto encoder it goes all the way back to the 1980s when Geoff Hinton and others explored all kinds of variations of it but today an auto encoder is generally understood to do dimensionality reduction so it's quite similar to transform coding except that the transforms were always assumed to be none there and we don't have any entropy coding or probabilistic modeling in the latent space instead you have the goal of just reducing dimensionality so all you do is make sure that you set it up so that the the transformed vector has a smaller dimensionality than your input and that means that in the in the last function all you have is the distortion term there's also somewhat different terminology one thing is that they call the analysis transform the encoder and the synthesis trance from the decoder the transform domain might be called the latent space sometimes it's called the bottleneck which makes sense if you think about it in terms of dimensionality reduction a more recent concept is the variational auto encoder that's a model that combines ideas from the auto encoder with variational Bayesian inference or if you will it's like it arguments the auto encoder with probabilistic weather with a probabilistic setting you go in one direction from the latent space to the data space then you have a Bayesian generative model so basically you're assuming that your Leighton's are distributed according to some prior P of Y and then the data is assumed to follow a distribution that is conditioned on that on that representation and typically it's conditioned on a transformation of the latent space with in with a neural network if you go in the other direction and this is where the where the innovation of the variational auto encoder comes in you have an approximate posterior which has a closed form so you're essentially imposing that the latent space here is is modeled as a conditional distribution which is structured in the same way so that the difference to a traditional Bayesian generative model is that you can do relatively cheap approximate inference once you've trained the system if this is a little bit too abstract then you can consider ICA so independent component analysis is essentially a subset of the sport version auto encoder in ICA you typically have a mixture of say audio sources that are assumed to be independent so you have a factorial distribution and then the generative model says you mixing these sources with the mixing matrix and you're actually observing this mixture with some additive noise once you've figured out what your mixing matrix is then you can actually go in the core the inverse ray and you can just invert the matrix and you can figure out what your original sources are so the variational autoencoder is like a generalization of this type of model except that maybe these these mixtures these sources actually don't really have to correspond to any real source now it turns out that the rate distortion optimization problem with nonlinear transform coding is a very close cousin of the variational auto encoder and you can see this by looking at the loss function so here we have the rate distortion loss function from before and the crucial point is you could take this distortion here so in this case the squared error and you can reinterpret it as a Gaussian you just need to add some or drop some additive constants that correspond to the normalizer of the distribution and you can basically write it as the negative log of a Gaussian where the mean comes from the transform labels and with that you have basically written the data as as modeled by a conditional distribution and so if you write even more general and you'll you'll see that you have now two logarithmic terms one is P of Y and then the other one is P of X given Y and that's basically the loss function of a Bayesian generative model so I simplified a bit here because to a variational auto-encoder you have some more terms that corresponds to the approximate posterior but the basic intuition here is that the connection between the rate distortion loss function and the variational auto-encoder is that the distortion is analogous to the negative log likelihood so the we've already seen this morning that they're all that there are many other probabilistic representations and you could use many of those in the same way or in in other ways to construct compression models we've seen the mixtures of experts we've seen RBMS we've seen ganz for instance and a lot of these models can even be combined so it's it's very interesting to look into this space and see what what kind of other models could be used to inspire new compression methods okay so probably 95% of you are probably thinking how good is learn image compression I want the numbers give me the numbers okay so at the moment the state-of-the-art learn image compression method is one of our papers that we published earlier this year at AI CLR it's it's basically what's going on is basically it extends the auto encoder the variation auto encoder with a hyper prior so it it turns the simple variational auto encoder into a hierarchical Bayesian model and if you if you turn this into compression lingo what we're basically doing is we're in improving the entropy model of the auto encoder by sending some side information so now we have some we have a basically a second bit string here so we're first sending the second bit stream then we're using it to construct the ends to entropy coder and then we sent the actual representation and there again you can you can keep up the analogy between variational auto-encoders and compression methods here and with this technique we can get a significant performance boost so as you can see the simple nonlinear transform coder which is the the purple line here factorized prior is just a bit better than JPEG 2000 which is the green line and with the with the hyper prior model we can actually jump quite a bit up and now we're almost as good as HEV see intra this is a big step and I hope that we can you know make keep making these big steps it's actually quite surprising that we got this this good because we only used you know we did this whole development in the time of maybe two years and it's developed from scratch so so this is psnr so far so good but what about other metrics well MSM is another popular metric and if you evaluate that same model and again with HIV see intra you see that surprisingly we're actually basically just as good as a GBC intra it's surprising because we actually optimize the system to be good at mean square error so it's it's kind of weird that we are getting better results in MSM compared to HIV see but that also leads to another interesting observation because we can we can actually take the the model and optimize it directly for MSM and that's that's not very hard because all we have to do is take the loss function and replace the mean squared error part with an MSS apart so we keep this the model exactly the same all we all we do is switch out the loss function and that gets us up here as far as I know that's the best result that anyone has ever gotten in terms of MSM on this data set so the nice thing about learned image compression in in this or this flavor of learning image compression is that we can directly optimize for any metric all we have to have is that it's differentiable and there's actually no surprise here I mean if we directly optimize for it then we would expect to be better okay so again 95% of you are probably going to think well which one actually looks better and that's an excellent question and we have we may not have direct subjective tests on that but we have some interesting observations so this is a reconstructed image from the model that was optimized for squared error at a fairly low bitrate again and then I can switch over to the one that was optimized for MSM I hope you can see a difference on this projector so this is again the same model we just switched out the metric that we optimized for if I go back and forth a little bit you should be able to see that the MSM of Timaeus model puts quite quite a bit more detail into the face and the sweater and the difference is really just up to rate allocation and that's that's basically how you get rate allocation in learned compression models use you change the distortion metric and the model will try to do better at that distortion metric by pushing bits around and in this case what the model did was it allowed a tone shift to happen in the background and instead use the the bits in the face and in the sweater so I can go back and you'll see that there is a tone shift and this kind of change is actually it's actually quite good I mean this this works for a lot of images however there are also some cases where where it's not doing what we want so in this image you should be able to see that again MSS intends to put more bits into textures so if I switch back and forth you should see that there's more detail in the grass if I go to msf the problem is where the bits are taking away from and you might see that it's actually taking away the bits from the text and can't really blame the model for it because for humans text is special right we would like to read things it has semantic meaning and so for us the text is really important and in this particular image we would not want to move more bits into the texture so again it's it's really nice that we can now really directly optimized for different distortion metrics but now it puts the burden on the distortion metric right now we need a distortion metric that is actually doing what we want all right but surely these methods must be computationally expensive well first of all don't call me Shirley and second yes they are somewhat expensive many of the published methods are relatively solved so the state of the art model that I just presented is is taking about 330 milliseconds to encode or decode at 512 by 768 pixel image on a desktop CPU and we've heard this morning there's been a lot of work on making in and faster including pruning units and reducing precision but many of these techniques don't perform as well on compression networks so a lot more research is needed in that space and we also can't really take the easy way out and run these models in the cloud because that would kind of defeat the purpose of compression but in general we are seeing more GPUs accelerators becoming available in mobile devices even so we can hope that time here works in our favor and that in some time in the future we will have more access to these optimized devices so that we can actually run these things in practice and here's some data that shows that if you do have access to a GPU encoding and decoding times can actually be quite reasonable so this is a paper that was published last year by a start-up called wave one they are now not anymore the state of the art in terms of compression efficiency but you can see that they did a pretty good job at optimizing their system for speed so they report that they can basically run their encoder or decoder in under 20 milliseconds on an image like this on a medium-range gaming GPU and that figure includes the entropy coding so so we have some evidence that these models can run in practice but of course there's a lot more work that needs to be done cool hey my retrievable classification segmentation and so on algorithm also uses a nonce what's up with that so there are some benefits to the auto encoder nonlinear transform coding style setup for instance it's it's pretty easy to train a classifier or a segmentation algorithm directly from the latent space instead of having to decode the image first and this study here compares floating-point operations directly with classification or segmentation accuracy and they basically find that well you can save some computational complexity if you trade if you run your classifier directly from the compressed domain we can even take this a little bit further for for some applications it could actually be interesting to not compress the image itself but just compress enough data so that you can analyze the image for example in image retrieval applications you might want to just store a compressed future representation and then use that to run a retrieval algorithm so in this study we basically train a neural network to classify images so you go from data space to the class labels but instead of going straight there goes through an intermediate representation and this representation again is compressed so I call this here the entropy bottleneck but essentially we're doing the same thing as in the nonlinear transform coding model so once we're in the intermediate representation and we've compressed it decompressed it then we actually go to the class labels and this whole system is trained jointly or end-to-end and this time we don't have a rate distortion trade-off but we have a rate accuracy trade-off and it turns out that this saves an enormous amount of space so previously when you want one to store intermediate features like this you would store them in floating-point format and for example for the image net validation data set you would need around 38 gigabytes to store all these features you can do something simple like compress it with gzip then you go down to 7 gigabytes but when we actually allow this model to Train a specialized compressed representation for these features you can bring it down to 850 gigabytes marek megabytes sorry and you have the same classification accuracy okay I've seen all these fabulous images synthesized by neural networks can compression models do that too so if you've seen really convincing synthesized images chances are that these came from a generative adversarial Network and you've seen some we've seen a talk this morning about this this technique so again is essentially another type of model where in a nutshell what happens is that you use sample from a random vector and you call this your latent representation and then there's a generator network that takes this latent representation and transforms it into an image and the way this generator network is trained is that it's basically pitted against another network which is called the discriminator so the discriminator randomly receives either an image from the training set or it receives an image that was generated by the by the other network and then it's job is to figure out which one of it it is so now you have two networks fighting against each other and the ultimate goal of this is to reach a sort of equilibrium so that the generator is trained up to producing realistic looking images so this is an implicit problem with a diwali you can sample from it but you cannot really evaluate the likelihood or or you can even not go from an image to the latent representation so what we saw in the paper this morning was that we can create a hybrid kind of model where we have this autoencoder style setup but now we use a discriminator network to actually try and make the reconstructed images more plausible note that training a model like this is not quite straightforward so typically we have or typically these gamma losses are are unstable so there were people do in practice is they add a lot more constraints into into these models and during the training you have other you have another squared loss or you have an MSS MLS and so on and there's other heuristics that people use to make these models train and practice but when it works then the results can be quite stunning and I have one result here that I want to show you this is a fairly recent paper which just appeared on archive and what the authors did was basically use this kind of setup and they applied this model to a narrower domain of images so in this case they used video on single frames of a video or street scenes so in this case this is an HT VC intra-coded a version of this image it's very low bitrate two point five kilobytes roughly and here's the same image that's compressed with that Gann based method you can see directly that there's a lot more detail in this image so let me go back and forth again you see that there's much more detail in the trees there's much more detail in the in the buildings you even have the mercedes-benz star down here so looks a lot more pleasing looks like a lot more plausible as an actual image but now if I want to compare this to the original then you actually see that it looks quite different so now you get you get also a lot more details but they are very different from the original so you can see that the foliage changes in the trees the texture on the buildings changes you can't really see it in this image but sometimes cars change colors all kinds of things happen so it's it's a really interesting result and maybe one key to getting these really realistic looking images is that it's easier for the model to generate realistic images if you constrain the domain of the images so if you if you really only show it images that have these sweet scenes and in particular if there's always this mercedes-benz star down here then it knows that it has to generate these details right there so it's it's it's it may be much easier to train models on narrow domain data than on general images and exactly how well this generalizes to general images is still being debated in the machine learning community so as you can see there are some exciting first results in this new field and but a lot of work still needs to be done there's still a lot of room for improvement in the probabilistic models themselves so we just plain to get better compression we've seen that better quality metrics are essential a lot more work is necessary to move this out of the academic regime and integrate the methods better with hardware like the BN is already doing basically just so that we can actually deploy it we also really need to worry about reducing the storage costs for the model parameters because these are we have a lot of parameters in these models there's some first results that that try to do video this work regarding progressive scalable compression models we can of course explore more datasets so there's nothing in here that says you have to apply this to images we can apply it to audio we can apply to audio visual data whatever and there's even some work that explores other types of channels or do joint source channel coding and I think that's so I want to tell you about a project that I think is really fun this is a paper from a group at UC Berkeley which they presented earlier this year on these at BCC and what they've been doing is they they've been looking at these devices called phase change memory so these devices have some advantages over flash they're you know they're they use less power they're faster they last longer you know how this device basically works is that you apply a voltage and then you measure a resistance out of that device and the thing is that depending on what voltage you apply you get a different resistance distribution so this is a highly nonlinear analog Channel and that means that standard error correcting codes can't really be used because they use they assume you know Gaussian channels and so on but because this is a probabilistic relationship and you know we're already using probabilistic models to your chip to design a compression method we can basically integrate the seamlessly so it's actually even easier to do this then to do a quantization an arithmetic coding because we don't have to deal with the discretization aspect of it so what they basically do is they just replace the quantization and the earth metal coding with a probabilistic model of the voltage resistance relationship the results currently aren't anything to write home about but they have some encouraging first results where they show that you know they can do better than JPEG with a sort of standard channel code okay so if you if you think now that you maybe want to try some some of this stuff out we have just started to release some source code in on github and we're calling it the tensor flow compression library so this is a project from my team at Google and you can you can download it and you can train your own ml models you can train your own nonlinear transform coder or you can train other types of machine learning models that have data compression built in we're going to update this in the future and in the next couple of weeks we're gonna make some additions it's also completely open source so if you want to contribute you're very very welcome if you want the announcements when we whenever released in code and there's a there's also Google group that you can subscribe to ok so now some words about click our challenge on learn image compression last week at CDP our so the challenge was organized by a mixed team of people from Google from Twitter and from 88 through these and we we've all put in a lot of work to make this happen we were very honored with four excellent speakers were really glad that they all came and agreed to speak okay so why would we make an image compression challenge just for this well basically as I hope I've convinced you we've seen some really encouraging results and we really want to explore this direction further we want to encourage both machine learning and compression researchers to work in this field want to introduce compression to machine learning people we want to introduce compression learning to compression people and in particular we want to be sure that certain caveats of compression are taken care of so for example we want to make sure that there is no information leakage from encoder the decoder we want to make sure that dreams can always be decoded and on top of that in machine learning you typically train on fairly small images and you tend to ignore the boundary handling of convolutions and so on and that's not really a good idea in image compression so we need we need some standard to to evaluate these things in the same way so how did the challenge proceed basically we released a set of images that can be used for training then the participants had some time a couple of months to train their models to tune a reference so that they hit 0.15 PPP which was the target bit rate for this challenge and then after a few months we released a test set and so we did not ask participants to send us their reconstructions because that would have allowed them to cheat so what we asked them to do was to send us the image decoders and the compressed versions of their business and then we ran the decoders on isolated environment so that there was no way of cheating and then we use the decoded images for evaluation here's a breakdown on what kind of methods were submitted we got a lot of submissions where people used existing codex so a GBC etc and then they developed some post-processing so maybe super resolution or artifact removal kind of models the second chunk of images were true and to end train methods and then we also had I think one traditional method and one traditional method that included a couple of learned coding tools for the evaluation we first computed psnr and we computed MSS a.m. and then the top methods in both metrics were selected to send for subjective testing there we had 17 human raters per image on those raters had five choices for scores each image was shown for two seconds and then there was a short break followed by the scoring here's an interesting statistic these are the decoding times versus the decoder sizes in megabytes and this is where HEV see in tries so you can see that there is still a lot of work to be done the largest decoder was 119 megabytes big and the longest time for decoding the test set of 61 hours so I'm it's not quite a CPU week or a CPU decade so that's that's it's good I hope but we no doubt we have to do a lot better than this there were five awards the first two best MSM and best mos both went to a team which basically built on top of the hyper prior model that I talked about earlier so this was an enter and train method the award for the best psnr actually went to a traditional method with some learned coding tools inside of it then the fastest method among the top five went to a traditional method and then we had another Award for a method which got pretty good results in both psnr and MOS and that one used h-e-b C intra with a learned post processing Network so in summary we had 31 submissions to the challenge we had 14 challenge track papers 9 papers that did not have a challenge submission there were a number of methods that were better than a GBC intra at this target bitrate and the moss and MSS i'm winner ended up being an entrance rate method and that's all I have you

Transcript for:Lecture on Machine Learning in Image Compression

Transcript for:
Lecture on Machine Learning in Image Compression