Deep Operator Networks with George Karniadakis

welcome and welcome to our speaker george karniadakis george is probably really jorgos i'm guessing uh he's an applied mathematician with wide-ranging interests they range from stochastic differential equations applied to various physics problems and life science problems computational fluid dynamics figures in there heavily and more recently meaning in the past decade plus a lot of work on machine learning for scientific applications and that i think will be uh the category that encompasses today's talk uh he was an undergraduate at the national technical university in athens in mechanical engineering and came to mit to get his master's and phd also in mechanical engineering i think he did these under anthony patera and borivoya mikkic he then held a number of positions including a faculty position at princeton and then ended up where he is now at brown university as a professor in applied mathematics a fancy named professorship but with two long names that i can't remember at the moment uh and uh we're very happy that he's going to talk to us about what he has dubbed deep own it yorgos welcome yeah thank you very much i am i believe that we are the crossroads of ai right now and if we want to be critical i would say we are at the stagnation point and i would like to give you an example of that is the recent gpt-3 from open ai well it's a thousand times bigger than gpt2 it's a good thing three times bigger it has 175 billion with the b parameters to train it takes about 350 cpu years to train it and in terms of money about 5 million dollars and according to their savvy ceo sam altman the dropbox guy he tweeted recently that gpt3 makes silly mistakes so true intelligence requires really scaling up is one avenue but there their limits it requires higher level of abstraction and such abstractions as my thesis can be effectively represented by non-linear operators that is non-linear mappings from one or multiple functional spaces to another so imagine for example here in this picture that i have the robot and you try to endow this robot with mathematical intelligence well would you teach this robot calculus uh i asked my daughter says god no uh as she's finishing up the high school because that's very tedious well if you do that then then you let's say the robot will try to use um what i used to do and sold numerically pdes and so on uh to predict something to move uh somewhere uh to check the weather and so on but but that will take an enormous amount of computing so this robot has to go around with the extra scale computer in its head and that's a lot of energy that's a lot of power and you know you all know here you are the guys who are deconstructing the brain to understand it that even a small chocolate bar provides sufficient energy for human intelligence for all sort of operations so uh so so it's also the energetics coming so let me go to the second slide if i hope you see my second slide is the universal theorem of function approximation and and almost every paper neural network published today about a thousand maybe 200 today about a thousand per week last year i checked it was 100 000 papers on neural networks were all based on universal approximation theory for functions but what i want to tell you today is something else i want to give you a higher level approximation the universal approximation for functionals and non-linear operators so why is it different well if you look at what we are doing today for image classification we take an image like that from rd1 and we map it to rb2 when you deal with an operator you have a function an infinite dimensional space to an infinite dimensional space so you can have multiple functions as input multi-functions as output so it's a very very different setup now obviously higher level abstraction now what what this operator is this operator could be as simple as a derivative talking about calculus an integral and i'll show you some examples it could be a complex dynamical system it could be an od ordinary difference equation partial differential equation stochastic differential equation fractional differential equation if i have time i'll show you all that but it could also be another biological system we don't quite understand but we map x to y x of t and space to y of t and space it could be a social system it could be a system of systems and i'll try to give you an example of that at the end so then can we learn these operators neural networks and and and how do we do that and um so so in some sense broadly speaking and i would like to want to i'd like to go to this question can we learn operators how do we do it how fast we do it and so on but but but before i get there i want to give you a couple of teasers because i know this some of the groups there uh working on generalization internalization is a big question and so on so i would like to address the generalization question which is that's also what i'm trying to do with these operators okay i want to to extrapolate i want to go out outside the space of distribution space and and i want to have small generalization there so let me give you a uh just a very before i get to the operators i want to give you a very brief overview of two topics that i've been working on recently one is um it's just under classification problem you can see here k categories and i'm trying to quantify generalization error but i will do it in a very different way than you have usually seen for example when we try to stabilize this stochastic gradient descent or some other methods uh i will try something different here i want to to to give you some introduction to that and then hopefully you can be interested in our paper that was published recently so this this schematic here shows the error which can be broken down this is a hypothesis space this is the approximation error for the network size you can make that smaller but it's a big elephant of course the generalization narrow how can you handle this so i'm going to look at some operators here yet no physics just pure classification how do we approach it and as i said we approach it from a very different point of view and just like the title of the paper says we will try to quantify the data distribution and also the smoothness of the network and and would introduce these new concepts if you like so for example introduce the probability of the neighborhood and then training set in this panel a if we have a datum here i take a radius r and the probability here will be uh in this plot and the integral under this from zero if r is zero the integral out of uh under this curve will will give me what i call the total cover now if i if i have two classes red blue then i want to introduce this self cover ti is that is a testing set for the i class a new i is the corresponding measure so self cutter would be something like this it's just rotation for now and then correspondingly in panel d i'll introduce also the mutual cover between now you can see the red class interacts with the blue class i also need to see how sparse are my data how far is label one from label two ideally i would like to know the medium distance delta not n but i don't have enough data to to be very precise so i will introduce this empirical distance delta t and finally so this is a data distribution and finally i will introduce something about the inverse modules of continuity of the network f so if i have a change in the network f epsilon i have delta f due to that epsilon so delta f here is my the inverse of modulus of continuity so what we did we prove the following theorem first we define the cover difference which is basically the average case the number of classes here is the average of the self cover minus the average of the neutral cover and then i define the cover complexity if it's zero means that it's very very easy to predict and so on so so here's the first theorem which uses this assumption in the paper we justify this assumption uh one thing that we use here kind of strong assumption is that the maximum uh loss the cross entropy maximum loss is bounded but actually in the experiments we just took the average we don't need to have the maximum because that's very strict constraint but but here is the main result is somewhere here in the middle bullet that says that the error is bounded by a coefficient that depends on the training set and the smoothness it turns out times this cover complexity that we define just to see that alpha of t depends on these delta parameters and the smoothness and i will connect that i just wanted to make this simple i i i cannot explain everything here but but just to show you what what happens in this framework we found surprisingly that the error just according to the theorem for all these familiar benchmarks grows linearly with the cover complexity as you can see here for the different classes 10 classes 20 class 100 classes so that's the theory the theorem provides that now empirically then we found that if we normalize the error with the square root of the number of classes everything collapses into one curve so we have a universal master curve for all these cases is that universal for everything we don't know there's a there is some gaps in the theory uh that the theory is not totally complete yet but i think it's a new theory now i want to connect so this is the distribution of data we cover complexity i we also try to connect this to the smoothness of the network of course it's very difficult to characterize delta f directly but in this inequality that we prove we can find a lower bound for the these modules of continuity in terms of the loss function and the weights that we can compute so this capital delta f is a computable quantity so what i have plotted here for this mis training set is actually the testing loss is the blue curve and you can see the minimum point and then we can see the overfitting now how does this relate to the smoothness of the network well the red curve shows you that because the red curve is a is a measure of like of course it has the loss in there but the loss is important here when it's big right here where it starts decaying the red curve if you can see uh what we have is uh is a loss of the smoothness of the network and right here is it's not coincidence but the fact that where you start the overfitting starts is where the smoothness of the network drops so if you go back you can basically relate this delta to the constant that we talked about here which is hidden somewhere and therefore one can connect those two things namely the data distribution and the smoothness of the network again fear is not complete but i was hoping to show this here so one of the smart mit students can take it and advance it further so that's one type of generalization uh just different approach the same problem the next generalization is something different is is what kenny said earlier that i'm interested in physical laws and unsupervised learning so this is how we publish this paper we call it pins physics inform neural networks it's been used now for from many different industries from nvidia has a parallel code on this to ansys the biggest software company in the world and so on for physical problems it's agnostic to any type of of physics actually but but physics is my regularization so so what's this uh what is what is a pin actually i wear a pin i don't know if you can see my t-shirt but but uh it's a very simple thing it's um it's a neural network let's say you're trying to learn u of x and t uh and you have lots of data uh then you have a corresponding loss miss much of the data and so on but we don't have enough data in in in in science we never have enough data you know that that's very expensive um and and they're not reproducible so we have very little data and but what we have is conservation of mass momentum energy and so on here i show an example of the parameterized radical equation so you have to satisfy that by insisting on this then i have another loss that's the residual of this conservation law which i can wait with the total loss and then i can improvise for not having data i gave a talk recently at the army here at natick they have an installation they were talking about autonomy and can you actually be autonomous without any physics at all i told them no here's a simple example you can learn how to solve this sode and predict on the inside the domain of training but you go outside the domain it's a huge error it turns out that if you use the pin approach no problem because unsupervised you follow exactly the trajectory that you want of your vehicle so to speak now it's interesting is that if you are outside the parametric space so i have lambda as a parameter if you train a certain parametric set and you are outside the errors are not are not as catastrophic as you can see some errors but not big so it's different of what you do in the domain or parameters and so on so so it's better to pin if you can and here's an example we published recently in science something very timely it's what can you do if you have this type of approach and i call this hidden fluid mechanics because it's a hidden markov process type of thing i use auxiliary data like a smoke or some thermal gradients from your breathing or from your coffee and from there this is one of our collaborators from la vision a german company german company i can tell you that i don't know if you can see the movies playing but i can tell you what the pressure is and what the velocity is just using this pin approach combine physics and data the only data i use is the date of the video but then i can infer a lot of other things uh did you see that the movie kenny yep look good okay and and i prepare this um espresso as i said for uh for tommy i uh professor poggio i know he likes uh espresso like me so recently i was doing this project with la vision uh they took a sleigh of photography 3d over an espresso cup and we were so curious to see how much what's the maximum velocity and what's the pressure of that kind of physics question but you can see here have you compared it to greek coffee no but we it was a controversy because we predicted that the maximum velocity was 0.4 meters per second which sounds really really fast and they didn't believe us so they went back and did an experiment with particularly much velocimetry and indeed they found point four point four five after our predictions so anyway we could be fair again they were lost in pressure and so on and uh so this is just a fun project uh we're doing more biomedical projects i will skip that because uh this is a brain aneurysm from children's hospital data i'll skip that same idea i want to go back to operators so so what i basically said so far is i'm interested in generalization like everyone else there's ways to generalize there's difficult to find generalization errors and i want to resort to operators to to make a big jump so you should be seeing now a slide that says problem setup so here we are g is the operator i'm looking for u is a function in some compact domain which i will define later but that but we have this mapping from u to g of u at y y is g of u of y that's the output of the operator so the setup is the following we will train this there's no physics now there's just all this data driven so we'll train this system with functions u a lot of them first one second one third one we will observe the output at some points and then we'll you give me another function from that space that way i have to define have yet to define and then i have to be able to give you the g of u of y so this will establish that i have learned a mapping between u and the output of the operator in this space of of y so now i went back and i look at the literature and i found this theorem and i don't know how many of you who have been doing theoretical uh machine learning have ever run into it but i asked one of my collaborators who's doing it taking a course at mit on machine learning and ask the instructor and the instructor had no idea that you can actually approximate functionals and operators but chen and chen back in food dan university in the early 90s developed this theory first for functionals here i show you for system identification nonlinear operators so basically the theorem says the following imagine you have a compact space v remember i show you the function u this function you will be in this compact space v and you're trying to identify this g non-linear continuous operator here g could be an explicit operator an implicit operator or a totally undescribable operator i'll show you some examples but basically the the theorem just now remember penguin hornigan and and other people at that at that time they were developing theory for functional approximation they developed this theorem that's that shows that a single neural network like this actually two neural networks but a single layer you run two single neural networks can approximate arbitrarily close this continuous operator g of u of y which is we're trying to will be approximated by a branch and a trunk notice that this is one layer for the output and one layer for the input these are two different networks we we call them branch and trunk this can be done for any u in this compact space v and and y in this uh k2 which is an rd space what does this mean i i interpret it here so it means that i can think of this as as a as a cross product of the of the output the trunk and the branch so if i if we look at panel d what we have here is a branch network where we take this function u we observe it at m points we call them sensors let's say you have m sensors you observe them and then that feeds network the branch network but we also need to say something about the output space because we need to have label data so i need to provide some g of u of y by doing that so so you see i have p points at the output i have m points where i observe this and that n is the number of neurons if you like for this network so so i pipe them through this two different networks i take a cross product i found the output so let's review again oh now this is a single neural network but my team did recently under a grant from darpa we extended this for deep neural network uh or basically replace a single neural network with a gn another neural network and the trunk without an fn and this can be now very general neural networks any type in fact of a class of functions that satisfy the classical universal approximation theorem so so the classical approximation theory would now go into our neural networks but of course this is say a kind of network of networks if you like but it's a composite network i'll i'll show it again but first you have to define the simple space the input space v is a compact space for the for the theorem it turns out in practice it does not need to be a a space like that so for example i can do laplace transform and in fact as you can see here i use gauss random fields uh to to approximate my space v but i want to see if you commit an error in representing the space v because you cannot exhaust that space right it's an infinite space how big is that error so i take this d s dx some right hand side i take i observe i sample my u uniformly and the real use this this care for example and then i can find for the special case of a gauss process with a gaussian kernel square exponential kernel correlation length l i can find that this constant kappa which depends on space v and the number of sensors is basically quadratic with the number of points one over m square and quadratic with the correlation length and then i can prove a theorem for this case only that in the indeed the error to approximate this neural network is bounded above by the error that that we sample the space v so it's quadratic in the number of points observation points for that function and quadratic in the in the correlation length so that makes sense uh of course there are many different ways of representation for example you can imagine that the v space could be a neural network itself you can imagine that v could be wavelets it could be a radial basis functions could be spectral expansions and i'll show you if i have time i'll show you some of it so let's recap what we have we want to find the operator that uh shows a non-linear mapping in general from u to g of u from rd to r let's say okay so down here in the panel b it gives you an idea of what we have i have here i have a summary of what i told you already so in the left column says training data i observe one function at end points or you observe another function at that point i may observe ten thousand functions okay now correspondingly i have to observe the output g of u of y but you can see i may have a hundred points to observe the input and only two or three or four points to observe the outputs are very like very spartan on the output i'm from crit actually but i use spartan here as an analogy of of of uh just a little data and then what i have here on the left is the input the function u space v and the output g of u of y i think that that's pretty simple that what i have in mind is a simple od to explain to you what we're doing and and here's an example here is an example so i will compare different neural networks that are out there so let's say i want to find the integral operator i want to build a neural network that approximates the this one-dimensional but other people have done multi-dimensional so don't worry about the complexity of this just just as a pedagogical example i want to find this 0 to x and x could be in some range so here the integrand u of x goes into the in the ground which is s of x depends on x okay so it's a map from u of x to g of a to s of x and capital g is that integral really that comes of course from this from this derivative definition so how do we do that in practice well i take one function i represent it with a 100 this is the simplest possible case okay this is just just to introduce a concept 100 points for the function i take 10 000 functions and i only observe s at some random point s of x the output at some random points only one point okay now here's a summary of what i got with my best network that that best network is the uh what i just showed you the unstacked depot net the mean square error training or testing as you can see they're almost on top of each other and the other goes down to 10 to the minus five i compare lots of networks i compare the best network here is of course the one that i show you that's why i show it to you because the generalization there is very small the difference between the training and the testing error if i use a standard fully connect network it's like that feed forward if i do a resnet it's similar to fnn i did a sequence to sequence one of our reviewers said well sequence of sequence works well it does a little bit better than fn but that it doesn't do uh as well as this stack network so now what happens if your if the space simple space v is very poorly represented for example as i said first violation is that v is supposed to be compact and i make it more compact by just taking a grf then i fix the correlation length to 0.5 and if you see what i have in my basket that space v are these functions so you come along and say can you integrate this function this would be my u of x can you integrate uh using a neural network and needless to say that if you train the network you can spit out the answer in a in a fraction of a second so so cost is not all the cost is advertise a priori in the pre-training so the answer to this is it depends actually on l how how if i go outside the distribution it depends on how well i did with l if i here if i have a very small correlation length right here my error although i'm outside the distribution is pretty small if i take 0.5 my error is pretty big my correlation length so obviously you want to be careful with the space so how rich is that input space is very very important now what happens if if you got lazy or if you don't have enough data and so on to train this case so so we pre-trained the depot net okay so in step one we use a supervised learning to pre-train a neuron this what i told you then step two you have two options one is actually if you know any physics any constraints you can do what i told you about pins but do it for a very very short time let's say 10 20 100 iterations not a million iterations for your std in other ways somebody gives you data just a few data but not a lot of data then you can just use this neural network and extra neural network to um and use deeponet as a pre-trained neural network we've done this i would not bother you the results look good for example if my correlation length is small you can see i start with a two person error i can improve it and so on now here's another example and a surprise a big surprise to us this is now u of t one input function and two outputs s1 and s2 it's a non-linear problem it's a nonlinear operator i show you here three examples with three different networks they all have the same depth but different width so if i take the middle one i plot the error versus the number of training data and you can see one thing that the testing error here uh drops very fast in fact exponentially fast originally okay then it goes to algebraic like monte carlo type sampling now you can see that this transition point from so exponential convergence this is great because i train operators if i can do it exponentially faster will be great now i haven't got there yet but one observation is that if we make these networks bigger let's say from width of 50 to 200 you can see that the transition point moves to the right so my exponential range is much bigger so again i'm looking for someone an mit guy who's very smart to come up and take this and make it a really really good network that will have exponential convergence in training and testing for all sizes we can do that for pdes advection diffusion reaction systems you can find that in the brain you can have biological systems now you have very few points that you observe in space time and you can do the same thing again you can find exponential convergence i will not bore you with that same idea uh you can with this depot net now you learn how to solve this uh pde now if you pump if you have data direction diffusion system and you and you train that network then you can change your initial conditions boundary conditions and so on and then you can solve this pde in real time in a fraction of a second so i you know i i i spent 35 years working on numerical methods for pds i cannot find a method that competes with this okay not only that the the you learn operator that is very general for example you learn implicitly this operator now how do you explain the data let's say here i have an example where i fit it with data from this adjection diffusion system from the old boring integer calculus which i don't like anymore i like fractional calculus so i use as a dictionary fractional operators using the operator i spit out values i i found a new equation that describes equally well my data so i can explain it with integer calculus the boring one or the fraction calculus the exciting one i can do lots of different things i'm talking about fractional calculus i like fractional calculus uh i like fraction calculus because it's as expressive as a neural network so so let me give you an example let's say i try to learn this operator this is a fractional derivative but it's actually an integral because it has a memory it's an old thing it has to go back to riemann level and so on but trying to i learned that there is an integral before can you learn the fractional derivative so so here's the idea again i take all sorts of functions i'm trying to uh train the neural network to learn a fractional operator i was trying to really push the depo net and i use the known formulas and so on and then just uh take a library and you can do it here i do it specifically for the what's called the caputo derivative which is for fraction time time fraction initial value problems but the main point i want to make here is that i can learn this really really well and there are three curves here that show how well it all depends on the space v the input space for example if i represent my functions with spectral expansions i can do a really really good job if i use gas grf which is i used before i still get a good accuracy but 10 to the minus three not them to the minus six that that i would like to so the your space v is very important and that's what i demonstrate here and i'm sure there are better ways to represent spaces talking about spaces and difficult operators one of the most difficult operator is to compute the fractional laplacian which gives anomalous transport i am 100 sure that diffusive transport in the brain is is anomalous so it will be described by a three-dimensional fraction laplacian but that's a different topic but here i represent my input space v with zerniky polynomials which are orthogonal polynomials on a disk now the reason i include this result here is because some of you may be using contrast microscopy and you may know who fritz zerniker was is the nobel prize winner of 1954 who uh discovered this contrast interferometry and he was the one who actually discovered this their nicki polymer so i use this to represent my input space and i learned the fractional laplacian really well and after you learn the fractional approach and instead of a few hours to compute it on on your laptop it takes about 0.01 seconds to to compute it at for any function you can see that's for any different functions you can do stochastic codes as operators this is a very simple example deceptively simple in fact dydt equals k of what times y by k is a process it will be white noise or partially correlated and so on if there's a little bit of correlation i can use you can do a calculus expansion on this which i do here and then now my branch and the triangle change because now i'm in high dimensions i'm sort of deterministic now because i take advantage of the color noise but now i have a much bigger input and also the trunk which is the output is an n plus one so if i keep five or ten modes i'll have 11 dimensions if i keep 20 i'll have 21 dimensions and so on a little difficult to train but it turns out that you can find not only the statistics of the stochastic operator but you can find individual trajectories as you can see here i have 10 samples 10 different trajectories depot net in split of a second split of a split of a second can have accuracy 10 to the minus 5. the accuracy as you may guess uh that depends actually just on optimization nothing else you can get better accuracy than that uh you can i have some math that explains why i can do that and uh and what's the error breakdown in so on i will skip that but uh you can you apply this to also pdes this is a tricky pde has a exponential non-linearity as you can see here it's non-linear in the in the stochastic domain again you take advantage of the kl expansion high dimensional space but you can do it if it's white noise you have to do something different but for color noise most physical and biological processes are governed by some color noise you can do it with deponet you can get all the statistics uh the standard deviation and even even trajectories as before so uh i know you don't do physics but i want to show you a really really difficult case that i'm doing now with darpa there's a lot of interest in hypersonics recently because of the russians as they say and so we were asked to provide some fast ways of of predicting path trajectories of hypersonic vehicles so what i show here is the euler equations which is like a ribbon problem you start with some discontinuities the euler equations develop shocks discontinuities contact expansion funds all all the crazy stuff and on top of that the air dissociates because you're you're you're flying at mach number 8 to 10. okay so the question is can you pre-train a neural network so that when you have some real data on the fly literally on the fly you can correct your trajectories so here's what we do we take not one but one two three four five neural networks and depot nets so we pre-trained them and the idea is that if you train them then with just a little bit of extra data and so on you can predict entirely what's going on at literally hypersonic speeds not supersonic mach number 10 as i said so so the parameterization of the problem is the initial conditions which is down here so i can start with very steep initial conditions for the riemann problem or very shallow and so on and i can get all sort of different solutions so i don't have one solution i have millions of solutions and depot net will encapsulate all that into one pre-trained network okay and we did that in phase one which in the time scale of darpa phase one now means uh two months so here's some results we got for one specific case we got accuracy of 10 to the minus five starting from the initial conditions which are with blue here you can see the convergence and so on so this shows you that this depot net can be used in any type of situation we have many more cases mostly from biological but we're moving to the biomedical domain and have one more slide this one and a conclusion and this is a sweet new concept we call it deep m m m stands for multiphysics the other m stands for multi-scale and the idea is that any complex problem can be built up from deponets so for example imagine you have three different fields temperature velocity and magnetic fields they are coupled through some physics or through observations you know what it is you train the opponent for each one of them whether the other two are the input functions remember depot net produces functions so now i have i have all this but in the multi so this is sort of the lego approach to doing multiphysics off the shelf you get your deponet you split train you have new data you have an overall new network that deep m m you you either give it a little you give it a little bit of data of the of the true multiphysics problem and and then uh you're basically done so pre-trained by you 99 and we have done this for many many applications i will not bother you anymore with all this physics because you may be already bored but i just want to demonstrate that so from eminem i want to show you uh my current center which is the biggest center on physics informed learning machines we i started a few years ago mit is participating see sale for stevie's is one of my co-pi stanford as a representation but the national labs and so on the idea is how to to use this type of networks that i show you primarily pins but also now depot nets to build new ways of uh approaching modeling of complex multiphysics multi-scale problems i think i will stop here thank you very much for your attention and i'm happy to take uh questions great thank you very much for that wonderful talk and uh george we have our first question uh they asked you mentioned these uh neural network methods are the best you've ever seen for these problems can you give some intuition for why these neural networks are so good at them compared to classical methods i should have qualified this uh when when we have some data well they're not as good like okay so i i've been working on actually back at mit i did my thesis and spectral element method which is probably very very accurate method combination finite elements and spectral methods so you cannot beat the accuracy but but these are very very slow methods these depot nets are you can literally use them on the fly some of the physical applications we have for example the hypersonic may take several days do one simulation here we compute we we predict the right answer at the clock time 0.01 on an old laptop of a postdoc so so the the main the main thing especially darpa and so on they're interested in speed they're not interested in 10 to the minus 16 accuracy they're interested in reasonable accuracy like i showed 10 to the minus 5 but they're interested in really really fast and interested in incorporating new data quickly so in some sense for non-sterilized examples i should have said these methods are very are basically uh unbeatable from abdul kalaf uh does this apply to convolutional neural networks uh it's a good question i actually i forgot to say that but uh the deponet in one of the cases when i did the fractional uh laplacian which i call i call the very complex operator if you have a nice domain square domain you can get an image and then cnn works really really really well there you make the input as an image and it's fast and and works very well for the first part of the talk when i when i was talking about pins in pins actually i use automatic differentiation to avoid any grids and so on so if i abandon totally uh the um numerical methods because i use the same technology that is used for bug propagation i use it for differential operators in cnn you don't have that you need to do finite difference and so on so then then you go back to the old problem of having numerical methods and artifacts with uh errors are diffusion and so on but yes cnn if the domain is simple can be used in um in depots if the domain is simple uh the domain does not have to be simple it cannot doesn't have to be rectangular it doesn't have to be 2d and so on but in some cases you can't be answered yes great thanks uh the next one is from zhongyi li can we bound the error in term of the operator norm it's a very good question i didn't show that i just wrote the report to darpa uh for some so so you have to be very specific um if you're talking about the depot net yes the answer is you have to be specific you so you take a class of let's say hyperbolic problems or conservation laws and then you use these properties to to show the error in fact some of the stuff that i skipped on the stochastic od was attempting to do exactly that and the answer is yes you need to use you need to assume holder continuity with the alphabet less than one and then you can you can prove it but you can you can use some sort of equivalence of norms you can use all sort of gamma convergence and so on to prove that the answer is yes but not but you have to go class by class it's not it's not one thing one one thing fits for everything yeah good question great uh the next one is from uh christian mueno uh thank you for the talk you mentioned that deep o net gives good performance even when you move away from compactness assumption of the theorem could you say a little more on that yes so as as you probably know most of the all the this type of theorems are for compact spaces uh it's it's very it's um very difficult to show theoretically non-compact sets but almost all our examples are for non-compact sets uh i didn't show here but in the in the paper we have a paper that will appear in nature machine intelligence i think we have like 16 different cases most of them are actually for non-compact cases including a laplace transform i think and and legend transform and so on so so but it's all empirical i don't i mean i don't know if you're a theoretical person but it's very difficult to prove not for even even for functional approximations very difficult to do proofs with non-compact sets if i hopefully i answer the question if if if that's what you mean if you mean that to extrapolate outside way outside the distribution space that's a different question but i but i answer the question about compactness but a lot of people a lot of mathematicians actually didn't think of this very highly because they thought it's so restrictive it actually doesn't it's not that's why i want to show it thanks and the next one is from anonymous uh follow-up question to the first uh what allows these networks to approximate exact solutions so fast do we have some understanding of how the implicit prior in these networks helps them approximate our desired physical function operators efficiently that's a really good question um it's it's um because i i have observed this actually i have another case where i i i fit the network with molecular dynamic state and i have stochastic fluctuations and and it learned the stochastic fluctuations so i don't i don't really know i i um you know when i when i showed the theorem which is a rigorous theorem where we went from a single layer to a deep neural network uh the deeper you go uh the better just like in in your in your networks and then how reaches the space and how well you represent v uh how representative is how how the representation is done is is very appropriate that's why i show you this example uh with the caputo derivative i had something there that i call polyfractonomials which is some functions i discover exact functions jacobi polynomials with with fractional exponents and these are the best to represent this type of solution so if you do that you gain like an order of 10 or or sometimes 100 in terms of accuracy so i don't know i think i think i think representing the space v is very important but i don't have it as i said uh if i don't i don't really have any intuition yet to to that's uh uh qualitatively i would say yes we used good priors but i i put emphasis on this on the input space v how how well i represent it and and i may be better representation out there and the other thing in terms of training i talked about a balanced network if i have a balanced network i can learn exponentially fast which is a big deal as you can tell for these type of things which you need a lot of training it was a very good question i don't know the answer all right the next one's from eric malik uh what about arbitrary operators with deep onenet uh can it learn complex user-defined operators yes it does it doesn't actually that's a that's that's a good question and that's why in the beginning i put a biological system a a social system system of systems because it doesn't really care so so for example i have this advection diffusion um system right so i generate data from an integer equation i learn the operator then the operator now it's implicit right now how would you represent if you don't know that the data came from there now you like integer calculus and you use a dictionary of of integer derivatives i used a fractional derivatives and i came up with a different operator again you may remember i saw i i found the derivative to the 1.5 which is a combination of second and first derivative so yes any any um fractional operators are exactly that it's kind of very very general operators you can have variable order you can have distributed order so so if you don't want to be totally arbitrary uh you can use fractional operators which are extremely expressive and and you can express them but but yes any operator uh which is non-linear has to be continuous operator uh could be uh could be represented yes great uh and thanks to david for helping marty submit his question marty's question is did i hear you mention the possibility of using wavelets instead of sigmoids have you tried this option yet and then a follow-up question could you envision using malays deep scattering networks as trunk networks uh the first one i i i the um that's probably i was kind of bespoke maybe i meant the wavelets would be a wavelet representation of the space v so imagine you have like this uh dilations and so on you a a or multi-ski for multi-scale problems maybe wavelets is the best way to represent just like i use my legendary polynomials and so on to represent the space v uh not not to replace the activation functions for the if you're interested activation function actually we have something really nice that we produce with uh one of the postdocs uh from seasale kenji is his name and and we did this adaptive activation functions or what we call them rowdy activation functions work really really well for training uh but but not i didn't mean to replace the activation functions with wavelets i meant to represent the space v uh i i'm not sure about the second part is this the stephan uh method but i don't actually know the scattering so i wouldn't be able to to answer your question uh the track network it's really interesting that you are that you that you target the trunk because when when we started doing this i thought especially for depot net as i used it as a pre-trained network and then and then uh trained a little further with some data i thought that i need to put more uh to to fine tune my branch but it turns out that these depot nets the the trunk is actually the most sensitive part so i think if there are ways to improve the trunk i find qnit or with better parameters with a different architecture uh that's a place to target that's kind of our experience so far but next year i may say something else sorry i cannot answer the second part of the question no worries at all uh the next question is in fem there's a lot you can do with your choice of fem space consequently there's a lot of work on which space spaces to choose for certain types of pdes ergo finite element exterior calculus and so forth it seems like you are picking a function space in some of these examples what should i do to pick myspace ergo how did you choose that space uh for the fractional lamplation uh good question uh i didn't show that result the results actually that point to this but um um for for it's this is a different thing this is um the v space is the input space of functions and and and how you represent it in terms of the pins and how how we do it we have something called variational pins now since you talk about finite elements uh finite element is the galerkin method the trial space and the test space is the same but in variational pins what i have is my trial space is the neural network space which is non-linear approximation and i use vessel so this will be the best of space but my test space it could be a polynomial space i use legend polynomials or monomials just like in finite elements so variation pins if you if you google v pins you will see our paper on vpn you have the best of two worlds you have uh the the the neural network approximation which by now we know is a adaptive finite element and then you have nicely you're testing this in subdomains just like in finite elements but arbitration sub domains with nice smooth functions so if you integrate by parts you can transfer the non-smoothness to a smooth part and you can do lots of great things you don't have to take higher derivatives because every time i take a derivative uh for my physics stuff i double the length of the of the neural network so it's not very good as uh to go very deep but uh from the training point of view so so you can do you can choose the space for the uh something you can you can construct fem type uh and and least squares types solutions here uh and and um one of my collaborators from sandia is working on exterior calculus uh and uh what's called generalized least squares nets since you mentioned the the exterior calculus so so one can mix and match in the petroleum type framework perhaps so there's lots of possibilities this is from ram comparing this approach to real neurons is branch a dendritic branch uh is branch a dendritic branch trunk is axonal spiking cross product is a recurrent feedback so the hidden layer is simultaneously both fragmenting data and categorizing it and using categorization to guide fragmentation these two networks compete and cooperate simultaneously i wonder if the model can be run backwards as well wow that's a really good question i i i should talk because i i don't you know i i i know what an action is i know what the snaps i do but i don't i have no clue actually i shouldn't be giving a talk to to neuroscience group today but but but um but it could be i mean yes the trunk and the branch have to work in sync they have to be balanced and so on so we have seen some stuff but i i i i'm not i don't have the knowledge to draw this analogy that you're presenting uh the in the archive we have depot net and then it's a it's a it's a spartan version but i can make the the long version available to you with all the theory and so on and and then you can you can you know if you can come up with an analogy like that that would be actually great because um i i just don't have the intuition for it and uh i just do uh operations but but that sounds great if that's true i hope it's true and we actually have a follow-up question from christian um he's uh asking a follow-up question on his uh compactness question and the place transform example do you like yeah do you learn the la place transform for function supported on say the unit interval and then apply the learn transform on functions with larger and larger support but continue to see the same l infinity error is this how you would test this idea yeah kind of yes exactly exactly that's exactly right this is this is how you do it and and and of course as you increase you have to increase your the other training and so on but you can but but uh yes exactly that that's uh that's uh that's exactly right uh so so the goal the grf was just for the input space right to basically you take a hat and you pull functions from that hat i was saying you can replace it with other spaces and so on so this is just experimenting with representations of of of v but of that compact space v which cannot be doesn't have to be compact as i said uh how i mean there's lots actually out there that one can use for that um this hasn't been around for very long and and so but we have a lot of examples i have lots of from from physical and biological cases that we have tested but i the theory uh we we prove just a deep neural network but uh and then and there's some error bounds but we don't really we haven't done a lot yet so uh i the reason i chose this topic today is because it's kind of a higher level abstraction so i thought maybe you guys could take a different angle of it and i hear some good questions so it's wide open i know i know darpa is very interested in this and they want to start a whole new program just on depot nets on all sorts and they ask me the same questions can we do this for social systems can we do for flocking can we do it for our troops can we do those because because it's agnostic to different to all sorts of specifics because you have to have data to to train all this but ram just uh stated i i think this might be a clarification uh trunk is inhibition branches excitation uh minus sum is difference of gaussian yeah so so branches the excitation is the input the excitation force trunk is the output that's correct all right these are non-linear systems but there's nothing ocean if you if you have a gaussian it's all non-linear system so there's nothing goes on the output great talk thank you george thank you very much thank you very much tommy thank you

welcome and welcome to our speaker george karniadakis george is probably really jorgos i&#39;m guessing uh he&#39;s an applied mathematician with wide-ranging interests they range from stochastic differential equations applied to various physics problems and life science problems computational fluid dynamics figures in there heavily and more recently meaning in the past decade plus a lot of work on machine learning for scientific applications and that i think will be uh the category that encompasses today&#39;s talk uh he was an undergraduate at the national technical university in athens in mechanical engineering and came to mit to get his master&#39;s and phd also in mechanical engineering i think he did these under anthony patera and borivoya mikkic he then held a number of positions including a faculty position at princeton and then ended up where he is now at brown university as a professor in applied mathematics a fancy named professorship but with two long names that i can&#39;t remember at the moment uh and uh we&#39;re very happy that he&#39;s going to talk to us about what he has dubbed deep own it yorgos welcome yeah thank you very much i am i believe that we are the crossroads of ai right now and if we want to be critical i would say we are at the stagnation point and i would like to give you an example of that is the recent gpt-3 from open ai well it&#39;s a thousand times bigger than gpt2 it&#39;s a good thing three times bigger it has 175 billion with the b parameters to train it takes about 350 cpu years to train it and in terms of money about 5 million dollars and according to their savvy ceo sam altman the dropbox guy he tweeted recently that gpt3 makes silly mistakes so true intelligence requires really scaling up is one avenue but there their limits it requires higher level of abstraction and such abstractions as my thesis can be effectively represented by non-linear operators that is non-linear mappings from one or multiple functional spaces to another so imagine for example here in this picture that i have the robot and you try to endow this robot with mathematical intelligence well would you teach this robot calculus uh i asked my daughter says god no uh as she&#39;s finishing up the high school because that&#39;s very tedious well if you do that then then you let&#39;s say the robot will try to use um what i used to do and sold numerically pdes and so on uh to predict something to move uh somewhere uh to check the weather and so on but but that will take an enormous amount of computing so this robot has to go around with the extra scale computer in its head and that&#39;s a lot of energy that&#39;s a lot of power and you know you all know here you are the guys who are deconstructing the brain to understand it that even a small chocolate bar provides sufficient energy for human intelligence for all sort of operations so uh so so it&#39;s also the energetics coming so let me go to the second slide if i hope you see my second slide is the universal theorem of function approximation and and almost every paper neural network published today about a thousand maybe 200 today about a thousand per week last year i checked it was 100 000 papers on neural networks were all based on universal approximation theory for functions but what i want to tell you today is something else i want to give you a higher level approximation the universal approximation for functionals and non-linear operators so why is it different well if you look at what we are doing today for image classification we take an image like that from rd1 and we map it to rb2 when you deal with an operator you have a function an infinite dimensional space to an infinite dimensional space so you can have multiple functions as input multi-functions as output so it&#39;s a very very different setup now obviously higher level abstraction now what what this operator is this operator could be as simple as a derivative talking about calculus an integral and i&#39;ll show you some examples it could be a complex dynamical system it could be an od ordinary difference equation partial differential equation stochastic differential equation fractional differential equation if i have time i&#39;ll show you all that but it could also be another biological system we don&#39;t quite understand but we map x to y x of t and space to y of t and space it could be a social system it could be a system of systems and i&#39;ll try to give you an example of that at the end so then can we learn these operators neural networks and and and how do we do that and um so so in some sense broadly speaking and i would like to want to i&#39;d like to go to this question can we learn operators how do we do it how fast we do it and so on but but but before i get there i want to give you a couple of teasers because i know this some of the groups there uh working on generalization internalization is a big question and so on so i would like to address the generalization question which is that&#39;s also what i&#39;m trying to do with these operators okay i want to to extrapolate i want to go out outside the space of distribution space and and i want to have small generalization there so let me give you a uh just a very before i get to the operators i want to give you a very brief overview of two topics that i&#39;ve been working on recently one is um it&#39;s just under classification problem you can see here k categories and i&#39;m trying to quantify generalization error but i will do it in a very different way than you have usually seen for example when we try to stabilize this stochastic gradient descent or some other methods uh i will try something different here i want to to to give you some introduction to that and then hopefully you can be interested in our paper that was published recently so this this schematic here shows the error which can be broken down this is a hypothesis space this is the approximation error for the network size you can make that smaller but it&#39;s a big elephant of course the generalization narrow how can you handle this so i&#39;m going to look at some operators here yet no physics just pure classification how do we approach it and as i said we approach it from a very different point of view and just like the title of the paper says we will try to quantify the data distribution and also the smoothness of the network and and would introduce these new concepts if you like so for example introduce the probability of the neighborhood and then training set in this panel a if we have a datum here i take a radius r and the probability here will be uh in this plot and the integral under this from zero if r is zero the integral out of uh under this curve will will give me what i call the total cover now if i if i have two classes red blue then i want to introduce this self cover ti is that is a testing set for the i class a new i is the corresponding measure so self cutter would be something like this it&#39;s just rotation for now and then correspondingly in panel d i&#39;ll introduce also the mutual cover between now you can see the red class interacts with the blue class i also need to see how sparse are my data how far is label one from label two ideally i would like to know the medium distance delta not n but i don&#39;t have enough data to to be very precise so i will introduce this empirical distance delta t and finally so this is a data distribution and finally i will introduce something about the inverse modules of continuity of the network f so if i have a change in the network f epsilon i have delta f due to that epsilon so delta f here is my the inverse of modulus of continuity so what we did we prove the following theorem first we define the cover difference which is basically the average case the number of classes here is the average of the self cover minus the average of the neutral cover and then i define the cover complexity if it&#39;s zero means that it&#39;s very very easy to predict and so on so so here&#39;s the first theorem which uses this assumption in the paper we justify this assumption uh one thing that we use here kind of strong assumption is that the maximum uh loss the cross entropy maximum loss is bounded but actually in the experiments we just took the average we don&#39;t need to have the maximum because that&#39;s very strict constraint but but here is the main result is somewhere here in the middle bullet that says that the error is bounded by a coefficient that depends on the training set and the smoothness it turns out times this cover complexity that we define just to see that alpha of t depends on these delta parameters and the smoothness and i will connect that i just wanted to make this simple i i i cannot explain everything here but but just to show you what what happens in this framework we found surprisingly that the error just according to the theorem for all these familiar benchmarks grows linearly with the cover complexity as you can see here for the different classes 10 classes 20 class 100 classes so that&#39;s the theory the theorem provides that now empirically then we found that if we normalize the error with the square root of the number of classes everything collapses into one curve so we have a universal master curve for all these cases is that universal for everything we don&#39;t know there&#39;s a there is some gaps in the theory uh that the theory is not totally complete yet but i think it&#39;s a new theory now i want to connect so this is the distribution of data we cover complexity i we also try to connect this to the smoothness of the network of course it&#39;s very difficult to characterize delta f directly but in this inequality that we prove we can find a lower bound for the these modules of continuity in terms of the loss function and the weights that we can compute so this capital delta f is a computable quantity so what i have plotted here for this mis training set is actually the testing loss is the blue curve and you can see the minimum point and then we can see the overfitting now how does this relate to the smoothness of the network well the red curve shows you that because the red curve is a is a measure of like of course it has the loss in there but the loss is important here when it&#39;s big right here where it starts decaying the red curve if you can see uh what we have is uh is a loss of the smoothness of the network and right here is it&#39;s not coincidence but the fact that where you start the overfitting starts is where the smoothness of the network drops so if you go back you can basically relate this delta to the constant that we talked about here which is hidden somewhere and therefore one can connect those two things namely the data distribution and the smoothness of the network again fear is not complete but i was hoping to show this here so one of the smart mit students can take it and advance it further so that&#39;s one type of generalization uh just different approach the same problem the next generalization is something different is is what kenny said earlier that i&#39;m interested in physical laws and unsupervised learning so this is how we publish this paper we call it pins physics inform neural networks it&#39;s been used now for from many different industries from nvidia has a parallel code on this to ansys the biggest software company in the world and so on for physical problems it&#39;s agnostic to any type of of physics actually but but physics is my regularization so so what&#39;s this uh what is what is a pin actually i wear a pin i don&#39;t know if you can see my t-shirt but but uh it&#39;s a very simple thing it&#39;s um it&#39;s a neural network let&#39;s say you&#39;re trying to learn u of x and t uh and you have lots of data uh then you have a corresponding loss miss much of the data and so on but we don&#39;t have enough data in in in in science we never have enough data you know that that&#39;s very expensive um and and they&#39;re not reproducible so we have very little data and but what we have is conservation of mass momentum energy and so on here i show an example of the parameterized radical equation so you have to satisfy that by insisting on this then i have another loss that&#39;s the residual of this conservation law which i can wait with the total loss and then i can improvise for not having data i gave a talk recently at the army here at natick they have an installation they were talking about autonomy and can you actually be autonomous without any physics at all i told them no here&#39;s a simple example you can learn how to solve this sode and predict on the inside the domain of training but you go outside the domain it&#39;s a huge error it turns out that if you use the pin approach no problem because unsupervised you follow exactly the trajectory that you want of your vehicle so to speak now it&#39;s interesting is that if you are outside the parametric space so i have lambda as a parameter if you train a certain parametric set and you are outside the errors are not are not as catastrophic as you can see some errors but not big so it&#39;s different of what you do in the domain or parameters and so on so so it&#39;s better to pin if you can and here&#39;s an example we published recently in science something very timely it&#39;s what can you do if you have this type of approach and i call this hidden fluid mechanics because it&#39;s a hidden markov process type of thing i use auxiliary data like a smoke or some thermal gradients from your breathing or from your coffee and from there this is one of our collaborators from la vision a german company german company i can tell you that i don&#39;t know if you can see the movies playing but i can tell you what the pressure is and what the velocity is just using this pin approach combine physics and data the only data i use is the date of the video but then i can infer a lot of other things uh did you see that the movie kenny yep look good okay and and i prepare this um espresso as i said for uh for tommy i uh professor poggio i know he likes uh espresso like me so recently i was doing this project with la vision uh they took a sleigh of photography 3d over an espresso cup and we were so curious to see how much what&#39;s the maximum velocity and what&#39;s the pressure of that kind of physics question but you can see here have you compared it to greek coffee no but we it was a controversy because we predicted that the maximum velocity was 0.4 meters per second which sounds really really fast and they didn&#39;t believe us so they went back and did an experiment with particularly much velocimetry and indeed they found point four point four five after our predictions so anyway we could be fair again they were lost in pressure and so on and uh so this is just a fun project uh we&#39;re doing more biomedical projects i will skip that because uh this is a brain aneurysm from children&#39;s hospital data i&#39;ll skip that same idea i want to go back to operators so so what i basically said so far is i&#39;m interested in generalization like everyone else there&#39;s ways to generalize there&#39;s difficult to find generalization errors and i want to resort to operators to to make a big jump so you should be seeing now a slide that says problem setup so here we are g is the operator i&#39;m looking for u is a function in some compact domain which i will define later but that but we have this mapping from u to g of u at y y is g of u of y that&#39;s the output of the operator so the setup is the following we will train this there&#39;s no physics now there&#39;s just all this data driven so we&#39;ll train this system with functions u a lot of them first one second one third one we will observe the output at some points and then we&#39;ll you give me another function from that space that way i have to define have yet to define and then i have to be able to give you the g of u of y so this will establish that i have learned a mapping between u and the output of the operator in this space of of y so now i went back and i look at the literature and i found this theorem and i don&#39;t know how many of you who have been doing theoretical uh machine learning have ever run into it but i asked one of my collaborators who&#39;s doing it taking a course at mit on machine learning and ask the instructor and the instructor had no idea that you can actually approximate functionals and operators but chen and chen back in food dan university in the early 90s developed this theory first for functionals here i show you for system identification nonlinear operators so basically the theorem says the following imagine you have a compact space v remember i show you the function u this function you will be in this compact space v and you&#39;re trying to identify this g non-linear continuous operator here g could be an explicit operator an implicit operator or a totally undescribable operator i&#39;ll show you some examples but basically the the theorem just now remember penguin hornigan and and other people at that at that time they were developing theory for functional approximation they developed this theorem that&#39;s that shows that a single neural network like this actually two neural networks but a single layer you run two single neural networks can approximate arbitrarily close this continuous operator g of u of y which is we&#39;re trying to will be approximated by a branch and a trunk notice that this is one layer for the output and one layer for the input these are two different networks we we call them branch and trunk this can be done for any u in this compact space v and and y in this uh k2 which is an rd space what does this mean i i interpret it here so it means that i can think of this as as a as a cross product of the of the output the trunk and the branch so if i if we look at panel d what we have here is a branch network where we take this function u we observe it at m points we call them sensors let&#39;s say you have m sensors you observe them and then that feeds network the branch network but we also need to say something about the output space because we need to have label data so i need to provide some g of u of y by doing that so so you see i have p points at the output i have m points where i observe this and that n is the number of neurons if you like for this network so so i pipe them through this two different networks i take a cross product i found the output so let&#39;s review again oh now this is a single neural network but my team did recently under a grant from darpa we extended this for deep neural network uh or basically replace a single neural network with a gn another neural network and the trunk without an fn and this can be now very general neural networks any type in fact of a class of functions that satisfy the classical universal approximation theorem so so the classical approximation theory would now go into our neural networks but of course this is say a kind of network of networks if you like but it&#39;s a composite network i&#39;ll i&#39;ll show it again but first you have to define the simple space the input space v is a compact space for the for the theorem it turns out in practice it does not need to be a a space like that so for example i can do laplace transform and in fact as you can see here i use gauss random fields uh to to approximate my space v but i want to see if you commit an error in representing the space v because you cannot exhaust that space right it&#39;s an infinite space how big is that error so i take this d s dx some right hand side i take i observe i sample my u uniformly and the real use this this care for example and then i can find for the special case of a gauss process with a gaussian kernel square exponential kernel correlation length l i can find that this constant kappa which depends on space v and the number of sensors is basically quadratic with the number of points one over m square and quadratic with the correlation length and then i can prove a theorem for this case only that in the indeed the error to approximate this neural network is bounded above by the error that that we sample the space v so it&#39;s quadratic in the number of points observation points for that function and quadratic in the in the correlation length so that makes sense uh of course there are many different ways of representation for example you can imagine that the v space could be a neural network itself you can imagine that v could be wavelets it could be a radial basis functions could be spectral expansions and i&#39;ll show you if i have time i&#39;ll show you some of it so let&#39;s recap what we have we want to find the operator that uh shows a non-linear mapping in general from u to g of u from rd to r let&#39;s say okay so down here in the panel b it gives you an idea of what we have i have here i have a summary of what i told you already so in the left column says training data i observe one function at end points or you observe another function at that point i may observe ten thousand functions okay now correspondingly i have to observe the output g of u of y but you can see i may have a hundred points to observe the input and only two or three or four points to observe the outputs are very like very spartan on the output i&#39;m from crit actually but i use spartan here as an analogy of of of uh just a little data and then what i have here on the left is the input the function u space v and the output g of u of y i think that that&#39;s pretty simple that what i have in mind is a simple od to explain to you what we&#39;re doing and and here&#39;s an example here is an example so i will compare different neural networks that are out there so let&#39;s say i want to find the integral operator i want to build a neural network that approximates the this one-dimensional but other people have done multi-dimensional so don&#39;t worry about the complexity of this just just as a pedagogical example i want to find this 0 to x and x could be in some range so here the integrand u of x goes into the in the ground which is s of x depends on x okay so it&#39;s a map from u of x to g of a to s of x and capital g is that integral really that comes of course from this from this derivative definition so how do we do that in practice well i take one function i represent it with a 100 this is the simplest possible case okay this is just just to introduce a concept 100 points for the function i take 10 000 functions and i only observe s at some random point s of x the output at some random points only one point okay now here&#39;s a summary of what i got with my best network that that best network is the uh what i just showed you the unstacked depot net the mean square error training or testing as you can see they&#39;re almost on top of each other and the other goes down to 10 to the minus five i compare lots of networks i compare the best network here is of course the one that i show you that&#39;s why i show it to you because the generalization there is very small the difference between the training and the testing error if i use a standard fully connect network it&#39;s like that feed forward if i do a resnet it&#39;s similar to fnn i did a sequence to sequence one of our reviewers said well sequence of sequence works well it does a little bit better than fn but that it doesn&#39;t do uh as well as this stack network so now what happens if your if the space simple space v is very poorly represented for example as i said first violation is that v is supposed to be compact and i make it more compact by just taking a grf then i fix the correlation length to 0.5 and if you see what i have in my basket that space v are these functions so you come along and say can you integrate this function this would be my u of x can you integrate uh using a neural network and needless to say that if you train the network you can spit out the answer in a in a fraction of a second so so cost is not all the cost is advertise a priori in the pre-training so the answer to this is it depends actually on l how how if i go outside the distribution it depends on how well i did with l if i here if i have a very small correlation length right here my error although i&#39;m outside the distribution is pretty small if i take 0.5 my error is pretty big my correlation length so obviously you want to be careful with the space so how rich is that input space is very very important now what happens if if you got lazy or if you don&#39;t have enough data and so on to train this case so so we pre-trained the depot net okay so in step one we use a supervised learning to pre-train a neuron this what i told you then step two you have two options one is actually if you know any physics any constraints you can do what i told you about pins but do it for a very very short time let&#39;s say 10 20 100 iterations not a million iterations for your std in other ways somebody gives you data just a few data but not a lot of data then you can just use this neural network and extra neural network to um and use deeponet as a pre-trained neural network we&#39;ve done this i would not bother you the results look good for example if my correlation length is small you can see i start with a two person error i can improve it and so on now here&#39;s another example and a surprise a big surprise to us this is now u of t one input function and two outputs s1 and s2 it&#39;s a non-linear problem it&#39;s a nonlinear operator i show you here three examples with three different networks they all have the same depth but different width so if i take the middle one i plot the error versus the number of training data and you can see one thing that the testing error here uh drops very fast in fact exponentially fast originally okay then it goes to algebraic like monte carlo type sampling now you can see that this transition point from so exponential convergence this is great because i train operators if i can do it exponentially faster will be great now i haven&#39;t got there yet but one observation is that if we make these networks bigger let&#39;s say from width of 50 to 200 you can see that the transition point moves to the right so my exponential range is much bigger so again i&#39;m looking for someone an mit guy who&#39;s very smart to come up and take this and make it a really really good network that will have exponential convergence in training and testing for all sizes we can do that for pdes advection diffusion reaction systems you can find that in the brain you can have biological systems now you have very few points that you observe in space time and you can do the same thing again you can find exponential convergence i will not bore you with that same idea uh you can with this depot net now you learn how to solve this uh pde now if you pump if you have data direction diffusion system and you and you train that network then you can change your initial conditions boundary conditions and so on and then you can solve this pde in real time in a fraction of a second so i you know i i i spent 35 years working on numerical methods for pds i cannot find a method that competes with this okay not only that the the you learn operator that is very general for example you learn implicitly this operator now how do you explain the data let&#39;s say here i have an example where i fit it with data from this adjection diffusion system from the old boring integer calculus which i don&#39;t like anymore i like fractional calculus so i use as a dictionary fractional operators using the operator i spit out values i i found a new equation that describes equally well my data so i can explain it with integer calculus the boring one or the fraction calculus the exciting one i can do lots of different things i&#39;m talking about fractional calculus i like fractional calculus uh i like fraction calculus because it&#39;s as expressive as a neural network so so let me give you an example let&#39;s say i try to learn this operator this is a fractional derivative but it&#39;s actually an integral because it has a memory it&#39;s an old thing it has to go back to riemann level and so on but trying to i learned that there is an integral before can you learn the fractional derivative so so here&#39;s the idea again i take all sorts of functions i&#39;m trying to uh train the neural network to learn a fractional operator i was trying to really push the depo net and i use the known formulas and so on and then just uh take a library and you can do it here i do it specifically for the what&#39;s called the caputo derivative which is for fraction time time fraction initial value problems but the main point i want to make here is that i can learn this really really well and there are three curves here that show how well it all depends on the space v the input space for example if i represent my functions with spectral expansions i can do a really really good job if i use gas grf which is i used before i still get a good accuracy but 10 to the minus three not them to the minus six that that i would like to so the your space v is very important and that&#39;s what i demonstrate here and i&#39;m sure there are better ways to represent spaces talking about spaces and difficult operators one of the most difficult operator is to compute the fractional laplacian which gives anomalous transport i am 100 sure that diffusive transport in the brain is is anomalous so it will be described by a three-dimensional fraction laplacian but that&#39;s a different topic but here i represent my input space v with zerniky polynomials which are orthogonal polynomials on a disk now the reason i include this result here is because some of you may be using contrast microscopy and you may know who fritz zerniker was is the nobel prize winner of 1954 who uh discovered this contrast interferometry and he was the one who actually discovered this their nicki polymer so i use this to represent my input space and i learned the fractional laplacian really well and after you learn the fractional approach and instead of a few hours to compute it on on your laptop it takes about 0.01 seconds to to compute it at for any function you can see that&#39;s for any different functions you can do stochastic codes as operators this is a very simple example deceptively simple in fact dydt equals k of what times y by k is a process it will be white noise or partially correlated and so on if there&#39;s a little bit of correlation i can use you can do a calculus expansion on this which i do here and then now my branch and the triangle change because now i&#39;m in high dimensions i&#39;m sort of deterministic now because i take advantage of the color noise but now i have a much bigger input and also the trunk which is the output is an n plus one so if i keep five or ten modes i&#39;ll have 11 dimensions if i keep 20 i&#39;ll have 21 dimensions and so on a little difficult to train but it turns out that you can find not only the statistics of the stochastic operator but you can find individual trajectories as you can see here i have 10 samples 10 different trajectories depot net in split of a second split of a split of a second can have accuracy 10 to the minus 5. the accuracy as you may guess uh that depends actually just on optimization nothing else you can get better accuracy than that uh you can i have some math that explains why i can do that and uh and what&#39;s the error breakdown in so on i will skip that but uh you can you apply this to also pdes this is a tricky pde has a exponential non-linearity as you can see here it&#39;s non-linear in the in the stochastic domain again you take advantage of the kl expansion high dimensional space but you can do it if it&#39;s white noise you have to do something different but for color noise most physical and biological processes are governed by some color noise you can do it with deponet you can get all the statistics uh the standard deviation and even even trajectories as before so uh i know you don&#39;t do physics but i want to show you a really really difficult case that i&#39;m doing now with darpa there&#39;s a lot of interest in hypersonics recently because of the russians as they say and so we were asked to provide some fast ways of of predicting path trajectories of hypersonic vehicles so what i show here is the euler equations which is like a ribbon problem you start with some discontinuities the euler equations develop shocks discontinuities contact expansion funds all all the crazy stuff and on top of that the air dissociates because you&#39;re you&#39;re you&#39;re flying at mach number 8 to 10. okay so the question is can you pre-train a neural network so that when you have some real data on the fly literally on the fly you can correct your trajectories so here&#39;s what we do we take not one but one two three four five neural networks and depot nets so we pre-trained them and the idea is that if you train them then with just a little bit of extra data and so on you can predict entirely what&#39;s going on at literally hypersonic speeds not supersonic mach number 10 as i said so so the parameterization of the problem is the initial conditions which is down here so i can start with very steep initial conditions for the riemann problem or very shallow and so on and i can get all sort of different solutions so i don&#39;t have one solution i have millions of solutions and depot net will encapsulate all that into one pre-trained network okay and we did that in phase one which in the time scale of darpa phase one now means uh two months so here&#39;s some results we got for one specific case we got accuracy of 10 to the minus five starting from the initial conditions which are with blue here you can see the convergence and so on so this shows you that this depot net can be used in any type of situation we have many more cases mostly from biological but we&#39;re moving to the biomedical domain and have one more slide this one and a conclusion and this is a sweet new concept we call it deep m m m stands for multiphysics the other m stands for multi-scale and the idea is that any complex problem can be built up from deponets so for example imagine you have three different fields temperature velocity and magnetic fields they are coupled through some physics or through observations you know what it is you train the opponent for each one of them whether the other two are the input functions remember depot net produces functions so now i have i have all this but in the multi so this is sort of the lego approach to doing multiphysics off the shelf you get your deponet you split train you have new data you have an overall new network that deep m m you you either give it a little you give it a little bit of data of the of the true multiphysics problem and and then uh you&#39;re basically done so pre-trained by you 99 and we have done this for many many applications i will not bother you anymore with all this physics because you may be already bored but i just want to demonstrate that so from eminem i want to show you uh my current center which is the biggest center on physics informed learning machines we i started a few years ago mit is participating see sale for stevie&#39;s is one of my co-pi stanford as a representation but the national labs and so on the idea is how to to use this type of networks that i show you primarily pins but also now depot nets to build new ways of uh approaching modeling of complex multiphysics multi-scale problems i think i will stop here thank you very much for your attention and i&#39;m happy to take uh questions great thank you very much for that wonderful talk and uh george we have our first question uh they asked you mentioned these uh neural network methods are the best you&#39;ve ever seen for these problems can you give some intuition for why these neural networks are so good at them compared to classical methods i should have qualified this uh when when we have some data well they&#39;re not as good like okay so i i&#39;ve been working on actually back at mit i did my thesis and spectral element method which is probably very very accurate method combination finite elements and spectral methods so you cannot beat the accuracy but but these are very very slow methods these depot nets are you can literally use them on the fly some of the physical applications we have for example the hypersonic may take several days do one simulation here we compute we we predict the right answer at the clock time 0.01 on an old laptop of a postdoc so so the the main the main thing especially darpa and so on they&#39;re interested in speed they&#39;re not interested in 10 to the minus 16 accuracy they&#39;re interested in reasonable accuracy like i showed 10 to the minus 5 but they&#39;re interested in really really fast and interested in incorporating new data quickly so in some sense for non-sterilized examples i should have said these methods are very are basically uh unbeatable from abdul kalaf uh does this apply to convolutional neural networks uh it&#39;s a good question i actually i forgot to say that but uh the deponet in one of the cases when i did the fractional uh laplacian which i call i call the very complex operator if you have a nice domain square domain you can get an image and then cnn works really really really well there you make the input as an image and it&#39;s fast and and works very well for the first part of the talk when i when i was talking about pins in pins actually i use automatic differentiation to avoid any grids and so on so if i abandon totally uh the um numerical methods because i use the same technology that is used for bug propagation i use it for differential operators in cnn you don&#39;t have that you need to do finite difference and so on so then then you go back to the old problem of having numerical methods and artifacts with uh errors are diffusion and so on but yes cnn if the domain is simple can be used in um in depots if the domain is simple uh the domain does not have to be simple it cannot doesn&#39;t have to be rectangular it doesn&#39;t have to be 2d and so on but in some cases you can&#39;t be answered yes great thanks uh the next one is from zhongyi li can we bound the error in term of the operator norm it&#39;s a very good question i didn&#39;t show that i just wrote the report to darpa uh for some so so you have to be very specific um if you&#39;re talking about the depot net yes the answer is you have to be specific you so you take a class of let&#39;s say hyperbolic problems or conservation laws and then you use these properties to to show the error in fact some of the stuff that i skipped on the stochastic od was attempting to do exactly that and the answer is yes you need to use you need to assume holder continuity with the alphabet less than one and then you can you can prove it but you can you can use some sort of equivalence of norms you can use all sort of gamma convergence and so on to prove that the answer is yes but not but you have to go class by class it&#39;s not it&#39;s not one thing one one thing fits for everything yeah good question great uh the next one is from uh christian mueno uh thank you for the talk you mentioned that deep o net gives good performance even when you move away from compactness assumption of the theorem could you say a little more on that yes so as as you probably know most of the all the this type of theorems are for compact spaces uh it&#39;s it&#39;s very it&#39;s um very difficult to show theoretically non-compact sets but almost all our examples are for non-compact sets uh i didn&#39;t show here but in the in the paper we have a paper that will appear in nature machine intelligence i think we have like 16 different cases most of them are actually for non-compact cases including a laplace transform i think and and legend transform and so on so so but it&#39;s all empirical i don&#39;t i mean i don&#39;t know if you&#39;re a theoretical person but it&#39;s very difficult to prove not for even even for functional approximations very difficult to do proofs with non-compact sets if i hopefully i answer the question if if if that&#39;s what you mean if you mean that to extrapolate outside way outside the distribution space that&#39;s a different question but i but i answer the question about compactness but a lot of people a lot of mathematicians actually didn&#39;t think of this very highly because they thought it&#39;s so restrictive it actually doesn&#39;t it&#39;s not that&#39;s why i want to show it thanks and the next one is from anonymous uh follow-up question to the first uh what allows these networks to approximate exact solutions so fast do we have some understanding of how the implicit prior in these networks helps them approximate our desired physical function operators efficiently that&#39;s a really good question um it&#39;s it&#39;s um because i i have observed this actually i have another case where i i i fit the network with molecular dynamic state and i have stochastic fluctuations and and it learned the stochastic fluctuations so i don&#39;t i don&#39;t really know i i um you know when i when i showed the theorem which is a rigorous theorem where we went from a single layer to a deep neural network uh the deeper you go uh the better just like in in your in your networks and then how reaches the space and how well you represent v uh how representative is how how the representation is done is is very appropriate that&#39;s why i show you this example uh with the caputo derivative i had something there that i call polyfractonomials which is some functions i discover exact functions jacobi polynomials with with fractional exponents and these are the best to represent this type of solution so if you do that you gain like an order of 10 or or sometimes 100 in terms of accuracy so i don&#39;t know i think i think i think representing the space v is very important but i don&#39;t have it as i said uh if i don&#39;t i don&#39;t really have any intuition yet to to that&#39;s uh uh qualitatively i would say yes we used good priors but i i put emphasis on this on the input space v how how well i represent it and and i may be better representation out there and the other thing in terms of training i talked about a balanced network if i have a balanced network i can learn exponentially fast which is a big deal as you can tell for these type of things which you need a lot of training it was a very good question i don&#39;t know the answer all right the next one&#39;s from eric malik uh what about arbitrary operators with deep onenet uh can it learn complex user-defined operators yes it does it doesn&#39;t actually that&#39;s a that&#39;s that&#39;s a good question and that&#39;s why in the beginning i put a biological system a a social system system of systems because it doesn&#39;t really care so so for example i have this advection diffusion um system right so i generate data from an integer equation i learn the operator then the operator now it&#39;s implicit right now how would you represent if you don&#39;t know that the data came from there now you like integer calculus and you use a dictionary of of integer derivatives i used a fractional derivatives and i came up with a different operator again you may remember i saw i i found the derivative to the 1.5 which is a combination of second and first derivative so yes any any um fractional operators are exactly that it&#39;s kind of very very general operators you can have variable order you can have distributed order so so if you don&#39;t want to be totally arbitrary uh you can use fractional operators which are extremely expressive and and you can express them but but yes any operator uh which is non-linear has to be continuous operator uh could be uh could be represented yes great uh and thanks to david for helping marty submit his question marty&#39;s question is did i hear you mention the possibility of using wavelets instead of sigmoids have you tried this option yet and then a follow-up question could you envision using malays deep scattering networks as trunk networks uh the first one i i i the um that&#39;s probably i was kind of bespoke maybe i meant the wavelets would be a wavelet representation of the space v so imagine you have like this uh dilations and so on you a a or multi-ski for multi-scale problems maybe wavelets is the best way to represent just like i use my legendary polynomials and so on to represent the space v uh not not to replace the activation functions for the if you&#39;re interested activation function actually we have something really nice that we produce with uh one of the postdocs uh from seasale kenji is his name and and we did this adaptive activation functions or what we call them rowdy activation functions work really really well for training uh but but not i didn&#39;t mean to replace the activation functions with wavelets i meant to represent the space v uh i i&#39;m not sure about the second part is this the stephan uh method but i don&#39;t actually know the scattering so i wouldn&#39;t be able to to answer your question uh the track network it&#39;s really interesting that you are that you that you target the trunk because when when we started doing this i thought especially for depot net as i used it as a pre-trained network and then and then uh trained a little further with some data i thought that i need to put more uh to to fine tune my branch but it turns out that these depot nets the the trunk is actually the most sensitive part so i think if there are ways to improve the trunk i find qnit or with better parameters with a different architecture uh that&#39;s a place to target that&#39;s kind of our experience so far but next year i may say something else sorry i cannot answer the second part of the question no worries at all uh the next question is in fem there&#39;s a lot you can do with your choice of fem space consequently there&#39;s a lot of work on which space spaces to choose for certain types of pdes ergo finite element exterior calculus and so forth it seems like you are picking a function space in some of these examples what should i do to pick myspace ergo how did you choose that space uh for the fractional lamplation uh good question uh i didn&#39;t show that result the results actually that point to this but um um for for it&#39;s this is a different thing this is um the v space is the input space of functions and and and how you represent it in terms of the pins and how how we do it we have something called variational pins now since you talk about finite elements uh finite element is the galerkin method the trial space and the test space is the same but in variational pins what i have is my trial space is the neural network space which is non-linear approximation and i use vessel so this will be the best of space but my test space it could be a polynomial space i use legend polynomials or monomials just like in finite elements so variation pins if you if you google v pins you will see our paper on vpn you have the best of two worlds you have uh the the the neural network approximation which by now we know is a adaptive finite element and then you have nicely you&#39;re testing this in subdomains just like in finite elements but arbitration sub domains with nice smooth functions so if you integrate by parts you can transfer the non-smoothness to a smooth part and you can do lots of great things you don&#39;t have to take higher derivatives because every time i take a derivative uh for my physics stuff i double the length of the of the neural network so it&#39;s not very good as uh to go very deep but uh from the training point of view so so you can do you can choose the space for the uh something you can you can construct fem type uh and and least squares types solutions here uh and and um one of my collaborators from sandia is working on exterior calculus uh and uh what&#39;s called generalized least squares nets since you mentioned the the exterior calculus so so one can mix and match in the petroleum type framework perhaps so there&#39;s lots of possibilities this is from ram comparing this approach to real neurons is branch a dendritic branch uh is branch a dendritic branch trunk is axonal spiking cross product is a recurrent feedback so the hidden layer is simultaneously both fragmenting data and categorizing it and using categorization to guide fragmentation these two networks compete and cooperate simultaneously i wonder if the model can be run backwards as well wow that&#39;s a really good question i i i should talk because i i don&#39;t you know i i i know what an action is i know what the snaps i do but i don&#39;t i have no clue actually i shouldn&#39;t be giving a talk to to neuroscience group today but but but um but it could be i mean yes the trunk and the branch have to work in sync they have to be balanced and so on so we have seen some stuff but i i i i&#39;m not i don&#39;t have the knowledge to draw this analogy that you&#39;re presenting uh the in the archive we have depot net and then it&#39;s a it&#39;s a it&#39;s a spartan version but i can make the the long version available to you with all the theory and so on and and then you can you can you know if you can come up with an analogy like that that would be actually great because um i i just don&#39;t have the intuition for it and uh i just do uh operations but but that sounds great if that&#39;s true i hope it&#39;s true and we actually have a follow-up question from christian um he&#39;s uh asking a follow-up question on his uh compactness question and the place transform example do you like yeah do you learn the la place transform for function supported on say the unit interval and then apply the learn transform on functions with larger and larger support but continue to see the same l infinity error is this how you would test this idea yeah kind of yes exactly exactly that&#39;s exactly right this is this is how you do it and and and of course as you increase you have to increase your the other training and so on but you can but but uh yes exactly that that&#39;s uh that&#39;s uh that&#39;s exactly right uh so so the goal the grf was just for the input space right to basically you take a hat and you pull functions from that hat i was saying you can replace it with other spaces and so on so this is just experimenting with representations of of of v but of that compact space v which cannot be doesn&#39;t have to be compact as i said uh how i mean there&#39;s lots actually out there that one can use for that um this hasn&#39;t been around for very long and and so but we have a lot of examples i have lots of from from physical and biological cases that we have tested but i the theory uh we we prove just a deep neural network but uh and then and there&#39;s some error bounds but we don&#39;t really we haven&#39;t done a lot yet so uh i the reason i chose this topic today is because it&#39;s kind of a higher level abstraction so i thought maybe you guys could take a different angle of it and i hear some good questions so it&#39;s wide open i know i know darpa is very interested in this and they want to start a whole new program just on depot nets on all sorts and they ask me the same questions can we do this for social systems can we do for flocking can we do it for our troops can we do those because because it&#39;s agnostic to different to all sorts of specifics because you have to have data to to train all this but ram just uh stated i i think this might be a clarification uh trunk is inhibition branches excitation uh minus sum is difference of gaussian yeah so so branches the excitation is the input the excitation force trunk is the output that&#39;s correct all right these are non-linear systems but there&#39;s nothing ocean if you if you have a gaussian it&#39;s all non-linear system so there&#39;s nothing goes on the output great talk thank you george thank you very much thank you very much tommy thank you

Transcript for:Deep Operator Networks with George Karniadakis

Transcript for:
Deep Operator Networks with George Karniadakis