Transcript for:
Collaboration Between Mathematicians and DeepMind: The Role of Machine Learning in Mathematical Discovery

[Music] wow [Music] today we'll be discussing a collaboration between pure mathematicians and deep mind where we used machine learning to help make new discoveries in mathematics in this talk we'll cover these results show how machine learning was used to discover them and paint a future where machine learning features fundamentally in the mathematical process i'd like to firstly thank the university of oxford and in particular the maths institute for the opportunity to discuss these exciting results to structure this talk i'll start by giving an introduction to machine learning and its application uh to problems in mathematics following this geordie williamson will give an introduction to the combinatorial invariance conjecture a central conjecture in his area of representation theory and then describe how we were able to make progress on this using machine learning following this andres juhash will give an introduction to not theory the second area of mathematics which we worked in and then finally mark lacan b will describe the signature slope conjecture the connection that we were able to find and then prove within the area of not theory but i'll start by answering the question of what is machine learning so machine learning is a field of computer science it's concerned with systems that are able to learn from data and this is a useful thing because we often want computers to perform tasks that we can't explain how to do and i'll come back to a recurring example of recognizing what's in an image so recognizing what's in an image is something i can do very easily however it can be hard to make precise enough instructions on how i do that for a computer to be able to replicate that so instead what we can do is we generate data we generate a lot of examples and when we have algorithms that let the machine learn from this data and this philosophy has led to a number of very big breakthroughs in different areas over the last 10 years some small number which uh understanding the content of images are vastly improving the quality of translating between different languages just five years ago a machine learned system was able to beat the world's champion at the board game of go a feat which was previously considered to be at least 20 years away and finally uh the breakthrough science's breakthrough of the year last year for 2021 was a machine learned system called alpha fold which could learn to fold proteins from his amino acid sequence a 50-year challenge in biology so i'll describe two techniques for machine learning which will refer to you throughout this talk that's supervised learning and attribution now supervised learning is the task of trying to find a function such that for a large data set of example input output pairs that function is a good approximation of getting that output from the given input attribution on the other hand is that if we have such a function which we would often call a model in machine learning we have a model that maps from x to y we might want to identify for a particular input output pair for a particular x and y which parts of the x is the machine learning model using to make that prediction of y so in the case of images in supervised learning we might be trying to learn a function that maps from an image to a label of what's in that image an attribution would be asking which part of that image was the model using in order to predict that label so let's look at that example in more detail here we have a picture of a cat in an image and as i mentioned we can all do this very easily i can do this you can do this a small child could do this but in order to have a machine do this we really have to have it learn from lots of data and so what we would do is show the computer many images with lots of different labels and then train a system that given a new image that it hasn't seen before can use this function that it's learnt to tell what's in the image and so if we take a model that we've trained to do this and we apply it to this image we would get a prediction of some labels that might be the label for this image and we can see a table here with the five most likely labels according to the model and we can see the most likely label is a tabby cat which we would take is correct and the one that we want to use you can see that it gives a small probability of the fact this might we might consider this an image of a bath towel and you might see why that is but we accept this as an image of a tabby cat and so we might ask the attribution question which part of this image is the model using in order to predict that this is a tabby cat and again we should have some expectation of what this would be we know what we use in order to tell this image is a tabby cat but in this case we can sanity check that the model is using the same information we would we can ask which parts of the image which in the case of a computer means which pixels is it using to make this prediction and reassuringly we can see that the pixels of the cat are the ones that are telling the model that this is an image of a cat so let's move on to patterns in mathematics so patterns and discovering patterns is really fundamentally a part of pure mathematics we're all familiar with the idea of proving theorems and taking some statement and then trying to prove it which we would do in high school or in university but a lot of research mathematics is about discovering new things it's about discovering new structure and new patterns that weren't known before turning these into conjectures and then going on to prove them and so a lot of what the history of mathematics looks like is generating examples of mathematical objects the mathematicians are interested in and then finding patterns that they can formulate into conjectures and then ultimately prove so going back in time a great example of this is prime numbers so before computers were around people would manually calculate large numbers of primes that they could then look at find patterns and come up with conjectures like the prime number theorem one of the most famous conjectures today is called the virgin sweetness and diet conjecture this is one of the millennium prizes that would win you a million dollars if you're able to solve it this was found by having a computer generate a bunch of examples of objects that mathematicians are interested in and then they could look at it find this pattern and turn it into a conjecture which as yet remains unsolved it is a fundamentally important conjecture for um for mathematicians and our understanding of mathematics so what we asked in this work is whether or not by using machine learning that's able to detect patterns from large amounts of data can we find patterns that are otherwise overlooked by mathematicians possibly in mathematical objects that are very complicated or very hard to visualize or see but to give a really simple illustration of this process let's start with a small example before we go on to how we actually use this in practice so we'll start with rediscovering euler's formula and one way we can phrase this is can we tell how many edges a polyhedron has by only looking at other measurements of that polyhedron and you may already know this formula or if you don't we'll rediscover it through this process i should emphasize we don't need machine learning to rediscover this relationship it's a fairly simple one but this just gives us an illustration of how we would use the techniques in a case where it's much more complicated and we may not be able to spot the pattern without machine learning so we can generate some data for this problem we can take a number of different polyhedron and we can take different measurements of it we can measure how many faces does it have how many vertices does it have what's its surface area and its volume and then we can make a table of these and look at how many edges does it have and this is a pretty small table as tables go and so potentially if you look at this you might be able to spot the pattern even before we finished but if we wanted to use machine learning what we could do is train a supervised model to predict the edges from the quantities on the left and if we were to do that we would see that we could train a model that was able to predict these edges very well all of the time which would tell us that there is a pattern there that the machine learning algorithm has found and if we found that the next question would be well what is this relationship can we tell the mathematician something more that will help us characterize this in a way that they can then go and use it and if we did attribution if we applied attribution techniques to this learned model what we would see is that it was looking only at the faces and vertices to make this prediction and so this is great because we've cut down the parts of the object that we're looking at quite substantially and this may have been made easier for you to figure out what the pattern is since we know that we can calculate the edges now just from the faces and vertices and if you haven't gotten it yet it turns out that this formula is at the edges is the number of faces plus the number of vertices minus two so this is also a very simple form of the formula which means that it's very very easy to spot if you plot this or you did linear regression you would just find this formula quite easily but the great thing about machine learning is it's very flexible it's able to look at large amounts of data about very complicated objects to find relationships that are really not so obvious and to give you an example of this i'll hand over now to geordie williamson to talk about how we use these techniques to understand more about representation theory so hello from the other side of the world i'm jordy williamson and i'll be talking about the representation theory arm of this paper that we're discussing today which advocates machine learning as a kind of mathematical collaborator in mathematical research so alex mentioned the task which machine learning is very good at of image recognition and i think that this is a very good example to understand in order to understand what a neural net does and it also motivates some of the neural nets that we used in this project so alex mentioned the task of recognizing a cat and here's a very big cat and i just want you to notice for a second that you can instantly recognize the eyes the nose the whiskers the ears the background the fact that it's a tiger it's absolutely remarkable i think it becomes even more remarkable when you think about trying to do this on a computer so i'm going to zoom in now on the pixels just to the left of the eye of the tiger so the eye is on the right hand side now for a computer these pixels are just all a single shade of gray they're encoded by a byte in other words a number between 0 and 255 and i extracted these numbers and here they are so this is what that portion of the eye looks like to a computer and if we zoom back out again we see that there's over 10 million pixels involved in this image so what our visual apparatus is processing instantly is 10 million greyscale values and we don't even have to think to know that it's a tiger so how does a computer actually do this so the most powerful technique nowadays is known as a convolutional neural net and what happens so if we think about the way that our visual cortex works so light reaches our eyes and then neural activity begins and the first few things that happen in our eyes are understood by neuroscientists and the deeper they we go back into our visual cortex the more mysterious it becomes and this loosely motivates the architecture of a convolutional neural net so initially there are certain functions which are indicated by the arrows on the left which perform local analysis of the image so although we don't ascribe these kind of words to the functions they're responsible for things like edge detection detecting blocks of color etc and then we take on more and more larger considerations until we finally reach an output which is very positive if it's an image of a tiger for example and very negative if it's not an image of a tiger so here's the picture you should have the top neural net is a vanilla neural net everything's connected to everything this is powerful on some basic tasks but most of the more sophisticated uses of neural nets use a specific architecture and then below i've illustrated a very simple convolutional neural net so you'll notice at the beginning we only care about nodes that are close to each other so the neural net architecture reflects the problem so now i want to turn to the pure maths problem so what we did is we worked on something called the combinatorial invariance conjecture so this is a 40 year old conjecture which has fascinated me since i was a phd student and i won't be able to go into detail about the conjecture but it basically says the following we start off with a pair of permutations so remember that a permutation is a way of rearranging the numbers from 1 up to 10 say and if you think about the number of these permutations it's it's a nice exercise to see that the number is given by 10 times 9 times 8 times 7 times 6 times 5 times 4 times 3 times 2 times 1 ie 10 factorial and we start off with a pair of these permutations so it's an enormous number of possible inputs when we're looking at permutations of say 10 elements and to these two permutations we can associate one object called a bruja graph so this is easy to work out on a computer it's enormous generally and it's a big directed acyclic graph okay so in this picture on the left all of the arrows go down now we can also associate to this pair of permutations a carcinogenic polynomial and a casualty polynomial i can't explain what it is given this time but it's a very very fundamental measurement in representation theory so it gives incredibly detailed fine structure of important mathematical objects and i would like you to kind of think of the bruja graph as being a bit like the image of the tiger and the casual analytic polynomial as being a bit like the evaluation is it a tiger or not so what the combinatorial invariance conjecture says is that this bruja graph this big directed acyclic graph determines the catastrophic polynomial as i said before i've been fascinated to this by this conjecture since i was a phd student but i never saw a kind of way into this problem and using this very loose analogy with image recognition we thought it might be fascinating to see how neural nets go on this problem so this is what we did we generated a data set of about a million pairs of bruja graphs and catalystic polynomials we generated a million pairs as on the previous slide we then trained a graph neural net to predict casualty polynomials from bruja graphs so this is roughly speaking kind of image of tiger to tiger association and again the bruja graph determines the neural net architecture it's very important which is i think a very beautiful aspect of this so this graph that we start off with we're trying to make a prediction based on this graph and this graph determines the architecture of the thing making the prediction so within a few days the team at deepmind could get an accuracy of about 97 and with more work in certain cases we could get the accuracy up to nearly 100 kind of shocking for me as a as an outsider but this is not a very useful place to be model x achieved 95 accuracy on problem y is maybe a great place to be if you want to make some money on the stock market or something but if you want to understand what's going on in the way that we do as mathematicians this is not useful so then we tried many things that didn't work over a period of nine months and i'm happy to discuss those if people are interested and we finally set it on saliency so these are the attribution techniques that alex talked about so basically you can ask which vertices and edges in this graph are most important for making this prediction so this is what kind of came out of this and this is the thing that really surprised me is that we we're trying to make this prediction based on this bruja and the deepmind team produced that the graph on the right so what this is saying is that certain edges are much more important for the prediction problem than others and this was very robust over many different slight variations in the model structure and stuff so this really seemed to reflect a reality of the problem and based on this um i started looking at the edges so in the in the picture on the left we illustrate a small example of a bruja graph and the blue vertices are the vertices that the model is claiming are very important and when i started looking at these edges i started to see hypercube-like structures in them which led to a conjecture which is that so i won't define the terms of the conjecture but basically it's a prediction which should solve the combinatorial invariance conjecture i find it interesting that in this conjecture the right hand side is kind of boring for experts whereas the left hand side which is what came from the saliency analysis is very unexpected so we've checked this conjecture on a over a million examples it should solve the combinatorial and variance conjecture and we have a proof in an important case so in summary a machine learning model allowed us to see some aspect of a mathematical problem which humans hadn't seen and this then led mathematicians to progress on this problem and i'm very excited to see whether this technique is useful for others so thank you hello my name is andreshu in this video i would like to give an overview of low dimensional topology and not theory an n-dimensional manifold or n-manifold in short is a space that locally so in a neighborhood of each point looks like n-dimensional coordinate space and we consider these up to continuous or smooth deformations for example you could just take n-dimensional chord in a space or the n-dimensional sphere there are not that many one-manifolds you can have the real line or the circle which is equivalent to this simple closed curve in the plain short in the up in the lower right too many foils are also known as surfaces if you were a tiny short-sighted observer living on a surface you would only see a small neighborhood around yourself that looks like the x y coordinate plane however globally this space might have a more complicated shape for example you could be on the surface of a bowl or a surface of a donut which is called a taurus you could be on a donut with two holes and so on and the number of holes is called the genus of the surface now we live in space which is a three manifold and we don't know what this shape is like and if we add the time coordinate we obtain a space-time and in this picture you can see a wormhole connecting two distinct points in space a poincare conjecture was a famous open conjecture originating from 1904 and they stated that the only three manifold in which if you take a big lasso and you can always pull it tight to a single point then this is the three-dimensional sphere and this condition on on the loops in the space says that the space has no holes now this was proved by pereman hundred years later who was offered the fields medal and the result was named the breakthrough of the year by science low dimensional topology is the study of three and four manifolds and surprisingly many foils in dimensions bigger than four and many foils up to continuous deformations are easier to understand for example the generalized poincare conjecture characterizing the n sphere is known for n bigger than four and is also known for the fourth sphere up to continuous deformation but it's not known for this four sphere up to smooth deformations and this is called the famous moose four dimensional poincare conjecture now thurston's geometrization conjecture which was also shown by peralman and the poem career conjecture was a special case uh states that you can cut every three manifold along embedded spheres and turai such that each of the resulting pieces carries one of eight special geometric structures one of which is hyperbolic and so in the lower left you can see a crocheted hyperbolic plane this has the property that at each point it looks like a saddle and this means it has curvature minus one and on the right you can see what it would look like to live inside the hyperbolic three manifold a knot is just a curve embedded in three-dimensional space considered up to deformation and they play a very important role in low-dimensional topology now the simplest nut is the unknot which is the center circle in the plane and we consider this up to deformation so this video shows a deformation and if you look at the projection of this knot onto the plane then you obtain something called the knot diagram so there are two diagrams of the unknot in the middle here the standard circle and this figure eight and on the right you can see a very complicated diagram of the unknot now knots have various applications for example in chemistry people have synthetized molecules in the shape or shapes of various knots and they hope these will provide potential building blocks in nanotechnology where it could give a stronger more flexible polymers there are also attempts to build quantum computers using knotting and breeding and these have the advantage that they are not affected by small perturbations of the system knots also play an important role in biology for example there are certain organisms whose dna is circular and if this gets noted that is a serious problem because if the cell tries to multiply a copy is made of the dna and this has to migrate into the daughter cell but if it is noted and linked with the original then this can't happen so there are certain enzymes in the cell for example recombinases who that whose purpose is to unlock and unlink uh dna and what the recombine is does is an operation called the band sum where two strands of the dna are cut and the end points are glued together as shown in this video so we can distinguish notes using certain algebraic invariants one of which is the classical alexander polynomial and so if two knots can be deformed into each other then the associated invariants are the same but if they have different algebraic invariants then we know that no such deformation exists now links are collections of knots whose components are linked together as the name suggests and they are particularly important because you can describe every three and four manifold using a link whose components are labeled by integers an important four-dimensional invariant of knots is the formal genus and it is the minimal genus so the minimal number of holes of a surface that lives in four dimensional half space whose boundary is your given knot in the three sphere and in particular if you are not bound the disk in four dimensional half space we say that that knot is slice and if we could show that a certain type of knot is not slice then we could construct a counter example to the smooth four dimensional poincare conjecture an easier to compute algebraic invariant is the signature and this is something that mark is going to talk about in his video so this is defined using a surface that lives in three space and whose bound that is your given knot for example in the lower right you can see a surface whose bound that is the trefoil not i've colored these two sides red and blue and it's an interesting exercise to show that it can be deformed into a torus with the disc removed if you take the absolute value of the signature and divide it by two it gives a lower bound on the four ball genus let me conclude by showing you a non-trivial knot that bounds the disc in four dimensional half space in order to visualize it we'll use time as the fourth coordinate function and so this will be a movie during which here is the non-trivial knot it undergoes a deformation then a band move which remember the enzyme called the recombinazed it and we obtain a two component unlink and we can keep these two unknot components of the two disks and obtain the disk shown in the top my name is mark lackenby and i'm one of the mathematicians who's been involved with this project one of my research areas is is not theory andresh my colleague has already told you a little bit about knots and one of the objects that mathematicians working in not theory study the so-called not invariance so these are mathematical quantities that are associated with a not they could be a number or a polynomial or something more sophisticated and then there are literally hundreds of them that mathematicians study and they fall in too broadly into three main categories there are the so-called hyperbolic invariants which are related to non-euclidean geometry there are invariants related to four dimensions and there are invariants related to quantum theory and string theory and these are very very different areas of mathematics to the extent that mathematicians working on invariants in one of these areas they hold conferences to which work mathematicians in the other areas don't go so our goal was to use machine learning to discover relationships between these very different fields and our motivation for doing this was not just so that we could go to each other's conferences but so that we could use the insights and information gleaned from one of these areas to provide new information about one of the other areas so we use machine learning to do this and my collaborator alex davies has already told you a little bit about how machine learning works and um i think it's best understood by means of an example so there are now algorithms that can distinguish between road signs and pedestrians and the way they work is you give them millions of pictures of pedestrians and road signs you say this is a pedestrian and this is a road sign and gradually the algorithm learns which is which to the extent that if you then give it a picture that it hasn't seen before it will be able to accurately predict whether it's a pedestrian or a road sign and it'll do this by giving you a number between zero and one where zero means it definitely thinks it's a pedestrian and one is where it definitely thinks it's a road sign but it might give you a number somewhere in between say for example this it might give you the number or 0.5 so in abstract what it's doing is it's taking the information of all the different pixels of these pictures it's churning through that information to give you the required output quantity which is in this case a number between north and one so our idea was to try to use these algorithms in the study of not in variance so what we did was we gave the hyperbolic invariance of lots and lots of knots to the algorithm we tried to train the algorithm to learn some dimensional invariants so an overview the input of the hyperbolic invariance and we're trying to train it to output the four dimensional invariance now the most important thing about this is that this is not guaranteed to work that if there is no connection between these different areas of mathematics then no matter how clever the algorithm it would be impossible to predict the four-dimensional invariance from the hyperbolic ones so what you can deduce is that if the process is successful then the four-dimensional variants are determined by the hyperbolic ones and remarkably it was we were able to get the algorithm to be able to predict the signature of a knot which is a four-dimensional invariant just using the hyperbolic and variance and for me this was really quite remarkable i had not expected that the signature should be able to be predicted in such a way so as mathematicians though we don't want to just know that one can predict the signature from the hyperbolic invariance we would like to know how and unfortunately the algorithms don't exactly give you that they're sort of black boxes they don't tell you how they work but it is possible to some extent to get a little bit under the hood of these algorithms and one way of doing this is using so-called saliency analysis so in this context we were able to ask the algorithm what are the main features that you're using to predict the signature so here you have a whole load of hyperbolic invariants and you can see that there are bars against each one and there are three bars which are particularly long and they correspond to three hyperbolic invariants that the machine is mostly using to predict the signature the so-called longitudinal translation and the real and imaginary parts of original translation so that's provided some information but what's the next step well then we just did a more traditional mathematical thing of plotting some pictures so here's a picture of original translation on the x-axis the signature on the y-axis and this is a scatter plot so each one of these dots is a not and we've actually colored each one according to its longitudinal translation to get a little bit more information and this very beautiful picture it clearly has some structure so something is clearly going on it's not just a random collection of dots on the plane and in fact for example you can see that their dots are mostly in the top right and the bottom left and so what that means is that if the original translation is positive then so is the signature normally and if it's negative then so is the signature normally so there's definitely new information that we didn't know before but we couldn't exactly um doesn't exactly give us how one can predict the signature from these quantities so this required a little bit more of thought from us mathematicians and what i did was i came up with a new quantity called the natural slope which is a way of packaging together the relevant hyperbolic invariance so it's the longitudinal translation lambda and the original translation mu and what you do is you take lambda divided by mu and you look at its real part and the reason for doing this is because if you take this plot with on the x-axis you have natural slope on the y-axis you have signature each one of these dots is a knot and you can see that these are following roughly a straight line and so what that means is that signature and slope are roughly linearly related to each other so that really does give us now more of an insight into how the hyperbolic invariants are related to the signature and the natural thing to do now is to formulate that as a conjecture so we formulated this conjecture which is if you look at the signature of a not k minus half its slope you look at his absolute value this should be bounded bounded by at most a constant times the volume which is a another hyperbolic invariant and we were pretty confident of this conjecture we spent nearly a year trying to prove it and we examined millions of examples all of which seem to confirm it but as we got more and more insight into the way these quantities the signature and the slope behave we eventually realized that something like this could in fact not be true the conjecture had to be wrong and we were eventually able to find some some some counter examples to it so this seemed a little bit depressing we spent a long time um and despite a lot of positive evidence this conjecture was wrong but fortunately the insight that we gleaned in understanding how the signature and the slope work but actually enough for us to be able to form a correct theorem so we're able to prove the following theorem which is that the signature minus half the slope the absolute value of that is indeed bounded by a constant times the volume times the injectivity radius raised to the power -3 where injectivity radius is another hyperbolic invariant so this then shows you that the signature really can be estimated in terms of these hyperbolic quantities we're also able to prove a more refined theorem which is the signature is roughly half the slope plus some correction terms relating to so-called short geodesics and so this really was sort of the end of the story we started with the machine learning which had suggested some new directions that we hadn't thought possible before and we were eventually able to prove those relationships with proven theorems and what i think is potentially the most interesting part of the story is that actually it seems that the machine knew all along what was going on so we go back to this saliency plot the first three of those bars were what we used they were corresponding to hyperbolic quantities that the machine is most using to predict the signature but if you look at the fourth quant the fifth quantity that's injectivity radius and the fourth quantity that's quantity related to the short geodesics those were actually vital quantities in forming the correct theorems so if we had looked when we first looked at this proper this this plot we thought that really the first three were controlling signature and the rest was just noise actually if we had trusted our algorithm a little bit more and had worked out a way of getting injectivity radius and the short geodesics into the picture then we may have been able to save ourselves quite a bit of time in formulating a correct theorem so where next and what are the morals to be taken from this well i would say this but the mathematicians were vital to this project that machine learning algorithms were powerful they're not going to put us out of a job just yet but they really do have the capacity to be able to find relationships in mathematics that are completely new to mathematicians in my view machine learning is not a magic bullet there'll be many mathematical problems that it won't be able to help solve but i really do believe that it will become a tool that is more widely used by mathematicians in the future i certainly hope so thank you very much for your listening so alex jordy uh andrej mark thank you very much indeed for your presentations really appreciate that it's a wonderfully uh interesting um we're going to go now to a a session of of discussion and questions uh where we can drill down a little further into what you were telling us so perhaps we can start um um if you could briefly outline the history of how this collaboration came about and how it developed it's an unusual collaboration it's an unusual team so i'd like to hear a little more about about the genesis of this um geordi i think you were in at the beginning yes so in 2018 um i was elected as fellow of the royal society and the same year uh demus hasabis the ceo of deepmind was also elected to the royal society and i can remember at the meeting that um everyone wanted to talk to elon musk who was elected the same year but i really really wanted to talk to demis and um and we just had some very brief chats about potential interactions between mathematics and um and artificial intelligence and we continued via email and i put um i suggested that the deep wide team contact mark because one of the problems that we were discussing were applications in not theory which they then did and i stayed in touch with alex throughout the project and then i joined the team later on i think so alex you were in there at the beginning so how big was the team um at google that was involved with this uh over time it people kind of came in and out but in the end probably 10 or so people from the deepmind side were final authors on the paper and even some more helping out in different capacities uh throughout the time that we were working together nice to see and had the mathematical problems been already selected by the time you came in or were you part of that process uh no so i got introduced to the process at the very beginning so push meet coley is the head of the sciences team where i work in deep mind and he's spoken to demis and when i was looking for something new to work on he said you know maths is a really interesting place that you should check out and we've got a great connection with this guy julie williamson and so we spoke about potentially where we could start looking for a problem to work in and this suggestion of night theory came up i then got in contact with mike and andres and we ended up working together for quite a while before we we settled on the problem that actually it tends to work on i see let's see so again andrew you came in a little later but you you were you'd understood the whole product you know that what what was intended for this or did this come fresh to you um it was definitely fresh so so alex reached out to me and said am i interested in um uh in in in doing something in collaboration with them and i i thought great um i'd had no real experience of using machine learning before i'd had some experience with sort of computer proofs before but that's quite different coincidentally i'd been talking to andresh about potentially using machine learning uh in in low dimensional topology so i thought let's get on board as well and um and then we we started to talk and i i think it was very much at the beginning it we were both sides were trying to find the right way of working with each other and um it actually took some time before we hit on on the the the the problems that actually i ended up by solving yes there were various questions that i got inspired by when well the results with alpha zero came out and had a number of ideas of applying maybe reinforcement learning to not theory so and that that's what we were discussing initially with mark and it was this miraculous coincidence that then alex got in touch with mike and and everything started from there okay so one thing that struck me um from all your all the presentations was the key role played by uh saliency um so alex is that something you'd anticipated that that would be so pivotal or is that something that emerged during the collaboration no not at all i think that it felt like we can we maybe came in with some conceptions of what would be useful um due to in my case lack of understanding of mathematics um and i think i kind of referred to this earlier i think as non-mathematicians sometimes you can think of mathematics as a process of theorem proving and of just trying to prove a particular statement um but i think the more that we worked together and with mark and address the more we could see actually what the process of mathematics was and that it was more of a kind of intuitive uh sense of them working to grapple and understand this area and after we'd done some early work which was kind of helpful but we reached some answers to questions that ultimately didn't really get mark and address excited we kind of took a bit of a reset and asked this question well what what would be exciting what would what could we find that would actually be exciting for you from a mathematical perspective not just using machine learning for the sake of it and that started us down this line which eventually naturally led to using saliency because we found something interesting and mark and andres wanted to understand it i see that's it so mark andrews you were familiar with this term before because i i i wasn't and i was really surprised by how big a role it played in in all this story it was definitely a learning process for us as well i think it it was certainly completely new to me um but it makes sense how important it is you know in mathematics is not just knowing sort of what is true but why right you know proof is is all about explaining why something is true and you know saliency allows you to sort of get under the hood a little bit of the machine learning algorithms and explain not just that this the quantity is related to this quantity but but sort of a little bit in a little bit more detail how how that works so i feel yeah it's it's certainly it would say it was new to me but it with hindsight it's not so surprising that it did play such a role so certainly in the not theory story um as i understand it played a key role in coming up with the correct conjecture that you you then were able to prove but initially you had a conjecture that wasn't quite correct and i think what was striking was your presentation that mark that um that that still satisfied still seemed to be satisfied in a huge number of cases i think you said millions of cases so could you talk us through because that might surprise some people that you can have an incorrect conjecture that's never less true in the first million or so cases that you you check yeah so this is one of the um sort of strange things about if you if you randomly sample from a huge sample set um you you might not see behavior that is sort of very much of the extremes um it's sort of some um you know things kind of cluster around the mean in some respects um and um yeah it it it is quite striking that you can have you can verify a conjecture in millions of cases and thereby be convinced is true and that it ends up by not being true it's uh yeah it's kind of a surprising phenomenon so were you shocked when it turned out not to be true um yeah i was kind of annoyed as well i mean i spent really quite a long time trying to prove it um so i guess i was annoyed but also relieved to explain why i wasn't able to prove it um yeah it uh it it was quite shocking though yeah so one thing that um i i i'm really curious about is that you explained um mark in your presentation that um you had to introduce another quantity uh to fix up uh the correct conjecture and that in retrospect uh the machine seemed to know that because it was the next one on the saliency list um but it's not that much more salient than the one below it on that list and so an obvious question for a non-expert like me is why doesn't that play a role or or what would that play a role in well as seen both the fourth and the fifth one do um do both play a role there are two versions of that there are two main theorems and what one plays a role in one one place a role in the other but the sixth one wasn't a lot less salient oh the sixth one yeah the sigma so yeah i mean the thing is that um you're not dealing with quantities that are completely independent um and so um the the the machine learning algorithm it might be using one quantity when it doesn't really need to because it's correlated with the others and so that's what we thought initially was happening with the fourth fifth sixth etc was just basically noise um but it turned out not to be um geordi for your part it was a similar-ish story that you know trying to settle on the right conjecture and this and what i think the machine learning told you was that some parts of the graph were more important than others so had you at all anticipated that or did this come was it completely unexpected i think it was interesting in um in our work because the work with mark and andres had already been going on for i believe over a year uh and so they'd already tried a whole lot of stuff that didn't work um which saved us a lot of a lot of effort so um what they had success with was basically like a kind of a relationship type statement and so from the get go in our project we were looking for a relationship type statement and yeah i guess the way that i look at the saliency plots now um in the brew in the bruja graph is as some kind of almost like you know some sometimes in maths we have some prediction from physics that tells us that something is the case and we have very little understanding of why it's very surprising um still i mean i looked at those saliency plots yesterday and it still strikes me in that same way there's something there's something about the problem encoded in that tendency plot that i now really believe is a fact of the world um which we we have some partial explanation for but i don't think we've really gotten to the bottom of why why it is so and yeah as as my language should indicate it was completely unexpected amazing um so in all these stories there's clearly uh a major human component in that you you understand some something is is salient but then you've got to figure out what the right um what the right sort of shape or formula should be um and um just just give me a sense of how important that component was so going once you've understood we've seen the saliency sort of the data what the machine learning was telling you how much effort did it take then to figure out the right conjecture the right statement um mark andrews do you ever tell me about that it was a huge amount of effort we spent probably many months or even a year trying to formulate the right conjectures and it it was really not clear from from the plots that we were looking at what was the right uh formula so of course one can try to use various techniques in machine learning to try to find formulae but at the time that didn't seem to bear fruit yeah to follow up on that it was it was quite it was really quite a lot of work uh to get to the right conjecture to get to the conjecture that that it turned out not to be correct um happened more quickly um but still was was quite a lot of work um and there was there's no doubt about it that sort of kind of um that that it required you know a certain sort of background understanding of the whole structure that we had um that i think it the machine learning on its own was not going to be enough to get to the right conjecture and um do you think this is something you'd be quicker at next time is this a skill that one can acquire to sort of interrogate what the machine's telling you um i hope so um it's it it's quite it is quite difficult um partly because the quantities are all correlated so you can't just sort of look at the dependence of just one like you can't keep all the other variables fixed and just vary one variable to see what's happening um it's um yeah it's it's it's a tricky business actually yeah um jody was at the same in your case that it give me talk me through what the effort was from getting the feed the readout from the machine to formulating that correct formula that you that can yes so i absolutely agree with mark and andrush that it was a i mean i think for me it was like nine months or something it was really a very very um difficult thing and the other thing is that as a as a like you know i come from a kind of extremely discreet side of the mathematical spectrum so in my work i never look at a probability distribution or anything like that um and so to to look at these saliency plots and realize that you know certain these models involve a whole lot of stochastic aspects and i spent um quite a lot of time thinking that this was just kind of hogwash and relics of the model um so one very interesting process for me was to overcome um a kind of skepticism of what these models can actually pick up in a problem and i think that that was actually a psychological barrier for me if i'd if i'd looked at that as being a reality of the problem earlier i would have made progress more quickly but one can't understate how how difficult it is to get from model to theorem just to illustrate that i think in our case we had reasonably good predictions within a few days of being a model and yeah the conjecture took about nine months to formulate so it's days versus months so alex is this usual in in in using machine learning in this way that you know there has to be this very considerable input from the kind of human mind at least in the examples we've given or or is math are the mathematical applications different to the ones you've been pursuing previously i i'd say there isn't a usual because really we're the first people to try and do this kind of thing in mathematics um but outside mathematics i mean you've had these great successes at google deep mind has the machine sort of carried more of the load in those or has there also had to be a very considerable human input this is one of the first projects that we've done which really aims to see if we can push forward the work that expert humans are doing so if you take for example the work in um in alphago where it was learning to play a game that was that was a kind of adversarial setup where there was a game that was learning to play against humans and so in that instance there wasn't actually a lot of input from human experts this kind of work where we are deliberately trying to work together with people in machine learning is a relatively new one for us as well okay so i think um i think certainly mark did a and perhaps this was between the lines and others um it seemed to be the case that that at some stage in the process the machine wasn't right or it was giving you pushing you in a direction that wasn't ultimately the correct direction how frequently did that happen can perhaps have a good answering that which was it happened quite a lot at the outset so um the uh the the the the patterns that we saw were not the only patterns that we saw there were a lot of other interesting patterns well some of them we could see were were not correct right just we knew were not right and was similar to the same sort of issues that we had about randomness if you pick something randomly you might see some correlations between quantities which don't hold all the time and some were already known um and some still out there are interesting and we still don't understand and we focused on the one that we we did not just because it was interesting but because it was potentially provable which we thought was really important in this project i think i would like to add that it is very very important that these are statistical methods and a lot of the patterns we see are true for a generic object but as as mark mentioned before there might be some extreme cases lurking in the background that we are just never going to see so really finding the questions where you very something universal holds is quite difficult okay i think jordan you said something i found really interesting which was the kind of psychological element of this so it's well known for those who teach students that if the answer is in the back of the book more students get the correct answer in some ways it's kind of liberating to know what's true and then try and prove it and it's equally true that there are some mathematicians who specialize in producing beautiful second proofs of theorems um and are sort of liberated by by knowing it's true and maybe somebody else has produced the first dirty proof so is that the role that machine learning played that it somehow gave you the confidence to try to prove um the conjecture i think it's definitely an interesting aspect of all of this the psychology of it um because i think that you know one very basic thing that a machine learning model can do for mathematicians is tell you whether it's plausible that there's a simple relationship between input and output like we have so we're working on this combinatorial and variance conjecture i would say that if these graph nets failed completely to produce a realistic prediction uh then it's probably rather difficult the conjecture might not be false but it's definitely not going to be easy and the fact that it was so easy for the model kind of definitely had me kicking myself for a few months going come on come on jordy you know if these if linear algebra can do it then surely you can do it too but i mean i i should also say that like one of the very interesting things that happened um once in this project is of one of the many things that we tried that didn't work alex and dream team tried to distill a network down to a very small network to produce some in one case the coefficient of q and they got this network down so small that i could actually print out the two matrices involved so the entire prediction is based on two matrices of the size that i can print out i think it was like you know four by twelve matrix or something like that and i printed them out and had them on my desk and was staring at them and then it was a really interesting lesson in the difficulty of interpreting even a very simple model so alex um perhaps this is a ridiculous question but just does the machine have a psychological element for you as well is this part of the team the machine and if so what role does it play um that is an interesting question um i think in this kind of work it's more fair to think of us using tools for machine learning and trying to figure out how they fit into the the workflow of what jordy and mark and andres do and i think at different points they've described it as a kind of a weird collaborator that is giving them kind of information that you might get from a another human collaborator but in a very different way and in a very in not really as an explicit or an easy to understand way as a person who is able to communicate well with able to be able to do so and extremely john i really think it's good to think about it as an extremely poorly communicative collaborator we've all we've all got those and we all know the value they play um interesting yeah so um alex you know this is clearly very innovative in terms of on the mathematical side it may suggest new ways for mathematicians to work it certainly suggested new ways to view problems that mathematicians have thought interesting for a long time how novel is this on the machine learning side does this require you to do new machine learning or was this a little more sort of standard on your side uh it's a new way to do machine learning but the techniques themselves are very straightforward so the the kind of fundamental ideas of of supervised learning or regression and even this this kind of saliency or sensitivity analysis has been around in statistics for 100 years um some of the models that we're using now are a lot more sophisticated than old models so the work with geordi where we're using graph nets they're relatively new uh machine learning models have saved the last 10 years but they're still quite standard in the terms of things that we use at deepmind uh the innovation was more in how we're using them and trying to work together to find a way that they'll be helpful okay good um so um so i've been reading around and of course lots of people are very enthused about this and excited by what you've done um there are some people out there who are a little more skeptical um uh how would you address their skepticism what could you say that might might might reduce their skepticism who's going to take that first some of them say yeah some of the criticism was aimed for example the fact that in the not theory project maybe uh neural nets were not necessary but i don't think we really claimed that this was a fundamental aspect of the project so there are various different techniques like random forests and others that that could equally well be used in in different scenarios and and also i think some people didn't understand the fact that we were aiming at producing conjectures and not theory improving and maybe maybe they were missing that aspect of the story but but that's a completely different story altogether judy oh oh just um also i i i feel i i have absolutely no issues with someone being skeptical you know what what we did was um we worked on two math problems the same technique with worked in both of them um to my point of view that's the end of the story how useful are these techniques we don't know we're not claiming that this is going to steamroll mathematics or even steamroll five percent of mathematics i'd i definitely would reinforce that i feel like um uh i i also completely understand the skepticism um my view is that this is a tool this is not a you know a magic bullet which will solve huge areas of mathematics um but i just think it's an extra tool that mathematicians can use which is genuinely different and new from anything that's gone before let's see alex dude super fun sorry just to finish off with what mark was saying i mean this is just a really really fun thing to learn and be involved with um no it seems like it yeah alex do you have a sort of uh comment to make on this yeah definitely i think i agree exactly with what address mark and jordy have said um i think people would be right to be at least somewhat skeptical of how useful this can be for their mathematical research but what i hope this serves as is um as some solid examples that machine learning can lead to mathematical results that mathematicians find interesting and care about and when we started this work we did talk to a bunch more mathematicians to get their sense of what do you think machine learning could help you is this going to be of interest to your field and almost universally with the exception of a few open-minded collaborators um the answer was there is no way that machine learning could help maths full stop and so if we're just putting an addendum to that sentence i think we're doing a good job here okay so another thing that puzzled me and it's perhaps hard to answer it's a bit counter factual is that in both cases um the machine pushed you in a certain direction then it took nine months or a year to solve to make human progress on the problem uh of course you know people have made progress on problems over a period of nine months or a year without a machine do you think if you'd set your mind to it and really gone made this your your goal for nine months or a year you might have come up with the same conjectures i can say from my point of view definitely not exactly the the that's one of the things that i find interesting about this project is that i would not have embarked down this route had the machine not told us that there was a connection there um i i i can say that for a fact that i i i don't i personally would not have been able to to get to the end without the machine starting things off other people may well have been able to but i mean for me it's been extremely eye-opening to be able to use the statistical techniques on large data sets of nas to really see what's going on because otherwise there are just very limited not tables and invariants and maybe look at a few examples to see if your conjecture might hold or not but but to really look at millions of examples and look at trends you can really find new and surprising things i totally agree with mark and unrush i never would have um so but i mean i should say that i've really been interested in this problem since i was a graduate student um so i've had plenty of time to work on it you know i'm not as young as i used to be uh and the other point is that it i i think that if i'd forced myself to sit down and think about this problem for nine months it's possible that we would have made progress in a different direction but this direction certainly not would not have would not have occurred alex was you were you surprised how long it took after the machine had done its bit differently to prove it i had how easy this was and uh given the the limited amount to which i now understand not the unrepresentation theory i only have more an incredible respect for the uh the difficulty of taking these potential conjectures and actually taking them through to approve um and it's been a fantastic opportunity to sit as a fly on the wall and watch how difficult that was um but if i can offer like a fun anecdote from my perspective is it does feel like uh in in the case with with mike and andresh there was there was kind of months where it seemed like no progress was happening and then weeks when months of progress happened and suddenly there's something something kind of clicks and within within a week we've gone from uh maybe weak idea of what the conjecture is about to really quite a strong idea of which direction that the proof was going good okay um so clearly not all mathematical problems will be amenable to this kind of approach have you developed a sort of sense or an intuition as to which which are suitable or amenable to this style of attack well at the moment we're two from two so we need a few more negative examples but to be to be slightly more serious i think one of the things that we we knew from the outset is it has to be an area of maths where you can generate examples and that examples are kind of small enough to fit on a computer i think that's certainly one criteria [Music] but but can i just check you are two from two there aren't cases that you're not sharing of abject failure uh not of using this uh style of work i guess you know we said earlier with mark that we investigated lots of different things before we tried this method of working good no yeah there's no kind of hidden hidden failures there okay andrew do you have a sense about kinds of problems that might be more amenable to this now have you developed some better understanding of the process yes definitely as alex said you need large data sets not just maybe tens or hundreds of examples and also they should be representable in relatively simple ways so even things like polynomials can be tricky to interpret and and there are much more complicated structures in math i don't know what you do with groups or other algebraic objects but if you have numbers or graphs that that seems to be nicely amenable to these methods nothing too complicated jordy you do you have a anything different perspective on that i guess um the obvious one is big data like large large data sets another one is a little bit more subtle um and i thought of an analogy the other day to try to explain it which is some kind of attempt to explain why this doesn't go so well with mathematical proof um which is basically you i'm not sure if you use gmail but recently gmail has started proposing the next word of some of the sentences that you typed so this is using a language model to predict the next word of the sentence and it's doing so in a very generic way whereas often mathematical proof as the quality of surprise so the final word of the sentence is not the word that the language model would predict and i think that's an interesting i mean often people say that mathematical proof has a quality of a joke um and i definitely feel that in our work we were exploring a very kind of continuous learning landscape that's small deformations of the input um produced predictable variations in the output like i really imagine this is we're learning some extremely high dimensional function but it has some kind of smoothness and if it was a kind of extremely jagged landscape these techniques would not work i see that's interesting because um in a number of representations of these surfaces that appear in machine learning the the surfaces do look very jagged so perhaps i could ask this question in the following way there are a lot of open mathematical questions of a very fundamental nature related to machine learning um and so is machine learning the right way to attack those problems that is how good are machines are understanding themselves alex uh i think the answer to that is that we would need to find an expert mathematician in the in the areas you're referring to i'm guessing this is optimization or uh non-convex optimization something like this um maybe it'd be a good place to have a look the calls out there okay okay um so i suppose this is related but in your view what should be the next steps on this journey uh um uh andrej mark geordi are you continuing with this approach uh have you got a sort of list of problems you're going to be taking to alex or you know did you have now moved on to other uh other collaborations actually we have a number of of projects that we are still working on some of them knew some of them continuations of the older questions that we investigated but there is there is nothing we can report on at this stage yeah i just add to that so definitely i feel like this has opened up a you know a way of working which we we certainly intend to pursue um i think also it's worth saying that the next steps is to get other mathematicians involved um i i feel like if other mathematicians can con can run with these techniques then there may be some really interesting things to be done here and just to add to that i totally agree we need more people involved and the other thing that i feel that we desperately need is the kind of first 50 examples of machine learning and mathematics um we were pursuing really difficult problems that you know pushing architectures you know graph neural nets is a rather complicated thing you can't just look at a youtube chip tutorial and and um and train one um but there's a whole lot of very simple things you can do in machine learning which uh i don't think we've uh explored the interface with mathematics very well like predicting simple number theoretic functions or these kind of these kind of problems and so we're actually running a seminar in sydney this semester with precisely this aim of producing good good code that mathematicians can play with so i remember when i was a graduate student um uh computer algebra packages and computer calculus packages were just coming out and at that stage they seemed extremely clunky and awkward and now of course we teach them to all our undergraduates so how long will it be before we're teaching this kind of machine learning to our undergraduate students in mathematics with a view to developing conjectures and proving theorems maybe i can um so um just one very simple thing is that machine learning layers are already starting to get incorporated into the symbolic algebra packages you love um and so i i would say in five or ten years you'll be using it without realizing it um but yeah i'm not sure i'd pass on to mac and andrew another question when are we going to be used teaching these in undergrad to be honest i would say should be treated with some caution i feel like it's it's kind of important to train in traditional mathematical proof right which is what you do as a graduate student you're you're given problems and you you you work on them and then you prove something and you feel really really pleased when you've when you've proven your first theorem and i think that is a hurdle that you have to jump over before you can start to use these particular techniques so i would say say not not yet in the undergraduate syllabus me having said that most of my undergraduate students are extremely interested in machine learning so i think they would be very receptive to these techniques okay good um so one of my heroes uh mathematical physicist eugene wigner nobel prize winner has a famous quote which is he was shown an output of a mathematical computation on a computer he famously said it's nice to know the computer understands the problem but i would like to understand it too what do you think he'd say to this to your work i would say it absolutely sums up the issues yeah sorry alex you go if you if you got something there um i'd say if anything but hopefully this the work is a is a first step at addressing that point and like you were saying the saliency is the maybe surprising but obvious in retrospect uh really useful part of this is can we explain some amount of what the model has learned about the function that we're trying to learn and i think that this is there's certainly um i think an acknowledgement in the machine learning community that we could do better in terms of developing more interpretable models or or better techniques for describing what functions have learned but it's good to know that even some of the the more basic tools can still get enough get across just enough insight to allow these kind of collaborations to happen great okay guys those are my questions is there anything i should have asked um that you wanted to or that you wanted to be asked um that i didn't ask you any angle that we've not covered maybe i would have a question for alex um which is just imagine that you walk around deep mind with a model which is you know predicting zeros of the data function with 100 accuracy or something let's say to what extent is that a motivating fact in machine learning to um get better at interpreting models if we if we had a model that could uh predict the the zeros of zeta function i think that would be a great motivating example for people to be like drop everything and let's see if we can figure out what the model's doing and i think one of the things that we've historically lacked is good motivating examples for um understanding what the models are doing i think a lot of problems that we attack everyone is going to be very impressed if we're able to beat the world champion at go everyone will be very impressed if you're able to generate a system that can recognize images and you know all of the many successes that machine learning's had but there's less focus often on this understanding so i think the more we have examples where understanding is important the more we can convince people in machine learning to really take even more notice of this part of machine learning perhaps i can expand on that question a bit then because it is a good question um if you had to pick one of the clay millennium prize problems to throw to throw your machine at um all of you which one would it be is it is it the riemann hypothesis jody or is there another one that uh you think machine learning's slightly more and who'd get the million pounds if a million dollars if you're succeeded would it be the should a machine deserves a cut i would go for hodgkin conjecture that's my personal fantasy any of the fantasies anyone wants to share yeah i mean actually it's a good good good question i think actually it shows somehow the difficulty of using these techniques that actually of those seven problems well one of them has been solved but the remaining six it's not completely clear how you would use these techniques to to to to to to tackle them um that said though you know the innocent die conjecture was generated by looking at data and so in some sense um uh you know the sort of data-driven technique that that we've been using is actually somehow um you know kind of in the spirit of that but whether it be used to solve the budget and architecture i think this seems less likely well it probably shows the power of the human mind that um so was the riemann hypothesis riemann did numerical computations of the first few zeros and extrapolated from there from knowing as enough as you can count on the fingers of one hand to making a prediction for all of them that seems to be a little of what you've been what you've been doing with a with a lot more data so well that was a really interesting conversation thanks ever so much for sharing your insights um answering uh questions that were hopefully not too impertinent i really think this will have given people a much greater understanding of what you've done and i'm sure that many people will find it extraordinarily interesting so thank you all very much indeed for your presentations and for this discussion you