i think it's fair to say that this year 2020 has thrown quite a few challenges at human civilization so it's really nice to get some positive news about the truly marvelous accomplishments of engineering and science one was spacex i would argue launching a new era of space exploration and now a couple of days ago deepmind has announced that its second iteration of the alphavote system has quote unquote solved the 50 year old grand challenge problem of protein folding solved here means that these computational methods were able to achieve prediction performance similar to much slower much more expensive experimental methods like x-ray crystallography in 2018 which is the previous iteration of the casper competition alpha fold achieved a score of 58 on the hardest class of proteins and this year it achieved a score of 87 which is a huge improvement and it's still 26 points better than the closest competition so this is definitely a big leap but it's also fair to say that the internet is full of hype about this breakthrough so let me indulge in the fun a bit some of it is definitely a little bit subjective but i think the case could be made on the life science side that this is the biggest advancements in structural biology of the past one or two decades and in my field of artificial intelligence i think a strong case could be made that this is one of the biggest advancements in recent history of the field so of course the competition is pretty steep and i talk with excitement about each of these entries of course the imagenet moment itself or the alexnet moment that launched a deep learning revolution in the space of computer vision so many people are comparing now this breakthrough of alpha fold 2 to the image that moment but now in the life sciences field i think the good old argument over beers about uh which is the biggest breakthrough comes down to the importance you place on how much real world direct impact a breakthrough has of course alex net was ultimately on a toy data set of very simplistic image classification problem which does not have a direct application to the real world but it did demonstrate the ability of deep neural networks to learn from a large amount of data in a supervised way but anyway this is uh probably a very long conversation over many beers of uh alpha zero with reinforcement learning self-play obviously in contention for the biggest breakthrough the recent breakthroughs in the application of transformers in the natural language processing space with gpt-3 being the most kind of recent iteration of state-of-the-art performance the actual deployment of robots in the field used by real humans which is tesla autopilot you know deployment of massive fleet learning of massive machine learning in safety critical systems and then other kinds of robots like the google self-driving car waymo systems that are taking in a further leap of removing the human from the picture being able to drive the car autonomously without human supervision smart speakers in the home there's a lot of actual in the wild natural language processing that i think doesn't get enough credit from the artificial intelligence community how much amazing stuff is there and depending how much value you put in engineering achievements especially in the hardware space boston dynamics with spot many spot robot is just one could argue is one of the great accomplishments in the artificial intelligence field especially when you maybe look 20 and 50 years down the line when the entire world is populated by robot dogs and the humans have gone extinct anyway i say all that for fun but really this is one of the big breakthroughs in our field and something to truly be excited about and i'll talk about some of the possible future impact i see here from this breakthrough in just a couple of slides here anyway my prediction is that there will be at least one potentially several nobel prizes that will result in derivative work launched directly with these computational methods it's kind of exciting to think that it's possible also that we'll see a first nobel prize that is awarded where much of the work is done by a machine learning system of course the nobel prize is awarded to the humans behind the system but it's exciting to think that a computational approach or machine learning system will play a big role in a nobel prize level discovery in the field like medicine and physiology or chemistry or even physics okay let's talk a bit about proteins and protein folding why this whole space is really fascinating first of all there's uh amino acids which are the basic building blocks of life in eukaryotes which is what we're talking about here with humans there's 21 of them proteins are chains of amino acids and are the workhorses of living organisms of cells and they do all kinds of stuff from structural to functional they service catalysts for chemical reactions they move stuff around they do all kinds of things so they're both the building blocks of life and the doers and movers of life hopefully i'm not being too poetic so protein folding is the fascinating process of going from the amino acid sequence to a 3d structure there's a lot that could be said here there's a lot of lectures on this topic but let me quickly say some of the more fascinating and important things that i remember from a few biology classes i took in high school in college okay so first is there's a fascinating property of uniqueness that a particular sequence usually maps one to one to a 3d structure not always but usually to me from an outsider's perspective that's just weird and fascinating the other thing to say is that the 3d structure determines the function of the protein so one of the correlators of that is that the underlying cause of many diseases is the misfolding of proteins now back to the weirdness of the uniqueness of the folding there's a lot of ways for a protein to fold based on the sequence of amino acids there's i think 10 to the power of 80 atoms in the universe so 10 to the power 143 is uh a lot and you can look at 11th house paradox which is one of the early formulations of just how hard this problem is and why it's really weird that a protein is able to do it so quickly as a completely irrelevant side note i wonder how many uh possible chess games there are i think i remember it being 10 to the power of 100 something like that i think that would also necessitate removing certain kinds of infinite games anyway off the top of my head i would venture to say that the protein folding problem just in the number of possible combinations is much much harder than the game of chess but it's also much weirder you know they say that life imitates chess but uh i think that uh from a biological perspective life is way weirder than chess anyway to say once again what i said before is that the misfolding of proteins is the underlying cause of many diseases and again i'll talk about the implications that a little bit later from a computational from a machine learning from a dataset perspective what we're looking at currently is 200 million proteins that have been mapped and 170 000 protein 3d structures so much much fewer and that's our training data for the learning based approaches for the protein folding problem now the way those 3d structures were determined is through experimental methods one of the most accurate being x-ray crystallography which i saw university of toronto study showing that it costs about 120 000 per protein it takes about one year to determine the 3d structure so because it costs a lot it's very slow that's why you only have 170 000 3d structures determined now that's one of the big things that the alpha falls 2 system might be able to provide is at least for a large class of proteins be able to determine the 3d structure with a high accuracy enough to be able to sort of open up the structural biology field entirely with sort of several orders of magnitude more protein 3d structures to play with there's not currently a paper out that describes the details of the alpha fold two system but i think it's clear that it's heavily based on the alpha fold one system from two years ago so i think it's useful to look at how that system works and then we can hypothesize speculate about the kind of methodological improvements in the alpha fold two system okay so for alpha fold one system there's two steps in the process the first includes machine learning the second does not the first step includes a convolutional neural network that takes its input the amino acid residue sequences plus a ton of different features that their paper describes including the multiple sequence alignment of evolutionary related sequences and the output of the network is this distance matrix with the rows and columns being the amino acid residues they're giving a confidence distribution of the distance between the two amino acids in the final geometric 3d structure of the protein then once you have the distance matrix then you have a non-learning based gradient descent optimization of folding this 3d structure to figure out how you can as closely as possible match the distances between the amino acid residues that are specified by the distance matrix okay that's it at a high level now how does alpha fold two work first of all we don't know for sure there's only a blog post and some little speculation here and there but one thing is clear that there's attentional mechanisms so i think convolutional neural networks are out and transformers are in the same kind of process that's been happening in the natural language processing space and really most of the deep learning space it's clear that attention mechanisms are going to be taking over every aspect of machine learning so i think the big change is comnet is out transformers are in the rest is more in the speculation space it does seem that the msa the multiple sequence alignment is part of the learning process now as opposed to part of the feature engineering which it was in the original step i believe it was only a source of features please correct me if i'm wrong on that but it does seem like here it's not part of the learning process and there's something iterative about it at least in the blog post where there's a constant passing of learned information between the sequence residue representation which is the evolutionary related sequence side of things and then the amino acid residue to residue distances that are more akin to the alpha fold one system how that iterative process works it's unclear whether it's part of one giant neural network or whether several neural networks evolved i don't know but it does seem that the evolutionary related sequences are now part of the learning process it does seem that there's some kind of iterative passing information and of course attention being involved into the entire picture now at least in the blog post the term spatial graph is used as opposed to sort of a distance matrix or adjacency matrix so i don't know if there's some magical tricks involved in uh some interesting generalization of an adjacency matrix that's involved in a spatial graph representation or if it's simply just using the term spatial graph because there is uh more than just pairwise distances involved in this version of the learning architecture i think the two lessons of the recent history of deep learning if you involve attention if you evolve transformers you're going to get a big boost and the other lesson is that if you make as much of the problem learnable as possible you're often going to see quite significant benefits this is something i've definitely seen in the computer vision especially the semantic segmentation side of things okay why is this breakthrough important allow this computer scientist ai person to wax poetic about some biology for a bit so because the protein structure gives us the protein function figuring out the structure for maybe millions of proteins might allow us to learn unknown functions of genes encoded in dna also as i mentioned before it might allow us to understand the cause of many diseases that are the result of misfolded proteins other applications will stem from the ability to quickly design new proteins that in some way alter the function of other proteins so for treatments for drugs that means designing proteins that fix other misfolded proteins again those are the causes of many diseases i read a paper that was talking about agriculture applications of being able to engineer insecticidal proteins or frost protective coating stuff i know nothing about i read it it's out there tissue regeneration through self-assembling proteins supplements for improved health and anti-aging and all kinds of bio materials for textiles and just materials in general now in the long term or the super long term future impact of this breakthrough might be just the advancement of end-to-end learning of really complicated problems in the life sciences so protein folding is looking at the folding of a single protein so being able to predict multi-protein interaction or protein complex formation which even in my limited knowledge of biology i think is a much much much harder problem as far as i understand and just being able to incorporate the environment into the modeling of the folding of the protein and also seeing how the function of that protein might change given the environment all those kinds of things incorporating that into the end to end learning problem then taking a step even further is this is physics biophysics so being able to accurately do physics-based simulation of biological systems so if we think of a protein as a one of the most basic biological systems so then taking a step out further and further in increasing the complexity of the biological systems you can start to think of something crazy like being able to do accurate physics-based simulation of cells for example or entire organs or maybe one day being able to do an accurate physics-based simulation of the very over-caffeinated organ that's producing this very video in fact how do we know this is not a physics-based simulation of a biological system whose assigned name happens to be lex i guess we'll never know and of course we can go farther out into super long-term sci-fi kind of ideas of uh biological life and artificial life which are fascinating ideas of being able to play with simulation of prediction of um of organisms that are biologically based or non-biologically based i mean that's the exciting future of end-to-end learning systems that step outside the game playing world of starcraft of chess and go and go into the life sciences of real world systems that operate in the real world that's where tesla autopilot is really exciting that's where any robots that use machine learning are really exciting and that's where this big breakthrough in the space of structural biology is super exciting and truly to me as one humble human inspiring beyond words speaking of words for me these quick videos are fun and easy to make and i hope it's uh at least somewhat useful to you if it is i'll make more it's fun i enjoy it i love it really quick shout out to podcast sponsors vincero watches the maker of classy well-performing watches i'm wearing one now and for sigmatic the maker of delicious mushroom coffee i drink it every morning and all day as you can probably tell from my voice now please check out these sponsors in the description to get a discount and to support this channel alright love you all and remember try to learn something new every day you