Transcript for:
Machine Learning in Computational Drug Discovery

okay so i'll begin though so the topic for today we're talking about the lead discovery and development particularly we're going to focus into the machine learning in computational drug discovery so a little bit about myself i'm currently head of the center of data mining and biomedical informatics which was started back in 2011 so this year it would be about 10-year anniversary and personally i have published 127 research articles and 17 review articles as well as five book chapters and all of this is made possible by the excellent team member comprising of several phd students and also young faculty researcher and so aside from being an associate professor and the head of the center of data mining i am also a part-time youtuber so in my spare time i make youtube videos on the data professor youtube channel and also on the second youtube channel which is called coding professor and i'm also a blogger so i blog about data science and also bioinformatics so aside from doing youtube i do blogging on medium and so you know like in the in traditional day you might think of blogging as just you know like a food blogger you blog about food but actually that's not the case for me i blog about data science i blog about bioinformatics okay i i also i show how you can analyze data you know step by step you know you don't need to have any background information so you could just start from scratch you know like start from a data set and then i try to write it in simple terms so that anyone can can make use of data science or bioinformatics to analyze data and therefore make data-driven decisions right because you know like when i first started even making youtube or blog about medium the challenge that i hear a lot from colleague from students is they don't know anything about data science they don't know anything about bioinformatics and it's such a new field and it's very challenging to you know like when you talk to everyone and they don't have a clue what you're talking about and therefore i figured it would be nice you know if i could you know find a way to [Music] to provide some education about this and looking online i mean there's not so much resources back then you know two years ago and therefore that that started the channel and nowadays we we have several you know videos tutorial video um and the thing is it helps bridge the gap you know like when you publish you know researchers they would talk in technical jargons and so you you would read materials and method but the thing is it's very difficult to see how the work is done because sometimes the explanation in the paper it might be done superficially meaning that the explanation that you get from the paper it won't be enough for you to reproduce the words and so this is especially a big hurdle for students wanting to learn about bioinformatics or data science and so it's a big challenge and you know like um from blogging and from making youtube video i see that a lot of the people they don't know about lcbr or scopus databases they rely on archive right archive and bioarchive they are like the pre-print um so they rely on free resources that are available and i don't think they know about various journals that we read about and therefore what we are publishing about it actually doesn't really go to the public right so you know like all the research that we've done i mean if we ask the practitioner in the field let's say we talk to a random data scientist working at scb they probably won't know that we use out random forest to make a prediction on the bioactivity and so and if you look at the data from scopus most of the paper like for example from my own research group they're not read by a lot of the general masses so maybe it's read by only a handful of researcher in the field and so it might give some publication i mean it might get some citation but it's not getting the attention that we hope it would right but then i noticed that when like for example if i make a youtube video or if i blog one article um i recently i wrote an article about how to master psychic learn for for doing data science and so that article received about 10 000 views in in about a month uh versus an article that i would publish maybe that one received about maybe a couple hundred in a year okay so it really depends and also like you know like not a lot of people get to see the work that you're doing and therefore a lot of the a lot of the common things that we hear is that most of the paper goes to the shelves right in thai we say it goes to the shelves and it's never been read and so you know as as a researcher we want our research to be to have an impact to the general public and if the public don't know then how will they be able to make use of that therefore i think this is a big challenge you know to to make research and also to communicate about research and so in this presentation actually i i recorded the first talk and i uploaded that one as the the very first video on my youtube channel so if you go back and have a look it's actually coming from this course i taught over at city iran uh in this department okay okay so let's have a look at this seeds okay when you think of the seeds what do you think of so if you open up the cambridge dictionary you will see that diseases are the illness of people animal or plants so plant can help disease okay animals and people have disease we are familiar with that but plants as well and so illnesses of these are caused by malfunction or it could cause by external factor like infection okay and so all of this will have a severe effect on the health of the living organism instead of being by accident right and so the malfunction or the undesired pathophysiologic condition could be remedied by drugs okay so drugs are called biological or chemical entity so a drug can be both a compound and it could also be a protein or peptide as well and so if it is a peptide or protein we call it biological so if it's coming from compounds like small molecule we call it the chemical entity or we could call it small molecule or compounds or you know generally we refer to them as drug so biological entity could also include antibodies right and chemical entity as i mentioned include the small molecule okay and so typically in drug discovery setting you would want to find a drug that will be able to interact with the target protein and so in a traditional simplistic drug discovery system you have a single target protein where you want to find a small molecule or a antibody that will be able to interact with the target protein and exert the modulation okay and so the modulation could either be inhibition or it could be activation okay but most likely for a lot of the disease we want to inhibit the function of the protein okay so therefore we've tried to find a molecule that could bind to the target protein okay and so drop target networks as you can see here if you look at the broad concept of bioinfomatics what i've mentioned to you in the previous slide it happens here in the drug target networks so let me try to draw an image okay so we have okay can you see this okay you can see right yes yes okay cool okay so let me draw some node um so each circle will represent a protein okay let's say we have protein a b c and we could say a protein a and b into wrap actually let's make it into another color make that a different color red like light pink okay and we have protein d and we have e let me try some more right so you know this is a biological network okay so this is a biological network and each period that you see let's say that they represent a protein and therefore they are protein protein interaction okay and so if you look at this image which one do you think are the most important protein can you guess which of these right what do you think a because it has many connections with other pro b okay right so when you see that it's most connected and actually that's the fact when they're doing biological network analysis is that a is like the center it's at the center of the biochemical pathway right but but you know the tricky part about it is it might or might not affect the disease that you're interested in so you also have to consider okay like what are the f and g's and the b and the e and you know like or d and c and how are they involved in this particular pathway or how about ijk they're separately on an island right it doesn't affect this pathway so let's say that we know that a like for example the p53 is similar to a p53 is involved with another you know if you anal if you look at the law of paper related to p53 they're at the center as well they're called the hub they're at the hub they call it the hub okay and so these hot protein are quite essential for the disease but the thing is if you find a drug that will inhibit this product protein it's not a good idea do you know why because there might be side effects right and therefore we have to find which which of these nodes can we inhibit without side effects and therefore we have to find the rate limiting step rate limiting step it's not like which protein is the rate limiting step which one like if you look at the pathway you have to analyze the role of each proteins and you have to know how they each function separately and let's say that pro they catalyze some reaction okay so protein could be enzymes or it could just simply be a receptor okay but so enzymes let me change that so enzymes so enzymes catalyzes a reaction right right so you have the substrate and then the enzyme will make it into a product okay and so if we let's say that for example the aromatis we have aromatase this is an example and it converts androstenedione to estrogen and then we know that high level of estrogen leads to breast tumor so therefore what do we want we want to reduce the level of estrogen and how can we reduce the level of estrogen we have to inhibit the aromatase okay how do we inhibit aromatase we need aromatase inhibitors and the thing is the aromatase it is a it's at the rate limiting step and therefore aromatase will be a good target protein so the thing is if you look at the pathway here not all protein can be a can be a target protein where we want to find a drug to inhibit so based on this network we have to analyze them one by one okay and we have to look at the the side chain effect because everything is like a domino if we inhibit this we might have influence on this one and this one and let's say that let me draw another let's say that f might interact with other protein of their own right and b may have other protein that it also interact with oh no okay so we have to be careful you know like if we inhibit this we might also influence this pathway and we might also influence this pathway as well so the thing is how can we stop this pathway with minimal side effects right because when when we encounter side effect there is downstream effect right you inhibit a but you also cause b to more function you also cause f to malfunction so there must be a trade-off it's not just inhibit a and sub problem are solved okay you have to you have to look at the pathway does it influence other proteins in the family as well and if it does it might cause side effects okay and i'm going to show you in the slide there's also other terms such as off target binding and the term of poly pharmacology okay so briefly right now i could tell you is off target binding is it happens when okay this is the protein protein protein and we have the compound okay so ideally you want to have it interacts and it will cause some effects right therapeutic effect hopefully but the thing is it's not that simple because your compound may we interact with another protein let's call this protein a and let's call this protein b okay and therefore it will have side effects it's off target binding okay is this let me move it hello off target binding and this is what we expect right expected binding but sometimes things goes unexpected and they might bind and it might cause side effects however it's not always a bad thing because it might also cause a new therapeutic effect and therefore if it happens this new therapeutic effect will be we call it drug reposition replace drug repurposing or sometimes we call it drug repositioning which leads to a novel drug indication we're in simple term we teach old drugs new tricks okay maybe i could color this started binding this is expected binding okay that's right okay so uh so far so this is addressing the topic of polypharmacology and of target binding which essentially are related as you can see okay so this is off target binding so the thing is it's supposed to bind this protein a but then the compound binds to protein b and it could lead to a side effect but in other circumstances okay this is scenario a okay scenario a scenario b right so if it happens to give bad effect it you could call it a side effect but it give rise to good therapeutic effects that you could call that is serendipity right serendipity is when you find something miraculous by accident serendipity you fine you find a new drug by accident and a popular example of this is the discovery of penicillin by alexander fleming so the the story goes that he went on a vacation or so and over the long weekend he discovered that the petri dish had some inhibition zone and then he tried to investigate what's the reason behind the inhibition zone and then he discovered that they the bacteria they express this um i mean the fungi they express this penicillin they're they're producing penicillin compounds and therefore they they discovered antibiotics so this is serendipity okay so let's go back so i have explained already about all of this okay let's go to the next page drug discovery so direct discovery process is a very long process taking about 10 to 15 years and the failure rate is pretty high at about 90 percent and it's very costly because it takes almost like 2 billion us dollar to take one single drug into the market and so that means that it takes you know 10-year research just to figure out whether a compound will work whether it will have side effects whether it will have you know like what what expected effect in the human body the pharmacokinetic property so all of that takes 10 to 15 years and 2 billion us dollar because you know like thousands of compound will be in the clinical trials and only a handful okay like out of hundreds or thousands only a few right like two to three or past clinical trials and so this is explained by the drop discovery process okay so if you look at here okay let me see if i can okay i can draw on here okay nice let me okay so if you look at here the it started with the target discovery right as i mentioned already you have to analyze your biological pathway in order to identify proteins that you think might be a good candidate to serve as the target okay so that is the first part the target discovery and once you have identified protein that could serve as a target then you want to find compounds that could bind to it right so i forgot to mention that when you identify that from here from the uh from the earlier drawing that i made let's say that we identify let's say protein e i can we don't want a because it might have side effect right let's say that we identify protein e as a good candidate for um serving as a target protein and then what do we do we let's say that we try to do a knockout experiment where we genetically engineer a mouse or uh e coli to not express a particular gene right so that particular gene that we identified and then we want to see what it causes to the phenotypic property okay if it has any effect on the phenotypic property and if the knockout had an effect and then we would know for sure that it is a good candidate okay and therefore we will try to find a compound that will be able to modulate okay modulate mean it could inhibit or it could activate the protein right um so we do that by performing discovery screening and then if you do it experimentally um actually your department also have equipment to do the high throughput screening right so there is high throughput and also typical is the low throughput screening and so how would you do it let me draw an image so what you typically have is a micro tighter plate and then you have many wells right and then for each well for the pipettes i'm not sure if i draw it properly in the pipette you would put the reagent okay of your the protein so the protein will go to all dish i mean to all wells okay so all of the wells here will have the proteins let's say i indicated by yellow okay so all wells will have the protein and then we will add the drug we will put different drug into it we will add drop a we'll add drop b we'll add drug c so i represented by the different color d e etc okay and therefore each well you will have a combination between the target protein and compound okay and then we're going to measure the read out so normally it will be the fluorescence and then we convert this into either you know the units like ki or ic50 for the percent binding so we collectively refer to this as the bioactivity okay so this is screening okay and screening could be high throughputs or it could be low throughput okay so it depends on the facility and therefore from hundreds or even thousands of the testings we will get the ic50 and ki out of that okay and so it is the measured activity either the binding the ki or the ic50 it's going to go to the database let me show you no okay databases is simple okay so this database contains the biological activity of you know any proteins and any compounds that you're interested in let's say that i'm interested in aromatase i type in aromatase i go to the target okay and then i get to see the targets on the menu bar here you have the target you have the compound so i click on the target and you're gonna see that there's two data available right for the homo sapiens human and also for the rats and then you're going to see that there are over 2 000 compounds right here uh 2986 compound that are available okay so there are data about 2900 compound against the aromatase okay for the human and also for the rats about 211. okay so if you click on the data so you have 5 000 record for so you have 2 900 compound but then you have 5 000 record it means that for each compound it could mean that you might have the bioactivity for ki or ic50 okay so each compound could be tested for ic50 or it could also be tested for other activity unit it could be percent binding it could be ic50 it could be ki right here it says ki and this is ic50 and this is the percent binding okay and you see the unit here percent and then this is the the bioactivity value and so if it's ki the lower the better but if it's inhibition percent the higher the better okay so inhibition the higher number is good but then the data for inhibition percents is not reliable so for any type of analysis that we we do when we build a prediction model we use ic50 or we use ki okay because they are more reliable they are kinetic constants okay so they're obtained from multiple measurements let me draw an image so you have ic from 0 to 100 and if you have 50 inhibition right it's right here and then you want to do a kinetic curve measurement this is the concentration in molar so you're doing it you know from multiple endpoints here you're doing multiple measurements and then you want to know where is 50 right here so it's actually corresponding to here i see 50 is right here and so this is the concentration right here so what is the value here let's say it's 10 micromolar and let's say that you have another compound let's see another compound is like this where we go we measure it by different endpoint here right and so the 50 percents for this one it's right here and so the value is slow it's smaller let's say it is about 100 nanomolar and therefore the purple color is a better inhibitor because it has lower concentration okay so let's say we call this compound a and you call this compound b so compound with the lower concentration that will elicit 50 inhibition will be the better compound okay so here you can see that compound a provides 100 nanomolar right because uh compound b you need you need how many times more 100 time more to provide the same inhibition effect okay so this is how you get the ic50 and so ic50 is a constant from multiple measurements okay but when you're doing when you're looking at the percent binding right here it is obtained from a single measurement okay and it's not from a curve like this so therefore this is not so reliable and therefore ki and ic50 they're both obtained from this kind of kinetic and therefore they're more reliable okay and so back to the symbol database here in a typical experiment like when we do a a model building we would collect the data do you see so 5000 record would be downloaded into the computer okay so we would download the entire you know 5392 into the computer okay so i'm going to go back to the slide so that you will see in more detail in just a moment okay so okay let me try to find the laser it's not here oh no okay okay so at each step you can see that it takes time right for target discovery to take two to three years discovery and screening it could take maybe half a year to one year and typically you know high throughput screening together with virtual screening okay and you might hear the term molecular docking uh elite optimization okay that's like essentially the the course is about lead discovery and development right so for lead it means that you have identified the hit okay so the hit will be identified from the virtual screening and also from the high throughput screening in step two you will see at the bottom here uh screen oh no okay so this one this is the hit hits will be identified at this stage okay and so once you get the hit it means that they are compounds that provide you know relatively good inhibition but they're not the best okay so you want to improve on the hit and therefore you will take the hits and then you will optimize the hit by maybe adding additional functional group to the molecule and that is what they call lead optimization okay so you're taking a hit and then you're making it into a lead and then you take the lead and then you optimize the structure of the lead and so that might take a lot of um synthesis so you might synthesize a library of compound that looks like the hit okay but like maybe it has a single functional group addition so let me draw some picture okay uh but before i draw is it clear or i have one question uh i found that the rate limited step for the every uh every biological pathway is a same or what will be the best remitting staff for the for the target all right so the thing about the rate limiting step is that if you look at the let's say that uh um so you have to correlate the the real imaging step together with the phenotype so i used an example of the aromatis enzyme it is a rate limiting it's at the rate limiting step of the conversion from the androgen to the estrogen okay so in the creation of estrogen if you look at the pathway there are so many proteins involved okay so we have aromatase we have 17 beta dehydroxy steroid the hydrogenase 17 beta hsd 17 what is it beta hydroxy steroid dehydrogenase they also have steroid uh sulfur taste it means that it adds and remove the sulfur group to the estrogen so let's say that there's about four to five protein in that pathway in the creation in the synthesis of estrogen but then if you look at the pathway you know each enzyme it will contribute to a certain percentage of the total estrogen that are produced and so the bulk i'm not sure about the actual percentage so let's say that roughly so the majority are produced by the aromatase let's say it accounts for 70 okay so if you could select any enzyme in the entire pathway to serve as a target protein and you found that aromatase and it encompassed about 70 of the production and if you could inhibit one protein and you will reduce the production of estrogen by 70 okay but but the thing is estrogen is not entirely bad okay it's it's good but not in high amounts so the goal is not to eliminate totally the production of estrogen but the goal is to prevent the overproduction of estrogen and so even if we inhibit aromatist we still have other pathway we still have other proteins in the pathway for synthesizing estrogen so we know that okay if aromatase what takes about 70 it produces about 70 therefore we have about 30 left that are still being produced but 30 is not overproduction so that is okay and therefore if we find inhibitors to inhibit the aromatise then we will eliminate the overproduction and the overproduction normally leads to the tumor mass accumulating in the breasts okay great question yeah thank you yes okay so i was about to draw the um the lead optimization right let's see if i have other drawings before okay i have this let me copy this okay i made this drawing long time ago [Music] go back copy paste let me duplicate it okay so let's say that this is a lead so i'm gonna generate many alternative or i call it analogs okay i'm going to generate a lot of analogs okay and let's say for this one i'm going to add oh here make the color red okay so i'm making it oh at the terminal here and let's say i'm creating another analog but instead of being o h there i will add o h here okay and create another analog right here okay so in this simple example i generated hydroxyl group at different positions of the molecule okay so we got the thing is let's say i don't know which one will give a good result and let's say that originally let's say this gives 10 micromolar if i make this it becomes let me this makes it to become um one micromolar this makes it to become 500 micro molar this becomes 50 000 micro molar okay so based on this results okay this is the bio activity let me make it bigger okay so the yellow highlight is the bio activity okay so based on the analogs you know analog a b c which one do you think is a good way to continue so the original is here let's say this is a hit so this is the identified hits okay but then i made the hit to become a lead right it serves as a lead okay and the analogs here is because i'm performing lead optimization i want to optimize the lead okay can you can you tell me which compound is it compound a is it b or is it c let's call it a1 in the actual paper they normally do that they call this a and then they call this a1 a2 a3 right this compound a compound a1 a2 a3 and you know in in in the typical paper they might have up to like a 20 or 830. okay can you guess which one the lower the better the lower values it means better activity a1 perfect yes a1 because the activity decreased from 10 micromolar to become one micromolar and therefore it means that okay if i add an oh group here at this position it's farther away from this right and and it might have more of a uh good electro uh we call it a negativity of the compound okay so it really boils down to the push-pull effect the localization or delocalization of the electron here because each always here are placed at different position to the left coming to the right to the left and to the bottom okay and also you know even this part could be modified as well even the ring here could be modified okay it could be a tryptophan like a uh we call it indo-ring right you could make it other substitution as well okay and so this is the field of medicinal chemistry medicinal chemistry so it's when you you modify the r group in the typical paper they will they will highlight the color of the molecule maybe like that okay so they call it the r group okay these are the r group colored red r group are group okay so it's quite similar to amino acid you know if amino acid you know amino acids contain the alpha carbon right and it has the amine group it has the carboxylic acid it has the r group and it has the hydrogen right so all amino acids are the same like this but they different they differ with the r group right so our group here is different i mean it's the same r group but it is added to the different position and therefore they give different activity and if you translate this this ligand it normally goes to bind with the target protein let me find you a picture here like here oh no this is the prediction um let me find another image right here no right here yeah so the the molecule it binds to a target protein okay and so what we do in in computational prediction is we want to predict if it has the structure what is the activity we want to predict the activity and so therefore we call it the bioactivity prediction okay so actually i made this application as a tutorial on my youtube channel let's go back to the slide okay so actually this presentation will be provided as a pdf so i'm not sure if all of you have received it already already let's say are we okay very good okay so this is quite similar to the previous slide but then it's a more concise way so the thing is you're going from a million compound you're going from a million to one so you have thousand you have millions of possibilities during the hit generation during the high throughput screening so the thing is we don't know everything and so we have to screen millions of compound let me go back to writing you know if i teach in class i like to draw on the whiteboard and therefore i draw on here as well so drug discovery you can think of it as kind of like a funnel right you put in a million compound during your screening and let's say that we identified 10 hits and then from the 10 hits let's say that we we try to make it into 100 of leads no no let's see 10 hits and then we get 10 leads right but then the lead we have to generate the we have to do optimization lead optimization there maybe we could get thousands of molecule and therefore from that thousand you know it comes down to maybe 10 and then we do some experimental testing and out of ten let's say that one compound give promising activity so it's kind of like this you know we all experiment you're you're starting with large number and then you're screening it down to only a few compounds okay so you have to rely on sheer quantity in order to make drug discovery work because a lot of is due to uncertainty as well because because we do know how it works right we might know the mechanism of action but to get the exact molecule that will be able to inhibit the protein um i mean not everything is known you know that we could design it from scratch okay but computational approach provides a lot of assistance in that manner all right so you can see that we perform virtual screening and we get like a good starting point and then after that we have to explore the chemical space and then after that we perform the lead optimization right all right but you know the thing is we don't only want a compound that will be able to inhibit the target protein optimally but we also need it to have favorable pharmacokinetic property okay so it need not only inhibit the target protein but it should have the following pharmacokinetic property it means that it should absorb in the stomach it should be permeable to the gut wall and to the cell wall and also to the blood-brain barrier if it's a neurological drug it must also be metabolically stable right that it could also be metabolized properly in the cytochrome p450 in the liver and also it must be non-toxic and most importantly of all it should be synthesizable right so it it must not be too difficult for the medicinal chemist to synthesize the compound so to achieve all of this desirable property it's kind of like a multi-objective optimization okay so if if i ask you a question like if you want to buy a mobile phone can you share with me what characteristic are you looking for if you buy this you know this phone what should be the characteristic can i hear from both of you whoever can start how about jack jack would you like to start i get a green piece if you want to buy a mobile phone okay so what are the requirements that you look for when you buy a phone buy a phone yeah buy a phone yes yes a cell phone yeah a cell phone or a mobile phone i mean because you know there's so many many types right there's android there's ios and even with android there's so many brands right and there's so many features so many functions you know what what what are your requirements do you have something to have a checklist or even the budget right am i your first okay for me i think uh it might be user apparently and i thought right now i use the android so i don't i i prefer user friendly and also the affordable price and i'm i'm concerned about the camera the camera should be the good resolution that's all okay price camera those are your two important factors anything else that's all i've got is that you have to be user friendly user friendly okay so there's three okay i believe in man you believe in the brand okay yes okay yes so i think you're going with the apple right actually yes yes yes i believe i bought apple okay any other reason no i i believe i believe in the brand okay so let me draw an image of that okay so we have the brand we have the user friendly and then we have what um camera good camera and what else heymar price is price good price competitive price okay so you have we have how about the spec let's say okay another one would be the text spec right if i draw into a the node this is a node okay this is a pentagon okay okay let's say that for for jack it's only one factor it's hard to draw with only one line okay let's see i make this visualization okay one for each of you before heymar okay so heymar is the user friendly how how would you give it let's say from a score of five how many zero to five four you give it a four okay and how about the good camera five he gave it a five how about the price uh whole let's say let's say um let's say that higher value of price more expensive and the lower is more cheaper okay so three three okay how about the brand for me right two two okay and how about tech the specs should be five it could be five okay you know like it should be like this so this is two one two three four five one two three four five this is three right one two three four five two three four five two three four five okay so let me connect the dots this is a radar plot it's a way of visualizing maybe i'll just draw it with this thing so this radar plot is for himar for buying a phone and how about for uh jack let me make the drawing again two three four okay so jack is a brand i think you give it a five five yes and user 30 uh three three okay good camera three and then we make it like this [Music] so this is your [Music] visualization so it's a simple you know like radar plot we call this the radar plot and let's say that we compare the tool it's a it's a nice way of uh comparing the visualization right you could even adapt this for your own oh no the same yeah so we could see that for heymar not so much emphasis on the brand but she wants more user-friendly and good camera but for uh for jack it's more into the brand and tech spec right yeah so we can see very easily how you know each factor influences the decision okay so your decision is to buy a phone okay and the factor that influences your decision are you know brands have spec for jack and user friendly and camera for uh hey mr and let's say that we already know the specification the filter okay so this is like the filtering criteria for each of you in buying a camera i mean for buying a mobile phone let's say that i have a thousand phones and i will try to fit it into this criteria okay and therefore from a thousand i might be able to get i'm starting from a thousand i may be able to get let's say [Music] um 20 phones for heymar and let's say from for jack maybe i could get from i could get two and the reason is because for jack you specified to be iphone you already specified the brand let's say you want only the iphone therefore you're limited to either getting a big or a small iphone right and the price could be either a big or a small iphone right the mini iphone or it would be like the bigger iphone right the 60-inch or the what they call it four-inch iphone yeah but for uh himar your you don't specify the brand therefore it could be any brand and good camera i mean there's so many brands having good camera right and user friendly right there's many many brand as well and therefore you will get maybe much more it could be even more than 20. it could be like maybe 50. okay so based on the different criteria for the filtering you will see that you will get the end result to be quite different okay and so the filtering criteria that you see here is kind of like the model it's kind of like the threshold when we analyze the data okay so the thousand of phone that you see here it could be the molecule balance is a molecule and then we get 20 molecule and your decision here will dictate how we turn 1000 into 20 how we turn a thousand into two okay so if we take an example here is that instead of it being from hey mar and becoming from jax they said this is a target protein um part of the kinase family let's say and let's say that this is a protein part of the cytokine p450 okay so different protein family will have different requirements right because different enzyme different protein will have different properties and therefore we have to be careful to uh to set the threshold differently okay based on the analysis of the data therefore if we try to pull back to the multi-objective optimization that we see here okay so it's very it really depends on the protein that you're studying okay and that you try to balance between many factors that it should bind well number one it should bind to the target protein but it should also be safe okay so like with the example of you selecting the phone right you should have a relatively good camera but it should be easy to use right so it comes with a trade-off okay so the tricky part is how do we get you know the best possible so we have to optimize it type in in terms of multiple parameters okay so actually i think i will skip a lot of slides it's because i've explained many in the form of the drawing okay so in the creation of a new compound where do we come up with the inspiration for a new compound if you could guess this probably from nature right we look to nature to find not not the nature journal but we look to nature right to the environment um green tea right it has kedaton um curcumin right in in the food in the spice um in chile we have capsaicin right um in ginger we have what like i think they call ginger noise the ginseng right yeah but there are several uh ginsenoside compounds as well and it really depends on the soil right what country it coming is coming from and the mineral in the soil will give rise to the different type of ginseng that you get even for the cannabis that we see right when it's grown at different altitude growing in different soil different temperature it gives rise to different composition of the um i'm not sure what it's called it's the tsc or something right uh the component of the cannabis so it really depends on the environment on the soil right so all of that is nature inspired okay and so another part is once we're inspired by nature we could use it as a hit right and then we generate a lot of lead compound and then we try to optimize the compound by doing lead optimization in order for us to find a you know a promising drug and so in in computer we could perform what is called compound enumeration have any of you played legos before lego you know legos right oh you know this kind of block not time it's three dimension yes right and it might have you know four in the block so the lego you could build many things with labels right same thing with the compound as i've drawn here you add a functional group to it like h and you add it to different position you will get a new compound right you could even reshuffle we could take this let's see we could take this and put it here put it here we could reshuffle it right and then we get a new compound so the chemical formula is the same but it's a different molecule right it's a different molecule right it contained the same number of atoms it contained the same element but it's connected differently therefore it is a different compound okay so you could generate this like a lego building block you could move it around you could connect it in different way and then you would get a new compound and so they call this the compound enumeration it's using concept from mathematics or similar to i'm not sure if you have heard of combinatorial library before a building block right let's say i have a building block let's say i have a core structure abc core structure is kind of like here let me color it for you yellow color is this core and let's see that oxygen containing is in pink and nitrogen containing is in the blue cyan color so you will see that i could move around and the property changes right so abc could be the the core this could be a let me try um orange let me zoom in this is a this is a and let's see if i take this and convert it into let me see put it back here delete this and i'll draw i'll try this one and i'll draw purple color here so this become b scaffold b scaffold b okay so scaffold b is this one j4a is the yellow one and i could draw another scaffold for c maybe it's another different scaffold okay so these are the substructures these are the scaffolds okay so when you form a drug when you design a molecule you have the scaffold along with the substructure and it's normally not only one but it's multiple and then you could imagine that the substructure could be reorganized it could be reshuffled okay it could be you know modified to become an h here o h okay and this could be a fluorine if you want okay so this is uh ligand or compound enumeration okay compound enumeration [Music] like the term enumerates like a lego building block you take a and b and c and you mix it in different combination in different connection let me call this one two and this is three this is one no no this is one this is two three but then i may change so this is three prime because i added the h here this is three this is uh one this is two but then i move the position right and move two from here to become here and here so let's call this two prime so these two are the same so i call it two prime original is two i move it from here to become here so i call it two prime uh it was originally three right but i added a h so it becomes three prime and this is h i mean this is o this is o so it's three three so this is one i added an f here so just become one prime there you go i modify the structure right and i modify the scaffold originally i had a but i modified to be b right so you're you're having scaffold and substructure but then you have different components and then you're creating new molecule okay so this is compound enumeration okay it's creating a new molecule based on the same building block but you're just organizing it in different way so what i told you is the compound enumeration is actually reported in a paper by the group of j j l raymond so he took molecules up to 13 atoms right here and he reshuffled the atoms you know so that they will form a new molecule and he got 977 million possible molecule and if he takes 17 atoms and then he reshuffled them in different combination he gets 166 billion possible molecule so this is 10 to the 11th power right so by 13 this is what it is mean let me let me show you so what i have right here in this example is i have how many one two three four 5 6 7 8 9 10 11 12 13 14 15 16 17 okay this is perfect 17 atoms okay so if i have 17 atoms and if i connect it in different way and i use different element i could use nitrogen oxygen phosphorus sulfur then i will get about 166 billion compound right here let's go back to the next slide chemical space okay so chemical space is just a simple way to refer to all the possible compound in a universe okay so if there's 166 billion then this is the chemical space so it represents all possible compounds that could be created but then in the actual experimental databases that we see um the number is not that high right so let's go to a database like temple let me show you this one let's go to the first page you can even follow along you could type in to google jimbo so you see from temple here that it has how many 2.1 million compounds okay so this is available to us 2.1 million compounds and there are 14 000 targets i cannot zoom in or can i yeah i can zoom in oh no okay so there are 2.1 million compounds and i mentioned to you that this is 166 billion okay so you can see that what we are able to access right now is quite limited considering that there are you know like 100 billion for compound that could be synthesized but that that are not yet synthesized so 166 billion is just the hypothetical compound okay so that is a lot of potential that we haven't yet tapped into okay and this is the drug discovery toolbox so they are all of the tools that are available for this discovering a new drug so experimentally the experimental method are shown at the top part so we could do combinatorial chemistry we could synthesize a lot of compound by chemical library um i mean to get chemical libraries and then we could you know all of the available compounds are called the chemical space in terms of visualizing it and then we have high technical screening in order to [Music] identify which of the compound when tested against the target protein of interest which one show promising activity and then we have property filter as i've shown you to you the example about buying a phone or using the different criteria so same here property filter computational chemistry is another field where they use quantum mechanics to investigate the effects of electrons on the molecule structure and in the field of data science machine learning well they could use machine learning in order to build prediction models particularly qsar and podio chemometric are making use of machine learning create prediction models that could predict the biological activity of the compound okay and then we have molecular modeling which is pretty much just visualizing the protein structure and then we have molecular dynamic and molecular docking which is to investigate the dynamic of the protein structure by using a special software called molecular dynamic and then we have molecular docking which is to figure out how a compound can bind to a protein okay so the database that i've shown to you is the jumbo database but then there are more there are bindingdb and also pubchem okay so these databases contain information about the compound let me show you again so if i search for curcumin i will be able to see for the curcumin drug i mean compound what our available data are there and if i click on the target i will be able to see what protein has been tested against the curcumin oh no okay so this one target is the curcuma lunga okay not sure about this why they have the they have it as a target as well okay interesting curcuma longer curcumin cellulose okay so they have it as the target but normally it's the compound okay so let's say that if i'm interested in the curcumin molecule i could click on the browse activity and i will get the activity here uh there's more than four thousand right four thousand record and if i look at the left part i can see the target type and normally i like to look at the single protein click on the single protein because it's much more easier to investigate and there's also the protein protein interaction that i mentioned already let's say that i look at the proteins and i can see that this particular curcumin has been tested against the protein matrix metalloprotease and it has the ic50 of 14 000 okay not not so good but then i can see to the left part here the distribution of the bio activity i can see that 265 or ic50 k i have 32 i could filter it i see 50 right i click on the ic50 and i get to see the activity here i could click on this the filtering part here so that it filters from low to high and i could see here that this compound is actually the same compound right you see the number 180239 is the same compound okay it seems to be redundant information it has 13 nanomolar the activity 30 nanomolar against this protein arachidonic five the box lipoxygenase okay so it's coming from a cell-based format and so this is the activity and so it has pretty good activity at 13 nanomolar so these are the type of information that we could use to build a model okay so this is the chambo database and i talked i talked about the pubchem right you could search in google for pubchem this is what we have okay you can even draw the structure or you could type in the name or even covet right or aspirin they give you the same sample let's use kobit 19. okay and so they have some compounds right 1600 compounds that have been tested against cobit 19. okay so these are some fda approved drug right ritunovir okay so these databases have information about the bioactivity okay so these are some of the things that computers can do i already talked about this so i think i can skip right we talked about briefly about self-driving car we talked about uh supermarket that could that could perform analysis of what you have purchased without even going through the the line and paying for it and we see here that they have transfer learning meaning that the computer they have they could use neural network in order to make sense of the arts and then they could learn the the pattern from the art and then transfer it to an image okay so you could transfer the the image style van gogh into this uh image shown on a okay so it's transforming the small image into the big image that you see okay so it's using transfer learning so these are some famous piece of art from famous painters this is an image of where deep learning has been used by google to dream so they use artificial nerve network to dream and so in order to figure out what could we use computational models for in drug discovery so these are some of the examples so you for sure you could use it to investigate the structure activity relationship between the chemical libraries so for your collection of compounds you want to know what structure what structural component of the drug will give rise to a favorable inhibition of the target protein and so the prediction model will allow you to do that and another thing is to allow you to filter out the toxic drug okay like which drugs are toxic to the human body um by by being able to calculate the uh properties giving rise to admit property like you know the absorption distribution metabolism excretion and toxicity okay so these are some of the questions that could be used to answer many questions in the drug discovery field so you might have questions such as what target protein could your compound bind to and modulate and would your compound bind on specifically to other protein as i mentioned already right about the off target effect and about the drug repositioning so if it could bind to another protein and if it provides favorable activity you call that you know drug repurposing or drug repositioning so it's a way of finding a new indication a new usage for the drug okay it's kind of like if you have an ebola drug and then you you discovered that um it is also effect effective toward treating covet 19. right right i think it's the ram that's severe right i think it's originally for uh ebola right and then they have found that it's it's also treating kobet as well okay so that's like the undesirable that's that is the desirable off-target binding okay but other off-target binding that is undesirable could be like when you're taking chloropheneramine and you might feel a bit drowsy okay so that's why you cannot drive your car or um work any um dangerous machinery when you take the chlorphenir amine right because it affects your your neural system and so these are some of the things that computational chemistry has helped um particularly in this example scientists have won the nobel prize um back in 1998 and in 2013. so they applied the the concept of quantum mechanics in 1998 to investigate how quantum mechanics could be used for analyzing the the mechanisms of compounds right it's used for investigating the computational biochemistry so they use it to investigate the effects of the what you call it the simulation of the active sites okay of the proteins like how does the catalytic triad of the enzyme work which residues are important okay this is an example overview of the broad field of computational drug discovery and you can see here at the middle part is the the various approaches okay we have fragment base we have ligand base we have structure base and we have system-based drug discovery right so i mentioned in the earlier on by drawing the image of the protein protein interaction so that is the red part right that's the system based and i talked about how we could use machine learning to build models so that is lincoln based and i talked briefly about molecular modeling and docking so that is structure based okay so let's have a look here so bioinformatics it is a field where we're using computational approach to try to make sense of biological data and so some of the biological data that we have include dna proteins lipids and carbohydrate so these are the macromolecule of life right and the important thing is how do these macromolecules interact with one another okay so that is part of the network analysis that i've shown you on by the drawing before okay so in the field of informatics is the usage of computers informatic approach to make sense of big chemical information chemical libraries right so you could use chem informatics to try to understand what makes a drug effective we use chemical addict to calculate the properties of the drug these are called molecular descriptor okay so common form of molecular descriptor include molecular weight right and the solubility log p right is coming from the uh partition coefficient between water and octanol um it's kind of like you try to mix oil and water and then you're gonna see that you get two layers right and then you your drug is it lipophilic or is it hydrophilic does it solubilize in water or does it solubilize in the octanol so every drug will have different degree of lipophilicity and hydrophilic hydrophobic we call it water water soluble okay so drugs and its precursor so if you imagine a drug let's start from the bottom okay so a drug is a molecule that you know could bind to the target protein and then the lead compound is normally the the one that i mentioned to you about the lead optimization process okay it's where you take a hit compound you convert it into a lead and then you try to modify the structure get a lot of analog okay and then you test which analog is favorable and out of 100 analog maybe one would be a favorable compound and then you use it as a drug right and when you use it as a drug it means that you have to perform clinical trials on it and so that would take a couple of years okay maybe five years in the various stages of clinical trial and i mentioned about the hips right the hips will be identified from a high throughput screen so i drew a picture of the petri of a micro tiger plate right so you have each plate you have the enzymes that are pipetted inside and then for each well you have different compounds okay and whichever compound give good activity you call it a hit okay and the hit will give reasonably good activity but not like very good and therefore you have to take the hits and then you have to convert it into a lead and then you perform analog generation and even before the hits right um the concept of fragment and privilege substructure um it's actually here i drew the image here so the number one two three as you saw here number one number two and number three these are the substructure okay so we call these the substructure and then there are several papers in the literature where they try to review which substructure are favorable for the biological activity okay and there are different amines so these are amies right and these are like carboxylic acid so in different types of disease there will be review article investigating like which substructure are favorable for a particular activity okay so substructure and the privileged substructure means that a particular substructure will be important for a particular disease okay like for example uh carboxylate multi carboxylate substructure could be important for anti-cancer drugs particularly let's say for breast cancer drug another substructure for anti-cancer or anti-breast cancer would have to be the acell ring okay aceto ring looks like this let me draw it aso okay sorry it's like a emitter so an imidazole contains the term acyl and normally acetyl ring it interacts with the fe of the heme so aromatase so aromatase is part of the cyp 450 family and therefore it contains the heme and we know that the acyl ring will be able to interact they call it the coordination so the acer ring from the drugs can bind via heme coordination so we call it coordination to refer to the interaction with the metal ion and so the end here can interact with the fe okay so as you see you know like designing a drug it requires a lot of information of not only the chemistry part the chemistry part the biology part shown here the biology part right which protein interact with which protein how they interact which pathway they are involved with um they're also involved with the kinetics you know this is more into like the biochemistry right the kinetics nitrogen screening uh also it involves a lot about the organic chemistry right the structure of the compound the location of the art group okay and in terms of computer we have the generation of compounds via compound enumeration and we also in terms of the uh i think we could call this the structural biology okay where you look at the protein structure and then you try to see how they interact so i think it's better if i show you let me show you protein data bank okay because you already have the slides and i'm showing you something that is not in the slide okay so let's say aromatase have you ever searched a protein data bank before not not yet not yet okay have you heard of it no not heard of it no right okay let me show you so this is the protein data bank and is a database containing information about protein structure okay all right so we have already searched for aromatase again the the first one is the three eqm this is like a protein id 3eqm is the protein id pdb id is the identification number it's kind of like a barcode okay and it's the first published structure of the aromatase back in 2009 and so back then i remember my master's degree student it was his project so when the protein structure came out recently at that year we downloaded the structure and we perform molecular dynamics let me click on the structure and then we perform dynamic and then we perform molecular docking to find to figure out how the drugs interact with the protein because at the time it only has the androstenedione you know the androgen okay androgen is bound inside so androgen is the male hormone and it's converted to the estrogen which is the female hormone and so at the time we don't know how the compounds of the drug bind to the aromatis enzyme therefore we perform docking to see how the drug interact with the um heme group of the aroma test so let me click on the 3d view for here 3d view um there is a software that you could open on your computer called taimo but i think it's beyond the the time that we're going to cover in the course so if you have time and you want to use pymo um you're more than welcome to do so so we're gonna just use the online version of the protein structure viewer so this is the protein structure this is the aromatase enzyme so if you take a look at the structure it looks a lot like the other enzyme of the cytochrome p450 family okay so let me zoom in and then you will see here in the color one moment okay so you can see this right and it will zoom in and you will see the heme group okay so you see the the one with the yellow yellow is the cysteine from i mean it's the it's the sulfur atom from the methionine i think yeah it's a methionine and the methionine coordinate with the fe which is the orange color of the heme okay i might use two fans two fingers to move it a bit okay and then you see the the heme it's interacting with the androstenedione right here you see the top compound in green and with the two red in the left and the right that is the androstenedione okay and the heme is shown here at the with the orange in the middle and the blue color atoms they are the nitrogen from the heme group okay so we we could see how the drugs interact with the heme group of the aromatase and therefore when we design a drug when we analyze the the data we we are more knowledgeable into what substructure are important for the inhibition and therefore we try to find those that have the a circle okay so the a subgroup is i mentioned already is shown here it looks like this one aso so what makes the a so it must contain a nitrogen and a ring and normally it is a five-member ring it could have one or you could have two or it could have three but it has to be connected in a different way okay so these are the the drugs let me show you so this is the protein structure database showing the the the pdb and then you could download the data as well you know like here you could download sequence so this is the protein sequence and then you could download the pdb and then you could visualize it on your own computer using pymo let me show you paimo paimo is here so they have a free educational version okay so you could download the free educational version okay so they support window and also max okay let me go back to the slide okay so this i mentioned already identifying the hits from the hydrogel screening and then you convert the hit to the lead so the hit is given in the sample here fragment hits on the left parts of the image and then we convert the hit to a lead and then delete to a more bleed analog okay and you you will notice that the structure becomes bigger and bigger right because the challenge is how can we add a new functionality to the molecule we have to add a new structure to it right so it more or less the molecule will be grown okay so it's like we're growing a molecule from a small one to a bigger one and so if you look at the example um molecule four it has a kd of 1340 okay it's not exactly actually 50. um and then they measure the lincoln efficiency le right so they they take the bioactivity value and then they divide it by the size of the molecule and then they get the ligand efficiency of 0.37 and in compound 5 they take 2.4 micromolar and they divided by the number of atoms also by the molecular size as well and then they get a ligand efficiency of 0.31 okay so you see that it decreased but then for the the final drug the activity became so much better and therefore the ligand efficiency is 0.44 okay but although the the molecule becomes larger and small and this is for the minha the tuberculosis drug so liquid efficiency is a way to take the activity value and divide it by the molecular size okay to see whether you have um improved activity as you increase the size okay so you will see the concept here the take-home message is that when you have a hit you convert a hit to a lead right and then the lead becomes an analog the molecule will get bigger and bigger because you're trying to optimize the multiple optimization right remember that we we need to have a molecule to bind to the target but it should also be safe it should cross the blood band barrier it should be having good solubility so in order to do that the scientists they have to add more substructure group to it and they have to add more functional group to the molecule okay like let's say that for the original hit molecule let's say that it's not soluble so in order for it to be soluble they have to add additional substructure to the molecule and therefore it becomes soluble and when it becomes soluble but it might be toxic so they have to change the toxic atom to another one in order to be less toxic and by changing it to a less toxic functional group the the size of the molecule becomes bigger okay so so you see the concept right that you you could start out with a small molecule which gives reasonably good activity but then you want to optimize it to be better you want it to have better activity you want to have it to be safe so it comes at a trade-off of the size of the molecule so the molecule becomes bigger as well okay so this image also shows you that from left to right okay from the fragment seven asa endo to the final molecule right you start out with nine atoms heavy atoms and then you you get the hit right you get 16 atoms and then you have it to have 27 atoms and finally you get the final drug and it becomes 33 heavy atoms okay so in order to make it you know more specific to the target in order to have it have low off target binding but in the example third compound from the left the molecule plx 4720 you see that it has activity against 54 other kinase okay so this is a kinase inhibitor and so if you use number three the compound three on plx 4720 you will also get a lot of off-target binding so it doesn't bind specifically to the pim1 protein but it also binds to 54 other kinase in the same family and so what they did in order to make it more specific is they added more functional groups to it to make it more specific right so this is the funnel that i mentioned to you so your but this funnel explains a different concept so it's essentially going from bottom up okay so you look at the bottom part which is the fragment you're starting from less than 500 micromolar i mean you're starting from less than i think it's a couple thousand um it might be maybe 300 dalton and then the size will increase to kilo dalton the molecule okay and so the activity will also have you'll see that the activity becomes better and better right it starts out with ki of about 500 micromolar and it becomes better right between 100 between 1 and 500 between one micromolar and also less than one micromolar and then up to about 10 nanomolar so in order to be a drug you need to have the the ki or the ic50 to be lower number and in order to do that you see that the molecule becomes bigger right okay so this is the lipinski rule of five okay so lipinski row five is very important for uh it's like a rule of thumb in order to to figure out whether the molecule that you have is drug like or not okay so the drug likeness rule of five was discovered by christopher lipinski so what he did was he take a data set of about 2 000 orally active drug and then he analyzed them okay so these early active drugs are fda approved drug and so he analyzed all of them and then he came up with a set of rule of five and so it deals with the multiple of five meaning that it's the molecular weight should be less than 500 dalton so it means that it shouldn't be too big the lipophilicity or the solubility should be less than five hydrogen bond donors should be less than five and hydrogen bond acceptor should be less than ten okay so this was formulated back in 1997 and then the image that you see is based on the paper that we publish i think it's 2018. so if you search for it you will find it if you search for er alpha and also rsc advanced okay so we published that let me show you er alpha rsc fans yeah so this paper 2018 probing the origin of estrogen receptor and it is a open access journal so you could download the paper and read i think it's one of the early figures here so we analyze the rule of five right here um figure four okay or maybe we didn't show it in the paper okay we didn't show it in the paper okay oh well okay so aside from the rule five they also have the lead light rule of three so it is in multiple of three like less than three hundred dollars and less than three so these are the lead molecule so the rule of five is for a drug like only two rule of three is for lead like molecule because you know that when you go from a lead molecule to a drug molecule the size will increase okay so from 300 to 500 right okay chemical space i mentioned already is is a way to visualize all of the possible chemical compound that could exist in the universe and if you look at the biological space they try to visualize the biological activity and also the chemical space together and so let's say that you have compounds from that are able to inhibit kinase or protease uh you can see the different circles in there right the gpcr is in green kinase is in the purple color smaller one so that that's the region that it contain encompass in the biological space so it's just a way a cartoon way of visualizing how the protein family are related to one another okay so we covered a bit about fragment space so it's just the visualization of the the total number of compound in this 3d visualization and so this is like the chemical space but then we show it in terms of the three different type of molecule this is just an example okay so if you zoom in you're gonna see they are having the nitrogen they have the oxygen and they have the car carbon cycle okay so this concept i mentioned many times about how the molecule the the size will increase as we go from fragment to become more like and it to become a final drug so you're going to see that the marker weight will become greater than 500. okay so that's the the rule of five the lepisky rule is right here at the multiple five the lead like is right here at the rule of three okay and then we go from the fragment right here so you're going to see that the molecule side becomes bigger and bigger all right and so this is the concept of polypharmacology right as i mentioned about the off-target binding if you give favorable activity or if you get unfavorable activity it will give rise to either becoming a side effect or becoming a drug repositioning or a drug repurposing okay so this is the concept of poly pharmacology okay so the concept of you know that one drug will bind to one targeted protein that that is a simplistic viewpoint so in reality a drug doesn't bind to one protein it could bind to multiple types of protein in the body and if it does bind to other protein in the body it give rise to off-target binding and therefore give rise to side effects okay so it depends on which off-target protein that it binds to and what effect does that give rise to uh phenotypically then it will give rise to that particular side effect kind of like if you have the chlorophenillarine you take it and it have some neurological effect because it's a neuroactive drug right so aside from being an antihistamine it also goes probably go through the blood burn barrier to elicit the uh the neurological effect that it has right and so as you see in the in the kinase family it's quite tricky because there's several hundreds of proteins in the family of kinase and let's say for this pro this compound uh starosporin as you can see here it binds to many of the members in the kindness family and in order to make it specific toward a single subfamily of the kinase it means that the molecule has to be bigger because if you imagine the protein structure of the kinase it will adopt a similar fold and to make it bind to only a single or or a handful of other kindness and eliminate the cross binding of the other member it means that you have to add an additional group that will be unique to the other one and therefore the size will be bigger okay so this is the take-home uh a concept that i would like for you all of you to to have from today about the party pharmacology about the off-target effect about the side effects and about the repositioning of the drug okay so we talked about this concept many times and i've drawn an illustration for you already so we will skip this one okay so we will spend some time on the qsar so qsar is the quantitative structure activity relationship and it's a way to correlate between the feature from the molecule so the molecular feature or the structural feature of the molecule to the biological activity okay so the biologic activity could be the ps350 it could be the ic50 or it could also be the ki so you will see here in the molecule in the in in the example here they all look similar but they're different right if you look at here at this position it's a different atom and this one it contains a metal group added to this benzene ring and so each molecule that you see they're substituted with a different r group and a different position okay and therefore it give rise to different activity and so if you could if we could simplistically um describe this let me let me make a new equation so the activity the biological is equal to a function of the features molecular features so f is a function right it's kind of like you have the equation y equal to f of x right and so y is the biological activity and x is the right so x is the features of the molecule the biological activity is a function of the molecular features so if you want a better activity you you need to add more or different molecular features to the molecule okay so our exam will be take will be a open book so you're allowed to have your um i'm not sure if it is permittable from or at least you could have that printed handouts of your slide right because your exam is probably online right so i think it's okay because where our exam is normally essay so you have to write your understanding so i don't mind if you have your printed version of the presentation slide next to you it's okay because you're going to have to explain your understanding okay so um i'm okay if you have your slide printed out next to you okay so so we're gonna ask something very basic you know like you're gonna have to understand what they are in order for you to write about it like kind of like what's the importance of this what what's the difference between this and this how can you you know how can you design a drug can you explain the process that is necessary okay so so everything is more conceptual um based on your understanding of this okay so i mentioned to you already that quantitative structure activity relationship it's a way for us to investigate the relationship between the structure activity right so how structure give rise to activity um you can see here in the numbered atoms right one two three four five six at each position you could decide to add a br a bromine or a chlorine they're both halogen atoms you could even add iodine or you know other to the same family or you could add a methoxy group or cs3 to position two or three or four right so these are used as an example okay you don't have to memorize but it's just that you could add a particular functional group to different position and the position really depends on how does it you know interact with the protein in the three dimension let me see if i have the pi mode installed on my computer okay one moment [Music] okay let me let me share my screen you see my the screen primo right get pdb okay let me enter 3 eqm into it and this is the aroma test so and then let me show the sequence i click here yes and then i'll click on the theme i'll give it a different color oh sorry um i have to move the zoom icon um let's see this is i give this a different color let me make it by element to be this color oh no it's gone okay yellow show it as it's shown as a yellow color you see and then this is the androstene dion the androgen i'm gonna show it in a different color i'll show it here in the cyan color and see if i color this according to the secondary structure by ss meaning the secondary structure okay now okay the color defaults to a difference uh let me color it again i could select okay if i could select the i select this thing it becomes s e l e i could name it you know i could name it to become right here to the left i would name it to become andro steen ion and then i'll give it a color let's see what color should i give green and then i'll click the him group and i'll i'll call it h e and e to the left here hcc enter and now you see that i have heme and android scene diode you know to the menu here and then i'll give it a different color call it i'll give it yellow okay for this one i'm using the track pad and the macbook to be better and i recommend you to have a three button mouse and you could use the middle mouse button the scroller to um zoom in zoom out it's a bit challenging on the two the track pad here you could do panning this is pending okay you move from left right this is called pending if you click it's rotating this is rotating this is panning and then let's see and then we have uh sorry i have control and mouse oh no and then i'm trying to figure out how this works and you could zoom in and zoom out i forgot how to zoom in already oh like this okay on the trackpad i just you know like um pinch the screen okay so i'm using the trackpad trackpad of the computer so it's like pinching you know like how you use the the tablet you go like this and it zooms in zooms in and zoom out zoom in zoom out okay and then you use shift to have this selection box shift and then the mouse control and mouse to see what happens nothing happened alt and mouse it allows me to pan if i if i press the keyboard out and the mouse it will expand it and then i'll hit rotate it and then i could pan it and then i could rotate it because it's hard to find a an optimal you know viewpoint of the molecule before i see here okay so you can see that the heme is situated here and when we visualize it we can see how the atoms of the drug will interact so this is the substrate this is the androstenedione so i need to have the actual drug docked here so if i do multiple docking i will be able to see how they actually inhibit here and then i will be able to see how the residue around the the drug interacts so let's say i hide everything i show only sorry um i show the aim let me hide hide equipping i show the stick i show the molecule so i see the heme oh no okay so i see the heme and i see the the green is the the substrate right the androsined ion and yellow is the heme group i could click on the heme group and then i could say i want to show only the residue around it around it to be about 6 angstrom and i will show all of the residue around six angstrom i show it in a line and i'll show it by elements let's make it like that okay so you can see the lines around it the they are the residue around the so whatever i show this as a stick show it as a line okay now i can see or maybe i show it all as a line okay so i think you can see the concept right on how the residual interact with the uh on the ligand yeah so the sometimes we call it interchangeably we call it the compound we call it as a drug or we could call it as ligand okay so in direct design we typically call it ligand the compound we call it ligand so you might hear the term protein ligand interaction okay so this is how the protein and the ligand interact and then based on the modification if you look at it three-dimensionally then you can see oh okay this this functional could interact with this part and this uh benzene ring interact with this in a pi pi stacking interaction and therefore you could figure out like which position you want to modify so that it will optimize the interaction with the residue okay so this is why you need to visualize and perform molecular docking and modeling okay so this step might take you an entire week you rotate the molecule play around with it try to figure out okay if you modify this to be this atom what happens um actually i want to show you more but okay let me let me show you this particular atom this is a tryptophan i could modify this tryptophan to become something else let's see wizard um mutagenesis and mutation i modified to be a alanine apply mutate to alanine apply done oh why doesn't happen okay i don't have the license that's why i cannot do it okay so i need to get the license so you could modify the the residue into another amino acid of your choice as well okay let's go back to the slide let me share the keynote okay okay so the concept of qsar is like here um explaining three lines so the first is you select a biological activity that you want to investigate you want to study and then next is to generate the description of the molecule okay normally you use something like a tool from computational chemistry to calculate the molecular descriptor to get the physical chemical description like what is the molecular weight what is the solubility okay so once you have the descriptor of the molecule or the feature of the molecule then you use machine learning okay to try to correlate between the biological activity uh the ic50 that you see here with the chemical property let me know correlate the biological activity or the chemical property right here okay this could be actually 50 or it could be a melting temperature or it could be um what freezing point of a molecule with the molecular descriptor so you want to correlate between these two parts the blue part here the descriptor with the activity and in order to correlate that descriptor and activity you need machine learning okay so if you look at the illustration here you have a set of molecule and then you want to describe it in terms of the property you could have the energy you could have the dipole moment you have the charge this is the energy of the molecular orbital so you don't have to know this in detail so just illustration for you so once you describe it in molecular descriptors term then you use machine learning to correlate between the descriptor here the blue part with the green part the ic50 and your machine learning will be able to make a prediction given the following descriptor what is the ic50 given the following descriptor what is the ic50 okay and so the most important thing about this is the interpretation of the machine learning model okay so you want to interpret the machine learning model to figure out what feature are important for a good bio activity okay let me show you more example so this is just a simple database search of the scopus database to show that um there are more and more paper being published in the field it is interaction it is made in the keynotes so it's quite outdated so i have to update this visualization and so this is a typical q star workflow it might look quite complicated it comes from the paper i showed you just a moment ago um the er alpha prediction so let me explain to you it's not that difficult chambo database is the one i've shown you to you already if you recall and i searched for aromatase as i've shown you already i collected the data which i've shown you there's about 2 900 compound against 5 000 activity but about yeah now now it has 5 900. so on this one that's 10 000 okay you're not sure why this one has more um and then i selected only ic50 oh okay okay i know that because i i we have only xc50 now we have 5900 but before in 2017 before we published it we had 3 500 compounds okay and then we did some series of filtering we processed the data set we removed the salts and then therefore we did a high quality data set of about 1300 compound and then we take the the process data set to use here in the modeling process and then we perform a series of processing on the on the data we take the data we calculate the descriptor we perform feature selection so we remove any irrelevant descriptor out we perform data splitting so this workflow is data science okay it's about building the model okay and so we perform data splitting so you don't have to know the detail of that okay so we performed 70 30 data splits and then we we generated the q star model and then we have the predicted activities okay my daughter's running okay so and then we have the prediction right and then we have to evaluate the performance r square and q square and then yeah and then we report that in the paper okay and so here are some of the applications of the qsr and you can see that aside from drug discovery here you could also use it for regulatory usage in order to evaluate whether the compound is toxic or hazardous another is for modeling the various property of the molecule you know like if whether they have a particular melting point or a freezing point okay this is an illustration of the bioactivity that our research group has done but this is quite outdated so we investigated various biological and chemical and protein properties and this is a comparison between qsar and proteochemometric modeling so we have here the difference here is like this for a qsar study we have a single protein shown in blue color and we have several molecules shown in the orange triangle okay this is called qsar as i have told you about um throughout the lecture and then we have this thing called proteochemometrics so the different part here is that we added additional target protein into it so instead of having one we have two and three and so this is ideal if you want to study for the kinase family okay let's say you have 100 members in the protein kinase family you could create a single model that will be able to simultaneously make a prediction model for the various molecule and the various proteins okay so this is proteochemometric okay so maybe in the exam i will ask you what if you could compare between qsar and the proteochemometrics okay and provide some example draw some illustration okay so something like that and you could draw this and then you could try to explain how do they differ and how are they similar okay so as you can see from the illustration they're quite similar right qsar is for a single protein a proteochemometric is for many protein okay so this is many to one this is many to many okay and therefore if you want to investigate the off target effects you will use podio chemometrics right if you want to do if you want to figure out what compound could be repurposed in order to treat another uh disease then you could use polio chemometric modeling okay so in this example um this is the gene fluorescent protein so i think we could skip this this is just an example that um i just want to illustrate to you how we use two sar to investigate several type of um property so we know that the jellyfish eliminate different color right sometimes it could illuminate green color sometimes you could illuminate red or yellow or orange cyan color blue color and the color comes from the chromophore as you see here there are 18 different different types of chromophore it's at the middle part okay and the chromophore has different structure some have the indo group some have the uh hydroxy coming from the phenol coming from the phenylalanine and normal tyrosine and some come from the uh the tryptophan right the indo group okay so we just go to show that different chromophores give different color and then we build a model out of that so we took the chromophore and we took the mutational information of the protein and then we built a model okay so nothing nothing for you to memorize here just to show you that we took the protein information and the compound information and then we try to explain the color the origin of the color and this is just a summary of the performance of prediction which we to skip and this is just a summary of the model performance okay so summary the summary here is looking for the gfp paper i think we could skip it here and you could read it um project chemometric i've mentioned already that we could use protein chromometric to perform drug repositioning to understand how a single how a library of compound or ligand could bind to a library of proteins okay and therefore you could use this to figure out personalized medicine right because every human being have a unique composition of their um different variant of the cytokine p450 and other proteins so given the protein sequence information you could generate information about the protein and then you could create a model and then you could figure out whether the the compound will have any side effect against your own protein so that's the possibility of proteochemometrics it considers the variability the heterogeneity of the protein composition okay so let's have a look at the conclusion okay so in a nutshell you know there's a lot of advantages of using these type of predictive modeling but then there are some [Music] some pitfall that you should be aware of okay so there could be the high dimensionality of the input space meaning the molecular feature okay sometimes you're working with too many molecular um sometimes you're working with like how would you represent the molecule which molecular descriptor would you use another would be the use of machine learning algorithm how could you use or develop interpretable models okay and other could be like what is the applicability of your model that you develop will it be usable after you develop it or will it only be good for the the data set that you have built okay so it's the long-term usage of your model okay so you could do that by validating your model performance okay so the second conclusion here is that although there are some flaws of the the field of qsar but then you know not all technology are perfect and therefore if we are able to understand the limitation of the technology we will be able to figure out a way where it is optimally used okay and so i mentioned here that there's a lot of potential for qsar and polio chemometric and in light of the explosive you know high dimensional data and also the era of personalized medicine or the um uh we call it the omix era there's ample data that are available and the thing is how can you make use of all of these data right the omics there's so many omics available right we have genomic proteomic glycomic interactomics all of that could be integrated as a feature in your model and our research group has developed some software some web server we develop several web servers and then if you want to get started i would recommend to the thing is you don't need any fancy equipment you just need a web browser and you can get started and using google collab and software you can use free software so you don't have to buy anything and programming i recommend to use python or r okay and you can get started in computational drug discovery and i have several examples uh tutorials on my youtube channel which guides you from the basic with no background uh these are all of the data and resources that i've talked about during this um lecture and these are just the name of the particular resources that you could have a look at okay and so that's all for today any questions i have one question people who invent a compile during the drug discovery proceed most of them is a group of scientists working outside a pharmaceutical company right um so there are it really depends both coming from our academia or coming from the pharmaceutical company so you can come from both sides okay after scientists discovery are substantial that was likely to be effective so they sold it to the pharmaceutical company no so the thing is it really depends so if let's say we search or publish about something we don't sell anything so we just publish about it and the pharmaceutical company if they read it they say it they could you know they could just buy it from sigma outright and and the thing about selling it it means that you need to patent it and therefore you see that many paper they don't describe the structure they don't they don't show how the molecule looks like until they they get the patent right and if they get the patent then they publish it later so the thing about patent is that you need to patent it first and then publish later so if you publish you cannot get a patent out of your published work you need to do the patent of the compound and then you publish about that work okay i have one question uh i'm just thinking that um for for drug republican um can be done by computational approach right no need to do that kind of like oh uh what left and yeah drug repurposing you could do a lot using computational approach um so it's kind of like you're you you're making use of available data that has already been published but then thing is the original researcher they did not investigate that activity they did not analyze it but then you're doing a post analysis and you will find something that the original paper might not have known existed and so yes you we use the computation approach after the publication has been made to do the the drug repurposing and after we identify some candidate then you know if you have collaborators and you you could confirm it experimentally as well so this one this letter is not new to me that uh for for the model of designing the model and like uh for for us like them not no no background knowledge of the the cloud machine learning can we be able to do the kind of design in the model like you are using kind of conceptual qss so to design the model i think you mean like to develop the model like if let's say for example you want to use machine learning to build the model so some of the knowledge that you would need is to have some training in model building machine learning model building and another would be to have about like the general you know like how do you calculate the molecular feature so more into the bioinformatics yeah so if you're coming from a biology background then i think it will be easier in the side of the domain knowledge that you already understand about the protein detail but compound detail but in terms of the computational part you might need to have more effort spent into learning about machine learning about python programming or about learning the use of the various tools and you might be overwhelmed because there are like hundreds and even thousands of tools that are available but the thing is which one do you use in what sequence actually we publish a paper it's called maximizing computational tools for drug discovery so out of you know like out of the thousands of tools that are available which one do you use it's right here it's actually summarized here and also from that review article that i mentioned we published in the expert opinion on drug discovery okay so what database do you use it from here jambo pubchem bindingdb um sync gdb17 pdb uniprot um how do you curate data we develop our own in-house tool we call it biocurator and how do you calculate the descriptor right here um what tool do you use for analysis of the model of the data you use r you use carrot or even parsnip if you use python you use psychic learn or you could also use deep learning by tensorflow or pi torch if you want to make plus in r you use cg plots um in python you could use matplotlib seaborn altair there's so many okay so molecular modeling you have these available docking you have auto doc docs microdynamic you have gold mac and md amber okay there's so many software but don't get discouraged because the thing is to do it step by step okay to figure out which tool is suitable for your hypothesis and then master that one by one so as you can see all of these two you know like over the course of what almost 15 years at our research group um each paper will only make use of one or two tool maybe in one paper we use only temple and scikit-learn and matplotlib another paper we might use shambo pubchem paimo autodoc nandi another paper we might use shambo babel and autodoc namdi so so the thing is it depends on the paper it doesn't mean that all paper will use all of this okay thank you guys thank you for sharing yeah i have another question to to all of you is um if i share this as a youtube video but then i will remove the image of your faces uh would that be okay it might have the audio yes that's fine yes okay