Transcript for:
NPTEL Bioinformatics Course Notes

Welcome to the NPTEL course on Bioinformatics, I am Dr Michael Gromiha from the Department of Biotechnology, IIT Madras. In this lecture I will provide an overview on Bioinformatics, different aspects of Bioinformatics and the applications of Bioinformatics and different complexities of biological systems. In this subsequent lecture, I will describe the details on various aspects. In this course, I mainly follow the books, Protein Bioinformatics written by me and published in 2010 by Elsevier and Academic Press and the book by Krane and Reymer Fundamental Concepts of Bioinformatics and published in 2006. For the general concepts of Bioinformatics I will use the Krane and Reymer book and for the applications on various aspects of proteins; such as protein sequence analysis, protein structure analysis protein structure prediction and protein folding, I use my book on Protein Bioinformatics. So, what is Bioinformatics? If you split it into two parts: bio plus informatics, so applications of the informatics on biological systems. So, I put the Bioinformatics in the central part. So, it is a field of science in which biology plays a major role right and combined with other applications from different fields of science, such as computer science, information technology and others to merge into a single discipline to analyze the biological data using statistical techniques as well as the computer algorithms. So, if you see this diagram. So, I put the Bioinformatics at the middle with all other fields, which are linked with Bioinformatics. So if you see this midst of Bioinformatics in fact, the analysis of biological systems have been carried out for the past few several decades using a small scale analysis; and in 1979 Paulien Hogeweg coined the word Bioinformatics to deal with the applications of the different science in biological systems. So, we look into the different fields of science, how the different fields contribute to the birth and the development of Bioinformatics, for example, computer science. Can you tell one example, how the computer science contribute to Bioinformatics? Machine learning can be used probably and in computer science, there are a lot of computer algorithms have been used for solving the biological problems. Correct. So, you can develop several programming and you can extract the hidden data, available in biological information. Also you can use different machine learning techniques and the algorithms, to understand and to capture the information as well as for the prediction purposes. So, can you tell one example how the mathematics or statistics is used to the development of Bioinformatics? Example would be use of statistics for an example in the use of Ramachandran plot they plot the phi and psi angles of protein. Correct. So, you can use mathematics right to derive some principles and relate the data for example, the protein sequences or how the distribution of the residues in the Ramachandran plot and so on using this mathematics and you verify the data whether you obtain any model. So, whether they are statistical significant or not you can relate with the correlation analysis you can relate the regression techniques, as well as you can see whether the data are statistically significant or not. So now, if you use the information technology; how the information technology contribute to the development of mathematics. There has been a increased computational resources over the decades, which is used to do a large scale data analysis. Correct. So, you can use for a large scale analysis, you can develop the online resources right you can see the computer storage and so on in this case that will enhance the applications of Bioinformatics to various fields; likewise physics if you talk about physics. So, you can see the concept of various types of interactions like electrostatic interactions, van der Waals interaction, hydrophobic interactions, how these interactions are important to understand the folding mechanism proteins. For example, if you take protein folding, in the unfolded state of a protein. So, it is like wobbling and you can see it is very like a random coil confirmation. When this protein folds into specific three dimensional structures, it can form a specific 3D structure. And how a protein can attain a specific 3D structures from its sequence right this can be explained by various types of interactions like disulfide bonds electrostatic interactions, van der Waals interactions. So, understand the principles governing the folding state of a protein, it requires the physical concepts. So, you can use physics to understand the mechanism of the folding in globular proteins. So, we consider all the fields; if we say life science, computing, maths, stat, information technology or physics or chemistry, which one is the some major field for the birth of Bioinformatics. Life sciences. Life sciences right because we need the data. So, without data even if you have several fields, we cannot apply it to different data right. So, we need a specific data. So, data where shall we get the data? We can get the date from biological experiments. So, if we look into this life sciences there are produced a lot of data right on a various aspects such as the macromolecule sequences for example, DNA sequence or protein sequence right and the structures, protein structures, DNA structures, complex structures and so on different expression profiles different pathways and so on. So, how the Bioinformatics is used to understand the data? So, Bioinformatics helps to acquire the data, to manage the data and to analyze the data and to understand the data right. So, Bioinformatics is the major field right to understand the concepts and understand the hidden information available by the experiments produced in life sciences. So, now what are the various aspects, how the Bioinformatics is grown? What are the various applications of Bioinformatics, how the Bioinformatics contribute to society? There are various aspects. So, I briefly I put into 5 bullet points, the first one is well-organized databases. The biologists they do the experiments and they produce data and they publish the data and the literature it is very important to collect all the information and put in the form of database. For example, if data are available scatteredly here and there right. So, it is very important to collect all the information and put in a proper form right and then you give some options to extract the data from this database. This is what the Bioinformatics do to develop plenty of databases which are well organized, are in computable form Once we have the data right, sufficient number of data, which is very essential and it is required for any analysis right otherwise you do not get any statistical significant data. So, second option is once you have the database, you can derive hypothesis, what will happen what is the relationship between some specific features as well as any function. If you know the function or if you know any specific characteristics of any biological systems, what makes these characteristics to a particular systems right? So, here we try to develop some features right and using these features you can link the function of any biological system right for example, if you take protein sequences or protein structures there are different types proteins say for example, if you get a protein structures globular proteins, membrane proteins and different types of globular proteins say alpha proteins beta proteins and so on and whether we can identify these types of proteins from a pool of sequences right. So, you can, left side you can see the sequences; because if you have sequence you know the information regarding the amino acid residues right. So, you can get the number of residues of which type. So, there are 20 different types of residues right present in these proteins, and you can relate these fields there is dominant of some specific residues for example, hydrophobic residues, then you can say that this could be a protein; likewise for any proteins having different functions or different type of diseases right. So, you can derive the hypothesis what is the basic principle for having this biological systems once you derive the hypothesis to understand these are the major factors for any specific systems right the next step is whether we are able to describe any algorithm, whether we are be able to derive any algorithm right. So, if you see the features and if you see the functions, and if you carry the relationship; here you have the features this side here you have the function, whether we can relate these features and function then what is the mathematical equation to characterize the function in terms of features right. So, we can do a function right to understand the function from the features. These when you do these we can make the algorithm once algorithmic study, then we can use it for public in this case we use web servers or online applications right with the earlier days when the internet was not fast enough. So, at that time see everyone they create the servers, they create their own algorithms, they keep themselves, it is difficult to transfer. But currently due to the advancements of these computers and biology and computational techniques, and fast internet facilities several webservers have been developed to give the applications to the others right. In our laboratory also we have developed various tools, which are widely used in the literature. So, many people use these databases and as well as the tools, try to understand any biological systems. So, the fourth one, I will little bit discuss about the virtual screening. So, for example, in the case of drug design, currently it is very popular, because there are a lot of small molecules are available in the literature and the people are affected with the several types of diseases like the cardiovascular diseases, cancer and so on and currently we are developed with the Chikungunya, Dengue, and so on. So, in all these cases to identify drugs, they try to find a target and then we see what are the functions of the particular target, what are the actions, what are the important residues right and then we try to inhibit that activity so that we can reduce these disease. So, here is one example for the structure based drug design. So, if you have a protein. So, this is a target. So, here I show a target of c Yes Kinase, because these c Yes Kinases are very important for several cellular activities right. So, there are several kinases one is the c Yes Kinase here this is very important for the colorectal cancer. So, this is an attractive target to define the inhibitor for the colorectal cancers. To do this there are various options how to derive a particular ligand to be an inhibitor right. So, it is a very large pool, how to derive it. So, in this case I will tell one example. So, finding a fish in a pond; so if you see it is a pond here. So, can you see different fishes in different ponds? So, if you want to catch a fish where will you put your net in the number 1 right number 1or number 2 if you put you will get a fish if you put your net in number 4. So, you will not get anything you only will spend much time you will not get anything right. So, if some of you will tell you that you are catching trying to catch a fish for long time. So, you try to use this one, to put that in a particular side. Then if you do it and if you get a fish then you will be very happy right because you do not have to waste your time. So, this is the case, if for any disease. So, there are different compounds for example, compound 1, compound 2, compound 3 and so on if we take compound 1 and you try this is failed. So, it is not a drug and you have compound 2 this is also not a good drug and then compound 3 is probably a drug, right there are millions of compounds, if you look into the literature right. So, there are a lot of compounds for example, if you go to a zinc database, there are 35 million compounds and in the enamine database 2.2 million compounds and in the natural compounds in the Chinese medicine 35000 compounds. So, if you want to try one by one right when you try everything, then the patient will die by that time. Second case if you try to use each compounds experimentally, if will take long time it needs large manpower and also it needs it is lots of money. So, in this case how to do it; so among these 35 million compounds if someone can reduce to 35000 compounds, then the number of experiments will be reduced by one 1000 times. So, if you instead of these 2.2 million compounds we have 100 compounds or 1000 compounds, then you can reduce enormously the search option how to do it. So, here is a solution, Bioinformatics can do it because currently we have very fast computers and we have very good techniques. So, it can assist one to searching drug target and designing drug for many millions of compounds, and with that second one is how to use hypothesis? Here I show you an example. So, here I have 5 of known values; so experimentally known. So, we take the example of number one the 0.1. So, rice 50 percent, wheat 25 percent, meat 10 percent, fruits 10 percent and vegetables 5 percent like. If you do this, then we can see that this is not controlled for example, take the food pattern and weight control and go over the second one. So, rice 30 percent, wheat 5 percent, meat 10 percent fruits 30 percent and vegetable 25 percent in this case also it is not controlled. And if you go for the third one rice 25 percent, wheat 10 percent, meat 10 percent, fruits 25 percent and vegetables 30 percent here it is controlled likewise the 5 examples. So, now, I have a test case this is a test case right, to another one consumes 20 percent, rice 10 percent wheat fruits 20 percent meat 10 percent vegetable 40 percent right. So, now, the question is here it is control or not what is the answer? It is control right it is correct. So, why it is control, how do you know this is control can you tell one example. Vegetables are high Vegetables are high means; vegetable here also are 25 percent, but it is. This is not controlled, but here vegetables is less. So, we can derive some principles you can derive some equation series statistics, right. So, show one example right we can say that if answer is control you are right. So we can see the right hand wheat is less than 35 percent and meat is less than 15 percent and vegetables more than 30 percent. So, likewise you can derive several equations right, you can properly study the initial data sets, experimental data sets, from this experimental datasets you can derive some equations right, some conditions where this will fit and apply these conditions to any set of data and then you can see whether that is control or not. So, likewise Bioinformatics can handle large amount of data and provide possible solutions. So, I will explain a little bit more about the virtual screening of the compounds, here is show one example how we use the virtual screening to understand the drug design. So, it is shown an example here is the protein right this is the c Yes Kinase in a protein. So, there are different domains. So, we can see one domain left side that is a domain SH3 domain and SH2 domain and here this is the tyrosine kinase domain and here is the catalytic domain and here is a phosphorylation site the tyrosine 416 and this is the loop; the question is, this is very is very important for the colorectal cancer, this is a target. So, it is important to identify a probable hit target for this particular c Yes Kinase enzyme. So, this is one aspect here one side we have the protein. So, you have the target to c Yes Kinase and the other side. So, how to design an inhibitor. So, they have a library of enamine library you have 2.2 million compounds; among the 2.2 million compounds, how to choose? The probable compounds which can be a lead compound for a drug. So, and if I do it for a 2.2 million compounds, it can take long time it takes lot of money because it is compound cost of 30,000 to 40,000 rupees right. So, it will do this spend time to do for all the 2.2 million compounds. So, how to do that? So, in this case you can derive some methodology, first you see whether the structure is known if the structure is known then you can use the particular structure. If the structure is not known then you may need to model this structure and then we have to stick for the activation sites, where are the activation sites they will again combine and see this pockets which are the binding sites right here is the site. So, now we see 2.2 million compounds you check the features of all the compounds and make some conditions to fit with this particular a pocket right then you can use some molecular weight, you can use the hydrogen bond donor, or the acceptors right, various options you take and then you eliminate the compounds. Finally, you can use virtual screening like docking you can do with these compounds and finally you derive some compounds. So, in 2004 the Tokyo Institute of Technology organized a competition, to identify a inhibitors for this particular target. We also contributed in that right, we it is identified about 120 compounds, and they tested 50 compounds among 120 that and we showed that 4 compounds showed inhibition and one is the probable hit compound. They continued the same in the next year 2015 right there are about 2000 compounds right they found 5 showed inhibition and 2 are hits. So, in the down side we can see this figure, I show they how the ligand interact with the protein you can see the green ones right. So, the green shows the ligand, and the surrounding ones are the protein side. So, these ligands, they have some specific interactions. You can see the hydrogen bonds or the hydrophobic interactions and the van der Waals interactions and they, because of these interactions they tightly bind to the protein and they act as an inhibitor for these particular Kinase. In the later classes I will explain about the more details on the structure based drug design. So, till now we discussed few aspects of Bioinformatics. So, what are the different aspects we discussed? Databases, one is well organized databases Bioinformatics contribute to organized databases right. So, then the second one computationally derived hypothesis when you have the data, then you can develop several function right. So, to relate the features and the functions right and once we derived the hypothesis then we can develop several algorithms right for the prediction, and then once we predict then we can make it as online applications in the form of web servers, there are several web servers for example, protein structure prediction, protein function prediction right. So, how the DNA can bend, and how the DNA can interact with the proteins and so on. Then the fourth one we discussed now regarding the virtual screening of compounds how they do the screening for the drug development. And currently if you see the Bioinformatics is widely applied in next generation sequence analysis. Now all are interested in personalized medicine, few years ago it was very expensive to sequence a genome; now it is very cheap to get a sequence right. So, everyone wants to see their own genome, you know and what are the proteins they have and what are the functions they will do, and to understand what are the probability of having any specific mutation to your protein and so on. So, in this case now currently we have a lot of data right, obtained from next generation sequences right for example, the Illumina sequencing right. So, we can due to the advancements in the sequencing techniques, now there are a lot of data available in the literature. But the question is how to analyze the data. So, how to extract information from this specific sequences right. So, there are several ways to get the sequences, they have used short reads and to get the final sequence. For example, if you some patients which are affected with a cancer or affected with the Parkinson disease and Alzheimer disease and so on; how is the genome or proteome different from healthy individuals. So, they get the data for the patients, they get the full sequence and they get the data from the individual healthy individuals and they compare. So, how are the features, what are the variations or the mutations right, where the mutations are in the protein coding regions or non coding regions. And then they relate how these mutations are or these residues are important, where they are involved in different pathways and how they are influencing the different diseases. They try to see the information and then they go with the treatment. For example, if you are affected with a cancer or any specific diseases they are treated. So, they go to the hospital, they get the patients data as well as they get the information regarding drug and the drug response and they make a database you can do it right. So, from that information you can see that if any specific variations, this specific drug will work. So, we have this information, then this will be helpful for the personalized medicine for different diseases. Likewise Bioinformatics plays a major role on different aspects in human health as well as for medicine.