Hi, welcome you all to our course introduction to biostatistics. So, this is our first lecture. I just wanted to cover one lecture on what statistics is, what biostatistics is and then talk about few things which are important and which will serve as a guide for you. as landmarks as we go ahead in this course. So I will begin with the course material what is the courses that were the two main reference books that will serve for you in this course.
So this is the first book introduction to probability and statistics by Mendenhall, Beaver and Beaver. And the second one, Introduction to Probability and Statistics for Engineers and Scientists. So both these books are available on Flipkart or Amazon and on regular websites.
So I come to the first question, what is statistics? So statistics is the science that deals with the collection, classification, analysis and interpretation of numerical facts or data. For example, we continuously hear about the year selections right now. And every now and then you see an opinion poll, what is the statistics by which Hillary Clinton will beat Donald Trump in the polls. You have the statistics which is repeatedly taken and you still come up with different companies which hold this polls which come up with different numbers.
So one person might predict that Hillary Clinton by win will win by 30 percent votes, another predicts by 20 percent votes. So this gives you an idea that this process is extremely complicated. It is very easy to come up with a number saying this. This is the difference, that is the difference.
But there is a very active science behind this process in order to make sure that the numbers you pop out of the process are accurate. Another example I want to talk about is the weather patterns, right. Every year in our country we have the meteorological, the weather department which predicts whether the annual rainfall in our country will be as is, is it good enough for this year, will it beat expectations for the past one year.
We have all known that India has faced a severe drought for the last maybe two consecutive years. This has been particularly extreme in parts of Maharashtra where I stay ok. So in all these cases the prediction which these people which these agencies give are really taken very seriously even for the government to make up a informed decision as to what should be the steps if the rainfall is adequate. What are the steps to make sure that the water is uniformly distributed should there be a water cut in the case that rain is not up to expectations or if there is excessive rain should the dams release the water as a consequence of that there might be some you know flood like situation created in some other place as has been recently observed in Bihar.
The last case is for example income as per professions you know we continuously decide particularly when we are in finish our education in class 12 we ponder as to which education you know trajectory we want to choose of course you know everyone wants to be an engineer the simple reason is we have enough industry positions open which after our education after our bachelors we can get access to a job which will secure your future so that is of course not true from every profession but every profession has its challenges so for example if you want to be a singer then the trajectory you will go ahead in terms of income initially you will have very miniscule amount of income but But as you become established in your field, you will have more and more income. So what is biostatistics? So biostatistics is nothing but application of statistics for the study of living organisms, for human beings, for animals, or for any biological process for that matter. take we can take examples from any field which concerns biology or medicine or bioengineering biology for example, we want to talk about evolution for example, right. In evolution we want to see by let us say comparing the structure of the bone, how a dinosaur eventually became a bird so on and so forth.
And these kind of things you want to make exact analysis of specific structural components of. a particular component of the body let us say and see its evolution as a function of time. In medicine, you want to predict the ability of drug resistance to arise in a given population given that in the past it has become resistant to this particular drugs.
what is the chance that they will become resistant to this new drug which is probably this company is thinking of bringing to the market. Last case for example is in public health. You all been aware of the damage that Zika virus is currently producing, right.
So in this case, in the case of Zika virus, whenever you go to the airport and you will see that people coming from these places must have a mandatory health checkup to eliminate the possibility that that person is a carrier of Zika virus. So, these are examples where statistics has been used to make some informed decisions first to measure and based on that measure to make come up with some analysis and then based on analysis to make some predictions as to what should be the corrective step. So, if I look at the history of statistics it actually started. with trying to understand the process of gambling which is one of the oldest professions people have you know participated in and this has roots in development of probability distributions which have been developed by DeMoyver and Laplace. In the 19th century for example you have Galton's discovery of regression, Carl Pearson's work on parametric fitting of probability distributions and that has an of late you know statistics is particularly of particular relevance to the field of biology, biomedicine, and biology.
in and its applications in clinical science. So why study statistics? Let us take a sample case. Let us say you are clinical coordinator for a clinical trial.
You are enrolling patients at x locations. Let us say x is 5 or 6 as the case may be. And you are pondering whether you want to add a new site.
for that clinical trial right. So for this for getting approvals for doing this clinical trials you need to get approval from you know from the institutional review board and one of the you know one of the members of the IRB asks okay you want to add one more location but what is your stopping protocol. you have no idea right what is a stopping protocol. Well stopping protocol is those set of rules which enable you to either declare your clinical trial a success at an earlier time or to preempt the clinical trial if you already have some negative data coming because you collect data you know in a continuous manner.
So you have thresholds at various stages of data collection which enables you to label the clinical trial a success at an earlier time point. or to label it a failure so you want to minimize your losses and so on and so forth. Another example let us say of biostatistics is measuring cell motility right.
So inside our body cells are not sitting steady in our position right. There many of the cells are continuously moving for take the case of you know take the case of immune cells they are continuously moving and scavenging for foreign particles which have. come inside our body.
We want the immune cells want to preempt it. So, we want to the cells want to be able to move all over the place. Now, let us say in the case of cancer or as the disease may be you want to come up with a drug which inhibits this motion right. Cancer is a disease in which cells certain cells which are supposed to be in a certain organ they start migrating they start proliferating that means they start dividing and then at some point of time they actually start invading.
the entire body, they eventually reach secondary tissues and stop their function leading to death, right. So a significant amount of effort by scientists is in trying to find out drugs which target cell motility or the ability of the cell to move. So you want to quantitatively assess whether this drug or drugs that you have currently developed in your lab. whether they work, whether they inhibit cell motility. So and in your lab you have this experimental assay in which you can have single cells which move, which move randomly in your under the microscope you are acquiring these movies.
And the question here is, so you want to understand the role of a specific protein or the effect of the drug as the case may be. So, as a as a process you want to stop you want to measure the cell motility and see whether it has an effect or not an effect right. So, you want to essentially measure the cell movement. So, the question is how should we acquire the movie when the cells are moving. Let us say a cell is jiggling in one place right it is not moving.
So, over the you know let us say within. 5 seconds the cell is pretty much static in one position, but over 5 minutes it is actually moving by a distance equivalent to its own length. So, if you acquire every 5 seconds then essentially the information you are gathering is noise, but the information that you are you can you can gather is if you are acquire after 5 minutes maybe that is an optimal time scale.
If you if you measure after 10 minutes that may not be good enough because cell might have come back to its original position and to you it seems that the cell is static. So how fast you acquire these images is an example of how you can you know statistics can help you in understanding what should be your acquisition speed right. Let us take another example of a protein right.
proteins participate in various functions in our body. When protein function is perturbed then they stop working and then they lead to disease processes. Now a protein you can imagine a protein like a long string which has loops in it. So, you have these loops right, these loops are actually they hide some sites on them which other proteins ideally can recognize, but because these are in looped configuration these sites.
are hidden and the cell cannot and other proteins cannot see which means they cannot bind and then the downstream signaling cascades are absent. Under the conditions you want to understand what is the statistics of protein folding? How much force do I need to exert for a particular domain of a protein to unfold? For that what you can do is you can use techniques like atomic force microscopy to pull on the protein and what you will measure essentially is the force. force which is required to open up a given domain.
Now the question here is so this now this force is not going to be a constant it will give you a distribution which means let us say for one particular experiment for a given domain of a protein you get a value of 10 piconewtons in the next realization of the experiment you get a value of 12 piconewtons the next one you might have a value of 8 piconewtons. So it is so there are small differences in this but if you take it enough number of times if you acquire enough number of times data then what you probably will get is something called a Gaussian distribution or a normal distribution and the peak of that Gaussian distribution you can say is the force required to unfold that particular domain. One last example. of you know an example of how biostatistics can be relevant. So cancer again taking the case of cancer, cancer is a consists of cells which are heterogeneous right.
So as these different colors are used as has to make out that you have cells of different types by different types it might be the cells are of different sizes the cells are of the they are different extents of deformability one cell is stiff one cell is soft. So, on and so forth right. So you want to find out a process whereby you can understand and differentiate these phenotypic differences and then so statistics so you measure all these quantities for so you have multiple parameters size, stiffness, expression of some surface protein of interest so on and so forth.
And then given for each cell type within the subpopulation you get its phenotypic characterization. Thank you. you want to understand what should be the combination of these matrix that would enable you to separate them into different lots.
So again here the role of statistics will help you in identifying the number of subpopulations and their properties. So, what are the types of studies which fall under statistics? What kind of you know how do statisticians acquire data? For example, one of the most common is surveys or cross sectional studies. So the important point about surveys is they want to collect information pertaining to the population which is right now, right here right now.
So for example, a particular company let us say Airtel might want to know what is the data. patterns of texting in college campuses by students right how many number of text messages you know do students on an average send so that they can accordingly decide you know how many free SMS's can Airtel offer versus after how many times do they charge so if they overcharge then essentially eventually the student population will go ahead and sign up for some other company so you want to retain your you know retain your consumer base yet maximize your profits. So survey is a critical example of how this might help.
Another case is a retrospective study. So retrospective study means you have a population of people who might have been exposed to something let us say yesterday I went to a restaurant particular restaurant I had my food and today there is a food allergy or you know outbreak of you know food poisoning. So there is this investigative agency they want to know. Whether is it because of one particular food that we have consumed because of which this food poisoning is occurring or is it a random event. In that case they would want to you want to sample people who have visited that particular restaurant on that given day and compare with other people of similar demographic background but who have not gone to that restaurant.
So this is an example of a retrospective study. Another thing can be a prospective study. So in prospective studies. These are studies associated to see for example in college campuses as you know the when the student population starts smoking and you want to do at what point of or how long of smokers will eventually lead to a chronic disease like a lung cancer or anything but which have long latency period.
So the disease does not manifest itself in 10 days or 20 days what we are talking about years maybe 10 years or 15 years. So statisticians would preferably like to you know track. certain people over time to see how they have you know what is their incidence of lung cancer or what is their current state as a when they start smoking to that current state. And of course the final sample you know very often used particularly for biostatisticians is clinical trials right. So clinical trials are performed so after you have proved that a drug works in the lab before humans we humans start taking that medicine it has to be proven.
Two things, one that if I consume that particular medicine I am not going to die. So I do not suffer any extreme diseases or side effects and two that the medicine actually works so that it can be put to the market. So there are lots of studies how you know what should be the my study size in order to test whether a particular sample is a drug is working or not working so on and so forth. So in all these processes there is a critical component.
of sampling in essential you have a population which is this big. So in India for example we have huge population and let us say I want to make a soap or Patanjali wants to come up with its new toothpaste or some other product it wants to test on a sample right. It want to do that measurement on that sample and ask to test whether my product is good or bad what should be the price of the products so on and so forth. I have to have that questionnaire in terms of a survey, but I want to ask some people. Now how many people should I ask?
Is it 10, 20, 200, 2000? So of course you cannot ask every random person on this earth or every person in a city. You are still going to sample. Again what is your sample? You know the kind of population from the population you might draw multiple samples right for cities like Mumbai, Delhi, Madras.
You might to you know draw certain kind of people for to get feedback on how that product might do in a more town like area you might go to much smaller towns and the statistics the information you will get might be completely different ok. But say as an exception let us say if you are just doing it on college campuses maybe it may not be that different between a city or a town. So choosing your sample and sample the type of sample and the sample size represent two two of the most important choices that we must take. So, and as an example of how you know wrong choices can lead to complete mess.
So, there was a literally there is a magazine in the US called a literally digest. So, it conducted a poll in 1936 and about the presidential poll and wanted to ask. Who would win Landon or Roosevelt?
So the particular magazine came up with the prediction and Landon will have will be victorious but which did not happen not it not only was it a borderline case but it was a significant victory for Roosevelt. So, this was a clear failure for that particular poll and no wonder the publication actually stopped shortly. So, it is very important to make predictions which are reasonably accurate and we see repeatedly in. India in election cycles you have every TV every major channel has its own prediction as to what it have and still in spite of you know all the checks and balances still there is reasonable amount of heterogeneity in the prediction as to who will win and by what margin they will win. So that brings us to sampling and what is the type of sampling.
So in terms of type of sampling the most simple thing is simple random sampling which means that I want an unbiased view of the population. population ok, which is so let us say in classes I want to measure the average height, average height of the class I just ask 5 random students I you know I may be I even blindfold myself I randomly walk around you know pick a student and measure his her height that will give me an average height. Now imagine I am doing the exact same thing, but I only choose only girls or only boys.
So of course my measurement is going to be skewed right. Because on an average boys are slightly taller than girls, so my average height number that I come up with is going to be erroneous ok. The other sample is systemic sampling. So let us say you already know the population or the sample list ok.
For example let us say the class is the population and I want to measure I want to measure 10 out of 10 measurements whatever be it height, weight or you know some other age from the population and I. know population size is 50. So, what I do I arrange the entire population by name by let us say by ascending or descending order of surname and out of the 50 I choose number 1 then I want you know 10. So, roughly your spacing is 5 I choose number 6 so on and so forth ok. So, there is randomness and still I am not biasing myself and but I have kind of made it in a much more systematic manner.
So, if your population size is reasonably known then this systemic sampling is very commonly used. The last one is stratified random sampling. Let us say for example, you have a basket of balls which are of 3 types of 3 different sizes right and on an average you know what is the ratio of the total number of red balls to blue balls to yellow balls right 3 colors let us say balls of 3 colors and you know the size.
the stoichiometry of red to blue to yellow right then logic would dictate if you are doing a random sampling then the number of balls you would draw would be in proportion to them ok proportion to the color. So, this is you know this is called stratified sampling ok. Let us get over a very simple you know do a very simple case of simple random sampling ok of.
So, let us say you have 6 patients. right and I want so and let us say we name the patients A, B, C, D, E, F and we want to pick 4 of the 6 patients ok. I want to pick 4 of 6 patients.
So, what do I do? What I can do is I you know and this I want this process to be completely random ok. So, what I do is I choose ok. So, and so if you have A, B, C, D, E, F they can be Total of 6C4 or 15 combinations possible in which you have 4 different patients being selected.
So, let us say ABCD is one example, ABCE is one example, ABCF is one example. So, if you write down the details you will see that there are roughly 15 combinations possible right. So, I want to draw 4 patients from 15 samples. and I want to ensure that it is completely random. What I do?
I assign a, b, c, d. So, there are 15 of these combinations possible. To each combination, I assign a unique identifier you see. So, a, b, c, d is identifier number 1. a, b, c, e is identified number 2 so on and so forth. Now from these so what I do so I have a total of 15 numbers.
Now imagine I break down the number 1 0 to 1 that range into 15 equal intervals right. So your interval your you know the size of each interval is going to be 0.067 ok. So the size of each interval is going to be 0.067 ok.
So, what you do? So, you take up a uniform 5 you know uniform random number table right you choose any particular number and convert it into decimal. Let us say we chose a particular number it is 2 5 6 9 7 right. So, I can just put a decimal with the front that is 0.25697 right. Now given that I have broken the entire range.
So, if I have broken the entire range 0 to 1. So, I have broken the entire range. 0 to 1, 0 to 1 I have broken down into delta ok. So, this delta is equal to 1 by 15 equal to 0.0667 ok.
So, whenever I choose when I am this I choose my identifier 1 which is the number which is the combination a b c d right. When I choose 2 my 2 is my combination 1. a, b, c, e so on and so forth. So, from the table I had chosen a number let us say 2, 6, 8, 9, 7 right I convert it into decimal place.
So, I know this. So, decimal place means so my delta is 0.0667. So, 0 to 0.0667 I choose identifier 1, 0.0667 to twice of 0.0667.
I choose identifier 2. So, like this I think this 0.26897 should be identifier 5. So, accordingly I can choose one particular combination of the patient ok. So, this is the way of doing simple random sampling ok. So, that is you know pretty much what would give you an idea this is just the background idea of what you know means statistics. And so to summarize, so we started off the definition of statistics which is essentially statistics is nothing but the you know science of numbers and biostatistics is application of statistics to the field of biology or you know biological processes or biological systems and it has you know applications in basic biology, in medicine, in healthcare, in public health so on and so forth. So after that we discussed about the some examples of how statistics might be beneficial in our understanding of biological systems.
So I gave you an example of clinical trial, how to design a clinical trial and what to understand from that. I gave you an example, so you need to understand how clinical trials are designed, what is the sample size of a clinical trial and so on and also be acquainted with the jargon of a clinical trial. I gave you the example of you know cell motility you know this is an example where statistics help you to understand that at what rate should you acquire movies so that your movies are exactly you know so that movies are not artifact or not noise do not give you noise when something is just jiggling in place versus something is data that means cell is actually moving from one place to another but you do not also do not want to miss you know too much of the information so If you sample too infrequently then you are you know losing information as well. So statistics will help you to get that information.
We discussed the case of protein folding, protein unfolding and how force can be used to open it and you can measure what is the average stability of a domain by doing experiments, pooling the data, generating a distribution and getting to understand what this means. And last talked about you know how you collect these samples. So, in terms of sampling techniques, random sampling or bias sampling and you know you want to do it retrospectively as a survey so on and so forth. So, this is just our first lecture. So, from our next lecture we will start talking about how to measure numbers you know from an experiment from the raw data how do you go about generating a plot that is very important and how to.
present that plot so that the common man or you know someone in the audience be it scientific non-scientific they can understand. how to you know what is the gist of the data you cannot just show the entire raw data but the useful matrix from that raw data. So first step of that is actually to plot and convey what it is okay with that I thank you for the first lecture I will also upload us you know few MCQs so you can get an idea of what we discussed and you know get to refresh on your memories of this class.
Thank you.