Transcript for:
L9

Hello and welcome to today's lecture on biostatistics. So, we will start with a brief recap of the of the R language and then solve few examples in R studio. So, just to review the basic it or discussion of what is R, it is a software environment for statistical computing and data analysis. So it is an open source package, it is freely available, it has common line interface, but since for many users using a common line interface may not be beneficial.

or easy to handle. There are you know open source softwares which provide a graphical user interface so that it is widely used. And the best part of R is can produce publication quality graphs with mathematical symbols. So it is an interpreted language. This is just you know a print screen of the R console.

You see it is just you know you can just enter things here. But and you have an R has been widely used by statisticians. It was actually developed at the University of Auckland and it is now widely used. widely used, it has a support base where people contribute to its further development and it has met the performance benchmarks comparable to that of GNE Octave or MATLAB.

So it is that is you know reason why it is used for statistical analysis of big data as well as in business analytics. So you can download R from this rproject.org, R project.org and you can depending on the software interface whether windows, unix or macOS. macOS, you can download the appropriate file and install it.

And so because it is a command line interface, so people have developed software based on R which have a GUI interface and R studio is one such of them. And in R, this is just a print screen of how R studio looks. This is the command window where we type in numbers or whatever we need to do. This is the workspace which stores all the data that we generate and here you have provided additional information. on how to use it or what something means, okay.

So this is just an example of how I can create a vector in R and so let us directly go to R studio and see. all these examples so that your idea is increased. So this was an example I had done earlier but let me just see so in order to generate a vector so you can do all basic computation I can write a a equal to 1 b b equal to 2 I can write a a power b b so I can get these answers b b power b b exponential of b b so on and so forth.

So and what you see is if I drag this down. down all these numbers that we are putting in are getting stored here okay in terms of value it has stored the value of AA stored the value of BB BC whatever okay. So you can do the basic arithmetic you can have so sine of 30 as I pointed out yesterday you have to write by pi by 180 so in order to come you know convert it into radians and only then will you get the value okay.

So this is how you get 0.5 if you do sine of 30 you will get some other value. value. So remember that we have to convert every degree into a radian in order to use the sine, cos, tan functions. You can do a sine or sine inverse let us say we can do a sine of half and you get the value of 0.523 which is nothing but this particular value okay.

So arc sine, arc cos you can do log, log 10 of 10 is 1 okay. You can also do log of 10, 10 which is also going to give you the value. So you can put.

log 10 base 2, you can do just log of let us say it is a natural log and you will get slightly value because E is 2.7 something, okay. So these are all the basic calculations in order to generate a vector let us say. So I can write CC is equal to C 1, 2, 3, okay.

So C and so as you see the format the syntax is you get the variable name which is a vector and C is for concatenate and you give them item 1, item 2 separated by comma. commas. So, if I type CC and put an enter it will give you the exact value. So, this is how you create vectors you can do elementary vector operations like let us say you can do. So, let us say CC is this let us say you define BB as C of 0. You can do CC and BB as another vector.

I can do CC plus BB and I get the answer. I can do CC star BB. So you see that these are element wise operations. So that is why when I am doing CC star BB, 1 is multiplied by 4, so on and so forth. So if I want to multiply something throughout, then I should do let us say 4 star CC.

Then all the individual entries are multiplied by this factor. So, for a scalar multiplication you put a prefactor outside the vector and for these are all vector additions vector multiplications what you see is these operate at the single element level ok. So, if I go back to this power point so this you can even create a same number repeated whatever number of times.

So, the syntax is very you know understandable you write repeat whatever is the number is 1 or a or b. times, how many times you want to repeat and you see you create a vector of the same number which is 10 times. You can repeat a sequence.

So, let us say you have 1, 2, 3 and so 1, 3 means would be 1, 2, 3 and this has to be repeated 5 times and you can accordingly get this particular you know vector. You can also create a sequence. So, this is another example of a sequence where you go from 1 to 7 in steps of 1 and you will get this particular you can create a repeat of a sequence.

So every time you put a bracket and you put an operator it operates on this whole thing and that is how you have a repetition of a sequence which is number of times 5 times. So you have 7, 7, 7, 7, so on and so forth. So this is just what I did which was element wise operations on a vector. You can you know do all the calculations here. Let us go back to R studio ok.

If I go back to R studio. Now let us say I have my vectors and what I want to add a vector online as opposed to typing them inside like this. If the vector is long then it is difficult to enter. So what I can do is I can add the vector online.

So in that case I can write dat is equal to scan till open brackets. So if I do enter then it gives the it puts the command prompt here which means I can enter anything. I put spaces. If I put enter, it again gives me the option of adding many more numbers. So you see the number 13 is written here because I have already made 12 entries.

So this is the 13th entry. So it begins with 1 which is this entry. So I can keep on putting random numbers and every time I put enter, it gives me the option of putting in any entries.

But if I press one more entry, then essentially it will… It will understand that that is the end of it and it has created this particular vector which has 17 items on it. So I can now type that and see what is the value of it and you see that this is the way it is. So you see that depending on how you write it and what is your font size, so when it is coming you can only reach till 13th entry here that is why from the 14th entry it is coming here.

So in order to know the length of the vector I can use this length of that. It tells me that there are 17 column entries in this particular vector. I can also, so the jargon for entering a character entry is slightly different. So let us say I can do this extract that is equal to scan. If I do this, if I want to enter characters then what I have to do is let us say Monday, Tuesday, Wednesday.

So, I put two times that. So, in this case I have to let us see. I think I have to write that is equal to scan. what equal to char ok. So, now so I it so by default this function scan always expects to get real numbers.

So, that is why you could see expected a real got m which is a character. So, if I put this statement inside that scan. and what it is expecting is a character then there is no problem.

So, if I now write that here you will see Monday, Tuesday, Wednesday as the numbers being entered. I can add ok. So, I can add. So, let us say if I have x x equal to c of ok. I can write I can add elements to x x by writing c of x x comma 6. If I write xx now you see it has added.

So it is possible to add numbers into a particular vector and as before as I had shown you before we can you know even puts the order of the xx does not matter I could have well written c of 678, xx in that the vector range would have been changed. So if I go back to the presentation, so the next thing, so of course these are still numbers which I can enter and I can keep on entering on screen but if you have a big file then this function does not work. You have to use what is called you know you can import data using various ways, okay.

So this is an example where you can import data using CSV format. This is called comma separated variable. So let us do this particular example.

What I will do is I will create and I will open an excel sheet and let us ok. So, I have entered some numbers and I can save it in CSV format. So, when you do save as by default it is always in excel excel workbook.

But what you can go down and you can choose comma separated values ok. This is csv and you can do the save it give it prompts you this particular warning, but you can say continue and this you know this is stored as a csv file ok. I can go to R studio and I can write xx is equal to read dot csv and I can write file dot choose.

What this means? So, when you do this it allows you to choose a particular file ok. So, if I do enter it will give it will prompts me this and this is the latest workbook that we have had this is the csv file I can open it and it gets chosen.

You can select it if I write xx here and you see that these are the values which is chose. So even though we did not explicitly enter the tags of what are the column names in CSV these things are already chosen and even the left row numbers they were stored in the CSV format generator. So, you can you know this is the easiest way of you know this is just another way of example of how you can choose this particular values ok. So, this is the most widely used system to import data, but particularly when you have big data ok. Now, let us get down to some examples of finding frequency mean median so on and so forth ok.

If I go back to R studio, so let us say I will again enter another set of parameters. Let us say xx in the screen is equal to c of I have entered this random array. I can know the length of the vector by writing length of xx. It has 8 entries. I can find minimum of xx.

which is 1, max of xx which is 6, I can have mean of xx which is 3.25, I can have median of xx gives me a value of 3, okay. So these are widely useful you know ways of doing of getting the statistics, district to statistics from your vector, okay. And one more thing I wanted to show is if you do table of xx then you will get the distribution.

So, how many values have value of 1 you have only 1 entry of 1 you have it says that there are 2 entries of 2 let us see. So, these are the numbers I have entered there are 2 2s and that is why when I do the frequency count there are 2 2s here similarly 2 3s 1 1 1 ok. So, this is a very useful way of getting the you know the statistics from these particular examples.

I have shown you how to calculate the mean, median, minimum and maximum. You can also do the variance and standard deviation. So, you can use this particular function of var to find out the variance of this distribution, sd to find the standard deviation of this particular vector. So, in this particular example where you have how many entries 1, 2, 3, 4, 5, 6, 7, 8, 9, 11 entries, your variance is coming out to be a value of 32 and standard deviation is around 5.6. So, you can also find out the standard deviation and square it to find out the value of variance which will give you the same information or you can calculate the square root of the variance to find out the standard deviation ok.

Some more examples so you can I will use I will show you how you can use this particular functions ok. Again let us go back to R studio ok. So let us say I want to sort xx. So, yy has 14 entries, length give me a value of 14. So, let us I already know the information I can use these earlier values to find out this, but I what I can also do is I can do sort of yy. ok.

And then you see that this is being sorted in ascending order. If you wanted to sort the same you know same vector in reducing order the other way round. So, you can write sort yy comma decreasing equal to true. So, in this case you have the reverse order from the top most number to the lowest number.

ok. You can also have let us say you want to some statistics how many there are how many numbers in this vector which are greater than 2 for example. So, I can write xx. So, there is 3, 6, 3, 4, 5 are the 5 numbers which are greater than xx ok. I can also write.

So, I can ask at which position is x x equal to 5, it is returning me a value of 8. Let us see where I have the first value of 5, 1, 2, 3, 4, 5, 6, 7, no 1, 2, 3, 4, 5. So, x x is equal to 5, it is returning me a value of 8 ok. I can do another thing which is the summary of yy. So, what it gives me are all the minimum, the first quartile, the median, the third quartile and the maximum ok. So, instead of summary you can also use this function called quantile. And this is the same thing, but given in terms of exact percentage again you see the minimum is 1, your first quantile is quartile is 2 as is showing up here, your median which is the 50th percentile has a value of 4. your 7. So, your 75th percentile is this and the maximum is 7 ok.

So, clearly so in this case you have median and mean both reported when you write quantile you will only get the plot of these actual percentile. So these are the things that I wanted to you know these functions I wanted to show these are useful okay. Let us take another example. So imagine you are doing a measurement where you are tracking or your aim is to correlate the cell. So a cell which is moving and you want to see whether when it is moving it is elongated or is it round okay.

And what you do is you do this particular you know. So, this is the data where you have two matrix for characterizing you know the cell phenotype. One is whether it is circular or spindle. So, circular cell would look something like this. This is your circular cell and this is your more spindle shaped.

So, this is spindle and this is circular and you are so these are of course, so you have a nucleus at the center. And you want to see what are the trajectories of these cells in 2D plane and you want to find out so based on the distance it is moving each of these have two population migratory and non-migratory. Same here you have M and non-migratory.

So this is just an example. So in this particular example what we have is the distribution for let's say 22 such cells. For each cell you have the shape which is either a circular or spindle, and the phenotype which you characterize as non-migratory or migratory.

So essentially both these metrics are categorical data. And you want to find out, so from this table you want to find out what is the distribution of this data. So, I can have, I can first generate the table of the cell type.

So, this is just the way to you know import the data into the R framework. You can do this command of table to get what are the different distributions. What you see is there are circular in shape, 2 of the such cells are migratory and there are 9 which are non migratory. And in case of spindle shapes, 8 of them are migratory, 2 of them are non-migratory. So this data kind of conveys the point that a greater proportions of cells which migrate are spindle in shape and a greater proportion of cells which are non-migratory are circular in shape ok.

So this is a way of plotting this data. You can do bar plot of this table and you have this particular distribution. So let us go back to R. So let us use the same data which is yy I can write let us say bar plot of yy and figure margins So this is perhaps not coming here I do not understand but you can also go histogram of yy this problem is coming we will see but if you use this particular functions box plot bar plot let us reduce the file size a little.

Now you see ok so it was coming like this because this required a certain amount of space you see this. histogram of yy will give you this particular distribution. It conveys the message you have a peak here and another peak here and then one corner data here ok.

Let us see again if I plot the bar plot here. So, this is your same plot plotted in bar plot manner. We can also do the same thing for box plot ok.

So, this is the box plot ok with bars. What we can also do is we can, so if I go back to the presentation now. So, this is what we had plotted which is the bar plot of the table and you see that in case of the these cells which are circular you have a greater portion which are non-migratory. ok, which are circular in shape and lesser portion which are migratory, but circular in shape ok.

So, I can also represent this data by side by way plotting this side by side. So, this would be. be how it would look like. So, if I write this particular framework which is beside equal to true then I would generate this particular plot ok. So, in terms of migratory and these are circular which is very few and you know and spindle shape which are very high ok.

So, I can use a similar thing to generate the plotting for R. Let us say you have a trajectory of a cell which is moving in xy plane I can and you have imported the data. So, you can clearly see that by plotting the data as so you have read this particular data and the data has become you know data is as represented as x and y and so what you see here is you have first imported the data and in that case you are plotting data colon comma 2 data colon comma 3. Why are you choosing the second and the third column? The reason is because when you import your you know you have a row or a column which gets inserted which is the row number. So, let us go back to R.

So, which is the vector that we had you know let us again have z z is equal to read dot c s p file dot choose ok. So, if I again read my workbook 3. And I plot z, z. So, you see that there are three columns which are generated.

This one was the default entry in excel which we do not have no control over. So, this is why you have to neglect this particular column and you plot x 2 and x 4. And this is what we have done in this power point file. This is what we have done in this power point file where by we have plotted data colon comma 2 colon comma 3. This is high and then this is the corresponding you know the plot in xy plane.

I can also what I can do is I can use I can give a title to this particular plot which is migration trajectory which appears here. I can set my limits. So x lab so I can set my labels. So x lab is x coordinate y lab is y coordinate you can enter here.

You can what you can also do is you can change with the play with the color of this particular trajectory. What you see here is cull equal to red means essentially we are resetting this trajectory color to red. And the last is you can also put xlim, xlim is for limit.

So if you want to you know probe it within the certain domain you want to plot it within a certain range you can have control over x range and y range. With that I think you have gotten a good enough handle of how to do this ok. So we would stop here.

I would just go back to. Let me go back to this particular slide. Let me just go back to R studio once more. So you can have these particular plots.

Let us so we had plotted histogram of yy. You can plot it in histogram version. So if you write histogram of yy comma frequency equal to false what it does it actually converts this data as density or relative frequency and you instead of absolute values you will get a range. So you can see that. So, here itself for example, the you know in terms of yy I can write other values I can change the color of this and so on and so forth.

With that you I hope you have you know you have been convinced of the power of this R software which is open source you can freely download it install it and use it for your own purposes. Particularly when you are handling you know you should get into the habit of calculating standard deviation mean and all these descriptive strategies even of generating these plots in R ok. With that I thank you for your attention and we will meet again in next class.