Jamovi Data Analysis Overview

Welcome to Jamovi. Jamovi is an application for analyzing data and helping you make sense of the information around you. It's also a compelling alternative to expensive proprietary programs like SPSS and SAS and Stata. Instead, Jamovi is based on the open source programing language R, which is designed for working with data. And what that means is that JR movie is free. It's open and it's extraordinarily friendly. In fact, once you start using a movie, you'll simply be embarrassed that you didn't start using it sooner. Now I want to mention one thing about this course. This is a tools course and teaching you how to use software. It works best if you already know a little about data, because I'm not going to be discussing the principles at great length. On the other hand, it's easier to learn data when you have a good tool like a movie. So it really, either way, works fine. Just get started and see what you can do. Go to the movie Dot org and download the program and then go to DataLab Dot CC slash tools, slash GMV to download all the course files so you can follow along with what I'm doing and start working in your own data too. And with that, let's go. The first thing you need to do to work with your movie is to get it installed on your computer. It's a desktop application, so you'll need to have either Windows, Mac or Linux and you install it and run it locally. Go to Jam Movie Dawg, and then simply click on this link. Download Now the beautiful thing is it's free and it's open source. Click the in system that you need. Windows Mac OS or Linux. Download the file double click on it and you'll be ready to start working into movie in just a few seconds. When you open your movie, this is the window that you're going to see. Now, if you're used to some other applications like SPSS or SAS, what's a little different here is that it's a single window, actually, that makes it really easy to organize and to navigate through it. But I want to show you a few different parts of what's going on now over here on the left, of course, where we have rows and columns arranged, integrated. This is where your data goes. It's your data window. Right now, we have three variables, just called A, B and C. You're, of course, going to replace those with something else. And you put the data in. So it's a one column per variable, one row per case over here on the right is the output window. It's where we'll see the results of our analyzes. Up at the top, we have two tabs here. The first one that's open by default is analyzes, which gives you options for exploration. If I click on that and come to descriptive, you'll see, for instance, it opens up the menu that allows you to choose what you want to do and you get a blank version of the results over here on the right by the way, you'll notice that the menus into movie are designed to resemble those of SPSS. It's not based on SPSS. Your movie actually runs on R, but it's designed to make it easy for people who are used to SPSS to really migrate over to G movie. You can close the windows by clicking on this arrow, hide the options and we have versions of T tests, the analysis of variance or ANOVA or ANOVA regression frequencies and then factor. And obviously I'll go over each of these procedures in separate videos. You also have a data tab right here that allows you to paste in your information to do a setup of the variables where you define what's in them, the type it calls them continuous, although I would call that measured or scaled or quantitative or null or nominal. And then you can put down integer decimal or text, you can give levels, you can give names, you can compute new variables, you can add elite variables in columns, you can filter your cases and add or delete rows for cases. I can hide this right here at the far left, the three horizontal two lines give you the menu to new open save and so on, as well as recent data you have at the bottom. And then one more that's really important is way over here at the right click on the three dots and you get a number of options zooming and the number format one interesting one here is the plot theme. So when it makes the graphs, if you really want it to look like SPSS, you can get the base charts. I don't like them. I actually prefer minimal. Hadlee is a reference to Hadley Wickham. The person who developed Gigi, applied to a very common graphics program for our users and then minimal is also really nice. I'll keep it at default for now and then syntax mode. If you click on this, then you're actually going to be able to see the R language syntax using the gem V package that allows you to duplicate any of these features, any of these analyzes in R. That's an enormous benefit, especially if you're familiar with SPSS, but you're trying to learn are. And so that's a general orientation of what's going on in your movie in the Future videos. I'll show you how each of these options work. Each of the analyzes function and how to read some of the output that you get from your movie. The fastest way to get started with your movie is to explore its example data sets. To do that, come up to the little menu, the three horizontal lines here, and come down to open. Now in this PC means the ones that are installed on your computer. But examples there are four sample data sets. The first one consists of big five data. This is big five personality characteristics. I'll click on that and it opens up. And what it has is five variables on the personality characteristics neuroticism, extroversion, openness to experience, agreeableness and conscientiousness, and they're rated on 1 to 5 scales. And this is a great data set for doing correlations, for doing regression, and for doing data exploration. Another good dataset, again to open an examples is to boost growth. Now this kind of a funny data set because it has to do with the length of teeth in guinea pigs. But what it has is a single outcome variable that's called length, that's for length, and then sub is for the supplement. This is an experiment where they are giving the guinea pigs one of two different supplements to aid in their tooth growth. It's either V.S., which is for a vitamin C or it's OJ for orange juice. And then on top of that, they manipulate the dosage either 500, 1000 or 2000. And this allows you to look at the combined effect of these two factors on the growth of teeth in guinea pigs. It's a good way to get started with the analysis of variance. After that, there is bugs. This is a peculiar data set because it has to do with how people react to insects. We have a subject number gender male, only female and region some education. And then these four mysterious variables are the level of disgusting ness and the level of frightening ness, either low or high on the both of them. And what this is, is a good example of the analysis of variance for a repeated measures or within subject variation, because everybody in this experiment gave responses to all four possible categories. And the final data set under again examples is the classic Anderson's Iris data, also known as Fisher's Iris Data. And what this is, is the measurement of the Siebel length and Siebel width and pedal length and pedal width of three species of irises. So you have four quantitative or continuous variables and one nominal or categorical variable. This is a classic data set on how quantitative measurements can be used to place observations into categories. It's also a great way to practice breaking down explorations by groups because these species of iris differ pretty substantially in some of these measurements, and so use any of these four data sets as a way to start exploring how you can work with some of you. You'll see how quick and easy you can get into the data, get the exploration, start doing some analysis and start getting some meaning of the data that you have in your movie. One of the things I truly love about your movie is how easy it makes it to collaborate with other people. And the reason it's easy is because when you share things with people, you're sharing a single file. So for example, here's a potential analysis that I might do when I'm working with somebody. I've got a data set here. I've got the variables that came with the dataset. I've got a calculated variable right here. You see the formula, I've got a descriptive analysis. Let me put this over a little bit. I've got descriptive statistics, I've got tables, I've got charts and box plots and all of this, and it's saved in a single file. So let me show you this. For example, all of this is saved in a single file that I have got called sharing files, dot OMV, where dot OMV is the gem of the extension. When I work with somebody, I do the analysis and then I send them the single file and I tell them, download the movie double click on this and you'll open up right to where we were. And they'll have all of the data, all of the calculations, all the transformations, all of the analyzes. They'll have everything in one self-contained unit, and that is one of the easiest ways to share things. You can email the file to them, you can put it in a shared folder like Google Drive or Dropbox or Box. But anyway, you do it. The unitary nature of the files and the fact that they have all the work that you've done in there makes it extremely easy to collaborate and share work. G Movie makes it easy to share files by simply emailing or putting in a shared box. The file that you've been working on because it has the data, it has the transformation, it has the analysis. There is, however, a really interesting, more sophisticated alternative to that. And it's called the Open Science Framework. And if you want to go here to OSX IO, I can show you a little bit about how it works and how your movie is very well adapted to it. The Open Science Framework is a free service that really facilitates collaboration and reproducibility. In research. You can create a free account and one of the important things here is that it allows you to integrate a number of other services. So for instance, maybe you use Dropbox or Google Drive or Box, it can bring files in from those. Or if you're the programmer type, you might use GitHub or Amazon Web Services. Again, those are integrated. Now, I personally like to upload files one by one into OSPF. That way I have a better sense of what's there. I don't put something in a folder by accident, but this is a way to share your work. Both with specific other people for collaboration and to share your results with the world. I want to show you how this works by having you go to an example. If you go to this web address, bit dot Lee slash gym movie, dash poster for Open Science Framework, then you'll see the page that I created for these movie tutorials for that address. It's a shortcut and it takes you to this page. In OSF, I've set up a read only access to my files here, so anybody can look at it. They can see what's there, they can't change it. But this is a great way of collaborating and right now I've got the files from the first chapter in the Jamo v course and one of them is sharing files dot OMV. Now you may remember what this looks like into movie itself right here. We've got the iris example data and I've got some charts that are broken down by categories. I got some statistics. Let me show you what those look like when we're in OSPF. If you click on this file right here, it's going to open up in your browser. I've got it open already, so I'm just going to move over. And here what you have is an exact reproduce version of the results that we have. We have the same table, it looks exactly the same and we push it down a little bit. We have the same charts and it's all there. And in fact, even though people can't modify it, they can download it and then they can open it themselves into mobi and start working with the data and see what they can get out of it. And so this is a wonderful way of collaborating and sharing results with people. And Jamovi is one of the few programs that opens is files natively and o-F that makes it so easy to share. One of the thing I want to point out is because this is just a web page, not only does it work on a browser like a mac Windows or a Chromebook, it even works on smartphones. You can open it on an iPhone or an Android or a tablet, and you'll get the same results and you can explore and share the results that way. So if you want to have a step up in your collaboration and then you're sharing and really a way of disseminating your research OS The Open Science Framework can be an excellent alternative that works so well with Germany. The base functions that come with your movie are really powerful and truthfully are probably enough to get you all the way through whatever it is you need to do. On the other hand, Germany is based on R these statistical programing language which has about 10,000 packages available. They give extra functionality to R and it turns out you can do a similar thing with movie through what are called modules. Modules are to Jamovi what packages are to are or to Python. They give you extra functionality. So for example, if you open up Jamovi. modules dot OMV, you'll see that I have the standard correlation matrix right here. You can click on this and then you'll see the commands that come under regression. That's where they come from right here. And I'm looking at the simple length and the simple with of the Irish data set and we've got the correlations here, the p value and we've got a scatterplot. It goes down in this one corner. It's actually built for doing a matrix where you have many variables. But you know, we get some cool stuff, we get the regression line, we get the standard errors around it, we get density plots for the two variables, and we get the correlation written up here in small corners. But there's a module that allows you to do more than that. If you come up to modules and down to G movie library, what you'll find is there's about a dozen modules that are currently available, and the Jamo v developers are making a strong push for our developers to convert their AR packages into movie modules. They have instructions on how to do that on their website and once they're available, you can sideload them, which means you download them separately and then import them into Java. But it's best if they're available through the movie library. We've got one on power analysis, general analysis for linear models, we have method selection, we've got meta analysis. Scatter is a great one for doing plots toaster to one side to test base. R This replicates the statistical analysis are available in the base package and are we have mediation in moderation. Walrus is a collection of robust statistics developed by Rand Wilcox, and we have survival analysis called Death Watch and even a little video game here called your movie Arcade. All you have to do is click on install and it'll put it in there and then it'll be listed here under installed. Everybody gets this one time movie, that's the default I've added scatter and let me show you what it looks like, why you would want to use one of these extra packages. This is the default version of the scatterplot matrix. I'm going to close the thing here by clicking on the side, but here's the one that I made using the scatter module and you can tell a lot of things a lot prettier has got more colors going on in it, but it allows me to break down the scatterplot by the species of the iris. And there's a really important thing here. Let's go back out 1/2 here. This is basically a flat and essentially zero regression line through the data. But when we break it down by species and ask you this over a little bit so you can see all of it, we see that they all have a strong uphill relationship in that total and the red is really strong. And then we have the density plots for each variable broken down by the categories on the side. And if I click on this again, it brings up the menu from the scatter module and that one by the way, it goes under exploration to scatter and has two functions, scatterplot and parietal chart. But I'm doing this one right here and it's a great way to get some additional insight. And so modules in general, their role is to bring extra functionality, extra things you can do with your data into your movie. Currently, there's about a dozen, but as Jamovi grows and develops, more people who create packages for AR will learn to adapt them for JR movie and give it even more power and even more possibilities. You may be familiar with these statistical programing language are a free, open source language that is specifically developed for working with data, and that's a favorite tool of statisticians, data analysts, data scientists everywhere. One of the great things about R is how powerful it is, and that really you can do anything in it. One thing you can't do, however, is use dropdown menu as you have to type out lines of code and that makes it intimidating for a lot of people. One of the great things about Jamovi, which is based on R, is that it helps provide a bridge between people used to menu driven applications like SPSS and command line driven applications like our show. The Connection between the two. I have the same analysis in both a Jamovi file that jmvPackaged.omv and as an R script jmvPackage.R Now you're only going to be able to open that one if you have R installed on your computer and hopefully R Studio two. But let me show you how Jamovi connects these two applications. Here's the analysis open in Jamovi. It's the Irish data. And what I have here on the right is I have some descriptive statistics, and if you click on that, you get the commands. I have the menu. It looks a lot like an SPSS menu. I put the variables in there. I asked for certain statistics, I asked for certain plots, and we scroll down and here is the basic analysis. I have density plots and box plots for each of the variables, and it's really easy to do if I want to do the same thing in R, it's really easy to do using the G movie package, but to get there you first need to enable something called syntax mode. If you come up to the top right and click on these three dots, you'll open up a menu and write down here in the results is you have syntax mode. If you click on that, then the output here looks a little bit different. We now have a momma's based font, we have some headers here, and this right here is an R command. And what you can do is you can do an option. Click on that or two finger, click and do syntax and do copy. And then from there you go over to r r version of this file open. I've got some header information that I put on all of my files. You need to install the GMV, which is short for gym movie package, and once you've got it installed, you need to load it by using library. I'm going to run that command down at the bottom and just tells us that the movie now has precedence on the ANOVA command over the stats package. And we also need to load the data sets because that's where the Irish data is. So I do those two things and now I load the data. Now a funny thing here is the movie command. It needs to know where the data is and an R you say data is equal to and then you give the name of the data where the name of the object in the memory and in this case is called data. I know it looks redundant, but it makes it very flexible. As long as you come back here and you save your data set into an object called data, which I'm going to do right here, and then you can see that it's loaded over here, 150 observations with five variables. I'm going to take a look at the first few lines with the head command just to make sure it's entered. Right. That looks exactly like it should. And then we come here to this command from J movie I've already pasted in here. It's a single sentence. This tells us that we're using the jmvPackage. The descriptors function here is the arguments that go into it and then a few extra options. This is a single command and when I run that command, what I get are the same tables and charts that I had in the Jamovi output. Now let me make this a little bigger over here so you can see the tables better. This is the exact same chart that your movie produced is now in mono space fun, but it's set up exactly the same way. And over here on the right, under plots, we have the same plots, but now it's showing them to us one at a time. And this is actually the last one. And you can scroll through them and see the other charts that your movie produces. It's exactly the same as we had in the application. And so the purpose of this package is, number one, maybe you're more comfortable with our you can reproduce all these commands that you did in your movie. In R number two, it provides a bridge if you're familiar with SPSS and other menu driven applications. Jamovi makes it possible to set up your analysis with menus, then copy the syntax and paste it into our and that way you can become more accustomed and learn to use R. And in fact, because the Jamovi package really contains everything you need to get. For instance, through an introductory statistics course, you only have to learn one package, and that makes it extremely efficient and extremely user friendly when it comes to learning. A powerful language like our our next chapter is about wrangling data in a movie. And if you're not familiar with it, you may wonder what I mean when I say wrangling, because I'm not talking about this kind of wrangling out on the range. Instead, I'm talking about this kind of wrangling where you're taking data that's in a mess, it's in a jumble, and you're organizing it and you're getting it so you can analyze it. It goes by a few other names as well. Sometimes it's called data munching, sometimes it's called data scrubbing. It can also be called data cleaning. Or really, it just refers to the general process of preparing your data for analysis. This is really the nitty gritty of data. It's the part that's not so glamorous but absolutely critical to the success of a project. It also tends to be the most time consuming part. In fact, one of the rules of thumb of any data project is that about 80% of your time is spent preparing the data for analysis. And to help with that, we're going to talk about entering data into a movie, importing data, defining variables. So you know what you're dealing with computing new variables to get things ready for the form that you need, and filtering cases so you can focus on subgroups and get more detailed analysis. And as we go through all of this, you'll see this special function that your movie has to help you prepare, working with your data and to get it started so you can discover the joy of clean data. The sample data in GMV is wonderful. It's a great way to learn the program, see how things go. On the other hand, you probably came here to analyze your own data. And so theoretically, the simplest way to do this is to enter the data directly into mobile, because you can do that. We've got something that looks like a spreadsheet over here, and when you first open it, it's got these three blank variables called A, B and C, and you can put your own stuff in there. So for instance, you can put an ID number and in fact, it's a good idea to change the names of these. So you know what they are. So I'm just going to change this one to ID and then I'll change the next one to first name, and then I'll change the third one to age. And so we've got three possible variables there. I'll close this window. You want to use ID number because you need to be able to find a particular case and you need to be able to return things to the order in which they were. So I'm going to put number one right there and let's say first name, let's put Alex and let's give an age of 34, okay? If we were in Excel or Google Sheets, we could hit return now and it would take us back to the first column to movie. Doesn't do that. Go straight down. So we need to manually. I'm going to hit the left arrow to go back. We'll do the second one and we'll do Ava and we'll say that she is 41 and then we'll do a third one and we'll say it's Fecri and we'll say that he's 18 anyway. You can do it this way. It's really tedious. It's a laborious process, and I really don't recommend entering the data directly into a movie the same way. I never would recommend entering the data directly into SPSS or to our or to SAS or anything else like that. This is possible, but it's not a good way. Instead, a much better way is to enter your data in a spreadsheet. Google Sheets is good because it's online and several people can be working on the same sheet at once. And then you can import that file, which is what I'm going to show you in the next video. The much better way to get your data into your movie. Well, as possible, to enter data manually into Jamovi. It's an awkward process and it's much better to save the data in a spreadsheet, either from Google Sheets, Excel or some other program and then import it directly. And fortunately, that's easy to do in your movie. Here I have a blank window, and what I'm going to do is I'm going to show you how to import several different spreadsheets or other formats. So first, I want you to see the files themselves. What I have are four files with the same data in it, but they're in different formats. This first one is an Excel X file. It's Excel or really it's the Excel's open XML format. And you can see we've got an ID number, recorded date and response ID, then we have five questions, queue one through five and then a zip code at the end. And it's a very simple data set. Now I have the same information saved as a C as a V file, which stands for comma separated values. This is the really most common version of a spreadsheet that basically anything can read, and we've got the same data. It's shown a little bit differently here on My Mac when I do a quick view, but it's the same data. You can also save it as a text file where you have tabs separating the values. And actually I did this in Excel, just saved it as text. And you can see again, although it's arranged a little differently, it's the same data. And then finally I have the same data in an SPSS save file. Now it may be depending on what you have on your computer, if you click on this, it'll give you variable definitions when you do a quick view. Right now, it's not showing me anything. But here's an interesting thing, Jamo. You can import most of these, but not all of them. Let me go back to Germany. What I do is I come over to the menu here with the three horizontal lines and I hit open and you'll see that it's able to open several different kinds of files, of course, movie files, and then see V and text files. It's able to open SPSS files, data files, SAS files, and Jasper, which is another program created by several of the same developers of G movie, looks very similar but operates a little differently. And so these are the options. Now let's come up here to browse and I'm going to go to my desktop desktops right there. And what you see now is the as X is grayed out, I can't open a native Excel file, but that's okay because if you're in Excel, you just save it as CSV and you're good to go. So let's start with the CSV file. This is the most common kind of file that you would be importing. All I got to do is double click on this and then it opens up in Germany and there we have our data. Let me move this one over a little bit. Perfect. It's formatted and you can see that it even change the data types to match pretty much what's in there. Now I'm going to show you how to do data definitions in later videos. So I'm not going to worry about that right now. But this is the C as V file. Now, let me show you what it's like when we open the text file again, I'll come to open and because I have this one folder open, it's my desktop, I can just hit import text, I double click on that. It takes just a moment. And here you see the data looks exactly the same. There's no difference. And then finally we're going to do the SPSS file and you'll see with this one there is a small difference. I'm going to come here to import s.A v, that's the SPSS data file. I click on that and what's different here is that in SPSS you have labels on your values. And so for instance, you see the variables. Q Once your Q five and the other ones, they were just the numbers of one through five. Here they have labels on them, although you should know that the numbers are underneath. I'm going to just double click on this and when we open that up you can see strongly disagree as a one. Disagree is a two and in fact Germany, you can still treat these as numeric variables the same way you would do in SPSS if you had a variable with value labels. I'm going to close that now. There is one other thing SPSS really didn't like having the date and the time in the same thing, so they had to get split into two. But that's a trivial thing. And so you can see that if you have the data in a spreadsheet, you got it in SPSS or SAS or some other format. It's a very quick and easy process to import that data into your movie. Here is the format. It reads the labels and then you're good to go. Your computer can analyze data without knowing what it is. If it has numbers, it'll work with numbers. If that's text it, it'll do something. But for it to mean something to you as the analyst and maybe as your client, you have to know what things mean. And that means that you need to define your variables, give them their variable types, as well as put labels on them that are going to help you as you try to make sense of things. So for instance, here's a small data set. It's based on the one that I imported earlier. It begins with an ID number. That's helpful because you can go back and find particular cases. So you always want to start with that. Then I have three questions Q1, Q2, Q3 that are on 1 to 5 rating scales. And then I have a question here. I have called it subscribed at the end and it's a yes and no are Y and n. Now, what we need to do is say what kind of types each of these variables are as well as give them labels as necessary. Now ID we don't really need to worry about this one, but by default, if a movie sees one through. However many numbers, it's going to assume that it is a continuous scale. Now we want to change the type. You can do that either by double clicking on the name of the variable here or by going up to data and then clicking setup. Either one will work. And so the very first one is ID. Now technically ID numbers are not continuous or quantitative, they're nominal where it's one per person. So I'm going to put nominal, but I'm going to say integer, that's fine and we'll leave it like that. And you see now that its little icon here has switched from the ruler to the three circles which indicate different categories or buckets that change by the way, from of ID from continuous nominal, it doesn't really make a difference. I'm going to go to the next variable, which is Q one, and this is where I actually want to give it a real name and I might say something like, like websites. So do they like your website? And now that shows up here. Now this isn't the label. I just changed the name of the variable. I can give it a more thorough description. Does the user like that? Companies website. Now that label there is just for our own use. It doesn't really show up anywhere. But what I'm going to do here is I'm going to change the variable type and I'm going to change the names of the levels. Now it turns out that when you have a small number of categories, they're one, two, three, four, five is a small number. GMV assumes that it's a categorical or nominal variable. The thing is, it doesn't really make a difference if you're going to be averaging or if you're going to be calculating statistics, it'll do it on these variables. Even though they're defined as nominal, you can change it to continuous or ordinal, and that's going to affect the kinds of graphs that you can make and is going to affect the ways that you can split the data. So it's not critical if all you're going to do is average, but if you want to do other things with it, it helps to define them. Now, a 1 to 5 rating scale, there's a debate about what level of measurement it is. Technically, it's ordinal because higher numbers indicate more agreement or a higher evaluation of something on the other hand, in every field I've ever seen, people actually take a 1 to 5 rating scale and treat it as though it were continuous or quantitative so they can average it and so I'm actually going to come here and I'm going to define this one as continuous. Now, you'll see here that the levels went away because now it's treating these as, you know, like time, 1 seconds to 5 seconds doesn't have labels. So that's fine and I know what it means. So I'm going to go ahead to the next one and I'm going to put here like, like price of your service and this one I'll put it as ordinal. Now you see that the levels here stayed and that's important. And the last one here, I'll just put like, like product and I'm going to leave. This one is nominal, but I'm now I'm going to change these levels, so I'm actually going to change these labels. So for instance, I click right here, add one, and that's usually strongly disagree. And then the two is disagree. The three might be neither, four would be agree and five might be strongly agree. A lot of people call this a Likert scale. It's actually a response scale. Likert scales have more to do with how you choose the questions as opposed to the format in which you respond to them. But call it a Likert scale if you want. It's a 1 to 5 rating scale, and then you can see that these labels all show up down here. That's really the convenient thing. And remember, the numbers are still underneath there, so you can still do numerical operations on these variables. The last one I want to show you is this text variable. At the end you see says nominal text because I actually typed in the letters Y and N and you can do the same thing with these. I can click on the Y, I can put subscribed and I can maybe put they're not subscribe or they can't be websites I will call my visitor. And so now you have new names for the variables. I change the names. I change the level of the measurement or the type of the variable for some of them, say, for instance, from continuous or quantitative to ordinal to nominal or categorical, and then I change labels. And so this is a very important in terms of preparing the data in Zimbabwe because it's going to make it much easier for you to interpret the analysis and then in turn to make sense of it, especially when you're collaborating with a colleague or potentially working for a client. One of the most common transformations on data is average in several variables. That way you can get it, for instance, a scale score. It's also a great way of helping balance out the error variance of different scores and really get you more generalizable information. This is easy to do in Zimbabwe. What I'm going to do is I'm going to take this data set. It has ID and it has these three rating scale questions and they're all on a 1 to 5 scale and they ask people how much they liked different elements of, for instance, your website. And I'm going to click on each of these first or go to like website and we have a little bit of descriptive text and it's a continuous or quantitative variable that's just on a 1 to 5. If we go to like price, you see it's ordinal and even though it's ordinal, you get to specify levels. Now it just has the 1 to 5. That's to do. And then like product you see here in the data, it has text, it says neither strongly disagree and so on and it's coded as nominal. On the other hand, it's important to say that it's an integer variable, even if it's nominal. And then over here we have levels where we have the one, two, three, four or five and we've put labels on each of those levels. So one as strongly as agree to disagree and so on. The reason this is important is because the numbers still are there underneath those labels, and that's what makes it possible for us to average them. So close us for just a moment and we're going to create a new variable. I'm going to double click right here and I can either enter a new data variable, a new computed variable, or a new transform variable. We're going to do the middle one right here and new computed variable, and it's going to ask for the name of that new variable. I'm simply going to call it the mean, and we can give it a description if we want. I'm going to put down here the average or three rating scale variables. The easiest way to do this is to come to the function window. And if you're used to SPSS or to excel, you know, you've got a lot of different choices. There's a much smaller range in your movie, but they're basically the ones that you need. I'm going to scroll down a little bit and get to me and I'm going to double click on that to put it in the box, and then I need to tell it what variables I want to include. I'm going to come over here to the variable list and simply double click on the first one. And when I double click on it puts it up here and you'll see that it puts it in back ticks. Those are sort of back leaning apostrophes. That's because there's a space or a non text character in here. If you had an underscore or a dash, it might do the same thing. It's actually a nice reason to not have spaces or other things in your variable names because then you don't have to do the back ticks. But I'm always going to do that first automatically. Now when you have a range of variables that are all next to each other in SPSS, you can give the name of the first one and then write t o Capital two and the last one in terms of you need to specify each of them separately. So I'm going to put a comma and then I'll click the second one and a comma and I'll click the third one. And now I've got that and I can close this and you can see it automatically fills in with the mean. Even though there are three different levels of measurement, it knows that they all have numerical information. Now, I said it was important that this one right here was specified as integer. Let me double click on that again and then you'll see if I come down here and I say, It's not integer, but it's text, then you see the mean disappears, even though those levels still remain. But if I come back and tell it again now it's integer than the mean reappears. It's able to treat it as numeric information. And so that's how you can average several variables into a movie, which gets you a long way towards getting more reliable scores, averaging out some of the variance and getting the scale scores that you may need for your further analysis. Sometimes your data comes to you on scales that don't have any inherent meaning or. May not be familiar. a1254127 agreement scale. Well, that's common, doesn't have inherent meaning. And if you're comparing income information from Nepal to Turkey to Mexico, you're going to be dealing with different currencies and you're going to need a way of comparing relative standing. The easiest way to do that is was Z scores or standardized scores. Now, if you've had statistics, you know, that simply takes the score, subtract the mean of the variable and divide by the standard deviation. And a lot of the people show you how you can do that manually and you can set that up in a movie that way. But there's a much easier way to do that. Let's do this one. We're going to come here and double click on this empty variable here, and we're going to choose new computed variable. And what I'm going to do is I'm going to change this to say Z score, and then I'm going to use the function window and scroll down to the statistical functions to the last one on that particular list, it's Z. I just double click on that and brings up the function and then I need to tell it what I want the Z score of. In this case, I want the mean that's the mean score of those three rating scale questions. I double click on that and I can close the window and there it is. That's all I need to do. A negative Z score indicates that somebody is below the mean. A positive Z score says they're above the mean, and the numbers themselves are a unit of standard deviations. So there's a very fair score that's highlighted is point for standard deviations below the mean, quick and easy and Jamovi makes it a cinch and that helps you get on the way to taking your variables that are in different scales or arbitrary ones and putting them into something that may be more meaningful and is certainly more comparable from one variable to another. When you're getting your data ready for analysis, sometimes you need to do the same transformation to many different scores doing reverse coding or logarithmic transformations. Or maybe you need to take variables and convert them into ordinal categories. I want to show you an example of how to use Jamovi transform function to take Z scores and identify whether they are extreme or not. Using the plus or minus two standard deviation criterion. What I'm going to do is I'm going to come over here and click on this blank space, and then I get to choose a new transformed variable. Now what I'm going to show you can be done with a computed variable, but the advantage of doing it with a transformed variable is it saves that transformation function, and then you can apply it to multiple variables at the same time if you want. The first thing we need to do is we need to define the actual function that we're going to use, the transformation function. And so I'm going to come right here to transform and I'm going to click on that. And right now I don't have any transforms. I'm going to come down here to create new transform, and that brings up another dialog from the bottom. And now I'm going to label the transformation. And again, this is something that can be applied to multiple variables. And so it's not the name of a variable, but it's the name of what you're going to do. Two variables and I'm going to call it extreme and I'm going to say is score more than two standard deviations from mean and I'm going to be using the Z score so I don't have to compute everything else. And then what you have here is a variable suffix. And the idea is that if you have many different variables like Q1, Q2, Q3 for question one, two, three, in doing a logarithmic transformation, all of them, it can create a new variable that includes, for instance. Q One underscore log or log and then in parentheses. Q one there are a lot of different ways to do it. I'm just going to put right here and say extreme, and then I'm going to do the re code condition. Now, right now it's asking me just to replace it with something else I have to do ad record condition and source means the variable that you're starting with. And again, you can leave that there because you're going to choose that variable in the next dialog box. I'm simply going to say if it's greater than two, then I'm going to say assign the text. Yes, and you have to put text in single quotes so it knows otherwise it's going to try to read it. It's a variable name. Then I'm going to add another record condition and see if it is less than nugget of two. And then that is also extreme. So I'll say yes and then else says otherwise, just do this otherwise it's no. So that's simple. So I'm going to close that. And now I've defined that transform function and you can see that it's available now there's extreme right there and I'm going to use extreme, but it wants to know what variable it is I want to transform. That's why there's the question mark there. So I'm going to come right here and I'm going to choose Z score. And once I do that, you see it fills in immediately where we have the notes and down here we have a Z score of 2.097 that's greater than two. So it says, yes, that's extreme. And this is a function that I could apply to other variables if they're on the same scale. And so it makes it easy to prepare a lot of variables, sometimes using rather complex functions, but rather quickly in a way that's easy to set up and easy to understand what's happening. One of the important steps in analysis is the ability to drill down, to focus on specific cases in your data. Set to movie allows you to do that by filtering cases. To do this, I'm opening this data set that I have called filtering cases, and I'm showing you a little bit of descriptive statistics here. The number of cases, 173 and the minimum and maximum for two of the calculated variables at the end of the data set to get to filters, come here to data and click on the filter. Looks like your porn filter in a kitchen and when you click on that, you get the opportunity to enter your text for a filter. You can click in here and you can choose various functions and you can choose existing variables and you can simply click on them. So I can go, for instance, ID, and then less than ten, and that will give me the first nine cases. Those that have ID numbers, less than ten and I have a variable over here called ID, so I'm going to hit return for that. And now what you see is I have a new column here called Filter one. You can't change the name of the filters. They're just one, two and so on. And I have a green check mark for all of the cases that are selected. And you see them here and a red X for all of the cases are not selected and their row as are grayed out. And now you can see that I have only nine cases and we have different statistics. Those are calculated only for those particular nine cases. By the way, if you want to close this, just hit the up button like that. And there's our nine cases from our first filter. But you probably want to do more than that. And so Jamo V allows you to use more than one filter in combination or trading off with another one. In the class you do more sophisticated filter commands. Let me come back to filters here and I'm going to add a new filter. I'm simply going to press the plus. By the way, what the eye here does, it shows or hides the filter column that's on the left. You see, here's says filter one. If I hide that, then you just don't see it anymore. Even though the filter is still operative. I'm going to do a new filter and this time I'm going to paste something and and what I have here is a filter that draws on this calculated column. So I'm going to hit filter and I'm selecting all cases with a mean greater than four. And if I double click on this, you'll see that the mean comes from the average of these three variables one, two and three. Even though this one has text. But note that if I click on it, you can see that there are numbers underneath it. I'm going to show you how that works in a later filter. But now I have this second filter and you can see I only have one case that's selected because I'm using both of my filters at once. If I only want one filter, I'm going to come back here to filter one and I'm going to make it inactive by turning it off. And now I have more cases that are showing it coincidentally is still only nine cases that have values of more than four on the mean. I'm going to make another filter first. I'll turn this one off. I'm going to make a third filter here and this time I'm going to paste it in. And what I'm going to use is the Z-Score variable. That's one over here. Now, the thing you need to know about Z score is because it's got this dash in it. I have to surround the name of the variable with back ticks. Those are where the tilde key is to the left of the one on your keyboard. Now, if you select this by going to F of X and you simply click on it, it will automatically put in the back ticks. And this time I'm putting in something less than two and more than two, but I'm getting a problem. And that is it doesn't want to do it because it says, I have a V or a column function. That's because Z score. Let's go back to this Z score. Depends on the mean of the variable that we're looking at, which is mean and the standard deviation, which means it has to calculate that for all of the cases that are in that column, you can't do a filter on that. You get sort of an infinite regress. So I'm going to come back to here and I'm going to turn that filter off, but I'm going to show you the way that you can do that. Now, one choice is to simply copy the values in Z score variable, paste them into a new one, and now they're no longer formula, they're just text, and you can operate them like any other variable. But that really compromises the functionality of some of you. You lose the ability to easily replicate the functions because that's not a formula based approach. A better way to do it, even though it's slightly more complicated, is to include the Z score function in the filter itself. So let me come back here and click on this new filter. So you see, I now have a filter for right here in a paste in a filter, and now I actually have the Z formula in the filter. And when I hit return, now we don't see the Z score for this one anymore. It had a value of 4.33. That's going to be a high Z score, but you can see that all of these others are within a particular value. Now, this is nice if you want to choose the cases that are between two values, but often when you're using Z scores, you want to look at the extreme cases to identify outliers and the intuitive thing would be to simply flip around these relational operators, the less than science to make them greater than science. But let me show you what happens when you do that. First, I'll turn this off. I'll create a new filter. So this filter number five, I'm going to paste it in. And you see, this one is the same as above, except I flipped around the less than and replaced it with greater science. Well, I'm going to get a problem with that one. And what it is, is it's excluded every single case because of the way it's parsing the logic of this one. This doesn't work. Instead, what you need to do is you need to do two separate statements joined by an or so. I'm going to turn that one off open it up and paste in this other filter. And now you see that I say Z four. The mean score is either less than negative two and then or z mean is greater than two. Please note in a lot of languages, things like and or need to be in capitals in Zimbabwe, they need to be lowercase. Also, if you're writing in a language like art, you can use the pipe, the vertical line character to signify or that doesn't work here you need to type lowercase o r and when I run that one now you can see that it's selecting cases are unusually high and also some that are unusually low. So we can join the two in a single to filter out and get the extreme cases. Now, I want to show you one more example of how filters work for I'm going to make this one inactive. So we have all the cases again and I'm going to press plus and they're going to paste in my seventh filter. And this one uses the variable like product. This one right here again, because I have a space or a nonstandard character in the title, I have to put the name of the variable impacted by the way, that's a good enough reason to never have spaces in your variable names. That way you don't have to do something like that. And I'm going to select cases that are either four or five. Now, in this case I'm using for because I know that agree. In fact, let me just run this and I'm going to click on this again. You'll see here that for four is agree and strongly agree is five and you can use either the value or the label. But if you use the label, you need to put it into quotes, not back ticks, because back ticks are only for the names of variables, but in quotes. Also, if you're not familiar with programing, you know that you don't use equals because that's usually what's called an assignment operator. Say this variable gets this value, you have to use the two equals signs together and that means is equivalent to so like product is equivalent to four or like product is equivalent to strongly agree. That gives me the fours and the fives the agrees and this strongly agrees. And so that's an exploration of how filters operate many different ways. In Jamo, they allow you to focus on particular cases to find the things that are most interesting in a particular data set and drill down to focus on those, give them the attention they deserve, and hopefully get some new insights out of your data. Our next topic in terms of is exploring data. And the reason we need to do that is because raw data is just completely overwhelming, even when it's set up in nice rows. Instead, you need to get a map, you need to know what you're dealing with and get the lay of the land so you can tell a coherent story. Now in Zimbabwe, we have several options for exploring data. First off is regular descriptive statistics, things like the mean and the median, the standard deviation. But even more useful than that are data visualizations like histograms or density plots. Box plots are violent plots or a dot plot in this case combined with a violent plot. And finally, standard bar plots. One of the most useful methods for visualizing data. We'll also talk about how you can export these graphs, these tables to docs like Google Docs and Microsoft Word and to presentations so you can share your insights and get your message across. Perhaps the quickest way to get some insight into your data is to do basic descriptive statistics like frequencies and mean as standard deviations. Fortunately, that's really easy to do in a movie. Now, the data set that I have open here is the bugs data set where we have a number of people and they are rating how much they would like to get rid of bugs that are either high or low disgusting or in high or low frightening. And we have the people's gender, region and education. And all we need to do is come over here to exploration and click descriptive. Now I have installed the scatter module, so these two menus show up here. If you don't have those, that's not a problem. Just hit descriptive and then we get to pick the variables we want. And Jamovi will either give us means and so on or frequencies. Now I find it useful to start with the predictor variables, the ones that you're going to use to predict the outcomes. And in this case, that's going to be these three categorical demographic things gender, region and education. So I'm going to put those over here into variables. And what we're going to get is a table that really only tells us how many cases there are. There is 91 or 92, and we're missing one or two on each of them. I don't need these other variables because these have to do with quantitative or continuous variables. So I'm just going to remove those for right now. On the other hand, we've got to have nominal or categorical variables. It would be nice to get a frequency table. I'm going to click this selection right here and then it's going to automatically expand and it's going to give me the count, the number of people in each category, along with the percentage of the total data and the cumulative percent. And so, for instance, we can see that we've got about two thirds women and one third men, and that we've got a lot of people from North America and almost 10% from Europe. But other groups are pretty small and level of education. We have a spike at less and we have 15 people or 16.5% at high, which I assume means high school. Anyhow, this is the first step. Get some basic demographics or the things that you're going to use as predictors. Now I also want to look at the outcome variables which are scaled. They're measured on a 1 to 10 or zero at a ten scale. And I'm going to do the descriptive command over again. I'm going to hit descriptors and then it shows up as blank. And what I'm going to do is I'm going to take these four outcomes and put those in the variables over here. Now, this time, because these are scaled or quantitative outcomes, it's going to give me a mean median and so on. On the other hand, there is a few that I should add to that. Most importantly is the standard deviation that's sort of a bare minimum for how spread out things are. It's also nice to have the quartiles now if you want to, you can add the standard error, the mean or the variance, and you can get skewness and courtesies each with a standard error. But this what I have right here is usually plenty. By the way, you may notice that this table strongly, the command that you use in SPSS to get basic descriptive statistics that's on purpose j movie is modeled to be friendly to people who are migrating from SPSS and then it brings in the power of R, but it's designed to be accessible. And so now we have some descriptive statistics on the categorical demographic variables. We have some descriptive statistics on the quantitative or continuous outcomes. There is one other thing that we can do here that's worth mentioning, and that is we may want to look at some of these variables. Here are the outcome variables and break them down by one of our other categories. The only one that's going to work really well here is male or female. The general rule of thumb with quantitative variables is you want to have at least 10% of your sample in the smallest group. So we would probably have to combine people to be sort of North America or other, but I don't know that that would make a lot of sense. But what I can do is I can get the statistics for these for outcome variables, how much people want to get rid of these various bugs in terms of how disgusting or frightening they are. And I can break that down by gender. To do that, I come back up to exploration and go to descriptive. I pick the four outcome variables which again stand for like low, disgusting, low, frightening, low, disgusting, high, frightening, and so on. And then I are normal or categorical variable gender, which is text in this case, because he actually wrote out male or female, and then I put them down here and then I'm going to come and make a few changes to the table that I'm going to get. Mostly, I'm going to remove some of these statistics because when you start breaking it down by other categories, it gets really busy. And so I'm just going to get the NW, the mean and the standard deviation, and that's probably going to be sufficient for what I'm doing right now. And here you can see how it broke it down into those categories. I can click out here. And so that is a quick run through of the descriptive statistics that you can get quickly and easily. And GMV, again, a very good first look at your data and a way to get started on understanding what you have to help you shape. And then interpret your analyzes. The waging movie is set up when you first go to explore your data. It offers you descriptive statistics or a numerical insight into your data. On the other hand, I actually prefer to begin with pictures, graphics, visualizations, do those first and then get numbers to provide precision. That's in addition to what you get from the graphics. So in this one, I'm going to show you the first of several different visualizations that we get from Genova movie. The first one is Histograms. Now, the dataset that I have open is the iris data. This is one of the example data sets. I'm going to come over here to exploration, click on that and go to a descriptive. Now by default is going to try to do statistics. I actually don't want statistics. So I'm going to preempt that and just remove this information right now so it's not going to produce a table. So the table's gone. What I am going to do, however, is I'm going to select these for quantitative variables. These are measurements of the petals and samples from three species of irises. I'm going to put those in variables now. Right now it's doing nothing because I canceled all of the statistics. But I need to come to this menu which has plots, and all I'm going to do is select the first one, which is histogram. Remember, a histogram is a chart like a bell curve that shows you how common each score is in distribution. The width of each bar or the bin with is arbitrary and a lot of other programs. You can adjust that manually if you want. You can't do that in a movie, but truthfully, right now, it just makes your life a little simpler. What we have here is a pretty strong unit modal distribution for super length. We have kind of normal with a big spike in Siebel width. We have a strong bimodal distribution for petal length, although it might be two separate normal distributions maybe. And then for pedal with again, strong bimodal with this really skewed distribution here on the far left. So this lets us know, by the way, when you get a bimodal distribution that generally tells you that you have more than one distribution happening there that got combined. And that makes sense because in this dataset, we have three different species of irises. And with that, it makes sense to split this up and look at each of the species separately. So I'm going to do this command again. I'm going to come back up to exploration to descriptive, and then I'm going to select the four variables again, put those into with. But this time I'm to add this one and put species under split by. I'm going to cancel out these statistics again. Those aren't what I'm looking for right now, but I'm going to get histograms. Now what we have are separate stacked histograms for each of the three species. Now it does a really nice thing in that it puts all of them on the same scale. So for instance, here on Siebel Length, all three of these go from four centimeters up to eight centimeters. I'm going to click over here to close the menu, and it puts this three different species or three different groups in different color. So it's really easy to tell them apart. And you can see status as the lowest various of colors kind of Middle East and Virginia because a little higher for Siebel with you can see that we've got some big differences there where the choice is now the highest and verse of color in Virginia are pretty similar to each other, although we have an outlier on status. So and then for petal length, we have an enormous difference. So tell us, is way down there at the end and the other two are pretty close to each other. And then finally for Petal with you can see the skewed distribution that's really close to zero for the Sentosa irises. Whereas versus color is a little bit medium and Virginia is at the highest. And so this is a great first step for getting a visual exploration of your data first by combining all the groups in one. And when you get an indication that maybe there's something going on there, then splitting it up to see if you can drill down and get a little more explanation for some peculiar results, like bimodal distributions or skewness. And that's what we get from these separate stacked small multiple histograms, which are very easy to do in a movie, and really a great way to start breaking down your data when you're looking at the distribution of a quantitative or continuous variable, a histogram is usually the first choice, and that's what I have right here. And in the last video I made histograms here. I have it for the entire variable with all three groups combined. And then down below I have the three groups separated out and this is a nice way to do it. However, another all turn into that truthfully I prefer is to smooth out the choppiness of this a little bit with something called a density plot. And to do that, let's simply click on the existing analysis. One of the nice things about your movie is that when you click on an analysis, it brings up the menu that produced it. And so here we have descriptive. I've got the four quantitative or continuous variables here. I have all the statistics turned off because I didn't want a table there and I've got histogram right here. Now the reason histogram and density are both listed below. The histograms title is because it can layer them. No, I'm going to do that to show the relationship, but then I'm going to separate it. So I'm just going to click density right here and this will be laid on top of the existing histograms. And so what you can see over here is our original histogram, this little blobby kind of shape over it, which is a lot like a smoothed out histogram. And you can see a similar shape on all of these. Now, it doesn't follow it really exactly because it's averaging across, but you see this same general trend strong by modal distributions and right here as well. Now, to make it a little clear, I'm actually going to turn off the histogram that'll leave us with just the density plot. You can see the density plotted without the histograms and the scale adjusts. And so it makes the pattern a little more pronounced, but that's our first indication. We get these blobs and you can see sort of unit model unit models really kind of triangular, but you know, kind of close to normal, strong bimodal, strong bimodal. And then we come down to this second set and I'm going to do those as well simply by clicking on the analysis. And this is where you have the four variables, but they're split by species. And again, all I need to do here is click density and that'll lay the density charts on top of the existing histograms, and then I'll turn off the histograms again. And I actually prefer to do density instead of histograms because it smooths out and makes the pattern a little easier to see. But you can see it does the same thing where it colors them, it stacks them by the three different groups and I'm going to turn off histogram so we can see what it looks like without that. And this is where the pattern really kind of comes into focus. We have the three different groups and you can see the mounds are the piles of the data in ways that make it clear that the distributions are basic you in a modal here, but they're in different locations, a little bit different in skewness it come down to pedal length and the differences become very clear and I think that by using the smoothness of the density plot, it becomes a little easier to see the overall pattern and not get distracted by the jagged edges of the histograms. And so density plots are another really helpful way to look at a quantitative or continuous variable, either for all data at once or breaking it down by subgroups. The more you work with data, the more you learn that outliers can be a really serious problem. They can throw off your analysis and your conclusions dramatically. And so you want to pay special attention to the presence of outliers in your data. And probably the easiest way to do this is not with a histogram or with a density plot, but with a box plot. So in this example, which uses the iris data and starts up where we left off with the density plots, I'm going to do box plots and show how they relate to the density plots and the extra insight we can get from them in terms of identifying outliers. Now to do this, I'm just going to click on my existing density plot analysis, and then when that opens up, I'm going to come down here to plots and I'm going to click box plot. And you know, this is in a separate column. And what this means is it's going to produce a separate chart. So the density plots aren't to go away. This will be in addition, but I'll then remove the density plots. So here come the box plots curl over. We'll see that the box is directly underneath the density charts. And in the first one there are no outliers. This one, we get a few outliers in each. And that's not surprising because if you remember your statistical terminology, this is basically a kleptocratic distribution, tends to have a lot of outliers with people length. We don't have outliers, but That's because of the bimodal nature. The middle part of the box plot is spread out so much, and then the same thing is true here at the bottom again, when it's spread out like that, that's, you know, something really unusual is going on with your data. I'm going to turn off the density charts in this analysis. And so we'll have just the box plots at this point. Now, you can do the same thing with grouped analyzes where you split it by something. So I'm going to close this analysis for a moment and come down here where we had our stacked density plots showing the data for the three different species. If I click on that, I can do the same thing that I did. I can get a box plot, I'll click box plot, and it's going to show up directly beneath each chart. Now this time we lose the pretty coloring, but it's easy to see what's going on. We have an outlier for Virginia. We have two outliers, each for citizen Virginia. You know, these are pretty symmetrical. They don't look bad. I'm going to come here and turn off the density plots and so we can have just the box plots and then we come down to Petal with you. We can see we got something really serious going on here because this one is so squished, its distribution is so narrow and it's so far away from the others that lets us know that we have something really important going on. And then the paddle length, the same idea. And so box plots are a good way of identifying outliers. We don't have any massive outliers in this data set, which tells us that we're in pretty good shape. But what we do have are strongly differing distributions. On the other hand, if you're doing something like the analysis of variance, that's kind of what you're looking for. And this just confirms our insight that the three species of iris differ significantly in some of their measurements. The box plots are a good way to check that. And again, a good way to check for the potential influence of outliers in the data jam movie includes one really kind of unusual, kind of funny looking graph. It's called a violin and what it is and is sort of the box plot version of the density plot, and it just kind of spreads out and you'll see why it's called a violin. I'm starting with the iris data set from the example data. And I'm using the box plots that I created in the last video and simply going to click here where I have the four variables and I'm not splitting them by anything. I can come here over the plots. And again, the reason violin's right here is because Jamo is able to stack all three of these on top of each other. You don't want to do that because it gets really busy looking. But I'm going to show you the relationship between the box spot and the violin plot by simply adding it right here. Let do that. You get a shape that looks a little bit like a Rorschach inkblot that actually corresponds to what's happening in the box plot here. For instance, you can see that we've got a lot of data here in the middle, and that's where the violin plot goes out the most. We get a funny little manta ray shaped one here because of our strongly kleptocratic distribution. And down here you can see why it's called a violin plot, because we get it comes out, it goes back in, comes out because of the bimodal distributions that we have on these two variables. Now, I'm actually going to remove the box plot so you can see just the violin plot on its own. And what we're left with is really just a little bit of squiggles. It feels like, again, a projective test. It's a little hard for me to read, but it's an interesting version. Again, it's like a density plot, but oriented the same direction as the box plot. You can, of course, do the same thing with the subgroup analysis that we have down here where we broke the measurements by species. I'm going to click on that to open up this menu. And there you can see we have species there. I'm going to click a violin and then I'm actually going to also click a box plot. Right now, I know what we have feels like a little set of drawing of ghosts or something, but you can see the distributions for the three different species. And it really feels like we're looking at a chart of animal shapes as a nice little bat down here. But you can see this shows that it's a strongly compressed range because again, the outcome variable petal length, it goes vertically here and then petal with you can see again, we have a very unusual distribution for this dataset. Well, things are almost kind of sort of normal for the various of color and the Virginia. And so the violin plot, not a very common choice, but potentially an interesting one. And it might be able to give you some extra insight into your data box plots and violin plots are a nice way of summarizing the distribution of a quantitative or continuous variable, but maybe you actually want to see the data directly and dot plots URL way that allow you to do that and with the violin plots that I made using the iris data from the example set, and then I'm going to click on the analysis to open up the commands and I'm going to come back down here to plots and move this up. And what I'm going to do is simply add data. Now, a lot of times you don't want to have things stacked where you have more than one kind of plot on top of another, the violin and the data, it works kind of well because the violins an empty shape. And here you have the data in there. Now, what this is, is every data point in the dataset. There's 150 in this particular case, and they go from the lowest score up to the highest score. And you can see how they approximately match the density of the violin plot. Now, this is called a jitter chart because every dot should be exactly on the line here in the middle. But that makes it hard to tell how many there are in a particular place because they might lay on top of one another and so jittering is at random, only spreading them out a little bit to the left, it to the right. So they're not usually on top of each other and it's a little easier to see the density of the distribution. I'll show you some others here. On the other hand, you have another choice instead of just jittering. If you're a nice and orderly person, you can also select stacked and what stack does. Instead of distributing them out randomly, it puts them exactly where they need to go and arranges them out in mirror image patterns as you go up the chart. And so, again, this will appeal to those of you who are very fastidious, but now you can see a little clearer the distribution of points at each value. And you can see really how the violin plot mimics that. I'm actually going to leave violin plot on. Normally I would remove it because the two go together nicely. They don't compete and then they're going to come down here to where we have our same violin plots but broken down by the species. I'm going to click on that to bring up that analysis, and then I'll click on data and I'll leave it stacked. Now, it's interesting, as I said, stacked, but that appears to be jitter. So I'm going to see if I can get his attention by switching back and forth from Girard to Stack to if it changes anything. It was just a small glitch, but by going back and forth, we straighten it out. And now you can see how the data are arranged within each of the violin plots and the height and the width of each of the violin plots. Now, it makes a little more sense. And so a dot plot of this kind, either Girard or probably stacked, is a little easier to read, lets you see what's going into the box plots or into the violin plots unless you see the individual data points that are going to drive your analysis. The last kind of chart I want to show you an exploration is actually the simplest and truthfully is often the most useful. It's just a bar chart. And all that does is it shows you the number of cases in a category. Now, for this example, I'm going to go back to the bugs data and this time I'm only going to be focusing on the gender, region and education variables. Those are the categorical or nominal ones. And I have frequencies for each of those over here. First, I have the frequencies for each variable without breaking them down. So I have the number of female and male respondents, the frequencies for region where most of the people are from North America and the frequencies for education. I'm going to click on this analysis to bring up the menu. And here you can see how those three got put over here. I turned off all the statistics. I turned on frequency table. That's how I got that. And now I'm going to do just one little click here. I'm going to come and click bar chart and you'll see that J movie knows to do that automatically when you have nominal that is categorical or ordinal variables. And that's what I have here. So I can click that and then it's going to produce the bar charts directly below the tables that I already have. You can see that it says Plot now and then it's easy to see. We have about twice as many women as men and we have an awful lot of people from North America in this particular sample and that the education is spread around in kind of a peculiar way. But this is a good way of looking at a categorical variable. It's so much easier to get them from a bars and especially from bars, because all they require is a relative linear judgment as opposed to a pie chart too. Movie doesn't even do pie charts. That's because they're often harder to read. So I'm going to close that one, and then I'm going to come down to where I have the tables here. This is where I've broken it down by gender. So let me click on this one. And you can see I'm getting a descriptive analysis of education split by gender, and that's why we have education down the side here. We have gender across the top. And I've turned off all the statistics. I have the frequency table on and I do bar plot. And what this is going to do is give a paired bar plot or grouped bar plot where it puts them together for each group. I'm going to close this window here and then move this over a little bit so you can see the whole thing. And now what we have are our six different levels of education from advanced to some. Again, these are in alphabetical order. There's a natural order to these things where more education should be further to the right and less education should be further to the left. But we also have our gender indicated by the colors. So for instance, we have the reddish orange for female in sort of a teal for male and from this is easy to see that we have a lot more women in each category with the exception of partial over here where it's just a couple of people in each in college. But the other ones have about a 2 to 1 ratio that we have overall. And so a bar chart and a grouped bar chart is a simple way of getting insight from a categorical or nominal variable. Again, often the easiest, the simplest kind of chart and really can be the most informative when you're trying to get some quick insights from your data. So once you've conducted your analyzes and you've created all these amazing visualizations, you want to be able to share them with other people so they can get the same insight out of the data that you got in the process. Fortunately, Jamovi makes that easy to do, and what I'm going to do here is in the file, exploring tables and plots is I have a table that I created showing the measurements for the iris data set, and then beneath it I've created graphs broken down by species for each of the measurements, and I want to show you how to put these into Microsoft Word, PowerPoint, Excel, and then Google Docs slides and sheets. Let's start with the table at the top now, what you need to do is you come over and you do an option click or control click or a two finger click and you get the table and you can say Copy. Now if you want to copy it, see, it says it's a copy of it. And now I'm going to go to word. So I'll come down here. I have word open and I've got a word document and then come right down here and I'm going to paste it. So go to edit and then paste. And what you find is it actually looks really beautiful. It's nice. It is a little tiny bit funky. Sometimes it has some columns in here that are invisible. There's a little in-between column right there, but it looks really good. Now, if we want to put that into PowerPoint, I'll come down here and open up PowerPoint. I'll come to this page and I'll come over here and I'll just press paste. And now it looks kind of weird because what it's actually pasting is an estimate table. You see, it doesn't even give us the last row of data. And then of course PowerPoint wants to do pretty things with it. I'm going to close that. But this is a good table. It worked better in word. Let me go back to word and we'll take a look at it again. It worked. We got all the lines of data. It did not work so well. We went into PowerPoint. What's usually the best when you have a table is actually to put it into a spreadsheet. I'm going to go to Excel here and I'm just going to do the paste command. And now you can actually see that it has these empty columns in between each of the measurements. So you'll need to move things around a little bit, you'll need to put the pedal width, take it from here and put it over here and so on. But it also allows you to easily rearrange, resize things, make them the way exactly you want. And so copying a table, sticking it into a spreadsheet like Excel is often the best way to work with tabular output. There is one other option I just want to mention. Let me come back to Jamovi, and you may have noticed that you had the option to save. Now this gives you two different options. You can save it as a PDF and I'll just call this table pdf and once that saved you see it says is exported. I'm going to come down to my downloads folder and I will simply click on that. It's a very pretty PDF, looks great. The other option is to save it not as a PDF but as an h html file. So I'll call that table and it's going to put HTML on. I'm going to save that and then I'm going to come down and open up that file. It will open up in a web browser and there it is. It looks exactly like what we had in the movie. So that's one way of doing all of that. Now, let me show you how this works with Google Docs and slides and sheets. It works a little differently. I'm going to take this table which I've copied. I'll just copy it again. Copy. And I'm going to go to Google Docs. I'm going to come right here to this. I've got a doc right here to come down a little bit and I'm going to hit paste. And unfortunately, it doesn't always work well on this computer. It's not letting select anything. There seems to be some kind of glitch and you can see that the cursor stop blinking and it won't let me scroll. So what I'm actually going to do is I'm going to close that page by and then I'm going to open it back up. And when I do that, it's there and I can scroll and you know, you can work with it. I don't know why it does that. It's a little frustrating and obviously things need to be resized and rearranged a little bit. But there it is. It's in Google Docs. If I want to put it in a presentation, in Google Slide presentation, I can click on this page and I can press paste right there. Now it's done two things. First off, it's pasted the table, but it's also pasted this other square. And truthfully, I have no idea what that is. It's invisible. I'm going to click off to the side and then try to come back and click the square and get rid of it. Doesn't really matter. And this one, I can simply resize it by dragging things over like this, That's not so bad. That one works pretty well. But the best way to do this by far again, is to go into a spreadsheet. So I've got a spreadsheet here. I just press paste in nicely. It looks exactly like what we had in Excel with the empty columns in between. Except this time the column names are in the right place. And so if you have tabular data, you can put it directly into a word document, a Google doc or a presentation. But it works best to put it into a spreadsheet which allows you to manipulate the tabular data. But often instead of a table, people are more concerned about sharing the graphics. So let's go back to Jamovi for a moment. And when I click on this, I'm going to come down and I've got a graph. I actually have several graphs here. Let's take this one right here, petal length and I'm going to do an option. Click this a two finger, click on my Mac and I can save it and I can save it as either a PDF or as an email like I did with the earlier ones. And it's going to look exactly the same, but I wish I had a copy and paste it. I'm going to go copy since it's been copied. Now let me go back to word. I come here into word, I click down here to scroll up a little bit. I press paste beautiful. It looks and we're good to go. And obviously if you want to, you can resize that. I can click on that and drag it down a little bit. Similarly, if I want to work in PowerPoint, I can just open this up. I can go to a new slide and I can paste it and there it is and it wants to give me a lot of abilities to animate and stuff. I will click that, but that's a great way to share your information. It's big, it's clear. It's it's nice. By the way, these have transparent backgrounds. So if you put it on a slide of the different background, it'll show up that way. And so that's a nice way. Just copy and paste things, work a little differently in Google Docs and then Google Slides and let me show you how that works. I won't go back to my browser here and I'm going to try to paste in this image. I'm going to come down here and I'm going to press paste. And what happens is it says, unable to create some images, dismiss, and it acts like it's doing something, but, you know, it's not going to get anywhere. So I'm just going to delete that. And the same thing happens in Google Slides. The easiest way around this is to do a small intermediate step, and actually this is a good thing to know how to do. Anyhow, let's go back to your movie and I'm going to take this graph. I'm going to right click on it. And instead of doing copy, if I'm going to go into Google Docs or Google Slides, I'm going to want to save it first. Now, by default, it wants to save it as a PDF file. And then what's actually nice about that is PDFs are infinitely scalable because. They're vector graphics. That's nice, but I don't want to do that for this one. I want to save it as an image file so I can to come down here to format and it can save it either as a PNG file, a ping file which has transparent background that's kind of usually used. But you have two other options, the SVG and the EPS. And so depending on the programs that you're using, you may want to use one of those others. But I like the PNG even though it's now a set resolution, so I'm going to come right here and I'm going to save it as a graphics. PNG and now I can go back to my document here and I can come down to my downloads folder and I just drag it in and there it is and I can do the same thing in my slide and come right here to my slide. I'm going to go to my folder and drag in and I'm going to center it. And there you have it. I have now successfully imported both tables and graphics into Microsoft Word and PowerPoint to table and to the spreadsheet and similar things for Google Docs, Google slides and Google Sheets. Although the way you go about it is sometimes a little different. But all of these make it possible for you to take the analysis that you worked very hard on in your movie and find a way to share your insights. Find a way to share the possibilities you got in your data with other people. In our next chapter on T tests into movie, we're going to go from that piece to the whole. That is, we're going to start with inferential statistics, which allows us to describe a sample of data that we have in front of us and then use that to describe a population something larger and more than what we observed. We are using the sample to infer things about the population and the t test is the simplest and most direct way to get started with this. In this chapter, we're going to talk about three kinds of t tests. First, we'll look at the independent samples, tests that you use to compare two groups means to each other. The second is the paired samples or repeated measures. T Test for you looking for changes in scores from time one to time, two for one group of people, and third is the one sample teachers where you're taking one samples and comparing it to a known population. Mean taken together this three these serve as an excellent introduction to the concept of inferential statistics and going from the specifics of your sample at hand and generalizing to the population at large, one of the simplest, inferential tests that you can do is the independent samples t test. This is where you comparing the means of two different groups. Very easy to do in a movie and you can actually do several at a time, although there are going to be separate comparisons to show you how this works. I'm using the example data set bugs, which talks about how much people want to get rid of bugs that are either low, disgusting, low fright, low discussing high fright or through high disgust in high fright. And they try it on a zero not at all to ten very much scale. And we have information about people's level of education. The region they live in, gender is coded as male and female. And so I'm going to use the gender one and compare the male and female respondents on these four different variables. Now, before you do a t test, before we start doing inferential tests, it's a good idea to take a look at the variables and to see how well they meet the assumptions, because certain things like normality or similarity and variance are important for a T test. So I'm going to come over here to exploration and descriptive, and then I'm going to pick my four outcome variables right here. I'll put them here under variables, and then I'm going to split the whole thing by gender. It's not really the table that I'm most worried about, although you can see that we have more female respondents than males. It's not a big deal and the means are pretty well. There's a one point difference. There's a two thirds, there's a half point. So they vary. What I really want here are the plots. So I'm going to come over here to plots and I'm going to get a density histogram or density chart and box plots for each of these comparisons. And so what you see is these ones are kind of sort of close to normal. These are female respondents here. These are male respondents here. Distributions are pretty similar. The box plots show there's no outliers there. Actually, there's a lot of overlap. Okay, here's where we start getting non normal distributions, which is in a sense a problem. But because of the central limit theorem and because we're really working with sampling distributions, it's not the end of the world. We can still go ahead and do things. We got a couple of outliers, similar distributions. Okay, so now we see a little bit what's going on. So with that as context, I'm now going to do the regular T test and so I'm going to come over here to T tests and do independent samples t test. And all I need to do is pick my dependent variables or you can just call them outcome variables and dependency for usually for when it's a randomized experiment. And then the variable is the thing that I want to split them into two different groups on. I'm going to use gender here, so I'll click that over here and it gives us this table by default and it's going to give us the T test and the P value, which is for the significance test. And we can see from this right now that actually there are no significant differences between the male and female respondents on any of these. The closest we had was a P value .161. And the rule of thumb is it needs to be less than .05 to be considered statistically significant. This one down here is a lot closer. It's 0.06, nearly significant. That's this last comparison right here. But there's a lot more you can get from the T test function in GMV. And so I'm going to do a few of these things now if you are familiar with Bayesian statistics where we can incorporate that, it's kind of nice. I'm going to come over here and get the mean difference. So that's the mean of the women, minus the mean of the men. And you can see they're about 2.55..606. Okay. I'm also going to get the confidence interval, which by default is better than 95%. And if I scroll this over a little bit, you can see the whole thing and also you can see that they're negative on one side, positive on the other. So they include zero, which is consistent with the differences not being statistically significant. I'm also going to get the effect size and this case it uses Cohen's D, which tells you how many standard deviations there are between the two groups means. And in this case, they're pretty small from close to zero to the biggest point for standard deviations. I'm also going to get normality checks. So we know that the distributions are not entirely normal, meaning bell curve shaped. And the test that we have here is the SHAPIRO Wilk test. And it lets us know that really none of them are exactly normal. We could tell that by looking at them. But again, it's really not the end of the world because we're using the sampling distributions and not the raw distribution. I'm also going to check for equality of variances. That's something that's also important for the T test. It says that the two groups need to be spread out approximately the same amount, and this is going to use Levine's test for the quality of variance. And you can see on this one, none of these are significant. In fact, they're all really 0.4 is the lowest. And so there's no significant difference in the variability of the two distributions, which is what we want. I'm also going to click descriptive and that's going to give me the means and whatnot of each of the groups. And so you can see the mean, the median standard division and the standard error which is used in the inferential tests. And then finally the descriptive plots. And in this case, what it gives us are confidence intervals for the means as well. It shows the median for each group. And since these are the ones that correspond most closely to the inferential test of the T test, it's probably the best one for actually whether there's a difference. The general rule of thumb here is that if the confidence interval that's a vertical line for a one group overlaps with the mean the other group, then they're usually not significantly different. And we've got a lot of overlap here, a lot of overlap and these ones are pretty separate. And so this one was nearly significant. Anyhow, that's how you can do the independent samples t test using a single categorizing variable. In this case, I used male and female respondents and you can use several outcome variables simultaneously. And it's an excellent first step in getting a look at what's happening in your data through inferential statistics. Sometimes you have data from one group of people and you're comparing them either before and after some kind of intervention or. You comparing them on two separate measurements and you want to compare each person's score with their own score. So you're looking for changes from one variable or one time another. And this is when you want to use a paired samples t test. And it's very easy to do here in Germany for this example, I'm using the same bugs data talks about how much people want get rid of bugs or insects that are low disgusting, low fried, up to high disgusting and high frightening goes from zero, which is the lowest to ten, which is the highest. I'm going to come up here and I'm going to use paired samples t test. That's the test that we're doing here. And what we have to do is we have to specify pairs of variables. Now, there are a few different ways that you can select the pairs that you want to do. So, for instance, you can click one variable and push it over here and then you simply click the second variable. So maybe I'll do this one. And so that sets up a pair. You can also click one and do a command or control click and select the other. So you get both of them there at the same time and that gets a pair. Or if you have a bunch of compare scenes all in a row that makes sense to do, you can select the whole thing by just doing a shift click. And what is going to do is it's going to put the variable all together in order. So number one goes with number two and number three goes with number four and so on. Now what I have here, by the way, is a violation of the independence of observations. Normally, you wouldn't want to be doing the analysis quite like this because we're looking at the same comparisons in different ways. This is more like a post hoc test from an analysis of variance, but the procedures for doing this are the same, and so it's a valid damage titration of the process in your movie. So what I have right here are my four pairs of variables and you can see that right over here. It's done all the tests already. We can see that, in fact, all four comparisons are statistically significant. So in one sense, that's the simple thing and we can stop right there. Fortunately, J movie gives us some other options. I want to take a quick look at them. Let's look, for instance, at the mean difference. How big is the difference? Or rather, we also want to look at the mean difference. And what that does is it gets the difference between the first variable and the second variable for each person. That's a different score and then it gets the average of that difference. And so here we have the differences and you can see these ones went down across the comparison. One went up, by the way, whether it's positive or negative is arbitrary, is simply a matter of which of these you put first and which one you put second. And we have the standard error. I'm also going to get the confidence interval for the mean difference. I got to scroll over a little bit so you can see that we'll get the effect size, which is Cohen's D and in this case it says how many standard deviations there are between these scores at time one and time two or the scores on the first variable. On the second variable. Really what it is, is the mean of the different scores divided by the standard deviation of the different scores. And you can see these ones are actually pretty big. That's almost a full standard deviation. We can also get descriptive, which gives us the means and the medians of the variables. We can do a normality check, which can be important because t tests are valid when they're working with normal or bell curve shape distributions. On the other hand, not really the raw distribution that's important here, but the sampling distribution and that's a slightly different thing. And even though have some statistically significant things here, I usually don't personally worry about it very much because it's the sampling distribution. As long as you have a large enough sample and we're pretty good, you're probably okay. I'm going to finish with the descriptive plots which give us confidence intervals for the differences. So I'm going to wait a second here while it updates and then we'll scroll down here to the plots. And so what this shows us are the mean and the confidence interval for each of the variables in the comparison. So this is low disgust, low fright and high disgust. High fright. That's the biggest difference available in the data. And you can see they're very far away from each other. There's no overlap at all. The square is the median. And what's funny here is you can see the median is actually above the confidence interval because we had a ceiling effect there. People gave really high scores to that and the means only lower because there were outliers on the low end and. You can see similar distributions for some of the other ones. And so this is a good way of making comparisons for the same group of people on one variable versus another where they're looking at two different things. Or often when you're doing something before and after a particular intervention, the simplest version of T test is the one sample T test, which allows you to compare the mean of a single group that you have data from to a chosen hypothesized value. It might be a national average, it might be a theoretical value that represents the baseline against which you want to compare everything. Now, this is really easy to do in your movie here. I'm using the example dataset Big Five, which contains a number of ratings where people evaluate themselves on five major personality characteristics neuroticism, extroversion, openness to experience, agreeableness and conscientiousness. And these are all rated on a 1 to 5 scale. And I want to show you how we can evaluate all five of these simultaneously. But before you jump into T tests, you always want to do some exploratory analysis to see if your data are going to work well with the intended analysis. So let's go first over here to exploration and let's go to the descriptive. And what I'm going to do here is I'm going to pick the five variables and put all of them into the variable list. By the way, Jamovi updates frequently and in fact, from the time that I started this this morning and I'm finishing it now, it's updated and we have a few extra options available now under descriptive. So here we have the mean, the median, the minimum and maximum by default. I also want to get the standard deviation and I may also want to get, for instance, a normality test. And this is one of the new features. This is just a p value that lets us know whether something is normal or not. And so this one is nearly not normal. This one's far away from normal and so on and so forth. The important thing here actually is going to be the graphics and I even have some more choices here. I'm going to do a density charts like a histogram smoothed out. I'm also going to do box plots which are good for identifying. And then I'm going to do the new option here, which is a Q coupon, which stands for Quantile Quantile Plot. And it shows how well the observed data match up against, for instance, a normal distribution that has the same mean and standard deviation as the observed data. So I'm going to click that as well. And the charts look a little different. They're colored differently, but here we have neuroticism that looks basically normal and the box plot shows that it's basically symmetrical with the EU high outliers in the Q Q plot. It shows information on a diagonal, and if all the dots were exactly on the diagonal, that would tell you that the matches what you would expect from a normal distribution. We're really close here. We have some small variations at the tails, which is how it normally works. Extroversion, you see, it's a little bit skewered. We've got some low outliers and you can see that happening here at the bottom. Openness is symmetrical. It's slightly funny looking. You can see how it tapers off here in here. But these are not major deviations. They're statistically significant because it's a large sample here. We have some low outliers on agreeableness because most people want to say that they're really agreeable and then conscientiousness, same general idea. And so that's some of the important background. And what I saw here is nothing really worrisome. And this is just confirmation that our data probably work well with the assumptions of a one sample t test, which includes normality, and we can go ahead and do our analysis. So now I'm going to come over here to t tests and click on one sample t test and when I do that, I can select my variables. I'm actually going to select all five of them simultaneously. You only want to select multiple variables if they have the same null value that you're comparing them to. Now, a lot of times that no value is going to be zero. That's a silly thing to do in this case because zero is not even in the scale these are rated on a 1 to 5 scale. And so what I'm going to do is I'm going to come down here and I'm going to change the test value from 0 to 3 because that's the midpoint of a 1 to 5 scale. And when I do that, you see how this table over here updates it's doing student's t test. It's the one sample T test and can see that all of them the mean is significantly different from three. This is not a surprise because most of these there's a positive or a desirable end of the scale. People to be agreeable, they want to be open. And so we tend to have high values. But I'm also going to look at the mean difference how far away are each of these variables means from the hypothesized value of three? And you can see they're not really big differences. You know, 2/10 of a point, about six of a point. We're going to get the effect size for each one of these, which compares how far the mean is of the sample data to the population. Mean divided by the standard deviation. And here because the standard deviations are not big, this value of Cohen's D the common effect size for this is pretty big. For instance, openness is 1.7 standard deviations above the hypothesized mean a three. That's a big effect. We can also get the confidence interval for the mean difference by which is going to add a few columns here and you can see that it's all positive because our means or above, except in the case of neuroticism, because people don't want to be neurotic, we can also get a normality check which is going to put this separate column here. We have a SHAPIRO Wilk test and this actually these p values. Here are what we had up at the top table when we checked off under descriptive. So that actually is a more convenient way to do it. I'm also going to do descriptive s, which repeats a lot of the same information we had before, but sometimes it's nice to have it down here. An important thing is to look at the see the standard error, which is the standard deviation divided by the square root of the sample size. And you see that these are tiny, tiny values. They two hundredths of a point. And that is what leads to a very funny descriptive plot which normally shows you a confidence interval for the mean. But when our standard error is microscopic, you end up with invisible confidence intervals. In each case here, the circle shows the mean of the variable and the square show is the median and they're super close. And you can tell, for instance, that openness is a little higher, even though at this point the labels don't automatically adjust. So they are a little squished together. But this Latino that yeah, everything's different from three three would be straight across right here. Four of them are above one of them neuroticism is below. But this is a very compact and concise way of looking at several variables simultaneously. And if you're simply trying to get a first quick look at how your sample data matches up with some expectations about the population, then this is a very good and quick way to go. In our last chapter on the T test, we looked at methods that allow you to compare two things at a time two groups where you're looking at the mean between them. But what if you have more than two things you want to compare? Or maybe you even have a whole bunch of different groups that you want to compare and you're trying to get insight into how those groups differ and how they affect the outcomes that you're interested in. Well, in that case, you're going to want to use analysis of variance, also called ANOVA or ANOVA. And in this chapter, we're going to look at a few variations on the ANOVA theme that GMAT makes available to us. The first is the standard factorial ANOVA. This is where you're looking at how one factor, a categorical variable that splits people into different groups influences your quantitative or continuous outcome variable. You may have one factor. You may have several factors simultaneously. Also, look at the repeated measures analysis of variance where you're looking at people across more than one measure mean time. Then we get into some things that really are rather sophisticated the analysis of covariance and kova and the multivariate analysis of covariance, where you have several outcome variables and you have some quantitative variables you're putting into the equation as well. It's an amazing thing that your movie makes these available and it makes them easy to work with. Then finally, we'll look at a couple variations on non parametric analysis of variance where instead making an analysis based on means and standard deviations. We're looking at ranks and taken together, all of these variations on the analysis of variance give you a much broader of situations that you can analyze and a lot more potential for getting inside out of your data. And so let's take a closer look at the analysis of variance. Certain fields of research tend to use different methods. So for instance, if you're doing economics research, it's very common to use linear regression. If you're doing experimental research is extremely common to use the analysis of variance that's also called ANOVA or another. And what it lets you do is it lets you compare the means of two or more groups. Also breaking them up by one or more categorical variables simultaneously. To demonstrate this in terms of I'm going to use the built in example, data set tooth growth. Now it's a curious topic. It's about the growth of teeth in guinea pigs who were given one of two different supplements, either vitamin C, which is shown here as Vichy or Orange Juice, which is written as OJ. And then it comes in three different doses 500, 1002 thousand, I assume it's milligrams and they're looking at the length of the teeth that's Len here. And what we can do is we can look at the combined effect of the supplement and the doses to, see how those jointly as well as separately affect the tooth growth in these guinea pigs in this experiment. But before we get started with the analysis of variance, we need to do a little bit of background work. This is always an important first step and we need to look at the variables, both the outcome variable. Len All by itself and then breaking it down by the categories that we're going to use in the analysis of variance. So I'm going to come over here to exploration and click on descriptive, and all I'm going to do here is I'm going to come down and select Len for length, and I just need a little bit of basic stuff here. We have 60 observations total. We have a mean median minimum and maximum. Great. What I really want though are the plots. So if this isn't already, you just click on it to get plots and I'm going to get the density chart, which is like a histogram, a box plot, which is good for showing outliers and a cute Q plot that stands for Quantile Quantile Plot, which is a way of assessing how closely the observed distribution matches a theoretical bell curve or normal distribution, some of the same parameters. And so when I scroll down here, I see that Len length is all it's basically in the models, not exactly a normal distribution, you know, it's not pathologic cli different see it's symmetrical and the box. But there are no outliers and the quantile quantile plot is a little funny kind of waving. If it were a perfect bell curve, every point would fall exactly on that diagonal line. But it's not too far off. It's not going to cause any real problems. On the other hand, I also want to break it down by the groups that we're going to use it. So I'm going to close this dialog and I'm going to do it again, except this time I'm going to take Len. That's my length, that's the outcome variable. I put that there and then I'm going to split it by both sub and dose that supplement and dosage. And it's going to give me a really long table here where it's breaking down the end for each of these and the number of missing and so on and so forth. And while that might be really important, we'll get a condensed version of it later. What I really want right now, though, are the charts. Again, I'm going to ask for density and box plot and Q, Q plot. If I scroll down and I'm going to move this over just a little bit so we can see more of it. What I have is paneled density plots, so we have here orange juice, which conveniently is in orange and Vichy for vitamin C on the bottom. And we see, yeah, they're not really perfect normal distributions. I mean, this one's kind of close, the other ones are a little wavy on the other. And that's because they're also very small samples. There's not many observations within each category, but they're not again, they're not pathologically different. We see we have some outliers here in the box plots. There's an outlier right there. That's for vitamin C at a 1000 milligram dose, and there's another one for OJ. But again, not terrible. And then here are the quantile quantile plots. And we don't have many observations in each group. So we have the dots are easy to distinguish, but again, not horribly different. So I think we're pretty good and ready to go ahead with the analysis of variance because we seem to some of the core assumptions of the technique. So what I'm going to do is I'm going to come over to ANOVA or ANOVA for analysis of variance, and I'm going to click this first one analysis of your entire ANOVA. And what you have here is an option to do either what's called a one way or one factor analysis of variance. That's where you have a single variable like for instance, supplement, and you're looking at how that affects the outcome variable. Len Or length in this case. But you can also put in more than one category and you can look at the interaction between them because that's the most common approach in many fields, especially in experimental research. That's what I'm going to demonstrate here. So I'm going to take Len and put it into independent variable and I'm going to take setup and dose. I'm going to do a shift click to get both of them and put them over here. And it puts it under fixed factors. And you can see the analysis of variance table is there and it's filling up and we get our immediate results. See that for instance, the supplement, if I come way over here, that's a big value for F and the probability value which is used for statistical significance testing is really low. It's less than .05, which is the common cutoff for statistically significant findings. So we have a significant main effect, meaning supplement makes a difference on the tooth length all by itself. Same thing is true for dose. In fact, it makes an enormous difference. You can see that the value of 92, it's going to be much less than 001. And then there's also an interaction. So we're going to want to get a little more detail about all of this. The first thing I'm going to do is I'm going to get an effect size. The effect size that's generally used for analysis of variance is called ETA squared. That's this little Greek letter that kind of looks like a lowercase and it's an ETA. Now, if we had a one way analysis of variance, we could do the regular H squared. But because we have these interactions, we should do what's called a partial eight esqueda, which looks at the unique contribution of each of these factors. That's what showed up right here. We have an RN squared, a P partial and A squared and it goes from 0 to 1 and it can be interpreted as the proportion of variance in the outcome variable then that can be associated with that particular factor. And we get a lot for CEP, we get 20.4 for dose, we get a huge amount, 77.3 and then we have this 13.2. And again, these don't add up to one, but they give you an idea of the relative strength of each of these things put together. What I'm going to do then is I'm going to go through some of the options. Now you can specify the specific model I'm using, the very general approach, which looks at the two main effects for the two categorical nominal variables that I'm using as predictor variables and their interaction. That's how I want to do it. So I'm just going to leave that alone. I'll close that. The assumption checks homogeneity tests because your different groups are supposed to have approximately the same amount of spread in the outcome variable. So I'm going to click that one and that's going to bring up a new table here. We're doing the test for homogeneity of variance using Levine's test. And what we have here is the important part is this P value. At the end it's point one or three, so it's more than oh five. So it's not a statistically significant difference. That's good. We're also going to get a Q Q plot of the residuals or the variability in the data that's left over after we control for SUP and dose and the interaction. And again, if it were perfectly uniform, which is kind of what you want, it would be right on the diagonal, but it's really close, so it looks like it meets the assumptions of the analysis of variance. I'm going to close that. Now you have an option of specifying specific contrasts in your design, especially if you don't want to do just this omnibus main effect. Me In effect, interaction. This is a more complicated procedure. I'm not going to get into it except to let you know that it's there and you have several choices for how you make those contrasts. I'm going to close that and come to post-hoc tests, and this is when you find significant effect, you need to determine where it is now. It's easy if you only have two levels or categories within your nominal variable. So for instance, for soap, we just have orange juice and vitamin C. So if there is a difference, we know that's where it is. But for dosage, where there are three levels, 501,002 thousand milligrams, you don't necessarily know exactly where they are. And it's more complicated with the interaction where there are six possible groups there. And so what we're going to do is I'm just going to select all of these and move them all over to the right. And we're going to get a big table that looks at possible cell by cell comparisons. Now it is a lot and you get to choose the correction. A lot of people use the CFA or the BONFERRONI. I personally prefer the Tukey test. That's from John Tukey who developed it. And what we see for instance is that oh JVC well there's only one comparison possible there. Yeah. That's significant for the dosages that we see that all three of them are significantly different from each other and then the interaction, it gets a little more complicated because there's a lot of possible comparisons here, but you see that not every one of them is statistically significant. We have a few that aren't, but that's going to be a lot easier to figure out if what we do is we look at the charts, we get the descriptive plots, and so what I'm going to do, they're going to take the supplement and put that on the horizontal axis and the dose and make that separate lines. And then we'll get a 95% confidence interval. And we've come down here, it sometimes takes a second for Jamovi to catch up. You have here is orange juice on the left, vitamin C on the right, and the three dosage levels, this is 500 milligrams, 1005 hundred. And what we see here is that, yeah, the 500 is the lowest orange juice, a little higher than vitamin C, that the 1000 is definitely higher on both of those, but about the same amount. But things flatten out when we get to 2000. In fact, there's no difference between orange juice and vitamin C at 2000, and there's really no difference between orange juice at 1002 thousand. And so that lets us know really where the important differences are in the data. Now, the last thing is we have a couple of additional options and we can get the descriptive statistics. Now, this is going to mirror to a certain extent the statistics we did at the very beginning when we were checking things out. But it's going to create this small table that looks at the two variables crosses in with each other and gives us really the bare minimum, the number of observations within each cell, the mean for that particular cell and the standard deviation. And now you have the data that actually is shown in this chart right here. So for instance, we have this 7.98 and the 13.23 way down low. And you can do some more detailed analysis there if you want. And so this is the walk through of the analysis of variance in MRV. Again, what we did is the preliminary work by looking at the outcome variable on its own and breaking it down by the categories that we were using to predict in this case what the supplement was and what the dosage was. And then doing the analysis of variance with interactions and several of the options that Jamari gives us makes it very simple and is a good way to break up your data to get the results, especially of an experimental analysis. Any time you're analyzing data, you are making comparisons. You may be making them explicitly or you may be making them implicitly. They comparison. That's usually most important in research is, for instance, a control or baseline condition versus some sort of experimental or manipulated or intervention condition. The idea is that if people are randomly assigned to one condition or another, then the differences between people balance out. And you can look the effect of just that manipulation. But a really interesting and powerful alternative to this and powerful meaning statistically powerful. Easier to find the effect fewer people is what's called a repeated measures design. And this is where everybody in your study gets to serve as their own control, their own comparison or baseline. No way you do that is by gathering data from them in more than one condition. If you have four conditions, you get it in all four conditions, and then you compare the change one condition to another for each person. Again, a much more powerful design if your study allows for it. Now the example data set that I want to use is the Bugs dataset, and this is where everybody in the study was asked to make evaluations about bugs in four different categories bugs meaning insects. They looked at insects. They were either low on disgust and low on fright, low on disgust, high on fright, high distress, low in fright and high disgust, high fright. And the thing they're being asked is how much you want to get rid of this bug. And it goes from zero and not at all. Butterflies are very pretty. Nobody wants to get rid of them to high disgust, high fright, which might be really big cockroaches or something like that. And people want to get rid of them. And so what this means is that people are making evaluations in each of these different categories and they get to serve as their own control. And this allows for extra power, although it also introduces a lot of additional complications into the analysis. One of the nice things about JMB is that it actually has the ability to do this kind of repeated measures, analysis of variance and other programs that functionality may not be there or may have to pay extra for it. But here we have a built in right away. Now, as I do in all the analysis, before you do the actual analysis, you want to do some of the background checks. So let's take a quick look at these four variables, these outcome variables. I'm going to come here to explore in descriptive. And even though we've looked at these in other videos, it's worth doing them again right here. And we're going to get some statistical summaries. But I actually find it a lot easier to use the plots. And so I'm going to come down here to histograms, actually, and so I'm going to come down here to Density. Click on that. Now, unfortunately, I can't get side by side density plots for each of these. I would like that and maybe that'll be available in the future version of your movie. But right now what I can do is I can look at these distributions and I can compare them one to the other half. Here is lower disgust, a little fright where you see it bumps up against ten and it goes all the way down to zero loads, gets high fright almost really goes down to zero. We get things a lot closer to ten. Hi. Discuss a little fright again pretty close to ten and then high discuss high fright. Basically Everybody wants to get rid of those bugs out. But now what we can do is instead of looking at the distributions overall, we can look at the changes from one to another. So what I'm going to do is I'm going to come here to the analysis of variance or ANOVA or ANOVA, and I'm going to select repeated measures. ANOVA Now the way you set this up may not be immediately obvious if you haven't done this before, because you have to specify how the various measurements fall into the categories or the overall that you want to combine. So it's asking for my repeated measures factors, and what I have is disgust and fright. And so I'm going to double click on this first thing right here, and I'm going to write the word disgust. And then I hit return and it asks me what is the first level of disgust? And I'm going to call it low disgust. It might not be high and low. You might be doing right or left or up or down or something like that. And I'm going to do that for the next one. Now I'm going to skip over level three and go to our AM or repeated measures factor two, which is fright. And for that I put low fright and I put high fright. By the way, you see that it's filling in over here and it's also coming down right here. And what it wants me to do right now is to come over and find the variables that correspond to the combination of these factors. So which cell, which variable has the outcome for the low disgust, low fright? Well, fortunately, it's abbreviated that LDL stands for low disgust left right. So I just drag that over and then low disgust, high fright goes right here. High disgust, low fright. HD Olaf goes there and then finally high disgust, high fright goes right down here. And so now it knows where the data is and it knows how to pass them into the variables that we're interested in. And it does the analysis right here. Now, if we want to, we can put in a between subjects factor like gender as well. That makes it a lot more complicated. So I'm going to leave it out for right now. But what you see here is, that disgust changed how much people want to get rid of the bugs, fright changed how much they wanted to get rid of them. And there wasn't really an interaction between them. So disgust and fright didn't depend on each other. And so that is our initial result. But we have a lot of options with the repeated measures analysis of variance. So I'm going to scroll up here a little bit. Again, if I had a between separate factor like gender or where people, I could put that in here if I wanted. And if I had a covariate like a person's age or their level of exposure to bugs, I could put that in there as well. I'm just going to come down here and get an effect size. And for a study like this, it's good to get a partial ADA squared. This little letter here that looks like an ad is a Greek ada, and it's the common measure of effect size for the analysis of variance. And here we can see, for instance, that this point 1 to 3 means about 12 point percent of the variance in a person's response can be attributed to the variations and disgust over twice as much can be attributed to variation in fright. And that there's just 2.4% that's associated with the interaction that's negligible. And then I can come down and I can specify the model. Now I'm just going to use the regular built in main effect, main effect interaction model. So I'm not going to change that at all. I can look at assumption checks now and the repeated analysis of variance. A common test is what's called Sphere Tricity, and it's sort of the repeated measures version of a normality check. And I can click it, but it's not going to be really relevant because what you find is that we only have two levels in each of these. We have high and low discuss high and low, right? So it's always met. There's only two. If we had three or more, then this would be important. So I don't really need that. Now, in a normal analysis of variance, I could also do the equality of variance test. In fact, I'm just going to click on that right here as Levine's test. But you know, it's irrelevant for this one because I don't have a between subjects factor like gender. So we can ignore that one. I can do post-hoc tests if I want, and so I can, for instance, look at disgust and fright and I can put them over. It's not really necessary to do the post-hoc test because I only have two categories within each one of them. And so it's obvious where the difference is if we look at the means. But if you had more than two categories within each factor, this would be something that would be important for you to do that. I want you to see how it works. You also have the choice of the correction that you do. The shift in Bonferroni are very I prefer the Tukey test and we can scroll down and we can see it's calculating all sorts of things. So these two correspond to the analysis of variance results. We earlier and is going to look at these specific comparisons. It makes for a very long table. And what it's telling us is that several of our comparisons, these first three are all significant in the last one, but not the middle two. And that's a lot easier if you were to go back and have a means chart, but it gives you the means and you can look at specific quickly these possible comparisons. Now, the last thing we can do is we can come to estimated marginal means and we can look at the effects. So I'm going to come right and do discuss drag it right there and get a new term. I'm going to get fright and drag it down right here and we're going to get a means plot. I also get a marginal means table and I can scroll down. And this is going to make it a lot easier to see what's happening in our data. Surprisingly, when a bug is disgusting, people want to get rid of it more than when it's not. And here are the numbers to go with that as well as the 95% confidence intervals. And when a bug is frightening, they want to get rid of it. And you can actually see that the difference for frightening is much bigger than the difference for disgust and. Then we get the numbers right there. And so this is really the collection of information you can get from the repeated measures, analysis of variance in your movie. Some of these options aren't really necessary when you have just two groups within each factor or you have a simple interaction. But even with more complicated designs, it's nice to know that the options are there and available in general to allow you to use the more powerful, repeated measures design to find out what's going on in your data and start getting insight into your results. Sometimes when you looking at the differences between groups, you both want to compare their means, but you also know that there is another variable like a quantitative variable that's associated with the outcome you're interested in. And if you want to put that in, one common choice is to do an analysis of covariance or an and cova. And this is actually really easy to do in your movie. For this example, I'm going to be using the iris data set from the example data sets, and I'm using this because we have four quantitative or continuous variables, the simple length and people with petal length and petal width and have a categorical or nominal variable species with three different groups. And that's a situation where you really would want to use an analysis of variance, but we can throw in some other to try to predict something. So what I'm going to do in this particular example is I'm going to look at the simple width. That's one of the four measurements of the flower of an iris. And before we get started, it's a good idea to explore what that variable looks like. So I'm just going to get sample width and put right here under variables and it's going to give us some descriptive statistics over here. I may ask for standard deviation also. That's helpful but mostly I want some plots. I'm going to get a density plot and I'm going to get a box plot to look for outliers. And that's going to give me my first take on the outcome variable or the thing that I'm trying to predict. We see that we have 150 cases. The mean is 3.06, I believe at, centimeters, medians close to that standard deviations about half a centimeter. And here is the distribution of all three species of put together. It's kind of cone shaped, but it's even a modal and it seems to be pretty well-behaved. And here's the box, but we've got a few outliers on the high end and one on the low end, but it's pretty much a well-behaved distribution. And so if we're trying to model Siebel with this looks like a good variable to be working with. Now, when I'm doing the analysis of variance, I actually want to break it down by a categorical variable and the one that I'm going to use in this example is the species. And so I'm going to do this analysis here over again. I'm going to go back to Explorer and go to Descriptive, and I'm going to do Siebel Width again, but this time I'm going to break it down by species. Now I don't need any statistics, so I'm actually going to just cut those out here. Really, all I wanted this exact point is the plots and I'm going to do density plots where it's broken down by group. And here you can see the three distributions shown by density plus are like histograms, but smoothed out for the status the versa color. And the Virginia is three different irises. You can see we've got a little bit of an outlier down here and things are kind of bumpy over here. And so if we're trying to explain people with knowing what species in irises is going to be a little bit helpful because we can tell this zero says at least a little different from the others. And so let's do an analysis of variance, which would be a common way of looking at the difference between these. I'm going to close this and go to ANOVA and going to do an analysis of variance, just a one factor or a one way ANOVA where I'm going to put Siebel with as the dependent variable and I'm going to put species as the fixed factor. And I'll just do a very basic one actually. I will get and partial a two squared and, then I will come down here to descriptive plots, I will get species on the horizontal and let's see if we have anything else here, some descriptive and that would be enough for right now. And what it's showing us here is that, yeah, there definitely is a difference between the three species on their Siebel width. The value of P here are less than oh one. In fact, the value of F of 49 is huge and the partial eight squared is point four. So about 40% of the variance in the sample with can be at least associated with the three different species. And you come right here and you see a really big difference when we're looking at just the mean with the confidence interval, but we might be able to do better in that by throwing in a covariate or a quantitative or continuous variable that is associated with the outcome. The people with. And maybe that would be helpful. So what I'm going to do actually is I'm going to do a little bit of exploration first. I'm going to come and do a scatterplot using the scatter module. Now, if you don't have the scatter module installed, you get it by coming over here to modules under General Library and select scatter under here. And once you click install, it's going to be available right there. I'm going to use that one exploration and scatter plot and I want to look at the simple width as the outcome. So I'm gonna put that on the Y, but let's put the Siebel length, by the way, a C pulls like a, they alternate the petals and the spores alternate as you go around the flower. And I'm going to put a smoother through it as it's like a local regression line. It kind of follows the pattern in the data and we'll even put the standard error on there. And so you can see that there is in fact an association between the two variables. And so maybe throwing in the simple length would be a helpful thing. In fact, if you want to really see what's in here, you see how we have this nonlinear association, it kind of jumps up here and it comes down here. Remember, we have three different species and there may be an important difference between them. So going to come back here to exploration, I'm going to draw scatterplot again. That way I create another chart that going to put Siebel width under the Y axis of possible length under the x and I'm going to do the smoother with the standard error. That's what I had before. But now I'm going to throw into it species as a grouping variable. And because I can, I'm also going to put density plots on the marginals. That means across the top and on the side. This shows the distributions of each variable for the three different groups. And so when we look at this, I'm going to move this over a little bit. Actually, I'll close that first and move it over a little bit. Now you can see that there's a really strong association that the Tozer Iris has very different association between sample length than people with, and the other two are pretty similar to each other. So we have something important to work with here when we're trying to get the association. I could also do a linear association, in fact, because that's relevant to the analysis covariance. I'll do exactly that. So I'm going to repeat this analysis and do the same thing. But instead of using the curved smoother that goes through the data, I'll use a linear or straight line regression. So I'm going to come here to exploration and scatterplot and I'm going to put width again on the y axis length on the x axis species on the group. And I'm going to put linear regression line with the standard error and I'll still put on this time. Instead of doing densities, I'll put box posture so you can see what that looks like when we run it. Different way of looking at the data. I'll close this menu and bring this over a little bit. And now you can see that we have this really strong straight line association here. And this is where the analysis of covariance is important, because not only do the three groups have separation between them, which is something you would expect for with the analysis of variance. But they have different slopes on the association. That's going to be an important thing for the analysis of covariance. And we have box plots for each of the variables and they break them down by groups. So we can see we have a little bit of outliers and a little bit of overlap between them. So let me now set an analysis of covariance. And the way I do that is I come to ANOVA and I come down to this third option and Cova, which is short for analysis of covariance. And what I'm going to do there is I'm going to take my dependent variable, the outcome, the thing I'm trying to predict and that's going to be simple width. I'm going to put that right here. Now the fixed factor is the thing that defines the distinct groups that's going to be species. So I'm going to put that under fixed factors. And this so far is just like the analysis of variance I did earlier, but now I can put in a covariate, another quantitative continuous variable that is associated with the outcome and may help me get a better understanding of what's happening in this case. I'm going to use simple length because we saw that there's an association with it. I'm going to put it here under covariate and you see that the table expands slightly and we now have two inferential tests. We have one for species. We get this enormous F ratio, 94 and it says, yeah, it's statistically significant. So there is in fact a difference on the species as far as Siebel width goes. And we also see that the sample length also has a significant association, even when the both of them are put in the equation together, we can get a partial data squared here to look at the effect size and we see that the species makes more of a difference, but they're still both really big, robust findings. Now, let's look at some of the other options we have here. We can look at model and I'm going to stay with the default model. If I want to put in an interaction I can. And actually, now that I think about, I will put an interaction because these lines have different slopes. And so I'm going to select the two of these. I'm just going do a shift and click to get both. Come right here and say, give me a two way interaction. So that's going to add the species by the Siebel length. And now you can see that that interaction is also statistically significant. And again, it's because they have different slopes here in chart. Now let's come down and check some of our assumptions. We can do a homogeneity, a variance test because the three groups are supposed to have approximately equal variance on the outcome variable. The simple with and we can also get a q q a quantile quantile plot of residuals. By the way, the homogeneous the test variance is, is not statistically significant. That's a good thing. It means that we are not violating our assumption the AQ or quantile quantile plot. The residuals assesses another assumption of the analysis of covariance, and that is that the residuals or the errors in prediction, how far off you are by your model, that should be approximately spread out the same for each of your groups and it should be approximately normal kind of a bell curve. If it were a perfect bell curve, every dot would be on this diagonal line. But you can see they're really close. So we're doing well. Now. It may be that you have a situation where you have specific contrast you want to look at, so maybe you want to look at just ceteris versus the other two kinds of viruses. You can set that up here. I don't want to do that. So I'm just going to close that and leave it alone. I could do post hoc tests and that's where I look at the two. And because I know that one of them is R from the other two, I'm actually going to slide that over here. I can pick correction. I'm going to stay with the default tukey and when I come down here, you see it's going to compare status with versus color in Virginia and then racial color with Virginia. So there are three possible comparisons when you have the three different groups and what you find is that the very color in Virginia are not significantly different from each other. That makes sense because they overlap a lot, but the status is different from the both of them. In fact, if we come back up here just for a moment, this the total over here and this is the very color of Virginia, you can see that they're very similar. And so to us is pretty different. So that result makes sense. Scroll back down. And then in terms of descriptive plots, I can get a means plot by moving species to the horizontal axis and get intervals, and that'll take just a second to pop out. It's confirming what we already knew that the sample with varies for our three different species and under additional options we can get descriptive statistics. We got some before, but I might as well just click it here because it puts them right next to the chart. And since the things that go into the chart, it's a little easier to put it all together. And so this is kind of an extended presentation because I spent time going through the preparatory analyzes, doing the univariate exploration and then breaking it down by categories and then doing the normal one way analysis of variance. And then looking at the associations between Siebel width and people length and sort of building up this, the narrative arc of the analysis. And when we get through the whole thing, we see, yeah, there is in fact not just a difference between the three species, but also the relationship between the simple length and the width is different for the three species that we got by having the different slopes of the regression lines. And so it gives us some nuance. It gives us a little additional insight into what is even a very, very simple dataset. And that's some of the power you can get from the analysis of covariance and working with your own data. Some of the options that Jamovi gives you in working with data are really kind of surprising because they're not usually found in other statistical software, at least not in the standard installations. So if you're working in SPSS and you want to do a multivariate analysis of covariance, you might have trouble, you may be included there in the base package, or you may have to download additional things in other programs to make it work. But a sophisticated analysis like a man cover is available as part of the base package in your movie. And that's an amazing thing. Now Minerva stands for Multivariate Analysis of Code variance, and the multivariate part means you have more than one outcome variable and you're trying to model the group differences. So on all of those simultaneously, plus because there's the C in the middle, that means covariance and you could use a quantitative or continuous predictor. You don't have to because you can do either of the above with a mankatha option in your movie. Let me show you how this works with the iris data. Again, all I'm going to do here is I'm going to see if there are group differences on all four of these variables. Now, I'm going to begin with the exploration, something that I say that's a good idea. So I'm going to come up here to exploration and go to descriptive. And what I'm going to do is I'm going to get these for quantitative or, continuous variables, put them here and then and break it down by species, split it by species. Now, I actually am not concerned very much about the statistical table at the moment, so I'm just going to unclip all of this and close that up. When I do want is the plot and I'm going to do the density. That's the like a histogram smoothed. It's going to break it down by the three groups. We can see what sorts of differences there are between those groups in the data. I mean, for Jamovi to get there. But now what we have is the simple length and looking at the three different species of iris flowers, we have a status of visit color in Virginia and you can see that there are differences, that things go up a little bit as we go from one species to another. The pattern switches around a little for Siebel with where status is the highest. The other two are really close petal length. We have enormous difference with Soto, a way down here on the low end and the other two are still distinguishable and then petal width we actually have this nice separation between, the three of them, and we can look at each of these separately if we want to do four separate analysis of variance analysis, we look at the group differences on each of these separately. The Minova or the Manqoba allows us to do all of them at once, although I'll let you know that sounds like it would be a big improvement, but it's often pretty hard to interpret the results because the math that goes into this gets a much much, much more sophisticated. But let's see what Jamovi can do to make this procedure at least a little more accessible. So I'm going to close this and I'm going go over to the ANOVA and come down to man cover, which again stands for multivariate analysis of covariance. And when I click on that, the nice thing, it's not giving me all the possible options you could get because there would be a million. But the easy thing is what are the dependent or outcome variables? I'm actually going to pick all four of these flower measurements, put them over here, and then factors that says what defines the groups? That's this species here. Now, if I wanted to use one of these measurements as a covariate, or if I had something else with covariate, like the age of the plant or the height of the or something, I could put that in there too. I'm not going to do that because it doesn't change the output very much. And I just want to focus on what we have right here. The important thing is, we really just have this one table that's kind of all that is given to us. In fact, we have the same analysis really done for different times when we come here to multivariate tests. The question that these four things are trying to tell us is, are there differences on these four variables simultaneously between the three different groups? And it turns out that all four versions of the math that goes into the multivariate analysis of variance are covariance they're all giving us the same answer in this case. They're all saying that, yeah, there's this huge difference. It's a highly significant difference. The p value you're looking for, something less than five is much than that. It also does univariate tests, whereas looking at the variables separately so we can look at Siebel length again. This is the one factor, a one way analysis of variance. And we can see, yeah, there are differences between the groups on all three of them. Now remember, with the analysis of variance, it doesn't mean that every group is different from every other one. If we come back up to here, for instance, you know, we can see that status is different on people with but there's a and Virginia are very similar to each other and so there's a difference somewhere in the mix even if not all three are different from each other. And That's what we get here. I'm just going to finish with one of the assumption checks we can do the Q plot of multivariate normality. Again, something that we're generally looking for because a normal distribution or a bell curve is a good idea. And a multivariate normal distribution means that if you have one variable, that would be a bell curve. If you have two variables, it's doing it in an X and Y dimension and you're looking for sort of a smooth pill. And if it's three or four variables, then you get into, you know, higher dimensions. And it's really hard to visualize. But if we just look at the quintiles, that is that's what the. Q stands for is the quanta. So we have the chi squared quantile across the bottom and the squared at mahalanobis distance on the side, which is also a quantile and idea here is that if all the dots which are data points fall on the diagonal line, then you have what is basically a multivariate normal distribution. We're really close to that. So I think we've met these assumptions, but really the options that we're getting in Jamovi are pretty simple for what could be an extremely complex and sophisticated analysis. Basically is telling us that, yes, you have a statistically significant Minova, I could call it a man cover, but I didn't actually use a covariate this time around. And Then it follows it up with univariate ANOVA as for one factor at a time, and it's enough for us to be able to say, Yes, there's something there. And if you have the theory that says what the differences should be, you might be able to tell that by looking at some of the univariate analyzes as well. And so the nice thing again about intimacy is it makes this really sophisticated kind of question easy to answer. It gives you just enough information to get the yes or no that you're looking for if you're doing hypothesis driven research. And that's going be a very significant first step in understanding what's going on in your data. Each version of the analysis of variance that we've looked at so far in Germany has in common, and that is that they all rely on population parameters or assumptions about population parameters like the population mean or the population variance. So the one factor or two factor analysis of variance, the repeated measures analysis of variance, the end cover, the analysis covariance, even multivariate analysis of covariance. All of them have this parametric assumption in comments very really typical among analyzes, but there are other options. These are called non parametric tests and there are ones that don't make assumptions about population parameters and traditionally they're based on ranks. And so for the analysis of variance, the one way version, the non parametric or ranked version of it is called the Cresco Wallace Test. And it's actually really easy to set up if you're concerned about non normal distributions, this might be a good choice. Now I'm going to use the Irish data. We've looked at it lots of times and I'm going to look at the Siebel with and break it down by the three different species. But let's take a quick look at the Siebel with and break it down in a couple of different ways. And I go over here to exploration and go to descriptors. And what I'm going to choose is Siebel with as the only variable that I'm looking at and I'm going to split it by species. I'm not really concerned about this great big table, but what I do want, the plots, there's two kinds of plots that I want to get. I want to get the density plots that will split up by species. And I also want to get not the box plot, but this time I want to get the data plot. It's a dot plot and I have a couple of options on that. One I can do. Girard, which it randomly shuffles them from left to right. Let's see if I come down here, if it's showing up yet, but I'm going to actually get the nice and orderly stacked version because it makes a little easier to tell what we're with. That was the jitter that popped up there for just a second version. Now let's take a quick look at these analyzes. What you can see from this is that in terms of Siebel with the Irish status there, Siebels are a little wider than the Siebels on the Irish variant of color or Virginia, which are pretty similar to each other. If we were doing a parametric test, we'd be looking at the means of each of these and we've done that previously. But the Cresco Wireless looks at rings, and in that case you're sort looking at where is the highest score? This is the number one. This is the second highest. The third highest. The fourth. And if you go through and as you come down to here, you have to start saying, well, this might be the 17th highest and this one here might be the third or fourth highest. And this one is 125th highest. You take every single data point across all three groups and you rank them. And so this one I know there's 150 total. This will be the 150th right here. So go from rank number one to rank 150. And the question is whether the ranks are evenly distributed across the three versions. That's the question that the Chris Wallace test is trying to answer. So let's go and see how to do the go. WALLACE It's actually really simple. Just come to ANOVA and then under non parametric or to one way ANOVA where says Chris Wallace. And we click on that and it's actually a very small dialog box and only ask me for two things. It says, What's your dependent variable? What's the thing you're looking at? I'm doing just Siebel with and what's your grouping variable? And I am using the species of the iris and I got this tiny little table that just has three numbers in it. It gives me the test statistic which is actually a chi squared. I know that looks like an x, but it's a Greek C, it's a capital chi with two degrees of freedom. And this is a highly significant result, meaning that there's these aren't even remotely close to identical distributions. Now, normally when you get a significant analysis of variance, you want to do some sort of follow up with post hoc and we're using ranked non parametric data. We need to do something a little bit different. And Jamari gives us the option of the DCF pairwise comparisons, which stands for Dwarfs. Still, Critchlow and Flickinger are filling your pairwise comparisons. You can think of them as similar to the HFA or Bonferroni or TUKEY because what it does is it gets every possible comparison and with three groups there's three possible comparisons. It calculates a test statistic and then this part here at the end that's important is the p value that tells us whether it is statistically significant. Now, you can see this one's taking a while and that's one of the things about when you're dealing with ranks and especially if you have a lot of data is an iterative procedure and takes a long time to get through all of it. But here we have it and you can see that in all three cases, the p value, the probability value of this happening at random, if the null hypothesis is true, is way less than the standard 5%. This is 0.5% and here they're even lower. And so these all serve to reject the null hypothesis that everything is similar or that their ranks are sort of randomly distributed across the three groups. And it gives us the same general impression that we got from doing the one way analysis of variance earlier that the three groups are not different. Obviously this one's really high. These two are here, but Even then, the rank shows us that there are still differences between these two groups because of the way their ranks are divvied up between them. And so the Cresco Wallis is a non parametric or rank based analog to the one factor or one way analysis of variance that you can use, especially when you're concerned about non in your data. One final variation of the analysis of variance that's available in movie is what's called the Friedman test, a non parametric analog to the repeated measures analysis of variance. The idea here is that you have several outcomes that you've for a group of people and you're looking for changes across those outcomes. I want to demonstrate how this works with the built in example data from bugs where people who are rated on a number of demographic variables evaluate insects and how much they want to get rid of them. Where those insects vary according to low and high disgust and low and high fright and they rate it from a zero, meaning they don't want to get rid of it all to ten, want to get rid of it immediately. And we've done this analysis before with the standard repeated measures, analysis of variance, but we can also do it with the ranked version, the Friedman test. And this is an advantage if you're worried about normality in your data. So most analysis like to have bell curves and normal distribution if you don't have that and we actually know that we have some non normality in this data, then a non parametric test might be an appropriate and informative choice. But let's begin by looking at the distributions of these four variables. We've looked at them before, but I'm just going to come here to exploration Descriptive and then I'm going to pick these four outcomes. Low disgust, low fright, up through high disgust, high fright. I'll put those under variables. And I'm not so much concerned about these statistics as I want the plots that we can get for each of them. So I'm just going to hit density and unfortunately I can't get them stacked right next to each other, which would be most convenient. But we can still compare going up and down in the list to see what we have. An important question is are they normal in or out? They on the same scale. Now this first one low discussed low. Right. It's not exactly normal, but it's not different. We've got a pretty strongly skewed distribution with the low disgust, high fright, also strongly skewed with the high disgust fright and then really skewed with the high disgust high fright. And so this is a situation we're using a non parametric test based on race as opposed to a parametric test that also assumes things like normality. This might be a good option. So let's close this and come over to the analysis of variance and drop down to the last choice here. It's under non parametric and it's the repeated measures. ANOVA or the Freedman test is what it's also called and I click on this and you know, there's not too many options here in the jazz movie dialog, but there's enough to get what we need out of it. So What I need to do is I need to pick my measures that I'm looking at and I simply pick all four of them. I click. The first I do is click on the last and I move them all over. And in this situation, I'm not breaking them down into categories like disgust and fright like I did with the regular Parametric analysis of variance. But I do this way now I get this tiny little table, just three numbers that shows us that there is, in fact, a significant effect here. The first number here, the 55.8, is under the chi squared. That's the distribution that use and that's a capital chi. We have three degrees of freedom and this deviates significantly from what we would expect at Rand M. And so we do find a significant effect here. And if you want to, you can get pairwise comparisons there like the post-hoc comparisons, the bonferroni or the Sheffield tests that we get with other procedures. And what we have here is a comparison of each set of variables here with. The four variables we have six possible comparisons. So low disgust, low fright to low disgust, high fright. That's statistically significant and in fact all of them are statistically significant one way or another. Let's get the descriptive statistics for these. This would be a very small table. And because some of that is similar to what we had when we got this information up here, it's just arranged differently. And then we can also get a descriptive plot and we can choose either means or medians. But since we've been talking about means, I'm going to leave it right there. It's just a little dot plot. It's not showing us a confidence interval because that really works differently when you're talking about a ranked situation. And even though are means, they still let you know that all four of these are different from each other. And because of this paired comparisons using the Durban Conover test, we can tell that all of these comparisons are significantly different from one another. So that's a very quick run through of the freedom in test is an repeated measures analysis of variance analog for using non parametric or ranked data and again depending on how far your data deviate from normality, this may be a good choice for analyzing and finding the hidden meaning in your data. In our next chapter on regression in the movie, we're going to look at associations specifically through correlation and scatter plots, and we're also going to look at the ways we can use many variables to predict scores on one variable how we can use predictor to look at an outcome or criterion variable, and we'll do this primarily through variations on regression to movie gives us actually a great set of choices. We can do standard linear regression and one of the most powerful, flexible and useful procedures available. We can also do binomial logistic regression where the outcome variable is not a quantitative or continuous score, but a dichotomy, this or that. And you're trying to use a collection of variables to predict which of two categories a case will go into. Then there's multi nominal logistic regression where have several categories in your outcome variable. This is actually a very sophisticated procedure and it's a surprising thing that your movie includes this and it includes it for free. And then finally we'll look at ordinal logistic regression recently added to your movie, which allows you to use again a collection of variables to predict which of several ordered categories a case falls into. And between this set linear regression binomial mountain oatmeal, ordinal logistic regression. We have an extraordinary collection of tools for getting more insight out of our data. And in each case, what we are doing is we're using several variables to predict an outcome on one. I like to think of it as the statistical version of E Pluribus Unum, a motto The United States, which means out of many one, and in this case, out of many variables and many data points, one conclusion. And so let's look at the ways that Germany let us use an entire collection of regression techniques to explore data and get useful insight out of it. Perhaps simplest way of looking at the association between variables is with the correlation coefficient, specifically the Pearson product moment correlation coefficient usually just called R, and it's a great way of looking at the association between two quantitative or continuous variables, although it's actually much more flexible than. That and I want to take a moment to show you how we can do correlations and correlation matrices and scatterplot matrices in Germany. Now to do this, I'm bringing in a dataset and it's called a state data, and it's based on a little bit of information that I've compiled from different sources. One is the name of the state, its code the region is from. I looked up whether their current governor is Republican or Democrat from a few years ago. A study done on categorizing states by their personality characteristics, putting them as temperamental or friendly and relaxed or traditional. Another study that classified states by their Big Five personality characteristics extroversion, agreeableness, conscientiousness and so on. And here at the end, I went to Google and I got state by state data on how common certain search terms are in each state. These are Z scores. So they tell you how many standard deviations above or below that state is from the national average. And I put in some that I felt were at least kind of relevant to the Big Five personality characteristics. I have some social media ones, I put in some business. He went to Entrepreneur. GDPR stands for the General Data Privacy Regulation that the European Union just put in. It's a big deal if. You're in e commerce and we have universities and and mortgage and volunteering and museums and scrapbooking. My favorite one modern dance. And let's take a look at some of the associations between a few of these variables and then we'll explore some of the other associations as we go through this chapter on regression in Germany. Okay. So the first thing I'm going to do is I'm going to pick a small number of variables. When you go to regression, your first choice here is correlation matrix. And you know, truthfully, you could put in a hundred variables. You just end up with this absolutely gargantuan matrix and not be able to make sense of it. I'm going to be a more selective and pick just a few. Right now, I'm going to pick one of our personality characteristics here, extroversion. So really, how outgoing social person is versus being introverted. And I'm going to pick just a few other things to go with it. So maybe extroverted people might be on social media more. So we'll pick Facebook and they may or may not be concerned about privacy. Let's see what that looks like and then. How about volunteering? So are they offering to do things for free to help people in their community? Okay, so I put those in there and you see that this table builds up and it fills in almost immediately. What we have are the correlation coefficient arranged in the top right diagonal of the table, and there's nothing in the bottom left because those two are mirror images. They're because the association between extroversion and Facebook is the same. The association between Facebook and extroversion down the diagonal are just these lines, because that's each variable with itself, and that's always a perfect correlation, but it doesn't mean anything. So what we have in this are two things. We first have the Pearson's R, this is the product of correlation coefficient. It from negative one, which indicates a perfect negative linear association through zero, which indicates no linear relationship whatsoever to plus one which a perfect positive linear relationship. And we also have the p value which is used for the statistical hypothesis testing and we're looking for a number there that is less than oh five now. It's going to be easier to see what's going on if we come over here and click flag significant correlations. It'll put asterisks next to them, which makes it a little easier to find instead of having to compare each number in your own head. And so we see, for instance, that we have three significant correlations out of six. So it turns out extroversion is not associated with any of these things, which is kind of surprising. But states that search for Facebook more on Google, search less for privacy. They also search less for volunteering and. Then states that, search more for privacy, search more for volunteering. These are pretty good correlations. And again, it's on a state by state level, not a person by person level. It's on the states. By the way, the dataset only includes 48 states. It doesn't include Alaska or Hawaii or the district Columbia, Washington, D.C. That's because that's what was included in the one that categorized these states by personalities. And so this gives you something to look at now one interesting option that your movie gives you is to do confidence intervals. And so we click on that and it will expand this table and give us a 95% interval. You can change it to something else if you want, and I'll give you the upper bound and the lower bound. So for instance, you can see that this one goes on either side of zero. So it's this and this here. However, this association between Facebook and volunteering has statewide search terms. We go from -0.451 to -0.789. So it may be that you want confidence intervals, your correlations, it makes for a busy matrix, but they're available. The other thing I want to mention is this It's always nice to have a graph and. I actually would have called this a scatterplot matrix as opposed to a correlation matrix, but either one works and what it is, is it's a similar rows and columns, but this time it has separate scatter plots for each of the associations and they get kind of busy. But you can see that we've got a strong negative association, a strong positive association, and some others that are pretty close to zero. It's set up a little bit differently here. We have stuff on the bottom left diagonal as opposed to the top right diagonal. But again, it's symmetrical. We can also get the density charts on the diagonal so we know what the distribution like for each variable. And from that you can see that extroversion and Facebook are kind of normal. We got a peculiar shape here with privacy because we have outliers and volunteering is bimodal. And then also we can get statistics that'll give us the actual correlation coefficients displayed on the upper right side of diagonal in the scatterplot matrix. See, these numbers correspond to what we have up here. There's the 0.036 and there's the same thing down here. And so we have both a numerical summary of the correlations, the associations between these variables, as well as a graphical summary. That's generally how you want to do it anyhow. And so correlations are a fabulous first step at looking at the association between variables, especially when they are quantitative or continuous variables. Again, it's more flexible than that, but that's the canonical usage of correlations in a movie and elsewhere. Sometimes you want to just do more with less. And if you want to be able to do as much as possible with the smallest number of tools as possible, then linear regression is going to be by far your best bet. It is the general purpose data analytic tool, and so many of the other procedures that we look at are actually special versions or variations on linear regression. The basic idea here is you're trying to take one or more variables and use those to scores on a quantitative or continuous outcome variable. Let me show you how this works. Using the state data that I have. This again, has information about whether the current governors, Republican or Democrat, some psychological information about states on the big 5% of the characteristics and a bunch of Google search information that's given in Z scores. Let's try to predict one variable here in particular and let's look at openness. That's an interesting one. It has to do with being open to new ideas or sometimes with art, but mostly it's associated with being open to thinking about new things and trying new perspectives. It's a nice thing. So let's come up here to regression and let's do linear regression and what we're going to do is a we're going to pick open this as our dependent variable. It isn't really a dependent variable because it's not from a manipulated experiment, but it's the outcome variable. And the idea here is that the outcome scores depend on the other variables we put into the model. So let's take openness. Now I have a small data set. There's only 48 cases in this because it's the contiguous United States. And so I can't go hog wild and throw everything in there because I'm going to violate some of the assumptions of regression. So I'm going to be a little bit selective. Just a little bit. It's not going to put everything in there. I'm going to pass up some of these things. I'm going to leave these other big five personality characteristics right here because they're not supposed to be associated. But I am going to come and get Google searches. And again, what it tells us is what is the state's average on this as a proportion of their total number of searches and it's given as Z scores compared to the rest of the United States, trying to select all of those and put them over here in the covariates. And actually, I, I will take one of these categorical variables. I'll get. Governor, this is a dichotomous variable in this particular dataset. It only has two values Democratic or Republican. There actually is an independent governor right now, but he's up in Alaska. And Alaska wasn't included in this dataset because we don't have the psychological information for them. So we can treat it as dichotomous. And you put that here under factor and you can see that off to the right. The regression table has filled in immediately. We can tell a couple of things here. First off, the multiple correlation, the big R here is 0.831. Again, correlation goes from negative one to positive one. And this is a strong correlation with multiple regression like we have right here. It's more common to look at the R squared, actually, the adjusted R squared. I'll show you in a minute. But this can be interpreted as the proportion of variance in the outcome openness that's predicted by these other variables in this 69%, which is truly kind of huge. And if we want to, we can come down here and look at these individual variables and how they contribute now keep in mind these coefficients, these are the things that you multiply each number by are only valid when you take this set variables together. This is not how much is associated on its own. This is as a whole. So if I were to drop one of these variables out or add another one in, all these numbers would add just a little bit because it's the unity here. This is important. That said, we can tell that some of these seem to matter more than others. We're looking for p values that are less than oh five and we're not getting much here on any of these until we get ten. Here at the bottom scrapbook, this is how much does that state search for the term scrapbook compared to other states? And that makes a difference. States where scrapbooking is more common of a lower level of openness, at least when taken in the context of this entire set of data. And then searches for modern dance are in fact significantly associated with higher levels of openness. Again, within the context of this set of variables. And then down here gov is a dichotomous variable that takes two values. And the nice thing about your movie, it's able to simply take a categorical variable and throw it in there and prepare it to work well in a regression. On the other hand, we see that this one is nowhere close to statistically significant, so we don't really need to worry about it. But let's come over here and see what some of the options are for linear regression in J Movie. Scroll up here a little. Let's go to Model Builder and this is if I wanted to put the information in in blocks, that actually might be a good thing. In fact, let's try that. Let's do blocks where I'm going to take all of these variables out right here and I'm going to make the first block, just the governor, whether that person's Democrat or Republican. And then I will add a second block that takes all of this other information and we'll just drag them over to right there. It should show up in a second and dragging all of them at once. Going to try adding them one at a time. Here, I'm going to click Instagram and I'll drag that over right here. And what this is going to do is, it first creates a model that has just the governor in it, and then it's going to see how much better the correlations get. The associations get. When we put in these other variables, I'm just dragging them over one at a time. It's going to take a moment to update everything here. Privacy and, University and mortgage and volunteering and museum and scrapbook and modern dance. So we'll see what those are able to add. In addition, it's going to take a moment for it to update the results here. Comparison of two different models, because I entered them in blocks. The first one has just the governor and it actually is associated a little bit there. If you want to see the specific results, we can come here and click model one and that shows us just the governor. And what's interesting is when we have only whether the governor of a state is Democratic or Republican, we actually get a strong association. And that reminds us that's very different from what we had when we entered everything at once. Now, let's take a look at model two, which in addition to that, as the other variables. Now, by the way, when is just the governor? You see, we have an R squared of 0.148. That's not bad. When we throw in the other variables, it goes all the way up to .69. And that change from model one to model two where the R squared up by 0.542 is statistically significant in and of its own. But let's look at model two here again. And now we can see again. The funny thing here is when we put in all these other variables that the governors, Republican or Democrat, it doesn't matter anymore. Again, it's changed by the inclusion of these other variables. And this gets to the results that we had previously. We're scrapbooking. They're in a different order because I drag them in differently. Scrapbooking nearly significant, and maintenance is the only one that's significant on its own within this group. By the way, you may notice we have the option of variables in blocks, block one and block two. BLOCK three. That actually is a good way to go. If you want to look at the separate or additional predictive ability of one variable or another, one thing you do not get to do in GMV and some people say this is a very good thing is you can't do step variable selection. You can't throw a whole bunch of variables at it and say, look, you just sort through it all on your own and tell me what works. That's an easy way to do things, but It tends to really magnify the result of weird flukey results in the data, and you tend to get results that don't generalize. Well. And so that's not even an option here. You can, however, do blocks where you choose what goes in at one point and you examine what's happening. I actually prefer blocked entry like this for reference level that's for categorical or nominal variables and you get to decide which category is baseline one and which one gets the regression coefficient that goes with it. Right now it's Democrat is the baseline and Republican goes on top. I can just flip around and then you can see that this will switch over here to the end. It'll just change the sign associated with it so it doesn't change anything big. Well, that's refreshing. I'll close this and come to assumption checks. There's a number of things that need to be true for the results of linear regression to be meaningful. Now, autocorrelation is if you had stuff like week one, week to week three where the week three results, you know, are going to be at least partially associated with week one. We're not dealing with that, but we are going to be dealing with curling reality where the predictor variables are associated among themselves and what causes the associations to change dramatically when variables are in different combinations. If no variable was associated with any other, it wouldn't matter the order that you put them in or whether they were in there with the others or not. It would always be consistent, but because these variables are associated, it becomes very sensitive to the way that you enter them. Let's roll down a little bit and let's look at two colony already statistics. The first is via F, which stands for variance Inflation Factor, and the second is tolerance. And you can take the two of these as an indication of how associated each of these variables is with all of the others. Now, I can tell by looking at this that we do have some CO linearity or multiple linearity. So between them it's not the end of the world. I'm going to go past it. We can also look at something like the Q Q plot of residuals. That's the quantile. Quantile or plot of residuals. When you do a regression, the residual is the distance between the predicted values you give people and their actual values. It's the error, the leftover part, and in theory that should be normally distributed. You shouldn't always be higher, always be low, should be spread out basically like a bell curve. And if our residuals were perfectly shaped like a bell curve, they'd all be exactly on this line. But you can see they're really close. So that part is nice. You can get separate residual plots. It'll do one for every one of the variables that you put in there. You can also get Cook's distance, which is a way of looking at how influential a single data point is. I'm going to skip over those and come down to model fit. We have R and R squared and if I come back up here, you'll see where those things go. Adjusted R squared is probably going to be a better choice in this situation, especially because we have a small data set. There's only 48 cases and you can see that there is a change. The R squared is 69 and the adjusted R squared is 572. This is going to be a more accurate number in terms of generalizing to other data sets. We have other information we could use like the AIC that's, the key information criteria or the Bayesian information criteria. We can also do the overall F test. Let me click on that and you'll see how it expands some of the information that we have in here. That is, it adds a few more columns here where we're doing the overall model. Now we have two different models. We have two entries. When it was just the governor, we had a highly significant result. And when we added on all that Google search information, we got an even stronger result here. In terms of model coefficients, one thing you can do is you can do the omnibus F test. We've basically done that. But what I would like to do is these standardized estimates. So instead of T tests here, we'll actually get Z scores where we compare the regression coefficients against each other and can even get a confidence interval, 95% confidence interval for those two over just a little bit. And the standardized estimate, these are Z scores for each of these variables over here as opposed to TS. And then we have confidence intervals for each of them. And you can see, for instance, that we have one here that is going to be the modern dance, one where the two sides of the confidence interval have the same sign. They're both positive, which corresponds with it being statistically significant. And then finally you can get estimated marginal means. So for instance, we can come down here and we can put Governor one of these, we can put it right there, and that will produce a new chart down here at the bottom and well, that working on it. I'm going to add another near term. Let's say, for instance, let's put modern dance in there and you do this for each of them if you want to. Showing us is the level of openness for the state broken down by whether that state currently has a Republican or Democrat governor? And you can see that there's an enormous amount of overlap between those two. And then this is the more fun one when you take something that is on a quantitative scale, you get sort of this regression line in terms of how open people are. And this is the standard error estimate or a confidence interval for the regression line. So you can see that as modern dance becomes more common as a search term, that the level of openness for the state goes up as well. And again, you can do that for one of these other variables. It's just going to be an enormous amount. But the general idea here is that this is the set of tools that your movie gives you for doing linear regression, which is probably the single most important analytical procedure you can have when you're working with your data. In the first video on Linear Regression, I showed you how to set up the most basic version of linear regression, which is called simultaneous entry, where you take all of your predictors and you put them in the model at. And when it gives you is this one that we have right here, which gives you the T values and the P values allow you to hypothesis tests on all of your variables, but only in the context of each other. And if you select different variables, or sometimes if you put them in different order, you can get results. Now, I want to show you some of the options that your movie gives you for this. And to do that, I'm going to come back to regression and I'm going to pick linear regression again and I'm going to set up a new analysis. I mean, I scroll down a little bit here. The outcome variable, the dependent is openness. So I'm going to pick that right here. And this is where I can start choosing which variables I want to include in the analysis. Now, what I did is I picked all of these Instagram down through modern dance, which are Google Correlate terms, which have to do with the relative popularity of these search terms on a state by state basis. And I can pick those. And I also chose gov, which is simply whether the Governor of that state was Democratic or Republican when the data was gathered and we have this table right here, but what I'm going to do is I'm going to come down here to model and here I have an opportunity to do data in blocks. Now, a very common approach or something that a lot of people want to do is something called entry, which you simply tell the computer, these are all the possible variables and I want you the computer to go through and pick the one that has the highest individual correlation. Stick that in the model, then pick the one that has the highest correlation after that and put it in and so on and so forth. The problem is stepwise models tend to really build on the idiosyncrasies of that particular data set. And so you get a model that's really well tailored to what you have, but it doesn't generalize well and it can actually mislead you in a lot of different ways. Jamovi doesn't even give you the option for that. What you do have the option for is what I personally call block regression or block entry or block wise regression. And that is you can set up several models that sequentially add additional variables. And so what I'm going to do here is I'm going to first take all of these search terms. I'm going to get all of them out. And so right now. BLOCK One is just whether the state has a Democratic or Republican governor. And you can see right here that in terms of openness, it's doing subtraction, Republican minus Democrat. And what that means is the Democrat governors or the states with Democrat governors tend to have higher levels of openness than the ones with Republican governors. That's because have a negative coefficient there. But let's add a second block and see how things shift around. Now, I do want to come up here and change this to include the adjusted R squared. So I'm going to come down and say model fit and add that on right there and I'll close that. So now we have our model it simply knowing whether a state has a Republican or Democrat governor predicts almost 13% of the variance in the outcome. The openness of this state as a whole. And that's kind of remarkable on its own. But I'm going to come over here and add a new block. And in block two, we're going to try to put several other variables, say, for instance, come down and pick just these last few museums, scrapbook and modern dance. And I'm going to put those into block to just as a main effect right there. And then I can compare the model with just Governor and then the model with governor and these three in it. And so that's what I have right here. We have this first one is just governor and we have an adjusted R-squared of 12.9. It explains 13% of the variance in the outcome openness. When we add these other three Google search terms, the relative popularity of those on a state by state basis, it goes up to 52%. We've now explained it over 50% of the variance in the outcome, which is openness. Now that makes a little bit of sense because openness often includes or creative openness, though it can include a lot of other things as well. And this one right here, this model comparisons, tells us that there is a statistically increase, which makes sense because it's a huge jump. And then here we have the variables. Now, this is from model two. If we want to look at model one, we just on this and we get to one with just governor. And you can see there's a statistically significant effect there. That's the 0007 right here. But let's go back to model two and now we can see that gov is no longer statistically significant. That's because it was a spurious correlation as predicted by other things. And then music doesn't seem to matter. But we have strong and statistically significant effects for both scrapbook and modern arts to creative interests, and we see that one's and one's positive. So in the states that have higher levels of interest in scrapbook as a search term, there are lower levels of openness and in states that have higher interest in modern dance, there are higher levels of openness. Again, that's because we have these negative and positive coefficient ints, but remember, it's always within the context of one another. Now if we want to, we can add a third block where we take on, for instance, these other ones. We go from Instagram and then I'll do a shift, click down to volunteering, and I'll put those in the third block as main effects only. We're not looking for interactions right now, and when I do that we can see that adding those variables improved. The adjusted R-squared from only 0.5 to 5, that's 52 and a half percent to 57.2%. And in fact, you can see here that that increase from model two to model three is not statistically significant. The increase has a P value itself of 0.180. So we didn't get anything helpful by adding those extra variables. In fact, we lost degrees of freedom by doing that. So that tells us that we're probably best off going with Model two, which had a substantially higher rate of predictive ability compared to the one I had just governor. And the results make sense and it's that theoretical interpretation that matters. And so this is one of looking at the relative contribution of different variables. Now, this one's theory driven because you decide what the blocks are and you enter them in and you interpret the results your own to see if they make sense and how applicable they are. And really, whether it's something that you can put into use, whether you can do something productive with it, that is the goal of linear regression and this blocked entry is to find the information that's going to be most useful to you in solving the problems that you're dealing with before you take the results of your linear regression and you go running off to market and making massive changes, you do want to make sure that you actually dotted I's and cross your TS and you want to make sure that your data met the assumptions that the model has and that your data is leading you in the right direction. This becomes a matter of checking what are called regression diagnostics, a number of statistical and graphical methods for looking at how well your data fits the model. Now, I'm going to do this by coming back to the same model and data that we used before where we're looking at variables to predict the openness of people on a state by state basis. And I'm going to click on this model right here and what I'm going to is I'm going to come back not to model builder, although that is where I specified the blocks that I was going to use. But I'm going to come down a little to assumption checks and this is where the most important things are going to happen. Now, there are a few here that are particularly important if you have data that's measured repeatedly over time, like quarter one, quarter to quarter three results, then you're going to need to deal with auto correlation to see how much a carryover effect there is from one time to another. We don't have that right here, so I'm not going to deal with that. But we do need to be worried about coloniality or multicore linearity. And that's what happens when your variables that you're using to predict the outcome are correlated with each other. Now, really, the easiest way to check that is to get a correlation coefficient and look at the associations between your various variables. But the statistics that are specific to linearity within linear regression also tell you important things. We come down here, it's giving us two measures in particular, and we're looking at model two. That's the one I have selected. The first is VHF, which stands for the variance inflation factor and the second one is tolerance. And there's a correlation or an association between those. And what they're both referring to in different ways is the association of each of these variables with the other predictors in the model. And I can tell by looking at this that we do have some crossover between them. The actual ways that they operate gets a little complicated, and that's beyond what I'm trying to do here. Mostly, I want you to know that Jamari is able to do this for you and you can interpret the results in ways that to make your model more robust, you can also do the Q. Q part of residuals. I have demonstrated that elsewhere, and it's here to the residuals or the leftovers from predictions based on your model need to be approximately normally distributed. They need to not flare out on one end of the model or the other. Now, I do see that there's one down here dipping kind of low. The rest are basically on the line and we come off a little bit. It's not horrible, though. You may want to do a little more of a drill down analysis to see what's happening in this particular dataset. There two ways you can do that. One is with the residual plot, and that's going to do separate plots for each of the predictor variables that we have in the model. And so it's going to a moment to catch up with all of those. And the other one is what's called cooks distance. And that's going to give you a measure of the influence of specific cases. Now, because we have only 48 cases in our data, it's the 48 continental United States that's not such a horrible idea. And here we have the plots that we're looking at. We see, for instance, with openness, there's this pattern in the residuals that we don't have in other things. And so that's issue we may be concerned about. I'll click on Cook's Distance as well, and Cook's Distance actually shows up above the charts that we have right here. And it's simply giving us the mean, the median standard deviation and the range of these. And so you can look at the individual scores that you have and try to find out what's on in the residuals plots to see if there's something that is deviating really sharply from your expectations or the assumptions of the modeling technique. I do want finish with one other thing, and that is the estimated marginal means. Now this will be more helpful in some situations than in others, but what you can do here is, you can actually try to get a chart that shows you how these variables predict your outcome. So I can take governor, which is an easy one because there's only two categories there and that's going to make a chart. It'll be down here at the bottom. This simply gives me the mean level of openness for the states with Democrat governors and the mean level of openness for states with Republican governors. And you can actually see, well, the means are a little different. This dot is a little lower than. This the confidence intervals overlap substantially. And you can do this with other terms if you want. We can take, for instance, modern dance here and we can put that in. And it's going to be a little different because this is a quantitative or continuous predictor, whereas the governor was a nominal variable. And we'll scroll down a little bit and what it is is kind of like a scatterplot, but there's no data points on here. And Instead, we're looking at a regression line with a confidence interval around it at the association between the two as how modern dance predict the expected levels of openness and you see as the relative interest in modern dance goes up to about a level of four point something that the average level of openness for the state goes up as well. And so this can be a way of also seeing how well your data, the assumptions and fit in with the approach of linear regression, which after all is one of the most common and most powerful methods of using data to model specific outcomes. And something that Jamovi makes incredibly easy to do and to interpret. One really common task in analyzing data is cases into one category or another based on a number of other variables you might have. So for instance, with your computer is trying to decide whether a particular email is spam or not. We are trying to decide whether a particular person is likely to buy your product or not, and those are dichotomous classifications. And a common method for analyzing or predicting dichotomous outcomes is to use what's called binomial, which means two names binomial logistic regression. So it's a form of linear regression, but it's adapted for placing cases into one group or another based on the probabilities that are predicted from your other variables. Now, this is really easiest to simply show how it works. So I'm going to use the data I have about state data. I've got a number of variables about personality characteristics on a statewide basis and search terms. But the dichotomous variable that I have in here is whether the state's governor, their current governor right now is Republican or Democrat. Now, let's start with just a little tiny bit of exploration so we know what we're dealing with. I'm going to take governor, put it over here and we're going to get a frequency table and we will also get a bar plot. And so at this exact moment of the lower 48 states in the United states, about two thirds of them have Republican governors. And so we're going to see if we can use some of the other data we have in this dataset to classify states or predict which ones have Republican governors and which ones have Democrat governors. And so the way we want to do this, I'll just close that is to come to regression and come down here to logistic regression. Two outcomes or a binomial again we're binomial means to names or to categories. I'll click on that. The first thing we need to do is put our dependent variable as the outcome variable thing we're trying to predict. And that's going to be and I don't even have to tell it, you know, this is what this means is what that means, because I actually have it written as words in the dataset and the nice thing is that Jamovi is smart enough to be able to tell that those are categories and then we get to pick some covariates. Now, I could pick a lot here. Let's pick just the social media ones, just for fun. So I'm going to come down here to Instagram, Facebook and retweet. By the way, the reason says retweet is because Google Correlate wouldn't let me search for Twitter. I don't know why, but since Retweet is exclusively a Twitter word, it seemed like a good substitute. So I'm going to put those all into covariates. And those are the three variables that I'm going to use together to try to predict which states have Democrat governors, in which states have Republican governors. And you can see right here, we've got a model that's actually working pretty well. We've got three variables. We have The Intercept, which is not zero. Instagram is not statistically significant. Facebook is in retweets. No, we're close. But there's a lot of other information we can get through the options that we have in movie. So let's just take a quick look. Let's take a look at model builder and this is how we want to enter things and we can put stuff in blocks if we want to. I don't feel the need to do that in this case, but I showed how that works in the video on linear regression. Let's look at reference levels and one of the categories needs to be taken as the baseline and we're trying to predict a change to the other category because there are more Republicans governors. Let's just have Democrat as the baseline and we'll go up from there. Assumption checks colony parity isn't the issue, especially because I'm pretty sure that these three social media terms, as Google search terms are related. And from that we get both the VHF, the the variance inflation factor and the tolerance. And there's indications here that we've got some colony ease, but nothing awful. We have a few choices on how we assess fit. I'm going to leave this with the default where it uses deviance in the AIC, which is the key information criterion. Then we go to the model coefficients. A common choice when you're dealing with binomial logistic regression is the odds ratio, and it's also nice to get a confidence interval and that's going to add a few columns on to this table right here. You see we've got the odds ratio right there now and so many other statistics, zero is the base value and things either go positive or negative, but the nothing's happening value for an odds ratio is one that means a 1 to 1 ratio and it goes below one, but it doesn't go all the way down to zero, can't get all the way down there and can go up from there. And so you see, for instance, that the intercept well, the intercept is reliably above zero, but you can see that it's a odds ratio. Both of these numbers are above one. This one one's below one, one's up these. Both of them are above one. They're on the same side. And so these give us an idea of the variables that predict the odds of a particular state. Having a Republican governor based on these three social media variables that we have from Google Correlate, if we come down a little further, we can go to estimated marginal means. And so, for instance, Facebook was significant within the context of these three predictor variables. Let's take that and stick it into here for marginal means and actually gets a nice curved chart that goes with it. If I come down here for a moment and what this shows us is the probability of a state having a Republican governor going from 0 to 1. That's 0% to 100% based on the Z score that State has on searches for Facebook, on Google. And you can see here that when states search less than other states for Facebook, they are less likely to have a Republican governor. But when they search more, they're more likely to have a Republican governor. And so this a nice way of looking at the effect of that, because binary logistic regression does work on a curve system where it's drawing this probably changing over time. This one shows you just the one variable. It actually uses all of the variables together to calculate probabilities. But another nice thing about categorization tasks like this is you can get a classification table. So I can click on that one right here. And what is going to tell me is what it predicts the states will have versus what they actually do. And what's interesting too is you can change the cutoff value. So for right now, it's only getting 40% of the Democrat states. Correct, it's getting 91% of the Republican ones. But let's take a quick look at what's called a cutoff plot, which gives us a chart of what's called specificity and sensitivity. You can think of sensitivity, meaning like it's very likely to set off an alarm or to give you the answer if it believes it has a Republican governor. And specificity means it's going to do that only if it does. And in many situations these two lines, this is specificity going up here and this is sensitivity going down. Often they cross right here at the 50% point. But these ones are a lot closer to 0.7. So what I'm actually going to do is I'm going to change the cutoff from point five, 2.7, and do that right here. And you'll see that it changes the way the classification table works, because now it's going to say, well, only put them as Republican if they have over a 70% chance of being Republican. And that actually makes sense to because it's about two thirds Republican governors slightly over in the country as a whole. You see this cutoff lines lot closer to where these two crossovers usually where you want to get it because it's going to be maximum utility there and it's changed the classification table instead of 40%. We now have 73% correct for the Democratic governors and we've gone from 91 to 73. But now it's sort of balanced in terms of how accurate it is for the two different conditions. And so these are some of the methods that Jamovi gives you for looking at the relationship between several predictor variables, like the three social media search terms and how they can be used to predict the classification of a dichotomous outcome like Republican or Democrat Governor or any other time, you have two distinct outcomes that you're trying to predict. We're at a point where things are going to get very complicated. At least they could very complicated. Multan Normal logistic regression is potentially a very sophisticated analysis. What you're trying to do is several predictor variables in a regression equation to predict not two categories, but several categories. And although it may not sound like a big change, the processing behind it becomes exponentially more complicated. Fortunately, Jamovi makes it possible to do a relatively simple, multichannel logistic regression. I'm going to demonstrate this with the state data, and I want to look at this one category here, which actually has to do with psychology profile for the various states in the United States. Let's start by doing this. Let's do a quick exploration and get descriptive for psych regions. I'm going to put that over here and I want to get the frequency table and I want to get the bar plot and this is based on an analysis of people's online behavior and other characteristics. And the researchers ended up with three category is what they called a friendly and conventional, which is half the states in the sample, relaxed and creative. That's ten states who are about 21 and temperamental and uninhibited. That's 14 of the states in their sample. And here's a bar chart. You can see them. And unfortunately, labels don't automatically adjust in a movie at this point, but that'll happen eventually. But let's see how we can use other variables to predict which states into which categories. And what I'm going to do for that is I'm going to come back up here to regression and go to an outcome that's multi normal logistic regression. And when I click on that, the first thing I have to do is pick what my dependent variable is, what's the thing that I'm trying to predict? In this case, it's the regions. So I'm going to put that right there. Now I've chosen one with just three categories. That's going to be the simplest. If we had only two, we do binomial and if we had more than that, it just gets so complicated. It's hard to deal with in terms of covariates. What I'm going to do is I'm going to pick the five elements of the big five personality factors. So that's extraversion, agreeableness, conscientiousness, neuroticism and openness. Mind you, this is not a nice person by person level. It's a state by state level. I'm going to slide that over into covariates and then from there we can start looking at the model that it creates, it takes a moment sometimes for it to run through this because there's a lot of calculating that goes on behind it. Let me close this so I can slide this over for a second. We have a pretty big table here. The first one is the fit measures. So we have deviance, we have the AIC, and you'll see that a lot of this is very similar to what we had in the binomial logistic regression. We also have the model coefficient where it's comparing the different groups, the relaxin creative minus the friendly and conventional. And it's looking at these various elements, the personality factors, and we have the P values over here and we see that, for instance, when distinguishing between relax and creative, none of these seem to make a big difference. When we look at temperamental and uninhibited versus friendly, conventional do get one and two conscientiousness is appears to be an important one and neuroticism is nearly there. Now, let's take a look at some of the options we have. Click back on this and then come down to Model Builder. I can put the variables in in blocks if I want. And there may be situations where since and there may be situations in which I want to do that. But with just these five personality ones, I'm not going to do blocks. But I did demonstrate that in linear regression. So you could take a at that reference level is what's the one I want to compare it all to. I actually think I'm going to take the temperamental and uninhibited and make that the reference level and you can see how the order just switches around over here. So this is friendly, conventional, minus temperamental and uninhibited, and it just changes the sign of some of the comparisons. Although you see down here, neuroticism becomes an extremely big thing for distinguishing between the relaxed and creative and the temperamental and uninhibited. We're going to follow up with that. We'll come down to model fit. These are the standard choices here. We'll keep those as they are and model coefficients. Now this is where we have the estimates. Those are coefficients, but it's nice to have odds ratios and confidence intervals those as well. And when I do that, I just get a few more columns over here. Now you can ignore this one because it's for the intercept. We don't expect anything particular there. It's the rest of the ones that we're interested in. And you can see, for instance, here this second, the bottom one, this is relaxing, creative, minus temperamental and uninhibited. We're looking at neuroticism, which has to do with kind of having a lot of mood swings or also getting irritable. And we have a significant association here. The odds ratio, 0.73. Remember the null value for north ratio is one. If it's either below one or above one, then you have something going on. And especially if both sides of the confidence interval are on the same side of one as we see here. But this is all going to be a lot easier to interpret if we get the graphs. And so I'm going to come down here to the estimated marginal means, and that's going to take a moment because I'm going to put all of them in here and I have to do them one at a time. I'm going to take extroversion. Put that over here. I'll add a new term to agreeableness, add a new term to conscientiousness, had a new term to neurotics ISM and add a new term and to openness. And then I'm going to close this for right now. And what we're waiting for are all the groups that are come at the bottom really pretty graphs. What they are are lines that show the probability. So we have our three regions, each of which is shown by a different color. So this gray line right here is the friendly and conventional. And by the way, this shows this extroversion where the mean is about 50 and it goes down to 30, up to 70. And so as extroversion goes up, there's a much higher probability that that, say, is in the friendly and conventional. You can see that when extroversion low, it's probably relaxed and creative and the temperamental and uninhibited is just kind of low all the way through on that one. Agreeable, friendly and conventional is really high. All the way through. You can see that the temperamental and inhibited low agreeableness. So they're a little cranky. They're as it comes down. It comes less likely that they are one of the temperamental and inhibited because they're more agreeable conscientiousness, this amazing crossover where friendly and conventional as conscientiousness increases friendly and conventional, the probability a state being friendly, conventional increases dramatically and we have the exact opposite for temperamental and uninhibited, much lower levels of conscientiousness and it doesn't seem to really figure into the relaxing and creative states and then neuroticism with this really peculiar huge crossover. This is temperamental, uninhibited, as neuroticism levels get higher, relax, it gets low. And then we have a funny little peek here for friendly and conventional. And then we see that the friendly and conventional as openness increases, they start out on the low end and then it drops off and the other two pick up. So there's a tradeoff there. And so this is a really neat way of seeing a visualization of the probabilities and how each fact eater goes into determining whether a state falls into one of these three different categories. And this is probably how you're going to get the most information. And then the most insight into your data is through these marginal mean graphs, but you also get an idea for which ones really matter overall by looking at the calculations and the numerical summaries in the multi normal logistic regression, one final variation of regression that we can get engine movie, that really is kind of surprising considering that's not always available in other programs is ordinal regression or specifically ordinal logistic regression. And the idea here is you're trying to take several variables to predict where a particular case falls in ordered categories, say, for instance, from lowest to middle to highest. Now, you could theoretically have an entire ordinal scale where you rank every case from one to however many there are. But when you have a small number of cases, this is where ordinal logistic regression is going to be most useful, and another time that it shows up is really handy is when you have data that really deviates from normality now I'm going to use this example data of state data that I uploaded, and we have, among other things, how much different states search for different things. And one that's kind of interesting here at the very end is searching for modern dance. Now Let me show you what this looks like, because then you'll understand why the procedures necessary if we put modern dance right here, it'll give us some statistics. But what we're really interested in here is the plot, the density histogram. So we have data from 48 contiguous states and you can see it's mostly normal, but man we got this big gap in this bump up here. I happen to know and this is Utah where I live. Utah, it turns out, leads the national mindshare for dance, which is kind of shocking anyhow. This is not a normal distribution, and outliers like this can really cause problems with regular analysis. And so every doing the data, transforming it and possibly putting into categories might be one method of dealing with this and ordinal logistic regression makes that possible. Now what I want to do to make that happen is I want to split it up into categories. So, for instance, let me come back to this just for a moment where I was. And let's go to statistics and say what are the cut points to create for equal sized groups? So we're going to take quartiles. And what this says is, well, the minimum Z score is -1.4 for the 25th percentile. The first quartile is -0 .6.8, the 50th. This is about negative .23 in the 70 is positive 0.45 and it goes up to 4.7. I think that's where Utah is a way up here. Now I'm going to round these off a little bit and what I did is I came in here and I created this new variable. I'm calling it Modern Dance Quartiles. And what I'm saying is if the Z score is less than negative five, put it in one. That is, put it in the first quartile. If modern dance is less than zero but above negative point five, but in the second quartile, if modern dance is less than positive 0.5, put it in the third quartile and otherwise put it in the fourth quartile. And so that's a way of nesting if statements to create these categories. And when I do that, I can come back and do exploration again, except this time we'll do it for the quartiles. I put modern dance quartiles right there. We'll get a frequency table and then we'll get a bar plot and then you'll be able to see that it's a little better behaved because we've brought in the outliers and we've created four groups that are approximately equal size and they're pretty close and. The reason they're not exactly the same is because I intentionally chose to make the cutoff points, something that seemed a little more sensible. But now we have an ordinal variable. We have the states categorized into the lowest quarter, the second lowest, the second highest, and highest quartile in terms of their relative search preference for modern dance. And so I'm going to come here to regression and come to this last option here, ordinal outcomes. And when I open that up, the first thing I need to do is specify the dependent variable or the outcome that we're looking at. And that's going to be modern dance. And I come back up here, I go to select and I'll just click it over here and then we put in covariates. We can put in a lot of variables if we want, but because have a small sample size only 48 cases, I'm going to limit it now. Modern dance because it's an art form should be most associated openness because in certain studies I'm big 5% of the characteristics open to seem to also have something to do with art. So I'm going to click that and put it over here in covariates and then also I'm going to show you how to use a factor. I'm going to take the psych regions because that included friendly, conventional, but also relaxed and creative, which seems like it would be important for something like of performing arts. And so if we take those we have this table right here, it lets us know about the deviance and the AIC, which are methods of assessing the fit of the overall model. We have these predictors. Openness, it turns out, is significantly associated with modern dance quartiles. So that's something that we all of hoping for. That openness or being open to new ideas and possibly to esthetic experiences seems to be associated with the number of searches, relatively speaking, that is state does for modern dance. We also see that the psych regions matter, that the relaxed and creative and the friendly conventional, it makes a difference. And what's interesting actually in both cases is that relaxing, creative doesn't more than friendly in conventional does. That's because this is positive and also temperamental and inhibited does more than friendly and conventional does in both cases. So let's take a look at some of the options we have. The model builder. If I wanted to put things in a separate box, I could the reference levels let's go to model fit. This is where we have two choices the deviance, the AIC and the McFadden's R squared. All seem like great ideas. I want to leave those and the model coefficients. That's this table we have right here. I'm just going to throw one more thing on to it. I'm going to do the odds ratios and the confidence intervals for the odds ratios. The null value for the odds ratio is one. And so if we are consistently below one in the confidence interval or above, that lets us know that we have something important here. And for instance, let me close this so we can see more of this output here. I'll drag this over. So openness, the odds ratio on both ends of the confidence interval are above one. So, you know, is significant. But really the amazing one is this here temperamental and uninhibited states versus friendly. In conventional states, the odds ratio is 17.54. That's really big and the confidence interval goes all the way up to 91. So this is a very influential difference in trying to predict what a state ranked category would be in terms of the quartiles went through four and their searches for modern dance on Google. And so it turns out that this is actually a really easy analysis set up. We could theoretically have a lot more options available to us, but jive always keeping it pretty simple and this is a good way to get started in dealing with ranked information as opposed to the outliers that we had in the original distribution and using a regression model with both a quantitative or continuous predictor variable and a categorical predictor to try to find out where states would fall into that ranked continuum that we created in this section on analyzing frequency is in January, we get a new way to deal with data. Specifically, we are now dealing with counting as opposed to measuring. Now measuring means when you say how long does it take to do something or What's the average score here? Counting is like counting the number of rocks where able to enumerate the cases and put the frequencies in as your data. And this requires a different approach. In GMAT, we have several options for analyzing frequencies. The first and simplest is what's called the binomial test. Binomial means two names, and it's fair when you have two categories of outcomes, like flipping a coin and getting the of heads and the number of tails. If you have more than two outcomes, say for instance, you have baskets of several different kinds of mushrooms. You can use the chi squared goodness of fit test to see how things are distributed across those multiple categories. If you want to look at two categories at once and look at the association and say, for instance, you want to look at something like the number of World Cup teams from different countries and break it down by male or female, you know, because know that in soccer the American women have done fabulously well. The American men didn't qualify for the World Cup. Most recently, that's looking at an association between two categories. And you can analyze that with what's called the Chi Square test of association. You also have something called a McNamara's test. This is a relatively unusual one. It's for looking at tables where you have related data, paired observations like data from twins, because they go together and you have to take together that consideration. And also, if you did the same person at time one and time two, then the frequencies have this built in association and you need to compensate for that. McNamara's test lets you do that. And then finally there's also log linear regression. It accomplishes a lot of the same things, but it uses a different mathematical approach. It's based on regression. There are some pretty serious equations that go into it, but it is often a very flexible and very powerful technique for modeling the number of cases, the observation genes are frequencies within different cells of your data. And so taken together, this gives us a great range of options for a data analysis, for counting infrequent CS In Germany, probably the simplest data situation is trying to analyze just two possible outcomes. You can call that dichotomous data because it's split into two pieces. You can also call it binary. And in this case we're going to look at the binomial test, which literally means two names. Think it as heads or tails on or off, yes or no. And in this example, I'm going to use the state data and I'm simply going to look at Governor, because that's the only binary variable I have in here. And let's do this first. Before we do a binomial test, let's get a little bit of descriptive statistics. I want to come over to Explorer version and go to descriptive and all I need to do is get governor, put it over here into variables is going to tell me some things I don't really need to know. Might as well remove the statistics that really only work for quantitative or continuous variables. But what I do want is a frequency table and what I do want is a bar plot. And when I get the two of those, it tells me what percentage I'm dealing with. In terms of governors from state to state. We have 48 states represented because remember, this data only is for the lower 48 states in the United States. And we have 15 Democrat governors as of about a week ago and 33 Republicans. And here's a bar chart that shows the difference. And so that's really easy to see. It's about 2 to 1. Now in terms of binomial test, what we're doing is trying to tell whether our observed proportions differ significantly from some value. In a lot of inferential tests, it's really obvious that the null value is zero. Well, in the binomial test you have a few choices. Now let me close this and come over to frequencies. And the first thing we're going to look at is the two outcomes or binomial test. It's under one sample proportion tests. If I click on that, it's actually a really simple dialog box. All I need to do is pick the variable that I want. I'm going to use gov and move it over into this box. You can tell it's looking for that is coded in g movie as either a nominal that is a categorical variable or ordinal. And right now it's already done the test and it has me that these proportions are both significantly different from a null value of 50%. And I can get a confidence interval which actually is a really nice idea in this particular case. And this lets me know that the percentage of Democrat governors has a 95% confidence that goes from 18.7% to 46.3%. And we have the flip side of that for the Republican governors now. I'm going to close this for just second so I can open it up again. This one is done with a null value of 0.5 or 50% each. And maybe that's not the value that we want to use. You know, right now we have 68.8% Republican governors. Let's say we want to compare it to maybe ten years ago. And I don't know what the proportion of and Democrat governors in the 48 contiguous states was ten years ago. But let's just say, for instance, I'm going to do the confidence interval here. Let's say it was 60, and we want to know whether there's been a change in the last ten years. Then I can do this. And now it's going to compare these two numbers to 60%. And what we see here is it's a little different. Let me close this so we can see this in the first case because we're comparing to this 50% and. Each of these two values, because they have to add up to 100% are equidistant from 50%. The p value is the same for the both of them on. The other hand, when we move the null value away from 50%, in this case 2.6 or 60%, then the P values are very different. We find that the current observed proportion of Republican governors, which is 0.688, which is the same as 68.8%, that does not differ significantly from 60%. On the other hand, the number of Democrats does. And so we have this big difference depending on how you set the null value and is going to depend what value you is. If it's something that really should be just as likely yes is no, then the default value of point five is good. But if you're looking, for instance, at a game that has four players, then the default value of any one player when you point to five and there can be other values depending on what you think is most likely to happen at. But no matter how you do it, the proportion test is a very easy test to do in movie and it gives you the simplest kind of data. Just a simple yes no outcome. And it does both an inferential test with a probability value and a confidence interval. When you're looking at the number of cases, either of two categories, you can do the binomial test, a very simple test, and we've show that elsewhere. But when you have more than two categories, but still in a single variable or a factor, then you get to do the Chi square two goodness of fit test. And it's kind of a silly long name, but it's test you use for comparing multiple categories in the same variable. Let me give you an easy demonstration of this using the state data that you can download from DataLab Dot CC. In this example, I'm going to use this variable that's called psych regions. It's based on psychological research that builds profiles states in the United States. And we're going to start with an exploration where we get basic descriptive of this. All I'm going to do is pick psych regions, move, whatever the variables, and I'm going to change the statistics. We don't need these ones because those are for quantitative or continuous variables. There we go. I'm going to get the frequency table and I'm going to get a bar plot. It's really simple stuff. Now, of the 48 contiguous states that the researchers looked at, they classified 24 of them. That's exactly half as friendly. And convention. That's our biggest bar right here. I'm sorry that the labels are overlapping. I'm sure that will be fixed in a later release of movie. The next category is relaxed and Creative, which includes ten of the 48 states. That's about 21%. And the last category is temperamental and uninhibited. That's 14 of the 48 states, or about 29%. And so we can look and see, for example, whether the states are evenly distributed across these three categories. Now can tell by looking at the bars that obviously they're not all the same. The question is whether that difference is statistically significant, and that's where we do the Chi Square two goodness of fit test. Now there's an important choice we get to make, and that's how we define our null values by default is going to assume equal frequencies or equal proportions in each category. So let's go to the frequencies here and go to any outcomes. That means more than two outcomes, an arbitrary number and that right there is chi squared. It looks like an ax. It's actually a capital Greek c, it's called chi chi squared. Goodness of fit. So we're going to click on that one. And all we need to do is take our variable here as psych regions and put it right here and the variables. Now this option for counts is when you have a summary table in your data set and I'll show you that in a later video how that can work. But because we have a raw data where, everything is listed as one uppercase, which is most common. That's the kind we're going to use right here. And what we have over here is a proportions test where it says that a proportion of 0.5, that 50% are in this 128 to 9 two. And it tells us that the chi squared goodness if, it gives an inferential value where the calculated value of chi squared is 6.5, two degrees of freedom gives us a p value of .039. The standard of P or probability that we use for significance testing is 0.05. So this would be a statistically significant finding. But I want to show you two other things we can do. One is exactly what is the algorithm comparing these values against? Well, is comparing them against expected counts. And the way it does expected counts, is it right now? It just takes however many values we have and it splits them evenly across a number of categories. Well, 48 states can be divided evenly into three categories by having 16 each. And what the Chi Square does is it looks at the deviation between 24 and 16, ten, 16 and 14 is 16 does some on those. And that's where the value of chi squared comes from. And so if we are looking for exactly the same number of states in each category, we have a result that is significantly different from that. But let me close this and do another analysis where we're not looking at strict equality across the three. There are lots of times where you expect a different number of people to be in each category. So, for example, if you're looking at the number of left handed and right handed people, there's more right handed people overall. You don't expect it to be split totally evenly. So there is another way you can specify values other than strict equality. Let me put psych regions back in here and we'll still get expected counts. But now I'm going to click on this menu to get expected proportions. And what I can do here, as you see, is saying that by default it's going to split it up across the three. So 33.3% change, I can put it as a different value and you put in the ratios here and that means like 2 to 1 or 3 to 1 if you want, you can enter them as percentages. So you can say like, Oh, let's have 60 here and let's do 25 here and let's do 15% here. And that gives us the values that we would expect in each of these conditions. I think that adds up to 100. And now what you see is our values. We have different expect ID values and now we still have a statistically significant result. But I was able to specify different values and you can see how up here the expected are 16 and it's the same in each case here. The expected values change from one to the other according to the values that I gave it. It's still a statistically significant result. In fact, it's, uh, there's a slightly greater deviation than we had before. But the idea here is we have one factor, and that is which personality category is the state put into? We have three options the friendly and conventional, the relaxed and creative, or the temperamental and uninhibited. And depending on how we want to set up our know values, we can compare our observed frequencies, how many there are in each category to what we might expect. If the null hypothesis of random variation were true. And in both cases here we find that we have a statistically significant deviation from that based on using the proportion test or the chi squared goodness of fit tests. In Germany, if you want to look at the association between two categorical or nominal variables, the most common choice is the chi squared test of also called the chi square test of independence. And let me demonstrate how this works by starting with a split frequency. Let's come up here to exploration and descriptors and let's look at, for instance, psych regions. This is in the state data data set that I supplied. I'm going to put that here under variables, but let's split it by whether a state has a Republican or Democrat governor, because it's possible that there is some association between the political party of the governor and the state's personality is rated by some researchers. Now, right now, all it's telling us is that there are 48 cases in the data. I'm going to remove this extreme permission that I don't need. Eric, right now, the table is a little bit smaller. I do want to get a frequency table. I'll click on that. And this is going to tell me how many people are really how many states there are in each combination. And then I'll also ask for a bar plot at the same time. And what we have here that we have 20 Republican governors of states that are considered friendly and conventional. We have six Democrat governors of states that are considered temperamental and uninhibited and we can see that in the chart down here. Again, I apologize for the overlapping labels. I'm sure that'll be fixed in the later version. The blue lines are Democrat and the yellow gold ones are Republicans. And what you see is this huge spike in terms of friendly and conventional. The vast majority of those states have Republican governors where the other ones appear to be somewhat split, a little more Republicans, the temperamental and uninhibited. But let's find out whether this difference is statistically significant, whether there is in a statistical association between the personality of the state and whether the governor is Democrat or Republican. So we'll use the chi squared test of association for that. Let's come up here and go to frequencies and then I come down here to independent samples. What that means is we have different groups of people in each of these categories or different groups of states and it's chi squared test of association also called the test for independence. And I'll click on that and here's what it's going to ask me. I need to give it the rows and the columns for what's called a contingency table. That's just a table of rows and columns. So I'm going to start by putting psych region as the rows and I'm going to put governor as the columns that will mirror the table I already created and I could add extra layers, but that gets really complicated and hard to interpret. What that would mean, by the way, is putting in another categorical variable here. So we'd end up with a three dimensional table, but you don't want to deal with that's too hard. We have a choice for statistics chi. Squared is just fine for what I have here. These ones, the comparative measures, including the confidence intervals, only apply when you have what's called a two by two table. And that means two rows and two columns. We have three rows and two columns. So these ones don't apply to what we're doing. We have some other choices. We can get a contingency coefficient or a Phi and Cramer's V, let's hit a contingency coefficient and that's like a correlation coefficient. Here at the bottom we can do some ordinal statistics. We have categorical data, not ordinal. So I'm going to leave that alone. And then we have some choices about what we put in the table. Now, right now what we have are just observed frequencies. The number of states that fall into each combination. We can also put expected frequencies. And that's important because that's what the chi squared test is, comparing these observed values too. So For instance, up here in friendly conventional, we have four states with Democratic governors, but if the data were completely randomly distributed, it would be seven and a half. On the other hand, down here we have eight states that are temperamental and uninhibited and have a Republican governor. But we would expect, if things were random, to have 9.3. The way it gets that, by the way, is by multiplying the column totals by the raw totals and then dividing by the grand total for each of the cells. But the important part stays the same. Down here we have our chi squared test and we get a value of chi squared of 4.89 with two degrees of freedom. We a probability value of 0.087. Now that's something that's not very likely to happen by chance, but because it's greater than the standard cutoff of 0.05, we can clear that even with this really big difference right here, that these results are not statistically different from what we would expect through random variation. So even though it looks like there's a big difference, it doesn't hold up under null hypothesis testing. And that's one of the values of doing the Chi Square test, even something that looks at an eyeball level like a big result. When you do the actual numbers and get the inferential test, it may tell you a different story, which is exactly happened in this situation. One of the things I really love about your movie is that they include some really unusual or specialty tests, make them really easy to use. One of these is McNamara's test, and this is a test designed for contingency tables. But instead of being independent variables, it's for repeated. Now, I have to admit, I've been doing research for 30 years and I've never had to do this. But it's always good to know that when you come up against this kind of question, you can get the right tool, which gives you greater power and precision in your research. So McNamara's test is four categories, and it is for frequencies in those categories, but where the categories are repeated. Now what I'm doing here is a little bit different in my other examples. I've had one row per observation and so I've had these long datasets. Right now I'm using a summary table and it's kind of nice that Jamovi lets you do this in a number of situations. That way I can do it this instead of having like 85 rows of data. The data that I'm using actually comes from a table that I saw on the Wikipedia article for McNamara's test, and it's about patients with Hodgkin's disease and whether they and their siblings had tonsillectomy. I don't know the association between the two, but it's an interesting example because the sibling goes with the patient, so there's a connection between the two of them. Let me show you how we set this up. Using McNamara's test in Germany, we come to frequencies and we come down right here and they call it McNamara test. If you go to Wikipedia, McNamara's possessive test, I'm going click on that. And what we need to do is simply take our data put in what variable has the row labels, what variable has the column labels? And then the variable that has the counter, the actual frequencies themselves. That's because I'm setting this up as a summary table, so I'm going to take patient and put that in rows and you can see how it starts to fill it in right here. And then we take sibling and put it into columns and it fills in the table, but it doesn't have any values yet except to say that. In theory, there's just one of each. But then I take the N that's a variable that tells you how many cases there are in each category. And I put that right here and it fills in the table and it does a couple of things. Number one, it gives us that there are 37 cases where the patient with Hodgkin's did not have a tonsillectomy and did the sibling. But there are 26 where they both had tonsillectomy, and so there's an association in that way. We also have the actual result here, McNamara says, is based on the value of chi squared. And it's interpreted in the same way that if the value of P, the probability value or the probability of the observed effect size, if the null hypothesis is true, is .088. Now that doesn't meet the standard levels of statistical significance. It's in the direction. I do want to show you a couple of other things we can do that can be really helpful, and that is anytime you're dealing with a table and especially when you have samples, different sizes. Now you see we have 44 Hodgkins patients that did not have tonsillectomy and 41 that did. Those numbers are really close to each other, so they're easy to compare. But for the siblings is 52 and 33 and that makes the comparison a little more difficult. And so what you can do are get row and column percentages. Now do just row percentages first. What this means is that for the patients that did not have a tonsillectomy, 84% of their siblings also did not a tonsillectomy, whereas 16% did. So they add up to 100% going across. And that's going to be true all the way across. So this really just tells us that 60.2% of the siblings did not have tonsillectomy and 38.8% did. Independent of that, you can get column percentages going to turn off the row percentages for just a moment and we'll get column percent. And now what this does is it adds up to 100% going down. And this is what's nice when you have different overall frequencies. So we have just 52 people without tonsillectomy as 33 that did. And so we can say that of the that did not have a tonsillectomy, 71.2% had patients, siblings who did not and 28.8 had patient siblings who did. And we can compare that to the percentages over here. And that allows us to make it sort of an apples to apples comparison, even though the total frequencies are different because often you're more concerned about the rates. And that's something you can do with any kind of chi squared table. You can do it in the binomial, you can do it in the chi squared test for goodness of fit for the chi square test of associations that getting the percentages is often easier to interpret. And the same thing is true in McNamara's test, which allows you to do continued in table four paired data. Another surprising but useful option in Jamovi for analyzing frequency data and associations between them is log linear regression. Now what this is is a form of regression where instead of trying to predict a person's score on a quantitative or continuous outcome variable, you're trying to predict the number of observations, the frequencies within cells of a contingency table. It's actually now it's a powerful method, although it can be confusing to read the output. My goal here is simply to show you that it's easy to put these together into a movie. You decide that this is something you need. Then you're going to want to consult your other resources on how to design and how to interpret a log linear regression, but calculating it into movie. Very simple. Let's do this. Let's look at the association between two of the categorical variables in the state data. Let's look at psych regions where the 40 contiguous states of the United States are classified according to their personality characteristics, whether they're friendly and conventional, temperamental and uninhibited or relaxed and creative. And then let's also look at whether that state currently has a Republican or Democrat governor. It's not the most compelling, but it's a useful comparison. What I'll do is I'll come here to frequencies and come down to this last option, which is log linear regression. When I that it asks me for what the factors are. Now, please note it's not saying which one is the predictor and which one is the outcome because the model really is kind of symmetrical. It doesn't matter. It's Just trying to put together the entire table without necessarily saying This one causes that one. So what I'm going to do is again here and get psych regions, I'll move that over and then I'll also get governor and I'll put that over because I have a table that gives one real problem surveys and I don't need to do the count, but is my initial results. Let me scroll this over a little bit. It's kind of a big table, but you can see, for instance, that we have an intercept at the top. I'll close this so. We can look at the whole thing at once. You can see that what we have here are a collection of coefficients. This is like an irregular regression and we have the estimate, the actual predicted value for that coefficient. It's standard error, the z score, which is this one divided by that one and the P value that goes along with it. Now, we aren't surprised that the intercept is significantly different from the zero. The two four psych regions are not significant. Democrat versus Republican. We have a major effect there and then we have a nearly significant interaction for these two regions relaxing, creative versus friendly in conventional Republican versus Democrat in terms of trying to reconstitute the frequencies within the table. Now, let me click on this and let's look at a couple of our other options. By default, it gives us the one factor for psych regions, the other factor for governor and the interaction, which is what we want that gets a more nuanced model even when we're dealing with a really kind of small three by two table. In terms of the other options, we get to pick our reference levels. So let's we can pick instead temperamental and uninhibited and that's going to switch the way the table over here is displayed. And if we want to put Republican the default value for governor, we can do that as well. And that's going to change the way that some of these values are calculated over here. So now you can see, for instance, we have this value up here, friendly and conventional versus temperamental and uninhibited that's now statistically significant. And this value here that compares the two with Democrat and Republican, it's changed a little bit. So overall, it's going to give you the same values. It's just going to pass it out differently. You Have several choices in how you evaluate, modify. We'll just leave it with the standard deviance and AIC as well as McFadden's R squared. Those show up here in the table top. Actually, I am going to add the overall model test that's going to give us a few more columns here that gives us the Chi Square test and allows us to do an inferential test for the model. And here we have a chi squared value and we can see that there is a significant deviation between the observed and the expected frequencies or predicted values. We can also come down and get some of these other values, say, for instance, a confidence interval for the estimates. And that's going add another two columns over here in case you want a little richer picture for the values you have. And then we can come down and can get estimated marginal means. Now I find it handy to do this. This gives us charts. I'm going to add a new term and I'll do a governor right here. I just click that in and then that's going to give us marginal means plots. And this is probably the single best way to look at the results of the log linear regression to come down and see in terms of the counts that we get for psych regions along with our confidence intervals and the confidence intervals for the governor. And so there's more that you can do with this. There are some important distinctions to be made when conducting a log linear regression. But the amazing thing is that movie includes this and it's an option for a free, open source and user friendly program. And so depending on the nature of your data, you could do either the chi test of association or independent, or you could do something a little more sophisticated with the log linear regression. And the great thing is that the movie gives you the choice of one or the other. Our next topic, engine movie is factors. And the idea here is that really data is a jungle and it can be confusing and it can overwhelming. And truthfully, sometimes less is more specifically with data. What that means is going from a thousand variables to maybe a dozen or two that you need to deal with. It's much easier to find the order and meaning in that you also tend to have more reliable and stable measurements. And the first procedure we'll look at into movie is analysis because, you know, sometimes you want to talk about the flock, the whole, and not each individual sheep. Perhaps you want to combine the variables you need to get a single scale score. But to do that, you first need to make sure that you're dealing with similar variables, that they're measuring similar things and reliability analysis, whether with chromebox Alpha or McDonald's. Iomega can allow you to do that, or if you're trying to find the main dimensions that make up your data, like finding the main streets in the city, you can try a principal component analysis or the closely related exploratory factor analysis, which well being based on different mathematics and a different about the relationship between observed variables and implicit factors accomplishes basically the same thing. And both of these approaches are exceptionally easy set up and interpret into movie. Or maybe you already know the beans that you want the variables to go into, and in that case you can use confirmatory factor analysis where you put together the factors that the variables go into and the observed variables and you see how well the variables match up with those factors and how well your model explains the covariance in your data. But whichever technique you use, it'll help you get some clarity in your data so you can start on path towards insight and action. In design, you'll hear the saying less is more, and the same thing is true to a certain extent in data analysis. Instead of trying to analyze 50 different variables, why don't you collapse them into a smaller number of more manageable variables? That's going to give you more reliable information and more stable insights. But in order to do this, you need to first establish that you can collapse the data, that there are correlations you're trying to combine, like with, like. And the most common way of doing this is with a reliability analysis which looks at the relationship of several variables that you would like to combine. This is very common if you're working with survey data and you're asking several questions about the same general concept or construct in data set, which is based on real data on the big five personality factors. And we've talked about those before because Germany has a built in dataset, but it had the summary data or rather it was already collapse. This is a data set that has the ten questions for each of the five scales Extroversion Agreeableness, openness to experience, conscientiousness, neuroticism. And so we've got 50 personality variables here. And this is originally from an open source dataset that has nearly 20,000 cases in the downloads. For this chapter, there's a folder called Big Five. It gives the references and the notes for this dataset. What I've done is I have a randomly selected 1000 cases that we're going to work with to demonstrate reliability and really the scale functions of GMV. So let's start with this. We're going to just see what we have here. We have a dataset with an ID numbers that's a row ID that I put in there, the age of the respondent, their reported gender. And then we have a whole bunch of questions that they rated themselves on. And the data information that you can download contains a description of what each of these questions were, but they're all rated on a 1 to 5 response scale. And the numbers up here, for instance, and four and five mean this is the neuroticism fourth question. Neuroticism the fifth question. And so on there's a for agreeableness, C for conscientiousness and oh, for openness. And that gets us to our one. So there's ten variables on each and then we want to see if we can combine them. So we're going to come up here. Two factor now factor is an umbrella term because of these procedures have a lot in common. They all have to do with fundamentally whether there is correlation or covariance between the variables that you're looking at. But we're going to start with this one scale analysis, a scale meaning a questionnaire or a survey where you have multiple questions designed to measure or scale the same thing. So we'll hit reliability analysis and here are our options, is simply asks us what the items or where the questions are. They're supposed to go into the same scale that are fundamentally measuring the same thing. Now, the nice thing is this information we have is already labeled E one through E ten. These are ten questions that are designed to measure extroversion as opposed to introversion. So let's just take those ten and, move them over. So when do we run the reliability analysis? This is our default table and it gives us something called Chromebox alpha. That's a lowercase Greek, alpha, r, a, and it's like a correlation coefficient. And what you're hoping for here is to have a positive value that is close to one, maybe 0.7 or higher. We have a negative value which is really bad and it's a low value. This is happening because we've got some funny things going on with the data, but that is something we're going to be able to fix. But let's do a couple of things really quickly. I want to come back to these scale statistics and I want to click Mean and I want to click Standard Deviation. Now, right now we have a mean that's really close to three and that's on a 1 to 5 scale. That's going to be the midpoint. So that's great. The standard deviation. Yeah, there it is. I find it really helpful to come over here and do item statistics and get the Chromebox Alpha if the item is dropped. So we've got ten items in this scale. What is it going to do on each of these is say, well, if you got rid of that one and kept the other nine, what would it be? And you see it bounces around, but it's always negative. But a graphic can also be really handy here. Let's click on this one. Correlation Heatmap. It's kind of a cute thing, especially when you have positive and negative correlations. We're going to scroll down here and in a second the correlation heatmap will pop up. Happens is when a correlation is negative, it shows up as a red. When it's positive shows up as a green, and we've got this nice little checkerboard pattern. And the reason that's happening is exactly what our analysis told us here. This has it looks like five of these variables are reverse scaled now, something that you do occasionally when you're doing surveys to make sure people are paying attention, you flip it around. And if we were in SPSS or some other program, we would then have to go and manually reverse variables and do this all over again. But fortunately, the package this is based on the psych package and R makes it really easy to do this. All we have to do is scroll down here and tell it which. Ones are reversed and it tells us right here e two, four, six, eight, ten. So I'm going to go to four, I'm just double clicking six, eight and ten and look what happens over here. It's going to recalculate the alpha for this scale. And now it's recalculated the scale. And you can see, for instance, that we've got these little A's here to indicate that these variables have been reverse scaled. And look what's happened. The Chromebox Alpha has gone from a negative two point something to a positive 0.89. And you can tell that dropping one of these variables, it's not going to help it. Overall, this is a very tight group of variables. So let's come down and look at the heat map. Now it's all green. Everything is correlated with everything and that lets know that we are now safe to average these ten variables. And so instead of having ten variables that measure extroversion, we can combine them and get a single scale score that gets a more reliable and stable measure of extroversion. And of course, this is for the first of five major personality factors. This is there are several others, and we would simply repeat this analysis for each of them as we go down. But the general concept is the same. We pick the variables we want it calculate Chromebox Alpha, which is like an average correlation between the items. It lets us know if any of them need to be reverse coded. It's very easy to set that up in the menus and then we can see how each item contributes or if it seems to be pulling away. In this case, the numbers support the whole thing and the graphical representations supports the whole thing too. We have consistent data, we can average it and then we can have the less is more fewer variables, but more and potentially more meaningful data to work with in our other analyzes. When you have a novel data set either because you created your own questions and you got your own data, or you're combining things that you don't really know if they've been put together before. One of the big questions is how can you combine things or what is the emergent structure of the data that you have? One of the most common ways of assessing this is was something called a principal component analysis. There's also the very closely related exploratory factor analysis, which we'll cover in another video. Principal component analysis works on the principle of covariance which is closely related to correlation. Right now we've got a data set here that is on the big five personality factors. It's included in the data that you downloaded from Data Lab CC and it addresses extraversion and openness and conscientiousness and so on. And the question is how well do they go together and what is the structure that shows up empirically in the data? We know how they're supposed to go together, but what actually appears in the data that we have well, let's come up here to factor and click on principal component analysis. And what we need to do is first feed the variables that we want. It needs ordinal or scaled or quantitative or continuous variables. So Let's come here to one and let's go way down here to oh ten. By the way, you'll notice that these all have the three little circles that indicate that they're a nominal or categorical variable, but generally smart enough to know that they're really can be treated as scaled variables from a 1 to 5 response scale. Just click to put them over here and it starts giving us the results immediately. Now, it might take a minute to get all the way through. And what we have here is similar to think think about correlations. There's an analogy here where we numbers that go from negative one to positive one, where is the middle value of zero indicates no linear relationship and this is breaking it down into several components, several factors, ways of grouping the variables based on the data. It decided that seven components seemed like the right answer. And we know that based on that there's supposed to be five. But this lets us know what shows up according to the settings we have in the analysis. And now what's nice is right now it's organizing the ten extra version items together. The fact that some of these are negative and some are positive, that's irrelevant. It's the value of the coefficient that matters. And you can see the neuroticism all tenure together. Same thing for agreeableness and conscientiousness and things to fall apart a little bit here with openness, I can actually tell you that based on the research, openness is generally the least coherent or least cohesive, the factors. So this is not too surprising, but we have a few options. Number one is if you know about principal component or exploratory factor analysis, you have the option of rotating the answer. That's a complex thing, but what it does is it can make the results a little easier to interpret now very max is one that's common by default. But I actually prefer Promax, which allows you to have what are called oblique factors where they can be correlated each other. They don't have to be all right angles in the dimensional chart. So I'm going to change that to Promax. You can see it. Same we got Promax down here now there's a few other things I'm going to do. One is I can hide the loadings. You see how we have a lot of blank space here. There's actually numbers on all of these, but most of the numbers are really small. And what some of it does is it covers up the ones that are low. Now, let me show you. If I were to change, it's like point five, then it's going to redo the table. It's going to hide a lot of the stuff that you wouldn't usually put a number that big, but it makes it easier to read the table when you hide the low values. There's a few other things we can do. And you see, for instance, it deleted this one. And on the other hand we're losing almost all the other that are off to the other sides. I'm going to put it back to the point three where it was. But what I'm also going to do is I'm going to look at the assumptions here. Now, one important assumption in principal component analysis is something called sphere tricity. And you can think of it as sort of an analog to normality. And we just need to run through that and see, well, we load now we've got a Bartlett's test as veracity and it lets us know that our data seem to differ significantly from the null hypothesis and he needs to take that into consideration when you were interpreting the results of your principal component analysis. But let's do a few other things. Let's do a component summary that's going to give us some statistics for this component. I'm also going to ask for component correlations and get a chart that's called a script lot. It's going to take a moment for those all to load up. What the component statistics tell us is how much variance in the total dataset could be accounted for by each of these components. Now there's 50 variables, so there's basically 50 units of variance to account for, and the first one accounts for more and drop down as it goes through. You can also see how the components are correlated with each other. If we use what's called an orthogonal rotation, then it forces things into right angles. We wouldn't have any correlations at all. And there's our test a series city, this one down here at the bottom, the eigenvalues, those are the values that correspond roughly to how much of variance each of our components accounts for. And it's called discrete plot, by the way, because scree is the rubble that's on the side of a cliff. And it's a little bit like what we have building up right here. And we have 50 components because we have 50. So we have 50 potential components. And the real question is how many do you want to use? And there are several different rules. One rule right here is to only keep components with an eigenvalue. That's this thing that talks about how much variance each of these components accounts for, maybe an eigenvalue greater than one. You can do. What we have right here is called a parallel analysis that's shown by these yellow dots. That's where it looks at a random structure and tries to see if we have something better than what's random. There's also what's called the elbow test, where you look at where things curve. And because I happen to know that there are five factors supposed to be in this I'm actually going to come here to fix factor and I'm going to hit five. And when I do that, we're going to get a very different factor structure that actually should correspond very closely with what's intended with the data where we have ten variables on each of five different well, wait a second. For that to load up, we've lost the one reference line here. So this plots exactly the same just without the line we had before. But let's come up to here and now you see that the component structure is much clearer all the way down to openness, where can see those ten items that are supposed to go together are in fact here together. Now, this is probably one of the most important things you need to do when you're analyzing survey data or, a number of variables that might potentially be correlated with each other. A principal component analysis is going to allow you to determine the underlying structure and see what you can with each other to simplify the data that you have to deal with and hopefully have more reliable information at the same time, sometimes you're analyzing data. You have to draw distinctions that don't make much of a difference. And one of those distinctions is between principal component analysis and exploratory factor analysis. Now, these are based on profoundly different mathematics, and they also have a different philosophy about the relationship between the individual items that you got data from, from people and the implied or implicit factor behind them. But the fact is they look really, really similar and people use them to do the same things. And there's so much commonality. There is sometimes hard to remember which one you're doing. I've showed you how to do principal component analysis. I'm now going to show you how to do the same kind looking for structure with exploratory factor analysis, and you'll see that it looks very similar. What I have here is Big five data that's included in the folder that you downloaded from DataLab Dot CC. I'm going to come up here to factor and go to exploratory factor analysis and. Mostly it's the same. I pick all the variables that I want to use in my factor analysis. This is my 50 variables that people use to assess big five personality factors. I put them over here and you can see it's crunch and we are ready. I'm going to change my rotation from oblivion to Promax. I can also do Bartlett's tests for Essity, which is one of the assumption checks. I can ask for a factor summary factor correlations and a screen plot. Well, truthfully, the major difference here is that the second factor is up on top. In the order it's a little different. Otherwise it came up with seven factors. When we did PCA, we had seven components and they're sprinkled around a little bit differently, but it's basically the same result. This is our openness factor right here and things are scattered a little bit for this last item, but if you come down here, you'll see that the factors also have approximately the same amount of variance they go through. The correlations are going to be similar. The screen looks nearly identical. And in fact, I'm going to do two things, and in fact, I'm going to do one thing here. And that's because based on the screen plot and based on what I know about this data, I think a fixed number of factors, five, you see it actually is calling them components. Even though we're doing factor analysis might be a little more useful for what we're doing. Now that I've done that, you can see we have just five factors that are correlated and now things are lining up much more neatly. But fundamentally, we're getting the same results that we had with the PCA analysis, principle, component analysis. And so take your pick one or other. I generally do principal component because I prefer that it starts with the individual variables and then infers something about a factor that goes to them as opposed to the idea that the factor is first and it leads to the individual variables, but fundamentally they're going to accomplish the same thing. They're going to help you find which are your variables, go with the others and give you the option of then combining them to get more stable, reliable data and the smaller number of variables that you have to deal with when you're your analysis. Either one of these will help you find the sense and the commonalities in your data and help you improve your analysis. One of the most amazing things about your movie is that this free, simple, friendly, open source analysis package includes some incredibly high end analyzes. In fact, the most surprising one for me is the inclusion of confirmatory factor analysis, something that, for instance, not even possible to do in SPSS. You're on a macintosh, and if you're doing it in other programs, it gets really, really complicated, really fast. And this is by far the easiest and fastest and cheapest way to do a very sophisticated procedure. What confirmatory factor analysis or CFA is is an attempt to data and say we know how the data is supposed to combine. We know that these ten variables should go to this factor, these ten should go to this other factor. It allows you to specify those factors and see how well your data fit your hypothesized factor structure. It also allows you, by the way, to compare the fit of factor structures across two different samples. That's a different thing. But let's go and take a look at how to set up a confirmatory factor analysis in JAMA. We come to Factor and we come down here to confirmatory factor analysis. Now I am working with the Big Five data set that you can download from the sample files on DataLab Dot CC. And what I have are 50 different variables with ten each for the five different five personality factors, extroversion, agreeableness, openness and so on. And what we have to do is we have to tell Jamovi which variables go and what the name of that factor should be. So for instance, E one through ten come first. I need to come here and say that that is extroversion. So I just click on it and type that. Then I come in, select these ten and you can see that the names have highlighted over here and I put them in right after that air and one through ten, which is for neuroticism. So I'm now going to add a new factor. I scroll down a little, we'll see if I can spell neuroticism correctly. And we select the ten variables and put them into neuroticism. There they go. We're going to add another factor because there's five total and this is agreeableness. I select those variables, put them in after that is C, which is for conscientiousness scroll down here country and justness. I think I got that right C one through ten feed those over and then the last factor is O for openness, come down to the bottom and type in openness and put those ten in. And now we've specified the important part of our confirmatory factor analysis. We've told you movie, we've got five different factors. We said what the names are and we said Which variables go with which. What's interesting is we don't even have to tell it which ones are positively associated, which ones are negative. We don't have to look at the reverse scaled items. It's going to be crunching away on this for a little while because this is a pretty mathematics intensive option. But let's take a look at a few other things we can do now. If we had specific other variables contain what are called residual covariance as things that don't necessarily go into the factor but help explain some of the leftovers. We could specify those and. What you do is you would say, you take this one variable and then you specify another way for the residual covariance. We don't have that. So I'm going to ignore that. Under options, we can change how. We deal with missing values. We only have a very small number of missing values. I'm just going to leave it like this. We can do constraint. Factor variance is equal to one that sounds good underestimates. We can look at a few different ways of calculating results. I'm going to do a standardized estimate right here. I'll close that under model fit. We have several choices. CFI stands for Comparative Fit Index Tallies for the Tucker Lewis Index. Our MSE A is for root mean square error of approximation. And this over here a chi squared test. We've seen that in several other places. I'm going to leave those defaults right there. And then the additional output. We've got a couple of really nice things. One is the residual observed correlation matrix because what the movie is going to do is it's trying to reconstitute a correlation matrix based on, I said, how the variables went together. And so I'm going to ask for that. And it's also going to highlight any residual values that are greater than 0.1, going to say this is where the model is off more than in other places. And then finally, I'm going to ask for a path diagram and then we'll just wait a minute for Jamari to finish crunching all the data and see what it has for us. I actually paused the recording for a minute because it takes a while to get through all this calculation. But here's what we have. We have our factor loadings where we said extroversion is the first factor and the indicator is the name of the variable that we say goes into that factor. The estimates are like regression coefficients and they say, multiply this number against the variable to get the factor loading. We have standard Z scores, the P value. And by the way, you can see that everything here is highly significant. And then finally, the standardized estimate again, like a Z score and that a good way of assessing the relative contribution of the variables. We have the same thing for neuroticism, agreeableness, conscientious ness, openness, and we got a lot of numbers there. We come down to the factor estimates. It passes it up a little bit differently where it's looking at the connection between, the factors in terms of extroversion as a function of these other things, the model fits an important one. This is where we have things. Again, the comparative index and so on. But I like these two things at the end. I'm going to close this so I can make a little more space on the screen and open up. I'm not going to be able to get the whole thing. This is a really, really big correlation matrix. These are correlations, again, from negative one, which a perfect negative linear association to zero, which means no linear association two plus one, which is a perfect positive association. And this correlation matrix is reconstituted based on the way that we said the variables went together. And these are residuals, which means compared the actual correlations which are computed in the background. This is how far off our reconstituted matrix is. The important things here are the ones in and that says we're off a little bit on that combination of variables and I can scroll down you can see this thing is really kind of huge and so you can take that as an indication of how close or how far away you are. If you want to go back to the fit indices, I can tell you that these numbers that our fit is adequate. It's kind of okay. It's nothing spectacular, but it looks like we're not too far off. Maybe we would want to see which variables are loading best, which ones seem to be contributing the most to the fit. But we may have a go ahead here in terms of confirming structure that we think we have. And then the last thing is the path diagram, and it's not putting the coefficients on this, but it simply says these are the associations. We say that we have five major factors extroversion, neuroticism, agreeableness, conscientiousness and openness to experience. And it says that we have all of them associated with each other and that each of them feeds into its own ten variables. And so that is in a nutshell, this is the functionality of confirmatory factor analysis, again, a very sophisticated procedure that a lot of programs don't like to do at all. And it's of the really a special present from Timothy that it makes it possible to do this. You're going to have to do more research on how it all works and how to interpret the results. But the fact that you have a tool that can do it and that does it so easily is a huge boon to researchers everywhere. So now we've come to the end of our introduction to Jamovi and it's time to talk a little bit about both what we've accomplished and what you might want to do next, your next steps. First off, here's what we've learned. We learned about installing Jamovi and sharing files and wrangling data, exploring data visualizations, t test analysis of variance regression models, analyzing frequencies and finding factors. We've learned how to do almost everything that's necessary for the vast majority data projects. And really, I just need to know, have you learned how to use spreadsheets before? Do you know how to use either Excel or Google Sheets? Because if you know how to use spreadsheets and if you've learned how to use your movie, then tool wise, maybe you're there now. Maybe you have what you need. I'm firmly of the belief that spreadsheets into movie cover the analytic necessities for the vast majority of people who work with data. Now it's possible that you might need some more tools depending on the kind of work you do. But not everybody needs it. If you're going to do something that is advanced than what Jamovi does and Jamovi has a lot of capability. Then you might need to do some programing for data analysis, and that usually means either are the statistical programing language, R or Python with the collection of data packages that are available for it. Those two are very, very popular in data science and I like to use R a lot, but I much prefer working with Jamovi. For the vast majority of my analyzes, there are other helpful tools. If you're looking for something to add to your toolkit. First, there's sequel structured query language for data access, data manipulation, data cleaning. If you're working in an organizational setting, then sequel can be really, really helpful. And truthfully, you only have to learn about 20 commands to get most of what you need out of sequel. Second is Tableau is a proprietary desktop application. Now there's Tableau Public, which is free. All the work you do is available free on the Internet for everybody. And there's Tableau desktop, which if you're a regular person or if your company is extraordinarily expensive if you work for a nonprofit, you can get it for free. Tableau really is the tool of choice for interactive visualization, even for data scientists who are handy with code and then third presentation software. Because remember, you do all this analysis, you're going to have to communicate it to somebody, you have to share it with them and knowing how to use Microsoft, PowerPoint or Apple Keynote or Google Slides is going to go a very long way in terms of making your work professionally, meaning For now, you may have seen this chart before. If you're coming from the world of data science, you've seen this is called the data science Venn diagram, and it was created by Drew Conway several years ago. And he said that data science consists of three different fields put together at the top left is coding. That's working with computers at the top right is stats and statistics in mathematics, and at the bottom is domain knowledge. You actually have to understand the field that you're working in. And when you take the three of those and combine, then you get data science and it applies to a certain extent with what we're doing. Even in this primary situation. We've been talking about the tool that's going to be that red circle on the top left. We have not talked very much about the statistics. That's the conceptual element, nor the domain, the application. And so even if you have the greatest toolset in the world, there are these other areas that do need your attention. And so I would say probably the single most important thing is data fluency. And seriously reading a bar chart and a line chart is if you know how, to ask a meaningful question and know how to read even these very simple analyzes to get something that is insightful and actionable out of it, you're going to have huge value in what you do. I actually believe that bar charts and line charts and scatter plots probably cover 98% of the data visualization needs. For most people, the simple tools are often the most effective. So one of the reasons I really liked your movie, even though it allows you to do some complex things, but data. And so here at DataLab, we're developing courses to teach data, fluency, the concepts and the principles that the tools help you with. But the tools are not a substitute them. And of course separate from the principles and the tools is the ability to work in a particular setting. So much of the real work in any data project is not the actual working with the numbers, but is in talking with people to understand what you're trying to accomplish and seeing how you can make sense of it and what to do with it. And the best way to do this is to work together with other people on real data projects to answer real questions. You can do that at your current job. You can do it with a volunteer group, you can do some online freelancing. Any of these are great ways to get the experience you need. Movie is an excellent data tool, but You need to have some understanding of these statistical concepts, the data fluency and more than anything, the ability to apply it when working with real people on real settings. So we've been talking about a tool and you have these other possibilities that we talk about a data lab as well, but take them together, take them together and really you're good to go. You've got what you need. I'm so glad you joined me here. I'm thrilled to show you to movie one of my favorite new tools. And I really hope that you'll start finding really interesting and exciting ways to use these in getting inside out of the data that you're working with and doing something amazing it.

Transcript for:Jamovi Data Analysis Overview

Transcript for:
Jamovi Data Analysis Overview