♪ (music) ♪ If I gave you the list of numbers 7.9, 8.4, and 7.8, these are just numbers. What do they mean? Well, without context, nothing. Similarly, if I have a list that say action, family, adventure, that's just a list of words. But if I tell you that the list of numbers were the average voter rating on IMDB for several different movies and that the words in my list were the genre of those movies, now we have data. Data can be any form of information but in context. The important thing to remember about data is that it's only useful when we consider its context. So, let's look at a data set of information collected from the movie rating and review site IMDB. The first thing to notice is the structure of how we interact with data. This data set format is called structured data. Along the top row of the data set, we have our list of variables. These are the attributes of the subjects that we measured. For this data set, we measure things such as the title of the movies, the year they were released, the genre, and the user rating average. Each row, then, indicates all of the characteristics for a single movie. Looking across the row, you can see all of the responses for all the variables for that movie. The thing to note about stacked data is that if you choose to sort your data in some way, like by user rating, you would have to be sure to sort all of your columns simultaneously because that full row of data makes up a complete picture of the movie. Let's talk about the different types of variables that we can have in a data set. There are two major types of variables: numerical, or quantitative, and categorical. For categorical data, these are labels we apply to certain characteristics. Within categorical data, we have two levels of categorical variables: ordinal and nominal. Ordinal data is data that has a specific order to it. An example from the movie's data set is the rating system: G, PG, PG-13, and R. These are categories, but they have a definite order to them. Other examples would be what we call a Likert scale: strongly agree, agree, disagree, strongly disagree. Or classification in school: freshman, sophomore, junior, senior. Nominal data is data that is categorical but has no defined order. For example, look at the genre. Adventure, action, family. How would we put those in order? Well, there isn't a set way, so this variable is nominal, or differences in name only. Other examples would be something like hair color or ethnicity. A special case of nominal data is any type of data that has only two categories, such as: Is a movie animated? Yes, or no. Two-category data is always considered nominal. For numerical data, we have two main scales of measurement: discrete and continuous. This distinction has to do with whether the data can take on any decimal values or not. Discrete data is data that cannot be sliced into smaller segments, meaning that there are gaps between the values, and, generally, we cannot have decimals. Think about the year the movie was released, or the number of votes a movie received on the IMDB site. We cannot have decimal or partial values for this data. Anytime we are counting, that is discrete data. Think: one, two, three, four. We cannot count something and have it result in a decimal value. The other scale of data is continuous. This is anything that can be measured and can be sliced into infinitely smaller pieces. A more obvious example of continuous data is the average movie rating. That data has decimal places in the data set, but the budget for the movies is another example of continuous data. While, in this data set, it does not display the decimals for cents, we know these movies' budgets could be tabulated in dollars and cents, and that makes it continuous data. Other examples would be anything we can take precise measurements for, like temperature, distance, or time. Knowing and defining the types of variables we can come across is important because, down the road, this will let us know what types of statistical tests and analyses we can do with our data.