Transcript for:
Understanding Data Visualization Principles

They say a picture is worth a thousand words, but how much data is it worth? Let's say I showed you this table of data. Hmm, yes, compelling.

Whatever story I'm trying to tell you is completely hidden in what looks like just a list of numbers. These four are actually Anscombe's Quartet, special sets that almost have the same summary statistics or metrics like the range, frequency, or mean that help us summarize the data. And with just the summaries and tables to work with, we'd probably think all four were saying the same thing. But watch what happens when we plot each set. These are four very different stories.

What we've done is created data visualization, which is the somewhat buzzy phrase we use to talk about how we graphically represent data. Calculations can tell us a lot, but images help us communicate or even let us spot connections, patterns, trends, and outliers we otherwise would have missed. But there's always a catch.

A few missteps and we can actually learn less from a visualization than if the author had just thrown their spreadsheet at us and ran away. So today we'll learn some of the common types of data visualizations. and data distortions. I'm Jessica Pucci, and welcome to Study Hall Data Literacy, presented by Arizona State University and Crash Course. Data visualizations go by many names—charts, figures, graphs, infographics, plots.

But they all pretty much stand for the same thing—a way to visualize the story data is telling. And sort of like genres of writing, there are also lots of specific types of visualizations that can be used to emphasize different parts of the story. So it's our job to decide if we should believe the story being told, and whether it fits with the visualization. With a big data set, there are lots of twists to focus on. Like, suppose we're ecologists working for the US Fish and Wildlife Service, and track the numbers and species of all the birds migrating over Florida in the last decade.

Let's wander through the gallery of possible visualizations and see how each changes the angle of the story. A line chart helps tell a story involving time, like if we wanted to see how the number of birds each year has changed over the decade. With time on the horizontal axis and the number of birds on the vertical axis, we can put some sort of symbol, like a dot, to mark the total number of birds for each year. And for the pièce de résistance, we connect the points with a line to help us see any trends.

But in a pie chart, we forget about time and instead focus on stories about proportions, or parts compared to the whole. So if we wanted to tell a story about all the types of migrating birds, the whole pie chart represents the total number of birds, while each slice shows how much each species contributes. Those sandpipers are going for the biggest piece! And in the same flavor profile as a pie chart, a bar chart also helps us tell stories with comparisons.

Like, if we wanted to study the birds by color, and see which is the most prominent, we'd label each color horizontally, and then the height of each bar would be how many birds we saw of that color. My money's on brown. Now, a histogram helps tell a story about the distribution of a variable.

But instead of putting a marker for each data point, we'd put each one into a group, or bin. Like, we might want to see how the total number of species we see each year is distributed. Some years we might see 40 different species, and some years we might see 35 or 42. To see if there's a trend, we split the range up into bins. One bin might contain all the years we saw 40 to 45 species.

One bin will contain all the years we saw 35 to 40, and so on. The bins go across the horizontal axis, and the vertical axis tracks the frequency, or how many years fall in each bin. Some bins will be less full, and some bins will be stuffed. But lined up, we can see how the number of species we see is spread out. Binning sounds straightforward, but how big we make our bins can really make a difference.

If we go too big, like one bin for 0 to 50 species, all the data will be mushed together. But if we pick groups that are too small, like separate bins for 40, 41, and 42 species, and so on, each bin will only have a few data points. And even though bars are involved, a histogram is not the same as a bar chart. Remember, in a bar chart, the vertical axis represents a value of the variable. not a frequency.

Of course, we're also interested in how variables relate to each other. A scatterplot tells the story about a relationship, or correlation, between two variables by putting one variable on each axis and plotting the corresponding pairs of values. Like take these two species of birds. If they're competitors, they might not migrate together, and we might expect a negative correlation. Which brings us to maps, which tell stories about information across a space.

Like we can plot all the locations where a bird was reported to see where all the birds were. or birders, hang out. So are these all the possible visualizations out there?

No. But they are the most common, and the foundation for the more specialized graphics you might find. Now part of being data literate is knowing how to pick a visualization that fits our data. Lots of us have tons of weather information at our fingertips, but what we see on the news or check on our built-in phone apps is rarely specific to exactly where we live. So to set up our home weather stations and choose an effective chart, let's plot the points in the Data Point.

You're really curious about two weather variables the exact midday temperature and daily precipitation. So you put a thermometer and a rain gauge right outside your window and decide to check them once a day as you sit down to lunch. Soon your friends and family hear about your project and want to get involved with their own data tables.

Which is great, but after a month the numbers get a bit overwhelming. This is where a data visualization can help. But what kind will tell us an interesting story? First, you have to explore your data to see what stories might be living in it. And if you remember what originally motivated us to start collecting data, you won't have to arbitrarily make a bunch of plots.

Originally, you were interested in midday temperatures, and maybe you had questions about the average or how much it varied. So it sounds like you want to see the distribution. With a histogram, we can see that most days the temperature was between 65 and 75 degrees Fahrenheit.

But there were some more extreme days in there, too. Now that you know how temperature varies, you've got more questions. Is there any trend in how temperature fluctuates? Or did it get noticeably hotter or colder over time?

Aha! That's the key! Time.

A line chart focuses on time and gives us visual evidence that temperature increases over time. But don't forget, you were also interested in rain data. Maybe there's even a relationship.

between temperature and rainfall, and with two variables and a potential correlation, a scatterplot might be revealing. Sadly, nothing jumps out at us in this scatterplot. With so many rainless days, it makes it hard to see any pattern. This chart was a dead end, but that's okay, at least we know. So if we follow our curiosity and original questions and build on what we find, deciding what charts to create will be much easier.

Now, with a small amount of data, we could sketch graphics by hand. But if we get all that data in one place, like a trusty spreadsheet, We can break out our favorite visualization tool like Data Wrapper, Google Data Studio, or Tableau, which all have free starter versions. Or we can use tools that are completely free but take a bit more skill, like Python and R. All these tools have extensive options to help us emphasize parts of our dataset and tell a better story with our visualization. But this emphasis can be a double-edged sword.

Sometimes the design of a chart can actually obscure the story or mislead the viewer. For instance, since the two axes control most of the information displayed in a chart, Changing them can distort our perception of the story. Like suppose we made a bar chart of the average temperature each person recorded in our weather experiment. If one temperature really stuck out as being extremely large, we could downplay the difference by breaking the vertical axis, which cuts out a whole section of the axis and shortens the longer bars.

Or if there wasn't much difference between all our temperatures, we could create some drama by not starting the vertical axis at zero. Then any difference between the minimum and maximum looks larger. Like we've zoomed in.

Not starting at zero is a special case of changing the baseline, or adjusting the reference points we use to compare things. And it's not limited to just bar charts. Most other charts can be altered to play with our perception. Like we expect the sections in a pie chart to make up the whole, right?

So it's confusing when the slices add up to more or less than 100%. If we wanted to compare the number of sunny and rainy days in one month, we could make a pie chart. But if we add everyone else's data, too, we'd be comparing nonsense.

Suddenly, we've got way more than one month's worth of days. And even when we're focused on the image, we can still get distracted by text. Like if the title in our line chart is, LOOK AT THAT HEATWAVE, our reader's probably going to be swayed by the sensational headline and see the trend as way more extreme than it is. So even the simplest graph might not be telling the whole story, which is why it's important to look carefully at all labels to make sure we're getting a fair picture. But even if we don't mess with the chart basics, design choices can still get in the way of the data's story.

These little raindrop and sun icons are pretty cute. and we could make a bar plot with one symbol for each day stacked to make the bars. But if the symbols are badly sized, what we thought was cute actually makes it hard to compare the heights of the bars accurately. Overly complicated aesthetic choices like this are called chart junk, a phrase coined by statistician Edward Tuft in his 1983 book The Visual Display of Quantitative Information.

Chart junk is basically design overkill, like using unnecessary lines, colors, or symbols. If the aesthetic choice doesn't connect to the data somehow, it's a distraction. A common culprit is throwing in an extra dimension. 3D bar or pie charts are rarely necessary. Ask yourself, what does that extra dimension connect to in the data?

Probably nothing. Some color palettes can even hide the story altogether. Sure, a red and green palette might seem festive, But some colorblind viewers won't be able to see a difference.

Seemingly simple choices like this are actually a major factor in determining how inclusive our data story is. No matter what story we're trying to tell, we can always try to be empathetic to our viewers. So a picture really is worth a thousand, or even a million.

data points. Today we learned what stories can be told and to pay attention to our original questions when deciding what visualization to use. And we learned we still need to use our data literacy skills to investigate visuals. They help us see patterns and trends in data more easily, but they can also help us understand the situation.

also distort our perception of the story. Next time, we'll talk more about the ins and outs of data collection. Thanks for watching Study Hall Data Literacy, which is produced by Arizona State University and the Crash Course team at Complexly.

If you liked this video and you want to keep learning with us here in Study Hall, be sure to subscribe. You can learn more about ASU and the videos produced by Crash Course in the links in the description. See you next time!