πŸ“Š

Histograms and Data Visualization

Jul 6, 2024

Histograms: A Method for Displaying Continuous Data

Introduction

  • Why histograms?
    • Means, median, standard deviations don't tell the whole story.
    • Distribution shapes are important and not captured by single summaries.

What is a Histogram?

  • Definition: Displays distribution of data by charting the number or percentage of observations within predefined numerical ranges.
  • Similarity to Bar Charts: Histograms are similar to bar charts but focus on data distribution.

Example: Age Data from 1995 Statistical Abstract (US)

  • Dataset: Proportions of individuals over 65 years for the 50 states.
  • Interesting Findings:
    • Smallest percentage: Alaska (4.6%)
    • Largest percentage: Florida (18.4%)

Steps to Create a Histogram

  • Step 1: Break data into mutually exclusive, equally sized bins.
  • Step 2: Count observations in each bin.
  • Notation: Use brackets and parentheses to define precise ranges.

Example Breakdown

  • Bin Setup: Bins from 4% to 19% with 1% width.
  • Observation Counts:
    • 4-5%: 1 observation
    • 5-6%: 0 observations
    • 6-7%: 0 observations
    • 8-9%: 1 observation
    • More action within 10-16% range.
  • Graphical Summary: Histogram visualizes where values are centered and spread out.

Blood Pressure Data Example

  • Dataset: Blood pressure data from 113 men.
  • Statistics: Mean = 123.6 mmHg, Standard deviation = 12.9 mmHg.
  • Histogram Properties:
    • Bin Width: 5 mmHg, height represents number of men.
    • Shape: Symmetric, bell-shaped around mean and median.
    • Arbitrary Bin Width:
      • 20 mmHg width: Too crude.
      • 1 mmHg width: Too fine.
    • Percentage Representation: Vertical axis can represent percentages instead of counts.

Choosing Bin Width and Number

  • General Guidance:
    • Dependent on sample size and data spread.
    • Rough rule: Number of intervals β‰ˆ √sample size.
    • Example: 10 observations β‰ˆ 3 bins, 50 observations β‰ˆ 7 bins, 100 observations β‰ˆ 10 bins.
  • Computer Selection: Computers typically choose the optimal number of bins for you.

Conclusion

  • Histograms are useful for summarizing continuous data and understanding distribution shapes.
  • Other visualization options will be explored in the next section.