📊

Understanding Skewness and Data Visualization

Mar 1, 2025

Lecture Notes: Media and History - Skewness and Data Distribution

Introduction

  • Discussion of seating arrangement; encouraged students to sit at the front.
  • Previous topic: Correlation.

Skewness of Data

  • Definition: Skewness refers to data distribution leaning to one side (left or right).
  • Importance of Visualization: Skewness is best visualized.
  • Numeric Analysis: Aim to understand skewness through numerical stats rather than visualization.

Why Correct Skewness?

  • Ensures models are accurate and data fits a normal Gaussian distribution.
  • Skewness disrupts the assumption of normal data distribution in machine learning algorithms.

Types of Skewness

  • Positive Skewness: Values more on the right.
  • Negative Skewness: Values more on the left.
  • Zero Skewness: Perfect balance (normal distribution).

Python Code for Skewness

  • Use of Pandas to read CSV data and calculate skewness.
  • Example output and interpretation of skewness values.

Transition to Data Visualization

  • Visualization: Helps in understanding data relationships and distributions.
  • Univariate vs. Multivariate: Two main categories of data visualization.

Univariate Plots

  • Types:
    • Histogram: Shows data distribution in bins, quick visualization.
    • Density Plots: Line or curve instead of bars (like abstracted histograms).
    • Box and Whisker Plots: Shows median, quartiles, and outliers.

Histogram

  • Characteristics: Provides count of observations.
  • Purpose: Visualize if data follows Gaussian distribution or is skewed.

Density Plots

  • Characteristics: Abstracted histogram; line/curve representation.
  • Use: Identify skewness and distribution.

Box and Whisker Plots

  • Purpose: Summarize data distribution, median, quartiles, and identify outliers.
  • Components:
    • Box (25-75 percentile)
    • Median line
    • Whiskers (minimum and maximum)
    • Outliers (dots outside whiskers)

Multivariate Plots

  • Aim to show relationships between multiple variables.
  • Correlation Matrix Plots: Shows strength of relationship between variables.

Correlation Matrix

  • Purpose: Identify strong/weak correlations between variables using color coding (blue for negative, yellow for positive).
  • Python Code: Use Pandas and NumPy to calculate and visualize correlations.

Summary

  • Visualization assists in understanding data distribution and relationships.
  • Univariate plots help understand single attribute distribution.
  • Multivariate plots reveal relationships between multiple attributes.
  • Data visualization is crucial for data analysis and machine learning preparation.
  • Discussion to continue in the next lecture.