Coconote
AI notes
AI voice & video notes
Try for free
📊
Understanding Skewness and Data Visualization
Mar 1, 2025
Lecture Notes: Media and History - Skewness and Data Distribution
Introduction
Discussion of seating arrangement; encouraged students to sit at the front.
Previous topic: Correlation.
Skewness of Data
Definition
: Skewness refers to data distribution leaning to one side (left or right).
Importance of Visualization
: Skewness is best visualized.
Numeric Analysis
: Aim to understand skewness through numerical stats rather than visualization.
Why Correct Skewness?
Ensures models are accurate and data fits a normal Gaussian distribution.
Skewness disrupts the assumption of normal data distribution in machine learning algorithms.
Types of Skewness
Positive Skewness
: Values more on the right.
Negative Skewness
: Values more on the left.
Zero Skewness
: Perfect balance (normal distribution).
Python Code for Skewness
Use of Pandas to read CSV data and calculate skewness.
Example output and interpretation of skewness values.
Transition to Data Visualization
Visualization
: Helps in understanding data relationships and distributions.
Univariate vs. Multivariate
: Two main categories of data visualization.
Univariate Plots
Types
:
Histogram
: Shows data distribution in bins, quick visualization.
Density Plots
: Line or curve instead of bars (like abstracted histograms).
Box and Whisker Plots
: Shows median, quartiles, and outliers.
Histogram
Characteristics
: Provides count of observations.
Purpose
: Visualize if data follows Gaussian distribution or is skewed.
Density Plots
Characteristics
: Abstracted histogram; line/curve representation.
Use
: Identify skewness and distribution.
Box and Whisker Plots
Purpose
: Summarize data distribution, median, quartiles, and identify outliers.
Components
:
Box (25-75 percentile)
Median line
Whiskers (minimum and maximum)
Outliers (dots outside whiskers)
Multivariate Plots
Aim to show relationships between multiple variables.
Correlation Matrix Plots
: Shows strength of relationship between variables.
Correlation Matrix
Purpose
: Identify strong/weak correlations between variables using color coding (blue for negative, yellow for positive).
Python Code
: Use Pandas and NumPy to calculate and visualize correlations.
Summary
Visualization assists in understanding data distribution and relationships.
Univariate plots help understand single attribute distribution.
Multivariate plots reveal relationships between multiple attributes.
Data visualization is crucial for data analysis and machine learning preparation.
Discussion to continue in the next lecture.
📄
Full transcript