MySQL Series: Data Cleaning Project
Introduction
- Focus: Data Cleaning
- Purpose: Get data into a usable format by fixing issues in raw data
- Importance: Ensures reliable data for visualizations and product integration
- Today's Goal: Import, clean, and prepare dataset for analysis
Dataset
- Topic: World Layoffs
- Source: Available on GitHub (Link provided in the description)
- Steps: Create a database, import dataset, clean data
Creating a Database
- Create new schema:
World_layoffs
- No tables initially; right-click on tables -> Table Data Import Wizard
- Import
layoffs dataset
- Create new table (
layoffs
); configure import settings
- Import 2,361 records
Data Fields
- Company: Name of company that conducted layoffs
- Location: Location of the company
- Industry: Industry category
- Total Laid Off: Number of employees laid off
- Percentage Laid Off: Percentage of employees laid off
- Date: Date of layoffs
- Stage: Company stage (e.g., Series B, Post IPO)
- Country: Country of company
- Funds Raised (Millions): Amount of funds raised
Data Cleaning Steps
Step 1: Remove Duplicates
- Essential for ensuring unique records in the dataset
- Strategy:
- Identify duplicate rows based on specific columns
- Use
row_number()
window function to identify duplicates
- Create staging table
layoffs_staging
to hold a temporary cleaned dataset
- Delete rows where
row_number > 1
to keep only unique records
Step 2: Standardize Data
- Objective: Correct inconsistencies (e.g., spellings, extra spaces)
- **Actions: **
- Trim whitespace from columns, e.g.,
company
field
- Ensure consistent values, e.g., merge cryptos (Crypto, Cryptocurrencies) into
Crypto
- Fix issues in
country
names (e.g., remove trailing periods)
Step 3: Handle Null and Blank Values
- Objective: Decide if null/blanks should be replaced or handled
- Strategy:
- Identify rows with
NULL
or blank values especially in critical columns (e.g., industry
)
- Use
join
to backfill NULL
values based on another non-NULL entry of the same company/location.
- Updated blanks to
NULL
for better handling
Step 4: Remove Unnecessary Columns/Rows
- Objective: Optimize dataset by removing non-informative elements
- **Actions: **
- Removed
row_num
column post deduplication
- Deleted rows where
total_laid_off
and percentage_laid_off
are both blank/NULL
Step 5: Convert Data Types
- Date field: Convert from
text
to date
- Use
STR_TO_DATE(column, format)
to convert text to MySQL date format
- Alter column in staging table to
DATE
type
Conclusion
- Successfully cleaned dataset, ensuring it's ready for analysis
- Followed real-world procedures: Avoid altering raw data, maintaining integrity by using staging tables
Next Steps
- Perform exploratory data analysis on the cleaned dataset
- Analyze trends and patterns
- Run complex queries to extract insights
Final Thoughts
- Data cleaning can be complex but crucial for reliable analyses
- Step-by-step process ensures thorough and systematic cleaning
- Great project to demonstrate and practice SQL skills
- Helpful for portfolio work; reflects real-world data handling practices
Thank you for watching and participating!
Call to Action
- Like, subscribe for more content on SQL and data analysis