MySQL Data Cleaning Project

Jul 4, 2024

MySQL Series: Data Cleaning Project

Introduction

  • Focus: Data Cleaning
  • Purpose: Get data into a usable format by fixing issues in raw data
  • Importance: Ensures reliable data for visualizations and product integration
  • Today's Goal: Import, clean, and prepare dataset for analysis

Dataset

  • Topic: World Layoffs
  • Source: Available on GitHub (Link provided in the description)
  • Steps: Create a database, import dataset, clean data

Creating a Database

  1. Create new schema: World_layoffs
  2. No tables initially; right-click on tables -> Table Data Import Wizard
  3. Import layoffs dataset
  4. Create new table (layoffs); configure import settings
  5. Import 2,361 records

Data Fields

  • Company: Name of company that conducted layoffs
  • Location: Location of the company
  • Industry: Industry category
  • Total Laid Off: Number of employees laid off
  • Percentage Laid Off: Percentage of employees laid off
  • Date: Date of layoffs
  • Stage: Company stage (e.g., Series B, Post IPO)
  • Country: Country of company
  • Funds Raised (Millions): Amount of funds raised

Data Cleaning Steps

Step 1: Remove Duplicates

  • Essential for ensuring unique records in the dataset
  • Strategy:
    • Identify duplicate rows based on specific columns
    • Use row_number() window function to identify duplicates
    • Create staging table layoffs_staging to hold a temporary cleaned dataset
    • Delete rows where row_number > 1 to keep only unique records

Step 2: Standardize Data

  • Objective: Correct inconsistencies (e.g., spellings, extra spaces)
  • **Actions: **
    • Trim whitespace from columns, e.g., company field
    • Ensure consistent values, e.g., merge cryptos (Crypto, Cryptocurrencies) into Crypto
    • Fix issues in country names (e.g., remove trailing periods)

Step 3: Handle Null and Blank Values

  • Objective: Decide if null/blanks should be replaced or handled
  • Strategy:
    • Identify rows with NULL or blank values especially in critical columns (e.g., industry)
      • Use join to backfill NULL values based on another non-NULL entry of the same company/location.
    • Updated blanks to NULL for better handling

Step 4: Remove Unnecessary Columns/Rows

  • Objective: Optimize dataset by removing non-informative elements
  • **Actions: **
    • Removed row_num column post deduplication
    • Deleted rows where total_laid_off and percentage_laid_off are both blank/NULL

Step 5: Convert Data Types

  • Date field: Convert from text to date
    • Use STR_TO_DATE(column, format) to convert text to MySQL date format
    • Alter column in staging table to DATE type

Conclusion

  • Successfully cleaned dataset, ensuring it's ready for analysis
  • Followed real-world procedures: Avoid altering raw data, maintaining integrity by using staging tables

Next Steps

  • Perform exploratory data analysis on the cleaned dataset
    • Analyze trends and patterns
    • Run complex queries to extract insights

Final Thoughts

  • Data cleaning can be complex but crucial for reliable analyses
  • Step-by-step process ensures thorough and systematic cleaning
  • Great project to demonstrate and practice SQL skills
  • Helpful for portfolio work; reflects real-world data handling practices

Thank you for watching and participating!

Call to Action

  • Like, subscribe for more content on SQL and data analysis