Data Analyst Portfolio Project Series - Lecture Notes

Jul 17, 2024

Data Analyst Portfolio Project Series

Overview

  • Series consisting of four projects.
    • Project 1: SQL - Data exploration and visualization setup for Tableau.
    • Project 2: Tableau - Visualization of SQL data.
    • Project 3: Advanced SQL - Focus on ETL processes and data cleaning.
    • Project 4: Python - Using pandas for data cleaning and visualization.

Project Goals

  • To provide a step-by-step guide for beginners to create a data analyst portfolio.
  • By the end of the series, participants should have a portfolio to showcase to potential employers.
  • Encouragement to understand and customize queries rather than just copying them.

Session Breakdown

Introduction

  • Begin by downloading, formatting a dataset in Excel, and importing it to SQL.
  • Focus dataset: COVID-19 data (cases, deaths, vaccinations, population).

Working with the Dataset

  1. Download: Suggestion to download data from 2020 onward for comprehensive analysis.
  2. Format: Initial formatting tasks (e.g., moving Population column to the beginning).
  3. Basic Queries: Writing queries to explore and extract data.

Key Queries and Concepts

  1. Percentage of Deaths: Calculate the percentage of deaths among total cases.

    SELECT 
        location, 
        date, 
        total_cases, 
        total_deaths, 
        (total_deaths / total_cases * 100) AS death_percentage
    FROM covid_deaths 
    WHERE continent IS NOT NULL
    ORDER BY location, date
    
  2. Population Percentage Infected: Calculate the percentage of population infected by COVID-19.

    SELECT 
        location, 
        date, 
        total_cases, 
        population, 
        (total_cases / population * 100) AS percent_population_infected 
    FROM covid_deaths 
    WHERE continent IS NOT NULL
    ORDER BY location, date
    
  3. Highest Infection Rate: Identify countries with the highest infection rates as a percentage of their population.

    SELECT 
        location, 
        MAX(total_cases) AS highest_infection_count, 
        population, 
        (MAX(total_cases) / population * 100) AS percent_population_infected 
    FROM covid_deaths 
    WHERE continent IS NOT NULL
    GROUP BY location, population 
    ORDER BY percent_population_infected DESC
    
  4. Highest Death Count per Population: Identify countries with the highest death counts as a percentage of their population.

    SELECT 
        location, 
        MAX(CAST(total_deaths AS int)) AS total_death_count 
    FROM covid_deaths 
    WHERE continent IS NOT NULL 
    GROUP BY location
    ORDER BY total_death_count DESC
    
  5. Breaking Things Down by Continent: Aggregate data to show continental stats.

    SELECT 
        continent, 
        SUM(total_cases) AS total_cases, 
        SUM(CAST(total_deaths AS int)) AS total_deaths 
    FROM covid_deaths 
    WHERE location IN ('World',...) 
    GROUP BY continent
    

Advanced SQL Techniques

  1. Using CTEs and Temp Tables
    • Demonstration of how to use Common Table Expressions (CTEs) and Temporary Tables for complex queries.
    • Example of using a CTE to calculate the rolling sum of vaccinated individuals and its percentage against the population.
    WITH pop_vs_vacc AS (
        SELECT 
            continent, 
            location, 
            date, 
            population,
            new_vaccinations,
            SUM(new_vaccinations) OVER (PARTITION BY location ORDER BY date) AS rolling_people_vaccinated
        FROM covid_deaths 
        WHERE continent IS NOT NULL 
            AND new_vaccinations IS NOT NULL
    )
    SELECT *, 
        (rolling_people_vaccinated / population * 100) AS percent_population_vaccinated 
    FROM pop_vs_vacc
    

Views

  1. Creating a SQL View for Tableau: Create persistent views for easier data visualization in Tableau.
    CREATE VIEW percent_population_vaccinated AS
    SELECT 
        continent, 
        location, 
        date, 
        population,
        new_vaccinations,
        SUM(new_vaccinations) OVER (PARTITION BY location ORDER BY date) AS rolling_people_vaccinated,
        (SUM(new_vaccinations) OVER (PARTITION BY location ORDER BY date) / population * 100) AS percent_population_vaccinated 
    FROM covid_deaths 
    WHERE continent IS NOT NULL 
        AND new_vaccinations IS NOT NULL;
    ``
    
    

Conclusion

  • Upload final SQL scripts to GitHub for portfolio.
  • Preparation for the next part of the series: using Tableau for data visualization.

Future Steps

  • Episode 2: Introduction to creating visualizations using Tableau Public.