📊

Data Analyst Portfolio Project Series - Lecture Notes

Jul 17, 2024

Data Analyst Portfolio Project Series

Overview

  • Series consisting of four projects.
    • Project 1: SQL - Data exploration and visualization setup for Tableau.
    • Project 2: Tableau - Visualization of SQL data.
    • Project 3: Advanced SQL - Focus on ETL processes and data cleaning.
    • Project 4: Python - Using pandas for data cleaning and visualization.

Project Goals

  • To provide a step-by-step guide for beginners to create a data analyst portfolio.
  • By the end of the series, participants should have a portfolio to showcase to potential employers.
  • Encouragement to understand and customize queries rather than just copying them.

Session Breakdown

Introduction

  • Begin by downloading, formatting a dataset in Excel, and importing it to SQL.
  • Focus dataset: COVID-19 data (cases, deaths, vaccinations, population).

Working with the Dataset

  1. Download: Suggestion to download data from 2020 onward for comprehensive analysis.
  2. Format: Initial formatting tasks (e.g., moving Population column to the beginning).
  3. Basic Queries: Writing queries to explore and extract data.

Key Queries and Concepts

  1. Percentage of Deaths: Calculate the percentage of deaths among total cases.

    SELECT location, date, total_cases, total_deaths, (total_deaths / total_cases * 100) AS death_percentage FROM covid_deaths WHERE continent IS NOT NULL ORDER BY location, date
  2. Population Percentage Infected: Calculate the percentage of population infected by COVID-19.

    SELECT location, date, total_cases, population, (total_cases / population * 100) AS percent_population_infected FROM covid_deaths WHERE continent IS NOT NULL ORDER BY location, date
  3. Highest Infection Rate: Identify countries with the highest infection rates as a percentage of their population.

    SELECT location, MAX(total_cases) AS highest_infection_count, population, (MAX(total_cases) / population * 100) AS percent_population_infected FROM covid_deaths WHERE continent IS NOT NULL GROUP BY location, population ORDER BY percent_population_infected DESC
  4. Highest Death Count per Population: Identify countries with the highest death counts as a percentage of their population.

    SELECT location, MAX(CAST(total_deaths AS int)) AS total_death_count FROM covid_deaths WHERE continent IS NOT NULL GROUP BY location ORDER BY total_death_count DESC
  5. Breaking Things Down by Continent: Aggregate data to show continental stats.

    SELECT continent, SUM(total_cases) AS total_cases, SUM(CAST(total_deaths AS int)) AS total_deaths FROM covid_deaths WHERE location IN ('World',...) GROUP BY continent

Advanced SQL Techniques

  1. Using CTEs and Temp Tables
    • Demonstration of how to use Common Table Expressions (CTEs) and Temporary Tables for complex queries.
    • Example of using a CTE to calculate the rolling sum of vaccinated individuals and its percentage against the population.
    WITH pop_vs_vacc AS ( SELECT continent, location, date, population, new_vaccinations, SUM(new_vaccinations) OVER (PARTITION BY location ORDER BY date) AS rolling_people_vaccinated FROM covid_deaths WHERE continent IS NOT NULL AND new_vaccinations IS NOT NULL ) SELECT *, (rolling_people_vaccinated / population * 100) AS percent_population_vaccinated FROM pop_vs_vacc

Views

  1. Creating a SQL View for Tableau: Create persistent views for easier data visualization in Tableau. CREATE VIEW percent_population_vaccinated AS SELECT continent, location, date, population, new_vaccinations, SUM(new_vaccinations) OVER (PARTITION BY location ORDER BY date) AS rolling_people_vaccinated, (SUM(new_vaccinations) OVER (PARTITION BY location ORDER BY date) / population * 100) AS percent_population_vaccinated FROM covid_deaths WHERE continent IS NOT NULL AND new_vaccinations IS NOT NULL; ``*

Conclusion

  • Upload final SQL scripts to GitHub for portfolio.
  • Preparation for the next part of the series: using Tableau for data visualization.

Future Steps

  • Episode 2: Introduction to creating visualizations using Tableau Public.