Modern Data Engineering Project with Uber Dataset

Jul 7, 2024

Modern Data Engineering Project with Uber Dataset

Introduction

  • Objective: Create a data model and transformation code for the Uber dataset. Then deploy it on Google Cloud using Mage and load the data into BigQuery to create a dashboard.
  • Components: Google Cloud Storage, Compute Engine, Mage (modern data pipeline tool), BigQuery, and Looker Studio for visualization.

Prerequisites

  • A laptop with a stable internet connection.
  • Basic understanding of Python, SQL, and Jupyter Notebook installed.
  • Courses available on Python and SQL are suggested for foundational knowledge.

Google Cloud Products Used

  1. Google Cloud Storage: Stores all types of files and acts as object storage.
  2. Google Compute Engine: Analogous to EC2 in AWS; deploys Mage instance and transformation code.
  3. BigQuery: Cloud-based data warehouse for storing and analyzing data using SQL.
  4. Looker Studio: BI tool for creating dashboards.

Mage: Data Pipeline Tool

  • Description: Similar to Airflow but with more functionality and easier interaction. Provides code templates for loading, transforming, and loading data to target locations.
  • Website: Explore Mage on their official website.

Fact and Dimension Tables

  • Fact Table: Contains quantitative measures like total numbers, revenue, etc.
  • Dimension Table: Contains descriptive attributes like product descriptions.
  • Example: Order table (fact) and customer, product, date (dimension).

Uber Dataset Details

  • Includes yellow and green taxi trip records with fields like pickup/dropoff times, locations, distances, fare, etc.
  • Data Dictionary: Provides descriptions of each column in the dataset.

Data Modelling Using Lucidchart

  • Fact Table: Vendor ID, DateTime ID, Passenger Count, Trip Distance, Rate Code, etc.
  • Dimension Tables: Created for date/time, passenger count, trip distance, pickup/dropoff locations, rate code, and payment type.
  • Steps: Use Lucidchart for ERD. Each dimension table has a primary key for unique identification.

Data Transformation in Jupyter Notebook

  1. Read the Dataset: Import Pandas and read the CSV file.
  2. Convert Date Columns: Convert pickup and dropoff times to datetime format using pd.to_datetime().
  3. Create Dimension DataFrames: Drop duplicates, reset indices, extract relevant columns, and assign primary keys.
  4. Create Fact Table: Merge all dimension tables with the original data and select necessary columns.
  5. Store Dataframes: Convert into dictionary and pass to Mage.

Setting Up Google Cloud Environment

  1. Create Google Cloud Account: Free $300 credit at signup.
  2. Google Cloud Storage: Create a bucket and upload the dataset. Make the dataset public to get a URL for accessing it from Mage.
  3. Google Compute Engine: Create an instance, install Python, PIP, and necessary libraries.
  4. Install and Start Mage: Use PIP to install Mage. Run Mage and access its UI.
  5. Firewall Rules: Ensure proper rules are set to allow traffic on necessary ports.

Creating and Running Data Pipeline in Mage

  1. Data Loader Block: Use API to fetch data from Google Cloud Storage using the public URL of the dataset.
  2. Transformation Block: Write transformation logic to process the data as per the dimension model and convert it to dictionary format.
  3. Data Exporter Block: Export transformed data to BigQuery. Pass each DataFrame from the transformation block into corresponding BigQuery tables.

Analyzing Data in BigQuery

  1. Create Dataset and Tables in BigQuery: Verify tables and preview data.
  2. Run SQL Queries: Join tables using common fields and run analysis queries to gain insights.
  3. Assignment: Practice queries for insights such as top pickup locations, total number of trips by passenger count, and average fare by hour.

Creating Dashboard in Looker Studio

  1. Set Up Looker Studio: Connect to BigQuery dataset and import tables.
  2. Design Dashboard: Add filters, scorecards (metrics), maps (pickup locations), and charts (for visualization).
  3. Customize: Adjust appearance, styles, and add necessary visual elements.

Conclusion

  • End-to-End Project: Covers from data extraction, transformation, loading to BigQuery, and visualizing in Looker Studio.
  • Future Work: Experiment with different datasets, model, and create new dashboards. Share progress on social media.

(C) Congratulations if you reached the end of this project tutorial!