Modern Data Engineering Project with Uber Dataset
Introduction
- Objective: Create a data model and transformation code for the Uber dataset. Then deploy it on Google Cloud using Mage and load the data into BigQuery to create a dashboard.
- Components: Google Cloud Storage, Compute Engine, Mage (modern data pipeline tool), BigQuery, and Looker Studio for visualization.
Prerequisites
- A laptop with a stable internet connection.
- Basic understanding of Python, SQL, and Jupyter Notebook installed.
- Courses available on Python and SQL are suggested for foundational knowledge.
Google Cloud Products Used
- Google Cloud Storage: Stores all types of files and acts as object storage.
- Google Compute Engine: Analogous to EC2 in AWS; deploys Mage instance and transformation code.
- BigQuery: Cloud-based data warehouse for storing and analyzing data using SQL.
- Looker Studio: BI tool for creating dashboards.
Mage: Data Pipeline Tool
- Description: Similar to Airflow but with more functionality and easier interaction. Provides code templates for loading, transforming, and loading data to target locations.
- Website: Explore Mage on their official website.
Fact and Dimension Tables
- Fact Table: Contains quantitative measures like total numbers, revenue, etc.
- Dimension Table: Contains descriptive attributes like product descriptions.
- Example: Order table (fact) and customer, product, date (dimension).
Uber Dataset Details
- Includes yellow and green taxi trip records with fields like pickup/dropoff times, locations, distances, fare, etc.
- Data Dictionary: Provides descriptions of each column in the dataset.
Data Modelling Using Lucidchart
- Fact Table: Vendor ID, DateTime ID, Passenger Count, Trip Distance, Rate Code, etc.
- Dimension Tables: Created for date/time, passenger count, trip distance, pickup/dropoff locations, rate code, and payment type.
- Steps: Use Lucidchart for ERD. Each dimension table has a primary key for unique identification.
Data Transformation in Jupyter Notebook
- Read the Dataset: Import Pandas and read the CSV file.
- Convert Date Columns: Convert pickup and dropoff times to datetime format using
pd.to_datetime()
.
- Create Dimension DataFrames: Drop duplicates, reset indices, extract relevant columns, and assign primary keys.
- Create Fact Table: Merge all dimension tables with the original data and select necessary columns.
- Store Dataframes: Convert into dictionary and pass to Mage.
Setting Up Google Cloud Environment
- Create Google Cloud Account: Free $300 credit at signup.
- Google Cloud Storage: Create a bucket and upload the dataset. Make the dataset public to get a URL for accessing it from Mage.
- Google Compute Engine: Create an instance, install Python, PIP, and necessary libraries.
- Install and Start Mage: Use PIP to install Mage. Run Mage and access its UI.
- Firewall Rules: Ensure proper rules are set to allow traffic on necessary ports.
Creating and Running Data Pipeline in Mage
- Data Loader Block: Use API to fetch data from Google Cloud Storage using the public URL of the dataset.
- Transformation Block: Write transformation logic to process the data as per the dimension model and convert it to dictionary format.
- Data Exporter Block: Export transformed data to BigQuery. Pass each DataFrame from the transformation block into corresponding BigQuery tables.
Analyzing Data in BigQuery
- Create Dataset and Tables in BigQuery: Verify tables and preview data.
- Run SQL Queries: Join tables using common fields and run analysis queries to gain insights.
- Assignment: Practice queries for insights such as top pickup locations, total number of trips by passenger count, and average fare by hour.
Creating Dashboard in Looker Studio
- Set Up Looker Studio: Connect to BigQuery dataset and import tables.
- Design Dashboard: Add filters, scorecards (metrics), maps (pickup locations), and charts (for visualization).
- Customize: Adjust appearance, styles, and add necessary visual elements.
Conclusion
- End-to-End Project: Covers from data extraction, transformation, loading to BigQuery, and visualizing in Looker Studio.
- Future Work: Experiment with different datasets, model, and create new dashboards. Share progress on social media.
(C) Congratulations if you reached the end of this project tutorial!