Comprehensive Data Analytics Project Guide

Aug 1, 2024

End-to-End Data Analytics Project

Overview

  • Project to perform end-to-end data analytics using Kaggle API, Python, Pandas, and SQL.
  • Target audience: Data engineers, data analysts, and aspiring professionals.
  • All code will be shared on GitHub.

Steps of the Project

1. Setup Kaggle API

  • Create a Kaggle Profile: Sign up on kaggle.com
  • Generate API Token:
    • Go to Settings > API section > Create New Token
    • This generates a kaggle.json file.
  • Place the kaggle.json file in the directory: C:/Users/YourUsername/.kaggle/

2. Download Dataset Using Kaggle API

  • Import Kaggle Library: pip install kaggle
  • Download Dataset Command:
    kaggle datasets download -d <DATASET_LINK>
    
  • Extract downloaded zip file using Zip Library in Python.

3. Load Data into Pandas

  • Import Pandas Library: import pandas as pd
  • Read CSV File:
    df = pd.read_csv('orders.csv')
    
  • Inspect Data: df.head(20) to view the first 20 records.

4. Data Cleaning in Pandas

  • Handling Missing Values:
    • Replace "Not Available" and "Unknown" with None for ship_mode.
    df['ship_mode'].replace(['Not Available', 'Unknown'], None, inplace=True)
    
  • Renaming Columns: Use df.rename() or update with lower case and underscores.
    df.columns = df.columns.str.lower().str.replace(' ', '_')
    

5. Create New Columns

  • Discount Calculation:
    df['discount'] = df['list_price'] * (df['discount_percent'] / 100)
    
  • Sale Price Calculation:
    df['sale_price'] = df['list_price'] - df['discount']
    
  • Profit Calculation:
    df['profit'] = df['cost_price'] - df['sale_price']
    
  • Dropping Unnecessary Columns:
    df.drop(['list_price', 'cost_price', 'discount_percent'], axis=1, inplace=True)
    

6. Convert Order Date to DateTime

  • Convert Data Type:
    df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')
    

7. Load Cleaned Data into SQL Server

  • Import SQLAlchemy and connect to SQL Server:
    from sqlalchemy import create_engine
    engine = create_engine('mssql+pyodbc://<username>:<password>@<server_name>/<database_name>?driver=ODBC+Driver+17+for+SQL+Server')
    
  • Load DataFrame to SQL Table:
    df.to_sql('df_orders', con=engine, index=False, if_exists='replace')
    

8. Data Analysis Questions

  1. Find Top 10 Highest Revenue Generating Products:

    • SQL Query:
    SELECT product_id, SUM(sale_price) AS sales
    FROM df_orders
    GROUP BY product_id
    ORDER BY sales DESC
    LIMIT 10;
    
  2. Top 5 Highest Selling Products in Each Region:

    • Use RANK() function to partition by region.
  3. Month-over-Month Growth Comparison for 2022 and 2023:

    • Aggregate sales data by month and year.
  4. Highest Sales Month for Each Category:

    • Use GROUP BY and RANK() functions to find the highest sales per category.
  5. Subcategory with Highest Profit Growth in 2023 Compared to 2022:

    • Calculate each subcategory’s sales for both years and find the growth percentage.

Conclusion

  • The project covered data extraction, cleaning, transformation, and loading into SQL.
  • Conducted analysis to derive insights, all code shared on GitHub.