End-to-End Data Analytics Project

Overview

Project to perform end-to-end data analytics using Kaggle API, Python, Pandas, and SQL.
Target audience: Data engineers, data analysts, and aspiring professionals.
All code will be shared on GitHub.

Steps of the Project

1. Setup Kaggle API

Create a Kaggle Profile: Sign up on kaggle.com
Generate API Token:
- Go to Settings > API section > Create New Token
- This generates a kaggle.json file.
Place the kaggle.json file in the directory: C:/Users/YourUsername/.kaggle/

2. Download Dataset Using Kaggle API

Import Kaggle Library: pip install kaggle

Download Dataset Command:

kaggle datasets download -d <DATASET_LINK>

Extract downloaded zip file using Zip Library in Python.

3. Load Data into Pandas

Import Pandas Library: import pandas as pd
Read CSV File:
```
df = pd.read_csv('orders.csv')
```
Inspect Data: df.head(20) to view the first 20 records.

4. Data Cleaning in Pandas

Handling Missing Values:
- Replace "Not Available" and "Unknown" with None for ship_mode.
```
df['ship_mode'].replace(['Not Available', 'Unknown'], None, inplace=True)
```
Renaming Columns: Use df.rename() or update with lower case and underscores.
```
df.columns = df.columns.str.lower().str.replace(' ', '_')
```

5. Create New Columns

Discount Calculation:

df['discount'] = df['list_price'] * (df['discount_percent'] / 100)

Sale Price Calculation:

df['sale_price'] = df['list_price'] - df['discount']

Profit Calculation:

df['profit'] = df['cost_price'] - df['sale_price']

Dropping Unnecessary Columns:

df.drop(['list_price', 'cost_price', 'discount_percent'], axis=1, inplace=True)

6. Convert Order Date to DateTime

Convert Data Type:

df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')

7. Load Cleaned Data into SQL Server

Import SQLAlchemy and connect to SQL Server:

from sqlalchemy import create_engine
engine = create_engine('mssql+pyodbc://<username>:<password>@<server_name>/<database_name>?driver=ODBC+Driver+17+for+SQL+Server')

Load DataFrame to SQL Table:

df.to_sql('df_orders', con=engine, index=False, if_exists='replace')

8. Data Analysis Questions

Find Top 10 Highest Revenue Generating Products:

SQL Query:

SELECT product_id, SUM(sale_price) AS sales
FROM df_orders
GROUP BY product_id
ORDER BY sales DESC
LIMIT 10;

Top 5 Highest Selling Products in Each Region:
- Use RANK() function to partition by region.
Month-over-Month Growth Comparison for 2022 and 2023:
- Aggregate sales data by month and year.
Highest Sales Month for Each Category:
- Use GROUP BY and RANK() functions to find the highest sales per category.
Subcategory with Highest Profit Growth in 2023 Compared to 2022:
- Calculate each subcategory’s sales for both years and find the growth percentage.

Conclusion

The project covered data extraction, cleaning, transformation, and loading into SQL.
Conducted analysis to derive insights, all code shared on GitHub.

Comprehensive Data Analytics Project Guide