Implementing a Classification Algorithm Using Decision Tree

Overview

Objective: Implement a classification algorithm using a decision tree in Microsoft Azure ML Studio Classic.
Theory Primer: Assumes prior knowledge of decision trees.
Tools Used: Microsoft Azure ML Studio Classic — works across any machine.

Navigate to Saved Datasets: Use the panel on the left to find Saved Datasets under Samples.
Select Data Sets: Drag
- Airport Codes dataset (second in the list)
- Flight On-time Performance dataset onto the workspace.

Departure Delay: Number of minutes late or early (negative value).
Arrival Delay: Similar to departure delay but for arrivals.
Departure Delay 15 (DepDel15): Binary flag if departure delay > 15 minutes (1 if true, otherwise 0).
Arrival Delay 15 (ArrDel15): Binary flag for arrival delay > 15 minutes (1 if true, otherwise 0).
Categorical Details: Date-related columns, carrier codes, which airport, etc.

Need: Combine Airport Codes with the main dataset to map airport IDs to meaningful names and locations.
Steps:
- Use Edit Metadata on the Airport Codes dataset to rename columns suitably for merging.
- Perform a join operation between the datasets based on columns origin_airport_id or dest_airport_id.
- Add new columns for origin and destination details (city, state, airport name) for easier understanding and queries.

Columns to Remove: Remove columns that aren't required for this analysis, e.g., origin_airport_id, dest_airport_id, cancelled, and diverted.
Segregate: Use Edit Metadata to segregate categorical vs. numeric features.
Handle Missing Values: Use Clean Missing Data for categorical and numeric columns separately.
Normalize Data: Apply Normalize Data to numeric columns (departure_delay and arrival_delay) to ensure consistent scaling.

Initial Split: Split Data component to split data into training (95%) and test (5%) sets.
Further Split: Use another Split Data to further partition training data into training (81%) and validation (14%) sets.

Algorithm Choice: Use Two-Class Boosted Decision Tree for classification.
Hyperparameter Tuning:
- Use Tune Model Hyperparameters module to find the best configuration for the classification algorithm.
- Specify the evaluation metric (e.g., F-score for classification tasks due to imbalanced datasets).

Link Modules:
- Connect the best model from Tune Model Hyperparameters to Train Model using the training dataset.
- Use Score Model to test the trained model using the validation and/or test set.
Evaluate Model: Use Evaluate Model to obtain metrics like accuracy, F1 score, precision, recall.

Deployment: Post-modeling steps include setting up web service deployment for real-time predictions.
Exploratory Data Analysis (EDA): Use tools like PowerBI or KNIME for visual analysis for questions related to customer churn and related metrics. PowerBI steps include importing the dataset, creating visuals, performing imputations in Azure, and using PowerBI for final visualizations and interpretations.
Assignments: Specifically involve using Azure and other tools to analyze Telecom datasets for predicting churn and other financial predictors.