Implementing a Classification Algorithm using Decision Trees in Azure ML Studio Classic

Introduction

Focus: Implement classification algorithm using Decision Tree in Azure ML Studio Classic
Any machine can be used; Azure ML Studio Classic is platform-agnostic
Ensure you have an Azure ML Studio account and are logged in

Access Azure ML Studio Classic: Sign in at studio.azureml.net
Create New Experiment: Click 'New' in the lower left corner and select 'Blank Experiment'
Data Source: Use two sample datasets – 'Airport Codes' and 'Flight On-Time Performance'

Airport Codes Dataset: Contains airport IDs, city, state, and airport name
Flight On-Time Performance Dataset: Includes details such as year, month, and day of flights, carrier, origin, and destination airport IDs, scheduled departure/arrival times, and actual departure/arrival delays
- Key fields: departure_delay, arrival_delay, departure_d15, arrival_d15, canceled, diverted
- arrival_d15: Binary target variable indicating if a flight is delayed by more than 15 minutes

Edit Metadata for Column Naming:
- Rename city, state, and name to origin_city, origin_state, and origin_airport for the origin airport IDs
- Duplicate process for destination airport names
Join Datasets:
- Join Flight On-Time Performance with Airport Codes on origin_airport_id
- Join the resultant dataset on destination_airport_id

Select Columns:
- Exclude irrelevant columns like origin_airport_id, destination_airport_id, canceled, and diverted
Handle Missing Data:
- Use Clean Missing Data component to impute missing values
  - Categorical: Replace with mode
  - Numeric: Replace with median

Edit Metadata: Segregate categorical and numeric attributes
Normalization: Apply MinMax normalization to numeric attributes (departure_delay and arrival_delay)

Initial Split: 95% training, 5% test
Second Split: Further split training set into 81% training and 19% validation
- Ensure stratified split based on arrival_d15

Decision Tree Model:
- Use Two-Class Boosted Decision Tree
- Configure relevant hyperparameters (max leaves, min samples per leaf, learning rate, number of trees)
Hyperparameter Tuning:
- Use Tune Model Hyperparameters (select metric: F-score for imbalance data)
- Opt for Random Grid search for efficiency
Training: Use Train Model with the best hyperparameters obtained

Score Model: Apply trained model on test data to evaluate performance
Evaluate Model: Check metrics like accuracy, precision, recall, F1-score, and ROC AUC

Exploratory Data Analysis (EDA): Use tools like Power BI for detailed EDA
- Import clean data into Power BI for visualization
- Common visuals: bar charts for categorical data, scatter plots for numeric comparisons
- Identify key categorical and numeric insights, explore distributions and potential outliers
Deploy Model: Set up web service for model deployment
- Input new data and get predictions online or in batches

The model, if trained as described, should generalize well to new data based on the specified steps.
Always ensure EDA is performed to understand the dataset thoroughly before modeling.