🌳

Decision Tree Classification Algorithm Implementation

Jul 11, 2024

Implementing a Classification Algorithm using Decision Trees in Azure ML Studio Classic

Introduction

  • Focus: Implement classification algorithm using Decision Tree in Azure ML Studio Classic
  • Any machine can be used; Azure ML Studio Classic is platform-agnostic
  • Ensure you have an Azure ML Studio account and are logged in

Steps to Get Started

  1. Access Azure ML Studio Classic: Sign in at studio.azureml.net
  2. Create New Experiment: Click 'New' in the lower left corner and select 'Blank Experiment'
  3. Data Source: Use two sample datasets – 'Airport Codes' and 'Flight On-Time Performance'

Exploring the Datasets

  • Airport Codes Dataset: Contains airport IDs, city, state, and airport name

  • Flight On-Time Performance Dataset: Includes details such as year, month, and day of flights, carrier, origin, and destination airport IDs, scheduled departure/arrival times, and actual departure/arrival delays

    • Key fields: departure_delay, arrival_delay, departure_d15, arrival_d15, canceled, diverted
    • arrival_d15: Binary target variable indicating if a flight is delayed by more than 15 minutes

Data Preparation

Merging Datasets

  1. Edit Metadata for Column Naming:

    • Rename city, state, and name to origin_city, origin_state, and origin_airport for the origin airport IDs
    • Duplicate process for destination airport names
  2. Join Datasets:

    • Join Flight On-Time Performance with Airport Codes on origin_airport_id
    • Join the resultant dataset on destination_airport_id

Selecting and Cleaning Data

  1. Select Columns:
    • Exclude irrelevant columns like origin_airport_id, destination_airport_id, canceled, and diverted
  2. Handle Missing Data:
    • Use Clean Missing Data component to impute missing values
      • Categorical: Replace with mode
      • Numeric: Replace with median

Feature Engineering

  • Edit Metadata: Segregate categorical and numeric attributes
  • Normalization: Apply MinMax normalization to numeric attributes (departure_delay and arrival_delay)

Splitting Data

  • Initial Split: 95% training, 5% test
  • Second Split: Further split training set into 81% training and 19% validation
    • Ensure stratified split based on arrival_d15

Building and Training the Model

  1. Decision Tree Model:
    • Use Two-Class Boosted Decision Tree
    • Configure relevant hyperparameters (max leaves, min samples per leaf, learning rate, number of trees)
  2. Hyperparameter Tuning:
    • Use Tune Model Hyperparameters (select metric: F-score for imbalance data)
    • Opt for Random Grid search for efficiency
  3. Training: Use Train Model with the best hyperparameters obtained

Model Evaluation

  1. Score Model: Apply trained model on test data to evaluate performance
  2. Evaluate Model: Check metrics like accuracy, precision, recall, F1-score, and ROC AUC

Additional Insights

  1. Exploratory Data Analysis (EDA): Use tools like Power BI for detailed EDA
    • Import clean data into Power BI for visualization
    • Common visuals: bar charts for categorical data, scatter plots for numeric comparisons
    • Identify key categorical and numeric insights, explore distributions and potential outliers
  2. Deploy Model: Set up web service for model deployment
    • Input new data and get predictions online or in batches

Conclusion

  • The model, if trained as described, should generalize well to new data based on the specified steps.
  • Always ensure EDA is performed to understand the dataset thoroughly before modeling.