🌳

Implementation of Classification Algorithm Using Decision Trees in Azure ML Studio Classic

Jul 16, 2024

Implement a Classification Algorithm Using Decision Trees in Azure ML Studio Classic

Overview

  • The session covers implementing a classification algorithm using decision trees.
  • We will use Azure ML Studio Classic which works on any machine, regardless of the operating system.
  • The practical involves using decision trees and decision tree ensembles (boosted decision trees) for binary classification.

Getting Started

  1. Sign in to Azure ML Studio Classic: Navigate to studio.azureml.net and log in.
  2. Create a New Experiment: Create a blank experiment by clicking the + New button in the lower left corner, and then selecting Blank Experiment.

Loading and Preparing Data

Datasets Used

  • Airport Codes dataset: Contains airport IDs and other details like city, state, and airport name.
  • Flight On-Time Performance dataset: Contains detailed flight data including scheduled and actual times, delays, and categorical features.

Steps

  1. Load the Airport Codes Data: Drag the airport codes dataset onto the workspace area.
  2. Load the Flight On-Time Performance Data: Drag the flight on-time performance dataset onto the workspace area.
  3. **Join Datasets: Merge airport codes data with flight on-time performance data to add location descriptors (city, state, and airport name) to the origin and destination airport IDs.
    • First Join: Join flight data to add origin airport details.
    • Second Join: Join the above result to add destination airport details.
  4. **Select Relevant Columns: Exclude unneeded columns like origin airport ID and destination airport ID, cancelled, and diverted.

Data Preprocessing and Feature Engineering

Segregate Categorical and Numeric Attributes

  1. Categorical Attributes: Use the Edit Metadata module to label categorical attributes, excluding numeric columns such as departure delay and arrival delay.
  2. **Clean Missing Data: Perform data imputation for categorical and numeric attributes separately using the Clean Missing Data module.
    • Categorical Imputation: Use Replace with MODE for categorical variables.
    • Numeric Imputation: Use Replace with MEDIAN for numeric variables.**

Normalize Data

  • Normalize Numeric Attributes: Utilize the Normalize Data module to perform min-max normalization on numeric columns (departure delay and arrival delay).

Splitting the Data

  1. Split Data: Partition the data into training (95%) and test (5%) sets using the Split Data module. Perform a stratified split based on the target variable `Arrival Delay 15 (AR D15).'
  2. Second Split: Further split the training data into training (90%) and validation (10%) sets.`

Model Training and Evaluation

Decision Tree Model

  1. Configure Two-Class Boosted Decision Tree: Set hyperparameters such as maximum number of leaves, number of samples per leaf, learning rate, and number of trees.
  2. **Tune Model with Hyperparameters: Utilize the Tune Model Hyperparameters module to find the best combination of hyperparameters using the validation dataset.
  3. Train the Model: Use the training dataset and best hyperparameters to train the Two-Class Boosted Decision Tree model.
  4. Score the Model: Evaluate the model performance on the test dataset.
  5. Evaluate Model: Inspect classification metrics like accuracy, precision, recall, F1 score, and ROC/AUC.**

Web Deployment

  • Set up Web Service: Use the option to deploy the trained model as a web service. Allows for predictions on new data via web interface.

Additional Notes

  • Alternative Visualization Tools: Visualization using tools like Power BI is recommended for EDA and visual inspection but can also use tools such as KNIME, RapidMiner, Tableau, etc.
  • Assignments: Apply learned methods to solve provided datasets and questions, focusing on both classification and regression analysis.

Queries

  • Email for additional queries and support regarding the module or assignments.
    • Direct questions that are not specified to be emailed to [Instructor's Email].

Feedback

  • Requested to provide feedback via the course platform.