Implementing a Classification Algorithm Using Decision Tree

Jul 16, 2024

Implementing a Classification Algorithm Using Decision Tree

Overview

  • Objective: Implement a classification algorithm using a decision tree in Microsoft Azure ML Studio Classic.
  • Theory Primer: Assumes prior knowledge of decision trees.
  • Tools Used: Microsoft Azure ML Studio Classic — works across any machine.

Starting with Microsoft Azure ML Studio Classic

  1. Sign in to Studio: Navigate to studio.azureml.net and sign in.
  2. Create New Experiment: Click +New, then select Blank Experiment.

Setting up the Experiment

  1. Navigate to Saved Datasets: Use the panel on the left to find Saved Datasets under Samples.
  2. Select Data Sets: Drag
    • Airport Codes dataset (second in the list)
    • Flight On-time Performance dataset onto the workspace.

Data Understanding

  1. Details of Flight On-time Performance Dataset:
    • Columns: 18
    • Rows: 54,000
    • Contents: Flight performance data from 2011
    • Variables include year, quarter, day, carrier, origin/destination airport ID, and times related to flights (departure, arrival, delays).
    • Specifically interested in columns indicating delays (Departure and Arrival) and whether delays exceed 15 minutes.

Working with Columns

Information on Critical Columns

  • Departure Delay: Number of minutes late or early (negative value).
  • Arrival Delay: Similar to departure delay but for arrivals.
  • Departure Delay 15 (DepDel15): Binary flag if departure delay > 15 minutes (1 if true, otherwise 0).
  • Arrival Delay 15 (ArrDel15): Binary flag for arrival delay > 15 minutes (1 if true, otherwise 0).
  • Categorical Details: Date-related columns, carrier codes, which airport, etc.

Task: Combine Data Sets

  1. Need: Combine Airport Codes with the main dataset to map airport IDs to meaningful names and locations.
  2. Steps:
    • Use Edit Metadata on the Airport Codes dataset to rename columns suitably for merging.
    • Perform a join operation between the datasets based on columns origin_airport_id or dest_airport_id.
    • Add new columns for origin and destination details (city, state, airport name) for easier understanding and queries.

Data Transformation and Preparation

  1. Columns to Remove: Remove columns that aren't required for this analysis, e.g., origin_airport_id, dest_airport_id, cancelled, and diverted.
  2. Segregate: Use Edit Metadata to segregate categorical vs. numeric features.
  3. Handle Missing Values: Use Clean Missing Data for categorical and numeric columns separately.
  4. Normalize Data: Apply Normalize Data to numeric columns (departure_delay and arrival_delay) to ensure consistent scaling.

Splitting the Data

  1. Initial Split: Split Data component to split data into training (95%) and test (5%) sets.
  2. Further Split: Use another Split Data to further partition training data into training (81%) and validation (14%) sets.

Setting up the Decision Tree Model

  1. Algorithm Choice: Use Two-Class Boosted Decision Tree for classification.
  2. Hyperparameter Tuning:
    • Use Tune Model Hyperparameters module to find the best configuration for the classification algorithm.
    • Specify the evaluation metric (e.g., F-score for classification tasks due to imbalanced datasets).

Evaluation and Model Scoring

  1. Link Modules:
    • Connect the best model from Tune Model Hyperparameters to Train Model using the training dataset.
    • Use Score Model to test the trained model using the validation and/or test set.
  2. Evaluate Model: Use Evaluate Model to obtain metrics like accuracy, F1 score, precision, recall.

Ending Notes and Further Tasks

  • Deployment: Post-modeling steps include setting up web service deployment for real-time predictions.
  • Exploratory Data Analysis (EDA): Use tools like PowerBI or KNIME for visual analysis for questions related to customer churn and related metrics. PowerBI steps include importing the dataset, creating visuals, performing imputations in Azure, and using PowerBI for final visualizations and interpretations.
  • Assignments: Specifically involve using Azure and other tools to analyze Telecom datasets for predicting churn and other financial predictors.

Common Questions

  1. How to handle missing values and normalization specifics.
  2. Detailed steps on deployment and real-time testing of the model.
  3. Specifics on how to use PowerBI and other EDA tools.
  4. Steps to create visualizations and box plots in PowerBI.
  5. Clarifications on data types - categorical and numeric handling.