Lecture on Choosing the Best Machine Learning Model and Cross-Validation

Jun 22, 2024

Lecture on Choosing the Best Machine Learning Model and Cross-Validation

Introduction

  • Dilemma: Which machine learning model to use for a specific problem?
  • Example: Different models can classify iris flowers (SVM, random forest, logistic regression, decision tree).
  • Objective: Cross-validation technique to evaluate model performance.
  • Focus: Train/test split method vs. k-fold cross-validation.

Basic Concepts

  • Model Training and Testing:
    • Train with labeled dataset.
    • Test with a different dataset.
    • Compare model results with the truth.
    • Measure accuracy.

Train/Test Split Method

  • Single Split: All data used to train, then tested on same data.
    • Analogy: Kid trained with 100 questions, tested on same 100 questions.
    • Problem: Doesn't measure true performance (seen questions).
  • Train/Test Split:
    • Split data into training (70%) and test (30%) sets.
    • Analogy: Kid trained with 70 questions, tested on 30 different questions.
    • Problem: Training might be biased (e.g., algebra vs. calculus).

K-Fold Cross-Validation

  • Concept:
    • Data divided into 'k' folds (e.g., 5 folds, 20 samples each).
    • Multiple iterations with different folds used for training/testing.
    • Average the scores of all iterations for final performance.
  • Advantages: More robust evaluation, variety of samples for training/testing.

Practical Application

  • Example with Digits Dataset:
    • Libraries: Imported necessary libraries (e.g., sklearn for digits dataset).
    • Initial Steps: Split data into training and test sets.
    • Model Evaluation:
      • Logistic Regression: Initial score 0.959.
      • SVM: Performed poorly.
      • Random Forest: Best initial performance.

Random Variability in Train/Test Split

  • Issue: Performance changes with each split.
  • Example: Scores change upon re-execution.
    • Highlighting problem with single train/test split evaluation.

Implementing K-Fold Cross-Validation

  • Manual Implementation:
    • Imported KFold from sklearn.model_selection.
    • Divided dataset into specified number of folds.
    • Iterated through each fold for training/testing.
    • Example: Printed training/testing indices.
    • Measured scores iteratively.
  • Stratified K-Fold:
    • Ensures uniform distribution of classes in each fold.
    • Used for real dataset example (Digits).
    • Custom method get_score to evaluate different models.
    • Scores array to store results for comparison.
    • Ran logistic regression, SVM, and random forest.
    • Example Scores: Logistic Regression (best), SVM (poor), Random Forest (varies with tree number).

Simplifying with cross_val_score

  • Built-in Function:
    • cross_val_score from sklearn.model_selection simplifies k-fold process.
    • Demonstrated usage with simple function call.
    • Compared results between custom loop and cross_val_score.
  • Benefits: Less code, integrated evaluation.

Parameter Tuning

  • Concept: Adjusting model parameters for optimization.
  • Example: Random Forest number of trees.
    • Tried different tree counts (5, 15, 40, 50, 60).
    • Noted score improvements with more trees, until optimal point.

Practical Lesson: Model and Parameter Choices

  • Conclusion: Best model and parameters are determined through trial and error.
  • Exercise: Use iris dataset to evaluate multiple models (random forest, decision tree, SVM, logistic regression) using cross_val_score for the best performing model.

Closing Remarks

  • Resources: Jupyter notebook and exercise links provided.
  • Encouragement: Subscribe for more tutorials, thumbs up for support.