📊

Lecture on Choosing the Best Machine Learning Model and Cross-Validation

Jun 22, 2024

Lecture on Choosing the Best Machine Learning Model and Cross-Validation

Introduction

Dilemma: Which machine learning model to use for a specific problem?
Example: Different models can classify iris flowers (SVM, random forest, logistic regression, decision tree).
Objective: Cross-validation technique to evaluate model performance.
Focus: Train/test split method vs. k-fold cross-validation.

Basic Concepts

Model Training and Testing:
- Train with labeled dataset.
- Test with a different dataset.
- Compare model results with the truth.
- Measure accuracy.

Train/Test Split Method

Single Split: All data used to train, then tested on same data.
- Analogy: Kid trained with 100 questions, tested on same 100 questions.
- Problem: Doesn't measure true performance (seen questions).
Train/Test Split:
- Split data into training (70%) and test (30%) sets.
- Analogy: Kid trained with 70 questions, tested on 30 different questions.
- Problem: Training might be biased (e.g., algebra vs. calculus).

K-Fold Cross-Validation

Concept:
- Data divided into 'k' folds (e.g., 5 folds, 20 samples each).
- Multiple iterations with different folds used for training/testing.
- Average the scores of all iterations for final performance.
Advantages: More robust evaluation, variety of samples for training/testing.

Practical Application

Example with Digits Dataset:
- Libraries: Imported necessary libraries (e.g., sklearn for digits dataset).
- Initial Steps: Split data into training and test sets.
- Model Evaluation:
  - Logistic Regression: Initial score 0.959.
  - SVM: Performed poorly.
  - Random Forest: Best initial performance.

Random Variability in Train/Test Split

Issue: Performance changes with each split.
Example: Scores change upon re-execution.
- Highlighting problem with single train/test split evaluation.

Implementing K-Fold Cross-Validation

Manual Implementation:
- Imported KFold from sklearn.model_selection.
- Divided dataset into specified number of folds.
- Iterated through each fold for training/testing.
- Example: Printed training/testing indices.
- Measured scores iteratively.
Stratified K-Fold:
- Ensures uniform distribution of classes in each fold.
- Used for real dataset example (Digits).
- Custom method get_score to evaluate different models.
- Scores array to store results for comparison.
- Ran logistic regression, SVM, and random forest.
- Example Scores: Logistic Regression (best), SVM (poor), Random Forest (varies with tree number).

Simplifying with `cross_val_score`

Built-in Function:
- cross_val_score from sklearn.model_selection simplifies k-fold process.
- Demonstrated usage with simple function call.
- Compared results between custom loop and cross_val_score.
Benefits: Less code, integrated evaluation.

Parameter Tuning

Concept: Adjusting model parameters for optimization.
Example: Random Forest number of trees.
- Tried different tree counts (5, 15, 40, 50, 60).
- Noted score improvements with more trees, until optimal point.

Practical Lesson: Model and Parameter Choices

Conclusion: Best model and parameters are determined through trial and error.
Exercise: Use iris dataset to evaluate multiple models (random forest, decision tree, SVM, logistic regression) using cross_val_score for the best performing model.

Closing Remarks

Resources: Jupyter notebook and exercise links provided.
Encouragement: Subscribe for more tutorials, thumbs up for support.

Full transcript