Introduction to Machine Learning Concepts

Aug 5, 2024

Machine Learning for Everyone - Lecture by Kylie Ying

Introduction

  • Kylie Ying: Physicist, engineer, worked at MIT, CERN, Free Code Camp.
  • Focus: Machine Learning (ML) for beginners.
  • Topics: Supervised and Unsupervised learning models, basic logic/math, programming on Google CoLab.

Resources

  • UCI Machine Learning Repository: Source of datasets.
  • Example dataset: Magic Gamma Telescope dataset.
  • Tools: Google CoLab, NumPy, pandas, matplotlib.

Dataset Overview

Magic Gamma Telescope Dataset

  • Information collected by a gamma telescope camera/detector.
  • Attributes: Length, width, size, asymmetry, etc.
  • Goal: Predict if particles are gamma particles or hadrons.
  • Steps: Import data, preprocess data, label columns, transform labels into numerical values.

Supervised Learning

Overview

  • Definition: Subfield of computer science focusing on algorithms that allow computers to learn from data without explicit programming.
  • AI vs. ML vs. Data Science: AI simulates human tasks, ML focuses on data-driven predictions, Data Science finds patterns and insights in data.
  • Supervised Learning: Uses labeled inputs to train models.
  • Tasks: Classification (binary, multi-class) and Regression.

Key Concepts

  • Features: Input attributes used to predict labels.
  • Classification: Predicting discrete classes (binary or multi-class).
  • Regression: Predicting continuous values.
  • Training, Validation, Testing datasets: Split data for training models and evaluating performance.
  • Models: K-Nearest Neighbors (KNN), Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Neural Networks.

Models and Examples

K-Nearest Neighbors (KNN)

  • Concept: Classify data points based on the majority label of nearest neighbors.
  • Distance metric: Euclidean distance.
  • Implementation: KNeighborsClassifier from sklearn.
  • Evaluation: Accuracy, Precision, Recall, F1 score.

Naive Bayes

  • Based on Bayes' theorem and conditional probability.
  • Assumption: Independence of features.
  • Implementation: GaussianNB from sklearn.
  • Evaluation: Similar metrics as KNN.

Logistic Regression

  • Concept: Models probability of class membership using a logistic function.
  • Implementation: LogisticRegression from sklearn.
  • Evaluation: Similar metrics, plus visualization of decision boundary.

Support Vector Machines (SVM)

  • Concept: Finds hyperplane that best separates classes with maximum margin.
  • Implementation: SVC from sklearn.
  • Evaluation: Often performs well, especially with clean data.

Neural Networks

  • Structure: Input layer, hidden layers (neurons), output layer.
  • Training: Backpropagation, adjusting weights to minimize loss.
  • Implementation: TensorFlow for defining and training models.
  • Evaluation: Compare performance with simpler models.

Regression

Concept

  • Predicting continuous values using input features.
  • Linear Regression: Fit a line to minimize prediction error (residuals).
  • Assumptions: Linearity, Independence, Normality, Homoscedasticity.

Evaluation Metrics

  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • R-squared (coefficient of determination)

Example: Bike Sharing Dataset

  • Steps: Import data, preprocess, split into training/validation/testing sets.
  • Models: Simple and multiple linear regression, neural networks for regression.
  • Visualization: Scatter plots, regression lines.

Unsupervised Learning

Overview

  • Definition: Learning from unlabeled data to find patterns or structure.
  • Types: Clustering (e.g., K-Means) and Dimensionality Reduction (e.g., PCA).

Models and Examples

K-Means Clustering

  • Concept: Partition data into K clusters based on feature similarity.
  • Steps: Initialize centroids, assign points to nearest centroid, recompute centroids, iterate until stable.
  • Evaluation: Visualize clustering results.

Principal Component Analysis (PCA)

  • Concept: Reduce dimensionality by projecting data onto principal components with the largest variance.
  • Steps: Compute principal components, transform data to lower dimensions.
  • Evaluation: Visualize data in reduced dimensions.

Example: Seeds Dataset

  • Data: Geometric attributes of wheat kernels.
  • Goals: Apply K-Means clustering and PCA to identify different wheat varieties.
  • Visualization: Compare original classes with clustered results.

Conclusion

  • Summary of supervised (classification and regression) and unsupervised learning models.
  • Encouragement to explore further, experiment with models and datasets.
  • Invite feedback and collaborative learning.