Introduction to Machine Learning Concepts

Sep 26, 2024

Machine Learning for Everyone - Lecture Notes

Introduction to Kylie Ying

  • Physical scientist and engineer worked at notable places like MIT, CERN, Free Code Camp.
  • Aimed at teaching machine learning to beginners.

Overview of Machine Learning

  • Machine learning (ML) is a sub-domain of computer science that focuses on algorithms allowing computers to learn from data without explicit programming.
  • Difference between AI, ML, and Data Science:
    • AI: Enabling computers to perform human-like tasks.
    • ML: Subset of AI focused on making predictions using data.
    • Data Science: Field that finds patterns and insights in data.

Types of Machine Learning

  1. Supervised Learning:
    • Uses labeled inputs to train models.
    • Example: Predicting classes based on features (e.g., pictures of animals).
  2. Unsupervised Learning:
    • Uses unlabeled data to find patterns or groupings.
    • Example: Clustering data points based on similarities.
  3. Reinforcement Learning:
    • An agent learns in an interactive environment through rewards and penalties.

Supervised Learning

  • Key Concepts:
    • Features: Inputs used by the model to make predictions.
    • Labels: Output to predict based on features.
  • Types of Supervised Tasks:
    • Classification: Predicting discrete classes (e.g., spam/not spam).
      • Binary Classification: Two classes (e.g., cat vs not cat).
      • Multi-Class Classification: More than two classes (e.g., cat, dog, lizard).
    • Regression: Predicting continuous values (e.g., price of a house).

Data Set Example: Magic Gamma Telescope Dataset

  • Uses properties of light hitting a gamma telescope to predict types of particles.
  • Process:
    1. Download dataset from UCI machine learning repository.
    2. Set up Google CoLab.
    3. Import libraries: NumPy, Pandas, Matplotlib, etc.
    4. Read and prepare the dataset (e.g., assign column names).

Data Preparation and Exploration

  • Data Exploration Techniques:
    • Viewing the first few entries of the dataset (using .head()).
    • Checking unique labels in the dataset.
    • Converting labels of gamma and hadron particles from letters to numbers.

Supervised Learning Implementation Steps

  1. Data Loading:
    • Load the data into a DataFrame.
  2. Data Cleaning:
    • Assign column names.
    • Convert categorical labels into numeric.
  3. Feature Selection:
    • Select relevant features for the model.

Example of Implementation in Google CoLab

  • Import necessary libraries.
  • Load the data using Pandas.
  • Clean the data and handle missing values.

Machine Learning Models Implemented

  1. K-Nearest Neighbors (KNN):

    • Classifies points based on majority class of neighbors.
    • Distance function used: Euclidean distance.
  2. Naive Bayes:

    • Uses probability to classify data based on prior probabilities.
    • Assumes independence of features.
  3. Logistic Regression:

    • Predicts probability of class membership using a logistic function.
    • Models the relationship between dependent and independent variables.
  4. Support Vector Machines (SVM):

    • Finds hyperplane that best separates different classes in high-dimensional space.
  5. Neural Networks:

    • Models with layers of interconnected nodes (neurons).
    • Trains on labeled data to predict outputs.
    • Uses backpropagation to adjust weights based on loss.

Evaluation of Machine Learning Models

  • Performance metrics include:
    • Accuracy
    • Precision
    • Recall
    • F1 Score
    • Mean Absolute Error (MAE)
    • Mean Squared Error (MSE)
    • R-Squared (R²)

Unsupervised Learning - K-Means Clustering

  • K-means clustering algorithm:
    1. Choose number of clusters (k).
    2. Randomly assign initial centroids.
    3. Assign points to nearest centroid.
    4. Recalculate centroids based on assignments.
    5. Repeat steps until convergence.

Unsupervised Learning - Principal Component Analysis (PCA)

  • Reduces dimensionality of dataset while preserving variance:
    • Projects data into lower dimensions.
    • Finds the direction with the most variance.
    • Enables visualization of high-dimensional data.

Conclusion

  • Summarized the key concepts and implementations in supervised and unsupervised learning.
  • Encouraged community engagement for further learning.