Machine Learning Algorithms Explained

Jul 2, 2024

Machine Learning Algorithms Explained

Introduction

  • Algorithm: Set of commands for a computer to perform calculations or problem-solving operations.
  • Definition: Finite set of instructions in a specific order to perform a task.

Supervised Learning Algorithms

Linear Regression

  • Use: Model relationship between a continuous target variable and one or more independent variables by fitting a linear equation.
  • Goal: Minimize the sum of squares of the distance between data points and the regression line.
  • Example: Chart of dots with a linear regression line.

Support Vector Machine (SVM)

  • Use: Mainly for classification, also suitable for regression.
  • Decision Boundary: Critical, drawn by maximizing distance to support vectors.
  • Dimensionality: Works in n-dimensional space, decision boundary can be a line, plane, or hyperplane.
  • Effectiveness: Good for cases where the number of dimensions exceeds the number of samples.
  • Efficiency: Memory efficient, but training time can increase.

Naïve Bayes

  • Use: Classification tasks.
  • Assumptions: Features are independent (Naïve Assumption).
  • Base: Bayes theorem.
  • Efficiency: Very fast, but less accurate due to naïve assumption.

Logistic Regression

  • Use: Binary classification problems.
  • Function: Uses logistic (sigmoid) function.
  • Output: Probability values for classification (e.g., spam detection, ad clicks).
  • Advantages: Simple, effective.

K-Nearest Neighbors (KNN)

  • Use: Classification and regression tasks.
  • Principle: Majority voting for classification, mean value for regression.
  • Challenges: Determining optimal K value, sensitive to outliers.

Decision Trees

  • Process: Iteratively asks questions to partition data.
  • Goal: Increase predictiveness by improving node purity.
  • Overfitting: Risk if model becomes too specific.
  • Advantages: Does not require normalization or scaling.

Random Forest

  • Type: Ensemble of decision trees using bagging.
  • Method: Parallel estimators (majority vote for classification, mean value for regression).
  • Advantages: Higher accuracy, reduces overfitting.
  • Challenges: Needs uncorrelated decision trees.

Gradient Boosted Decision Trees (GBDT)

  • Type: Ensemble algorithm using boosting.
  • Method: Sequentially adds trees to minimize previous errors.
  • Advantages: High efficiency, accurate predictions, handles mixed feature types.
  • Challenges: Requires careful hyperparameter tuning.

Unsupervised Learning Algorithms

K-Means Clustering

  • Use: Partition data into K clusters based on similarity.
  • Process: Iterative (select centroids, assign points, adjust centroids).
  • Advantages: Fast, easy to interpret.
  • Challenges: Number of clusters must be predetermined.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

  • Use: Arbitrary shaped clusters and outliers detection.
  • Parameters: EPS (neighborhood distance) and MinPts (minimum points to form a cluster).
  • Classification: Core points, border points, outliers.
  • Advantages: Does not require predefined number of clusters, robust to outliers.
  • Challenges: Determining EPS can be difficult.

Principal Component Analysis (PCA)

  • Type: Dimensionality reduction algorithm.
  • Use: Derives new features while retaining as much information as possible.
  • Advantages: Retains significant variance using fewer features.
  • Order: Principal components are ordered by explained variance.