Machine Learning Algorithms Explained

Jul 2, 2024

Machine Learning Algorithms Explained

Introduction

Algorithm: Set of commands for a computer to perform calculations or problem-solving operations.
Definition: Finite set of instructions in a specific order to perform a task.

Supervised Learning Algorithms

Linear Regression

Use: Model relationship between a continuous target variable and one or more independent variables by fitting a linear equation.
Goal: Minimize the sum of squares of the distance between data points and the regression line.
Example: Chart of dots with a linear regression line.

Support Vector Machine (SVM)

Use: Mainly for classification, also suitable for regression.
Decision Boundary: Critical, drawn by maximizing distance to support vectors.
Dimensionality: Works in n-dimensional space, decision boundary can be a line, plane, or hyperplane.
Effectiveness: Good for cases where the number of dimensions exceeds the number of samples.
Efficiency: Memory efficient, but training time can increase.

Naïve Bayes

Use: Classification tasks.
Assumptions: Features are independent (Naïve Assumption).
Base: Bayes theorem.
Efficiency: Very fast, but less accurate due to naïve assumption.

Logistic Regression

Use: Binary classification problems.
Function: Uses logistic (sigmoid) function.
Output: Probability values for classification (e.g., spam detection, ad clicks).
Advantages: Simple, effective.

K-Nearest Neighbors (KNN)

Use: Classification and regression tasks.
Principle: Majority voting for classification, mean value for regression.
Challenges: Determining optimal K value, sensitive to outliers.

Decision Trees

Process: Iteratively asks questions to partition data.
Goal: Increase predictiveness by improving node purity.
Overfitting: Risk if model becomes too specific.
Advantages: Does not require normalization or scaling.

Random Forest

Type: Ensemble of decision trees using bagging.
Method: Parallel estimators (majority vote for classification, mean value for regression).
Advantages: Higher accuracy, reduces overfitting.
Challenges: Needs uncorrelated decision trees.

Gradient Boosted Decision Trees (GBDT)

Type: Ensemble algorithm using boosting.
Method: Sequentially adds trees to minimize previous errors.
Advantages: High efficiency, accurate predictions, handles mixed feature types.
Challenges: Requires careful hyperparameter tuning.

Unsupervised Learning Algorithms

K-Means Clustering

Use: Partition data into K clusters based on similarity.
Process: Iterative (select centroids, assign points, adjust centroids).
Advantages: Fast, easy to interpret.
Challenges: Number of clusters must be predetermined.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Use: Arbitrary shaped clusters and outliers detection.
Parameters: EPS (neighborhood distance) and MinPts (minimum points to form a cluster).
Classification: Core points, border points, outliers.
Advantages: Does not require predefined number of clusters, robust to outliers.
Challenges: Determining EPS can be difficult.

Principal Component Analysis (PCA)

Type: Dimensionality reduction algorithm.
Use: Derives new features while retaining as much information as possible.
Advantages: Retains significant variance using fewer features.
Order: Principal components are ordered by explained variance.

Full transcript