🧠

Data Science, Machine Learning, and Data Analysis Lecture Notes

Jul 20, 2024

Data Science, Machine Learning, and Data Analysis

Key Topics Covered

  • Supervised Learning: Classification and Regression
  • Unsupervised Learning: Clustering and Dimensional Reduction
  • Semi-supervised Learning
  • Supervised Learning vs. Unsupervised Learning
    • Classification examples: email spam detection
    • Regression examples: predicting continuous values
  • Mathematical Formulas and Concepts
    • Linear regression: y = mx + b
    • Probability Basics
    • Nearest Neighbors
  • Machine Learning Issues
    • Underfitting and Overfitting
    • Validation and Regularization
  • Ensemble Methods
    • Bagging, Decision Trees, Random Forests
  • Support Vector Machines (SVM)
  • Naive Bayes and Probabilistic Models
    • Posterior and Likelihood Probabilities
  • Data Processing with NumPy and Pandas
    • Data manipulation, indexing
    • Splitting datasets, scaling variables
  • Feature Scaling
    • Normalization
    • Standardization (Z-Score)
  • Data Visualization
    • Outliers
  • Performance Metrics for Classification
    • Accuracy
    • Precision
    • Recall
    • F1 Score
    • Mean Absolute Error
    • Mean Squared Error
  • Dimensionality Reduction
    • Principal Component Analysis (PCA)

Supervised Learning

  • Classification
    • Examples include email spam detection.
    • Features (X): number of words, keywords.
    • Labels (y): spam or not spam.
  • Regression
    • Examples include predicting numeric values (e.g., prices).
    • Linear regression formula: y = mx + b.
    • Concepts: independent variables (X), dependent variable (y).

Unsupervised Learning

  • Clustering
  • Dimensional Reduction: Simplifying datasets while retaining important information.

Important Concepts in Detail

  • Linear Regression
    • Simple linear equation y = mx + b.
  • Probability Basics
    • Posterior probability P(A|B)
    • Likelihood P(B|A)
    • Prior probability P(A)

Nearest Neighbors

  • Calculating distances between data points.

Common Problems in Machine Learning

  • Underfitting: Model is too simple.
  • Overfitting: Model is too complex.
  • Regularization: Technique to prevent overfitting.

Ensemble Methods

  • Bagging: Bootstrapping and aggregation.
    • Bootstrap samples with replacement.
  • Random Forests: Multiple decision trees.
  • Support Vector Machines (SVM): Hyperplanes and classification.

Naive Bayes Classifier

  • Probability Formulas
    • Posterior probability: P(A|B) = (P(B|A) * P(A)) / P(B)*

Data Processing

  • NumPy and Pandas
    • Creating arrays, data manipulation.
    • Example: np.array(), pd.DataFrame().
    • Handling missing data, replacing values.

Feature Scaling

  • Normalization: Rescaling to [0, 1].
    • Example formula: (X - min) / (max - min).
  • Standardization (Z-Score)
    • Formula: (X - mean) / standard deviation.

Data Visualization

  • Outliers: Detecting and managing outliers.

Performance Metrics for Classification

  • Accuracy: Correct predictions / Total predictions.
  • Precision: True Positives / (True Positives + False Positives).
  • Recall: True Positives / (True Positives + False Negatives).
  • F1 Score: 2 * (Precision * Recall) / (Precision + Recall).
  • Mean Absolute Error (MAE): Average of absolute differences between predicted and actual values.
  • Mean Squared Error (MSE): Average of squared differences between predicted and actual values.

Dimensionality Reduction

  • Principal Component Analysis (PCA)
    • Transforming correlated features into uncorrelated components.
    • Explained variance by components.