Data Science and Machine Learning Lecture

Jun 30, 2024

Data Science and Machine Learning Lecture Notes

Overview

  • Data Science and Machine Learning = Hardest job of 21st century
  • Average salary: $120,000 per year (according to LinkedIn)
  • Among top five jobs globally (according to LinkedIn)

Entry Requirements

  • Good command over statistics is essential
    • Basis for data science concepts
    • Enables predictions, e.g., tornadoes in New York, stock market crashes
    • Not magic, just statistics

Course Introduction

  • Comprehensive course by Dr. Abanda Sarar
    • PhD in Statistics from Stanford University, taught at MIT
    • Research staff at IBM, led various functions at General Electric
    • Co-founded Omix Labs
    • Uploaded by Great Learnings: Business Analytics and Business Intelligence course
    • Available on YouTube for a limited time
    • Subscribe to Great Learn YouTube channel

Agenda

  1. Difference between statistics and machine learning
  2. Types of Statistics: Descriptive, Predictive, and Prescriptive
  3. Types of data available
  4. Correlation and Covariance
  5. Probability and Bayes Theorem
  6. Probability distributions: Binomial and Poisson

Key Concepts

Statistics vs. Machine Learning

  • Statistics: Formulate problem, then get data
  • Machine Learning: Provide data, infer from it
  • Differences in problem resolution approach
  • Employment challenges for statisticians due to these differences

Applying Statistics in Business Problems

  • Example: Watch company sales decline
    • Identify who buys watches
    • Compare sales data year-on-year, region-wise, age-wise
    • Understand sales segments and declines
    • Use data analysis to derive insights

Descriptive, Predictive, and Prescriptive Analytics

  • Descriptive: Describe data (e.g., sales trends)
  • Predictive: Forecast future events based on current data
  • Prescriptive: Suggest actions based on predictions

Descriptive Analytics

  • Summarizing data using mean, median, standard deviation, and other measures
  • Visualization techniques: histograms, box plots, scatter plots

Probability

  • Understanding randomness and uncertainty
  • Calculating probabilities
  • Concepts such as independent events and mutually exclusive events

Bayes Theorem

  • Crucial for updating probabilities based on new evidence

  • Practical example: Email spam detection

  • Formula:

    P(A|B) = [P(B|A) * P(A)] / P(B)
    

Probability Distributions

  • Binomial Distribution
    • Discrete probability distribution of number of successes in a sequence of n independent experiments
  • Poisson Distribution
    • Probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space
  • Normal Distribution
    • Continuous probability distribution that is symmetrical, bell-shaped and defined by the mean and standard deviation
    • Central Limit Theorem: Why normal distribution is common

Practical Applications

  • Business: Sales forecasting, quality control, risk management
  • Healthcare: Diagnostic tests, disease spread modeling
  • Technology: Machine learning models, A/B testing

Conclusion

  • Statistics forms the backbone of data science and machine learning
  • Crucial to understanding and making informed decisions based on data