📊

Data Science Statistics Lecture Notes

Jul 11, 2024

Lecture Notes: Data Science Statistics Basics

Topics Covered

  1. Introduction to Data Science Statistics

    • Differences between Descriptive and Inferential Statistics
  2. Descriptive Statistics Topics

    • Measure of Central Tendency
    • Measure of Dispersions
    • Data Summarizing Tools
      • Histograms
      • Box Plot
      • Whisker Plot
  3. Detailed Breakdown of Descriptive Statistics

    • Histograms: Understanding PDF, CDF, and creation techniques.
    • Probability & Permutations:
      • Importance in Data Science
      • Mean, Median, Mode
      • Variance, Standard Deviation
    • Distributions:
      • Gaussian (Normal)
      • Log-Normal
      • Binomial
      • Bernoulli
      • Pareto (Power Law)
      • Standard Normal
  4. Techniques in Standard Normal Distribution

    • Transformation
    • Standardization
    • QQ Plot
    • Determining Normality of a Distribution
  5. Inferential Statistics

    • Hypothesis Testing: Null and Alternative Hypotheses
      • P-Values
      • Confidence Intervals
      • Z-Test, T-Test, Chi-Square Test
      • ANOVA (F-Test)
    • Importance of Hypothesis Testing
      • Defining P-Value and Z, T tables
  6. Introduction to Statistics

    • Definitions: Collection, Organization, and Analysis of Data
    • Importance in Decision Making
    • Data: Facts/Information that can be Measured
  7. Types of Statistics: Descriptive vs. Inferential

    • Descriptive Statistics: Organizing and Summarizing Data
      • Example: Average age of students
    • Inferential Statistics: Making Conclusions from Data
      • Example: IQ of class vs. college
  8. Population and Sample Concepts

    • Population: Complete data set
    • Sample: Subset of the population
    • Sampling Techniques
      • Simple Random Sampling
      • Stratified Sampling
      • Systematic Sampling
      • Convenience Sampling
  9. Variables

    • Types
      • Quantitative: Can be measured numerically (e.g., Age, Height, Weight)
      • Qualitative (Categorical): Based on characteristics (e.g., Gender, Blood Group)
    • Quantitative Variables
      • Discrete: Whole numbers (e.g., Number of Bank Accounts)
      • Continuous: Any value (e.g., Height, Weight)
  10. Variable Measurement Levels

    • Nominal: Categorical (e.g., Color, Gender)
    • Ordinal: Rank-Ordered (e.g., Ranks)
    • Interval: Order Matters, No Natural Zero (e.g., Temperature in Fahrenheit)
    • Ratio: Interval with a Natural Zero (e.g., Height, Weight)
  11. Frequency Distribution

    • Organizing data into a frequency table
    • Cumulative Frequency
      • Example: Counting flowers types
  12. **Bar Graph vs. Histogram

    • Bar Graph: Discrete variables
    • Histogram: Continuous variables
  13. Probability and Sampling Techniques

    • Determining the likelihood of events
    • Various Sampling Techniques
      • Simple Random Sampling
      • Stratified Sampling
      • Systematic Sampling
      • Convenience Sampling
  14. Central Measure of Tendency: Mean, Median, Mode

    • How to compute and their significance in presence of outliers
    • When to use each measure
  15. Measure of Dispersion: Variance and Standard Deviation

    • Concept and importance of spread
    • Calculations involving Population and Sample variance
  16. Percentiles and Quartiles for Outlier Detection

    • Examples of outlier detection
    • Calculation of Percentiles, Interquartile Range (IQR), and Fences
  17. Hypothesis Testing and Types of Errors

    • Type 1 and Type 2 Errors
    • One-Tailed and Two-Tailed Tests
  18. Confusion Matrix

    • True Positive, True Negative, False Positive, False Negative
  19. Distributions and Tests with Examples

    • Binomial, Bernoulli, Poisson Behavior, and more
    • Central Limit Theorem
    • Practical examples and calculations
    • Z Test, T Test, Chi-Square Test, and ANOVA Test
  20. Covariance and Correlation

    • Definitions and formulas
    • Real-world examples
    • Calculating covariance and correlation using Python
  21. Tools and Implementations in Python

    • Google Collab for Z Test, T Test, Chi-Square Test
    • Libraries: Pandas, Seaborn, Numpy**

Notes and Observations

  • Important Definitions and Formulas: Keep track of essential formulas, definitions, and their practical usage.
  • Statistical Tests: Understand the context and when to use which type of tests (Z, T, Chi-Square, ANOVA).
  • Visualization Techniques: Learn different methods to visualize data like histograms, bar charts, box plots, QQ plots, etc.
  • Practical Coding Examples: Practicing code is essential to understand distributions, tests, and transformations.
  • Outlier Analysis: Important techniques for detecting and handling outliers using IQR and Standard Deviation.
  • Sampling Techniques: Understand types of sampling methods for better data collection.