Understanding Multivariate Normal Distribution

Aug 10, 2024

Multivariate Normal Distribution Tutorial

Overview

  • This is the first tutorial focusing on multivariate normal distribution and clustering algorithms.
  • The multivariate normal distribution is determined by parameters: mu (mean) and sigma (covariance matrix).

Key Concepts

Probability Density Function

  • Probability of x is given by:
    $P(X | ext{params})$
    where
    • mu is in k-space (means)
    • sigma is in k x k space (covariance matrix)

Covariance

  • Covariance between two variables x and y is calculated as: $$Cov(X,Y) = \frac{(X - E[X])(Y - E[Y])}{n - 1}$$
    where E[X] and E[Y] are the expected values.

Implementation Steps

  1. Import Libraries

    • Import numpy and matplotlib for calculations and plotting.
  2. Generate Toy Dataset

    • Create a multivariate normal distribution with:
      • Mean mu = (0, 0)
      • Covariance sigma = [[1, 0.5], [0.5, 1]]
    • Generate 100 data points.
    • Check shape of x, should be (100, 2).
  3. Define Class

    • Create a class MultivariateNormal.
    • Initialize parameters mu and sigma to None.
  4. Fit Method

    • Define fit(self, x) method to calculate mu and sigma.
    • Convert x into column vectors as needed.
    • Set mu as the mean of x along the appropriate axis.
    • Calculate covariance matrix sigma:
      • Subtract mu from x.
      • Use numpy functions to compute covariance matrix and normalize by n-1.
  5. Probability Calculation

    • Define prob(self, x) method to predict probabilities:
      • Calculate factors for the probability density function based on mu and sigma.
      • Use numpy.einsum for efficient calculations.
  6. Plotting the Distribution

    • Generate a mesh grid for plotting probabilities.
    • Reshape data to appropriate formats for plotting.
    • Create contour plots of the probability density function.
    • Overlay the original data points on the contour plot.

Conclusion

  • Successfully implemented a Gaussian distribution fitting procedure using multivariate normal distribution concepts.
  • Visualization confirms the model fits the generated data well.
  • Questions and comments are welcome for further clarification.