L13

Sep 20, 2024

Lecture Notes on Correlation, Regression, and Data Fitting

Overview

  • Brief review of correlation and regression.
  • Discussion on fitting data, interpolation, and extrapolation.

Correlation

  • Bivariate Data: Plotting data to estimate correlation between variables x and y.
  • Correlation Coefficient (r or ρ):
    • Defined as ( \frac{S_{xy}}{S_x \times S_y} ).
    • ( S_{xy} ) is the covariance, given by ( \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} ).
    • Bounded between -1 (perfect negative correlation) and 1 (perfect positive correlation).
    • Example: For x = y, r = +1; for x = -y, r = -1.
    • If data points are scattered, r is close to 0.

Regression

  • Linear Regression: Estimating y given x or vice versa.
    • Objective: Approximate data points by a curve, often linear.
    • Equation: ( y = a + bx ).
    • Finding ( a ) and ( b ):
      • ( \bar{y} = a + b\bar{x} )
      • ( b = \frac{S_{xy}}{S_x^2} = \rho \times \frac{S_y}{S_x} )
    • Example calculation with data points (x, y): (1, 1), (2, 2), (3, 2), (4, 2), (5, 3).

Goodness of Fit

  • R-squared (R²): Measure of goodness of fit.
    • Formula: ( R^2 = 1 - \frac{\sum e^2}{\sum (y - \bar{y})^2} ).
    • R² value indicates how well the line represents the data.
    • Perfect fit: R² = 1; Poor fit: R² close to 0.

Interpolation and Extrapolation

  • Interpolation: Estimating a value within the range of known data points.
    • Example: Estimating protein concentration using a standard curve in biology.
  • Extrapolation: Estimating a value outside the known data range.
    • Example: Measuring ligand-receptor binding strength via extrapolation of force measurements.
    • Challenges: Assumptions about curve continuation beyond collected data can lead to errors.

Biological Example

  • Interpolation used in protein concentration estimation for Western blot analysis.
  • Extrapolation used for estimating binding strength in molecular interactions.
    • Example: Varying ligand concentrations and measuring unbinding forces.

Conclusion

  • Linear regression is useful for data fitting.
  • Interpolation is generally reliable; extrapolation can be error-prone if assumptions about data trends are incorrect.
  • Importance of selecting the appropriate function for curve fitting.