Simple Linear Regression Example in Python

Jul 11, 2024

Simple Linear Regression Example in Python

Introduction

  • Tools needed: scikit-learn, Pandas, Quandl
  • Installation: Use pip install sklearn, pip install quandl, pip install pandas

Basic Concepts

  • Regression: Used to model continuous data and find the best-fit line.
  • Equation of Line: y = mx + b, where m and b need to be determined.
  • Application Example: Stock prices.
  • Data Type: Continuous data (e.g., stock prices over months).
  • Difference from Classification: Classification assigns unique labels to different data groups.

Features and Labels

  • Supervised Machine Learning: Involves features (attributes) and labels (outcomes).
  • Meaningful Features: Important for effective modeling.

Implementation Steps

Import Required Libraries

import pandas as pd
import quandl

Fetching Data

df = quandl.get("WIKI/GOOGL")
print(df.head())
  • Quandl Dataset: Use the wiki dataset for Google stock (WIKI/GOOGL).
  • Account: Not mandatory, but allows more requests.

Understanding Features

  • Columns: Open, High, Low, Close, Volume, and Adjusted prices.
  • Adjusted: Accounts for stock splits.
  • Feature Relationships: Consider relationships (e.g., High-Low for volatility).

Selecting Meaningful Features

columns = ['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']
df = df[columns]
  • Selected Columns: Adjusted Open, High, Low, Close, and Volume.

Creating New Features

df['HL_PCT'] = (df['Adj. High'] - df['Adj. Close']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
  • HL_PCT: Measures daily volatility (High - Low / Low * 100).
  • PCT_change: Measures daily movement (Close - Open / Open * 100).

Final Dataframe

import pandas as pd
import quandl
df['HL_PCT'] = df['Adj. High'] - df['Adj. Low'] / df['Adj. Low'] * 100
# daily percent change - you can use a different logic
df['PCT_change'] = df['Adj. Close'] / df['Adj. Open'] * 100

final_df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print(final_df.head())

Considerations for next steps

  • Features vs. Labels: Features help predict labels.
  • Future Prediction: Decide if Adj. Close will be a feature or a label.

Conclusion

  • Think about the relationship between features and labels.
  • Next steps involve making predictions with the clean data.
  • For questions or comments, further video instructions are available.