Full Data Science Process

Introduction

Objective: Walk through the full data science process, completing a full data analysis.
Focus Areas:
- Exploration
- Pre-processing
- Feature Engineering
- Model Training
- Model Evaluation
- Advanced Topics (Hyperparameter tuning)
Audience: Beginners in data science and machine learning, with some advanced concepts introduced.

Prerequisites

IPython Notebook Environment: Jupyter Notebook, Jupyter Lab, or IPython notebooks in VS Code/PyCharm.
Basic Data Science Libraries:
- numpy
- pandas
- matplotlib
- seaborn
- scikit-learn

Dataset

Source: California Housing Prices dataset from Kaggle.
Features: Coordinates, median age, total rooms, total bedrooms, population, ocean proximity, median house value (target variable).
Task: Regression to predict house value based on features.

Steps

1. Setup Environment

Ensure you have the IPython notebook environment setup.
Install necessary libraries using pip.

2. Load and Explore Data

Load Dataset:

data = pd.read_csv('housing.csv')
data.head()

Basic Exploration:
```
data.info()
data.describe()
```
Handle Missing Values:
```
data.dropna(inplace=True)
```

3. Split Data

Define Features (X) and Target (Y):

X = data.drop('median_house_value', axis=1)
y = data['median_house_value']

Train-Test Split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Data Pre-processing

Log Transformation for Skewed Features:

X_train['total_rooms'] = np.log1p(X_train['total_rooms'])
X_train['total_bedrooms'] = np.log1p(X_train['total_bedrooms'])
X_train['population'] = np.log1p(X_train['population'])
X_train['households'] = np.log1p(X_train['households'])

One-Hot Encoding for Categorical Features:

X_train = pd.get_dummies(X_train, columns=['ocean_proximity'])
X_test = pd.get_dummies(X_test, columns=['ocean_proximity'])

Ensure both training and testing data have the same dummy columns.

Combine Updated Training Data:
```
train_data = X_train.join(y_train)
```

5. Feature Engineering

Create New Features:

train_data['bedroom_ratio'] = train_data['total_bedrooms'] / train_data['total_rooms']
train_data['household_rooms'] = train_data['total_rooms'] / train_data['households']

Correlation Analysis:

sns.heatmap(train_data.corr(), annot=True)

6. Model Training

Linear Regression:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

Random Forest Regressor:

from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
forest.score(X_test, y_test)

7. Model Evaluation

Evaluate models on test set and check performance scores.

8. Hyperparameter Tuning

Grid Search CV:

from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'min_samples_split': [2, 5, 10],
    'max_depth': [None, 10, 20, 30]
}
grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_forest = grid_search.best_estimator_
best_forest.score(X_test, y_test)

9. Results

Linear Regression Score: ~0.668
Random Forest Regressor Score: ~0.81 (default settings)
After Hyperparameter Tuning: Varied Results

Conclusion

Explored, pre-processed, engineered features, trained multiple models, and performed hyperparameter tuning.
Linear models are simpler but may not perform as well as ensemble methods.
Hyperparameter tuning can help optimize models, though results may vary.

Additional Resources

Check other videos on Jupyter Notebooks, Jupyter Lab, and Google Colab from the channel.

End of the Lecture.