Full Data Science Process

May 30, 2024

Full Data Science Process

Introduction

  • Objective: Walk through the full data science process, completing a full data analysis.
  • Focus Areas:
    • Exploration
    • Pre-processing
    • Feature Engineering
    • Model Training
    • Model Evaluation
    • Advanced Topics (Hyperparameter tuning)
  • Audience: Beginners in data science and machine learning, with some advanced concepts introduced.

Prerequisites

  1. IPython Notebook Environment: Jupyter Notebook, Jupyter Lab, or IPython notebooks in VS Code/PyCharm.
  2. Basic Data Science Libraries:
    • numpy
    • pandas
    • matplotlib
    • seaborn
    • scikit-learn

Dataset

  • Source: California Housing Prices dataset from Kaggle.
  • Features: Coordinates, median age, total rooms, total bedrooms, population, ocean proximity, median house value (target variable).
  • Task: Regression to predict house value based on features.

Steps

1. Setup Environment

  • Ensure you have the IPython notebook environment setup.
  • Install necessary libraries using pip.

2. Load and Explore Data

  • Load Dataset:
    data = pd.read_csv('housing.csv')
    data.head()
    
  • Basic Exploration:
    data.info()
    data.describe()
    
  • Handle Missing Values:
    data.dropna(inplace=True)
    

3. Split Data

  • Define Features (X) and Target (Y):
    X = data.drop('median_house_value', axis=1)
    y = data['median_house_value']
    
  • Train-Test Split:
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    

4. Data Pre-processing

  • Log Transformation for Skewed Features:
    X_train['total_rooms'] = np.log1p(X_train['total_rooms'])
    X_train['total_bedrooms'] = np.log1p(X_train['total_bedrooms'])
    X_train['population'] = np.log1p(X_train['population'])
    X_train['households'] = np.log1p(X_train['households'])
    
  • One-Hot Encoding for Categorical Features:
    X_train = pd.get_dummies(X_train, columns=['ocean_proximity'])
    X_test = pd.get_dummies(X_test, columns=['ocean_proximity'])
    
    Ensure both training and testing data have the same dummy columns.
  • Combine Updated Training Data:
    train_data = X_train.join(y_train)
    

5. Feature Engineering

  • Create New Features:
    train_data['bedroom_ratio'] = train_data['total_bedrooms'] / train_data['total_rooms']
    train_data['household_rooms'] = train_data['total_rooms'] / train_data['households']
    
  • Correlation Analysis:
    sns.heatmap(train_data.corr(), annot=True)
    

6. Model Training

  • Linear Regression:
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train, y_train)
    
  • Random Forest Regressor:
    from sklearn.ensemble import RandomForestRegressor
    forest = RandomForestRegressor()
    forest.fit(X_train, y_train)
    forest.score(X_test, y_test)
    

7. Model Evaluation

  • Evaluate models on test set and check performance scores.

8. Hyperparameter Tuning

  • Grid Search CV:
    from sklearn.model_selection import GridSearchCV
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_features': ['auto', 'sqrt', 'log2'],
        'min_samples_split': [2, 5, 10],
        'max_depth': [None, 10, 20, 30]
    }
    grid_search = GridSearchCV(estimator=forest, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)
    best_forest = grid_search.best_estimator_
    best_forest.score(X_test, y_test)
    

9. Results

  • Linear Regression Score: ~0.668
  • Random Forest Regressor Score: ~0.81 (default settings)
  • After Hyperparameter Tuning: Varied Results

Conclusion

  • Explored, pre-processed, engineered features, trained multiple models, and performed hyperparameter tuning.
  • Linear models are simpler but may not perform as well as ensemble methods.
  • Hyperparameter tuning can help optimize models, though results may vary.

Additional Resources

  • Check other videos on Jupyter Notebooks, Jupyter Lab, and Google Colab from the channel.

End of the Lecture.