Building a Machine Learning System to Predict Medical Insurance Costs
Introduction
- Presenter: Siddharthan
- Project Video: 11th project in machine learning series
- Objective: Build a system to predict medical insurance costs using machine learning
- Programming Language: Python
- Platform Used: Google Collaboratory
- Data Source: Kaggle
Workflow Overview
- Problem Statement
- Workflow Explanation
- Data Collection
- Data Analysis
- Data Preprocessing
- Data Splitting
- Model Training
- Model Evaluation
- Building Predictive System
Problem Statement
- Task: Predict medical insurance cost for individuals using provided data.
- Dataset Requirements: Insurance cost data with parameters like health issues, gender, etc.
- Role: Data Scientist/Machine Learning Expert to build the predictive system.
Workflow Details
1. Data Collection
- Step: Collect insurance cost data.
- Data Includes: Age, sex, BMI, children, smoker status, region, insurance charges.
2. Data Analysis
- Purpose: Understand data and its meaningful insights.
- Steps:
- Analyze data structure.
- Use plots to visualize data.
3. Data Preprocessing
- Purpose: Prepare data for machine learning model.
- Steps:
- Handle missing values.
- Encode categorical features.
4. Data Splitting
- Purpose: Split data into training and testing datasets.
- Steps:
- Use
train_test_split function from sklearn to split data.
- Typical split: 80% training, 20% testing.
5. Model Training
- Model: Linear Regression
- Steps:
- Initialize the model.
- Train using training data (
X_train, y_train).
6. Model Evaluation
- Purpose: Verify performance of the model.
- Steps:
- Predict on training and testing data.
- Calculate R-squared value for both sets to measure performance.
7. Building Predictive System
- Objective: Predict insurance cost with new input data.
- Steps:
- Input data transformation.
- Use trained model to predict costs.
Detailed Implementation
Dependencies
- Libraries:
numpy, pandas, matplotlib, seaborn, sklearn.
- **Load in Python:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
Data Collection and Analysis
- Load Data
insurance_dataset = pd.read_csv('insurance.csv')
- Print First 5 Rows
insurance_dataset.head()
- Data Info
insurance_dataset.info()
- Check for Missing Values
insurance_dataset.isnull().sum()
- Statistical Measures
insurance_dataset.describe()
- Data Visualization
- Age Distribution
plt.figure(figsize=(6, 6))
sns.histplot(insurance_dataset['age'], kde=True)
plt.title('Age Distribution')
plt.show()
- Other Visualizations: Sex Distribution, BMI Distribution, Children Count, Smoker Count, Region Distribution, Charges Distribution
Data Preprocessing
- Encoding Categorical Features
insurance_dataset.replace({'sex': {'male': 0, 'female': 1}, 'smoker': {'yes': 0, 'no': 1}, 'region': {'southeast': 0, 'southwest': 1, 'northeast': 2, 'northwest': 3}}, inplace=True)
Splitting Features and Target
- Features and Target Variables
X = insurance_dataset.drop(columns='charges', axis=1)
y = insurance_dataset['charges']
- Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
Model Training
- Initialize and Train Model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Model Evaluation
- Predict & Evaluate Training Data
train_data_prediction = regressor.predict(X_train)
r2_train = metrics.r2_score(y_train, train_data_prediction)
print(r2_train)
- Predict & Evaluate Testing Data
test_data_prediction = regressor.predict(X_test)
r2_test = metrics.r2_score(y_test, test_data_prediction)
print(r2_test)
Building Predictive System
- Prediction Example
input_data = (40, 1, 25.8, 2, 1, 0) # Sample Input Data (age, sex, bmi, children, smoker, region)
input_data_numpy = np.asarray(input_data)
input_data_reshaped = input_data_numpy.reshape(1, -1)
prediction = regressor.predict(input_data_reshaped)
print('The insurance cost is USD', prediction[0])
Conclusion
- Summary: Successfully built and evaluated a linear regression model to predict medical insurance costs.
- Next Steps: Practice the codes, try other regression models for improved accuracy, and engage with the community for feedback and questions.
Note: Review machine learning basics, model evaluation techniques, and Python data structures for deeper understanding. Recommended modules include Python basics and model training and evaluation techniques from the course.