Data Cleaning with Pandas

Overview

Focus on cleaning data using Pandas.
Tasks include:
- Dropping rows and columns
- Creating new columns/series
- Manipulating strings
- Changing data types
- Debugging issues

Data Source

Data pulled from ESPN's cricket information site.
Will analyze top cricket players with the highest batting averages in Test matches.
Cannot directly download as CSV, but can use Excel to import data from the web.

Importing Data to Excel

Open a new Excel tab.
Navigate to the "Data" tab and select "From Web".
Input the ESPN URL.
Choose the appropriate table (Table 0) and load it into the spreadsheet.
Save as CSV (e.g., cricket_test_match_data.csv).

Importing Data to Jupyter Notebook

Upload the CSV file to Jupyter Notebook.
Import Pandas:
```
import pandas as pd
```

Read CSV file:

df = pd.read_csv('path/to/cricket_test_match_data.csv')

Data Frame Overview

Rename columns for clarity using df.rename() method:
- Example:
  - No to Not Out
  - HS to Highest Innings Score
  - BF to Balls Faced
  - SR to Strike Rate

Handling Missing Values

Check for null values:
```
df.isnull().any()
```

Fill missing values with 0:

df['Balls Faced'].fillna(0, inplace=True)
df['Strike Rate'].fillna(0, inplace=True)

Dropping Duplicates

Check for duplicates:
```
df.duplicated()
```
Drop duplicates:
```
df = df.drop_duplicates()
```

Creating New Columns

Split the "Span" column into "Rookie Year" and "Final Year":

df['Rookie Year'] = df['Span'].str.split('-').str[0]
df['Final Year'] = df['Span'].str.split('-').str[1]

Drop the original "Span" column:
```
df = df.drop(columns=['Span'])
```

Extracting Player Names and Countries

Split the "Player" column:
- Assign names and countries to new columns.

Changing Data Types

Change data types using astype() method:

Convert columns to appropriate types (int, float):

df['Matches'] = df['Matches'].astype(int)
df['Batting Strike Rate'] = df['Batting Strike Rate'].astype(float)

Calculating Career Length

Create a new column for career length:

df['Career Length'] = df['Final Year'] - df['Rookie Year']

Analyzing Data

Calculate average career length:

avg_career_length = df['Career Length'].mean()

Calculate average batting strike rate for players with career lengths over 10 years:
```
avg_strike_rate = df[df['Career Length'] > 10]['Batting Strike Rate'].mean()
```

Count players who debuted before 1960:

count_before_1960 = df[df['Rookie Year'] < 1960].count()

Grouping Data

Find max highest innings score by country:

df.groupby('Country')['Highest Innings Score'].max()

Calculate averages for 50s and 100s by country.

Conclusion

The session covered essential data cleaning techniques using Pandas.
Recommended to explore further resources on Pandas and Python.

Data Cleaning Techniques in Pandas