Data Cleaning Techniques in Pandas

Aug 29, 2024

Data Cleaning with Pandas

Overview

  • Focus on cleaning data using Pandas.
  • Tasks include:
    • Dropping rows and columns
    • Creating new columns/series
    • Manipulating strings
    • Changing data types
    • Debugging issues

Data Source

  • Data pulled from ESPN's cricket information site.
  • Will analyze top cricket players with the highest batting averages in Test matches.
  • Cannot directly download as CSV, but can use Excel to import data from the web.

Importing Data to Excel

  1. Open a new Excel tab.
  2. Navigate to the "Data" tab and select "From Web".
  3. Input the ESPN URL.
  4. Choose the appropriate table (Table 0) and load it into the spreadsheet.
  5. Save as CSV (e.g., cricket_test_match_data.csv).

Importing Data to Jupyter Notebook

  • Upload the CSV file to Jupyter Notebook.
  • Import Pandas:
    import pandas as pd
    
  • Read CSV file:
    df = pd.read_csv('path/to/cricket_test_match_data.csv')
    

Data Frame Overview

  • Rename columns for clarity using df.rename() method:
    • Example:
      • No to Not Out
      • HS to Highest Innings Score
      • BF to Balls Faced
      • SR to Strike Rate

Handling Missing Values

  • Check for null values:
    df.isnull().any()
    
  • Fill missing values with 0:
    df['Balls Faced'].fillna(0, inplace=True)
    df['Strike Rate'].fillna(0, inplace=True)
    

Dropping Duplicates

  • Check for duplicates:
    df.duplicated()
    
  • Drop duplicates:
    df = df.drop_duplicates()
    

Creating New Columns

  • Split the "Span" column into "Rookie Year" and "Final Year":
    df['Rookie Year'] = df['Span'].str.split('-').str[0]
    df['Final Year'] = df['Span'].str.split('-').str[1]
    
  • Drop the original "Span" column:
    df = df.drop(columns=['Span'])
    

Extracting Player Names and Countries

  • Split the "Player" column:
    • Assign names and countries to new columns.

Changing Data Types

  • Change data types using astype() method:
    • Convert columns to appropriate types (int, float):
    df['Matches'] = df['Matches'].astype(int)
    df['Batting Strike Rate'] = df['Batting Strike Rate'].astype(float)
    

Calculating Career Length

  • Create a new column for career length:
    df['Career Length'] = df['Final Year'] - df['Rookie Year']
    

Analyzing Data

  • Calculate average career length:
    avg_career_length = df['Career Length'].mean()
    
  • Calculate average batting strike rate for players with career lengths over 10 years:
    avg_strike_rate = df[df['Career Length'] > 10]['Batting Strike Rate'].mean()
    
  • Count players who debuted before 1960:
    count_before_1960 = df[df['Rookie Year'] < 1960].count()
    

Grouping Data

  • Find max highest innings score by country:
    df.groupby('Country')['Highest Innings Score'].max()
    
  • Calculate averages for 50s and 100s by country.

Conclusion

  • The session covered essential data cleaning techniques using Pandas.
  • Recommended to explore further resources on Pandas and Python.