Complete Python Pandas Tutorial Walkthrough

Jul 15, 2024

Complete Python Pandas Tutorial Walkthrough

Introduction

  • Purpose: Updated pandas tutorial after 5 years, including new features, tools, and knowledge acquired over time.
  • Audience: Great for both beginners and experienced users of pandas.
  • Goal: Provide concrete action items to apply pandas skills to real-world data sets.

Getting Started with Pandas

Tools and Environment Setup

  • Online: Use pandas in-browser via cab.research.com.
  • Local: Clone the repo from the corresponding video link for use with Visual Studio Code, PyCharm, or Jupyter Lab.
  • Virtual Environment: Create and activate a virtual environment:
    source tutorial/bin/activate
    
  • Install Libraries: Install required libraries:
    pip3 install -r requirements.txt
    
  • Setup Code Editor: Open the folder in Visual Studio Code and create an IPython notebook.
    import pandas as pd
    

Understanding DataFrames

Creating DataFrames

  • DataFrame Basics: Main structure in pandas for tabular data.
  • Basic Structure: Create DataFrame with dummy data:
    df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['A','B','C'])
    df.head() # Display first 5 rows
    

Accessing Data

  • Viewing Data:
    • df.head(): First 5 rows
    • df.tail(): Last 5 rows
    • df.columns: Show columns
    • df.index: Show index
    • df.info(): Basic information about the DataFrame
    • df.describe(): Descriptive statistics

Loading Data

  • Read CSV:
    coffee = pd.read_csv('data/warmup/coffee.csv')
    coffee.head()
    
  • Read Other Formats:
    results = pd.read_parquet('data/results.parquet')
    olympics = pd.read_excel('data/olympics.xlsx', sheet_name='results')
    

Selecting Data

  • LOC and ILOC:
    • df.loc[rows, columns]
    • df.iloc[rows, columns]
    df.loc[0]  # Access first row
    df.iloc[0, 2]  # First row, third column
    
  • Conditions:
    df[df['A'] > 5]
    df[(df['A'] > 5) & (df['B'] == 8)]
    

Manipulating DataFrames

Adding/Removing Columns

  • Adding Columns:
    df['New_Col'] = value
    df['Conditional_Col'] = np.where(df['A'] > 10, 'High', 'Low')
    
  • Dropping Columns:
    df.drop(columns=['A'], inplace=True)
    

Merging and Concatenating

  • Merging DataFrames: Join data on a key.
    merged_df = pd.merge(df1, df2, left_on='key1', right_on='key2', how='inner')
    
  • Concatenating DataFrames:
    concatenated_df = pd.concat([df1, df2], axis=0)
    

Handling Missing Data

  • Identify Missing Data:
    df.isna().sum()
    
  • Fill Missing Data:
    df.fillna(value, inplace=True)
    df.interpolate(inplace=True)
    
  • Drop Missing Data:
    df.dropna(subset=['column_name'], inplace=True)
    

Grouping and Aggregating

  • Group By: Aggregate data by groups.
    df.groupby('column').sum()
    
  • Pivot Table: Reshape data.
    pd.pivot_table(df, values='sales', index=['month'], columns=['product'])
    

Advanced Functions

Shift and Rank

  • Shift Data:
    df['Prev_Day'] = df['Sales'].shift(1)
    df['Change'] = df['Sales'] - df['Prev_Day']
    
  • Rank Data:
    df['Rank'] = df['Values'].rank(ascending=False)
    

Performance Optimizations with PyArrow

  • DataFrame Backend: Use pyarrow for better performance.
    pd.options.compute.use_bottleneck = True
    pd.options.compute.chunk_size = 100000
    

AI and Machine Learning Integrations

  • GitHub CoPilot and ChatGPT: Aid in code suggestions and automating tasks.
    # Example ChatGPT query:
    filter_df = df[(df['BornRegion'] == 'New Hampshire') | (df['BornCity'] == 'San Francisco')]
    

Conclusion

  • Applications: Apply learned pandas skills to various datasets.
  • Further Learning: Check related tutorials for data cleaning, pandas puzzles, and real-world data analysis.
  • Community Feedback: Encourage sharing of improvement tips in comments.

Happy coding with Pandas!