Complete Python Pandas Tutorial

Jul 12, 2024

Complete Python Pandas Tutorial

Introduction

  • Purpose: Tutorial on Python Pandas for both beginners and experienced users.
  • Context: Updates to pandas library, new tools, new methods.
  • Goals:
    • Basics of working with tabular data.
    • Advanced manipulation techniques.

Getting Started

  • Environment Setup
    • Use Google Colab for online coding.
    • Local setups: Visual Studio Code, PyCharm, Jupyter Lab.
    • Clone repository: git clone <repo_link>.
    • Create and activate virtual environment.
    • Install requirements: pip install -r requirements.txt.

Basic Pandas Operations

Importing Pandas

import pandas as pd

Creating DataFrames

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

Viewing DataFrames

  • df.head(): First 5 rows.
  • df.tail(): Last 5 rows.
  • df.columns: Column names.
  • df.index: Index.
  • df.info(): Summary.
  • df.describe(): Descriptive statistics.

Loading Data

  • CSV: pd.read_csv('file.csv')
  • Parquet: pd.read_parquet('file.parquet')
  • Excel: pd.read_excel('file.xlsx', sheet_name='Sheet1')

Accessing Data

Rows and Columns

  • Using iloc: Index-based access.
  • Using loc: Label-based access.
# iloc example
df.iloc[0, 1] # First row, second column

# loc example
df.loc[0, 'A'] # First row, column 'A'

Filtering Data

  • Basic Filtering: df[df['A'] > 2]
  • Multiple Conditions: df[(df['A'] > 2) & (df['B'] < 5)]
  • String Operations:
    • df[df['name'].str.contains('John')]
    • df[df['name'].str.startswith('A', na=False)]

Adding and Removing Columns

  • Adding Columns: df['C'] = df['A'] + df['B']
  • Removing Columns: df.drop(columns=['C'])

Handling Missing Data

  • Detect Missing Data: df.isna().sum()
  • Fill Missing Data: df.fillna(value)
  • Drop Missing Data: df.dropna()

Aggregating Data

GroupBy

  • Example:
gb = df.groupby('column').sum()

Pivot Table

  • Example:
pivot = df.pivot_table(index='A', columns='B', values='C', aggfunc='sum')

Merging and Concatenating

Merge

merged_df = pd.merge(df1, df2, left_on='key1', right_on='key2', how='inner')

Concatenation

concat_df = pd.concat([df1, df2], axis=0)

Advanced DataFrame Operations

Shifting and Rolling

  • Shift for Comparing Periods:
df['previous_day'] = df['Revenue'].shift(1)
  • Rolling Calculation:
df['3_day_avg'] = df['Revenue'].rolling(window=3).mean()

Rank and Cumulative Sum

df['rank'] = df['height'].rank()
df['cumulative_sum'] = df['units_sold'].cumsum()

Pandas 2.x New Features

PyArrow Backend

  • Optimizations: Better handling of strings and booleans, efficient memory use.
df = pd.DataFrame(..., dtype_backend='pyarrow')

AI Integrations

  • Copilot and ChatGPT: Efficiency in coding, troubleshooting, and debugging.
  • Examples:
    • Code generation for plots, queries, etc.
    • Providing explanations and definitions.

Recommendations

  • Practice: Use real datasets, explore more complex operations.
  • Learn from Others: Watch tutorials, read blogs, solve puzzles.
  • Stay Updated: Continuous learning given the evolving nature of pandas and data science in general.