Complete Python Pandas Tutorial
Introduction
- Purpose: Tutorial on Python Pandas for both beginners and experienced users.
- Context: Updates to pandas library, new tools, new methods.
- Goals:
- Basics of working with tabular data.
- Advanced manipulation techniques.
Getting Started
- Environment Setup
- Use Google Colab for online coding.
- Local setups: Visual Studio Code, PyCharm, Jupyter Lab.
- Clone repository:
git clone <repo_link>
.
- Create and activate virtual environment.
- Install requirements:
pip install -r requirements.txt
.
Basic Pandas Operations
Importing Pandas
import pandas as pd
Creating DataFrames
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
Viewing DataFrames
df.head()
: First 5 rows.
df.tail()
: Last 5 rows.
df.columns
: Column names.
df.index
: Index.
df.info()
: Summary.
df.describe()
: Descriptive statistics.
Loading Data
- CSV:
pd.read_csv('file.csv')
- Parquet:
pd.read_parquet('file.parquet')
- Excel:
pd.read_excel('file.xlsx', sheet_name='Sheet1')
Accessing Data
Rows and Columns
- Using
iloc
: Index-based access.
- Using
loc
: Label-based access.
# iloc example
df.iloc[0, 1] # First row, second column
# loc example
df.loc[0, 'A'] # First row, column 'A'
Filtering Data
- Basic Filtering:
df[df['A'] > 2]
- Multiple Conditions:
df[(df['A'] > 2) & (df['B'] < 5)]
- String Operations:
df[df['name'].str.contains('John')]
df[df['name'].str.startswith('A', na=False)]
Adding and Removing Columns
- Adding Columns:
df['C'] = df['A'] + df['B']
- Removing Columns:
df.drop(columns=['C'])
Handling Missing Data
- Detect Missing Data:
df.isna().sum()
- Fill Missing Data:
df.fillna(value)
- Drop Missing Data:
df.dropna()
Aggregating Data
GroupBy
gb = df.groupby('column').sum()
Pivot Table
pivot = df.pivot_table(index='A', columns='B', values='C', aggfunc='sum')
Merging and Concatenating
Merge
merged_df = pd.merge(df1, df2, left_on='key1', right_on='key2', how='inner')
Concatenation
concat_df = pd.concat([df1, df2], axis=0)
Advanced DataFrame Operations
Shifting and Rolling
- Shift for Comparing Periods:
df['previous_day'] = df['Revenue'].shift(1)
df['3_day_avg'] = df['Revenue'].rolling(window=3).mean()
Rank and Cumulative Sum
df['rank'] = df['height'].rank()
df['cumulative_sum'] = df['units_sold'].cumsum()
Pandas 2.x New Features
PyArrow Backend
- Optimizations: Better handling of strings and booleans, efficient memory use.
df = pd.DataFrame(..., dtype_backend='pyarrow')
AI Integrations
- Copilot and ChatGPT: Efficiency in coding, troubleshooting, and debugging.
- Examples:
- Code generation for plots, queries, etc.
- Providing explanations and definitions.
Recommendations
- Practice: Use real datasets, explore more complex operations.
- Learn from Others: Watch tutorials, read blogs, solve puzzles.
- Stay Updated: Continuous learning given the evolving nature of pandas and data science in general.