Overview
This lecture reviewed data filtering, probability calculations, and merging datasets using pandas in Python, focusing on practical techniques for data analysis and preparation.
Preparing and Inspecting Data
- Always make a personal copy of shared notebooks to save your work (File > Save a copy in Drive).
- The dataset used is e-commerce customer behavior, containing demographics, purchase patterns, and customer satisfaction.
- Always document dataset columns for clarity, especially for projects and reports.
- Import necessary packages (e.g., pandas, matplotlib, seaborn) at the start.
- Load data using
pd.read_csv(URL) and assign it to a clearly named DataFrame variable.
- Check for missing data using
df.isna().sum() and review column data types with df.dtypes.
Exploring Data: Unique Values & Logical Operators
- Use
.unique() to find unique values in a column (helps catch typos or categorize data).
- Logical operators in Python: use
== (equals), and, or, and not for conditions.
- In pandas, use
& (and), | (or), and ~ (not) with parentheses for multi-condition filters.
Filtering and Calculating Proportion (Probability)
- Filter DataFrames by conditions, e.g.,
df[df['gender'] == 'Male'].
- Calculate proportions ("probabilities" in samples) as
len(filtered_df) / len(df).
- Use filter conditions for event-based probabilities, such as customer satisfaction ratings.
- Format decimals in f-strings with
:.2f for clarity in outputs.
Multiple Conditions in Filtering
- Combine multiple filter conditions with
& or |, enclosing each condition in parentheses.
- Example:
df[(df['gender'] == 'Male') & (df['satisfaction'] == 5)].
- Separate filter conditions into variables for readability.
- Proportion calculations can use any comparison operator (
==, >, <, etc.).
Merging DataFrames
- Merge two DataFrames using
pd.merge(left, right, on='key', how='type').
- Merge types:
'inner': rows with matching keys in both DataFrames (intersection, like AND).
'outer': all rows from both DataFrames (union, like OR), fill missing with NA.
'left'/'right': all rows from left/right, matched with other, unmatched filled with NA.
- Use unique identifiers (primary keys) for accurate merging.
- After merging, remove redundant columns (e.g., with
.drop()) and rename as needed with .rename().
Key Terms & Definitions
- DataFrame — a 2D labeled data structure used in pandas.
- Filter/Condition — a logical expression returning True/False for each row in a DataFrame.
- Proportion — ratio of filtered rows to total rows; called "probability" when referencing sample data.
- Merge (Join) — combining DataFrames based on shared key columns.
- Primary Key — a unique identifier for records in data tables.
Action Items / Next Steps
- Practice filtering DataFrames using single and multiple conditions.
- Try calculating proportions for different customer satisfaction levels.
- Practice merging two DataFrames on a key and cleaning up the result.
- Document your dataset columns in your project notebooks.
- Review previous notebooks if you need code examples.