Data Filtering and Merging Techniques

Overview

This lecture reviewed data filtering, probability calculations, and merging datasets using pandas in Python, focusing on practical techniques for data analysis and preparation.

Preparing and Inspecting Data

Always make a personal copy of shared notebooks to save your work (File > Save a copy in Drive).
The dataset used is e-commerce customer behavior, containing demographics, purchase patterns, and customer satisfaction.
Always document dataset columns for clarity, especially for projects and reports.
Import necessary packages (e.g., pandas, matplotlib, seaborn) at the start.
Load data using pd.read_csv(URL) and assign it to a clearly named DataFrame variable.
Check for missing data using df.isna().sum() and review column data types with df.dtypes.

Exploring Data: Unique Values & Logical Operators

Use .unique() to find unique values in a column (helps catch typos or categorize data).
Logical operators in Python: use == (equals), and, or, and not for conditions.
In pandas, use & (and), | (or), and ~ (not) with parentheses for multi-condition filters.

Filtering and Calculating Proportion (Probability)

Filter DataFrames by conditions, e.g., df[df['gender'] == 'Male'].
Calculate proportions ("probabilities" in samples) as len(filtered_df) / len(df).
Use filter conditions for event-based probabilities, such as customer satisfaction ratings.
Format decimals in f-strings with :.2f for clarity in outputs.

Multiple Conditions in Filtering

Combine multiple filter conditions with & or |, enclosing each condition in parentheses.
Example: df[(df['gender'] == 'Male') & (df['satisfaction'] == 5)].
Separate filter conditions into variables for readability.
Proportion calculations can use any comparison operator (==, >, <, etc.).

Merging DataFrames

Merge two DataFrames using pd.merge(left, right, on='key', how='type').
Merge types:
- 'inner': rows with matching keys in both DataFrames (intersection, like AND).
- 'outer': all rows from both DataFrames (union, like OR), fill missing with NA.
- 'left'/'right': all rows from left/right, matched with other, unmatched filled with NA.
Use unique identifiers (primary keys) for accurate merging.
After merging, remove redundant columns (e.g., with .drop()) and rename as needed with .rename().

Key Terms & Definitions

DataFrame — a 2D labeled data structure used in pandas.
Filter/Condition — a logical expression returning True/False for each row in a DataFrame.
Proportion — ratio of filtered rows to total rows; called "probability" when referencing sample data.
Merge (Join) — combining DataFrames based on shared key columns.
Primary Key — a unique identifier for records in data tables.

Action Items / Next Steps

Practice filtering DataFrames using single and multiple conditions.
Try calculating proportions for different customer satisfaction levels.
Practice merging two DataFrames on a key and cleaning up the result.
Document your dataset columns in your project notebooks.
Review previous notebooks if you need code examples.