Overview
This comprehensive lecture introduces core data science concepts and workflows in Python, including setup, data wrangling with Pandas and Numpy, data cleaning, visualization, and an introduction to machine learning for beginners.
Python & Anaconda Setup
- Anaconda bundles Python and key data science tools like Jupyter Notebook and Pandas for easy installation.
- Download Anaconda for your operating system, run the installer, and follow prompts to complete setup.
- Use Anaconda Navigator to manage environments and libraries, and to launch Jupyter Notebook or Jupyter Lab.
Jupyter Notebook Basics
- Jupyter Notebook lets you write, run, and share live Python code, visualizations, and text.
- Cells can be in code, markdown, or raw NBConvert mode; use code cells for code and markdown cells for formatted text.
- Use keyboard shortcuts (e.g., Shift+Enter to run, A/B to add cells, M/Y to change cell types).
- Save work regularly; notebooks autosave but manual checkpoints are possible.
Python Fundamentals
- Use
print() to display messages; comments start with #.
- Strings are text data and have methods like
.upper(), .lower(), .count(), .replace().
- Variables store data using the
= operator; use double quotes for strings with apostrophes.
- Lists store ordered, mutable collections of items; use zero-based indexing and can be sliced or modified.
- Use methods like
.append(), .insert(), .remove(), .pop(), .sort(), .copy() for list management.
- Dictionaries store unordered key-value pairs; keys and values accessed and modified with methods like
.keys(), .values(), .items(), .update(), .copy(), .pop().
- Control flow uses
if, elif, else, and for loops for iteration and conditions.
- Functions are defined with
def and return values via return.
Working with Modules & Libraries
- Import modules (e.g.,
import os) to use their functions (e.g., file management).
- Install additional libraries through Anaconda Navigator or
pip.
Pandas for Data Analysis
- Pandas excels in data wrangling, cleaning, and analysis, comparable to Excel but with more power.
- Core data structures: Series (1D) and DataFrame (2D).
Creating & Loading DataFrames
- Create DataFrames from arrays, lists, dictionaries, or CSV files using
pd.DataFrame() or pd.read_csv().
- CSV files should be in the same directory as your notebook for easy importing.
Exploring DataFrames
- Use
.head(), .tail() to view sample rows.
- Attributes like
.shape, .columns, .dtypes provide structural info.
- Methods like
.info(), .describe(), .round() give more detail.
Selecting & Modifying Data
- Select columns:
df['column'] or df[['col1','col2']].
- Add new columns by assignment.
- Use
.assign(), .insert() for more controlled additions.
- Perform column/row operations with built-in or custom functions using
.apply() or .map().
Filtering & Sorting Data
- Filter with boolean indexing:
df[df['col'] > value].
- Sort rows by column values using
.sort_values().
- Set and sort indexes with
.set_index() and .sort_index().
Handling Missing Values
- Identify missing values with
.isnull() and handle them via .dropna(), .fillna().
- Choose to drop or fill missing data depending on analysis goals.
Data Cleaning Techniques
- Use string methods to clean and format text columns (
.str.strip(), .str.upper(), .str.replace()).
- Remove duplicates with
.duplicated() and .drop_duplicates().
- Count unique values with
.unique() and .nunique().
Grouping & Aggregating Data
- Group data by categories using
.groupby() and aggregate with functions like sum(), mean(), count(), agg().
- Filter groups based on aggregate conditions using
.filter().
Merging & Concatenating Data
- Combine DataFrames vertically (
pd.concat()) or horizontally (pd.merge()).
- Understand different join types: inner, outer (full), left, right, and exclusive joins.
Data Visualization
- Use DataFrame
.plot() method for line plots, bar charts, histograms, pie charts, boxplots, scatter plots.
- Customize with arguments like
kind, title, xlabel, ylabel, figsize.
- For interactive plots, use libraries like Plotly with Cufflinks.
Introduction to Machine Learning
- Machine learning automates pattern recognition and prediction.
- Use
scikit-learn (sklearn) for classic algorithms like linear regression, logistic regression, decision trees, SVM, and Naive Bayes.
- Prepare data: split into train/test sets using
train_test_split.
- Convert text to numeric features with
CountVectorizer or TfidfVectorizer.
- Fit models and use metrics like accuracy, confusion matrix, F1-score, and classification report for evaluation.
- Use grid search (
GridSearchCV) to tune model hyperparameters.
Key Terms & Definitions
- DataFrame — two-dimensional labeled data structure in pandas.
- Series — one-dimensional labeled array in pandas.
- Indexing — selecting elements from lists, arrays, or DataFrames using integer or label positions.
- Aggregation — summarizing data via functions like sum, mean, or count.
- Boolean Indexing — filtering data using logical conditions.
- Regular Expression — pattern syntax for string searching or text processing.
- Imbalanced Data — unequal representation of classes in a dataset.
- Precision/Recall/F1 — metrics for evaluating classification models' performance.
- Machine Learning Model — algorithm trained to recognize patterns and make predictions.
Action Items / Next Steps
- Download required datasets and place CSV files in your Jupyter Notebook directory.
- Practice importing, cleaning, and analyzing sample datasets using Pandas.
- Try building a small machine learning model using
scikit-learn and evaluate it with accuracy and F1-score.
- Complete any assigned exercises or projects and review the provided cheat sheet for reference.