Data Science Basics in Python

Overview

This comprehensive lecture introduces core data science concepts and workflows in Python, including setup, data wrangling with Pandas and Numpy, data cleaning, visualization, and an introduction to machine learning for beginners.

Python & Anaconda Setup

Anaconda bundles Python and key data science tools like Jupyter Notebook and Pandas for easy installation.
Download Anaconda for your operating system, run the installer, and follow prompts to complete setup.
Use Anaconda Navigator to manage environments and libraries, and to launch Jupyter Notebook or Jupyter Lab.

Jupyter Notebook Basics

Jupyter Notebook lets you write, run, and share live Python code, visualizations, and text.
Cells can be in code, markdown, or raw NBConvert mode; use code cells for code and markdown cells for formatted text.
Use keyboard shortcuts (e.g., Shift+Enter to run, A/B to add cells, M/Y to change cell types).
Save work regularly; notebooks autosave but manual checkpoints are possible.

Python Fundamentals

Use print() to display messages; comments start with #.
Strings are text data and have methods like .upper(), .lower(), .count(), .replace().
Variables store data using the = operator; use double quotes for strings with apostrophes.
Lists store ordered, mutable collections of items; use zero-based indexing and can be sliced or modified.
Use methods like .append(), .insert(), .remove(), .pop(), .sort(), .copy() for list management.
Dictionaries store unordered key-value pairs; keys and values accessed and modified with methods like .keys(), .values(), .items(), .update(), .copy(), .pop().
Control flow uses if, elif, else, and for loops for iteration and conditions.
Functions are defined with def and return values via return.

Working with Modules & Libraries

Import modules (e.g., import os) to use their functions (e.g., file management).
Install additional libraries through Anaconda Navigator or pip.

Pandas for Data Analysis

Pandas excels in data wrangling, cleaning, and analysis, comparable to Excel but with more power.
Core data structures: Series (1D) and DataFrame (2D).

Creating & Loading DataFrames

Create DataFrames from arrays, lists, dictionaries, or CSV files using pd.DataFrame() or pd.read_csv().
CSV files should be in the same directory as your notebook for easy importing.

Exploring DataFrames

Use .head(), .tail() to view sample rows.
Attributes like .shape, .columns, .dtypes provide structural info.
Methods like .info(), .describe(), .round() give more detail.

Selecting & Modifying Data

Select columns: df['column'] or df[['col1','col2']].
Add new columns by assignment.
Use .assign(), .insert() for more controlled additions.
Perform column/row operations with built-in or custom functions using .apply() or .map().

Filtering & Sorting Data

Filter with boolean indexing: df[df['col'] > value].
Sort rows by column values using .sort_values().
Set and sort indexes with .set_index() and .sort_index().

Handling Missing Values

Identify missing values with .isnull() and handle them via .dropna(), .fillna().
Choose to drop or fill missing data depending on analysis goals.

Data Cleaning Techniques

Use string methods to clean and format text columns (.str.strip(), .str.upper(), .str.replace()).
Remove duplicates with .duplicated() and .drop_duplicates().
Count unique values with .unique() and .nunique().

Grouping & Aggregating Data

Group data by categories using .groupby() and aggregate with functions like sum(), mean(), count(), agg().
Filter groups based on aggregate conditions using .filter().

Merging & Concatenating Data

Combine DataFrames vertically (pd.concat()) or horizontally (pd.merge()).
Understand different join types: inner, outer (full), left, right, and exclusive joins.

Data Visualization

Use DataFrame .plot() method for line plots, bar charts, histograms, pie charts, boxplots, scatter plots.
Customize with arguments like kind, title, xlabel, ylabel, figsize.
For interactive plots, use libraries like Plotly with Cufflinks.

Introduction to Machine Learning

Machine learning automates pattern recognition and prediction.
Use scikit-learn (sklearn) for classic algorithms like linear regression, logistic regression, decision trees, SVM, and Naive Bayes.
Prepare data: split into train/test sets using train_test_split.
Convert text to numeric features with CountVectorizer or TfidfVectorizer.
Fit models and use metrics like accuracy, confusion matrix, F1-score, and classification report for evaluation.
Use grid search (GridSearchCV) to tune model hyperparameters.

Key Terms & Definitions

DataFrame — two-dimensional labeled data structure in pandas.
Series — one-dimensional labeled array in pandas.
Indexing — selecting elements from lists, arrays, or DataFrames using integer or label positions.
Aggregation — summarizing data via functions like sum, mean, or count.
Boolean Indexing — filtering data using logical conditions.
Regular Expression — pattern syntax for string searching or text processing.
Imbalanced Data — unequal representation of classes in a dataset.
Precision/Recall/F1 — metrics for evaluating classification models' performance.
Machine Learning Model — algorithm trained to recognize patterns and make predictions.

Action Items / Next Steps

Download required datasets and place CSV files in your Jupyter Notebook directory.
Practice importing, cleaning, and analyzing sample datasets using Pandas.
Try building a small machine learning model using scikit-learn and evaluate it with accuracy and F1-score.
Complete any assigned exercises or projects and review the provided cheat sheet for reference.