Introduction to Data Wrangling and Analysis Using R

Instructor: Dr. Emily Nortman

Overview

Walkthrough of the Data Wrangling chapter
Tasks: Selecting, filtering, cleaning, and processing data
Tools: R and RStudio
Required Packages: tidyverse, medicaldata

Setting Up the Project

Create New Project:
- File -> New Project -> New Directory -> New Project
- Name it: intro_wrangling_analysis
- Save in appropriate folder
Create R Markdown Document:
- File -> New File -> R Markdown
- Name it: Chapter 2
- Save in project directory

Loading Data and Packages

Load tidyverse and medicaldata packages
Data sets used: opt and polyps
Check data with data() and ? functions
Clean default R Markdown template

Data Wrangling Using `tidyverse`

Selecting Variables (Columns)

Function: select

Syntax:

new_data <- original_data %>% select(column1, column2)

Example:

opt_select <- opt %>% select(Clinic, Age, Education)

Renaming Columns:

opt_rename <- opt %>% select(newName = oldName)

Selecting by Column Number:

opt_select <- opt %>% select(2, 4, 3, 10)

Deselecting Columns:

opt_deselect <- opt %>% select(-columnName)

Filtering Observations (Rows)

Function: filter

Syntax:

filtered_data <- original_data %>% filter(condition)

Examples:

filtered_data <- opt %>% filter(Clinic == "NY")
bmi_diabetes <- opt %>% filter(BMI >= 30 & Diabetes == "Yes")
bmi_bgs <- opt %>% filter(Diabetes == "Yes" | BMI >= 30)

Common Mistakes: Misuse of & (AND) and | (OR), correct order of logical operators, correct handling of text cases

Arranging/Sorting Data

Function: arrange

Syntax:

sorted_data <- original_data %>% arrange(column1, column2)

Example:

arranged_data <- polyps %>% arrange(Sex, Baseline)

Creating and Modifying Columns

Function: mutate

Syntax:

new_data <- original_data %>% mutate(new_column = expression)

Example:

polyps2 <- polyps %>% mutate(treatment1 = Baseline - three_months)

Using case_when for Conditional Mutation:

polyps3 <- polyps2 %>% mutate(Improvement = case_when(
   Total > 0 ~ "Decline",
   Total == 0 ~ "No Change",
   Total < 0 ~ "Improvement"
))

Handling Missing Data

Finding Missing Data:

polyps %>% summarize(missing_baseline = sum(is.na(Baseline)), ...)

Dropping Rows with NAs:

polyp6 <- polyps %>% drop_na(column_name)

Ignoring NAs in Calculations:

summary_data <- polyps %>% summarize(mean_value = mean(column_name, na.rm = TRUE))

Replacing NAs:

polyps7 <- polyps %>% mutate(column_name = replace_na(column_name, 0))

Summarizing Data

Function: summarize

Syntax:

summary_data <- original_data %>% summarize(mean_col = mean(column_name), ...)

Example:

polyps_summary <- polyps %>% summarize(mean_baseline = mean(Baseline), ...)

Grouped Operations

Function: group_by and summarize

Syntax:

grouped_data <- original_data %>% group_by(column_name) %>% summarize(mean_col = mean(column_name), ...)

Example:

grouped_summary <- polyps %>% group_by(Sex) %>% summarize(mean_total = mean(Total))

Rounding Numbers

Problem with Default Rounding: R defaults to rounding to the nearest even number

Redefining the Round Function:

round2 <- function(x, digits) {
    posneg <- sign(x)
    z <- abs(x)*10^digits
    z <- z + 0.5
    z <- trunc(z)
    z <- z/10^digits
    z*posneg
}

Usage:

rounded_data <- original_data %>% mutate(new_column = round2(column_name, 1))

Conclusion

Importance of understanding your data
Essential to perform sanity checks
Practice with familiar datasets
Extend learning to other problems using the tidyverse functions
Reach out to discussion forums for help

Intro to Data Wrangling and Analysis Using R

Introduction to Data Wrangling and Analysis Using R

Instructor: Dr. Emily Nortman

Overview

Setting Up the Project

Loading Data and Packages

Data Wrangling Using tidyverse

Selecting Variables (Columns)

Filtering Observations (Rows)

Arranging/Sorting Data

Creating and Modifying Columns

Handling Missing Data

Summarizing Data

Grouped Operations

Rounding Numbers

Conclusion

Data Wrangling Using `tidyverse`