Intro to Data Wrangling and Analysis Using R

Jun 21, 2024

Introduction to Data Wrangling and Analysis Using R

Instructor: Dr. Emily Nortman

Overview

  • Walkthrough of the Data Wrangling chapter
  • Tasks: Selecting, filtering, cleaning, and processing data
  • Tools: R and RStudio
  • Required Packages: tidyverse, medicaldata

Setting Up the Project

  1. Create New Project:

    • File -> New Project -> New Directory -> New Project
    • Name it: intro_wrangling_analysis
    • Save in appropriate folder
  2. Create R Markdown Document:

    • File -> New File -> R Markdown
    • Name it: Chapter 2
    • Save in project directory

Loading Data and Packages

  • Load tidyverse and medicaldata packages
  • Data sets used: opt and polyps
  • Check data with data() and ? functions
  • Clean default R Markdown template

Data Wrangling Using tidyverse

Selecting Variables (Columns)

  • Function: select
  • Syntax:
    new_data <- original_data %>% select(column1, column2)
    
  • Example:
    opt_select <- opt %>% select(Clinic, Age, Education)
    
  • Renaming Columns:
    opt_rename <- opt %>% select(newName = oldName)
    
  • Selecting by Column Number:
    opt_select <- opt %>% select(2, 4, 3, 10)
    
  • Deselecting Columns:
    opt_deselect <- opt %>% select(-columnName)
    

Filtering Observations (Rows)

  • Function: filter
  • Syntax:
    filtered_data <- original_data %>% filter(condition)
    
  • Examples:
    filtered_data <- opt %>% filter(Clinic == "NY")
    bmi_diabetes <- opt %>% filter(BMI >= 30 & Diabetes == "Yes")
    bmi_bgs <- opt %>% filter(Diabetes == "Yes" | BMI >= 30)
    
  • Common Mistakes: Misuse of & (AND) and | (OR), correct order of logical operators, correct handling of text cases

Arranging/Sorting Data

  • Function: arrange
  • Syntax:
    sorted_data <- original_data %>% arrange(column1, column2)
    
  • Example:
    arranged_data <- polyps %>% arrange(Sex, Baseline)
    

Creating and Modifying Columns

  • Function: mutate
  • Syntax:
    new_data <- original_data %>% mutate(new_column = expression)
    
  • Example:
    polyps2 <- polyps %>% mutate(treatment1 = Baseline - three_months)
    
  • Using case_when for Conditional Mutation:
    polyps3 <- polyps2 %>% mutate(Improvement = case_when(
       Total > 0 ~ "Decline",
       Total == 0 ~ "No Change",
       Total < 0 ~ "Improvement"
    ))
    

Handling Missing Data

  • Finding Missing Data:
    polyps %>% summarize(missing_baseline = sum(is.na(Baseline)), ...)
    
  • Dropping Rows with NAs:
    polyp6 <- polyps %>% drop_na(column_name)
    
  • Ignoring NAs in Calculations:
    summary_data <- polyps %>% summarize(mean_value = mean(column_name, na.rm = TRUE))
    
  • Replacing NAs:
    polyps7 <- polyps %>% mutate(column_name = replace_na(column_name, 0))
    

Summarizing Data

  • Function: summarize
  • Syntax:
    summary_data <- original_data %>% summarize(mean_col = mean(column_name), ...)
    
  • Example:
    polyps_summary <- polyps %>% summarize(mean_baseline = mean(Baseline), ...)
    

Grouped Operations

  • Function: group_by and summarize
  • Syntax:
    grouped_data <- original_data %>% group_by(column_name) %>% summarize(mean_col = mean(column_name), ...)
    
  • Example:
    grouped_summary <- polyps %>% group_by(Sex) %>% summarize(mean_total = mean(Total))
    

Rounding Numbers

  • Problem with Default Rounding: R defaults to rounding to the nearest even number
  • Redefining the Round Function:
    round2 <- function(x, digits) {
        posneg <- sign(x)
        z <- abs(x)*10^digits
        z <- z + 0.5
        z <- trunc(z)
        z <- z/10^digits
        z*posneg
    }
    
  • Usage:
    rounded_data <- original_data %>% mutate(new_column = round2(column_name, 1))
    

Conclusion

  • Importance of understanding your data
  • Essential to perform sanity checks
  • Practice with familiar datasets
  • Extend learning to other problems using the tidyverse functions
  • Reach out to discussion forums for help