Introduction to R for Data Science

Sep 6, 2024

R Introduction Lecture Notes

Course Overview

  • Instructor: Barton Poulsen
  • Goal: Introduce R and its applications in data science.
  • Importance of R:
    • Ranked first in a survey of data mining experts for data science software, outperforming Python by 50%.

Reasons to Learn R

  • Cost-effective: Free and open-source, unlike many alternatives that cost thousands annually.
  • Optimized for Vector Operations: Allows processing entire rows/tables without explicit loops.
  • Strong Community Support: Access to examples, assistance, and continuous development.
  • Extensive Package Availability: Over 9,000 contributed packages enhance R's functionality.

Getting Started with R

Installing R

  1. Go to r-project.org.
  2. Click on "Download R" to access CRAN.
  3. Choose and download the appropriate installer for your operating system (Mac, Windows, Linux).
  4. Install R by following standard installation procedures.

Course Files

  • Download course files from the provided link and unzip them.
  • Place the folder containing .R scripts and data files on your desktop.

Using R Environment

Basic R Interface

  • Script Window: Where you write your code.
  • Console Window: Displays output and results.
  • Commented Lines: Begin with # and are not executed.

Running Code Example

  • Load datasets and explore functionality using the well-known Iris dataset:
    • View first six lines using head(iris).
    • Get summary statistics with summary(iris).
    • Create basic plots with plot() commands.

Introduction to RStudio

  • RStudio: A user-friendly interface that organizes R's functionality better than the basic R interface.
  • Setting up RStudio: Download from rstudio.com and install.
  • RStudio Interface: Consists of script, console, environment, and plots panes for easy navigation and management.

Working with R Packages

Importance of Packages

  • Packages are collections of code that extend R's capabilities.
  • Two types of packages:
    • Base Packages: Installed with R but not loaded by default.
    • Contributed Packages: Need to be downloaded and loaded separately.

Finding and Installing Packages

  • Use CRAN, CRANTASTIC, or GitHub to find packages.
  • Commonly used packages include:
    • dplyr: Data manipulation.
    • tidyr: Data cleaning.
    • ggplot2: Data visualization.

Loading Packages in R

  • Use library(package_name) to load packages.
  • To install a package, use install.packages("package_name").

Basic Graphics in R

Creating Basic Plots

  • Use plot() for scatterplots of quantitative data.
  • Create bar charts for categorical data using barplot().
  • Use histograms for quantitative data with hist() commands.

Visualizing Data

  • Create multiple graphs for comparative analysis (small multiples).
  • Overlaid plots can provide richer data visualization.
  • Ensure clarity and avoid clutter in graphs.

Basic Statistics in R

Summary Statistics

  • Use summary() to get basic statistics for datasets.
  • describe() from the psych package for detailed statistics.

Selecting Cases

  • Filter datasets by categories or values using conditional indexing.

Importing Data in R

Data Formats

  • R can import various formats:
    • CSV, TXT, XLSX, JSON.
  • Recommended to use the Rio package for ease of import.

Importing Example

  1. Use import("file_path") for CSV/TXT.
  2. Use import("file_path.xlsx") for Excel files.
  3. Explore data using View(data_frame).

Modeling Data in R

Hierarchical Clustering

  • Group data points based on similarity.
  • Use functions like dist() and hclust() for clustering.

Principal Component Analysis (PCA)

  • Reduces dimensionality by transforming to fewer components.
  • Implemented using prcomp() function.

Regression Analysis

  • Predict an outcome variable using multiple predictor variables.
  • Use lm() to create linear models.
  • Explore different regression techniques (stepwise, lasso, etc.).

Conclusion and Next Steps

  • Explore additional resources and courses at datalab.cc.
  • Consider comparing R with Python for broader programming skills.
  • Engage with visualization concepts to enhance data presentation.
  • Explore machine learning methods for advanced data processing.