🧬

FastQC Report Interpretation for RNA Data

Jul 29, 2025

Overview

This lecture explains how to interpret FastQC reports for RNA sequencing data, covering quality metrics and common issues to help guide preprocessing steps.

FastQC Report Basics

  • FastQC uses green (pass), yellow (warning), and red (fail) indicators to summarize QC results.
  • Warnings can often be ignored, but failed checks should be addressed before further analysis.
  • Basic statistics section includes file name, type, total reads, total bases, GC content, and sequence length range.

Quality Metrics Interpretation

  • Per base sequence quality shows quality scores for each position across all reads using box plots.
  • Median value (red line), interquartile range (yellow box), and whiskers (10th/90th percentiles) help visualize base quality distribution.
  • High-quality data has most bases with scores above 30; bases in error (red) region indicate poor quality.
  • Per sequence quality score summarizes the average quality per read; most should lie in high-quality (upper) range.
  • Reads with low average quality (10–20) are problematic and need attention.

Sequence Content and Distribution

  • Per base sequence content checks the proportion of A, T, G, C at each position; equal proportions indicate good quality.
  • Unequal base content at read ends often results from adapter contamination.
  • Per sequence GC content should be around 50%; large deviations may indicate contamination.
  • Per base N content shows ambiguous calls (N); >5% is a warning, >20% is a fail.
  • Sequence length distribution visualizes the range of read lengths; a tight range at expected length is ideal.

Duplication and Overrepresentation

  • Sequence duplication levels indicate repeated reads, often from PCR amplification; excessive duplication should be removed.
  • Overrepresented sequences are short motifs found across many reads, typically from adapters or contaminants, and should be removed.
  • Duplication refers to entire repeated reads; overrepresentation refers to repeated motifs within reads.

Adapter Content

  • Adapter content section detects non-biological sequence contamination (adapters); >5% is a warning, >10% is a fail.
  • Adapters should be removed before downstream analysis.

Key Terms & Definitions

  • FastQC — a tool for assessing the quality of high-throughput sequence data.
  • Quality score (Phred score) — numerical measure of base calling accuracy.
  • Adapter — artificial sequence added during library preparation for sequencing.
  • GC content — percentage of guanine and cytosine nucleotides in a sequence.
  • N (ambiguous base) — base that could not be confidently determined by the sequencer.

Action Items / Next Steps

  • Watch earlier videos in the series (Parts 4, 5, and 10) for background and practical FastQC usage.
  • Review your FastQC report and note any failed or warning sections for preprocessing.
  • Prepare for the next video, which will cover how to preprocess and clean your sequencing data.