Overview
This lecture explains how to interpret FastQC reports for RNA sequencing data, covering quality metrics and common issues to help guide preprocessing steps.
FastQC Report Basics
- FastQC uses green (pass), yellow (warning), and red (fail) indicators to summarize QC results.
- Warnings can often be ignored, but failed checks should be addressed before further analysis.
- Basic statistics section includes file name, type, total reads, total bases, GC content, and sequence length range.
Quality Metrics Interpretation
- Per base sequence quality shows quality scores for each position across all reads using box plots.
- Median value (red line), interquartile range (yellow box), and whiskers (10th/90th percentiles) help visualize base quality distribution.
- High-quality data has most bases with scores above 30; bases in error (red) region indicate poor quality.
- Per sequence quality score summarizes the average quality per read; most should lie in high-quality (upper) range.
- Reads with low average quality (10–20) are problematic and need attention.
Sequence Content and Distribution
- Per base sequence content checks the proportion of A, T, G, C at each position; equal proportions indicate good quality.
- Unequal base content at read ends often results from adapter contamination.
- Per sequence GC content should be around 50%; large deviations may indicate contamination.
- Per base N content shows ambiguous calls (N); >5% is a warning, >20% is a fail.
- Sequence length distribution visualizes the range of read lengths; a tight range at expected length is ideal.
Duplication and Overrepresentation
- Sequence duplication levels indicate repeated reads, often from PCR amplification; excessive duplication should be removed.
- Overrepresented sequences are short motifs found across many reads, typically from adapters or contaminants, and should be removed.
- Duplication refers to entire repeated reads; overrepresentation refers to repeated motifs within reads.
Adapter Content
- Adapter content section detects non-biological sequence contamination (adapters); >5% is a warning, >10% is a fail.
- Adapters should be removed before downstream analysis.
Key Terms & Definitions
- FastQC — a tool for assessing the quality of high-throughput sequence data.
- Quality score (Phred score) — numerical measure of base calling accuracy.
- Adapter — artificial sequence added during library preparation for sequencing.
- GC content — percentage of guanine and cytosine nucleotides in a sequence.
- N (ambiguous base) — base that could not be confidently determined by the sequencer.
Action Items / Next Steps
- Watch earlier videos in the series (Parts 4, 5, and 10) for background and practical FastQC usage.
- Review your FastQC report and note any failed or warning sections for preprocessing.
- Prepare for the next video, which will cover how to preprocess and clean your sequencing data.