De novo sequencing and assembly is typically
applied to organisms where no reference genome is available or the available reference is
of poor quality. Genomes that have not been sequenced before
must be assembled via a de novo approach following sequencing. This assembly can then be used for additional
analyses and the basis for future resequencing projects. Resequencing is typically performed when a
reference genome sequence is available. Sequencing reads are aligned back to the reference
to determine the location in the genome the specific read best matches. Resequencing is often applied to explore genetic
variation in individuals, families and populations, particularly with respect to human genetic
disease. Requirements for sequencing depth in these
studies are governed by the variant type of interest, the disease model and the size of
the regions of interest. Resequencing can reveal single nucleotide
polymorphisms, small insertions or deletions, structural variants, and copy number variation. Naturally, the design of a particular study
depends on the biological hypothesis in question, and different sequencing strategies are used
for population studies compared with those for studies of Mendelian disease or of somatic
mutations in cancer. Furthermore, targeted resequencing approaches
allow a trade-off between sequencing breadth and sample numbers: for the same cost, more
samples can be sequenced to the same depth but over a smaller genomic region. Here, we discuss the merits of whole-genome
sequencing (WGS) relative to targeted resequencing approaches, including WES, in the context
of these different variant types and disease models. High-depth WGS is the 'gold standard' for
DNA resequencing because it can interrogate all variant types including SNVs, indels,
structural variants and CNVs in both the minority (1.2%) of the human genome that encodes proteins
and the remaining majority of non-coding sequences. WES is focused on the detection of SNVs and
indels in protein-coding genes and on other functional elements such as microRNA sequences;
consequently, it omits regulatory regions such as promoters and enhancers. Although costs vary depending on the sequence
capture solution, WES can be an order of magnitude less expensive than WGS to achieve an approximately
equivalent breadth of coverage of protein-coding exons. These reduced costs offer the potential to
greatly increase sample numbers, which is a key factor for many studies. However, WES has various limitations that
are discussed below. Early genome resequencing studies focused
specifically on the two most common classes of sequence variation, which are SNVs and
small indels. The first human genome that was sequenced
using Illumina short-read technology showed that, although almost all homozygous SNVs
are detected at a 15× average depth, an average depth of 33× is required to detect the same
proportion of heterozygous SNVs. Consequently, an average depth that exceeds
30× rapidly became the de facto standard. Although read quality is mostly governed by
sequencing technology, the uniformity of depth of coverage can also be affected by sample
preparation. A GC bias that is introduced during DNA amplification
by PCR has been identified as a major source of variation in coverage. Elimination of PCR amplification results in
improved coverage of high GC regions of the genome and in fewer duplicate reads. In WES, differences in the hybridization efficiency
of sequence capture probes, which are possibly again attributable to GC content variation,
can result in target regions that have little or no coverage. Uniformity of coverage will also be influenced
by repetitive or low-complexity sequences, which either restrict bait design or lead
to off-target capture. Furthermore, unlike WGS, WES still routinely
uses PCR amplification, which must be carefully optimized to reduce GC bias. As a result of increased variation in coverage,
a greater average read depth is required to achieve the same breadth of coverage as that
from WGS, and an 80× average depth is required to cover 89.6–96.8% of target bases. All WES kits are prone to reference bias,
which arises from capture probes that match the reference sequence and thus tend to preferentially
enrich the reference allele at heterozygous sites; such bias can produce false-negative
SNV calls. CNVs can be detected from WGS and WES data
using methods that analyse depth of coverage. These methods pile up aligned reads against
genomic coordinates, then calculate read counts in windows to provide the average depth across
a region. Copy number changes can then be inferred from
variation in average depth across genomic regions. In WGS, reasonable specificity can be obtained
with an average depth of as little as 0.1×. However, sensitivity, break-point detection
and absolute copy number estimation all improve with increasing read depth. Regardless of average read depth, depth-of-coverage
methods are vulnerable to false positives that are being called owing to local variations
in coverage even after correction for both GC bias and 'mappability', and cross-sample
calling is required to reduce this effect. In contrast to the high depth that is required
to accurately call SNVs and indels in individual genomes, population genomics studies benefit
from a trade-off between sample numbers and sequencing depth, in which many genomes are
sequenced at low depth (for example, 400 samples at 4×) and their variants are called jointly
across all samples. Variant calls on individual low-depth genomes
have a high false-positive rate, but this is mitigated by combining information across
samples. This approach provides good power to detect
common variants at a proportion of the sequencing cost of deep sequencing. Indeed, even ultra-low-coverage sequencing
(that is, sequencing at 0.1–0.5×) captures almost as much common variation (that is,
variants with >1% allele frequency) as single-nucleotide polymorphism (SNP) arrays. Conversely, reliable identification of variants
in either highly aneuploid genomes or heterogeneous cell populations, such as those from tumours,
requires greater depth of coverage than those from normal tissue. Targeted enrichment and ultra-deep sequencing
(that is, sequencing at 1,000×) of limited regions of interest can be used to study clonal
evolution in cancer samples, in which specific variants are present in <1% of the cell population. The identification of disease-causing de novo
or recessive variants is often best served by sequencing parent–child trios. In this case, it is recommended that the same
depth of sequencing is obtained for each of the family members in order to minimize false-positive
calls in the proband and false-negative calls in the parents.