High Throughput Validation of NGS Structural Variant Calls Using High Definition Mapping
Structural variants (SVs) affect more of the average human genome than SNVs and indels combined and frequently play critical roles in human disease. However, SVs are often difficult to detect in Illumina data sets due to limitations in read length and systematic error. Longer-read approaches such as PacBio sequencing provide information over a larger-scale but often still fall short of accurately characterizing large-scale genomic changes. While recent years have seen a rapid expansion in the number of short and long read data sets, as well as SV calling methods, validation for SV calls remain time-consuming and error-prone. Here we demonstrate the utility of map data to qualify the performance of different sequencing technologies and callers.
Whole genome maps constructed from electronic detection of long molecules can be used effectively to facilitate the genome-wide validation of SVs calls with high sensitivity and specificity across a wide range of sizes. The Nabsys HD-Mapping™ platform utilizes electronic detection of single DNA molecules to achieve read lengths >100 kilobase pairs with resolution better than 100 base pairs. These data allow analysis of SVs ranging from a few hundred base pairs in size to chromosomal rearrangements, with sensitivity, accuracy, scalability, and speed of detection superior to existing optical methods. These high-resolution mapping data are used as input for SV-Verify™, a software package which provides an efficient, robust pipeline for the systematic and automated evaluation of putative SVs.
We describe SV-Verify training using reference material from the NIST Genome in a Bottle project and show the resulting sensitivity and specificity. We then present results obtained by applying SV-Verify to several thousand putative deletions in the human genome GM24385. To investigate the phenomenon of technology bias in SV calling, we utilized SV-Verify to confirm or refute putative deletions calls made using Illumina or PacBio data by a large number of variant callers in various size ranges.