Analysis of Complex Microbial Samples Using High Definition Mapping - AGBT 2019
February 27, 2019
Jay M. Sage, Barrett Bready, Anthony P. Catalano, Jennifer R. Davis, Michael D. Kaiser, John S. Oliver
Complex microbial communities play a critical role in a wide variety of biological systems in the environment and throughout the human body. Characterization of these communities has historically been limited to one or a small number of known genetic markers for species such as 16S rRNA genes. While the advent of inexpensive shotgun sequencing has enabled a more accurate measure of biodiversity than marker typing, short read lengths prevent accurate analysis of related strains within a mixture, as well as consistent characterization of large-scale structural variation that can distinguish highly related strains and significantly impact pathogenicity.
To address these issues, we have applied the Nabsys HD-MappingTM platform to strain-level identification of microbial strains in the context of complex mixtures. HD-Mapping employs fully electronic detection of tagged single DNA molecules, hundreds of kilobases in length, at a resolution superior to existing optical mapping approaches. This combination of long read lengths and high information density means that individual HD-Mapping reads tend to be much more specific to the genomes from which they derive than do NGS reads. As a result, differences between closely related strains of the same species become clear with minimal bioinformatics work.
Here we describe strain-level characterization of the ZymoBIOMICS Microbial Community Standard using Nabsys HD-Mapping. DNA was extracted using a standard kit-based isolation procedure, and single-molecule reads derived from the mixture were mapped to the NCBI database of all ~10,500 completed bacterial references, including ~1,700 references for species present in the mixture. Through analysis of unique read mapping characteristics, the correct reference was identified for each of the 8 bacterial strains present in the mixture as well as relative strain quantitation. In addition, we show that strain-level detection of the 8 bacterial strains is unaffected by the presence of 20% human DNA co-extracted with the mixture
Correcting Errors in PacBio and ONT Assemblies using High-Definition Mapping - AGBT 2019
February 27, 2019
John S. Oliver, Barrett Bready, Anthony P. Catalano, Jennifer R. Davis, Michael D. Kaiser, Jay M. Sage
De novo whole genome assemblies based on short-read sequencing data are often incomplete and highly fragmented. The development of long-read, single-molecule technologies, like those produced by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), were driven by the need for longer read lengths to span repeat regions and complex events. While significant improvements in assembly have been observed with the application of these technologies, both have high per-read error rates resulting in frequent assembly errors and are unable to achieve sufficient read length to observe all genomic structural changes. Scaffolding methods, such as optical mapping or Hi-C methodologies, have been used in combination with sequencing technologies for assembly improvement, but suffer from inherent resolution limitations and high cost. Complete, accurate and cost-effective genome assembly continues to be a problem, even for small microbial genomes.
To provide the necessary long-range information while maintaining sufficient resolution to complement sequencing technologies, Nabsys has developed the HD-MappingTM platform to construct high-resolution whole genome maps. By analyzing reads that are hundreds of kilobases in length, electronic detection preserves long-range information while simultaneously achieving unparalleled resolution and accuracy. Single-molecule reads have high resolution and low false-negative and false-positive error rates, resulting in high information content per read. To assess data quality, we present a de novo assembled map containing a known large tandem repeat and show concordance with the well-established reference.
To demonstrate the need for the long-range information provided by Nabsys HD maps for accurate assembly, we show examples of assembly errors generated with PacBio and ONT data from large and small genomes. Errors observed include collapsed repeats, false duplications/insertions, chimeric joins, and incorrect circularization junctions. Alignments between Nabsys assembled maps and sequence assemblies are presented to highlight regions of discrepancy.
High Throughput Validation of NGS Structural Variant Calls Using High Definition Mapping - AGBT 2018
February 14, 2018
Structural variants (SVs) affect more of the average human genome than SNVs and indels combined and frequently play critical roles in human disease. However, SVs are often difficult to detect in Illumina data sets due to limitations in read length and systematic error. Longer-read approaches such as PacBio sequencing provide information over a larger-scale but often still fall short of accurately characterizing large-scale genomic changes. While recent years have seen a rapid expansion in the the number of short and long read data sets, as well as SV calling algorithms, approaches for validating SV calls remain time-consuming and error-prone. Here we demonstrate the utility of map data to qualify the performance of different sequencing technologies and callers.
Whole genome maps constructed from electronic detection of long molecules can be used effectively to facilitate the genome-wide validation of SVs calls with high sensitivity and specificity across a wide range of sizes. The Nabsys HD-MappingTM platform utilizes electronic detection of single DNA molecules to achieve read lengths >100 kilobase pairs with resolution better than 100 base pairs. These data allow analysis of SVs ranging from a few hundred base pairs in size to chromosomal rearrangements, with sensitivity, accuracy, scalability, and speed of detection superior to existing optical methods. These high-resolution mapping data are used as input for SV-VerifyTM, a software package which provides an efficient, robust pipeline for the systematic and automated evaluation of putative SVs.
We describe SV-Verify training using reference material from the NIST Genome in a Bottle project and show the resulting sensitivity and specificity. We then present results obtained by applying SV-Verify to several thousand putative deletions in the human genome GM24385. To investigate the phenomenon of underlying technology bias in SV calling, we utilized SV-Verify to confirm or refute putative deletions calls made using Illumina or PacBio data by a large number of variant callers in various size ranges.
SPAdes-Directed Genome Assembly Using Short Reads and High-Definition Maps - AGBT 2018
February 14, 2018
The increased use of next generation sequencing has led to the generation of many genome assemblies; however, they are often incomplete and not contiguous. While long-read technologies have improved assembly contiguity, they are more expensive and still lack sufficient read length to resolve complex structures. Optical mapping technologies have been used to scaffold and improve assemblies, but the inherent limitation on resolution and precision limits utility, especially in combination with short-read data. To provide the necessary long-range information while maintaining sufficient resolution to complement next generation sequencing technologies, Nabsys has developed its HD-MappingTM platform to construct electronic whole genome maps. By analyzing reads that are hundreds of kilobases in length, electronic detection preserves long-range information while simultaneously achieving better than 300 bp single molecule resolution.
Consensus maps or de novo map assemblies, constructed from Nabsys data, accurately represent the distances between tags affixed to single molecules. For a well-characterized microbial strain, we compare de novo assembly consensus interval sizes to interval sizes determined from the reference sequence, showing R-squared of better than 0.9999 (interval size ≥300 bp). Due to the stochastic nature of single-molecule false negatives and false positives, the consensus contains 0 false positives and 0 false negatives, for intervals >500 bp.
Nabsys electronic single-molecule reads have been integrated with short-read data in a hybrid assembler to improve assembly completeness. We will describe results from hybrid assembly of Nabsys HD maps and Illumina short-read sequence data using the SPAdes assembler. We demonstrate that hybrid assembly of short-reads and Nabsys electronic maps significantly improves the contiguity and quality of genome assemblies.
Metagenome Quantitation and Assembly of Complex Microbial Samples Using High Definition Mapping - AGBT 2018
February 14, 2018
The role of the human microbiome in disease is increasingly well established. The characterization of microbe populations has historically been limited to known genetic markers of strains such as 16S rRNA genes. More recently, shotgun sequencing has been employed in an attempt to provide a more accurate measure of biodiversity. While whole metagenome sequencing is more comprehensive than single marker typing, short read lengths prevent accurate quantitation of related strains within a mixture, as well as consistent characterization of large-scale structural variation that can significantly impact pathogenicity.
To address these issues, we have applied the Nabsys HD-MappingTM platform to the identification and quantitation of microbial strains in the context of defined complex samples. HD-Mapping employs fully electronic detection of tagged single DNA molecules, hundreds of kilobases in length, at a resolution superior to existing optical mapping approaches. On the length scales of the HD-Mapping reads, there are frequently differences, even between closely related strains of the same species. This means that individual reads tend to be much more specific to the genomes from which they derive than do NGS reads. This specificity allows for strain identification in complex mixtures with minimal bioinformatics work.
Nabsys single-molecule reads derived from a complex mixture were mapped to a database of microbial references resulting in identification of strains within the sample. Titrating the relative abundance of a single bacterial genome in a complex sample resulted in quantitation that was linear over three orders of magnitude. In addition, assembly of the mixed reads into map-level contigs allowed for the identification and structural comparison of distinct but related strains contained within the same sample.
Assembly of Dense Electronic Maps to Analyze Structural Diversity in Bordetella pertussis Outbreak Strains - AGBT 2018
February 14, 2018
Despite increased administration of pertussis-containing vaccines, whooping cough cases in the United States and developing countries continue to rise. Recent publication of complete B. pertussis genomes from multiple outbreak events has revealed striking variation at the structural level between modern strain genomes and references used in vaccine development. Characterization of these structural variations through whole genome electronic mapping provides a comprehensive, high-resolution view of these changes that holds the potential for uncovering mechanisms of pathogenicity and vaccine evasion.
Nabsys has developed HD-MappingTM, a platform for the construction of electronic whole genome maps. The major advantages of electronic detection over optical methods are higher sensitivity, accuracy, scalability, and speed of detection, as well as greatly reduced cost. By analyzing single-molecule reads that are hundreds of kilobases in length, electronic detection preserves long-range information while simultaneously achieving unparalleled resolution and accuracy with low false-negative and false-positive error rates resulting in high information content per read. This allows the use of high-density nicking enzymes to generate complete and contiguous de novo assemblies.
Here we present the utilization of Nabsys HD-Mapping to generate de novo assembled HD maps of B. pertussis for the characterization of modern epidemic strain structural variation. As an illustration of the precision and accuracy of Nabsys HD-Mapping we compare our de novo assembled HD maps to completed references. The assembled maps highlight the structural diversity present between strains within a single outbreak. Furthermore, we demonstrate the ability to resolve a complex, nested repeat structure that spans hundreds of kilobases that was previously unresolved using PacBio, Illumina, and optical mapping data. The generation of dense, highly accurate whole genome electronic maps of pathogenic strains, such as B. pertussis, enables a level of structural analysis unavailable using existing sequencing and mapping technologies.
Machine Learning Applied to Single-Molecule Electronic DNA Mapping for Structural Variant Verification in Human Genomes - ASHG 2017
November 17, 2017
The importance of structural variation in human disease and the difficulty of detecting structural variants larger than 50 base pairs has led to the development of several long-read sequencing technologies and optical mapping platforms. Frequently, multiple technologies and ad hoc methods are required to obtain a consensus regarding the location, size and nature of a structural variant, with no single approach able to reliably bridge the gap of variant sizes between those readily detected using NGS technologies and the largest rearrangements observed with optical mapping. Often, structural variants larger than 10 kilobases are not detected.
To address this unmet need, we have developed a new software package, SV-VerifyTM, which utilizes data collected with the Nabsys High Definition Mapping (HD-MappingTM) system, to perform hypothesis-based verification of putative deletions. We demonstrate that whole genome maps, constructed from data generated by electronic detection of tagged DNA, hundreds of kilobases in length, can be used effectively to facilitate calling of structural variants ranging in size from 300 base pairs to hundreds of kilobase pairs. SV-Verify implements hypothesis-based verification of putative structural variants using supervised machine learning. Machine learning is realized using a set of support vector machines, capable of concurrently testing several thousand independent hypotheses. We describe support vector machine training, utilizing 1089 deletions and 4637 negative controls from a well-characterized human genome. Plots delineating the specificity versus sensitivity of each of the support vector machines will be presented. We subsequently applied the trained classifiers to another human genome, evaluating > 5000 putative deletions, demonstrating high sensitivity and specificity for deletions from 300 base pairs to hundreds of kilobases. Over 78% of deletions called by three or more technologies were confirmed by SV-Verify.
De Novo Assembly of High Density Electronic Maps Reveals Structural Diversity in Bordetella pertussis - SFAF 2017
May 17, 2017
The generation of dense, highly accurate whole genome electronic maps of pathogenic strains such as B. pertussis enables a level of structural analysis unavailable by existing sequencing and mapping technologies. The precision and accuracy of Nabsys HD-Mapping™ simultaneously allows for distinction between highly related strains and a clear understanding of the nature of structural variation that can modulate virulence and vaccine avoidance characteristics of pathogenic bacterial strains.
We would like to thank Margaret Williams, PhD (Centers for Disease Control and Prevention, Atlanta, GA) for providing DNA samples and strain reference sequence files.
First Long-Range Non-Optical Maps of Human Genomes and Their Applications - AGBT 2017
February 15, 2017
Despite advances in next generation sequencing technologies, limitations imposed by short read length combined with sequence features such as polymorphism and copy number repeats present challenges for genomic assembly and analysis. Longer range techniques such as optical mapping utilize long reads to provide information over a larger scale, but lack resolution and throughput.
To bridge the gap, Nabsys has employed its HD-MappingTM platform to construct the first long-range, electronic maps of whole human genomes. The major advantages of electronic sensing are higher sensitivity, accuracy, scalability, and speed of detection. Single-molecule events translocate through the detector at velocities above 1 megabase pair per second. By analyzing reads that are hundreds of kilobases in length, electronic detection preserves long-range information while simultaneously achieving unparalleled resolution. Single-molecule reads have high resolution and low false-negative and false-positive error rates resulting in high information content per read.
Using its electronic HD-Mapping platform, Nabsys has constructed whole human genome maps characterized by high depth of coverage as well as coverage of greater than 99% of the reference genome. We present the first application of electronic whole human genome maps. Examples of structural variants and tandem repeats in a human reference genome, as well as analysis of large-scale chromosomal rearrangements in a breast cancer genome, MCF-7.
Characterization of Lysinibacillus fusiformis strain S4C11: In vitro, in planta, and in silico analyses reveal a plant-beneficial microbe
Despite sharing many of the traits that have allowed the genus Bacillus to gain recognition for its agricultural
relevance, the genus Lysinibacillus is not as well-known and studied. The present study employs in vitro, in vivo, in
planta, and in silico approaches to characterize Lysinibacillus fusiformis strain S4C11, isolated from the roots of an
apple tree in northern Italy.
A robust benchmark for detection of germline large deletions and insertions
June 15, 2020
New technologies and analysis methods are enabling genomic structural variants (SVs) to be detected with ever-increasing accuracy, resolution and comprehensiveness. To help translate these methods to routine research and clinical practice, we developed a sequence-resolved benchmark set for identification of both false-negative and false-positive germline large insertions and deletions. To create this benchmark for a broadly consented son in a Personal Genome Project trio with broadly available cells and DNA, the Genome in a Bottle Consortium integrated 19 sequence-resolved variant calling methods from diverse technologies. The final benchmark set contains 12,745 isolated, sequence-resolved insertion (7,281) and deletion (5,464) calls ≥50 base pairs (bp). The Tier 1 benchmark regions, for which any extra calls are putative false positives, cover 2.51 Gbp and 5,262 insertions and 4,095 deletions supported by ≥1 diploid assembly. We demonstrate that the benchmark set reliably identifies false negatives and false positives in high-quality SV callsets from short-, linked- and long-read sequencing and optical mapping.
Screening and Genomic Characterization of Filamentous Hemagglutinin-Deficient Bordetella pertussis
Despite high vaccine coverage, pertussis cases in the United States have increased over the last decade. Growing evidence suggests that disease resurgence results, in part, from genetic divergence of circulating strain populations away from vaccine references. The United States employs acellular vaccines exclusively, and current Bordetella pertussis isolates are predominantly deficient in at least one immunogen, pertactin (Prn). First detected in the United States retrospectively in a 1994 isolate, the rapid spread of Prn deficiency is likely vaccine driven, raising concerns about whether other acellular vaccine immunogens experience similar pressures, as further antigenic changes could potentially threaten vaccine efficacy. We developed an electrochemiluminescent antibody capture assay to monitor the production of the acellular vaccine immunogen filamentous hemagglutinin (Fha). Screening 722 U.S. surveillance isolates collected from 2010 to 2016 identified two that were both Prn and Fha deficient. Three additional Fha-deficient laboratory strains were also identified from a historic collection of 65 isolates dating back to 1935. Wholegenome sequencing of deficient isolates revealed putative, underlying genetic changes. Only four isolates harbored mutations to known genes involved in Fha production, highlighting the complexity of its regulation. The chromosomes of two Fhadeficient isolates included unexpected structural variation that did not appear to influence Fha production. Furthermore, insertion sequence disruption of fhaB was also detected in a previously identified pertussis toxin-deficient isolate that still produced normal levels of Fha. These results demonstrate the genetic potential for additional vaccine immunogen deficiency and underscore the importance of continued surveillance of circulating B. pertussis evolution in response to vaccine pressure.
Automated Structural Variant Verification In Human Genomes Using Single-Molecule Electronic DNA Mapping
May 22, 2017
The importance of structural variation in human disease and the difficulty of detecting structural variants larger than 50 base pairs has led to the development of several long-read sequencing technologies and optical mapping platforms. Frequently, multiple technologies and ad hoc methods are required to obtain a consensus regarding the location, size and nature of a structural variant, with no approach able to reliably bridge the gap of variant sizes between the domain of short-read approaches and the largest rearrangements observed with optical mapping.
To address this unmet need, we have developed a new software package, SV-Verify™, which utilizes data collected with the Nabsys High Definition Mapping (HD-Mapping™) system, to perform hypothesis-based verification of putative deletions. We demonstrate that whole genome maps, constructed from electronic detection of tagged DNA, hundreds of kilobases in length, can be used effectively to facilitate calling of structural variants ranging in size from 300 base pairs to hundreds of kilobase pairs. SV-Verify implements hypothesis-based verification of putative structural variants using a set of support vector machines and is capable of concurrently testing several thousand independent hypotheses. We describe support vector machine training, utilizing a well-characterized human genome, and application of the resulting classifiers to another human genome, demonstrating high sensitivity and specificity for deletions ≥ 300 base pairs.
High-Definition Electronic Genome Maps From Single Molecule Data
May 18, 2017
With the advent of routine short-read genome sequencing has come a growing recognition of the importance of long-range, structural information in applications ranging from sequence assembly to the detection of structural variation. Here we describe the Nabsys solid-state detector capable of detecting tags on single molecules of DNA 100s of kilobases in length as they translocate through the detector at a velocity greater than 1 megabase pair per second. Sequence-specific tags are detected with a high signal-to-noise ratio. The physical distance between tags is determined after accounting for viscous drag-induced intramolecular velocity fluctuations. The accurate measurement of the physical distance between tags on each molecule and the ability of the detector to resolve distances between tags of less than 300 base-pairs enables the construction of high-density genome maps.