SBIR-STTR Award

Long Read Based Sequencing Software for the Comprehensive Analysis of Clinical Samples
Award last edited on: 9/21/2022

Sponsored Program
SBIR
Awarding Agency
NIH : NIGMS
Total Award Amount
$1,500,000
Award Phase
2
Solicitation Topic Code
859
Principal Investigator
Timothy J Durfee

Company Information

DNA Star Inc (AKA: DNAstar Inc)

3801 Regent Street
Madison, WI 53705
   (608) 258-7420
   info@dnastar.com
   www.dnastar.com
Location: Single
Congr. District: 02
County: Dane

Phase I

Contract Number: 1R44GM137643-01
Start Date: 4/1/2020    Completed: 3/31/2022
Phase I year
2020
Phase I Amount
$750,000
The high cost and complexity of the analysis of whole genome resequencing remain prohibitive for most clinical applications. Targeted resequencing allows regions of interest to be enriched from a genomic DNA sample and sequenced to high depth allowing cost-effective identification of important variants. In combination with next- generation sequencing (NGS), the approach has been exploited to tremendous effect in identifying candidate genes and variants for an array of diseases and traits from cohorts and populations as well as individual clinical samples. However, the short read nature of NGS technologies severely limits its potential to characterize, for example, compound heterozygotes due to the lack of long range connectivity needed for haplotype phasing and structural variants (SV). Those limitations can be overcome with long read data from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT). Moreover, new targeting methods tailored toward long read sequencing are being developed such that a comprehensive analysis of key regions in an individual’s genome will soon be within reach. However, an integrated software solution that is easy enough for clinical researchers to efficiently use is sorely lacking. The overall goal of this Direct to Phase II proposal is to develop commercial-grade software that produces a comprehensive catalog of annotated haplotype phased variants from clinical sequencing data and presents them to clinical researchers through a single easy-to-use application with both analytical and genome browsing capabilities, GenVision Ultra. The proposal focuses on augmenting our highly extensible XNG assembly pipeline with tools necessary for fully automated detection and annotation of all classes of variants from haplotype phased sequences. Novel adaptions to core XNG components will partition reads matching the reference from those likely representing a SV for parallel processing (Aim 1). Matching reads will be aligned to the reference using XNG while the putative SV-containing reads will be de novo assembled and annotated using our long read assembler (LRA). Reference-based alignments will be phased using a novel Bayesian classifier to produce two haplotype sequences prior to SNV/small indel calling and annotation (Aim 2). Short read polishing of the entire assembly will be available on demand. Complete small variant and SV profiles as well as the underlying assembly data will be accessible to the end user in GenVision Ultra. In addition, the application will have discrete filtering and statistical tools with which to identify genes and/or variants of interest in an individual sample or across a cohort/population (Aim 3). To ensure that the software meets the clinical sequencing market needs, Arkana Laboratories has agreed to provide ONT and Illumina sequence data from highly curated HapMap control samples processed with their kidney disease gene panels. Those real-world data sets together with expert interpretation and feedback by Arkana researchers provide an ideal environment in which to develop an outstanding software solution for this critical market (Aim 4).

Public Health Relevance Statement:
Long read sequencing technologies can decipher DNA molecules one thousand times longer than “next generation” sequencing machines. This has tremendous implications for clinical sequencing by giving researchers and clinicians access to previously opaque aspects of an individual’s genome. In this project, we will develop the software needed by clinical sequencing labs to exploit this remarkable advance in the pursuit of improved disease prevention, detection and treatment.

Project Terms:
analytical tool; annotation system; base; Biological Sciences; Candidate Disease Gene; Catalogs; Clinical; clinical application; clinical sequencing; cohort; Collaborations; Computer software; Computers; contig; cost; cost effective; Data; Data Set; Detection; Development; Disease; disorder prevention; DNA; DNA Resequencing; Ensure; Environment; Evaluation; Feedback; Gene Family; Genes; genetic variant; Genome; genome browser; genome sequencing; genome-wide; Genomic DNA; Goals; Haplotypes; Heterozygote; Hour; improved; Individual; insertion/deletion mutation; instrument; interest; Kidney Diseases; Laboratories; Length; Methods; nanopore; Nature; next generation sequencing; novel; parallel processing; Phase; Polishes; Population; Privacy; Process; prototype; Provider; Pseudogenes; reference genome; Research Personnel; Running; Sampling; Series; software development; Statistical Methods; Stream; Structure; Targeted Resequencing; Technology; Time; tool; trait; Variant; Visualization; whole genome

Phase II

Contract Number: 5R44GM137643-02
Start Date: 4/1/2020    Completed: 3/31/2023
Phase II year
2021
Phase II Amount
$750,000
The high cost and complexity of the analysis of whole genome resequencing remain prohibitive for most clinicalapplications. Targeted resequencing allows regions of interest to be enriched from a genomic DNA sample andsequenced to high depth allowing cost-effective identification of important variants. In combination with next-generation sequencing (NGS), the approach has been exploited to tremendous effect in identifying candidategenes and variants for an array of diseases and traits from cohorts and populations as well as individual clinicalsamples. However, the short read nature of NGS technologies severely limits its potential to characterize, forexample, compound heterozygotes due to the lack of long range connectivity needed for haplotype phasing andstructural variants (SV). Those limitations can be overcome with long read data from Pacific Biosciences (PacBio)or Oxford Nanopore Technologies (ONT). Moreover, new targeting methods tailored toward long readsequencing are being developed such that a comprehensive analysis of key regions in an individual's genomewill soon be within reach. However, an integrated software solution that is easy enough for clinical researchersto efficiently use is sorely lacking. The overall goal of this Direct to Phase II proposal is to develop commercial-grade software that producesa comprehensive catalog of annotated haplotype phased variants from clinical sequencing data and presentsthem to clinical researchers through a single easy-to-use application with both analytical and genome browsingcapabilities, GenVision Ultra. The proposal focuses on augmenting our highly extensible XNG assembly pipelinewith tools necessary for fully automated detection and annotation of all classes of variants from haplotypephased sequences. Novel adaptions to core XNG components will partition reads matching the reference fromthose likely representing a SV for parallel processing (Aim 1). Matching reads will be aligned to the referenceusing XNG while the putative SV-containing reads will be de novo assembled and annotated using our long readassembler (LRA). Reference-based alignments will be phased using a novel Bayesian classifier to produce twohaplotype sequences prior to SNV/small indel calling and annotation (Aim 2). Short read polishing of the entireassembly will be available on demand. Complete small variant and SV profiles as well as the underlyingassembly data will be accessible to the end user in GenVision Ultra. In addition, the application will have discretefiltering and statistical tools with which to identify genes and/or variants of interest in an individual sample oracross a cohort/population (Aim 3). To ensure that the software meets the clinical sequencing market needs,Arkana Laboratories has agreed to provide ONT and Illumina sequence data from highly curated HapMap controlsamples processed with their kidney disease gene panels. Those real-world data sets together with expertinterpretation and feedback by Arkana researchers provide an ideal environment in which to develop anoutstanding software solution for this critical market (Aim 4).

Public Health Relevance Statement:
Long read sequencing technologies can decipher DNA molecules one thousand times longer than "next generation" sequencing machines. This has tremendous implications for clinical sequencing by giving researchers and clinicians access to previously opaque aspects of an individual's genome. In this project, we will develop the software needed by clinical sequencing labs to exploit this remarkable advance in the pursuit of improved disease prevention, detection and treatment.

Project Terms: