The high cost and complexity of the analysis of whole genome resequencing remain prohibitive for most clinical applications. Targeted resequencing allows regions of interest to be enriched from a genomic DNA sample and sequenced to high depth allowing cost-effective identification of important variants. In combination with next- generation sequencing (NGS), the approach has been exploited to tremendous effect in identifying candidate genes and variants for an array of diseases and traits from cohorts and populations as well as individual clinical samples. However, the short read nature of NGS technologies severely limits its potential to characterize, for example, compound heterozygotes due to the lack of long range connectivity needed for haplotype phasing and structural variants (SV). Those limitations can be overcome with long read data from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT). Moreover, new targeting methods tailored toward long read sequencing are being developed such that a comprehensive analysis of key regions in an individuals genome will soon be within reach. However, an integrated software solution that is easy enough for clinical researchers to efficiently use is sorely lacking. The overall goal of this Direct to Phase II proposal is to develop commercial-grade software that produces a comprehensive catalog of annotated haplotype phased variants from clinical sequencing data and presents them to clinical researchers through a single easy-to-use application with both analytical and genome browsing capabilities, GenVision Ultra. The proposal focuses on augmenting our highly extensible XNG assembly pipeline with tools necessary for fully automated detection and annotation of all classes of variants from haplotype phased sequences. Novel adaptions to core XNG components will partition reads matching the reference from those likely representing a SV for parallel processing (Aim 1). Matching reads will be aligned to the reference using XNG while the putative SV-containing reads will be de novo assembled and annotated using our long read assembler (LRA). Reference-based alignments will be phased using a novel Bayesian classifier to produce two haplotype sequences prior to SNV/small indel calling and annotation (Aim 2). Short read polishing of the entire assembly will be available on demand. Complete small variant and SV profiles as well as the underlying assembly data will be accessible to the end user in GenVision Ultra. In addition, the application will have discrete filtering and statistical tools with which to identify genes and/or variants of interest in an individual sample or across a cohort/population (Aim 3). To ensure that the software meets the clinical sequencing market needs, Arkana Laboratories has agreed to provide ONT and Illumina sequence data from highly curated HapMap control samples processed with their kidney disease gene panels. Those real-world data sets together with expert interpretation and feedback by Arkana researchers provide an ideal environment in which to develop an outstanding software solution for this critical market (Aim 4).
Public Health Relevance Statement: Long read sequencing technologies can decipher DNA molecules one thousand times longer than next generation sequencing machines. This has tremendous implications for clinical sequencing by giving researchers and clinicians access to previously opaque aspects of an individuals genome. In this project, we will develop the software needed by clinical sequencing labs to exploit this remarkable advance in the pursuit of improved disease prevention, detection and treatment.
Project Terms: analytical tool; annotation system; base; Biological Sciences; Candidate Disease Gene; Catalogs; Clinical; clinical application; clinical sequencing; cohort; Collaborations; Computer software; Computers; contig; cost; cost effective; Data; Data Set; Detection; Development; Disease; disorder prevention; DNA; DNA Resequencing; Ensure; Environment; Evaluation; Feedback; Gene Family; Genes; genetic variant; Genome; genome browser; genome sequencing; genome-wide; Genomic DNA; Goals; Haplotypes; Heterozygote; Hour; improved; Individual; insertion/deletion mutation; instrument; interest; Kidney Diseases; Laboratories; Length; Methods; nanopore; Nature; next generation sequencing; novel; parallel processing; Phase; Polishes; Population; Privacy; Process; prototype; Provider; Pseudogenes; reference genome; Research Personnel; Running; Sampling; Series; software development; Statistical Methods; Stream; Structure; Targeted Resequencing; Technology; Time; tool; trait; Variant; Visualization; whole genome