Phase II Amount
$1,500,002
We are entering a new era of personal genomics where an individual's genome sequence will be used to identify disease susceptibility, improve diagnosis and better treat illnesses as well as be combined across cohorts and populations to identify new biomarkers and causal mutations underlying any phenotype. Despite the tremendous success of mapping short read next-generation sequencing (NGS) data onto a reference genome (resequencing) in identifying genetic variation in a new genome, the inherent lack of long range connectivity together with reference-induced biases make obtaining complete haplotype-phased genomes exceedingly difficult. Emerging long read technologies are beginning to address this critical shortcoming by direct de novo assembly of an individual's genome. However, initial de novo assemblies typically consist of many thousands of unordered contigs that require extensive post-assembly processing to produce finished sequences that can be effectively mined for genetic content and variation. Thus, there is an urgent need for integrated, scalable post-assembly software that 1) automatically organizes, joins and phases the initial contigs into complete haplotype sequences, 2) supports optional NGS and/or manual polishing and 3) provides initial automated annotation of those sequences. Currently, such software does not exist and instead users must cobble together a confusing array of difficult-to-use, task-specific pieces of open source programs. DNASTAR's post-assembly editing program, SeqMan Pro (SMP), has a proven history in finishing bacterial sized genomes although it currently lacks the scalability and all the needed functionality to tackle human genome sized problems. The primary goal of this Fast Track proposal is to create a fully scalable version of SMP for the automated finishing and annotation of de novo assembled large eukaryotic genomes while also providing a manual editing platform when needed. During Phase I, we will develop two key prototypes: 1) a new assembly file format, eBAM, which is interconvertible with the BAM format, but also is editable like our SQD files and 2) a rapid reference-assisted contig scaffolding tool adapted from our proprietary Disk Sort Alignment (DSA) algorithm. With that foundation, we will complete the transformation of SMP in Phase II by: 1) refining the eBAM format for optimal editing performance, 2) building a new 64-bit version of the SMP editing engine that incorporates the additional functionality necessary for post-assembly finishing of large eukaryotic genomes including automated DSA-based scaffolding and phase-aware gap filling, contig joining and haplotype refinement, 3) creating a new DSA-based genome aligner for rapidly aligning a finished sequence to an annotated reference genome which together with 4) a new feature transfer and analysis module, will permit initial annotation of the finished genome along with a cataloging of variants and their impact in both native and reference coordinates. Inclusion of the reference coordinates allows variants in the new genome to be easily associated with the wealth of information available through the numerous online knowledgebase resources.
Public Health Relevance Statement: New advances in DNA sequencing are making it possible to construct the entire genome sequence of every person. Gleaning the genetic variation unique to each individual from these sequences has tremendous implications for personalized medicine, including understanding the causes of and cures for human disease. In this project, we will develop the finishing software needed to determine the complete genome sequence as well as its genetic content and variation from any individual.
NIH Spending Category: Bioengineering; Genetics; Human Genome; Networking and Information Technology R&D (NITRD)
Project Terms: Address; Algorithms; Alleles; Automated Annotation; Awareness; Bacterial Genome; base; Base Sequence; Biological Markers; Cataloging; Catalogs; causal variant; Chromosome 12; cohort; Complement; Complex; Computer software; Computers; Consensus Sequence; contig; cost; Data; design; Diagnosis; Diploidy; Disease susceptibility; DNA Resequencing; DNA sequencing; experience; file format; Foundations; Generations; Genes; Genetic; Genetic Variation; Genome; genome annotation; Genomics; Glean; Goals; graphical user interface; Haplotypes; Hour; human disease; Human Genome; Imagery; improved; Individual; knowledge base; Manuals; Maps; next generation sequencing; open source; Performance; personalized medicine; Persons; Phase; Phenotype; Polishes; Population; programs; Proteins; prototype; Recording of previous events; reference genome; Resources; Running; scaffold; success; Technology; tool; Variant; whole genome; Writing