SBIR-STTR Award

Scalable Post-Assembly Editing Software for Finishing and Annotating Personal Genomes
Award last edited on: 7/28/2020

Sponsored Program
SBIR
Awarding Agency
NIH : NIGMS
Total Award Amount
$1,649,983
Award Phase
2
Solicitation Topic Code
-----

Principal Investigator
Timothy J Durfee

Company Information

DNA Star Inc (AKA: DNAstar Inc)

3801 Regent Street
Madison, WI 53705
   (608) 258-7420
   info@dnastar.com
   www.dnastar.com
Location: Single
Congr. District: 02
County: Dane

Phase I

Contract Number: 1R44GM128518-01A1
Start Date: 9/1/2018    Completed: 2/28/2019
Phase I year
2018
Phase I Amount
$149,981
We are entering a new era of personal genomics where an individual's genome sequence will be used to identify disease susceptibility, improve diagnosis and better treat illnesses as well as be combined across cohorts and populations to identify new biomarkers and causal mutations underlying any phenotype. Despite the tremendous success of mapping short read next-generation sequencing (NGS) data onto a reference genome (resequencing) in identifying genetic variation in a new genome, the inherent lack of long range connectivity together with reference-induced biases make obtaining complete haplotype-phased genomes exceedingly difficult. Emerging long read technologies are beginning to address this critical shortcoming by direct de novo assembly of an individual's genome. However, initial de novo assemblies typically consist of many thousands of unordered contigs that require extensive post-assembly processing to produce finished sequences that can be effectively mined for genetic content and variation. Thus, there is an urgent need for integrated, scalable post-assembly software that 1) automatically organizes, joins and phases the initial contigs into complete haplotype sequences, 2) supports optional NGS and/or manual polishing and 3) provides initial automated annotation of those sequences. Currently, such software does not exist and instead users must cobble together a confusing array of difficult-to-use, task-specific pieces of open source programs. DNASTAR's post-assembly editing program, SeqMan Pro (SMP), has a proven history in finishing bacterial sized genomes although it currently lacks the scalability and all the needed functionality to tackle human genome sized problems. The primary goal of this Fast Track proposal is to create a fully scalable version of SMP for the automated finishing and annotation of de novo assembled large eukaryotic genomes while also providing a manual editing platform when needed. During Phase I, we will develop two key prototypes: 1) a new assembly file format, eBAM, which is interconvertible with the BAM format, but also is editable like our SQD files and 2) a rapid reference-assisted contig scaffolding tool adapted from our proprietary Disk Sort Alignment (DSA) algorithm. With that foundation, we will complete the transformation of SMP in Phase II by: 1) refining the eBAM format for optimal editing performance, 2) building a new 64-bit version of the SMP editing engine that incorporates the additional functionality necessary for post-assembly finishing of large eukaryotic genomes including automated DSA-based scaffolding and phase-aware gap filling, contig joining and haplotype refinement, 3) creating a new DSA-based genome aligner for rapidly aligning a finished sequence to an annotated reference genome which together with 4) a new feature transfer and analysis module, will permit initial annotation of the finished genome along with a cataloging of variants and their impact in both native and reference coordinates. Inclusion of the reference coordinates allows variants in the new genome to be easily associated with the wealth of information available through the numerous online knowledgebase resources.

Project Terms:
Address; Algorithms; Alleles; Automated Annotation; Awareness; Bacterial Genome; base; Base Sequence; Biological Markers; Cataloging; Catalogs; Chromosomes, Human, Pair 12; cohort; Complement; Complex; Computer software; Computers; Consensus Sequence; cost; Data; design; Diagnosis; Diploidy; Disease susceptibility; DNA Resequencing; DNA sequencing; experience; file format; Foundations; Generations; Genes; Genetic; Genetic Variation; Genome; genome annotation; Genomics; Glean; Goals; graphical user interface; Haplotypes; Hour; human disease; Human Genome; Imagery; improved; Individual; knowledge base; Manuals; Maps; Mutation; next generation sequencing; open source; Performance; personalized medicine; Persons; Phase; Phenotype; Polishes; Population; programs; Proteins; prototype; Recording of previous events; reference genome; Resources; Running; scaffold; success; Technology; tool; Variant; whole genome; Writing;

Phase II

Contract Number: 4R44GM128518-02
Start Date: 9/1/2018    Completed: 2/28/2021
Phase II year
2019
Phase II Amount
$1,500,002
We are entering a new era of personal genomics where an individual's genome sequence will be used to identify disease susceptibility, improve diagnosis and better treat illnesses as well as be combined across cohorts and populations to identify new biomarkers and causal mutations underlying any phenotype. Despite the tremendous success of mapping short read next-generation sequencing (NGS) data onto a reference genome (resequencing) in identifying genetic variation in a new genome, the inherent lack of long range connectivity together with reference-induced biases make obtaining complete haplotype-phased genomes exceedingly difficult. Emerging long read technologies are beginning to address this critical shortcoming by direct de novo assembly of an individual's genome. However, initial de novo assemblies typically consist of many thousands of unordered contigs that require extensive post-assembly processing to produce finished sequences that can be effectively mined for genetic content and variation. Thus, there is an urgent need for integrated, scalable post-assembly software that 1) automatically organizes, joins and phases the initial contigs into complete haplotype sequences, 2) supports optional NGS and/or manual polishing and 3) provides initial automated annotation of those sequences. Currently, such software does not exist and instead users must cobble together a confusing array of difficult-to-use, task-specific pieces of open source programs. DNASTAR's post-assembly editing program, SeqMan Pro (SMP), has a proven history in finishing bacterial sized genomes although it currently lacks the scalability and all the needed functionality to tackle human genome sized problems. The primary goal of this Fast Track proposal is to create a fully scalable version of SMP for the automated finishing and annotation of de novo assembled large eukaryotic genomes while also providing a manual editing platform when needed. During Phase I, we will develop two key prototypes: 1) a new assembly file format, eBAM, which is interconvertible with the BAM format, but also is editable like our SQD files and 2) a rapid reference-assisted contig scaffolding tool adapted from our proprietary Disk Sort Alignment (DSA) algorithm. With that foundation, we will complete the transformation of SMP in Phase II by: 1) refining the eBAM format for optimal editing performance, 2) building a new 64-bit version of the SMP editing engine that incorporates the additional functionality necessary for post-assembly finishing of large eukaryotic genomes including automated DSA-based scaffolding and phase-aware gap filling, contig joining and haplotype refinement, 3) creating a new DSA-based genome aligner for rapidly aligning a finished sequence to an annotated reference genome which together with 4) a new feature transfer and analysis module, will permit initial annotation of the finished genome along with a cataloging of variants and their impact in both native and reference coordinates. Inclusion of the reference coordinates allows variants in the new genome to be easily associated with the wealth of information available through the numerous online knowledgebase resources.

Public Health Relevance Statement:
New advances in DNA sequencing are making it possible to construct the entire genome sequence of every person. Gleaning the genetic variation unique to each individual from these sequences has tremendous implications for personalized medicine, including understanding the causes of and cures for human disease. In this project, we will develop the finishing software needed to determine the complete genome sequence as well as its genetic content and variation from any individual.

NIH Spending Category:
Bioengineering; Genetics; Human Genome; Networking and Information Technology R&D (NITRD)

Project Terms:
Address; Algorithms; Alleles; Automated Annotation; Awareness; Bacterial Genome; base; Base Sequence; Biological Markers; Cataloging; Catalogs; causal variant; Chromosome 12; cohort; Complement; Complex; Computer software; Computers; Consensus Sequence; contig; cost; Data; design; Diagnosis; Diploidy; Disease susceptibility; DNA Resequencing; DNA sequencing; experience; file format; Foundations; Generations; Genes; Genetic; Genetic Variation; Genome; genome annotation; Genomics; Glean; Goals; graphical user interface; Haplotypes; Hour; human disease; Human Genome; Imagery; improved; Individual; knowledge base; Manuals; Maps; next generation sequencing; open source; Performance; personalized medicine; Persons; Phase; Phenotype; Polishes; Population; programs; Proteins; prototype; Recording of previous events; reference genome; Resources; Running; scaffold; success; Technology; tool; Variant; whole genome; Writing