SBIR-STTR Award

BioHDF - Open Binary File Standards for Bioinformatics
Award last edited on: 5/3/19

Sponsored Program
STTR
Awarding Agency
NIH : NCHGR
Total Award Amount
$1,308,142
Award Phase
2
Solicitation Topic Code
-----

Principal Investigator
Todd M Smith

Company Information

Geospiza Inc

100 West Harrison North Tower Suite 330
Seattle, WA 98119
   (206) 633-4403
   info@geospiza.com
   www.geospiza.com

Research Institution

The HDF Group

Phase I

Contract Number: 1R41HG003792-01
Start Date: 3/17/09    Completed: 2/28/11
Phase I year
2005
Phase I Amount
$142,775
Geospiza Inc. and the National Center for Supercomputing Applications (NCSA) are creating a standards based software framework around NCSA's Heirarchical Data Format (HDF5). The envisioned framework will integrate algorithms important in DNA and protein sequence analysis to create scalable high throughput software systems which will be accessed using new graphical user interfaces (GUIs) to provide researchers with new views of their data to finish sequencing projects in large-scale genome sequencing, microbial genome sequencing, viral epidemiology, polymorphism detection, phylogenetic analysis, multi-locus sequence typing, confirmatory sequencing, and EST analysis. In our vision, algorithms will be either integrated into the system to directly read and write from HDF5 project files, or they will communicate with project files via filter programs that produce standardized XML formatted data. Through this model, a scalable solution will support different applications of DNA sequencing, fulfilling the many needs and requirements expressed by the medical research community now and into the future. As the first step in this process we will, define requirements for editing and versioning data in DNA sequencing, research and propose data models for the computational phases of DNA sequencing and annotating DNA sequence data using existing standards, create a prototype application for DNA sequencing based SNP discovery, and engage the bioinformatics community for BioHDF adoption. In the past ten years the cost of sequencing DNA has dropped over 1000 fold and the amount of raw sequence data, entering our national repositories is doubling every 12 months. DNA sequencing is fundamental to biological research activities such as genomics, systems biology, and clinical medicine. Proposals are being sought to decrease sequencing costs by two orders of magnitude through technology refinements with an ultimate vision of developing technology to sequence human genome equivalents for $1000 each. The amount of data that will be produced through these endeavors is unimaginable. However, the $1,000 genome will not advance medical research unless we integrate all phases of the DNA sequencing process and treat the creation, management, finishing, analysis, and sharing of the data as common goals

Phase II

Contract Number: 2R42HG003792-02A1
Start Date: 3/17/09    Completed: 2/28/11
Phase II year
2009
(last award dollars: 2010)
Phase II Amount
$1,165,367

The first wave of Next Generation ("Next Gen") sequencing technologies combines molecular resolution with extremely high throughput to dramatically reduce sequencing costs and increase assay sensitivity and specificity. These technologies will provide large numbers of laboratories with "Genome Center" levels of throughput to make discoveries and develop new assays never before imagined. However, widespread adoption of Next Gen will be hindered because current bioinformatics programs do not scale; they are inefficient in data storage, processing, and memory utilization. The most popular programs typically copy and recopy data to new files many times during processing, require that all data be maintained in random access memory (RAM) when running, and cannot incrementally process data. To overcome these issues, fundamental changes in data management and processing are needed. Geospiza and The HDF Group are collaborating to develop portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format http://www.hdfgroup.org ). We call these extensible domain-specific data technologies "BioHDF." BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, and meta data) and results from sequence assembly and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, and graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing. BioHDF will include APIs, software tools, and a viewer based on HDFView to enable its use in the bioinformatics and research communities. Using BioHDF, researchers will be able perform whole genome shotgun sequencing (WGS), "tag and count" experiments (EST analysis, promoter mapping, DNA methylation, functional mapping), and variation analysis; they will also be able to export datasets in formats accepted by the key databases to publish their work. As a programming environment, BioHDF can be easily extended to accept data from new data collection platforms, and format data for interchange with many databases. Core BioHDF tools will be delivered to the research community as an open source technology. Geospiza will use BioHDF in its Finch. line of products to deliver software systems and applications to support clinical research, diagnostics, and other relevant activities that rely on genetic data.

Public Health Relevance:
The overall goal of the BioHDF Phase II project is to make it possible for medical research and clinical communities to take full advantage of the latest DNA sequencing platforms in their efforts to improve public health. Geospiza and The HDF Group will build on their expertise in Laboratory Information Management Systems and high- volume, high-complexity scientific data management systems to create and deliver bioinformatics software systems that can handle the massive amounts of data produced by the latest sequencing instruments. The integrated systems will keep track of collected samples, sequence data, DNA tests, and other laboratory records and biological data associated with the entire sequencing and analysis process, and make it easy for clinicians to use the technology to do their work.

Project Terms:
API; Address; Adoption; Algorithms; Analysis, Data; Assay; Basic Research; Basic Science; Bio-Informatics; Bioassay; Bioinformatics; Biologic Assays; Biological; Biological Assay; Clinical; Clinical Research; Clinical Study; Communities; Computer Programs; Computer Software Tools; Computer software; Consensus; DNA; DNA Methylation; DNA Sequence; Data; Data Analyses; Data Banks; Data Bases; Data Collection; Data Set; Data Storage and Retrieval; Databank, Electronic; Databanks; Database, Electronic; Databases; Dataset; Deoxyribonucleic Acid; Detection; Development; Diagnostic; ESTs; Environment; Expressed Sequence Tags; Finches; Future Generations; Genetic; Genome; Goals; Graph; Heart; Infrastructure; Investigators; Laboratories; Libraries; MAPI; Management Information Systems; Maps; Medical Research; Memory; Methods; Molecular; Performance; Phase; Process; Programs (PT); Programs [Publication Type]; Promoter; Promoters (Genetics); Promotor; Promotor (Genetics); Public Health; Publishing; Reading; Records; Research; Research Infrastructure; Research Personnel; Researchers; Resolution; Running; SEQ-AN; STTR; Sampling; Science; Secure; Sensitivity and Specificity; Sequence Analyses; Sequence Analysis; Small Business Technology Transfer Research; Software; Software Tools; Solutions; Structure; System; System, LOINC Axis 4; Technology; Testing; Time; Tools, Software; Variant; Variation; Whole-Genome Shotgun Sequencing; Work; alkaline protease inhibitor; base; clinical data repository; clinical data warehouse; computer program/software; computerized data processing; cost; data format; data management; data modeling; data processing; data repository; data retrieval; data storage; experience; experiment; experimental research; experimental study; file format; improved; indexing; instrument; memory process; microbial alkaline proteinase inhibitor; next generation; open source; programs; public health medicine (field); public health relevance; relational database; research study; signal processing; software systems; tool; usability