Hepatitis infection is a widespread and persistent crisis, both in the U.S. and worldwide. Chronic infection with the hepatitis B virus (HBV) is a major cause of end-stage liver disease, with staggering long-term healthcare costs. Thus, HBV research and innovation have been deemed national priorities in the U.S., as evidenced by calls-to-action by the Institute of Medicine and the Department of Health and Human Services. The intra-host HBV infection comprises a genetically diverse population of variants (quasispecies) that is an important determinant of pathogenesis and treatment outcome. Mapping the quasispecies is required; however map construction is difficult owing to HBVs' complex genome structure, variant divergence from reference genomes and a lack of accurate tools. Current de novo assembly algorithms intended for viral genome assembly produce inadequate single linear representations of a viral population. Algorithms meant for diploid genome assembly are taxed and confused by virology data, produce unnecessarily complex output and are computationally expensive. For this Phase I project, GATACA, LLC proposes to develop the Assembly Tool to accurately map intra-host HBV strains from short read data. Using novel steps, the tool will assemble the reads into multiple interconnected consensus sequences (contigs) as a map of global haplotypes. The contig sets will provide valuable reference data backbones for subsequent analyses. The tool will improve inter-host comparisons which depend on accurate HBV quasispecies parameters. The Assembly Tool will be integrated into existing software developed by GATACA. Specific Aims for Phase I are: (1) Develop, test and prototype a de novo algorithm based on novel iterative clustering and priority merging steps and represent global HBV variation as interconnected graphs. (2) Develop a validation algorithm for generating simulated HBV data, incorporating patient-derived HBV data and benchmarking the performance of the Assembly Tool against that of other viral genome assemblers. In Phase II, GATACA will develop HepBbase, a commercial web-based platform that will provide data management and allow users to "plug-and-play" familiar analysis tools alongside HBV-specific functions. An Assembly pipeline will be developed in Phase II to automate the labor-intensive steps of developing HBV draft genomes. GATACA will begin HepBbase commercialization efforts in Q2 2017 during Phase II development. Potential customers are HBV virologists in all research-based disciplines, who lack adequate or centralized user-friendly HBV management and analysis software. Discoveries made with our tool will also inform clinicians, based on assembled patient reference genomes. As the first virus-specific large- scale capacity bioinformatics platform, HepBbase will eliminate bottlenecks and facilitate collaboration.
Public Health Relevance Statement: Public Health Relevance: Chronic infection with hepatitis B virus (HBV) is associated with serious health concerns (such as cancer, cirrhosis, and liver failure) and escalating healthcare costs. Deep sequencing of viral infections using next generation sequencing (NGS) methods yields extensive data about the genetic complexity of HBV and other viral infections. Newer sequencing technologies are able to produce longer, useable sequences, but are years away from reaching the maturity required for wide-spread adoption in virology. Meanwhile, NGS continues to yield new insights on heterologous infections to identify resistance associated variants (RAVs) even if the resistance requires a long-range interaction and occurs in rare variants. Current analysis tools are taxed and challenged to provide an accurate intra-host variant map; they detect only point mutations at limited viral regions. Virologists require tools fr aggregating the NGS fragments into accurate variant strains to advance research in the therapeutics and clinical fields, and to guide clinical management of patients. In this Phase I SBIR, GATACA, LLC proposes to develop an innovative short read assembler to bridge the de novo single-consensus assembly approach and the haplotype inference problem that is especially challenging in HBV intra-host populations.
Project Terms: Address; Adoption; Algorithms; base; Benchmarking; Bioinformatics; Biological Markers; Chronic; Chronic Disease; Cirrhosis; Clinical; Clinical Management; cluster merger; Collaborations; commercial application; commercialization; Complex; Computer software; computing resources; Consensus; Consensus Sequence; Data; data management; Data Set; deep sequencing; Development; Diploidy; Discipline; Frequencies (time pattern); Genetic; genetic analysis; genetic variant; Genome; Genomics; Goals; Graph; Haplotypes; Health; Health Care Costs; Hepatitis; Hepatitis B; Hepatitis B Virus; improved; Industry; Infection; innovation; insight; Institute of Medicine (U.S.); interest; Internet; Knowledge; Liver diseases; Liver Failure; Malignant Neoplasms; Maps; Marketing; Masks; Methods; Mining; Minor; next generation sequencing; novel; Online Systems; Outcome; outcome prediction; Output; pathogen; Pathogenesis; Patients; Performance; Phase; Play; Point Mutation; Population; Population Heterogeneity; programs; prototype; public health relevance; rare variant; Reading; Reading Frames; reconstruction; reference genome; Research; Resistance; Risk; Scheme; Scientist; Small Business Innovation Research Grant; software development; Staging; Structure; Taxes; Technology; Testing; Therapeutic; tool; Treatment outcome; United States Dept. of Health and Human Services; user-friendly; Validation; Variant; Vertebral column; Viral; Viral Genome; virology; Virus; Virus Diseases; Work