SBIR-STTR Award

Development of a Joint Machine Learning/De Novo Assembly System for Resolving Viral Quasispecies
Award last edited on: 1/29/2021

Sponsored Program
SBIR
Awarding Agency
NIH : NIAID
Total Award Amount
$267,225
Award Phase
1
Solicitation Topic Code
855
Principal Investigator
Johanna C Craig

Company Information

Gataca LLC

180 Orchard Hill Lane
Newport, VA 24128
   (540) 544-3033
   research@gatacallc.com
   www.gatacallc.com
Location: Single
Congr. District: 09
County: Giles

Phase I

Contract Number: 1R43AI152894-01
Start Date: 4/1/2020    Completed: 3/31/2021
Phase I year
2020
Phase I Amount
$267,225
Viral hepatitis from hepatitis B (HBV) establishes chronic infections in >250M people worldwide; chronicity is on the rise, and approximately one-third of the world’s population (2 billion) has serologic evidence of exposure. HBV coinfection with HCV and HIV is a hidden consequence of the substance use disorder epidemic. Viral populations have extremely high sequence diversity and rapidly evolve, which explains the vaccine failure rates and viral resistance to existing therapies and makes discovering lasting therapies extremely challenging. Next Generation Sequencing (NGS) is the method of choice to assess the intra-host virus population, termed a “quasispecies”. While a large set of short DNA sequencing reads are acquired that represent the virions in the quasispecies, computational technologies are limited in their analysis capabilities, resulting in particularly low resolution of complex HBV genomic structures. Another challenge is assembling NGS reads representing short fragment of the host genome into full strains (haplotypes) without knowledge of their true occurrence in the samples. To meet these challenges, GATACA is developing pathogen-specific bioinformatics software, GAT-ML (GATACA Assembly Tool – machine learning [ML]) to support treatment discovery and improve infection control. Its specifically designed algorithm utilizes novel ML methodologies adapted and modified for assisting genome assembly that will allow GAT-ML to reconstruct complete viral haplotypes and populations by learning the ‘language’ of the sequences. Tailored initially for HBV samples, GAT and its new ML system will be integrated for feasibility testing in this Phase I with the following Specific Aims: 1. Specific Aim 1. Build a joint learning system. Train and test natural language processing (NLP) methods on HBV genetic variation. 2. Specific Aim 2. Implement and test the machine learning methods in GAT (GAT-ML). We anticipate a working tool for characterizing HBV haplotypes, validated with multi-sourced datasets, and extensive testing and benchmarking of offline and integrated methods.

Public Health Relevance Statement:
The proposed project will develop and increase the capabilities of our novel computational tool, GAT, to help researchers identify the full spectrum of genetic features of a viral population—such as emergence and persistence of resistance or baseline polymorphisms regardless of their frequencies—and translate these findings to the development of new or improved antiviral drugs and other applications requiring high analytic sensitivity. GAT will particularly benefit researchers working in preclinical stages of drug development who require rapid, sensitive, and reliable results to inform decisions about which targets to advance to clinical trial testing.

Project Terms:
Adoption; Algorithm Design; Algorithms; Antiviral Agents; base; Benchmarking; Bioinformatics; Chronic; Chronic Hepatitis; chronic infection; Classification; Clinical Trials; co-infection; commercialization; Complex; Computer software; computerized tools; contig; Data; Data Set; design; Development; Dimensions; DNA sequencing; DNA Structure; drug development; Epidemic; Failure; Frequencies; Genetic; Genetic Polymorphism; Genetic Variation; Genome; Genomics; Genotype; Haplotypes; Healthcare; Hepatitis B; Hepatitis B Virus; HIV; HIV/HCV; improved; Infection Control; insertion/deletion mutation; Joints; Knowledge; Language; Language Development; Learning; Link; Liver diseases; Machine Learning; machine learning algorithm; machine learning method; Metagenomics; Methodology; Methods; Modeling; Molecular; multiple data sources; Mutation; Natural Language Processing; neural network; next generation sequencing; novel; Outcome; pathogen; Pattern; Performance; Phase; Population; Population Analysis; pre-clinical; Privatization; Research Personnel; Resistance; Resolution; Sampling; Semantics; Serological; Serotyping; Source; Speed; structural genomics; Substance Use Disorder; Supervision; syntax; System; Techniques; Technology; Testing; tool; Training; Translating; Trust; Vaccines; Validation; Variant; Viral; Viral hepatitis; viral resistance; Virion; Virus

Phase II

Contract Number: ----------
Start Date: 00/00/00    Completed: 00/00/00
Phase II year
----
Phase II Amount
----