Viral hepatitis from hepatitis B (HBV) establishes chronic infections in >250M people worldwide; chronicity is on the rise, and approximately one-third of the worlds population (2 billion) has serologic evidence of exposure. HBV coinfection with HCV and HIV is a hidden consequence of the substance use disorder epidemic. Viral populations have extremely high sequence diversity and rapidly evolve, which explains the vaccine failure rates and viral resistance to existing therapies and makes discovering lasting therapies extremely challenging. Next Generation Sequencing (NGS) is the method of choice to assess the intra-host virus population, termed a quasispecies. While a large set of short DNA sequencing reads are acquired that represent the virions in the quasispecies, computational technologies are limited in their analysis capabilities, resulting in particularly low resolution of complex HBV genomic structures. Another challenge is assembling NGS reads representing short fragment of the host genome into full strains (haplotypes) without knowledge of their true occurrence in the samples. To meet these challenges, GATACA is developing pathogen-specific bioinformatics software, GAT-ML (GATACA Assembly Tool machine learning [ML]) to support treatment discovery and improve infection control. Its specifically designed algorithm utilizes novel ML methodologies adapted and modified for assisting genome assembly that will allow GAT-ML to reconstruct complete viral haplotypes and populations by learning the language of the sequences. Tailored initially for HBV samples, GAT and its new ML system will be integrated for feasibility testing in this Phase I with the following Specific Aims: 1. Specific Aim 1. Build a joint learning system. Train and test natural language processing (NLP) methods on HBV genetic variation. 2. Specific Aim 2. Implement and test the machine learning methods in GAT (GAT-ML). We anticipate a working tool for characterizing HBV haplotypes, validated with multi-sourced datasets, and extensive testing and benchmarking of offline and integrated methods.
Public Health Relevance Statement: The proposed project will develop and increase the capabilities of our novel computational tool, GAT, to help researchers identify the full spectrum of genetic features of a viral populationsuch as emergence and persistence of resistance or baseline polymorphisms regardless of their frequenciesand translate these findings to the development of new or improved antiviral drugs and other applications requiring high analytic sensitivity. GAT will particularly benefit researchers working in preclinical stages of drug development who require rapid, sensitive, and reliable results to inform decisions about which targets to advance to clinical trial testing.
Project Terms: Adoption; Algorithm Design; Algorithms; Antiviral Agents; base; Benchmarking; Bioinformatics; Chronic; Chronic Hepatitis; chronic infection; Classification; Clinical Trials; co-infection; commercialization; Complex; Computer software; computerized tools; contig; Data; Data Set; design; Development; Dimensions; DNA sequencing; DNA Structure; drug development; Epidemic; Failure; Frequencies; Genetic; Genetic Polymorphism; Genetic Variation; Genome; Genomics; Genotype; Haplotypes; Healthcare; Hepatitis B; Hepatitis B Virus; HIV; HIV/HCV; improved; Infection Control; insertion/deletion mutation; Joints; Knowledge; Language; Language Development; Learning; Link; Liver diseases; Machine Learning; machine learning algorithm; machine learning method; Metagenomics; Methodology; Methods; Modeling; Molecular; multiple data sources; Mutation; Natural Language Processing; neural network; next generation sequencing; novel; Outcome; pathogen; Pattern; Performance; Phase; Population; Population Analysis; pre-clinical; Privatization; Research Personnel; Resistance; Resolution; Sampling; Semantics; Serological; Serotyping; Source; Speed; structural genomics; Substance Use Disorder; Supervision; syntax; System; Techniques; Technology; Testing; tool; Training; Translating; Trust; Vaccines; Validation; Variant; Viral; Viral hepatitis; viral resistance; Virion; Virus