Modern DNA sequencing technologies have transformed our ability to interrogate human genomes in a single experiment, thereby eliminating the inherent blind spots of gene panels and whole exome sequencing. Furthermore, recent speed and economy improvements are driving the cost of whole genome sequencing (WGS) down to that of WES; therefore, we foresee a transition over the next two years to WGS as the de facto test for human disease research and diagnosis in academic labs, hospitals, and both biotechnology and pharmaceutical companies. Indeed, conservative estimates project 20 million human genomes will be sequenced in next decade. However, the transition to research and diagnostics driven by WGS presents a substantial data processing burden, as a single WGS sample represents at least 100 gigabytes and converting the raw data into a comprehensive set of genetic variation requires an intricate, rapidly changing, and computationally onerous workflow. Based on our history of developing innovative computational methods for genomic research and motivated by the acute need for advanced, scalable computing platforms, the applicant team founded Base2 Genomics (Base2). Base2 has created an innovative platform for WGS data processing, quality control, variant detection and prioritization, and data visualization using Amazon Web Services (AWS) cloud computing. Developed in close collaboration with AWS engineers, the fundamental strengths of the Base2 platform are its speed, cost, capacity for parallelization, and, most importantly, its ability to accurately identify all forms of genetic variation, whereas most other commercial offerings focus on solely the easiest forms (SNPs and INDELs) of variation to discover. We argue that, in order to maximize the research, diagnostic, and pharmacogenetics utility of WGS, it is imperative to create a complete catalog of all variation in each sequenced genome. In this proposal, we will further improve our technologies with the following aims: Aim 1. Develop proprietary technologies for prioritizing and annotating copy-number and structural variation via population-scale databases. We have developed STIX (STructural variant IndeX), a proprietary compression algorithm and database for efficiently profiling evidence for SV among thousands of human genomes. We propose to leverage this innovation to create unique, proprietary STIX databases, and an associated SV annotation engine to facilitate accurate prioritization of SV for customer WGS cohorts. Aim 2. Create a secure, high-performance customer data submission portal. We will develop a secure customer data submission portal that maximizes efficiency and security while allowing customers to upload data and invoke processing through the Base2 platform.
Project Terms: Acceleration; Acute; Algorithmic Software; Automation; Automobile Driving; base; Biotechnology; Businesses; Catalogs; Cloud Computing; cohort; Collaborations; computerized data processing; Computing Methodologies; cost; Data; Data Compression; Data Set; data submission; data visualization; Databases; Detection; Diagnosis; Diagnostic; Diagnostics Research; disease phenotype; DNA sequencing; Engineering; Ensure; Environment; Ewings sarcoma; exome sequencing; experimental study; Family; Frequencies; Funding; Future; Genes; Genetic; Genetic Polymorphism; genetic variant; Genetic Variation; Genome; genome analysis; genome sequencing; genomic platform; Genomics; Genotype; gigabyte; Grant; Growth; Health Insurance Portability and Accountability Act; Hospitals; human disease; Human Genetics; Human Genome; improved; indexing; Individual; Informatics; innovation; innovative technologies; insertion/deletion mutation; Knowledge; Mendelian disorder; Modernization; parallelization; Patient Data Privacy; Performance; Pharmacogenetics; Pharmacologic Substance; Population; Process; programs; Provider; Quality Control; Rare Diseases; Recording of previous events; Research; Resources; Retinal blind spot; Sampling; Science; Secure; Security; Seeds; sequencing platform; Short Tandem Repeat; Single Nucleotide Polymorphism; Speed; Technology; terabyte; Testing; Time; Utah; Variant; web services; whole genome; Work;