SBIR-STTR Award

Micropublications for Automating Genome Sequence Variant Interpretation from Medical Literature
Award last edited on: 5/21/2023

Sponsored Program
SBIR
Awarding Agency
NIH : NHGRI
Total Award Amount
$1,862,994
Award Phase
2
Solicitation Topic Code
172
Principal Investigator
Mark Julin Kiel

Company Information

Genomenon Inc

3135 South State Street Suite 350 BR
Ann Arbor, MI 48108
   (734) 794-3075
   hello@genomenon.com
   www.genomenon.com
Location: Single
Congr. District: 12
County: Washtenaw

Phase I

Contract Number: 1R43HG010446-01A1
Start Date: 5/1/2019    Completed: 4/30/2020
Phase I year
2019
Phase I Amount
$152,946
Accurate and efficient interpretation of genomic variants for clinical decision making is predicated on ready access to useful information in the medical literature. The sheer number of potentially relevant articles that must be examined during this curation process poses a major challenge in ensuring the accuracy and reproducibility of clinical variant interpretation as it is time-consuming and the results highly user-dependent. To this end, we have developed the Mastermind Genomic Search Engine - a commercial database that automatically organizes disease, gene and variant information from the medical literature by systematically indexing millions of scientific articles. In direct comparison to manually developed databases of genetic variants, we have achieved greater than 97% concordance and accurately identified >50% more variants with an average of 3-fold more references demonstrating the effectiveness of our automated approach. Currently, Mastermind is used by over 1800 variant scientists in 25 different countries to more quickly curate literature for genetic variants in clinical settings. In response to feedback from ClinGen curators and others, the present proposal seeks to create a framework to facilitate literature curation and clinical variant interpretation activities within Mastermind by 1) prioritizing relevant references and external database entries containing content meaningful to variant classification guidelines, 2) assembling this information into a “micropublication” text format with codified data fields including population frequencies, computational predictions, reference citations and relevant sentence fragments with conclusive content, 3) allowing users to manually review, alter and augment pre-populated entries and 4) providing a platform to share and continuously update this information with other variant scientists in the Mastermind community and elsewhere. Developing tools that allow for collaborative curation in real-time at the point of interaction with source material (i.e. individual articles) will mitigate reproducibility challenges plaguing other large-scale crowd-sourced projects. In contrast to genetic variant databases of user submitted classification information and associated data, the present proposal seeks to create enhanced curation tools to fill such variant databases with more accurate and reproducible data and in a way that would promote dramatic scaling of variant curation activities including those undertaken by groups like ClinVar. To test this approach, we will work with industry partners engaged in variant curation activities to 1) determine the requisite data fields, 2) integrate external database information, 3) test the accuracy and relevance of results and the overall efficacy of the approach using hundreds of manually curated genetic variants, and 4) solicit and incorporate feedback from our development partners to iterate and refine the software features for the greatest effect. Within the $4B genome sequencing software market, there is significant commercial potential and scientific merit in bringing more automated techniques of data analysis to large-scale genome sequencing variant interpretation as described in this proposal. [Word count – 447; Line count - 30]

Public Health Relevance Statement:
PROJECT NARRATIVE Successful completion of the present project will contribute to the public health mission of the NIH by promoting more widespread adoption of genome sequencing by making the interpretation of genome variant data more accurate, reproducible and cost-effective in clinical and research laboratories. The community of users that can benefit from this work include geneticists, oncologists, pathologists, researchers and patients. [Two sentences]

Project Terms:
Address; Adoption; Algorithms; Authorship; Blinded; Categories; Classification; Clinic; Clinical; clinical decision-making; clinical practice; Clinical Research; Collaborations; Communities; Computer software; Consultations; Consumption; cost effective; Country; crowdsourcing; Data; Data Aggregation; Data Analyses; Data Set; Databases; design; Development; Disease; Effectiveness; Ensure; Feedback; Frequencies; Future; Genetic Databases; genetic variant; Genome; genome sequencing; genome-wide; Genomics; Goals; Gold; Guidelines; indexing; Individual; industry partner; information classification; Knowledge; Laboratory Research; Light; Literature; literature citation; Machine Learning; Manuals; Medical; Methods; Mission; Modification; novel; Oncologist; Pathogenicity; Pathologist; Patients; peer; Phase; Population; Population Database; preservation; Process; Public Health; Published Comment; Publishing; PubMed; Recommendation; Reproducibility; Research Personnel; response; Scientist; search engine; Semantics; Source; standardize guidelines; success; Techniques; Testing; Text; Time; tool; Training; United States National Institutes of Health; Update; Variant; Work

Phase II

Contract Number: 2R44HG010446-02
Start Date: 5/1/2019    Completed: 8/31/2023
Phase II year
2021
Phase II Amount
$1,710,048
Accurate and efficient interpretation of genomic variants for clinical decision making is predicated on readyaccess to and extraction of information from the medical literature. The sheer number of potentially relevantarticles that must be examined during this process poses a significant challenge in ensuring the accuracy andreproducibility of clinical interpretation as it is time-consuming, error-prone, and highly user-dependent. To thisend, we have developed the Mastermind Genomic Search Engine - a commercial database that automaticallyorganizes disease, gene and variant information from the medical literature by systematically indexing millionsof scientific articles. Mastermind is used by over 9,100 variant scientists in more than 100 different countries tomore quickly interpret genetic variants in clinical settings. In Phase I of this project, we developed and tested amicropublication platform within Mastermind that assembles literature curation along with population frequencydata, computational predictions of pathogenicity, and automated ACMG/AMP classifications that improves thespeed of variant interpretation by more than 70% and increases the sensitivity of these results by 2-20x. Thepresent proposal seeks to build on the success of Phase I by 1) integrating the micropublication platform intoMastermind with migration of collaborative features for community-based evaluation of variant interpretations; 2)optimizing and improving automated variant interpretation/prioritization of articles and implementing a rigorousquality assurance process; and 3) using these improvements to curate all evidence in all variants in all genescomprising the entire human genome, beginning with the clinical exome. Integration of the pre-curated genomedata in the micropublication platform will result in Mastermind Enterprise, allowing for immediate and accurategenome-wide variant interpretations with collaborative curation in real-time at the point of interaction with sourcematerial (i.e. individual references). This work will mitigate reproducibility challenges plaguing other large-scalecrowd-sourced projects, including those undertaken by groups like NIH's ClinVar and QIAGEN's HGMD. Inaddition, our novel approach will not suffer from poor sensitivity as it relies on a comprehensive source of medicalliterature pre-annotated based on genetic content. This work will permit dramatic scaling of variant interpretationactivities and allow for complete and accurate curation of the entire human genome within 2 years - a feat thatcould not be completed utilizing current manual methods for variant interpretation. Mastermind Enterprise will berevolutionary in the genomics industry and will represent a natural next step to build on the achievementsprovided by the Human Genome and the reduced cost of next-generation sequencing. It will substantiallyimprove diagnostic rates and accuracy in the clinic, especially in rare disease, where a lack of genetic evidenceoften results in severely delayed and inaccurate diagnoses. Additionally, it will allow the pharmaceutical industryto develop more successful targeted therapies and to design more inclusive clinical trials as well as to morereliably identify patients who would benefit from therapeutic intervention.[Word count - 468; Line count - 30]

Public Health Relevance Statement:
PROJECT NARRATIVE Successful completion of the present project will contribute to the public health mission of the NIH by promoting more widespread adoption of genome sequencing by making the interpretation of genome variant data more accurate, reproducible and cost-effective in clinical and research laboratories. The community of users that can benefit from this work include geneticists, oncologists, pathologists, researchers and patients. [Two sentences]

Project Terms: