Accurate and efficient interpretation of genomic variants for clinical decision making is predicated on ready access to useful information in the medical literature. The sheer number of potentially relevant articles that must be examined during this curation process poses a major challenge in ensuring the accuracy and reproducibility of clinical variant interpretation as it is time-consuming and the results highly user-dependent. To this end, we have developed the Mastermind Genomic Search Engine - a commercial database that automatically organizes disease, gene and variant information from the medical literature by systematically indexing millions of scientific articles. In direct comparison to manually developed databases of genetic variants, we have achieved greater than 97% concordance and accurately identified >50% more variants with an average of 3-fold more references demonstrating the effectiveness of our automated approach. Currently, Mastermind is used by over 1800 variant scientists in 25 different countries to more quickly curate literature for genetic variants in clinical settings. In response to feedback from ClinGen curators and others, the present proposal seeks to create a framework to facilitate literature curation and clinical variant interpretation activities within Mastermind by 1) prioritizing relevant references and external database entries containing content meaningful to variant classification guidelines, 2) assembling this information into a micropublication text format with codified data fields including population frequencies, computational predictions, reference citations and relevant sentence fragments with conclusive content, 3) allowing users to manually review, alter and augment pre-populated entries and 4) providing a platform to share and continuously update this information with other variant scientists in the Mastermind community and elsewhere. Developing tools that allow for collaborative curation in real-time at the point of interaction with source material (i.e. individual articles) will mitigate reproducibility challenges plaguing other large-scale crowd-sourced projects. In contrast to genetic variant databases of user submitted classification information and associated data, the present proposal seeks to create enhanced curation tools to fill such variant databases with more accurate and reproducible data and in a way that would promote dramatic scaling of variant curation activities including those undertaken by groups like ClinVar. To test this approach, we will work with industry partners engaged in variant curation activities to 1) determine the requisite data fields, 2) integrate external database information, 3) test the accuracy and relevance of results and the overall efficacy of the approach using hundreds of manually curated genetic variants, and 4) solicit and incorporate feedback from our development partners to iterate and refine the software features for the greatest effect. Within the $4B genome sequencing software market, there is significant commercial potential and scientific merit in bringing more automated techniques of data analysis to large-scale genome sequencing variant interpretation as described in this proposal. [Word count 447; Line count - 30]
Public Health Relevance Statement: PROJECT NARRATIVE Successful completion of the present project will contribute to the public health mission of the NIH by promoting more widespread adoption of genome sequencing by making the interpretation of genome variant data more accurate, reproducible and cost-effective in clinical and research laboratories. The community of users that can benefit from this work include geneticists, oncologists, pathologists, researchers and patients. [Two sentences]
Project Terms: Address; Adoption; Algorithms; Authorship; Blinded; Categories; Classification; Clinic; Clinical; clinical decision-making; clinical practice; Clinical Research; Collaborations; Communities; Computer software; Consultations; Consumption; cost effective; Country; crowdsourcing; Data; Data Aggregation; Data Analyses; Data Set; Databases; design; Development; Disease; Effectiveness; Ensure; Feedback; Frequencies; Future; Genetic Databases; genetic variant; Genome; genome sequencing; genome-wide; Genomics; Goals; Gold; Guidelines; indexing; Individual; industry partner; information classification; Knowledge; Laboratory Research; Light; Literature; literature citation; Machine Learning; Manuals; Medical; Methods; Mission; Modification; novel; Oncologist; Pathogenicity; Pathologist; Patients; peer; Phase; Population; Population Database; preservation; Process; Public Health; Published Comment; Publishing; PubMed; Recommendation; Reproducibility; Research Personnel; response; Scientist; search engine; Semantics; Source; standardize guidelines; success; Techniques; Testing; Text; Time; tool; Training; United States National Institutes of Health; Update; Variant; Work