To evaluate how a drug candidate affects cells, researchers often study how the abundance or behavior of a specific set of proteins is changed by treatment with each compound. However, it is not currently possible to test the effect of every possible drug compound (>500,000) on every human protein (~20,000) in hundreds of different types of cells. Even the most advanced protein analysis systems available today could only measure and process a tiny fraction of these combinations in a feasible timeframe. One method of measuring the abundance of all the proteins in a cell sample is mass spectrometry, but available instruments can only analyze several samples per day. To increase the throughput of these mass spectrometry experiments, in Aim 1 of the proposed project we will develop a machine learning algorithm that will reconstruct the peptide composition of a large number of samples from measurements of a smaller number of mixtures of those samples. This technology, called âcompressed sensingâ was developed for digital imaging to reduce (com- press) the file size of an image. Importantly, it can also âdecompressâ a low amount of collected information to reconstruct an image with surprisingly high detail. Similarly, we will develop a compressed sensing algorithm to extract the individual protein profiles from mixtures of multiple combined samples. Initially, this approach will analyze 1,000 samples from 250 measurements of mixtures of those samples, providing a 4-fold increase in speed. Ultimately, with a much higher number of samples, it may allow a 100-fold increase in samples analyzed. To accelerate interpretation of this type of data for drug discovery, we will create a machine learning algorithm to simplify complex patterns of interactions between test compounds and the proteins within various types of cells. Previously acquired data will be modeled to learn the effects of individual compounds on various proteins. By learning from a large number of these data sets that describe interactions between specific compounds and proteins, in many different cell types, the model will be able to predict the effect of untested compounds on proteins within various types of cells. In addition, it will be able to indicate which experiments would be most useful to perform in the future, to obtain information on classes of compounds or proteins that are lacking in the current data sets. The combination of these two techniques has the potential to greatly accelerate development of novel drugs by providing a potentially huge increase in protein abundance measurements, along with a powerful method to predict how drugs will alter the expression of proteins in cells.
Public Health Relevance Statement: NARRATIVE Current mass spectrometry methods are too inefficient for large-scale proteomic analysis of drug effects in cells. We will develop machine learning methods to enable high-throughput mass spectrometry analysis of complex protein samples, identify the effects of drugs on proteome localization in various cell types, and predict the most informative experiments to gain new information on the effects of drugs.
Project Terms: Computer software; Software; Mass Spectrum Analysis; Mass Photometry/Spectrum Analysis; Mass Spectrometry; Mass Spectroscopy; Mass Spectrum; Mass Spectrum Analyses; Systems Analysis; Systems Analyses; Talus; Astragalus; Astragalus Bone; Technology; Testing; Time; X-Ray Computed Tomography; CAT scan; CT X Ray; CT Xray; CT imaging; CT scan; Computed Tomography; Tomodensitometry; X-Ray CAT Scan; X-Ray Computerized Tomography; Xray CAT scan; Xray Computed Tomography; Xray computerized tomography; catscan; computed axial tomography; computer tomography; computerized axial tomography; computerized tomography; non-contrast CT; noncontrast CT; noncontrast computed tomography; Measures; Data Set; HLA-DR Associated Protein II; IGAAD; Inhibitor of GZMA-Activated DNase; Phosphatase 2A Inhibitor I2PP2A; SET Translocation Inhibitor-2 of Protein Phosphatase-2A; Template Activating Factor I Beta; Set protein; improved; Phase; Training; Individual; Licensing; Measurement; Therapeutic; fluid; liquid; Liquid substance; instrument; Knowledge; programs; Complex; Event; cell type; Pattern; Techniques; 3-Dimensional; 3-D; 3D; three dimensional; Services; digital imaging; computer imaging; chromatin protein; Speed; Proteome; Code; Coding System; genetic regulatory protein; Regulatory Protein; regulatory gene product; Modeling; Sampling; drug development; Proteomics; Intervention; Intervention Strategies; interventional strategy; Genomics; drug discovery; Bio-Informatics; Bioinformatics; Molecular Interaction; Binding; protein expression; protein complex; Pharmaceutical Agent; Pharmaceuticals; Pharmacological Substance; pharmaceutical; Pharmacologic Substance; Image Compression; Data; Protein Analysis; Small Business Innovation Research Grant; SBIR; Small Business Innovation Research; Process; Development; developmental; Image; imaging; cost; designing; design; blind; Consumption; open source; new drug treatments; new drugs; new pharmacological therapeutic; new therapeutics; new therapy; next generation therapeutics; novel drug treatments; novel drugs; novel pharmaco-therapeutic; novel pharmacological therapeutic; novel therapy; novel therapeutics; commercialization; drug candidate; Drug Targeting; learning activity; learning method; learning strategies; learning strategy; genomic data-set; genomic dataset; genomic data; experiment; experimental research; experiments; experimental study; DNA seq; DNAseq; DNA sequencing; deep learning; machine learned algorithm; machine learning based algorithm; machine learning algorithm; machine learning based method; machine learning methodologies; machine learning method; screening services; Acceleration; Affect; Algorithms; Behavior; Biological Assay; Assay; Bioassay; Biologic Assays; Malignant Neoplasms; Cancers; Malignant Tumor; malignancy; neoplasm/cancer; Cells; Cell Body; Chromatin; Communities; Diabetes Mellitus; diabetes; Disease; Disorder; Drug Compounding; Drug Preparation; Pharmaceutical Preparations; Drugs; Medication; Pharmaceutic Preparations; drug/agent; Future; Regulator Genes; Transcriptional Regulatory Elements; regulatory gene; trans acting element; Human; Modern Man; Laboratories; Lead; Pb element; heavy metal Pb; heavy metal lead; Learning; Methods; United States National Institutes of Health; NIH; National Institutes of Health; Peptides; Physiology; Physiological Processes; Organism-Level Process; Organismal Process; Physiologic Processes; Proteins; Research; Research Personnel; Investigators; Researchers; Running; Signal Transduction; Cell Communication and Signaling; Cell Signaling; Intracellular Communication and Signaling; Signal Transduction Systems; Signaling; biological signal trans