There are a number of methods available for locating protein coding regions in the human genome, but as these are species specific, there is no method that applies universally (and successfully) to locating protein coding regions in most other eukaryotic genomes. Such a general method would be particularly useful to the agriculture, biomedicine and the pharmaceutical industries. We have developed an expert rule set based statistical method for locating protein coding regions in prokaryotic genomes. Our method lends itself to generalization as it is self-training and has been successfully applied to a number of prokaryotes. The major hurdle to overcome in the application to eukaryotic genomes is the treatment of introns that we plan to deal with by developing statistical rule set based intron splicing techniques, and exon rule set based protein coding regions reconstruction procedures. We have developed a statistical method that is self-training and can accurately and rapidly identify protein coding regions in prokaryotic genomes. We believe we can translate this method to identify coding regions in eukaryotic genomes. The research will address the following problems: (1) the extended intron spectrum of eukaryotic organisms; (2) algorithmic splicing of introns to enhance the statistical signal; (3) the generation of statistical exon locating rule sets for eukaryotes (as previously found for prokaryotes); (4) extending the method to include promoters and other binding site motifs (as has been done for start and stop codons in prokaryotes).
Commercial Applications and Other Benefits as described by the awardee: The Companys Pharmaceutical and Biotech customers are requesting complimentary sequence analysis software for use with any animal or plant genome. These commercial researchers therefore need rapid self-training genomic analysis software to work on any organism they choose. Such tools do not presently exist.