This project develops an information extraction system that demonstrates higher levels of recall than current systems, seeking not to jeopardize the levels of precision. Our recall enhancing algorithms use more linguistic and world knowledge than most current systems. Four crucial avenues of work that will lead to the improvement of recall are: disambiguation of input text terms through ontological semantic processing; processing reference; processing non-literal language; and assigning semantic features to new, unattested word and phrase occurrences. All the above activities rely on a unique battery of resources and processes developed by or available to Onyx. These include an ontological world model, a fact database, a comprehensive NLP lexicon of English and an onomasticon, or lexicon of proper names. In addition, we use special routines for resolving reference, processing non-literal language through controlled constraint relaxation and treating unattested inputs using expectations recorded in the ontology, the fact database and in special orthographic, morphological and syntactic rules. Architecturally, we will combine in a single system a variety of approaches and processes as above. Unlike most current systems, ours will be geared not only at information extraction for a given set of templates (and, therefore, typically, working in a single domain) but will also include facilities for modifying templates and defining new templates for new types of questions and, orthogonally, new domains. Thus, our product will be the first general-purpose, configurable information extraction system, which will in multiple domains and with multiple text genres. Additional resources and linguistic expertise for this project are supplied by consultants at New Mexico State University's Computing Research Laboratory, a premier academic R&D institution.
Keywords: Text Extraction, Onomastica, Semantics, Anaphoric Reference, Natural Language Processing, Synta