teXtagger: Automated XML Tagging of Natural Language Text
Joseph Naft

Naftware Incorporated

1830 Edgewood Road
Baltimore, MD 21234
   (410) 663-0767
We are proposing to build teXtagger, a software tool to automatically apply XML tags to natural language text based on an XML definition, saving enormous amounts of manual tagging effort otherwise required to annotate the texts. If a knowledge base is defined in XML, then teXtagger will be the natural tool for rapidly building large knowledge bases from raw text. Many organizations are creating XML definitions to enable information sharing for collaborative efforts across departments or even entire industries. TeXtagger will support automated reformatting of natural language text into XML-defined intelligent schemas, for texts ranging from scientific and engineering papers to maintenance manuals for spacecraft. Naftware will create new, cross-domain portable, information extraction methods from unrestricted text, to fulfill an important need for NASA and for American society - a significant innovation indeed. Subtopic 24.02 calls for "information technology enabling comprehensive sharing of project-related information and data, which supports intelligent organization, access, and presentation of the information." This is, in fact, an excellent description of teXtagger. By automatically transforming raw text into intelligent XML-tagged documents, teXtagger will create major new opportunities for sharing information and organizing, accessing, and presenting it intelligently: turning mountains of text into effective, interactive knowledge repositories.

A robust, natural language processing (NLP) system for automatically annotating text documents (including the text on web pages) with XML defined tags has a tremendous commercial potential. Recent corporate announcements indicate that XML technologies are not going to suffer from competing, proprietary standards. All major players appear to be accepting W3C standards and participating in the creation of those standards. Many industry-specific groups have teams at work creating cross-company, industry-wide XML DTD's and Schemas to enable more intelligent, web-based business-to-business and consumer-to-business applications. Two major factors limit further enhancements of the Web. First is the need for greater bandwidth to the home, which cable and DSL companies are investing heavily to meet. Second, is the Web's lack of semantics, lack of intelligence. XML promises to help remedy this second limitation. We expect XML to be widely adopted and aggressively incorporated into both government and commercial Web applications. The next hurdle will be applying XML tags to the installed base of millions of web pages and extending it to millions of other texts, to bring them online. This will be a massive and ongoing task for many years to come, which sets the market pull for teXtagger.

