Phase II year
2015
(last award dollars: 2017)
The broader impact/commercial potential of this Small Business Innovation Research (SBIR) Phase II project will be more effective and efficient use of unstructured data. Limiting analysis to structured data ignores the massive amount of information in reports, memos, articles, and other written documents. Workers require information on past work and ongoing projects, best practices, current events and competitor and customer activities. However, companies with thousands of workers have millions of documents. It is prohibitively expensive to index them manually so that they can be found, analyzed and acted on. Moreover, overloaded workers do not have the time or training required to take on the task. The problem is exacerbated by mergers and acquisitions. In addition, over years, it is common for companies to accumulate large numbers of duplicate and out-of-date documents that workers do not take the time to rationalize and delete. The result is inflated storage costs and reduced productivity as workers struggle to find the relevant, up-to-date information. Inconsistent information governance also puts organizations at risk - litigation (retaining documents without legal or business value), safety (using out-of-date process safety management procedures) and operational (not leveraging best practices and lessons learned across the enterprise and beyond).This Small Business Innovation Research (SBIR) Phase II project addresses the problem that text documents - especially those internal to an organization - are very difficult to locate and analyze unless they are classified and tagged. But manual classification and tagging are too expensive and inconsistent for large collections. Large companies store many millions of documents. And there is even more relevant information on the Web. The objective of the proposed research to is to provide software assistants that classify documents into pre-specified categories, add tags to describe what each document is about, and the entities named in the documents (e.g., oilfields). The assistants identify relevant documents and help people to learn of new developments by sending alerts when new documents of interest appear on the web or in the company's computers. The primary technical result will be a suite of software assistants that companies can adopt singly or as an ensemble to help manage information sustainably. These assistants build upon the proposed research to develop and integrate novel approaches to unsupervised machine learning, concept identification, and ontology construction. They will enable companies to overcome major problems, including overload, finding relevant, up-to-date information, analyzing unstructured information, and identifying unneeded documents for elimination.