News Article

Integration of Information from Heterogeneous Sources
Date: Jan 01, 2010
Source: DARPA Success Stories ( click here to go to the source)

Featured firm in this article: BCL Technologies of San Jose, CA



BCL Technologies developed fuzzy logic mathematical methods to convert previously un-readable text and figures (in financial tables, graphics, and other tabular form) into relational databases, thus reducing the cost of manual data reentry.

BCL Technologies provides automated solutions that help companies reduce costs and improve productivity for a wide variety of costly manual processes in document creation, conversion, and extraction. BCL's easyConverter captures formatted information from PDF files and translates them into a rich text format (RTF) or DOC document

Technical Challenge Addressed:
Although the advances in optical character recognition (OCR) have made huge strides in data conversion from paper into machine-readable form, the difficulties of converting data in table format still remain. Many documents in printed, fax, or Portable Document Format (PDF) contain tables that are not machine readable. Expensive, manual data reentry is then required to convert the table into a relational database or other computer-readable form. Several difficult challenges must be overcome to achieve automated extraction and conversion of information from tables or other tabular forms. Table, column and row boundaries must be determined, and sub-tables, headers, footers and captions must be recognized. Data in the table cells must then be isolated from the boundaries, extracted, classified, and then converted to machine-readable form through OCR.

Technology Description:
BCL's solutions are based on expertise in neural networks, fuzzy logic, document analysis, information retrieval, database technologies and natural language query processing.

BCL selected and used the following techniques in the DARPA Phase II project addressing extraction of information from printed pages: fuzzy logic methods reasoning for pattern recognition to detect table, row and column boundaries; perform classification; extract information from the table cells; and store the information in a relational database accessible by Structured Query Language (SQL) software methods.

OCR errors limited data conversion from tables, so alternative extraction methods were developed to replicate and use the table format structure originally employed, including those from PDF engines. These alternative methods allowed machine-readable information to be generated from tables and other structured document elements, such as bibliographies, graphics and financial charts, which are widely used in documents.

The Phase II project demonstration was successful, and BCL continued subsequent SBIR work for DARPA and the U.S. Army, developing technology for management of heterogeneous data and information. BCL was awarded a series of patents in the area of heterogeneous data management, with two of the patents being directly related to the information extraction work initiated at DARPA. Several more patents are pending.

Lessons Learned & Best Practices:
Recognize that both unforeseen technical obstacles and new opportunities may occur during SBIR projects, so be adaptable and seek approval to shift technical emphasis when a new opportunity offers potential advantages over the existing project direction.

Target a specific market in Phase II to facilitate early commercialization.
Consider using a licensing approach, and explore collaboration with an established company.

Economic Impact:
Through the DARPA SBIR project, BCL gained the technical expertise, insight, and credibility to win and successfully perform additional SBIR projects from DARPA and the U.S. Army. The commercial licensing revenue from this technology subsequently allowed BCL to weather two recessions and continue on its growth path. The company began on the initial DARPA SBIR work, and based on its commercial business, has grown to 20 people.

Applications:
To commercialize its patented software, PDF Tools, first BCL targeted the financial sector, where it could focus on extraction and conversion of financial information in tabular form without using OCR. Then the company licensed the software as an add-on plug-in to widely used commercial software packages specializing in format conversions, for extraction and conversion of information in tables. The system is currently used by the U.S. Army and was licensed by Adobe Systems to respond to the Japanese government mandate that financial information in PDF format be converted to machine-readable form.

Partnering & Collaboration:
BCL received several follow-on SBIRs in heterogeneous data and information management. These included a DARPA FaxAssist project to automatically extract the recipient's name from a fax and route it by e-mail, and two Army projects that make documents in various formats accessible on electronic devices, and to provide a knowledge and information management platform.

BCL has partnered with over two dozen financial publishers to automate their data extraction and conversion processes. For example, BCL and RR Donnelley developed the Donnelley Jade software for conversion of PDF files to formats used by the U.S. Securities and Exchange Commission's (SEC) Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system. Donnelley's Jade manager reported that "We have reduced our turnaround times for SEC filings of non-proprietary (CFS and ProFile) by 75 percent." BCL's SEC Publisher is used by many companies for SEC filings with similar results.

Adobe licensed BCL's table extraction technology for Acrobat Table/Formatted Test plug-in 4.0 and 5.0. Another licensee, MS Wireless, used PDF Tools to extract sales order information from PDF files and compare it with data in its point-of-sale database to verify sales, detect fraud, and pay commissions. This resulted in annual labor cost savings of over $500,000 and reduced processing time from 1-2 weeks to 2 days.