In this project Fetch Technologies will design, prototype and evaluate a new approach to transforming and normalizing data from multiple heterogeneous sources. In previous work, we developed and successfully commercialized a system for creating transformation pipelines. In a transformation pipeline, a new source (with its own unique schema) can be dropped into the pipeline, and as long as the sources data schema satisfies some very general constraints on the type of data present, then the pipeline will successfully normalize data from that source. Our objective is to design the next generation of this system, which we call AutoTrans, that will minimize the human effort necessary build a robust transformation pipeline. In particular, through the use of machine learning techniques, the AutoTrans system will make it easier and more automatic to configure and modify a series of transformations. It will also provide actionable results even when the existing set of recognizers and mappings is incomplete. Finally, the system will be able to represent and reason about the correctness/fidelity of the transformed data.
Benefit: The aim of this project is to create a transformation system that minimizes the human effort necessary to aggregate data from multiple heterogeneous systems. Currently, integrating information from multiple domains and applications is technically challenging. Using existing transformation design systems is difficult because the transformations generally have to be designed by knowledgeable programmers. They are often one-to-one mappings, which must be modified or redesigned when a new data source needs to be integrated. Our approach represents an advance for data aggregation problems, because it allows one to implement a data pipeline that can normalize data from a wide variety of sources without reprogramming. The new AutoTrans technology represents the next generation of this approach. It will markedly decrease the human time and the skill-level required to develop and maintain these powerful pipelines. This in turn will produce a qualitative difference in how broadly this technology can be applied in commercial and military systems.
Keywords: Information Integration, Machine Learning, Data Cleaning, Information Extraction