Phase II Amount
$1,000,000
The broader impact/commercial potential of this Small Business Innovation Research (SBIR) is to increase the quality of consumer products and services by scaling the data sharing economy. Data sharing and reuse increases revenue streams and allows companies to realize a 8-15x return on investment in improved products and services. Despite these benefits, growth in data sharing is stymied because approximately 60% of companies are unable to access needed data, and only about 30% have taken first steps towards data sharing to realize its potential returns. The major challenges of widespread sharing are messy" data and the legal issues surrounding ownership and utilization. Some organizations have addressed these challenges internally with feature stores, which support data processing pipelines that extract insights. While the current generation of feature stores works well for a single organization, their centralized design assumes a level of trust absent across multiple organizations. This proposed work will develop a feature store that supports the trust requirements needed for inter-organizational workflows to enable realizing the significant returns from data sharing.This SBIR Phase II project proposes to implement a decentralized feature store (DeFS) by solving two primary technical challenges. First, a DeFS requires fast, reliable, and provably correct data storage. While layer-1 blockchain storage is provably correct, it does not provide the required write speed nor long-term availability if miners abandon it. This project's proposed approach will meet the reliability requirement by signing data on a virtual blockchain that can span and migrate between layer-1 blockchains. The continuity of the virtual blockchain allows it to survive as layer-1 chains fail and improve as new chains come online. Moreover, the virtual blockchain meets speed requirements by creating virtual blocks quickly on a private blockchain and periodically migrating to layer-1. Second, a DeFS requires integrating multiple data sources and pipeline code without leaking either. While interactive proof techniques satisfy privacy, they do not scale to large data sets. This project will support data processing pipelines, such as data wrangling and model fitting, by developing a scalable and secure computing framework. The framework preserves the privacy of both data and pipeline code evaluated on virtualized trusted execution environments and controlled by a trustless distributed protocol built around a virtual blockchain commitment scheme.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.