home · mobile · calendar · colloquia · 2007-2008 · 

Colloquium - Cheney

Language-Based Foundations for Data Provenance
Cornell University

Bioinformatics and other disciplines now rely heavily on "curated" databases that are built up by the manual effort of expert scientists. Curation yields higher-quality results than any fully automatic technique, but it is labor-intensive and costly. Because judgments of the quality of the data ultimately rest on the choices made by the database curators, it is crucial to maintain adequate provenance records showing the database's history. Currently, however, provenance is not well-supported by databases and other systems used by curators. Instead, it is maintained by manual curator effort or ad-hoc systems which do little to ensure that the provenance record is correct, complete and useful.

In my view, the key unsolved problem in this area is developing clear high-level specifications and correctness guarantees for various techniques that justify their inclusion into general-purpose systems. My approach is to adapt techniques from programming languages and semantics to the setting of databases. I will present two provenance-tracking techniques that provide strong guarantees, one based on the intuition that provenance should indicate where data in the output of a query or update "comes from" in the input, and one based on the idea that provenance should highlight all parts of the input that "explain" a part of the output.

Hosted by Amer Diwan.

Department of Computer Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
May 5, 2012 (14:13)