Provenance for Computational Reproducibility and Beyond

Speaker: Juliana Freire (New York University, USA)

Abstract

The need to reproduce and verify experiments is not new in science. While result verification is crucial for science to be self-correcting, improving these results helps science to move forward. Revisiting and reusing past results – or as Newton once said, “standing on the shoulders of giants” – is a common practice that leads to practical progress. The ability to reproduce computational experiments brings a range of benefits to science, notably it: enables reviewers to test the outcomes presented in papers; allows new methods to be objectively compared against methods presented in reproducible publications; researchers are able to build on top of previous work directly; and last but not least, recent studies indicate that reproducibility increases impact, visibility, and research quality and helps defeat self-deception.

Although a standard in natural science and in Math, where results are accompanied by formal proofs, reproducibility has not been widely applied for results backed by computational experiments. Scientific papers published in conferences and journals often include tables, plots and beautiful pictures that summarize the obtained results, but that only loosely describe the steps taken to derive them. Not only can the methods and implementation be complex, but their configuration may require setting many parameters. Consequently, reproducing the results from scratch is both time-consuming and error-prone, and sometimes impossible This has led to a credibility crisis in many scientific domains. In this talk, we discuss the importance of maintaining detailed provenance (also referred to as lineage and pedigree) for both data and computations, and present methods and systems for capturing, managing and using provenance for reproducibility. We also explore benefits of provenance that go beyond reproducibility and present emerging applications that leverage provenance to support reflective reasoning, collaborative data exploration and visualization, and teaching.

This work was supported in part by the National Science Foundation, a Google Faculty Research award, the Moore-Sloan Data Science Environment at NYU, IBM Faculty Awards, NYU School of Engineering and Center for Urban Science and Progress.

About speaker: Juliana Freire is a Professor of Computer Science and Engineering and Data Science at New York University. She holds an appointment at the Courant Institute for Mathematical Science, is a faculty member at the NYU Center for Urban Science and at the NYU Center of Data Science, where she is also the Director of Graduate Studies. Her recent research has focused on big-data analysis and visualization, large-scale information integration, provenance management, and computational reproducibility. Prof. Freire is an active member of the database and Web research communities, with over 150 technical papers, several open-source systems, and 11 U.S. patents. She is an ACM Fellow and a recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She has chaired or co-chaired several workshops and conferences, and participated as a program committee member in over 70 events. Her research grants are from the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, the University of Utah, New York University, Microsoft Research, Yahoo! and IBM.