Chris Jermaine, MCDB: The Monte Carlo Database System

Analysts working with large data sets often use statistical models to "guess" at unknown, inaccurate, or missing information associated with the data stored in a database. For example, an analyst for a manufacturer may wish to know, "What would my profits have been if I'd increased my margins by 5% last year?" The answer to this question naturally depends upon the extent to which the higher prices would have affected each customer's demand, which is undoubtedly guessed via the application of some statistical model. In this talk, I'll describe MCDB, which is a prototype database system that is designed for just such a scenario. MCDB allows an analyst to attach arbitrary stochastic models to the database data in order to "guess" the values for unknown or inaccurate data, such as each customer's unseen demand function. These stochastic models are used to produce multiple possible database instances in Monte Carlo fashion (a.k.a. "possible worlds"), and the underlying database query is run over each instance. In this way, fine-grained stochastic models become first-class citizens within the database. This is in contrast to the "classical" paradigm, where high-level summary data are first extracted from the database, then taken as input into a separate statistical model which is then used for subsequent analysis.