Chris Jermaine's Talk on DB Seminar

Title: "Large Scale Bayesian Machine Learning with the SimSQL System"
 

Abstract:

SimSQL is a parallel database system developed at Rice that supports a version of a SQL, with a few  key extensions. For example, SimSQL allows users to utilize (and also define) sampling distributions that
can be used to stochastically generate database tables. These stochastic tables can be defined
recursively, so that an older version of a stochastic table can be used to parameterize a newer version.
Taken together, these extensions make it easy to use SimSQL to simulate database-valued Markov
chains (that is, chains whose state at each time tick is embodied by a relational database).  There
are many potential uses of this capability, one being that SimSQL can be used to perform distributed
Markov Chain Monte Carlo (or "MCMC") simulations over very large data sets. MCMC is the standard
inference method for Bayesian machine learning.

In this talk, I will describe SimSQL's SQL dialect, and give examples of how it is very easy to use
SimSQL to write Bayesian inference codes that are small and implicitly parallel. I will also describe
some of the key implementation methods utilized by SimSQL to compile and execute parallel MCMC codes over large data sets.
 

Bio: Chris Jermaine is an associate professor of computer science at Rice University in Houston, Texas.
Chris' research is at the intersection of data management and applied statistics. He is the recipient of a Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and a SIGMOD Best Paper Award. In his spare time, he enjoys outdoor activities such as hiking, climbing, and whitewater boating. In one particular exploit, Chris and his wife floated a whitewater raft (home-made from scratch using a sewing machine, glue, and plastic) over 100 miles down the Nizina River (and beyond) in Alaska.