Scaling research, prototyping and visualization

Princeton

Machine Learning

Computer scientist Diana Cai is a graduate student at Princeton University doing research in machine learning:

"I work in statistical machine learning, focusing on the area of probabilistic modeling, which is a powerful framework of methods for extracting interpretable latent structure and used in many scientific and engineering applications, such as recommendation systems, network and community analysis, text and time series analysis, and online decision making. Due to the complexity of models for many modern data problems, coupled with the large data set sizes and dimensionality, fitting these probabilistic models usually requires approximate Bayesian inference methods, such as Markov Chain Monte Carlo (MCMC) and variational inference. Scaling these methods to modern data problems is of great interest. Furthermore, I am interested in developing robust and interpretable Bayesian inference methods as well as understanding the theoretical properties of these methods.

When developing new probabilistic models, the complexities of a new model are such that I often have to derive new inference updates from scratch and implement them directly in code, rather than relying on existing packages. Thus, I’ll directly implement, say, an MCMC or variational inference algorithm in Julia.

Some of my work is more theoretical, which involves mostly simulations on synthetic data. For instance, I often work with Bayesian nonparametric models, in which the number of parameters in the model grows as the number of observed data points grows. Thus, many of my simulations involve generating a large number of samples in order to simulate an infinite dataset and studying asymptotic behaviors in the data or model as the number of data points grows.

For all of these types of research projects, I rely primarily on the Distributions.jl and StatsBase.jl Julia packages, visualization packages (e.g., PyPlot and Seaborn), Optim.jl for performing numerical maximum likelihood estimation, as well as autodifferentiation and quadrature packages. I also frequently use packages for handling data such as DataFrames and CSV. For certain modeling problems, I’ll also use Stan.jl, which is a Julia wrapper to Stan, which allows for automated inference in many probabilistic models.

Before Julia, I used Python for my research. Originally one of my main motivations for switching was speed and the ease of accessing high speed. My senior thesis in college focused on developing and implementing large-scale streaming variational inference for online changepoint detection modeling, and it was important to run this algorithm on large and high-dimensional time series. For some complex problems, I found it useful to implement my code with loops first, but then a substantial effort would then be spent optimizing my code by vectorizing or building in Cython.

In Julia, I have the flexibility to begin prototyping as I might have in Python, and when I want even better performance, I can quickly check and fix any types that are leading to performance issues. I also prefer Julia when it comes to performing matrix computations and numerical methods.

Overall, the ease of quick prototyping along with the ease of writing fast code makes Julia appealing for statistics and machine learning research.

The speed and simplicity of Julia are big advantages. Being able to quickly run (and rerun) experiments is useful, as well as not having to rely on packages for speeding up my code. Syntactically, I also like that I can easily go from a function that I’ve derived on paper to a function that implements that because the code looks so much like math — I found that leads to fewer bugs in my code and overall is easier to debug. Because of the speed, I can reasonably run a lot more iterations of my MCMC algorithms.

I used Julia for my master’s thesis, which involved developing a latent variable model to analyze patterns in large-scale and sparse relational graph data and I’ve continued to use Julia throughout research for my PhD. Recently, I’ve been working on Bayesian sketching methods for count data. One popular method for sketching is the count-min sketch, which is a successful and widely-deployed algorithm for approximate data stream processing that allows for approximate counting of items in a data stream. In our recent NeurIPS 2018 paper, we developed a method that assuming a particular family that the token stream arises from, we can compute Bayesian estimates for the count, which allows us to quantify the uncertainty than the count, rather than simply returning a point estimate. A potential application of this framework is in complex probabilistic models for counts, which would ultimately involve some approximate and scalable Bayesian inference algorithm, such as MCMC or variational inference, applied to a large-scale streaming application.

What’s more, I love the Julia user community, which is something I haven’t been involved with when it comes to other programming languages. In the future, I’d love to incorporate some of my own work into Julia software packages and interact more with the Julia community."

OUR ENTERPRISE PRODUCTS