Julia Computing has recently received funding from the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative. One of the main components of this project is to improve the statistics and data science functionality available in the Julia ecosystem. I thought it would be useful to set out our current plans in this regard. These may change as the work develops, but they represent what we think are the most potentially useful contributions to the community.
First, I wish to acknowledge the excellent work that has been done so far. The breadth of functionality already available in Julia’s statistical ecosystem is remarkable for such a young language. In particular, some of the packages, such as Distributions.jl, MixedModels.jl and Gadfly.jl are genuinely some of the most cutting edge software in their respective domains. The overall aim of this work is to build a robust, flexible and high-performance platform to allow such excellent work to continue to develop.
At the moment, Julia’s data manipulation functionality leaves a lot to be desired. DataFrames.jl, though widely-used and quite functional, suffers from some notable performance problems as outlined recently by John Myles White. Interfaces to external databases and other sources have been somewhat ad hoc and inconsistent. The are several aims to this part of the project:
The current array-indexing approach to DataFrames, though extremely flexible, requires the user to reason about the underlying data structures and types. The most successful data manipulation frameworks, such as Microsoft’s LINQ, Hadley Wickham’s dplyr, and of course SQL, abstract away these details, freeing the user from needing to think about these lower-level details, though necessarily limiting the types of operations that can be performed.
The aim is to develop a consistent, high-level data manipulation interface. The exact form and design of this is still up for debate, but this part is one of the highest priorities.
Complementing the front-end work is the need to interface with various backends, building on the excellent work that has already been done. Obvious candidates are DataFrames.jl, relational databases (SQLite, PostgreSQL, and a general ODBC interface), as well as various NoSQL and columnar stores (SFrames, Hive, Google BigTable, Amazon SimpleDB/DynamoDB), and common file formats such as CSV.
There are several potential improvements to Julia itself that will help with these problems, but also have the potential to be useful more generally:
Improve performance of Union types, which occur when the result type of a function is fully predicatble from its input types. For example, we could generate explicit branches when only a small number of types can potentially occur.
Type specialization inside “hot loops”, where the types of values entering the loop might be unpredictable but once the loop is entered, they remain consistent thereafter.
Improve performance of anonymous functions and function arguments.
Linear and generalized linear models such as least squares and logistic regression are the workhorse of statistical analysis. Although GLM.jl provides basic functionality, it is somewhat sparse in its features, and limited in scope. The particular aims of this project include the following:
The initial aim will be to improve the interface, notably by developing graphical output and diagnostics, along with further functionality such as model comparison and model selection.
The current implementation requires that the data be stored in a DataFrame, which necessarily limits the data to being stored in memory. Building on the data manipulation work, we aim to support disparate data sources directly, avoiding the need to construct intermediate Arrays or DataFrames.
Julia’s current GLM algorithm relies on a dense QR-based iteratively re-weighted least squares algorithm, similar to that used by R. Although numerically robust, this approach limits the size of models that can practically be fit. Complementing the above work, we intend to support a more flexible choice of algorithms, such as QR, Cholesky, stochastic gradient descent, MCMC techniques (for example via Lora.jl or Stan.jl), and variational methods for Bayesian models. As part of this work, we intend to support sparse, distributed, and out-of-core computation.
Many large-scale problems require richer models than strictly fit within the GLM framework, such as non-parametric effects, repeated measures, or high-dimensional variable selection. One of the long term aims is to design a high-performance plug-in architecture supporting alternative model factors and enabling features such as mixed effects models, generalized additive models and Lasso models to be fit within the same framework. This is going to require various design and algorithmic considerations, but has the potential to be very powerful.
Packages such as StatsBase.jl, MLBase.jl, Distributions.jl and HypothesisTests.jl provide much of the basic statistics functionality for Julia. Although these packages are now well established and are widely used, they have suffered somewhat from lack of attention and slow response to issues and pull requests. In particular, we are looking to improve things such as: