Here is the latest from Julia Computing
BL
G

Big Data Analytics with OnlineStats.jl

16 Nov 2017 | Josh Day

OnlineStats is a package for computing statistics and models via online algorithms. It is designed for taking on big data and can naturally handle out-of-core processing, parallel/distributed computing, and streaming data. JuliaDB fully integrates OnlineStats for providing analytics on large persistent datasets. While future posts will dive into this integration, this post serves as a light introduction to OnlineStats.

What are Online Algorithms?

Online algorithms accept input one observation at a time. Consider a mean of n data points:

By adding a single observation, the mean could be recalculated from scratch (offline):

Or we could use only the current estimate and the new observation (online):

A big advantage of online algorithms is that data does not need to be revisited when new observations are added. It is therefore not necessary for the dataset to be fixed in size or small enough to fit in computer memory. The disadvantage is that not everything can be calculated exactly like the mean above. Whenever exact solutions are impossible, OnlineStats relies on state of the art stochastic approximation algorithms.

OnlineStats Basics

The statistics/models of OnlineStats are subtypes of OnlineStat:

using OnlineStats, Plots

# Each OnlineStat is a type
o = IHistogram(100)  
o2 = Sum()

# OnlineStats are grouped together in a Series
s = Series(o, o2)

# Updating the Series updates the grouped OnlineStats
y = randexp(100_000)

# fit!(s, y) translates to:
for yi in y
    fit!(s, yi)
end

plot(o)

Working with Series of Different Inputs

A Series groups together any number of OnlineStats which share a common input. The input (single observation) of an OnlineStat can be a scalar (e.g. Variance), a vector (e.g. CovMatrix), or a vector/scalar pair (e.g. LinReg).

The Series constructor optionally accepts data to fit! right away.

  • Scalar-input Series
julia> Series(randn(100), Mean(), Variance())
 Series{0} with EqualWeight
  ├── nobs = 100
  ├── Mean(0.0899071)
  └── Variance(0.952008)
  • Vector-input Series
    • The MV type can turn a scalar-input OnlineStat into a vector-input version.
julia> Series(randn(100, 2), CovMatrix(2), MV(2, Mean()))
 Series{1} with EqualWeight
  ├── nobs = 100
  ├── CovMatrix([0.916472 0.089655; 0.089655 0.984442])
  └── MV{Mean}(0.17287277199330608, -0.12199728546589127)
  • Vector/Scalar-input Series
    • The Vector holds predictor variables and the Scalar is a response.
julia> Series((randn(100, 3), randn(100)), LinReg(3))
 Series{(1, 0)} with EqualWeight
  ├── nobs = 100
  └── LinReg: β(0.0) = [-0.0486756 -0.0437766 -0.160813]

Working with Series and Individual OnlineStats

  • value returns the stat’s value
julia> o = Mean()
Mean(0.0)

julia> value(o)
0.0
  • value on a Series maps value to the stats
julia> s = Series(Mean(), Variance())
 Series{0} with EqualWeight
  ├── nobs = 0
  ├── Mean(0.0)
  └── Variance(-0.0)

julia> value(s)
(0.0, -0.0)
  • stats returns a tuple of stats
julia> m, v = stats(s)
(Mean(0.0), Variance(-0.0))

(Embarassingly) Parallel Computation

At first glance, it appears necessary that a Series must be fit!t-ed serially, but OnlineStats provides merge/merge! methods for combining two Series into one. This is how JuliaDB is able to use OnlineStats in a distributed fashion. Below is a simple (not actually parallel) example of merging.

s1 = Series(Mean(), Variance())
s2 = Series(Mean(), Variance())
s3 = Series(Mean(), Variance())

fit!(s1, randn(1000))
fit!(s2, randn(1000))
fit!(s3, randn(1000))

merge!(s1, s2)
merge!(s1, s3)

Resources

This is a small sample of OnlineStats functionality. For more information, stay tuned for future posts or check out the OnlineStats Github repo and documentation.

Recent posts

Newsletter December 2018
04 Dec 2018 | Julia Computing
Researchers Use BioJulia to Develop a New Single-Cell RNA Sequencing Method
03 Dec 2018 | Julia Computing
Racefox Uses Julia to Provide Digital Sports Coaching
03 Dec 2018 | Julia Computing
Get the latest news about Julia delivered to your inbox.
Need help with Julia?
We provide products, training and consulting to make Julia successful in your organization. Email us: [email protected]
Contact us
Julia Computing, Inc. was founded with a mission to make Julia easy to use, easy to deploy and easy to scale. We operate out of Boston, London and Bangalore and we serve customers worldwide.
© 2015-2018 Julia Computing, Inc. All Rights Reserved.