Julia is a modern, easy to use, high performance programming language. It provides a sophisticated compiler, parallel execution, numerical accuracy, and an extensive library of fast mathematical functions. It’s a thriving open source project with over 500 contributors from around the world, and over 1,000 packages made by the community. It’s being used by a number of universities for teaching and research, by businesses in fields as diverse as robotics, artificial intelligence, finance, and e-commerce, and by institutions such as the U.S. Federal Aviation Administration and the Bank of England.
The age of data
Breakthroughs in data capture, genome sequencing, medical imaging, and other fields of biological research, coupled with the ubiquity of cheap digital storage, have paved the way for massive amounts of biological data. It’s estimated that by the year 2025, between 2 and 40 exabytes of human genomic data alone will be collected every year . Unfortunately, most mainstream software can’t process data on that scale efficiently, which leaves these troves of data underutilized.
Julia provides a wide variety of facilities for using and processing data effectively. It can efficiently store data structures in memory for quick access, but when datasets are too large to fit into memory, Julia can employ memory mapping of data files stored on disk. This allows for fast and efficient processing, even when memory is limited.
Data are often messy and complex. Julia doesn’t take a one-size-fits-all approach to data structures; instead, it provides a sophisticated yet easy to use system, where users can employ whichever structure most efficiently and sensibly stores their data. Users are not forced to choose from among a strict limited set of data types. When no existing data type fits the bill, users can create their own types and define any set of operations for them. This kind of extensibility and flexibility is at the heart of Julia.
Genome sequencing produces massive quantities of data – the human genome consists of over 3 billion nucleotides. However, it can be stored as just a few thousand runs using run-length encoding. This functionality in Julia was developed by pharmaceutical scientists who helped to create a package called RLEVectors.jl. This package facilitates vector storage in a memory-efficient manner using run-length encoding. In benchmarks, RLEVectors.jl is shown to be 1,000 to 65,000 times faster than similar functionality from the R BioConductor package , as can be seen in this comparative graphic:
As the scale of data increases, so must the scale of computation. Many problems in the life sciences lend themselves particularly well to parallel processing, such as the analysis of single nucleotide polymorphisms in genome-wide association studies and simulating disease outbreaks in epidemiological models based on individuals. Julia was built with effortless parallelism in mind, be it on a single multicore machine, a supercomputing cluster, or in the cloud.
Regardless of where you run Julia, well-written Julia code is fast, even in serial. In benchmarks, its performance approaches—and in some cases beats—that of C and Fortran, the current de facto languages for performance-critical applications. And because Julia isn’t a statically compiled language, there’s no waiting around for compilation before you can run your code. This makes it easy to rapidly prototype and iterate on ideas.
Recently a Julia package called Gillespie.jl was published in the Journal of Open Source Software. It implements Gillespie’s direct method for stochastic simulations, which is widely used in fields such as systems biology and epidemiology, in pure Julia with no parallelism. In benchmarks, it’s shown to be over 500 times faster than the equivalent package for R, and over 600 times faster than hand-written R code for the same tasks . Amazingly, no special optimization tricks were used to achieve this huge gain in performance; Gillespie.jl is fast simply by virtue of being built on Julia.
Venturing into the wild
Inertia can play a big factor when dealing with adopting a new technology. You shouldn’t have to change your entire workflow to incorporate something new. Julia provides easy interoperability with many database systems, file formats, and even other programming languages, such as R and Python. If you need to interface Julia with something, chances are it’s either been done or it’s straightforward to do yourself. Julia has a robust foreign function interface, plus anything that can be called from C can be called from Julia.
We believe that Julia is not just the language of the future, but also the language of now. It’s a modern solution for modern problems, with the ability to adapt to new challenges with ease. That’s why we feel it’s the right choice for the life sciences industry and research.
While Julia has already proven to be quite effective in this area, there is still more to be done. In particular, Julia needs to expand its general biostatistics toolkit to include methodologies such as Cox proportional hazards regression and Kaplan-Meier estimated survival. Methods common in epidemiology, such as generalized estimating equations, should also be implemented.
Julia’s compliance with 21 CFR Part 11 should be documented to show that it’s ready to take on the rigorous needs of clinical trials. Also crucial for clinical trials is the ability to summarize data into production-quality tables, listings, and figures, and save them in common formats such as RTF and PDF. Anyone who has created an adverse events table for a clinical trial has depended on such functionality from a software package; having this functionality available in Julia will be critical for driving adoption of Julia for clinical trials reporting.
The Julia community is doing amazing things. We hope you’ll join us.
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al.(2015) Big Data: Astronomical or Genomical? PLoS Biol 13(7): e1002195. doi:10.1371/journal.pbio.1002195
Haverty, Peter. 2015. “RLEVectors.jl: Run length encoded vectors for julia, inspired by BioConductor.” https://github.com/phaverty/RLEVectors.jl.
Frost, Simon D.W. 2016. “Gillespie.jl: Stochastic Simulation in Julia.” https://github.com/sdwfrost/Gillespie.jl.