Most modern computers possess more than one CPU, and several computers can be combined together in a cluster. Harnessing the power of multiple CPUs allows many computations to be completed more quickly. There are two major factors that influence performance: the speed of the CPUs themselves and the speed of their access to memory. In a cluster, a given CPU will have fastest access to the RAM within the same computer or node. Perhaps more surprisingly, similar issues are relevant on a typical multicore laptop, due to differences in the speed of main memory and the cache. Consequently, a good multiprocessing environment should allow control over the “ownership” of a chunk of memory by a particular CPU.
Julia provides built-in primitives for both shared-memory and distributed parallelism. Julia’s native multi-threading allows the user to harness the power of a multi-core laptop while its remote-call and remote-fetch primitives allow distribution of work across many processes in a cluster. In addition to these in-built primitives, a number of packages in the Julia ecosystem allow for efficient parallel processing.
There are two ways to think about parallelism in Julia: distributed and multi-threaded. While Julia’s in-built primitives are sufficient for large scale parallel deployments, a number of packages exist for convenience features. ClusterManagers.jl provides interfaces to a number of different job queue systems commonly used on compute clusters such as Sun Grid Engine and Slurm. DistributedArrays.jl provides a convenient array interface to data distributed across a cluster. This combines the memory resources of multiple machines, allowing use of arrays too large to fit on one machine. Each process operates on the part of the array it owns, providing a ready answer to the question of how a program should be divided among machines.
For some legacy applications, users prefer not to rethink their parallel model and wish to continue to using MPI style parallelism. For these users, MPI.jl provides a thin wrapper around MPI that allows users to employ message passing routines in the style of MPI.
Multi-threading in Julia is an experimental feature from Julia v0.5. Multi-threading in Julia usually takes the forms of parallel loops. There are also primitives for locks and atomics that allow users to synchronize their code.
Julia’s parallel primitives are simple yet powerful. They are shown to scale to thousands of nodes and process terabytes of data. Both of Julia’s sets of parallel primitives have come together for a project conducted by Lawrence Berkeley National Labs (LBNL) to analyze the Sloan Digital Sky Survey (SDSS) - a 55 terabyte dataset of nearly 5 million images of outer space - using a model developed to catalogue the universe called Celeste. The research team includes researchers from Julia Computing, [email protected], Intel, UC Berkeley, Lawrence Berkeley National Labs and the National Energy Research Scientific Computing Center (NERSC).
The team uses the Cori Cray XC40 supercomputer at the National Energy Research Scientific Computing Center (NERSC). Cori Phase I (also known as the “Cori Data Partition”) has 1,630 compute nodes, each containing two Intel Xeon E5-2698v3 processors (16 cores each) running at 2.3 GHz, and 128 GB of DDR4 memory, a total of 3,260 processors and 52,160 cores. Nodes are linked through a Cray Aries high speed “dragonfly” topology interconnect. Datasets used in these experiments were staged on Cori’s 30 PB Lustre file system, which has an aggregate bandwidth exceeding 700 GB/s. The team scaled Julia’s native parallelism to 8192 Xeon cores, increased the speed of image processing by 225x, and analyzed three orders of magnitude more SDSS images than any prior analysis, demonstrating definitively that Julia’s native parallelism is supercomputer-ready.