Most modern computers possess more than one CPU, and several computers can be combined together in a cluster. Harnessing the power of multiple CPUs allows many computations to be completed more quickly. There are two major factors that influence performance: the speed of the CPUs themselves and the speed of their access to memory. In a cluster, a given CPU will have fastest access to the RAM within the same computer or node. Perhaps more surprisingly, similar issues are relevant on a typical multicore laptop, due to differences in the speed of main memory and the cache. Consequently, a good multiprocessing environment should allow control over the “ownership” of a chunk of memory by a particular CPU.
Julia has several built-in primitives for parallel computing at every level: vectorization (SIMD), multithreading, and distributed computing.
Julia’s native multi-threading allows the user to harness the power of a multi-core laptop while its remote-call and remote-fetch primitives allow distribution of work across many processes in a cluster. In addition to these in-built primitives, a number of packages in the Julia ecosystem allow for efficient parallel processing.
Auto-vectorization in Julia
Modern Intel chips provide a range of instruction set extensions. Among these are the various revisions of the Streaming SIMD Extension (SSE) and several generations of Advanced Vector Extensions (available with their latest processor families). These extensions provide Single Instruction Multiple Data (SIMD)-style programming, providing significant speed up for code amenable to such a programming style. Julia’s powerful LLVM-based compiler can automatically generate highly efficient machine code for base functions and user-written functions alike, on any architecture like the SIMD Hardware (supported by LLVM), making the user worry less about writing specialized code for each of these architectures. One additional benefit of relying on the compiler for performance rather than hand-coding hot loops in assembly is that it is significantly more future proof. Whenever a next generation instruction set architecture comes out, the user’s Julia code automatically gets faster.
Multi-threading in Julia usually takes the form of parallel loops. There are also primitives for locks and atomics that allow users to synchronize their code.Julia’s parallel primitives are simple yet powerful. They are shown to scale to thousands of nodes and process terabytes of data.
While Julia’s in-built primitives are sufficient for large scale parallel deployments, a number of packages exist for convenience features. ClusterManagers.jl provides interfaces to a number of different job queue systems commonly used on compute clusters such as Sun Grid Engine and Slurm. DistributedArrays.jl provides a convenient array interface to data distributed across a cluster. This combines the memory resources of multiple machines, allowing use of arrays too large to fit on one machine. Each process operates on the part of the array it owns, providing a ready answer to the question of how a program should be divided among machines.
For some legacy applications, users prefer not to rethink their parallel model and wish to continue to using MPI style parallelism. For these users, MPI.jl provides a thin wrapper around MPI that allows users to employ message passing routines in the style of MPI.
Celeste is a fully generative hierarchical model that uses statistical inference to mathematically locate and characterize light sources in the sky. This model allows astronomers to identify promising galaxies for spectrograph targeting, define galaxies for further exploration, and help understand dark energy, dark matter, and the geometry of the universe.
Using Julia’s native parallel computing capabilities, the Celeste research team processed 55 terabytes of visual data and classified 188 million astronomical objects in just 15 minutes, resulting in the first comprehensive catalog of all visible objects from the Sloan Digital Sky Survey. This is one of the largest problems in mathematical optimization ever solved.
The Celeste project used 9,300 Knights Landing (KNL) nodes on the NERSC Cori Phase II supercomputer to execute 1.3 million threads on 650,000 KNL cores, joining the rarified list of applications to exceed 1 petaflop per second performance, making Julia the only dynamic high-level language to have ever achieved a feat like this.