Supercomputer Accelerators

Julia’s superior speed, functionality, flexibility, scalability and capacity make it the first choice for graphics processing units (GPUs) and other supercomputers that employ accelerators.

Scientific problems are traditionally solved with powerful clusters of homogeneous CPUs connected in a variety of network topologies. However, the number of supercomputers that employ accelerators has steadily been on the rise. According to the Top500 list of the world’s most powerful supercomputers, 109 of the world’s top 500 supercomputers employ accelerators.

Accelerators that are employed in practice are mostly graphics processing units (GPUs), Xeon Phis and FPGAs. These accelerators take advantage of many-core architectures which can be used to exploit coarse and fine grained parallelism. However, the traditional problem with using GPUs and other accelerators has been the ease (or lack thereof) of programming them. To this end, NVIDIA Corporation designed the currently pervasive Compute Unified Device Architecture (CUDA) to allow for a C-like interface for scientific and general purpose programming. This was a considerable improvement over previous frameworks such as DirectX or OpenGL that required advanced skills in graphics programming. However, CUDA would still feature low on a productivity curve, with programmers having to fine tune their applications for different devices and algorithms. In this context, interactive programming on the GPU would provide tremendous benefits to scientists and programmers who not only wish to prototype their applications, but to deploy them with little or no code changes.

Julia on GPUs

Interactive GPU Programming

Julia was always well regarded when it came to programming multicore CPUs and large parallel computing systems. Recent developments, however, make it a competitive and suitable choice for GPU computing as well, offering programmers the ability to code interactively on the GPU.

The performance possibilities of GPUs can be democratized by providing more high-level tools that are easy to use by a large community of applied mathematicians and machine learning programmers.

The Julia package ecosystem contains quite a few GPU-related packages and wrapper libraries, targeting different levels of abstraction as shown in the image above.

At the highest abstraction level, domain-specific packages like MXNet.jl and TensorFlow.jl can transparently use the GPUs in a system. More generic development is possible with ArrayFire.jl, and for specialized CUDA implementation of a linear algebra or deep neural network algorithm, there are vendor-specific packages like cuBLAS.jl or cuDNN.jl. All these packages are essentially wrappers around native libraries, making use of Julia’s foreign function interfaces (FFI) to call into the library’s API with minimal overhead.

Native GPU Programing (CUDANative.jl)

The CUDAnative.jl package adds native GPU programming capabilities to the Julia programming language. Used together with the CUDAdrv.jl or CUDArt.jl package for interfacing with the CUDA driver and runtime libraries, respectively, users can now do low-level CUDA development in Julia without an external language or compiler.

Julia lets the user generate specialized code for compiling the kernel function to GPU assembly, upload it to the driver, and prepare the execution environment. Combined with Julia’s just-in-time (JIT) compiler, this results in a very efficient kernel launch sequence, avoiding runtime overhead typically associated with dynamic languages.

Tim Besard, the original author of CUDANative.jl, introduces the package for programming Nvidia GPUs at JuliaCon 2017

CUDAnative.jl avoids multiple independent language implementations, which often result in slightly different semantics. Which essentially means Julia can be used to write GPU functions just like how it is used for CPU code, bringing in exciting features like dynamically-typed multimethods, metaprogramming, arbitrary types and objects, and more to the world of GPU programming.

The chart above compares CUDAnative.jl performance with CUDA C++ for 10 benchmarks. CUDAnative.jl provides a 30%+ performance improvement compared with CUDA C++ for the nn benchmark and is comparable (+/- 7%) for the other nine benchmarks tested.

High-level GPU Programing (CuArrays.jl)

Julia’s combination of dynamic language semantics, a specializing JIT compiler and first-class metaprogramming makes it possible to create really powerful high-level abstractions that are hard to realize in other dynamic languages, and downright impossible with statically compiled code. Thanks to CUDAnative.jl, it is now possible to create such abstractions for GPU programming, without sacrificing performance.

For example, the CuArrays.jl package combines the performance of CUBLAS.jl and flexibility of CUDAnative.jl to offer CUDA-accelerated arrays that behave just like any other array in Julia. It builds on Julia’s support for higher-order functions, which are automatically specialized.

This post was also published on the NVIDIA blog.

Get the latest news about Julia delivered to your inbox.