Julia’s superior speed, functionality, flexibility and capacity make Julia the first choice for graphics processing units (GPUs) and other supercomputers that employ accelerators.
Scientific problems are traditionally solved with powerful clusters of homogeneous CPUs connected in a variety of network topologies. However, the number of supercomputers that employ accelerators has steadily been on the rise. According to the Top500 list of the world’s most powerful supercomputers, 109 of the world’s top 500 supercomputers employ accelerators.
Accelerators that are employed in practice are mostly graphics processing units (GPUs), Xeon Phis and FPGAs. These accelerators take advantage of many-core architectures which can be used to exploit coarse and fine grained parallelism. However, the traditional problem with using GPUs and other accelerators has been the ease (or lack thereof) of programming them. To this end, NVIDIA Corporation designed the currently pervasive Compute Unified Device Architecture (CUDA) to allow for a C-like interface for scientific and general purpose programming. This was a considerable improvement over previous frameworks such as DirectX or OpenGL that required advanced skills in graphics programming. However, CUDA would still feature low on a productivity curve, with programmers having to fine tune their applications for different devices and algorithms. In this context, interactive programming on the GPU would provide tremendous benefits to scientists and programmers who not only wish to prototype their applications, but to deploy them with little or no code changes.
Julia offers programmers the ability to code interactively on the GPU. There are several libraries wrapped in Julia, giving Julia users access to accelerated BLAS (by Nick Henderson, Katharine Hyatt, et. al), FFTs (by Tim Holy et. al.), sparse routines, solvers (both by Katharine Hyatt), and deep learning (by denizyuret). With a combination of these packages, programmers can interactively develop custom GPU kernels. One example of a custom kernel is the Conjugate Gradient, which has been benchmarked below.
However, one might argue that low-level wrapper libraries do not in any manner increase programmer productivity as they involve working with obscure function interfaces. In such a case, it would be ideal to have a clean array interface for arrays on the GPU with a convenient standard library that operates on these arrays. Each operation would in turn be tuned with regards to the device in question to achieve great performance. The folks over at ArrayFire have put together a high quality open source library to work on scientific problems with GPUs.
ArrayFire.jl is a set of Julia bindings to the library. It is designed to mimic the Julia standard library in its versatility and ease of use, providing an easy-yet-powerful rrray interface that points to locations on GPU memory.
Julia’s multiple dispatch and generic programming capabilities make it possible for users to write natural mathematical code and transparently leverage GPUs for performance. This is done by defining a type AFArray as a subtype of AbstractArray. AFArray now acts as an interface to an array on device memory. A set of functions are imported from Base Julia and are dispatched across the new AFArray type. Thus users may be able to write code in Julia that runs on the CPU and port it to run on the GPU with very few code changes. In addition to functions that mimic Julia’s standard library, ArrayFire.jl provides powerful functions in image processing and computer vision, amongst others.
The benefits of accelerated code can be seen in real world applications. Consider the following image segmentation demo on satellite footage of the Hurricane Katrina. Image segmentation is an important step in weather forecasting, and should be performed on many high definition images on a daily basis. In such a use-case, interactive GPU programming would allow the applications designer to leverage powerful graphics processing on the GPU with little or no code changes from his original prototype. The application used the K-means algorithm which can easily be expressed in Julia and accelerated by ArrayFire.jl. It initializes some random clusters and then reassigns the clusters according to Manhattan distances.
The package CUDAnative.jl by Tim Besard et. al. provides support for compiling and executing native Julia kernels on CUDA hardware. This allows Julia users to write purely Julia code that can compile all the way down to PTX IR that can directly execute on CUDA-enabled GPUs.