The Sloan Digital Sky Survey contains nearly 5 million telescopic images of 12 megabytes each – a dataset of 55 terabytes.
In order to analyze this massive dataset, researchers at UC Berkeley and Lawrence Berkeley National Laboratory created a new code named Celeste. Celeste is a fully generative hierarchical model that uses statistical inference to mathematically locate and characterize light sources in the sky. This model allows astronomers to identify promising galaxies for spectrograph targeting, define galaxies for further exploration and help understand dark energy, dark matter and the geometry of the universe.
The research team developed a new parallel computing method that leverages Julia and the 8,192 Intel Xeon processors in the National Energy Research Scientific Computing Center (NERSC) Cori supercomputer at Berkeley Lab.
What was the result?
A dramatic increase of 225x in the speed of image analysis. This enabled the processing of more than 20 thousand images, or 250 gigabytes – an increase of 3 orders of magnitude compared with previous iterations.
“Astronomical surveys are the primary source of data about the Universe beyond our solar system,” said Jeff Regier, a postdoctoral fellow in the UC Berkeley Department of Electrical Engineering and Computer Sciences who has been instrumental in the development of Celeste. “Through Bayesian statistics, Celeste combines what we already know about stars and galaxies from previous surveys and from physics theories, with what can be learned from new data. Its output is a highly accurate catalog of galaxies’ locations, shapes and colors. Such catalogs let astronomers test hypotheses about the origin of the universe, as well as about the nature of dark matter and dark energy.”
“It is exactly to enable such cutting-edge machine-learning algorithms on massive data that we designed the Julia language,” said Viral Shah, CEO of Julia Computing. “Researchers can now focus on problem solving rather than programming.”
NERSC provided the extensive computing resources the team needed to apply such a complex algorithm to so much data, assisting with many aspects of designing a program to run at scale, including load balancing and interprocess communication, Regier noted.
“Practically all the significant code that runs on supercomputers is written in C/C++ and Fortran, for good reason: efficiency is critically important,” said Pradeep Dubey, Intel Fellow and Director of the Parallel Computing Lab at Intel. “With Celeste, we are closer to bringing Julia into the conversation because we’ve demonstrated excellent efficiency using hybrid parallelism – not just processes, but threads as well – something that’s still impossible to do with Python or R.”
Alan Edelman, co-creator of the Julia language and professor of applied mathematics at MIT, said, “The JuliaLabs group at MIT is thrilled and impressed with this advancement in the use of Julia for High Performance Computing. The dream of ‘ease of use’ and (‘and’ not ‘or!’) ‘high performance’ is becoming a reality.”
The Celeste project is at the cutting edge of scientific big data analysis along multiple fronts, added Prabhat, NERSC Data and Analytics Services Group Lead and principal investigator for the MANTISSA project. “From a scientific perspective, it is one of the first codes that can conduct inference across multiple imaging surveys and create a unified catalog with uncertainties,” he said. “From a methods perspective, it is the first demonstration of large scale variational inference applied to hundreds of gigabytes of scientific data. From a software perspective, I believe it is one of the largest applications of the Julia language to a significant problem: we have integrated the DTree scheduler and utilized MPI-3 one-sided communication primitives.”
This implementation of Celeste also demonstrated good weak and strong scaling properties on 256 nodes of the Cori Phase I system, Prabhat added. The group’s next step will be to apply Celeste to the entire SDSS imaging dataset, followed by a joint SDSS + DECaLS analysis on Cori Phase II.