Here is the latest from Julia Computing
BL
G

# Finding Correlations in Data with Julia and NAG

20 Oct 2016 | Reid Atcheson and Andy Greenwell

The Numerical Algorithms Group (NAG) provides expertise in numerical engineering, by delivering high-quality computational software, consulting services and high performance computing services. For over four decades NAG have collaborated with world-leading researchers in academia and industry to create powerful, reliable and flexible software which is relied on by tens of thousands of individual users, as well as numerous independent software vendors. As a not-for-profit company, NAG reinvests surpluses into the research and development of its products, services, staff and its collaborations.

The NAG Library is the world’s largest collection of robust, documented, tested and maintained numerical and statistical algorithms. It is callable from many different programming languages and environments such as .NET, C++, Java®, MATLAB®, Python® and Visual Basic®. Some users of the Julia language already call the NAG Library into their applications. This blog demonstrates that the NAG Library is also callable directly from Julia, using one of the routines from the NAG Library as an example.

This blog post is also available on the NAG website.

# Introduction to Correlation Matrices

Correlations provide helpful information when working with possibly mutually related data sources, such as the value of stocks, options, precious metals or others. In the case of three or more data sources this produces a matrix known as a “correlation matrix” which has a special structure, shown in Julia code below:

Correlation (or, to give its full name, “Pearson’s product-moment correlation coefficient”), is a measure of the linear relationship between two random variables. One of the features that makes it particularly useful is that it is invariant to location-scale transformations of the variables, giving values on the interval [-1,1]: 1 being perfect positive linear relationship, 0 being no linear relationship (though, it should be emphasised, not necessarily independent), and -1 being a perfect negative relationship.

A correlation matrix is then the set of pairwise correlations between multiple random variables, where the (i,j)th element is the correlation between the ith and the jth variable. Mathematically, a correlation matrix must satsify the following properties:

• it must be symmetric,

• the diagonal elements must be 1, and

• it must be positive semidefinite, which is equivalent to all eigenvalues being non-negative.

## Estimating Correlation Matrices

Of course, we never get to observe the correlation matrix directly, instead it must be estimated from data. Given a matrix Y, where each row is a joint observation of the variables, the empirical correlation matrix can be computed using the cor function in Julia.

The resulting matrix should satisfy the three properties above, though some small deviations may arise due to numerical error in extreme cases:

The structure of correlations produces a matrix that should be symmetric, have unit diagonal and be positive semidefinite. In applications however it can happen that approximate correlations lead to accordingly approximate correlation matrices which are symmetric and have unit diagonal, but also have nontrivial negative eigenvalues, meaning it is not positive semidefinite.

Larger deviations from positive definiteness may arise when the correlations are computed from incomplete data: for example, two stock symbols may not have price data on the exact same days.

The available approaches to handling the situation of missing data can be broken down into two broad categories; model the missing values and use that model to generate a full set of data, hence guaranteeing a semi-definitive correlation matrix; or calculate an approximate correlation matrix from the available data and adjust it to ensure that it is definite.

The first of these approaches is beyond the scope of this article and we only deal with the second: calculating an approximate correlation matrix from existing data, adjusting if required. This approach has the advantage of being applicable when an approximate correlation matrix is indefinite for a reason other than missing data, for example, due to rounding during its calculation.

In the example given below, based on real financial data, it was not possible to calculate a full correlation matrix due to missing data. Pairwise correlations between the stocks were calculated using as much data as possible, which yielded a matrix with a negative eigenvalue.

It is useful therefore to correlate incomplete data and then clean the resulting correlation matrix by finding the “nearest” correlation matrix, and then use the nearest correlation matrix as a proxy for the original correlation matrix in workflows which require those properties.

# Nearest Correlation Matrix

To fix the problem of indefiniteness in an approximation correlation matrix, one can define an optimization problem which fixes it appropriately without reducing the original value of the true correlations. One could for example simply find the closest matrix in the Frobenius norm to an input matrix, but subject to the constraints that the output is a correlation matrix. In practice this may not be an acceptable solution, so there are alternative formulations which enforce important properties of the approximate correlation matrix input. NAG has implemented many of these different approaches and through a partnership with Julia Computing a Julia interface has been developed to access these nearest correlation routines in the NAG Library. The next section will first detail the routines and following this there is an explanation of how they can be accessed through Julia.

## NAG Routines

The first and simplest routine in the NAG Library for nearest correlation matrices solves an optimization problem and accepts an arbitrary matrix as input, this is g02aa, which solves the following optimization problem

where X is a valid correlation matrix. An algorithm devised in 1 uses alternating projection to solve this optimization problem, which has the virtue of being simple to implement, but tends to converge slowly. Newton’s algorithm can have rapid convergence and was investigated in 2 along with suitable preconditioners for the linear systems it yields. This algorithm has been implemented by NAG

Some common situations may make this simple approach less appealing however. For example suppose only a few data pairs have problematic missing data, then only a small sub-block of the approximate correlation matrix needs to be fixed and a likely-small change added to the rest of the matrix to preserve the properties it had to begin with. Thus targeting of a sub-block can be formulated as follows:

this was solved by Higham et al. in 3 and is implemented by NAG as routine g02an. A similar routine “g02ap” which will be released with the NAG Library Mark 26 can keep arbitrary entries of a matrix fixed, rather than a full sub-block, which affords the user more flexibility.

If fixing a sub-block or fixed entries of the input matrix does not model well the additional information a user may have, then one can also weight fixed rows and columns in the optimization, formulated as follows:

where W is an input diagonal matrix of weights. The NAG routine g02ab incorporates this weighted formulation.

Alternatively if weighting whole rows and columns is not adequate one can optionally weight only specified entries with the routine g02aj, (note this is a very expensive formulation).

Finally one may know that the output should have a k-factor structure, which means that the off-diagonal entries of the output can be low-rank approximated. The NAG Library also incorporates this additional information through the reformulation

This optimization problem is solved in the NAG routine g02ae and produces a nearest correlation matrix which respects k-factor structure specified by one.

In summary, the NAG Library contains the following nearest correlation routines:

• g02aa Basic problem using Newton’s method.
• g02ab Incorporates row/column weights.
• g02aj Incorporates entrywise weights.
• g02an Incorporates fixed submatrix.
• g02ap Incorporates fixed entries.
• g02ae Incorporates k-factor structure specified by user.

As a final remark it is worth noting that g02ab, g02aj, and g02ap can also optionally include a bound on the lowest eigenvalue. This might be desirable for example to clamp the smallest eigenvalue to zero rather than allowing a possible roundoff to be negative. Bounding the smallest eigenvalue can also help in enforcing a positive definite output, rather than only positive semidefinite.

What follows will detail a NAG partnership with Julia Computing which makes these routines much easier to utilize from a Julia environment.

# Julia Interface

Julia’s ccall function provides an easy to use, “no boilerplate,” interface for calling out to shared C and Fortran libraries, such as the NAG C Library and NAG Fortran Library. The ccall function allows for calling into compiled shared libraries directly without the need to write any “glue” code, perform low-level code generation to create function interfaces or even recompile a binary. Full documentation on the use of ccall can be found in the “Calling C and Fortran Code” section of the Julia Language documentation.

For this project, we developed a set of Julia wrappers to the above set of nearest correlation matrix routines. Development of these wrappers consisted of the following components:

• a Julia NAGdemo module containing all of the Julia code for the interface. The module includes all Julia type definitions, const values, function definitions and export definitions for the Julia/NAG interface.
• a constant string (LIBNAGC_NAG) pointing to the location of the NAG shared library distributed with a NAG C Library installation.
• a function nag_licence_query/nag_license_query that tests execution of the simple NAG routine a00acc as a method of performing proper license checkout for the NAG Library.
• translation of entries in the nag_types.h header file included with the NAG C Library into corresponding Julia objects.
• definition of proper error handling routines that can capture, store and print any errors in Julia that occur within the NAG C Library routines.
• definition of Julia wrapper functions to NAG’s nearest correlation matrix routines.

Below we will discuss each of these items in turn.

## Defining a NAGdemo module

In this post we define a simple NAGdemo module, included in full at the bottom of the page. NAGdemo is a Julia module much like any other Julia module, providing a standard mechanism for the organization of Julia code, as well as namespace resolution for all functions defined as part of the module when called from external Julia code. The list of exported functions in the module enumerate those functions that do not need to be called with full namespace resolution when the NAGdemo module is loaded into a Julia session via a using NAGdemo statement.

## Defining a constant for the location of the NAG shared library

The constant string LIBNAGC_NAG contains the location of the NAG shared library which is a required argument for all uses of ccall that will call into the NAG C Library from Julia.

The nag_licence_query \ nag_license_query function provides our first, and simplest, example of calling into the NAG C Library from Julia using ccall.

This particular use of ccall takes 3 arguments:

• a tuple (:a00acc, LIBNAGC_NAG) containing a symbol for the name of the exported function to be accessed, :a00acc, and the const string, LIBNAGC_NAG, pointing to the location of the NAG shared library. The NAG routine :a00acc outputs information about the version of the NAG C Library in use.
• the output type, Cint, of the NAG C routine being called. In the case of :a00acc, the output type is a C integer. The routine will return 0 if a licence is acquired and the function executes successfully, or 1 if the routine was unsuccessful.
• a tuple of the types for all input arguments, In this particular case, :a00acc does not take any input arguments, so this tuple argument is empty.

## Translating the NAG header file

The header file nag_types.h is a large file containing many C enums, typedefs and pre-processor macros. As the contents of this header file are widely used throughout the NAG C Library when referring to the input and output types of most functions, the definitions included in this header, or at least the atomic types and values that are represented by these definitions, need to be present in Julia when wrapping routines via ccall.

While there are packages within the Julia ecosystem, such as Clang.jl, that can assist in automation of wrapping C libraries in certain situations, a manual translation of the necessary portions of nag_types.h file into corresponding Julia code is not difficult, and is sufficient for this exercise. Each of the enum entries within nag_types.h are fundamentally integer constant values and this Nearest Correlation Matrix demonstration only requires one of those enum values:

NagInt is an alias for an appropriately sized integer value on Windows or Unix platforms, as needed for use of the NAG C Library in each operating system environment for appropriate 32-bit or 64-bit flavours as necessary.

## Handling errors from NAG routines

The following block of Julia code defines a NagError type that inherits from Julia’s abstract Exception type, as well as various functions for handling and storing any errors returned by any NAG routine as Julia exceptions:

The call to reset_nag_error() on load of the NAGdemo module sets our custom error_handler routine to collect and handle any errors thrown by NAG routines, as well as throw those errors as Julia exceptions. The last_nag_error() function resets the stored list of NagError values.

## Creating Julia wrappers

For each of the NAG routines for which a Julia function API is to be created we start with inspection of the appropriate NAG function signature to determine the datatypes of all arguments.

With a ccall invocation defined we can subsequently make a determination of which arguments to a NAG routine can be considered as purely input arguments, which arguments could be considered as purely outputs and which arguments could be considered both input and output arguments. This classification can help in the design of a user-facing Julia API.

Pure input arguments are those arguments which are passed to the NAG routine but are not modified in any way. Pure output arguments are those arguments for which memory is allocated prior to calling the NAG routine, but for which the initial values in that memory location have no importance the memory is only needed for returning an output value. Arguments that are both inputs and outputs are those arguments for which the initial value is both used by the NAG routine and modified before being returned.

Understanding the context of each argument in the NAG C API also helps to determine which of the C arguments might be able to have default values assigned via Julia keyword arguments, which of the C arguments might be able to infer their values from other input arguments at the Julia level and which of the C arguments might not need to be present at all within a user-facing Julia API.

For each of the nearest correlation routines considered in this exercise, the routine returns an output of type Void.

For the g02aac routine, its C API is the following:

Constructing a ccall command for this API signature requires mapping the type of each input argument from C type to the corresponding Julia data type. The following table provides that mapping:

argument C data type Julia data type
order Nag_OrderType NagInt
g double[] Ptr{Float64}
pdg Integer NagInt
n Integer NagInt
errtol double Float64
maxits Integer NagInt
maxit Integer NagInt
x double[] Ptr{Float64}
pdx Integer NagInt
iter Integer* Ptr{NagInt}
feval Integer* Ptr{NagInt}
nrmgrd double* Ptr{Float64}
fail NagError* Ptr{Void}

With each of these variables defined in the current Julia scope a pure ccall into NAG’s g02aac routine would look like the following block of code:

As stated previously, for the design of our initial Julia API, we need to make a few decisions about which of the arguments to ccall should be Julia inputs and which values should be returned as outputs.

In this proof of concept project, we made the following decisions for a first level Julia wrapper to g02aac named nag_nearest_correlation!:

• the scalar arguments being passed to ccall by value (e.g. order, pdg, n, errtol, maxits, maxit, pdx) will be arguments to nag_nearest_correlation!
• all array arguments (e.g. g and x) will be arguments to nag_nearest_correlation!. Argument g is actually a pure input argument, as its value is not modified, while argument x could be considered a pure output argument as its memory is only used to store the output correlation matrix. We will return to this point below.
• the scalar arguments being passed by reference into the ccall (e.g. iter, feval, nrmgrd) are being used to track and report internal behaviour of the NAG routine. These pointers to scalar values will be created within the Julia routine and their values returned as Julia integer or floating point values as needed.

The initial implementation of the nag_nearest_correlation! Julia function subsequently took the following form:

Using the same data defined above, we can call this routine as follows:

Now if we decide that we would like to provide an interface that does not require one to manage the memory allocation for x (and also only provides the nearest correlation matrix x as output), we can define the following method nag_nearest_correlation that allocates the memory for x internally and then calls nag_nearest_correlation!.

But the values we have assigned for errtol, maxits and maxit are essentially default values for these arguments. Consequently, we can assign these values in keyword arguments in our nag_nearest_correlation function signature:

Furthermore, the values for arguments pdg, n and pdx can be deduced from the size of the input array g and can also be removed from the function signature for nag_nearest_correlation as follows:

Calling this method now involves only the following code:

Having seen the entire process for construction of a user-friendly wrapper for the nag_nearest_correlation function below are the full contents of our NAGdemo module:

## Summary

By using Julia’s features for calling external libraries we have provided easy interfaces into the NAG nearest correlation matrix routines. This example shows that the numerical computing capabilities of Julia can be easily extended to incorporate pre-existing industry grade libraries. For users of Julia, NAG integration allows the possibility of including widely utilized, specialized libraries developed outside of the Julia ecosystem, and for users of the NAG Library it means their existing workflows depending on NAG need not be disrupted by a switch to Julia.

In summary: the NAG Library, which is the worlds’ largest collection of robust and commercially maintained numerical and statistical routines and Julia, which is a high-productivity, high-performance programming language is a winning combination.

If you are interested in a more extensive integration of Julia and the NAG Library then please contact Julia Computing at [email protected] and NAG at [email protected].

Java® is a registered trademark of Oracle Corporation

MATLAB® is a registered trademark of The MathWorks, Inc.

Python® is a registered trademark of the Python Software Foundation

Visual Basic® is a registered trademark of Microsoft Corporation

## References

1. N. J. Higham. “Computing the nearest correlation matrix - A problem from finance”. IMA J. Numer. Anal., 22(3):329–343, 2002. doi: 10.1093/imanum/22.3.329

2. R. Borsdorf and N. J. Higham. “A preconditioned (Newton) algorithm for the nearest correlation matrix”. IMA J. of Numer. Anal., 30(1):94–107, 2010. doi: 10.1093/imanum/drn085

3. N. J. Higham, N. Strabić, and V. Šego, “Restoring definiteness via shrinking, with an application to correlation matrices with a fixed block”. SIAM Rev., 58(2), 245–263, 2016. doi:10.1137/140996112