Here is the latest from Julia Computing

Analyzing LendingClub Data Using JuliaDB

22 August 2017 | Venkat Subramaniam and Shashi Gowda

JuliaDB is our take on a package for writing succinct, expressive and fast data processing pipelines. It includes tools to load data from CSV files, index them, iterate over row or column subsets, perform relational queries and save intermediate results.

In this post, we will work with a particular dataset published by LendingClub. Our goal will be to train a model to learn to classify loans as bad and good ones, depending on their likelihood of being repaid in a timely manner. It also serves as a case study of how to use JuliaDB in a non-trivial application. This post owes its selection of dataset and methodology to a blog by Microsoft's R Server Tiger team.

The data set

This example will use datasets provided by LendingClub, the world’s largest online marketplace for connecting borrowers and investors. On their website, they provide publicly available, detailed datasets that contain anonymous data regarding all loans that have been issued through their system, including the current loan status and latest payment information.

The first step then is to download some files from this page. For the purpose of this post, we have used the files 2007 through 2016 (8 files in total).

Place the files in a folder named data.

The files contain lines with copyright info and comments, these can be removed using this neat sed command, courtesy of Doug Bates.

cd data/
sed -i -n -e '/^\"/p' *.csv

The code below is to be executed from the same directory that contains the data directory.

loading the data

To load the data in parallel, you might use many processes.

addprocs(4)

will add 4 worker processes.

Then a call to loadfiles will load the files into a single table:

julia> @time data = loadfiles("data", escapechar='"',usecache=false);

Metadata for 0 / 8 files can be loaded from cache.
Reading 8 csv files totalling 995.147 MiB...
 57.671541 seconds (475.41 k allocations: 28.874 MiB, 0.03% gc time)

JuliaDB can figure out the correct types of the columns and loaded a table that is completely type-stable way. You can look at the element type of the table by running eltype(data). I will forgo the eye-sore involved in the thorough book-keepings here by not showing the output here.

Feature selection

To assess which features to use in our prediction model, we can see density plots of numeric quantities in the dataset for good loans versus bad ones. Let's start by partitioning the data into good loans and bad loans:

const bad_statuses = ("Late (16-30 days)","Late (31-120 days)","Default","Charged Off")
good_loans = filter(x->!(x.loan_status in bad_statuses), data)
bad_loans = filter(x->x.loan_status in bad_statuses, data);

One can use eltype of the table to introspect the field types and extract only numeric fields.

fields = fieldnames(eltype(data))
ftypes = eltype(data).parameters

numeric_cols = [fields[i] for i = 1:length(fields) if (ftypes[i] <: Number ||
            ftypes[i] <: Nullable{<:Number}) && !(fields[i] in [:id, :member_id, :dti_joint])];

Here, we filtered the field names based on the type of each field. This filtered list of names is in numeric_cols

86-element Array{Symbol,1}:
 :loan_amnt
 :funded_amnt
 :funded_amnt_inv
 ⋮
 :total_bal_ex_mort
 :total_bc_limit
 :total_il_high_credit_limit

Plotting using Gadfly

using Gadfly
import NullableArrays: dropnull

# Density plot for bad and good loans
plots = Gadfly.Plot[]

for (name, g, b) in zip(numeric_cols, columns(good_numbers), columns(bad_numbers))
    g = dropnull(g)
    b = dropnull(b)
    p = plot(layer(x=g, Geom.density, Theme(default_color=colorant"green")),
             layer(x=b, Geom.density, Theme(default_color=colorant"red")),
              Guide.title(string(name)), Guide.ylabel("density"))
    push!(plots, p)
end

Next we render and layout the plots in a 9x10 grid:

grid = reshape(vcat(render.(plots), fill(Compose.context(), 90-length(plots))), 9,10);
draw(SVG(24inch, 24inch), gridstack(grid))

Now define a variable that contains a path to the directory containing the data files, and a dictionary that contains the names of all of the columns that are contained in the dataset as keys.

Training

To validate our model, we need to segregate the data into a training dataset and a testing dataset, we will use 75% and 25% of the data, randomly chosen, as these respective subsets:

perm = randperm(length(data))
train_till = round(Int, length(data) * 3/4)

training_subidx = sort!(perm[1:train_till])
testing_subidx = sort!(perm[train_till+1:end]);

training_subset = data[training_subidx]
testing_subset = data[testing_subidx]

In order to make sure that our analysis is as close as possible as that conducted by Microsoft, we’ll select the same set of predictor variables that they did:

features_train = [revol_util_train int_rate_train mths_since_last_record_train annual_inc_joint_train total_rec_prncp_train all_util_train]

Further, we need to create a vector of labels which labels each loan as good or bad:

labels = collect(map(x->x in bad_statuses, data))

Constructing feature matrix

import NullableArrays: dropnull
findnonnulls(xs::Columns) = find(x->!any(map(isnull, x)), xs)

function input_matrix(table, fields)
    float_features = collect(values(table, fields))
    tmp = findnonnulls(float_features) # indices of the rows where all fields are non-null
    nzidxs = collect(keys(table,1)[tmp]) # corresponding indices in the table
    reduce(hcat, map(dropnull, columns(float_features[tmp]))), values(labels[nzidxs]) # matrix, label vector
end

The predictive model

The random forest model needs us to create two arrays — one, a vector of labels, and the other being the corresponding feature matrix. For the label vector, we reuse the index vector used above (when extracting the training subset of the original data to extract the corresponding subset of the labels label vector). For the construction of the feature matrix, we extract the columns for our selected features from the distributed JuliaDB table, gather those columns to the master process, and finally concatenate the resulting vectors into our feature matrix.

Having done this, we can now call the build_forest function from the DecisionTree.jl package.

training_matrix, train_labels = input_matrix(training_subset, numeric_features)
model = build_forest(train_labels, training_matrix, 3, 10, 0.8, 6)

#     0.512169 seconds (7.67 k allocations: 440.766 KiB)
#   Out[98]:
#   Ensemble of Decision Trees
#   Trees:      10
#   Avg Leaves: 41.8
#   Avg Depth:  8.0

Should we want to save our model to reuse at a later time, we can store it to our disk.

f = open("  loanmodel.jls", "w")
serialize(f, model)
close(f)

We can now test our model on the rest of the data. To do this, we will generate predictions in parallel across all workers by mapping the “apply_forest” function onto every row of the JuliaDB dataset.

features_test, test_labels = input_matrix(testing_subset, numeric_features)

@time predictions = mapslices(features_test, 2) do fs
    DecisionTree.apply_forest(model, fs)
end;

# =>  0.234474 seconds (36.23 k allocations: 1.964 MiB)

With our set of predictions, we construct a ROC curve using the ROC.jl package and calculate the area under the curve to find a single measure of how predictive our trained model is on the dataset.

# Receiver Operating Characteristics curve
using ROC
curve = roc(convert(Vector{Float64}, predictions[:]),convert(BitArray{1}, test_labels))

# An ROC plot in Gadfly with data calculuated using ROC.jl

plot(layer(x=curve.FPR, y=curve.TPR, Geom.line),
     layer(x = linspace(0.0,1.0,101), y = linspace(0.0,1.0,101),
       Geom.line, Theme(default_color=colorant"red")), Guide.title("ROC"),
       Guide.xlabel("False Positive Rate"),Guide.ylabel("True Positive Rate"))

The ROC would look like this.

The area under the curve is:

# Area Under Curve
AUC(curve)

# => 0.7291976225854385

We have shown how to use JuliaDB to create a model predicting the quality of a loan, combining data access and sophisticated analysis in a single environment.