The very first task in any data analysis workflow is simply reading the
data, and this absolutely must be done quickly and efficiently so the
more interesting work can begin. Across many industries and domains, the
CSV file format is king for storing and sharing tabular data. Loading
CSVs fast and robustly is crucial, and it must scale well across a wide
variety of file sizes, data types, and shapes. This post compares the
performance for reading 8 different real-world datasets across three
different CSV parsers: R’s fread, Pandas’ read_csv, and Julia’s CSV.jl.
Each of these was chosen as the “best in class” CSV parser in each R,
Python and Julia, respectively.
All three tools have robust support for loading a wide variety of data
types with potentially missing values, but only
(R) and CSV.jl (Julia) support
multithreading—Pandas only supports
single threaded CSV loading. Julia’s CSV.jl is further unique in that it
is the only tool that is fully implemented in its higher-level language
rather than being implemented in C and wrapped from R / Python. (Pandas
does have a slightly more capable Python-native parser, it is
significantly slower and nearly all uses of read_csv default to the C
engine.) As such, the CSV.jl benchmarks here not only represent the
speed of loading data in Julia, but are also indicative of the sorts of
performance that’s possible in the subsequent Julia code used in the
The following benchmarks show that Julia’s CSV.jl is 1.5 to 5 times
faster than Pandas even on a single core; with multithreading enabled,
it is as fast or faster than R’s read_csv. The tools used for
for R, and timeit for
Let’s start with some homogeneous datasets i.e. datasets which have the
same kind of data in all columns. The datasets in this section, apart
from stock price dataset, are derived from this benchmark
site. The performance metric
is the time taken to load a dataset as the number of threads is
increased from 1 to 20. Since Pandas does not support multi-threading,
single threaded speed is reported across the board for all core counts.
Performance on Homogenous Datasets:
Uniform Float dataset: The first dataset contains float values
arranged in 1 Million rows and 20 columns. Pandas takes 232 milliseconds
to load this file. Single threaded data.table is 1.6 times faster than
CSV.jl. With Multithreading, CSV.jl is at its best, more than double the
speed of data.table. CSV.jl is 1.5 times faster than Pandas without
multithreading, and about 11 times faster with.
Uniform String dataset(I): This dataset contains string values in
all columns and has 1 Million rows and 20 columns. Pandas takes 546
milliseconds to load the file. With R, adding threads doesn’t seem to
lead to any performance gain. Single threaded CSV.jl is 2.5 times faster
than data.table. At 10 threads, it is about 14 times faster than
Uniform String dataset(II): The dimensions of this dataset are the
same as that of the one above. However, every column has missing values
as well. Pandas takes 300 milliseconds. Without threading, CSV.jl is 1.2
times faster than R, and with, it is about 5 times faster.
Apple stock prices:
This dataset contains 50 million rows and 5 columns, and is 2.5GB. The
rows are open, high, low, and close prices for AAPL stock. The four
columns with prices are float values, and there is a date column.
The single threaded CSV.jl is about 1.5 times faster than R’s fread from
data.table. With multithreading CSV.jl is about 22 times faster! Pandas’
read_csv takes 34s to read, this is slower than both R and Julia.
Mixed dataset: This dataset has 10k rows and 200 columns. The
columns contain, String, Float, DateTime, and missing values. Pandas
takes about 400 milliseconds to load this dataset. Without threading,
CSV.jl is 2 times faster than R, and is about 10 times faster with 10
Mortgage risk dataset
Now, let’s look at a wider dataset. This mortgage risk
from Kaggle is a mixed type dataset, with 356k rows and 2190 columns.
The columns are heterogeneous and have values of types String, Int,
Float, Missing. Pandas takes 119s to read in this dataset. Single
threaded fread is about twice faster than CSV.jl. However, with more
threads Julia is either as fast or slightly faster than R.
Wide dataset: This is a considerably wider dataset with 1000 rows
and 20,000 columns. The dataset contains string and Int values. Pandas
takes 7.3 seconds to read the dataset. In this case, single threaded
data.table is about 5 times faster than CSV.jl. With more threads,
CSV.jl is competitive with data.table. Increasing the number of threads
doesn’t seem to result in any performance gain in case of data.table.
Fannie Mae Acquisition dataset: This dataset can be downloaded from
Fannie Mae site
The dataset has 4 Million rows and 25 columns and values of types Int,
String, Float, Missing.
Single threaded data.table is 1.25 times faster than CSV.jl. But, the
performance of CSV.jl keeps increasing with more threads. CSV.jl gets
about 4 times faster with multi-threading.
Across all eight datasets, Julia’s CSV.jl is always faster than Pandas,
and with multi-threading it is competitive with R’s data.table.
System Info: The specs of the system on which the benchmarking was
performed are as below
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
$ uname -a
Linux antarctic 5.6.0-custom+ #1 SMP Mon Apr 6 00:47:33 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
CPU MHz: 800.225
CPU max MHz: 3000.0000
CPU min MHz: 800.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 14080K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
$ free -h
total used free shared buff/cache available
Mem: 62G 3.3G 6.3G 352K 52G 58G
Swap: 59G 3.2G 56G
Need help with Julia?
We also provide training and consulting services
and build open source or proprietary packages
for our customers on a consulting basis. Email us:
Julia Computing was founded by all the creators
of the language to provide commercial support
to Julia users. We are based in Boston, New York,
San Francisco, London and Bangalore with
customers across the world.
© 2016 - 2020 Julia Computing, Inc. All Rights Reserved.