R-ohjelmointi.org

Tilastotieteellistä ohjelmointia R-kielellä

How to speed up data import in R

Have you ever wondered what is the fastest way to load binary R files (e.g., .RData) to R? Well, me neither, but now the need arose. Here are some benchmarks of different approaches.

Let’s first generate some test data:

N <- 4600000
m <- data.frame(matrix(ncol=150, nrow=N, data=0))

Why exactly 4600000 rows you might wonder. Long story short, I have a real world dataset of about the same dimensions.

It does not really matter whether the data frame is filled with zeros or random decimal numbers, because it takes the same amount of space in both cases (but the compressed file size might be vastly different, though). The data frame takes more than 5 GBs (5.14 GB) of space:

object.size(m)
# 5520016192 bytes
 
library(pryr)
object_size(m)
# 5.52 GB

Now that we have the data, how long does it take to write it on the disk? Here we will test five different ways to write data: The out-of-the-box 1) write.table(), 2) save() or saveRDS(), and packages 3) feather, 4) fst and 5) readr. Let’s load the mentioned libraries:

library(feather)
library(fst)
library(data.table)
library(readr)

Function system.time() measures the time used for the computation, in this case either reading or writing the data to disc. It prints three different time elements (in seconds): user time is the time used by the R process, system time is the time used by the operating system, and elapsed time is the user’s perceived time to the completion of the call. From the user’s point of view the most important measure is the elapsed time, and we will argue about the benchmark results using it.

The timings for writing the data are as follows:

 
> system.time(write.table(m, "m.txt", col.names=T, row.names=F, sep=";", quote=F))
   user  system elapsed 
 809.78    5.34  816.52 
 
> system.time(fwrite(m, "md.txt", col.names=T, row.names=F, sep=";", quote=F))
   user  system elapsed 
  14.22    2.19    4.41 
 
> system.time(save(m, file="m1.RData"))
   user  system elapsed 
  60.70    0.51   61.29 
 
> system.time(save(m, file="m2.RData", compress="bzip2"))
   user  system elapsed 
  70.02    0.02   70.05 
 
> system.time(save(m, file="m3.RData", compress="xz"))
   user  system elapsed 
 384.50    0.25  386.17 
 
> system.time(save(m, file="m4.RData", compress=F))
   user  system elapsed 
   6.92    3.99   26.28 
 
> system.time(saveRDS(m, file="m5.RData", compress=F))
   user  system elapsed 
   8.20    3.75   51.83 
 
> system.time(write_feather(m, "m.ftr"))
   user  system elapsed 
   2.62    3.40   56.17 
 
> system.time(write.fst(m, "m.fst"))
   user  system elapsed 
   0.10    5.89   42.95 
 
> system.time(write_delim(m, "m.rdr", delim = ";"))
   user  system elapsed 
  82.59    3.61   87.87

And similarly the timings for reading the data are as follows:

 
> system.time(m<-read.table("m.txt", header=T, sep=";"))
   user  system elapsed 
 200.25   16.05  223.50 
 
> system.time(m<-fread("md.txt", header=T, sep=";"))
   user  system elapsed 
  23.83    1.09   25.06 
 
> system.time(load(file="m1.RData"))
   user  system elapsed 
  20.45    1.21   21.84 
 
> system.time(load(file="m2.RData"))
   user  system elapsed 
  23.89    3.38   27.34 
 
> system.time(load(file="m3.RData"))
   user  system elapsed 
  10.95    1.28   12.25 
 
> system.time(load(file="m4.RData"))
   user  system elapsed 
   7.89    3.44   16.89 
 
> system.time(readRDS(file="m5.RData"))
   user  system elapsed 
   6.86    3.21   17.93 
 
> system.time(read_feather("m.ftr"))
   user  system elapsed 
   1.39   13.47   99.28 
 
> system.time(m<-read.fst("m.fst"))
   user  system elapsed 
   0.06    3.84   13.69 
 
> system.time(m<-read_csv2("m.txt"))
   user  system elapsed 
  33.22    2.59   38.71

The timings were repeated three times, and values above are averages of the replicates. I have an SSD disc with the reading speed of about 500 MB/s and similarly, 300 MB/s for writing. Thus, you would expect the theoretical minimum for writing this example file to be about 17.1 seconds, and for reading about 10.3 seconds. True enough, the results would seem to (mostly) obey this physical restriction. One notable difference is the fwrite() function that appears to have consumed only 4.41 seconds, but that’s because fwrite() was using four threads for the task (4*4.41s = 17.64s.). You might notice that the user time was 14.22 seconds, and larger than the elapsed time, but that’s because of the multiple threads too.

Based on the benchmarks, the fastest way to write the data is either fwrite() from the package data.table or the save() without compression from base-R. The fastest reading time was for xz-compressed .RData file. Runner-ups were fst and reading the uncompressed .RData (an base-R function). So, if a rather consistently fast reading and writing of a data frame is desired, the save() and load() functions are actually pretty decent choises. Just remember to not use the compression when saving the file to disc. The downside is that the uncompressed data files take much storage space. The total reading and writing time was smallest for the fread() and fwrite() functions. Also the storage space is somewhat midway between the uncompressed and compressed file formats, taking about 1.28 GBs.

How do the other languages for data analysis perform? I also tested Python and Julia. Both were tested with the CSV file only. Python is faster than base R with only 72.7 seconds compared to R’s 223.5 seconds. The following script was run in IPython shell (Python 3.6.1) using the ’%cpaste’ magic:

import pandas as pandas
import time
start_time = time.clock()
m = pandas.read_csv('m.txt', sep=';')
stop_time = time.clock()
# start_time 4.2666630257808847e-07
# stop_time 72.65627378664637

stop_time - start_time
# Out[7]: 72.65627335998006

There are other ways to make data IO faster in Python too. For example, CSV could be parsed just once, and then stored as an HDF5 file. Reading a HDF5 file is much faster than reading an CSV file. As an alternative, Numpy’s loadtxt might outperform read_csv from Pandas.

Julia was outperformed by Python but not by base R (read.table()). The following script was run in the Julia command prompt 0.5.2.2:

using CSV
input_file = "c:/users/lenovo/desktop/m.txt"
start_time = now()
data_frame = CSV.read(input_file, delim=";")
stop_time = now()
#2017-06-18T21:14:29.621
#2017-06-18T21:17:44.695
stop_time - start_time
#160751 milliseconds

Julia took about 160.8 seconds to complete the import of the CSV file to a data frame structure.

All in all, R does not do so bad compared to Python or Julia, because there’s a large variety of packages to choose from, each with its own pros and cons. Personally, I like to use as little storage space as possible, but time critical applications might require other solutions.


Category