Need FoR Speed EP1

Replace readLines with data.table::fread(sep=NULL) for reading any text files.

Imaduddin Haetami http://artidata.io/blog/about.html
05-21-2019

Table of Contents


Preamble

I recently encounter a supposedly 16GB csv file. However, the csv file does not have the same number of comma “,” for each line. The reason is the one who exported the data, does not consider the randomly existing comma(s) in the address columns. Several method to read the csv files using R, such as utils::read.csv, readr::read_csv, and default data.table::fread will either throw an error or make an incomplete reading. Hence, I have no choice but to use the readLines and fix the errors.

Unfortunately, I have to do the readLines for several times. There were several files which such errors or I overwrite the already read object. Although I work on my office HPC with 16 Cores of Intel Xeon Gold 6144 as the processor, the readLines still takes me around 10 minutes to do one reading. So slow~

Problem Replication

Let me replicate the problem and objectively compare the performance. Our data looks very similar with the one below:


len <- 5000000
set.seed(240193)
char <- sapply(1:len,
               function(x)
                 paste(sample(c(LETTERS,","),100,replace = T),
                       collapse=""))

The first 5 samples are:


head(char)

[1] "WAG,UUXAPZOYOBQHFSHBNHKLYYZMOQFOTPJNS,WIDYDTFDXBGZLHWQWFHFHM,BSYENLVK,QIDFSAPMHXRVWWKYOUCELQOJGEEQYY"
[2] "ZDUIBXQCTSZBNIX,THOQMYNHTXOOHNBRSQOYLXWTWDKCNMOI,YSGUWGQBEQOZB,KQFFVPMNRE,XFFYHURKMYVGDLKLMGHRLDBOBU"
[3] "MOXICB,YXCUGPGDUWCZVCXWVHAUGKOLPRALVNQXBAYENZWNNRT,LVEHXVZ,ZRLOATCDPOODTO,ENWJWEXECCQOXGVHNQBUQ,VJSP"
[4] "DZOQIFWDWSKOOW,CASGYCAJEKYFGSDFRGZRZJCOHNMMOQERLESUDB,IHUAAACYWGGVH,,AXBHKMJJULXZNSHXFZAUXGZF,CZHCJI"
[5] "PMEHVZNYCOOIEJUVTDDIOIYLUBVTO,QLORLXYWIUTUMJNZBFYZV,JIOGLLMDLGJYNLFXYRNLNYQU,HTVXWGTPJATESUWGV,YMMTQ"
[6] "AXCTBUCKDLBOFOVPAAOVLYKEEOXRI,FRNRPTDYBLTMVMPJSXNNIFBCZAZPKWRANSMVDHITXOB,QPKGYZNTWRPJX,EKIKHVCLDOXV"

This is a character vector with 510^{6} elements of 100 random letters and commas. Then, we write it to the hard drive as a text file:


writeLines(char,"char.txt")

The size of the written files (bytes):


file.size("char.txt")

[1] 5.05e+08

Objective Comparison

Now, we compare the 2 techniques:


(t1 <- system.time(ReadLines <- readLines("char.txt")))

   user  system elapsed 
  7.689   0.172   7.861 

library(data.table)
setDTthreads(2)
(t2 <- system.time(Fread <- fread("char.txt",sep=NULL,header=F)[,V1]))

   user  system elapsed 
  4.164   0.072   2.881 

Checking equality of the 2 objects:


identical(ReadLines,Fread)

[1] TRUE

Hence, the second method is 2.7285665 times as fast as the first method. In other words, you can save 63.3507187% of your time by adopting the second method.

I believe the reason for such speed is the built-in parallelization of data.table::fread.
It manage to utilize the number of cores of my CPU:


getDTthreads()

[1] 2

Here is further reading on parallelization of data.table::fread. The efficient algo writing in fread is also mentioned to be the reason. In my experience, the number of cores generally increase the reading speed of fread. However, your hard disk drive reading speed may become the bottleneck and slowing down the process.

Thank you for reading!

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Haetami (2019, May 21). artidata: Need FoR Speed EP1. Retrieved from http://artidata.io/blog/posts/2019-03-26-need-for-speed-ep1/

BibTeX citation

@misc{haetami2019need,
  author = {Haetami, Imaduddin},
  title = {artidata: Need FoR Speed EP1},
  url = {http://artidata.io/blog/posts/2019-03-26-need-for-speed-ep1/},
  year = {2019}
}