Parsing CSV files is CPU bound: a C++ test case
In Parsing text files is CPU bound, I claimed that I had a C++ test case proving that parsing CSV files could be CPU bound. By CPU bound, I mean that the overhead of taking each line, finding out where the commas are, and storing the copies of the fields into an array, dominates the running time.
How do I test this theory? I read the file twice. Once, I just read each line and report the time elapsed. Then, I read each line and process them and report the time elapsed. If the two times are similar, the process is I/O bound, if the second time is much larger, the process is CPU bound.
I get this result on a 2 GB file (numbers updated on Dec. 19, 2008):
$ ./parsecsv ./netflix.csv
without parsing: 26.55
with parsing: 95.99
Hence, parsing dominates the running time. At least in this case. At least with my C++ code.
Before you start arguing with me, please go download my reproducible test case. All you need is the GNU GCC compiler. I tested out two machines, with two different versions of GCC.
Disclaimer: I do not claim that this is professional benchmarking. I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.
Reference: This quest started out from a post by Matt Casters where he reported that you could parse a CSV file faster using two CPU cores instead of just one.
Montreal, Canada 
Facebook
Friendfeed
LinkedIn
SlideShare
Delicious
You are most certainly right if you think that memory allocation is guilty. However, I specifically defined “parsing” as “copying the fields into new arrays”.
There is no question that if I just read the bytes and do nothing with them, it is not going to end up being CPU bound, but that is not very realistic of a real application, is it? Copying the fields and storing them into some array appears to me to me a basic operation.
Comment by Daniel Lemire — 18/12/2008 @ 18:34
Major performance problems of the code:
0. getline does one heap allocation and copy for every line.
1. tokenize does heap allocation and copy *per value* via token string.
2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.
3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java
The solution is left as an exercise for the readers
Comment by vicaya — 18/12/2008 @ 20:32
1. tokenize does heap allocation and copy *per value* via token string.
Heap allocation is avoided in latest version.
2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.
Fixed this in latest version.
3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java
The solution is left as an exercise for the readers
I wrote in my blog post:
«I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.»
Comment by Daniel Lemire — 18/12/2008 @ 21:46