(These results were updated.)

In Parsing text files is CPU bound, I claimed that I had a C++ test case proving that parsing CSV files could be CPU bound. By CPU bound, I mean that the overhead of taking each line, finding out where the commas are, and storing the copies of the fields into an array, dominates the running time.

How do I test this theory? I read the file twice. Once, I just read each line and report the time elapsed. Then, I read each line and process them and report the time elapsed. If the two times are similar, the process is I/O bound, if the second time is much larger, the process is CPU bound.

I get this result on a 2 GB file (numbers updated on Dec. 19, 2008):


$ ./parsecsv ./netflix.csv
without parsing: 26.55
with parsing: 95.99

Hence, parsing dominates the running time. At least in this case. At least with my C++ code.

Before you start arguing with me, please go download my reproducible test case. All you need is the GNU GCC compiler. I tested out two machines, with two different versions of GCC.

Disclaimer: I do not claim that this is professional benchmarking. I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.

Reference: This quest started out from a post by Matt Casters where he reported that you could parse a CSV file faster using two CPU cores instead of just one.

3 Comments »

  1. You are most certainly right if you think that memory allocation is guilty. However, I specifically defined “parsing” as “copying the fields into new arrays”.

    There is no question that if I just read the bytes and do nothing with them, it is not going to end up being CPU bound, but that is not very realistic of a real application, is it? Copying the fields and storing them into some array appears to me to me a basic operation.

    Comment by Daniel Lemire — 18/12/2008 @ 18:34

  2. Major performance problems of the code:

    0. getline does one heap allocation and copy for every line.

    1. tokenize does heap allocation and copy *per value* via token string.

    2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.

    3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java :) The solution is left as an exercise for the readers :)

    Comment by vicaya — 18/12/2008 @ 20:32

  3. 1. tokenize does heap allocation and copy *per value* via token string.

    Heap allocation is avoided in latest version.

    2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.

    Fixed this in latest version.

    3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java :) The solution is left as an exercise for the readers :)

    I wrote in my blog post:

    «I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.»

    Comment by Daniel Lemire — 18/12/2008 @ 21:46

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

Example: duo plus septem is '9'. The numbers are expressed in latin numerals but you should give your answers using ordinary digits.

 

« Blog's main page

Powered by WordPress