<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Parsing CSV files is CPU bound: a C++ test case</title>
	<atom:link href="http://lemire.me/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/feed/" rel="self" type="application/rss+xml" />
	<link>http://lemire.me/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/</link>
	<description>Computer Scientist and Open Scholar: Databases, Information Retrieval, Business Intelligence.</description>
	<lastBuildDate>Wed, 23 May 2012 19:07:50 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/comment-page-1/#comment-50357</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Fri, 19 Dec 2008 02:46:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1634#comment-50357</guid>
		<description>&lt;i&gt;1. tokenize does heap allocation and copy *per value* via token string.&lt;/i&gt;

Heap allocation is avoided in latest version.

&lt;i&gt;2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.&lt;/i&gt;

Fixed this in latest version.

&lt;i&gt;3. Shorter and complete I/O bound (for &gt; then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java :) The solution is left as an exercise for the readers :)&lt;/i&gt;


I wrote in my blog post:

«I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.»</description>
		<content:encoded><![CDATA[<p><i>1. tokenize does heap allocation and copy *per value* via token string.</i></p>
<p>Heap allocation is avoided in latest version.</p>
<p><i>2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.</i></p>
<p>Fixed this in latest version.</p>
<p><i>3. Shorter and complete I/O bound (for > then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you’re using C++ not Java <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  The solution is left as an exercise for the readers <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </i></p>
<p>I wrote in my blog post:</p>
<p>«I do not claim that writing software where CSV parsing is strong I/O is not possible, or even easy.»</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: vicaya</title>
		<link>http://lemire.me/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/comment-page-1/#comment-50356</link>
		<dc:creator>vicaya</dc:creator>
		<pubDate>Fri, 19 Dec 2008 01:32:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1634#comment-50356</guid>
		<description>Major performance problems of the code:

0. getline does one heap allocation and copy for every line.

1. tokenize does heap allocation and copy *per value* via token string.

2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.

3. Shorter and complete I/O bound (for &gt; then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you&#039;re using C++ not Java :) The solution is left as an exercise for the readers :)</description>
		<content:encoded><![CDATA[<p>Major performance problems of the code:</p>
<p>0. getline does one heap allocation and copy for every line.</p>
<p>1. tokenize does heap allocation and copy *per value* via token string.</p>
<p>2. vector of strings also do an *extra* heap allocation and copy to copy construct *each element*.</p>
<p>3. Shorter and complete I/O bound (for &gt; then physical memory files) code can be written, using a custom allocator (~10 lines reusable code) that does only amortized 1 heap allocation and memcpy *per file*. Remember that you&#8217;re using C++ not Java <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  The solution is left as an exercise for the readers <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2008/12/16/parsing-csv-files-is-cpu-bound-a-c-test-case/comment-page-1/#comment-50355</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Thu, 18 Dec 2008 23:34:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1634#comment-50355</guid>
		<description>You are most certainly right if you think that memory allocation is guilty. However, I specifically defined &quot;parsing&quot; as &quot;copying the fields into new arrays&quot;.

There is no question that if I just read the bytes and do nothing with them, it is not going to end up being CPU bound, but that is not very realistic of a real application, is it? Copying the fields and storing them into some array appears to me to me a basic operation.</description>
		<content:encoded><![CDATA[<p>You are most certainly right if you think that memory allocation is guilty. However, I specifically defined &#8220;parsing&#8221; as &#8220;copying the fields into new arrays&#8221;.</p>
<p>There is no question that if I just read the bytes and do nothing with them, it is not going to end up being CPU bound, but that is not very realistic of a real application, is it? Copying the fields and storing them into some array appears to me to me a basic operation.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

