Where to get your ebooks?

If you read my blog, you probably like to read in general. Thus, if you don’t own an ebook device, you will soon. The choice is growing: the Amazon Kindle, the Sony Reader, the Apple iPad,… I bought a kindle because my wife won’t let me fill the house with books. And I hate to throw away perfectly good paper books.

Amazon has most of the market for now. Yet, using the kindle store—on the kindle—is painful. Moreover, Amazon ebooks are protected by Digital Right Management (DRM). Amazon sells you crippled ebooks that can stop working if you copy them too often. There are often better alternatives elsewhere.

And, in Canada, there is a two-dollar surcharge for every wireless download using the Kindle. Since most ebooks are 0.5MB or less, the wireless costs 4$ per megabyte! This is insulting! Moreover, if you buy a book by mistake—which is annoying common—Amazon will reimburse the cost of the book itself, but not the fee for the wireless download.

Thankfully, you can grab books compatible with the kindle (in Mobipocket format) elsewhere. Then you can drop the file on the kindle using the USB port.

  • You can get nearly 2000 of the great French classic for free on ebookgratuits. This include a large fraction of the work of Honoré de Balzac.
  • Project Gutenberg offers 30,000 free e-books in various languages (mostly English).
  • WebScription sells DRM-free ebooks in various format. Most books fall into the scifi, young adults and fantasy genres.

I am currently reading You’re Not Fooling Anyone When You Take Your Laptop to a Coffee Shop by Scalzi. I bought it at WebScription for six dollars. It is a compilation of Scalzi’s blog posts on his life as a writer. I am fascinated by how much it ressembles my own life. Well… Except for the fact that I don’t get paid when I publish a paper. Maybe I should put together a compilation of posts about my silly work life. Would anyone buy it for six dollars?

I am also reading Halting State by Stross which I bought on Amazon for ten dollars. I haven’t yet gotten into the mood of the novel.

Further reading:

Actual programming with HTML and CSS (without javascript)

I usually stick with academic or research issues, but today, I wanted to have some fun. Geek fun.

While W3C describes Cascading Style Sheets (CSS) as a mechanism for adding style (e.g. fonts, colors, spacing) to Web documents, it is also a bona fide programming language. In fact, it is one of the most widespread declarative languages.

But how powerful is CSS? It is not Turing complete, so there is no hope of programming full applications.  However, CSS may be surprisingly powerful.

Suppose you are given a plain HTML table and you want to add counters and colors to it. Starting with this table:

<table>
<tr><th>City</th><th>Color</th></tr>
<tr><td>Montreal</td><td>Red</td></tr>
<tr><td>Toronto</td><td>Blue</td></tr>
<tr><td>Vancouver</td><td>Yellow</td></tr>
</table>

We want to get the following result using a pure CSS approach:

We need counters and a way to color the second line differently.

Solution: Add the following lines to your CSS style sheet:

tr{counter-increment: mycounter}
table {counter-reset: mycounter -1}
td:first-child:before {content: counter(mycounter)". " }
tr:nth-child(2n+2) td {background-color: #ccc;}
tr:nth-child(2n+3) td {background-color: #ddd;}

My best blog posts (2009)

As year 2009 comes to an end, I selected a few of my best blog posts.

Database, compression and column stores:

Hashing (contrarian blog posts):

Academia and Research:

Entropy-efficient Computing

Microprocessors and storage devices are subject to the second law of thermodynamics: using them turn usable energy (oil, hydrogen) into unusable energy (heat). Data centers are already limited by their power usage and heat production. Moreover, many new devices need to operate for a long time with little power: (1) smart phones (2) powerful computing devices inserted into our bodies (3) robots shipped in space.

Our approach to entropic efficiency remains crude. We improve power supply. We shut down disks and CPUs when they are idle. Deeper questions arise however:

  • Except maybe for embarrassingly parallel problems (such as serving web pages), parallel computing trades entropic efficiency for short running times. If entropic efficiency is your goal, will you stay away from non-trivial parallelism?
  • We analyze algorithms by considering their running time. For example, shuffling an array takes time O(n)  whereas sorting it takes time O(n log n). Yet, unlike sorting, I expect that shuffling an array should be possible without any entropy cost (heat generation)!

Suppose I give you a problem, and I ask you to solve it using as little entropy as possible. How would you go about it?

Further reading:

Run-length encoding (part 3)

In Run-length encoding (part 1), I presented the various run-length encoding formats. In part 2, I discussed the coding of the counters. In this third part, I want to discuss the ordering of the elements.

Indeed, the compression efficiency of run-length encoding depends on the ordering. For example, the sequence aaabbb is far more compressible than the sequence ababab.

Reordering sequences

Text, images, and sound are stored on disk as strings. Intuitively, we may think that strings cannot be reordered without losing information. But this intuition is wrong!

You can reorder strings without losing any information, as long as the reordering is invertible. The Burrows–Wheeler transform—invented in 1994—is an invertible transform which tends to generate streams of repeated characters. It is used by the open source bzip2 software. This invertible reordering trick works so well that bzip2 is “within 10% to 15% of the best available techniques, whilst being around twice as fast at compression and six times faster at decompression.”

Ordering sets

Sometimes we want to compress sets of elements. For example, the ordering of the rows in a relational database is semantically irrelevant. Similarly, in a phone directory, we are not concerned about the order of the entries. Of course, we are trained to expect entries to be sorted lexicographically starting from the last names:

Last name First name
Abeck John
Becket Patricia
Smith John
Tucker Patricia

Such sequences of tuples can be compressed column-wise: compress each column with run-length encoding. Reordering the rows so as to maximize compression is NP-hard by reduction from the Hamiltonian path problem. Moreover, it can be reduced to an instance of the traveling salesman problem.

Fortunately, a simple and efficient heuristic is available. Reorder the columns in increasing cardinality: put the columns having the fewest number of distinct values first. Next, sort the table lexicographically.

Applying this heuristic, our previous table becomes:

First name Last name
John Abeck
John Smith
Patricia Becket
Patricia Tucker

Further reading: Daniel Lemire, Owen Kaser, Reordering Columns for Smaller Indexes (working paper).

Run-length encoding (part 2)

(This is a follow-up to my previous blog post.)

Any run-length encoding requires you to store the number of repetitions. In my example, AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K, we must store 5 counters (3,5,1,2,1) and 5 characters.

Storing counters using a fixed number of bits

Most programming languages represent integers using a fixed number of bits in binary format. For example, Java represents integers (int) using 32 bits. However, in my example (3A-5B-1Z-2W-1K) storing the counters using 32 bits and the characters using 8 bits means that I use 25 bytes which is more than twice the length of the original string (AAABBBBBZWWK).

Thus, we have a simple optimization problem: determine the best number of bits.  In practice, it might be better to store the data in a byte-aligned way. That is, you should be using 8, 16, 32 or 64 bits. Indeed, reading numbers represented using an arbitrary number of bits may involve a CPU processing overhead.

If you use too few bits, some long runs will have to count as several small runs. If you use too many bits, you are wasting storage. Unfortunately, determining on a case-by-case basis the best number of bits requires multiple scans of the data. It also implies added software complexity.

But you don’t have to use the binary format!

You can still use  a fixed number of bits for your counters, but with quantized codes instead of the binary format. For example, using 3 bits, you could only allow the counter values 1,2,16, 24, 32,128,256,1024. In this example, the sequence of bits 000 is interpreted as the value 1, the sequence of bits 001 as the value 2, the sequence 010 as 16, and so on. Determining the best codes implies that you must scan the data, compute the histogram of the counters, and then apply some optimization algorithm (such as dynamic programming). The decoding speed might be slight slower as you need to look-up the codes from a table.

Using variable-length counters for optimal compression

If you are willing to sacrifice coding and decoding speed, then you can represent the counters using universal codes. Thus, instead of using a fixed number of bits and optimizing the representation (as in the quantized coding idea), you seek an optimal variable-length representation of the counters. With this added freedom, you can be much more efficient.

The illustrate the idea behind variable-length codes, we consider Gamma codes: the code 1 is 1, the code 01 is 2, the code 001 is 3, the code 0001 is 4, and so on. Thus, we use x bits to represent the number x.

Fortunately, we can do much better than Gamma codes and represent the number x using roughly 2 log x bits with delta codes. Firstly, write x as x=2N +(x modulo 2N) where N is the logarithm. Then we can represent the number N-1 as a Gamma code using N-1 bits, and then store (x modulo 2N) in binary format (using N-1 bits). Thus, we can represent all numbers up to 2N-1 using 2N-2 bits.

Unfortunately, the current breed of microprocessors are not kind to variable-length representations so the added compression is at the expense decoding speed.

References and further reading: Holloway et al., How to Barter Bits for Chronons, 2007. See also the slides of my recent talk Compressing column-oriented indexes.

Next Page »

18 queries. 0.453 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.