Learning and the Social Web: A Call for Papers (new deadline)

We have extended the deadline of our call for papers on Learning and the Social Web (http://socialwebjetwi.info/). It is a special issue for the Journal of Emerging Technologies in Web Intelligence (JETWI). You now have until December 20th to submit your papers.

tag cloud

Suggested topics:

  • Web 3.0
  • Ambient and Ubiquitous Learning
  • Telepresence
  • Web 2.0 and Social intelligence
  • Web Mining
  • Web Services and Semantic Web
  • Human-Web Interaction
  • Web Agents and Agent-based Systems
  • Knowledge Management, Networks, and Communities
  • Ontology Engineering
  • Personalization Techniques
  • Semantic Web
  • Web based Support Systems
  • Web Services, Services Discovery & Composition
  • Enterprise Mashup

Run-length encoding (part 2)

(This is a follow-up to my previous blog post, there is also a follow-up: part 3.)

Any run-length encoding requires you to store the number of repetitions. In my example, AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K, we must store 5 counters (3,5,1,2,1) and 5 characters.

Storing counters using a fixed number of bits

Most programming languages represent integers using a fixed number of bits in binary format. For example, Java represents integers (int) using 32 bits. However, in my example (3A-5B-1Z-2W-1K) storing the counters using 32 bits and the characters using 8 bits means that I use 25 bytes which is more than twice the length of the original string (AAABBBBBZWWK).

Thus, we have a simple optimization problem: determine the best number of bits.  In practice, it might be better to store the data in a byte-aligned way. That is, you should be using 8, 16, 32 or 64 bits. Indeed, reading numbers represented using an arbitrary number of bits may involve a CPU processing overhead.

If you use too few bits, some long runs will have to count as several small runs. If you use too many bits, you are wasting storage. Unfortunately, determining on a case-by-case basis the best number of bits requires multiple scans of the data. It also implies added software complexity.

But you don’t have to use the binary format!

You can still use  a fixed number of bits for your counters, but with quantized codes instead of the binary format. For example, using 3 bits, you could only allow the counter values 1,2,16, 24, 32,128,256,1024. In this example, the sequence of bits 000 is interpreted as the value 1, the sequence of bits 001 as the value 2, the sequence 010 as 16, and so on. Determining the best codes implies that you must scan the data, compute the histogram of the counters, and then apply some optimization algorithm (such as dynamic programming). The decoding speed might be slight slower as you need to look-up the codes from a table.

Using variable-length counters for optimal compression

If you are willing to sacrifice coding and decoding speed, then you can represent the counters using universal codes. Thus, instead of using a fixed number of bits and optimizing the representation (as in the quantized coding idea), you seek an optimal variable-length representation of the counters. With this added freedom, you can be much more efficient.

The illustrate the idea behind variable-length codes, we consider Gamma codes: the code 1 is 1, the code 01 is 2, the code 001 is 3, the code 0001 is 4, and so on. Thus, we use x bits to represent the number x.

Fortunately, we can do much better than Gamma codes and represent the number x using roughly 2 log x bits with delta codes. Firstly, write x as x=2N +(x modulo 2N) where N is the logarithm. Then we can represent the number N-1 as a Gamma code using N-1 bits, and then store (x modulo 2N) in binary format (using N-1 bits). Thus, we can represent all numbers up to 2N-1 using 2N-2 bits.

Unfortunately, the current breed of microprocessors are not kind to variable-length representations so the added compression is at the expense decoding speed.

Continue with part 3.

References and further reading: Holloway et al., How to Barter Bits for Chronons, 2007. See also the slides of my recent talk Compressing column-oriented indexes.

Run-length encoding (part I)

(This is part 1, there is also a part 2 and a part 3.)

Run-length encoding (RLE) is probably the most important and fundamental string compression technique. Countless multimedia formats and protocols use one form or RLE compression or another.

RLE is also deceptively simple. It represents repeated values as a counter and a character. Thus, the string AAABBBBBZWWK becomes 3A-5B-1Z-2W-1K.

If that is all there was to RLE, then the wikipedia page on run-length encoding would be just fine. Yet, I think it needs help.

Why do we use RLE?

  • You can read and write RLE data in one pass, using almost no memory.
  • Given a vector V compressed with RLE, you can apply any scalar function f to its component in time O(|V |) where |V | is the compressed size of the vector.
  • Given two vectors V and V‘ compressed with RLE, you can do arithmetic (e.g. V+V‘) in time O(|V |+|V’|).

(Some RLE formats have worse complexity bounds.)

Any downsides to RLE?

  • Random access is slower. Sometimes, only sequential read (from the beginning) is possible. Updating an RLE-compressed array can be difficult.
  • You need long runs of identical values.
  • Some RLE formats negatively affect CPU vectorization. Thus, if the compression rates are modest, it could actually take longer to process an RLE-compressed array.

What is the RLE format?

There is no unique RLE format. How you use the RLE idea depends on your goals such as (1) maximize the compression rate (2) maximize the processing speed.

Here are some common variations:

  • Instead of using a counter for each run of characters, you only add a counter after a value has been repeated twice. For example, the string AAABBBBBZWWK becomes AA1-BB3-Z-WW-K. Thus, if many characters are not repeated, you will rarely use an unnecessary counter.
  • You can use a single bit to decide whether a counter is used. For example, the string AAABBBBZWWK becomes A-True-3, B-True-5, Z-False, W-True-2, K-False. Again, this may avoid many unnecessary counters if values are often not repeated.
  • Instead of a counter, you may store the location of the run in the array. For example, the string AAABBBBBZWWK becomes 1A-4B-9Z-10W-11K. The benefit of this approach is to allow random access in logarithmic time using binary search. However, it is also incompatible with some techniques to avoid unnecessary counters. So, it is a compression-speed trade-off. For even more speed, you can store both the location of the run and its length (thus avoiding a subtraction).
  • To help vectorization, you can group the characters into blocks of k characters. For example, using blocks of two characters, the string AAABBBBBZWWK becomes 1AA-1AB-2BB-1ZW-1WK. Again, this is a compression-speed trade-off because there will be fewer runs to compress after grouping the characters into long blocks.

Continue reading to part 2.

Some References (to my own work):

More database compression means more speed? Right?

Current practical database compression techniques stress speed over compression:

In a comment to my previous blog postRasmus Pagh asks more or less this question:

Given that we have more and more CPU cores—and generally more powerful CPUs—shouldn’t we compress the data more aggressively?

As the CPUs get more powerful, shouldn’t all database become I/O bound? That is, the main difficulty is to bring enough data into the CPUs?

Apparently not.

  • As we have more CPU cores, we also have more bandwidth to bring data to the the cores. Otherwise, CPU cores would be constantly data-starved in most multimedia and business applications.
  • Multicore CPUs are not the only game in town: RAM and storage have also been revolutionized. In 2009, it is not unpractical to run entirely database applications from RAM. After all, many business databases fit in 16 GB for RAM. And even when we do not have enough RAM, we have SSDs.
  • Lightweight compression techniques often favor vectorization which is also getting more and more important and powerful.

Hence, for most database applications fast decompression remains preferable to aggressive compression.

Which should you pick: a bitmap index or a B-tree?

Morteza Zaker sent me pointer to their work comparing bitmap indexes and B-trees in the Oracle database. They examine the folklore surrounding bitmap indexes—which are often thought to be mostly useful over low cardinality columns (columns having few distinct values, such as gender). Their conclusion is that the Bitmap index is the conclusive choice for data warehouse design for columns with high or low cardinality. Their claim is backed by experiments using columns with millions of distinct values. This confirms the observation made in my earlier post: The mythical bitmap index.

They make an interesting distinction between full cardinality columns, columns where each value appears only once, and high cardinality columns where at least a few values are often repeated (think about storing last names). A bitmap index is inadequate for a full cardinality column because there is no compression possible, and this is probably how the folklore around bitmap indexes came about. Yet, unlike transaction processing, data warehousing is usually not concerned with full cardinality columns.

Reference: Zaker et al., An Adequate Design for Large Data Warehouse Systems: Bitmap Index Versus B-Tree Index, IJCC 2 (2), 2008.

Further reading: Trading compression for speed with vectorization, To improve your indexes: sort your tables! and my research paper Sorting improves word-aligned bitmap indexes.

Procrastination can be your friend

Procrastination can be a serious problem leading to job loss, high anxiety and even significant psychological disability and dysfunction (according to wikipedia). To avoid excessive procrastination, most researchers grow a sense of professional urgency.

Most people rely on extrinsic pressures. In Computer Science—for example—we often organize our work around a few conferences with fixed submission deadlines. Others will compete with fellow researchers for funding or a specific research goal.

I prefer intrinsic motivation. I avoid fixed deadlines and competition. I choose problems that I want to solve, for myself. In this intrinsic mode, you find that you cannot get yourself to work on some problems. For example, you ran some experiments or derived some results, but you cannot begin or finish writing the paper. After all, there is no deadline, no competitor to beat, so why bother if you do not feel the urge? In some instances, you are merely lazy. But what happens if you make good progress in other, similar projects? Simply put, procrastination becomes an indicator for uninteresting problem.

Whenever I procrastinate too much, I start to question the importance of the work. Often, I fund that it is better to drop the work entirely.

Further reading: Structured Procrastination and Procrastination and Perfectionism by John Perry.

Next Page »

18 queries. 0.405 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.