External-memory shuffling in linear time?

You can sort large files while using little memory. The Unix sort tool is a widely available implementation of this idea. Files are written to disk sequentially, without random access. Thus, you can also sort variable-length records, such as lines of text.

What about shuffling? Using the Fisher-Yates algorithm also known as Knuth algorithm, you can shuffle large files while using almost no memory. But you need random access to your files. Thus it is not applicable to variable-length records. And indeed, the Unix sort command cannot shuffle. (It has a random-sort option, but it is not a shuffle. Meanwhile, the shuf command runs in RAM.)

A solution: Tag each record with a random number. Pick random numbers from a very large set so that the probability that any two lines have the same random number is small. Then use external-memory sorting. You can implement something similar as a single line in Unix.

A better solution? Shuffling is possible in linear time O(n). Sorting is a harder problem (in O(n log n)). Thus, using a sort algorithm for shuffling—as we just did—is inelegant. Can we shuffle in linear time without random access with variable-length records?

Maybe we could try something concrete? Consider this algorithm:

  • Read the original file in small blocks. Shuffle each block in RAM. Write them to temporary files. View each shuffled block as a stack of records.
  • Select a non-empty block at random. Pick and remove the record on top of the stack. Append it to the result set. Repeat. (The correct probability assignment for each block is the number of records left in the block divided by the total number of records left.)

(As a variation on this algorithm, you can merge the blocks two-by-two.)

Unfortunately, I doubt this algorithm can run in linear time.

Your challenge: Consider variable-length records. Prove or disprove that we can implement an external-memory shuffle in linear time. Alternatively, come up with an algorithm faster than the sorting-based one.

Update: Preston L. Bannister proposed an algorithm which solves the problem to my satisfaction. The same algorithm was described by P. Sanders in Random Permutations on Distributed, External and Hierarchical Memory (Information Processing Letters, 1998).

Reference: This is a follow-up to my blog post External-Memory Shuffles?

Which is fastest: integer addition or XOR?

The bitwise exclusive or (e.g., 1110 XOR 1001 = 0111) looks simpler to compute than integer addition (e.g., 2 + 9 = 11). Some research articles claim that XOR is faster. It appears to be Computer Science folklore. But is it true?

Which line runs faster? (The symbol “^” is the XOR.)

for(int k = 0; k < N; ++k) sum+= k;

for(int k = 0; k < N; ++k) sum^= k;

My result: In C++ and Java, both run at the same speed (within 1%).

Disclaimer: I’d be delighted if you could prove me wrong. Please provide Java or C++ source code.

Language, Mathematics and Programming

Even if you have extensive training in Mathematics, the average Mathematics paper is undistinguishable from the ramblings of a madman. Many of these papers seek to solve narrow problems. And yet, we respect Mathematicians.

Software programming is a form of communication, usually between human beings and machines. While different in style, programming is a subset of the language of Mathematics. If you dig into the average source code, it is undistinguishable from ramblings, even if you are an expert developer.

Yet, we denigrate programming. Many will even deny that it is a Mathematical language. But Mathematics and Programming are not so different:

Mathematics Programming
Building on the previous research papers requires you to dig through endless piles of boring, badly written research papers. Maintaining millions of lines of codes written by various people over the years is difficult, boring, error-prone.
Inventing new theorems or new mathematical theories requires much creativity. Coming up with the next best iPhone application requires much creativity.
For most people, mastering even part of Mathematics requires a decade or more. Please read Teach yourself programming in ten years by Peter Norvig.
The language of Mathematics has directly contributed to technological progress. Electricity, engines, nuclear power, space travel all required extensive use of Mathematics. Google changed the world through the brilliance of its software engineers. The open source revolution has changed how people think about collaboration.
Some Mathematicians are widely recognized as being extremely smart. Some famous people have done a fair share of difficult and technical programming : Donald Knuth and TeX, Tim Berners-Lee and the Web, Linus Tovarlds and Linux.

Why is programming getting so little respect?

  • The intense commercialization of programming has commoditized it. As Paul Graham might say : painters where initially “portrait takers”. It is only when painting lost its commercial function that it became recognized as a noble art. However, just like painters always used their free time to create great art, the best programmers are open sourcing beautiful code all the time.
  • The study  of programming itself remains rather informal. You can get degrees in Computer Science, Computing Engineering or Software Engineering, but there is no degree in Programming. Programming is taught in universities, but generally only in the first few courses of a degree. Yet, there are degrees in Communication, Fine Art, Architecture, Music or Dance. While a degree in Computer Science or Software Engineer can make you a better programmer, the fact remains that your professors are not expert practitioners.

How can we fix this? I have this secret dream of setting up the equivalent of “Creative Writing” program, but for programmers. Call it “Creative Programming”. Basically, students would come together to write great code. Yes, such code might be useful commercially, but that would be a secondary consideration. The pursuit of greatness would be the only goal that matters. It would treat programming as a bona fide language. It would attract the best programmers as guest lecturers. Would this ever work out? I do not know.

I am sure that many will point out that my secret dream is impractical. Beauty should not come first : we want cheap, reliable, maintainable code. We also want programmers to be replaceable, inexpensive and practical. However, human beings can both pursue greatness while being practical. Compromise is possible.

Let me conclude by quoting Donald Knuth:

(…) computer programming is an art, because it applies accumulated knowledge to the world, because it requires skill and ingenuity, and especially because it produces objects of beauty. A programmer who subconsciously views himself as an artist will enjoy what he does and will do it better.

Further reading: The best software developers are great at Mathematics? and Is programming “technical”?

Who the heck got Universities into the email business?

My current employer, UQAM, refuses to allow email forwarding. Students would rather forward their emails to their existing GMail accounts, for example. And the IT Department (the SITEL) agrees that it would have several benefits. However, they refuse to allow it for the following reasons:

  • Email forwarding may create infinite email loops. These may disrupt services and require human intervention.
  • Invalid or failing remote servers may saturate the local servers as they are unable to forward the emails.
  • Professors and management send confidential information by email. Yet, without full control of the email service, the University cannot ensure the needed confidentiality.
  • With email forwarding, it may be impossible to ensure and prove that an email was received and read. Thus, homework assignments, administrative inquiries or security advisories may never reach the students, or we may be unable to prove that they reach the students because of email forwarding.
  • As a Canadian University, email forwarding puts us at risk that the emails may transit on American servers, where the Canadian law on privacy is not applicable.
  • Email forwarding may put students at risk if remote accounts are stolen or lost.

Can you help me debunk or mitigate these arguments? I know that some of these arguments are bogus, but I am looking for solid references. (Not that I expect to change their mind.)

A larger issue: shouldn’t universities stick with research and teaching? I understand that we must have networks, cables, computers, firewalls, but do we need to provide our students with email services?

Update: Turns out that our IT people encourage students who want forwarding to GMail (say) to use the POP3 protocol. It is unclear to me how email forwarding can be a dangerous practice whereas POP3 “forwarding” can be safe.

Is programming “technical”?

According to student evaluations, most of my students appreciate short programming assignments. Yet, every year, some students think that programming is below them or unimportant.

Maybe I should start my courses with this theorem:

Theorem. If you understand an idea, you can implement it in software.

There is no denying that programming requires a lot of technical knowledge. Most programmers do technical jobs, involving testing, building or refactoring code. But programming is ultimately a communication form. And it is as noble as Mathematics or English. Let us compare:

  • Writers are considered sexy and non-technical people. Yet, grammar and spelling are technical. Moreover, most writers earn a living by writing ads for boring products. Some of them make a living with grand novels, but fewer than you think.
  • Physicists are great thinkers. Yet, their mathematical derivations are often mind-numbing and technical. Many physicists spend years running extremely technical experiments. And when they don’t, they program extremely complex (and technical) simulations.

For some reason, being a writer is somehow considered more prestigious than being a programmer. If you ask me, Linus Torvalds is every bit as cool J. K. Rowling. And I’d rather have a lunch date with Linus.

Most common questions about recommender systems…

I get ten to fifteen questions a week on recommender systems from entrepreneurs and engineers. Sometimes, I help people find their way in the literature. On occasion—for a consulting fee—I get my hands dirty and evaluate, design or code specific algorithms.  But mostly, I answer the same questions again and again:

1. How much data do I need?

Given your data, you can use cross-validation or A/B testing to measure objectively the effectiveness of a recommender system.

2. We have this system in place, how do we know whether it is sane?

See previous question.

3. My online recommender system is slow!

Laziness is your friend: don’t recompute the recommendations each time you have new data.

4. My customers don’t like the recommendations!

  • Keep expectations in check: recommending products is difficult and even human beings have trouble doing it,
  • Explain the recommendations: nobody trusts a black box,
  • Allow your users to freely explore your data and products in convenient and exciting ways.

5. Which algorithm is best?

You should start with simple algorithms: it worked well enough for Amazon. To do better, a mix of different algorithms is probably best. You can combine them using ensemble methods.

Next Page »

18 queries. 0.359 seconds. Valid XHTML

Powered by WordPress

Subscribe to this blog in a reader or by Email.