External-memory shuffling in linear time?
You can sort large files while using little memory. The Unix sort tool is a widely available implementation of this idea. Files are written to disk sequentially, without random access. Thus, you can also sort variable-length records, such as lines of text.
What about shuffling? Using the Fisher-Yates algorithm also known as Knuth algorithm, you can shuffle large files while using almost no memory. But you need random access to your files. Thus it is not applicable to variable-length records. And indeed, the Unix sort command cannot shuffle. (It has a random-sort option, but it is not a shuffle. Meanwhile, the shuf command runs in RAM.)
A solution: Tag each record with a random number. Pick random numbers from a very large set so that the probability that any two lines have the same random number is small. Then use external-memory sorting. You can implement something similar as a single line in Unix.
A better solution? Shuffling is possible in linear time O(n). Sorting is a harder problem (in O(n log n)). Thus, using a sort algorithm for shuffling—as we just did—is inelegant. Can we shuffle in linear time without random access with variable-length records?
Maybe we could try something concrete? Consider this algorithm:
- Read the original file in small blocks. Shuffle each block in RAM. Write them to temporary files. View each shuffled block as a stack of records.
- Select a non-empty block at random. Pick and remove the record on top of the stack. Append it to the result set. Repeat. (The correct probability assignment for each block is the number of records left in the block divided by the total number of records left.)
(As a variation on this algorithm, you can merge the blocks two-by-two.)
Unfortunately, I doubt this algorithm can run in linear time.
Your challenge: Consider variable-length records. Prove or disprove that we can implement an external-memory shuffle in linear time. Alternatively, come up with an algorithm faster than the sorting-based one.
Update: Preston L. Bannister proposed an algorithm which solves the problem to my satisfaction. The same algorithm was described by P. Sanders in Random Permutations on Distributed, External and Hierarchical Memory (Information Processing Letters, 1998).
Reference: This is a follow-up to my blog post External-Memory Shuffles?
Montreal, Canada 
Facebook
Friendfeed
LinkedIn
SlideShare
Delicious