You can sort large files while using little memory. The Unix sort tool is a widely available implementation of this idea. Files are written to disk sequentially, without random access. Thus, you can also sort variable-length records, such as lines of text.

What about shuffling? Using the Fisher-Yates algorithm also known as Knuth algorithm, you can shuffle large files while using almost no memory. But you need random access to your files. Thus it is not applicable to variable-length records. And indeed, the Unix sort command cannot shuffle. (It has a random-sort option, but it is not a shuffle. Meanwhile, the shuf command runs in RAM.)

A solution: Tag each record with a random number. Pick random numbers from a very large set so that the probability that any two lines have the same random number is small. Then use external-memory sorting. You can implement something similar as a single line in Unix.

A better solution? Shuffling is possible in linear time O(n). Sorting is a harder problem (in O(n log n)). Thus, using a sort algorithm for shuffling—as we just did—is inelegant. Can we shuffle in linear time without random access with variable-length records?

Maybe we could try something concrete? Consider this algorithm:

  • Read the original file in small blocks. Shuffle each block in RAM. Write them to temporary files. View each shuffled block as a stack of records.
  • Select a non-empty block at random. Pick and remove the record on top of the stack. Append it to the result set. Repeat. (The correct probability assignment for each block is the number of records left in the block divided by the total number of records left.)

(As a variation on this algorithm, you can merge the blocks two-by-two.)

Unfortunately, I doubt this algorithm can run in linear time.

Your challenge: Consider variable-length records. Prove or disprove that we can implement an external-memory shuffle in linear time. Alternatively, come up with an algorithm faster than the sorting-based one.

Update: Preston L. Bannister proposed an algorithm which solves the problem to my satisfaction. The same algorithm was described by P. Sanders in Random Permutations on Distributed, External and Hierarchical Memory (Information Processing Letters, 1998).

Reference: This is a follow-up to my blog post External-Memory Shuffles?

10 Comments »

  1. First question – what is the aim of this exercise?

    You probably do not need a completely-random shuffle over the entire bulk of the file. This makes a large difference in the appropriate algorithms. What do you need to get meaningful results from your current exercise?

    Comment by Preston L. Bannister — 15/3/2010 @ 14:17

  2. @Bannister

    I’d be fine with quasi-shuffling, as long as it has nice properties.

    My practical motivation is this: I’m annoyed that no standard Unix tool does random line shuffling over very large files.

    My second motivation is that I’m annoyed people consider the shuffling problem “solved” by assuming that you have fixed-length records.

    And finally, I think you cannot just “shuffle” locally. Think about a sorted file. If you shuffle it only locally, it will still be “almost sorted” (no line starting with the letter ‘z’ will appear in the first few lines).

    Comment by Daniel Lemire — 15/3/2010 @ 14:56

  3. Fair enough. Note that I was thinking more along the lines of classic sort-merge algorithm, except shuffling rather than sorting. Runtime could be effectively linear and equal to a read/write of all the file data, twice (with minimal seeks). Theoretically there would be a non-linear component, but with a very much smaller constant multiplier.

    If local shuffling were sufficient, the runtime could be reduced to a single read/write of the input.

    Yes, this is the pragmatist talking to the theoretician. :)

    The reason for the lack of a generic optimal random file-shuffle is that the underlying reason for wanting a shuffle is not, um … random. The base requirement changes which algorithms are most suitable.

    Comment by Preston L. Bannister — 15/3/2010 @ 15:30

  4. @Bannister

    The algorithm I propose is close to a classical merge-sort algorithm. It fails to be linear (I think) because picking blocks at random, with uneven probabilities, is hard.

    Maybe this can be fixed.

    Comment by Daniel Lemire — 15/3/2010 @ 15:37

  5. Still not sure about your base need, but try this:

    1. Create N files as output buckets.
    2. For each line and use a hash to choose the output bucket.
    3. For each line written to the final output, read one line from a random bucket.
    4. Measure and adjust.

    The hash could be a random number generator. Note you do not want:
    * Too many buckets (OS limits on number of open files that can be efficiently handled).
    * Too-small buckets (inefficient small I/O).
    * Hashing that clusters with negative effect on your base purpose.

    With careful use of buffering you can make this run at full disk read/write rates with minimal CPU usage.

    Comment by Preston L. Bannister — 15/3/2010 @ 15:52

  6. (Way do I always notice needed edits right AFTER posting??)

    Comment by Preston L. Bannister — 15/3/2010 @ 15:53

  7. Using a random-keyed sort to implement a shuffle is well known to give rise to biased distributions, so that method won’t work without some heroic effort.

    If you can identify a record boundary by scanning backwards (always true for normal Unix text files), it’s simple to use Fisher-Yates shuffle with external storage and random access, in just the same way as you’d implement binary search. I’ve a suspicion that the time complexity isn’t theoretically linear, though.

    Comment by Bryan O'Sullivan — 15/3/2010 @ 17:16

  8. It looks to me like it isn’t possible. The survey paper “External Memory Algorithms and Data Structures: Dealing with Massive Data” by Jeffrey Scott Vitter says that permuting has the same IO complexity as sorting for non-trivial block sizes (its in section 5.5 if you are interested). That seems to imply that a true linear IO complexity external shuffle isn’t possible even with fixed size records.

    Comment by J.D. Park — 15/3/2010 @ 18:02

  9. You could create an “index file” for variable-length records where each fixed-size record has 2 fields: offset and length. This fixed-length file could be shuffled and then used to re-read the original file.

    While perhaps not as elegant as Preston L. Bannister’s solution, it would be pretty simple to implement.

    Comment by Craig Kelly — 24/3/2010 @ 22:52

  10. @Craig Yes, it would be simple to implement, but it *might* be quite a bit slower. Even with fixed-length records, the Fisher-Yates algorithm might just be fundamentally slow when used with external memory. Indeed, how do you leverage your fast internal memory? Bannister’s algorithm is nice because a good part of the work is done in RAM. You don’t write all over the disk, all the time.

    Comment by Daniel Lemire — 25/3/2010 @ 6:34

Leave a comment

Warning: When entering a long comment, please ensure that you make copy of your text prior to submitting it. If the server should fail or if you hit a bug, you might lose your work. I am not responsible for your lost effort.

To spammers: I carefully review every single post and make sure that spam gets deleted. You are wasting your time if you are manually entering spam using this form. Read my terms of use to see what I consider to be abusive.

Example: duo plus septem is '9'. The numbers are expressed in latin numerals but you should give your answers using ordinary digits.

 

« Blog's main page

Powered by WordPress