Problem statement: You have a list of items that you want to randomize.
I’ve found myself in this situation many times. If the language you’re working in has a shuffle or randomize function, you’re set. However, there are plenty of languages that don’t provide built in support for such a function, leaving you on your own. The first time I was faced with this problem, I wrote a shuffle algorithm that looked something like this:
def incorrect_shuffle(items): for i in range(len(items)): randomIndex = random.randint(0, len(items)-1) temp = items[randomIndex] items[randomIndex] = items[i] items[i] = temp return items
The above algorithm swaps every element in the list with another randomly-chosen element in the list. But there are three problems with this algorithm:
- It’s biased.
- It’s biased.
- It’s biased.
Ok, so there’s really only one problem, but its a big one! This topic has been covered before, most notably by Jeff Atwood in his The Danger of Naïveté post. I’m writing this post to re-emphasize the importance of this topic, especially since I’ve made the mistake of implementing an incorrect shuffle in the past.
How do we know that that above algorithm is biased? On the surface it seems reasonable, and certainly does some shuffling of the items.
There are two ways to realize the incorrectness of the above algorithm. The first is theoretical. A list of
N items has
N factorial (
N!) possible orderings. Consider a list with three elements
C. There are
3! = 6 ways to order these elements:
ABC ACB BAC BCA CAB CBA
However, the above incorrect algorithm does not yield
N! potential orderings. Each item’s final list index is chosen randomly from
N, resulting in
N possible final locations for each item. There are
N items in the list, so this implementation results in
N^N possible orderings. Since
N^N is not evenly divisible by
N!, some of the final list orderings must be more common than others. This produces a bias for these orderings (e.g., they’re more common than other orderings).
The second way to observe the incorrectness is empirical (e.g., by running some examples). Let’s try to randomize our three element
[A,B,C] list with the above biased algorithm, and compare those results to randomizing the list with the Fisher-Yates implementation shown below:
def fisher_yates_shuffle(items): for i in range(len(items)): randomIndex = random.randint(i, len(items)-1) temp = items[randomIndex] items[randomIndex] = items[i] items[i] = temp return items
If we shuffle our
[A,B,C] list 1,000,000 times with each algorithm we end up with the following distribution for each of the six possible list orderings:
The results show that the biased shuffle produces certain orderings more often than others. The correct Fisher-Yates algorithm produces each outcome with equal likelihood. We can repeat this experiment for a list with four elements. Here are the results:
The numbers 1-24 each represent one of the 24 possible orderings of a list with four elements. The Fisher-Yates shuffle produces each final ordering with equal likelihood. The incorrect algorithm does not.
The first time I saw this I was quite surprised. What makes this problem especially interesting is that the difference between the two algorithms is essentially one character. The incorrect algorithm has the line:
randomIndex = random.randint(0, len(items) - 1)
The Fisher-Yates algorithm uses the following line instead:
randomIndex = random.randint(i, len(items) - 1)
The Fisher-Yates shuffle algorithm (also called the Knuth shuffle) walks a list of items and swaps each item with another in the list. Each iteration the range of swappable items shrinks. The algorithm starts at index zero (it can also walk the list in reverse), and chooses a item from
N at random. This selection freezes the
0th element in the shuffled list. The next iteration moves to index
1 and chooses an item from the remaining
N indices. This repeats until the entire list is walked.
On the surface, using something similar to the incorrect shuffle algorithm might not seem like a big deal. However, the shuffling bias grows as the number of list items grows since
N^N grows faster than
N!. The Fisher-Yates algorithm is a good one to have in your pocket. It comes in handy more often that you would think.