Andrew Kensler's permute()
A function for stateless, constant-time pseudorandom-order array iteration

That subtitle is a little bit of a mouthful, but I’ve been wanting to talk about Andrew Kensler’s permute() function for a while and I wanted to make it clear right off the bat why it’s useful. permute() is used some in the graphics community, especially in offline rendering, and the techniques are more widely known in cryptography*, but it’s applicable to so much more than graphics and cryptography.

Let’s first talk about the problem it solves. You want to iterate over \(k\) items of an array of length \(n\), in random order, where \(k \leq n\). Those \(k\) items cannot contain duplicates from the original array (they must each correspond to a unique index). Or, you want to get a random sample of \(k\) items without replacement.

When I say “random”, this is obviously pseudorandom, but there are varying degrees of pseudorandomness. This is unquestionably not cryptographically secure, that’s not its intended use. I can’t say that I know enough about pseudorandomness to say how statistically indistiguishable from random this is or for what applications it will be random enough. It’s good enough to use for computing rendering integrals, and I’d be comfortable using it for a lot of casual statistical analysis.

Update May 27, 2021: I saw what looks to be a nice improvement to the randomness of this, and extends it to work for 64-bit integers. I recommend taking a look after reading this post!

Anyway, the canonical way to randomly iterate over an array is to first shuffle it with the Fisher-Yates shuffle, and then iterate over the shuffled array.

*I found this Stack Overflow answer from 2009 that gives the exact same idea, so I should be clear that Andrew Kensler didn't really invent the method I'll talk about here, but I think he popularized the use of it in graphics and certainly created the particular hash function I'll show at the end.

The Fisher-Yates shuffle (and its limitations)

The shuffle works something like this:

  1. For the first item in the array, randomly swap it with any item (including itself). Whatever is now in the first position is “fixed”.
  2. For the second item in the array, randomly swap it with any of the non-fixed items (i.e. after the first item). The second item is now “fixed”.
  3. For the third item, swap it with any of the non-fixed items (i.e. after the second item).
  4. Etc. until you get to the last item.

The C++ code looks something like this:

1
2
3
4
5
6
7
8
template<typename T>
void fisher_yates_shuffle(std::vector<T> *vec) {
  int n = vec->size();
	for (int i = 0; i < n-1; i++) {
		int swap_idx = uniform_rand_int(i, n-1);
		std::swap((*vec)[i], (*vec)[swap_idx]);
	}
}

uniform_rand_int(min, max) is inclusive of both min and max, i.e. \(\textnormal{swap\_idx} \in \{i, i+1, \dots, n-1\}\)

Obviously this manipulates the array in place, which may not be desirable for a bunch of reasons. In practice you might call this with an array of indices or an array of pointers.

There are a bunch of situations where the Fisher-Yates shuffle doesn’t work very well, because you need to store the shuffled array.

  1. The rendering case, which is what motivated Kensler. This is when you’re iterating over a LOT of arrays simultaneously, like 10s of thousands or millions on one computer. Or maybe iterating over one array in a million different random orders. In other words, rather than fully iterating over one array, then the next array, etc. you want to get the i‘th shuffled value of a million arrays, then get the i+1‘th shuffled value of those arrays, etc.
  2. The case where you want to draw a “small” random sample (without replacement, i.e. no duplicates) from a huge array. Think, you want to get 10,000 unique random items from an “array” that’s a billion items long, like a database.
  3. You want to compute something, like an average, over a really large random sample from an even huger array, neither of which can fit in memory, and it’s really important that you don’t have duplicate values.

In the latter two cases there’s obviously a lot more to think about, like the cost of just reading incoherent values from the large array. There are other good ways to solve #2 and #3 too, especially if you’re doing distributed computing. But a lot of the techniques I know share the problem that you need to scan over the entire array at least once. If you have constant-time access into the array, Kensler’s method is always linear only in the size of your random sample, not in the original array.

In other words, if you want to iterate over \(k\) random, unique items from an array of length \(n\), then the Fisher-Yates shuffle is \(O(n)\) time and \(O(n)\) memory. As I’ll show in a second, this can be reduced to \(O(k)\) time and \(O(k)\) memory. But permute() is \(O(k)\) time and \(O(1)\) memory, the best you could possibly do. Not only that, but with permute() you can arbitrarily query random indices in constant time, which means you can use it in parallel across multiple threads or computers.

Incremental Fisher-Yates shuffle

I think it’s worth mentioning you can extend the Fisher-Yates shuffle to handle #2. You use a sparse map to keep track of swapped indices rather than putting them into the array. When you swap the next index, you check your map to see if the original or swapped locations in the array have already been swapped. If so, you use the values from the map, otherwise you just use the original indices, and then you update the map. Here’s a code snippet to show what that looks like. This will select \(k\) random unique integers from the values \(\{0,1,\dots,n-1\}\).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
std::vector<int> get_k_indices(int n, int k) {
  std::vector<int> indices(k);
  std::unordered_map<int, int> index_map;
  for (int i = 0; i < k; i++) {
    int swap_idx = uniform_rand_int(i, n-1);

    // Get the new index and the previous. If they've been swapped before
    // they'll be in the map.
    auto swap_val_it = index_map.find(swap_idx);
    auto prev_val_it = index_map.find(i);
    int swap_val = (swap_val_it == index_map.end()) 
        ? swap_idx
        : swap_val_it->second;
    int prev_val = (prev_val_it == index_map.end()) ? i : prev_val_it->second;

    indices[i] = swap_val;
    index_map[swap_idx] = prev_val;
    // We don't need to set the prev_idx in the map,
    // it will never be checked again.
  }
  return indices;
}

But still, even this “incremental” Fisher-Yates shuffle requires \(O(k)\) memory, which makes it unsuitable for case #1 or case #3.

Update: I believe this is equivalent to, but less elegant than, Floyd's algorithm for random sampling without replacement. So if you want to use that (like if you're in case #2 and really care about good randomness), implement Floyd's algorithm, rather than the way I've framed it here.

permute()

Kensler introduced permute() in his paper Correlated Multi-Jittered Sampling. I didn’t quite understand how it worked when I first read it, so I also wanted to write this blog post to break it down and hopefully help others appreciate it. However if you just want the code, skip to the bottom!

The function acts as a “stateless” pseudorandom iterator over an array. It takes in a seed value for a permutation, the length of the array, and the index you’d like to map to a different index. It looks like this:

uint32_t permute(uint32_t idx, uint32_t len, uint32_t seed) {
  // Somehow return a randomly shuffled int in [0, len-1] interval.
  ...
}

and if you wanted to use it, say to get 10 random unique values from a vector, you might do:

  uint32_t seed = uniform_rand_int(0, UINT_MAX);
  for (int i = 0; i < 10; i++) {
    int rand_idx = permute(i, array.size(), seed);
    ...  // Do something with array[rand_idx].
  }

So how do we construct this magic function?

An invertible hash function

The first thing we need is a hash function that is invertible for a given power-of-two sized domain. What does that mean? For a given number of bits \(n\), every value from \(0\) to \(2^n-1\) will be hashed to a single unique value in the same range. It’s a one-to-one and onto mapping, i.e. a bijective function. Let’s give a super simple example: adding 1. Let’s say our domain is 2-bit integers. If we add one to a 2-bit number, we’ll get a unique 2-bit number.

\[f(x) = (x + 1) \mod 4\]

In binary:

\[f(00) = 01\\ f(01) = 10\\ f(10) = 11\\ f(11) = 00\\\]

Adding is generally, an invertible hashing operation. Another invertible operation in binary is multiplying by an odd number, let’s say 3. \(f(x) = (x * 3) \mod 4\), with:

\[f(00) = 00\\ f(01) = 11\\ f(10) = 10\\ f(11) = 01\\\]

A non-invertible operation is multiplying by two or any even number:

\[f(00) = 00\\ f(01) = 10\\ \textcolor{red}{f(10) = 00}\\ \textcolor{red}{f(11) = 10}\\\]

so we can’t do that.

Kensler outlines more invertible operations in power-of-two domains, especially xor’s and combinations with a seed value, where different seed values give different permutations. Ultimately, we get some invertible hash function:

uint32_t hash(uint32_t idx, uint32_t seed) {
  ...  // do some hashing of idx.
  return idx;
}

Cycle-walking

Already, we can see that this hash will allow us to randomly iterate over an array with length power-of-two. But what about arrays of other lengths? Kensler uses a technique from cryptography called “cycle walking”, which was invented in “Ciphers with Arbitrary Finite Domains” by Black and Rogaway. Cycle-walking is a fancy name, but the idea is simple: we just repeat the hash until the value is less than the length of the array! So we get a permute() function like this:

uint32_t permute(uint32_t idx, int len, uint32_t seed) {
  do {
    idx = hash(idx, seed);
  } while (idx >= len);
  return idx;
}

Obviously this can be really slow since it could take a lot of hashes to get to where we want, but we’ll get to that in a second. First, why does this work at all? This is where the bijective mapping comes in. We start with an index that is less than the length of the array and we hash it until we get another value less than the length of the array. Let’s visualize it like this:

\[idx \rightarrow \mathbf{hash(idx)} \rightarrow \mathbf{hash^2(idx)} \rightarrow hash^3(idx)\]

where \(hash^3(x)\) is \(hash(hash(hash(x)))\) or, if you prefer the function composition notation, \(hash \circ hash \circ hash(x)\). In this case, I’ve bolded the first two hashes to indicate that the values are bigger than the length of the array. The last hash is our final permuted value.

Because our hash function is invertible, we can invert the final hashed value (note that we don’t need to actually implement the inverse, this is just theoretical):

\[hash^{-1}(hash^3(x)) = \mathbf{hash^2(idx)}\]

By inverting the final value, which was less than the length of the array, we again get a value larger than or equal to the length of the array. So if we did this loop in reverse until we got a value less than the length of the array, we would get back the original index.

\[idx \leftarrow \mathbf{hash(idx)} \leftarrow \mathbf{hash^2(idx)} \leftarrow hash^3(idx)\]

In other words, our loop is also an invertible hash function. But rather than mapping within a power-of-two sized domain, it maps within the domain of the array’s length, which is exactly what we’ve been trying to achieve.

Making it (average) constant-time

We need to make this faster, specifically we don’t want a lot of loop iterations. Let’s say that we have a really good hash function such that there’s roughly a 50% probability that any bit (within our power-of-two domain) will be 1 after hashing. Now, we modify our hash function to specify the power-of-two domain. We do this by taking in a mask that limits the bits which get hashed. Here’s an example (but a bad hash function).

uint32_t hash(uint32_t idx, uint32_t mask, uint32_t seed) {
  idx ^= seed;
  return (idx * 3) & mask;
}

If our hash domain is the two-bit (unsigned) integers, then our mask would be 11 in binary, or 0x3. If our domain is the 32-bit integers, it would be 0xFFFFFFFF.

Now, in our permute() function, we’ll use a mask with all the bits set up to the most significant bit of the highest possible index (that is, len - 1). Here’s a really inefficient (\(O(\log n)\)) implementation for clarity:

uint32_t get_mask(uint32_t len) {
  if (len == 1) return 0;
  uint32_t mask = 1;
  while (mask < (len-1)) mask |= (mask << 1);
  return mask;
}

However a lot of processors (e.g. x86 processors) have a single instruction that can get the number of leading zeros of an integer. In GCC you can use this with the instrisic __builtin_clz(x), which even on processors that don’t have the instruction will compile to a constant-time operation. So if you use GCC you could write:

uint32_t get_mask(uint32_t len) {
  if (len == 1) return 0;
  uint32_t full_mask = -1;
  // This subtly assumes that "unsigned int" == uint32_t.
  return (full_mask >> __builtin_clz(len-1));
}

This may not be super portable, here is how Kensler does it with bit arithmetic.

uint32_t get_mask(uint32_t len) {
  uint32_t mask = len-1;
  mask |= mask >> 1;
  mask |= mask >> 2;
  mask |= mask >> 4;
  mask |= mask >> 8;
  mask |= mask >> 16;
  return mask;
}

With this, our permute() function looks something like:

uint32_t permute(uint32_t idx, uint32_t len, uint32_t seed) {
  uint32_t mask = get_mask(len);
  do {
    idx = hash(idx, mask, seed);
  } while (idx >= len);
  return idx;
}

Because of this mask, we know that our hashing domain is less than two times the size of the array. Since our hash function has a 50/50 chance of flipping the most significant bit, or rather of hashing to a value in the lower half of its domain, we can say that on average, this will take less than two iterations. We’ve got our fast random iterator!

We still need to make a good invertible hash function. As I mentioned, Kensler outlines a bunch of invertible operations, and he uses hill-climbing to find a good hash function. I won’t go into that in more detail, but here is the final code using his hash. I’ve adapted Kensler’s code to the way I’ve structured this blog post:

uint32_t get_mask(uint32_t len) {
  uint32_t mask = len-1;
  mask |= mask >> 1;
  mask |= mask >> 2;
  mask |= mask >> 4;
  mask |= mask >> 8;
  mask |= mask >> 16;
  return mask;
}

uint32_t hash(uint32_t idx, uint32_t mask, uint32_t seed) {
  idx ^= seed; idx *= 0xe170893d;
  idx ^= seed >> 16;
  idx ^= (idx & mask) >> 4;
  idx ^= seed >> 8; idx *= 0x0929eb3f;
  idx ^= seed >> 23;
  idx ^= (idx & mask) >> 1; idx *= 1 | seed >> 27;
  idx *= 0x6935fa69;
  idx ^= (idx & mask) >> 11; idx *= 0x74dcb303;
  idx ^= (idx & mask) >> 2; idx *= 0x9e501cc3;
  idx ^= (idx & mask) >> 2; idx *= 0xc860a3df;
  idx &= mask;
  idx ^= idx >> 5;
  return idx;
}

uint32_t permute(uint32_t idx, uint32_t len, uint32_t seed) {
  uint32_t mask = get_mask(len);
  do {
    idx = hash(idx, mask, seed);
  } while (idx >= len);
  return (idx + seed) % len;
}

Okay, there’s one additional thing at the end, the (idx + seed) % len, but that won’t break anything and adds a little extra randomness.

I hope that you find this as elegant and useful as I have! One final note: Kensler mentions that this hash only works well up to domains of about \(2^{27}\). For larger domains you could maybe use two calls of the permute function, although if it’s not random enough, that could exacerbate correlations. Overall I really just wanted to present the ideas here.

March 22, 2021