1. Introduction

Recently I had this problem – how do I efficiently hash an immutable rope data-structure as it is being constructed?

sealed trait Rope extends Product with Serializable {
    def hashCode: Int

    def equals(obj: Any): Boolean = obj match {
        case that: Rope => this.toString == that.toString
        case _ => false
    }
}
final case class Combine(left: Rope, right: Rope) extends Rope {
    val hashCode = hashCombine(left.hash, right.hash)
    def toString = left.toString + right.toString
}
final case class Leaf(value: String) extends Rope {
    val hashCode = hashLeaf(value)
    def toString = value
}

def hashCombine(left: Int, right: Int): Int = ???
def hashLeaf(value: String): Int = ???

Note that since ropes that have equal string representation are considered equal, and that hashCode must be compatible with equality in a sense that a == b implies a.hashCode == b.hashCode, hashCombine must be associative and hashLeaf must form a monoid homomorphism with hashLeaf("") being the identity with respect to hashCombine. Why? Let's unpack.

Equal ropes must have equal hashes.
Ropes are equal when the sum of the underlying string fragments is the same.
Hence, ropes such that the sum of the underlying string fragments is the same should have the same hash codes.
Combine(Leaf(""), x) == x, therefore hashCombine(hashLeaf(""), x.hashCode) == x.hashCode.
Combine(x, Leaf("")) == x, therefore hashCombine(x.hashCode, g("")) == x.hashCode.
Combine(Combine(x, y), z) == Combine(x, Combine(y, z)), therefore hashCombine(hashCombine(x.hashCode, y.hashCode), z.hashCode) == hashCombine(x.hashCode, hashCombine(y.hashCode, z.hashCode)).
Leaf(x + y) == Combine(Leaf(x), Leaf(y)), therefore hashLeaf(x + y) == hashCombine(hashLeaf(x), hashLeaf(y)).

2. Re-discovering `SL_2`

Once I realized that hashLeaf must be a monoid homomorphism, I started searching for papers or articles on the topic of "monoidal hashes" and "monoid homomorphism hashing", but all I could find was an algorithm called `SL_2`:

The original paper introducing the algorithm.
A video explaining some additional motivation and the algorithm.
An MSc thesis discussing the hash function and some attacks against it.
Some improvements to make it more resilient against cryptographic attacks.

It is rather complex and while there is a highly optimized version of it, I did not want to use a cryptographic-grade hashing algorithm just to hash a data-structure into an Int. I had to find something simpler.

3. Designing a hash function

I set out to explore my options in a newly created Jupyter notebook. First, I defined my monoidal hash MHash:

class MHash:
    bits: int                           # The number of bits in our hash output.
    empty: int                          # Identity element w.r.t. `combine`.
    def combine(self, left, right): int # Combines two hashes, associative.
    def lift(self, byte): int           # Monoid homomorphism from a single byte.

    def random(self):
        return randbits(self.bits)

    def compute(self, bytes):
        r = self.empty
        for b in bytes:
            r = self.combine(r, self.lift(b))
        return r

And some simple tests for it:

def test_properties(m):
    # Test associativity.
    for _ in range(100):
        h1, h2, h3 = randbits(m.bits), randbits(m.bits), randbits(m.bits)
        assert m.combine(m.combine(h1, h2), h3) == m.combine(h1, m.combine(h2, h3))
    ...

Now I could implement a really simplistic "hash function" and test it:

class MHash0(MHash):
    bits  = 32
    mask  = 2 ** bits - 1
    empty = 1
    def combine(self, x, y): return (x * y) & self.mask
    def lift(self, byte):    return byte

test_properties(MHash0())

4. What makes a good hash function?

Quoting wikipedia:

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability.

Furthermore, a good hash function must satisfy a so-called avalanche criterion: if an input is changes slightly (for example, flipping a single bit), the output changes significantly (e.g., half the output bits flip).

Both are pretty easy to test:

Generate a random input sequence.
Flip some bits, record how the output changes.
Compute how many bits of the output are effectively used – given a random input, how much uncertainty do I have about the resulting hash in bits? I called this metric "Output Entropy", the closer it is to 32 bits, the better.
Compute how many bits of the output are effectively controlled by the input bits – given a random input and a random bit flip, how much uncertainty do I have about the resulting hash in bits? I called this metric "Avalanche Entropy", the closer it is to 32 bits, the better.
Compute how many bits of the output change when a single input bit flips. I called this metric "Mean Avalanche Size". The closer it is to 16 bits, the better.

The simplistic algorithm MHash0 from above gave me:

Monoid Homomorphism  = True
Non-commutative      = False
Output Entropy       = 28.637 bits
Avalanche Entropy    = 22.791 bits
Mean Avalanche Size  = 10.134 bits

Notice that I also tested for non-commutativity for obvious reasons. Non-commutativity requirement gave me a pause for a bit. I knew of only one way to make a non-commutative monoid - use matrix multiplication.

I tried that:

class MHash2(MHash):
    bits  = 32
    mask  = 2 ** bits - 1
    empty = (1 << 24) + (0x00 << 16) + (0x00 << 8) + (1 << 0)

    def combine(self, x, y):
        assert 0 <= x < 2**self.bits
        assert 0 <= y < 2**self.bits
        x00, x01, x10, x11 = (x >> 24) & 0xFF, (x >> 16) & 0xFF, (x >> 8) & 0xFF, (x >> 0) & 0xFF
        y00, y01, y10, y11 = (y >> 24) & 0xFF, (y >> 16) & 0xFF, (y >> 8) & 0xFF, (y >> 0) & 0xFF

        mul = lambda x, y: x * y
        add = lambda x, y: x + y
        z00 = add(mul(x00, y00), mul(x01, y10)) & 0xFF
        z01 = add(mul(x00, y01), mul(x01, y11)) & 0xFF
        z10 = add(mul(x10, y00), mul(x11, y10)) & 0xFF
        z11 = add(mul(x10, y01), mul(x11, y11)) & 0xFF

        return (z00 << 24) + (z01 << 16) + (z10 << 8) + (z11 << 0)

    # Just mixes the 256 values of the input byte up,
    # "diffusing" them randomly among 2**32 output
    # values. This operation is bijective.
    def lift(self, x):
        x = (x * 0xfffffd + 0xe6546b64) & self.mask
        x = x ^ (x >> 16)
        x = (x * 1729) & self.mask
        x = x ^ (x >> 10)
        x = (x * 1729) & self.mask
        x = x ^ (x >> 16)
        return x

But was not entirely satisfied with the results:

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 30.957 bits
Avalanche Entropy    = 30.724 bits
Mean Avalanche Size  = 13.884 bits

A careful choice of lift function is necessary for good results here. Perhaps a carefully chosen translation table for each byte value would make it work even better...

I decided to pursue a different route. Searching for "non-commutative integer monoid" on Google and Math Overflow eventually led me to this simple monoid:

(a, b) x (c, d) = (a + b * c, b * d)

I could not believe at first that it was a non-commutative monoid, but it passed all my tests.

class MHash4(MHash):
    bits  = 32
    mask  = 2 ** bits - 1
    empty = 1

    def combine(self, x, y):
        x, x1 = (x >> 16) & 0xFFFF, (x >> 0) & 0xFFFF
        y, y1 = (y >> 16) & 0xFFFF, (y >> 0) & 0xFFFF
        z, z1 = (x + x1 * y) & 0xFFFF, (x1 * y1) & 0xFFFF
        return (z << 16) + z1
    ...

Unfortunately, it was still lousy:

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 30.678 bits
Avalanche Entropy    = 29.438 bits
Mean Avalanche Size  = 13.607 bits

Well, if you think about it, multiplication mod 2 ** 32 is not a particularly good idea. Factors of 2 will eventually accumulate, making the whole thing zero out.

Suddenly, I realized that any monoid with annihilators will not do. For example,

(0, 0) x (c, d) = (0 + 0 * c, 0 * d) = (0, 0)

Which means that if we find a string that hashes to (0, 0), we can add arbitrary suffixes to it to get easy hash collisions.

We could keep the second component always odd, ensuring that no annihilators will ever arise, but we lose that one bit of output entirely.

class MHash6(MHash):
    ...

    def random(self):
        while True:
            x = randbits(32)
            if (x & 1) != 1: continue
            return x

    def combine(self, x, y):
        x, x1 = (x >> 16) & 0xFFFF, (x >> 0) & 0xFFFF
        y, y1 = (y >> 16) & 0xFFFF, (y >> 0) & 0xFFFF
        z, z1 = (x + x1 * y) & 0xFFFF, (x1 * y1) & 0xFFFF
        return (z << 16) + z1

    def lift(self, x):
        ...
        if (x & 1) == 0: x = x ^ 1
        return x

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 31.641 bits
Avalanche Entropy    = 30.962 bits
Mean Avalanche Size  = 15.508 bits

What if instead of making the right component odd, we multiply them by 2 and add 1 right before our monoid operation?

class MHash7(MHash):
    ...

    def combine(self, x, y):
        x, x1 = (x >> 16) & 0xFFFF, x & 0xFFFF
        y, y1 = (y >> 16) & 0xFFFF, y & 0xFFFF
        x1 = (x1 << 1) + 1
        y1 = (y1 << 1) + 1
        z, z1 = (x + x1 * y) & 0xFFFF, (x1 * y1) & 0x1FFFF
        z1 = z1 >> 1
        return (z << 16) + z1

    ...

Surprisingly, this hash function does really well:

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 32.000 bits
Avalanche Entropy    = 31.942 bits
Mean Avalanche Size  = 15.956 bits

It has no obvious annihilators, and although it's avalanche diagram has some obvious structure (afterall, we are just adding and multiplying integers), it still has amazing statistics.

I did some experiments with general linear group as well (computing elements mod 251 instead of mod 256, very useful trick ensuring that matrices don't end up being degenerate), but even though the resulting hash-function has better statistics,

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 31.999 bits
Avalanche Entropy    = 31.999 bits
Mean Avalanche Size  = 15.997 bits

it is much more complex:

class MHash5(MHash):
    bits  = 32
    mask  = 2 ** bits - 1
    empty = (1 << 24) + (0x00 << 16) + (0x00 << 8) + (1 << 0)

    table = ...

    def combine(self, x, y):
        x00, x01, x10, x11 = (x >> 24) & 0xFF, (x >> 16) & 0xFF, (x >> 8) & 0xFF, (x >> 0) & 0xFF
        y00, y01, y10, y11 = (y >> 24) & 0xFF, (y >> 16) & 0xFF, (y >> 8) & 0xFF, (y >> 0) & 0xFF

        mul = lambda x, y: x * y
        add = lambda x, y: x + y

        z00 = (add(mul(x00, y00), mul(x01, y10)) % 251) & 0xFF
        z01 = (add(mul(x00, y01), mul(x01, y11)) % 251) & 0xFF
        z10 = (add(mul(x10, y00), mul(x11, y10)) % 251) & 0xFF
        z11 = (add(mul(x10, y01), mul(x11, y11)) % 251) & 0xFF

        return (z00 << 24) + (z01 << 16) + (z10 << 8) + (z11 << 0)

    def lift(self, x):
        # All elements of the table have det != 0.
        return self.table[x]

Monoidal Hashing

1. Introduction

2. Re-discovering `SL_2`

3. Designing a hash function

4. What makes a good hash function?

References