TLDR: to make a hash of a custom class that contains few primitive properties with defined `hash`

methods, use this:

```
- (NSUInteger)hash {
NSUInteger hash = 17;
hash = hash * 37 + self.prop1.hash;
hash = hash * 37 + self.prop2.hash;
return hash;
}
```

## Hashing

`NSDictionary`

is a collection that we use every day. Most of the times we use a native type as a key, something like `NSDictionary<NSString, MyClass>`

and everything just works. But sometimes it’s required to use a custom object as a key. For example, I needed to store an array of events per date-range: `NSDictionary<MyDateRange, NSArray>`

, where `MyDateRange`

is:

```
@interface MyDateRange
@property (readwrite) NSDate *start;
@property (readwrite) NSDate *end;
@end
```

When using a custom `NSObject`

as a key in dictionary, it’s required to implement `isEqual:`

, `copyWithZone:`

and `hash`

methods. `isEqual`

and `copyWithZone`

are easy: we simply compare/copy `NSDate`

properties. `hash`

is not: we need to combine hashes of 2 `NSDate`

.
In this article I want to discuss:

- The properties that a good hash function must possess.
- Existing implementations from a few well-known libraries.
- Performance with artificial test-data.

## 1. Properties of a good hash-combine function

`NSDictionary`

allows a fast access to an object by a key no matter how many elements there are. To do this `hash`

method is being used. `hash`

must be equal for objects that are equal and should be different if they are not equal.
If the `hash`

for different objects is equal, it’s called a collision and `NSDictionary`

will call `isEqual`

on all of them to find proper key. More on this topic by NSHipster and Mike Ash.

That’s what documentation says. Unfortunately, it’s not that simple in the real world. Internally, hash-table stores keys in a so called buckets. The number of buckets is somewhat bigger then a number of keys, we can see this in CF-Foundation sources.

Number of keys | Number of buckets (Actual NSDictionary Size) |
---|---|

0 | 0 |

3 | 3 |

6 | 7 |

11 | 13 |

19 | 23 |

32 | 41 |

52 | 71 |

85 | 127 |

118 | 191 |

155 | 251 |

237 | 383 |

390 | 631 |

672 | 1087 |

and so on …

We can see few facts: dictionary is filled by about 60% mostly, the actual size is always prime integer.
There is a blog that covers this in more details.
In short: each key must be put into a sinlge bucket. We need to calculate index of a bucket based on `hash`

. Since `NSUInterMax`

, the maximum number of `hash`

value, is much bigger than number of buckets, it should be reduced to `[0..number_of_buckets)`

somehow. The way it’s done is simple: a plain modulo operator
`hash % number_of_buckets`

is used.
The more keys have same modulo hash - the more collisions we get. When collision happens, one of the collision keys occupies next empty bucket. This can lead to grouping keys in one area, which is really bad as I’ll cover in next section.

In the next section we will review popular methods to combine hashes.

## 2. Existing implementations

#### Naive way of Hash Building

The first thing that comes to mind is to `+`

or `^`

2 dates:

```
- (NSUInteger)hash {
return start.hash + end.hash
}
```

Well, `+`

or `^`

of 2 integers is really single instruction and everything looks good. Except it’s not: it produces too many collisions as I’ll show later.

#### C++ Boost Hash Combine

Boost is well-known, high quality and efficient c++ library. It also contains hash combining. The interesting function is rather simple:

```
template <typename SizeT>
inline void hash_combine_impl(SizeT& seed, SizeT value)
{
seed ^= value + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
```

Boost uses sort of Shift-Add-XOR hash function.

#### Mike Ash

There is an article from Mike Ash on this topic as well. He suggests using rotate and xor:

```
#define NSUINT_BIT (CHAR_BIT * sizeof(NSUInteger))
#define NSUINTROTATE(val, howmuch) ((((NSUInteger)val) << howmuch) | (((NSUInteger)val) >> (NSUINT_BIT - howmuch)))
- (NSUInteger)hash
{
return NSUINTROTATE(self.start.hash, NSUINT_BIT / 2) ^ self.end.hash;
}
```

#### Apache.Commons

In Java there is Apache.Commons.HashCodeBuilder. It uses simple prime number multiplication algorithm.

```
- (NSUInteger)hash {
NSUInteger hash = 17;
hash = hash * 37 + self.start.hash;
hash = hash * 37 + self.end.hash;
return hash
}
```

#### CFHashBytes

Most of the Foundation objects override `hash`

method. Look:

```
[NSNumber numberWithInt:1].hash = 2654435761
[NSNumber numberWithInt:2].hash = 5308871522
[[NSDate dateWithTimeIntervalSince1970:0] hash] = 2596853616923779200
[[NSDate dateWithTimeIntervalSince1970:1] hash] = 2596853614269343439
[@"hello_hello_hello_hello" hash] = 12392780783853204868
[@"hello_hello_hello_hello1" hash] = 4243784906404336054
```

Change in 1 bit produces very different result. This all was done to reduce the amount of collisions.

Starting iOS10, there is a `NSDateInterval`

class that does just what we need.
Taken from swift sources:

```
public var hashValue: Int {
var buf: (UInt, UInt) = (UInt(start.timeIntervalSinceReferenceDate), UInt(end.timeIntervalSinceReferenceDate))
return withUnsafeMutablePointer(to: &buf) {
$0.withMemoryRebound(to: UInt8.self, capacity: 2 * MemoryLayout<UInt>.size / MemoryLayout<UInt8>.size) {
return Int(bitPattern: CFHashBytes($0, CFIndex(MemoryLayout<UInt>.size * 2)))
}
}
}
```

Under the hood, it uses `CFHashBytes`

function. Unfortunately, it’s a private function in CoreFoundation. We can still find it open source Apple Core Foundation. This implementation uses ELF hash algorithm.

#### FNV

FNV stands for Fowler/Noll/Vo. Was developed for hashing strings, but can be generalized to hash any byte sequence. Also used in Roslyn .NET Compiler.

```
void FNVHashUpdate(NSUInteger* h, NSUInteger value) {
NSUInteger magic = 1099511628211lu;
NSUInteger hash = *h;
unsigned char *p = (unsigned char*)&value;
for (int i = 0; i < sizeof(NSUInteger); i++) {
hash = (hash ^ p[i]) * magic;
}
*h = hash;
}
```

#### One-at-a-Time Hash

Made by Bob Jenkins, it’s a default hash function in Perl. I’ve no idea how did he come up with this, but it’s works really good.

```
void OATHashUpdate(NSUInteger* h, NSUInteger value) {
unsigned char *p = (unsigned char*)&value;
NSUInteger hash = *h;
for (int i = 0; i < sizeof(NSUInteger); i++) {
hash += p[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
*h = hash;
}
void OATHashFinalize(NSUInteger* h) {
NSUInteger hash = *h;
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
*h = hash;
}
- (NSUInteger)hash {
NSUInteger hash = 0;
OATHashUpdate(&hash, self.start.hash);
OATHashUpdate(&hash, self.end.hash);
OATHashFinalize(&hash);
return hash
}
```

#### JenkinsHashCombine

The most recent hash function from Bob Jenkins is called SpookyHash. The implementation is really complicated to inline here. See for yourself.

## 3. Performance on artificial test-data

Having all this, it’s better to run a quick test to see which one is the fastest. I picked simple and generic data: composed objects that contains 2 random integer values from random distribution:

```
Num_of_samples = 10_000_000
arc4random(num_of_samples)
```

Those objects are put into `NSDictionary`

as keys, and we are interested in duration of this operation.

See a test project for more details.

The values are in seconds. The smaller - the better.

`xor`

didn’t complete in any reasonable time.

Why some functions are better than others? Well, because of collisions. Let’s verify this using key distribution plot.
Since we know the number of buckets, we can simulate construction of `NSDictionary`

. For each `hash`

we will find an empty index, and for each index we check - increment number of visits. Plot is better than words. We can’t plot that much of a data, so I’ll use 500 samples.

The number of buckets for holding 500 keys is 1087 (according to table of sizes above) - that’s X axis on the plot.
Y axes are number of visits in each bucket, which is actual number of `isEqual:`

calls.
Good hash function should be evenly distributed in 0..1087 range.

We can see that `sum`

is slightly grouped around the center, which makes sense for sum of 2 numbers.
`xor`

has maximum at 510 - that’s a random number upper bound (500) + collisions. `xor`

only uses half of all available interval, which leads to huge spikes. Because each collision occupies next empty bucket, number of collisions grow non-linearly.
Other functions don’t have visible peaks and are distributed evenly.

Basically all, but trivial functions, have similar performance. `sum`

and `xor`

have too many collisions which leads to poor overall performance. Using advanced hash functions as `fnv`

, `oat`

, `elf`

and `jenkins`

doesn’t really makes sense on typical input. Most of the times functions from Mike Ash, Boost and Apache Commons are sufficient enough. I’ll be using Apache version in my current project.

Big thanks @FalconSer for reading drafts and suggestions.

Happy hashing!.