Is it necessary to assume a distribution of keys for hashing to obtain constant access time guarantees?

Blog

Directory

Submitted by srchvrs on Sat, 02/27/2016 - 12:23

This post duplicates my Quora answer. The question is regarding distribution assumptions necessary to obtain constant time access to a hash table. Feel free to vote and comment there.

In the ideal world, you would assume that there is a distribution of keys and this would allow you to analyze an average case behavior. However, in the real world, this is extremely hard to do for two reasons: (1) it's not clear which distribution to use, (2) your math will be crazy complex if doable at all.

So, as Daniel Tunkelang noted, we make a Simple Uniform Hashing Assumption (SUHA) and blindly assume that things are going to be alright: That is, the keys will be miraculously distributed evenly among buckets. This is not the only simplifying assumption about hash functions, there are several others. For example, in the analysis of locality sensitive hashing, we assume that a hash function is randomly and independently reselected for each pair of data points (see my notes here: Does Locality Sensitive Hashing (LSH) analysis have a fatal flaw?)

Note, however, that the SUHA assumption doesn't allow you to tell anything about the worst case complexity (as noted by Mark Gritter). It allows you to establish an average-case complexity (guarantee). If you want to optimize for the worst case, you may opt to use a perfect hash function. For a prespecified set of values (from a static universe of possible hash keys), the perfect hash function always hashes different elements into different buckets. In other words, the perfect hash function is collision-free.

One catch here is that the hash function is not specified for keys outside a given domain. For example, you can hash perfectly integers from 0 to 1000, but you won't know how to deal with 1001. A Cuckoo hashing doesn't have this limitation, while still allowing to answer queries in O(1) worst case time.

Sounds good, eehh? Well, actually both the cuckoo and perfect hashing share a common disadvantage: indexing is a much more expensive procedure compared to the classic hashing scheme. AFAIK, there are only probabilistic guarantees of a success. In practice, I think it is very unlikely that you won't be able to create a hash table, but it may take you quite a while to do so.

Obviously, there are better and worse hash functions. With better hash functions, keys are distributed among buckets more or less uniformly. This is not necessarily true if your hash function is bad. In his famous book, Donald Knuth considers hash function testing in detail. Are bad functions completely useless? In my experience, this is not necessarily so (but better hash functions do lead to substantially better performance).

While even bad hash functions may be ok for many practical purposes, an adversarial selection of keys and their insertion order may cause a real performance problem. For many hash functions, a hacker who knows your hash function may select such a sequence of keys that would result in a nearly O(N) insertion time (N is the number of entries).

One solution to this problem (not yet adopted in all the mainstream languages) is randomized hashing: One may use the same hash function, but some hashing parameter will be selected randomly for each hash table that you create in your program (see, e.g., Use random hashing if you care about security? )

srchvrs's blog

You are here