Are a form of maps common in compilers for symbol tables (tables storing variable's names and address) and operating systems for registry of environment variables (variables defining the shell options).
Bucket Arrays // where the items are stored.
Hash Functions (algorithm for storing items in the hash table) // How we locate the items.
An array A with size N
If keys are integers then insert will put entry e in array rank k, A[k] = e, where entry = (key, value). We could then recover the value given k, with get(k) by accessing A[k].
Any bucket with array cell not assigned an element should contain a special NO_SUCH_KEY object. Naturally when the bucket array is first created with no keys assigned all cells contain NO_SUCH_KEY
The map methods get(k) and insert(k,e) should take O(1).
There are problems:
What if the keys are not unique?
What if the keys are not integers?
If there are many keys or if N is too small then a collision can occur. We have to take care of this. If the keys are not integers then we need a function form keys to integers: a hash function.
A hash function maps general keys to a finite range of integers.
Hash function: general keys --> finite set of integers
The hash function has two parts:
Hash code, maps general keys to integers.
Hash code: general keys --> Whole range of integers (Not really.)
Compression map, maps the infinite set of integers to a finite range of integers.
Compression map: Whole range of integers --> finite range of integers
We like for the hash code to avoid collisions, but this is not always possible. Collisions are when distinctly different keys map to the same hash values. Why is this possible? Collisions are unavoidable because:
finite set of integers even if we use int
algorithm may get too complicated otherwise
We require that if keys are equal then the hash code should return the same integer value. If hash code function is h, then:
if k1 = k2 then h(k1) = h(k2)
For efficiency of the hash table we would like for the hash code to spread out the mapping uniformly across the integers.
Casting to Integer:
This works great for int and char, but not very well for doubles or Strings. Why? Chopping would give the same value for the hashcode.
Summing components:
static int hashCode(long i) { return (int)((i >> 32) + (int) i):} // good for long
In general Sum(xk) for all k where xk are the components.
Good for longs mixes significant an insignificant parts, collisions still occur. Not good for Strings. Why? Hint programmers like to use similar variable names.
Polynomial hash codes:
Assume the key is of the form of a
Object with multiple components, (x0, x1,
x2, ..., xk-1), for
example a String, the individual objects characters. The
polynomial:
x0ak-1 + x1ak-2 + ... + xk-2a + xk-1
where a is a constant of the hash
code, mixes all the components. Horner's form is more
efficient to use:
xk-1 + a(xk-2 + a(xk-3 + ... + a(x2 + a(x1 + ax0)) ... ))
This is a very common hash code. The author has checked for the number of collisions for 50,000 English words as a function of the constants a. They have found that a = 33, 37, and 41 have less then 7 collisions. Java uses a = 31, which I suspect also gives low collision rates.
Cyclic shift hash codes:
A variant of the polynomial hash code but replaces multiplication of a with cyclic shifts. Again mixes the components.
We need to compress the hash value so we can find the proper bucket in a finite array.
The Division Method:
h(k) = |k| mod N
N, the size of the array, should be a prime number so that it spreads out the generated codes for typical sequence of integers. Consider N=10 and sequence 10, 20 , 30, ... or 5, 10, 15, 20, ...
We want to different integers two have probability 1/N for
generating a bucket value.
The Mad Method:
Multiply-Add-Divide
h(k) = |ak + b| mod N