Hashing

Back to Computer-Science-Concepts

Converting some given data to a numeric value, using a hash function. Hash functions are deterministic and should be fast. Hashing has several important applicaitons, including the HashTable data structure and for security in communication protocols.

Map (Dictionary)

The C++ map data type implements the idea of dictionaries.

map<string, int> m;
string token;
while(cin >> token) { // count number of times each word appears
    m[token]++;
}
m.erase("some_key"); // removes key "some_key" from the map

Hash Functions

Converts some data to an index in the range [0..K-1], K is number of slots.

Important Characteristics of Hash Functions:

  1. Deterministic based on key (always get same answer from same input)
  2. Uniform spread over buckets/slots.
  3. Cheap to compute.
  4. Supports a variable range (as K scales)

Hashing does not require unique keys, however same keys will map to same place causng collision.

Hashing doesn't work well for ordered operations (like print in alphabetical order) since similar keys should map to wildly different locations.

Example: Human names have poor distribution (lots of S's, few X's)

Hash Table Structure

Consists of vector, hash function, collision strategy.

Can insert, delete, lookup in amortized O(1) constant time.

Closed Hashing

Also known as open addressing, uses linear probing (adding one and trying again):

Issues:

Open Hashing

Each slot has a bucket of elements. Hash Table is a list-of-lists, called open hashing with chaining.

With K buckets and N records:

Hashing Algorithms

Perfect hashing is an injective function that maps every input to a distinct output. It is a minimal perfect hashing function if it can map n keys to n distinct hash values.

With information about the data or its distribution, one can design a huristic function that performs better for the special case of data. For example if the data is mostly sequential, then taking an integer representation of the data and modding it will yield a very good distribution.

With a universal hashing algorithm, the collision chance should be exactly 1/n for n buckets, regardless of key similarity. Will perform worse than a perfect hashing algorithm catered for a specific set of data.

Multiplicative Hashing

The hash index is computed as the hash value multiplied by some large value. Works well for the same reason linear congruences generate seemingly random numbers.

hash = some_prime
K = 33; // or 31 or 27
word.each do |character|
    hash = (hash * K) + character
end

Cryptographic Hash Function

Properties of cryptographic hashing functions:

MD5, SHA0 and SHA1 examples of "broken" algorithms.

Evaluating hash functions