Converting some given data to a numeric value, using a hash function. Hash functions are deterministic and should be fast. Hashing has several important applicaitons, including the HashTable data structure and for security in communication protocols.
The C++ map
data type implements the idea of dictionaries.
map<T1, T2> m;
where T1 is the key and T2 is the value.<
function (strings, numbers, pointers)map<string, int> m;
string token;
while(cin >> token) { // count number of times each word appears
m[token]++;
}
m.erase("some_key"); // removes key "some_key" from the map
Converts some data to an index in the range [0..K-1], K is number of slots.
Important Characteristics of Hash Functions:
Hashing does not require unique keys, however same keys will map to same place causng collision.
Hashing doesn't work well for ordered operations (like print in alphabetical order) since similar keys should map to wildly different locations.
Example: Human names have poor distribution (lots of S's, few X's)
Consists of vector, hash function, collision strategy.
Can insert, delete, lookup in amortized O(1) constant time.
Also known as open addressing, uses linear probing (adding one and trying again):
EMPTY
statusEMPTY
, ZOMBIE
, ACTIVE
)ACTIVE
Issues:
insert
approaches O(N)lookup
and remove
approaches O(N)Each slot has a bucket of elements. Hash Table is a list-of-lists, called open hashing with chaining.
With K buckets and N records:
Perfect hashing is an injective function that maps every input to a distinct output. It is a minimal perfect hashing function if it can map n keys to n distinct hash values.
With information about the data or its distribution, one can design a huristic function that performs better for the special case of data. For example if the data is mostly sequential, then taking an integer representation of the data and modding it will yield a very good distribution.
With a universal hashing algorithm, the collision chance should be exactly 1/n for n buckets, regardless of key similarity. Will perform worse than a perfect hashing algorithm catered for a specific set of data.
The hash index is computed as the hash value multiplied by some large value. Works well for the same reason linear congruences generate seemingly random numbers.
hash = some_prime
K = 33; // or 31 or 27
word.each do |character|
hash = (hash * K) + character
end
Properties of cryptographic hashing functions:
MD5, SHA0 and SHA1 examples of "broken" algorithms.