CS 228 meeting -*- Outline -*- * hashing (HR 12.6) Suppose the poison control center has a *very* slow computer, or, because of typing mistakes, that the process of comparing a typed name to the name of a poison is very slow (corrects for spelling mistakes, etc.) This brings up the question: can we do better than O(log N)? In cases where there are only a few items, e.g., digits 0-9, there is no problem, just use an array. But takes too much space, and not obvious how to use this for the poison control center. ------------------------------------------ HASHING (HR 12.6) THE PROBLEM Can we search in O(1) time? THE SOLUTION IDEA Idea: use a "hash function" to map key -> index then extract information in O(1) time using the index from a "hash table". Picture: ------------------------------------------ Draw an array of poison records, of size say 1009, with ... some poison names and treatments (arsenic, etc.) Be sure it's NOT sorted!!! See fig 12.12 for an example ** hash functions (p. 565) ------------------------------------------ HASH FUNCTIONS Problem: Map a poison name to 0..TAB_SIZE. Want to "spread out" the results. Solutions? 1. hash1(s) == 0 2. hash2(s) == int(s[0]) % TAB_SIZE 3. ------------------------------------------ 1. doesn't spread them out at all, but is actually useful in incremental development of a hash program (for debugging). 2. can only map to 26 (or 52 if count capitals) different things. ... 3. hash3(s) == SumCharsMod(s, TAB_SIZE) ------------------------------------------ A HASH FUNCTION FOR STRINGS // SumCharsMod.h #include "String.h" extern int SumCharsMod(const String & s, int n); // PRE: n > 0 // POST: FCTVAL = the sum of the // character codes in s, modulo n. // SumCharsMod.C #include "SumCharsMod.h" int SumCharsMod(const String & s, int n) { int s_size = Length(s); int hash_val = 0; // INV: hash_val is the sum of // the char codes s[0..i-1] mod n // && i <= s_size for (int i = 0; i < s_size; i++) { hash_val = (hash_val + int(s[i])) % n; } return hash_val; } ------------------------------------------ Q: why isn't hash_val += int(s[i]) % n; correct? Notes: experimental results show that this kind of hash function is good it's best if TAB_SIZE is a prime, might choose 1009, or some other large prime. I use a cast, although it's not strictly necessary, because I don't like to remember that chars coerce to ints... ** collisions (p. 567 ff.) ------------------------------------------ COLLISIONS def: a *collision* is two keys that a hash function maps to the same index. Example: SumCharsMod(String("arsenic", 301)) = SumCharsMod(String("iron", 301)) = 173 SumCharsMod(String("barbitone", 301)) = SumCharsMod(String("copper", 301)) = 47 SumCharsMod(String("meprobamate", 301)) = SumCharsMod(String("thallium", 301)) = 262 SumCharsMod(String( "phenobarbitone", 1009)) = SumCharsMod(String( "cholinesterase inhibitor", 1009)) = 479 ------------------------------------------ def: two keys that hash to the same value are called *synonyms* def: a *perfect hashing function* for a set of data is one that has no collisions. These are hard to find. So in general, we live with collisions. *** collision resolution ------------------------------------------ COLLISION RESOLUTION OVERVIEW linear probing: if detect a collision, look at the next index, modulo table size variations: quadratic probing: +1, -1, +4, -4, +9.. random probing: add random number rehashing: use another hash function chaining: list of synonyms for each index ------------------------------------------ *** linear probing ------------------------------------------ LINEAR PROBING name treatment |-----------------------------| 0 | gasoline | don't light ..| |-------------|---------------| 1 | arsenic | see a doctor | |-------------|---------------| 2 | iodine | apply victum..| |-------------|---------------| 3 | | | |-------------|---------------| 4 | | | |-------------|---------------| 5 | | | |-------------|---------------| 6 | caffine | induce vomit..| |-------------|---------------| 7 | | | |-------------|---------------| 8 | kelp | drink seawat..| |-------------|---------------| 9 | | | |-----------------------------| ------------------------------------------ Suppose the table size is 10, and we want to add ovaltine (drink milk) to the table. But it's hash mod for a table of size 10 is 6, so it's a synonym for caffine. Linear probing: does a linear search for unused element (note, have to have some way of telling which are unused) so in this case, ovaltine, goes in slot 7. Now suppose we want to add nicotine (don't smoke) to the table. It's hash is 7, so we eventually put it in 9. If we try to add old lace (buy new lace) to the table, it hashes to 6, so we have to put it in 3. Note the wrap-around. ------------------------------------------ PROBLEM: CLUSTERING def: *primary clustering* results from keys that hash to adjacent indexes. example: gasoline, arsenic and iodine. def: *secondary clustering* results when different clusters merge to form a larger cluster. examples: ovaltine, nicotine ------------------------------------------ The secondary clustering really hurts performance Q: what's an easy way to reduce the problems of clustering without writing new code? make sure the table doesn't fill up (keep it sparse) Also problematic: when the table is full, have to watch for infinite loops on insertion and searching deletion is difficult, as can't break up clusters *** chaining better than the variations on linear probing what I usually use. Idea: store a linked list at each index, this avoids all secondary clustering! ------------------------------------------ CHAINING name tre link |----| 0 | *---> [gasoline|do..| ] |----| 1 | *---> [arsenic |se..| ] |----| 2 | *---> [iodine |ap..| ] |----| 3 | | |----| 4 | | |----| 5 | | |----| 6 | *---> [caffine |in..| ] |----| 7 | | |----| 8 | *---> [kelp |dr..| ] |----| 9 | | |----| ------------------------------------------ Note, way to tell what's unused is by null pointer. Suppose the table size is 10, and we want to add ovaltine (drink milk) to the table. But it's hash mod for a table of size 10 is 6, so it's a synonym for caffine. chaining: add a ovaltine to head of list at element 6. Now suppose we want to add nicotine (don't smoke) to the table. It's hash is 7, so start new list with it. If we try to add old lace (buy new lace) to the table, it hashes to 6, so add it to same list as caffine. ------------------------------------------ CAVEATS WITH CHAINING What is the worst case search time? How does that happen? ------------------------------------------ ... O(N), ... when all in the same bucket so hash function is more important ------------------------------------------ EFFICIENCY OF HASHING Factors: - quality of hash function (for the data) - collision resolution technique - availabilty of space def: the *load factor* of a hash table is n/s, where n = number of entries and s = size of table ------------------------------------------ Q: what is the load factor for a full table with linear probing? Q: can a table have a load factor of more than 1 with chaining? show figure 12.17 in the book ** comparison (p. 572) ------------------------------------------ SEARCHING COMPARISON |linear binary chained |search search hash table |============================= special | none sorted hash requirements| function |----------------------------- worst case | time cost | |----------------------------- average case| time cost | ------------------------------------------ Q: what's the time cost measured in? comparisons (String::operator < or operator ==) Q: If we had to sort the data by key anyway, which should we use? Q: Is a chained hash table best for the poison control center? maybe not, as in the worst case it might be really bad. might be better to use binary search