CS 228 meeting -*- Outline -*-

* hashing (HR 12.6)

	Suppose the poison control center has a *very* slow computer,
	or, because of typing mistakes, that the process of comparing
	a typed name to the name of a poison is very slow
		(corrects for spelling mistakes, etc.)

	This brings up the question: can we do better than O(log N)?

	In cases where there are only a few items, e.g., digits 0-9,
		there is no problem, just use an array.
	But takes too much space, and not obvious how to use this for
		the poison control center.

------------------------------------------
         HASHING (HR 12.6)
            THE PROBLEM

Can we search in O(1) time?


	    THE SOLUTION IDEA

Idea: use a "hash function" to map
	key -> index
then extract information in O(1) time
using the index from a "hash table".

Picture:


------------------------------------------
	Draw an array of poison records, of size say 1009, with ...
		some poison names and treatments (arsenic, etc.)
		Be sure it's NOT sorted!!!

	See fig 12.12 for an example

** hash functions (p. 565)

------------------------------------------
          HASH FUNCTIONS

Problem:

   Map a poison name to 0..TAB_SIZE.
   Want to "spread out" the results.


Solutions?

  1. hash1(s) == 0

  2. hash2(s) == int(s[0]) % TAB_SIZE

  3.
------------------------------------------
	1. doesn't spread them out at all, but is actually useful
		in incremental development of a hash program (for debugging).

	2. can only map to 26 (or 52 if count capitals) different things.

	... 3. hash3(s) == SumCharsMod(s, TAB_SIZE)

------------------------------------------
          A HASH FUNCTION FOR STRINGS

// SumCharsMod.h
#include "String.h"

extern int SumCharsMod(const String & s,
                       int n);
  // PRE: n > 0
  // POST: FCTVAL = the sum of the
  // character codes in s, modulo n.


// SumCharsMod.C
#include "SumCharsMod.h"

int SumCharsMod(const String & s, int n)
{
  int s_size = Length(s);
  int hash_val = 0;

  // INV: hash_val is the sum of
  // the char codes s[0..i-1] mod n
  // && i <= s_size
  for (int i = 0; i < s_size; i++) {
    hash_val = (hash_val + int(s[i])) % n;
  }
  return hash_val;
}
------------------------------------------

	Q: why isn't
		hash_val += int(s[i]) % n;
	 correct?

	Notes: experimental results show that this kind of hash function
			is good
		it's best if TAB_SIZE is a prime,
			might choose 1009, or some other large prime.

		I use a cast, although it's not strictly necessary,
		because I don't like to remember that chars coerce to ints...

** collisions (p. 567 ff.)

------------------------------------------
           COLLISIONS

def: a *collision* is two keys that a hash
  function maps to the same index.

Example:
  SumCharsMod(String("arsenic", 301))
  = SumCharsMod(String("iron", 301))
  = 173

  SumCharsMod(String("barbitone", 301))
  = SumCharsMod(String("copper", 301))
  = 47

  SumCharsMod(String("meprobamate", 301))
  = SumCharsMod(String("thallium", 301))
  = 262

  SumCharsMod(String(
        "phenobarbitone", 1009))
  = SumCharsMod(String(
        "cholinesterase inhibitor", 1009))
  = 479

------------------------------------------

	def: two keys that hash to the same value are called *synonyms*

	def: a *perfect hashing function* for a set of data
		is one that has no collisions.
	These are hard to find.

	So in general, we live with collisions.

*** collision resolution
------------------------------------------
         COLLISION RESOLUTION
	      OVERVIEW

linear probing:
   if detect a collision, look at the next
   index, modulo table size

variations:
   quadratic probing: +1, -1, +4, -4, +9..
   random probing: add random number
   rehashing: use another hash function

chaining:
   list of synonyms for each index
------------------------------------------

*** linear probing
------------------------------------------
         LINEAR PROBING

	name         treatment
     |-----------------------------|
   0 | gasoline    | don't light ..|
     |-------------|---------------|
   1 | arsenic     | see a doctor  |
     |-------------|---------------|
   2 | iodine      | apply victum..|
     |-------------|---------------|
   3 |             |               |
     |-------------|---------------|
   4 |             |               |
     |-------------|---------------|
   5 |             |               |
     |-------------|---------------|
   6 | caffine     | induce vomit..|
     |-------------|---------------|
   7 |             |               |
     |-------------|---------------|
   8 | kelp        | drink seawat..|
     |-------------|---------------|
   9 |             |               |
     |-----------------------------|


------------------------------------------
	Suppose the table size is 10, and we want to add
		ovaltine (drink milk)
	to the table.  But it's hash mod for a table of size 10 is 6,
	so it's a synonym for caffine.

	Linear probing: does a linear search for unused element
		(note, have to have some way of telling which are unused)
	so in this case, ovaltine, goes in slot 7.

	Now suppose we want to add
		nicotine (don't smoke)
	to the table.  It's hash is 7, so we eventually put it in 9.

	If we try to add
		old lace (buy new lace)
	to the table, it hashes to 6,
	so we have to put it in 3.  Note the wrap-around.

------------------------------------------
	PROBLEM: CLUSTERING

def: *primary clustering* results from
     keys that hash to adjacent indexes.

    example: gasoline, arsenic and iodine.

def: *secondary clustering* results when
    different clusters merge to form
    a larger cluster.

    examples: ovaltine, nicotine
------------------------------------------

	The secondary clustering really hurts performance
	Q: what's an easy way to reduce the problems of clustering
		without writing new code?
		make sure the table doesn't fill up (keep it sparse)

	Also problematic:
		when the table is full, have to watch for infinite loops
			on insertion and searching

		deletion is difficult, as can't break up clusters

*** chaining
	better than the variations on linear probing
	what I usually use.

	Idea: store a linked list at each index,
		this avoids all secondary clustering!

------------------------------------------
              CHAINING

	    name     tre link
  |----|
0 |  *---> [gasoline|do..|  ]
  |----|
1 |  *---> [arsenic |se..|  ]
  |----|
2 |  *---> [iodine  |ap..|  ]
  |----|
3 |    |
  |----|
4 |    |
  |----|
5 |    |
  |----|
6 |  *---> [caffine |in..|  ]
  |----|
7 |    |
  |----|
8 |  *---> [kelp    |dr..|  ]
  |----|
9 |    |
  |----|


------------------------------------------
	Note, way to tell what's unused is by null pointer.

	Suppose the table size is 10, and we want to add
		ovaltine (drink milk)
	to the table.  But it's hash mod for a table of size 10 is 6,
	so it's a synonym for caffine.

	chaining: add a ovaltine to head of list at element 6.

	Now suppose we want to add
		nicotine (don't smoke)
	to the table.  It's hash is 7, so start new list with it.

	If we try to add
		old lace (buy new lace)
	to the table, it hashes to 6, so add it to same list as caffine.

------------------------------------------
            CAVEATS WITH CHAINING

What is the worst case search time?


How does that happen?


------------------------------------------
	... O(N),
	... when all in the same bucket

	so hash function is more important

------------------------------------------
         EFFICIENCY OF HASHING

Factors:
 - quality of hash function (for the data)
 - collision resolution technique
 - availabilty of space

def: the *load factor* of a hash table
  is n/s, where n = number of entries
            and s = size of table
------------------------------------------
	Q: what is the load factor for a full table with linear probing?
	Q: can a table have a load factor of more than 1 with chaining?


	show figure 12.17 in the book

** comparison (p. 572)

------------------------------------------
         SEARCHING COMPARISON

            |linear   binary    chained
            |search   search    hash table
            |=============================
special     | none    sorted    hash
requirements|                   function
	    |-----------------------------
worst case  |
time cost   |
	    |-----------------------------
average case|
time cost   |
------------------------------------------
	Q: what's the time cost measured in?
		comparisons (String::operator < or operator ==)

	Q: If we had to sort the data by key anyway, which should we use?

	Q: Is a chained hash table best for the poison control center?
		maybe not, as in the worst case it might be really bad.
		might be better to use binary search