hash

advertisement
HASH.M: flexible and fast hashing functions.
note: if you have no clue what hashing is, you might want to read a
little
explanation I gave at the end of this doc first.
why use these hashing functions?
-------------------------------simple:
one does not need to know anything about how hashing to make
good use of this module.
flexible: the functions do not make _any_ assumptions upon the nature
of the data, the size of it, or anything else. you may
therefore
use it to hash strings, graphical data, etc.
fast:
not only does this module use a high performance hashing
function,
but also the most time-consuming function, hash_find() has been
written in optimised inline assembly.
the object
---------OBJECT hashtable
hashtable(tablesize)
end()
find(ptr,len)
add(link,hashvalue,ptr,len)
iterate(do_proc)
calc_hash_spread()
ENDOBJECT
methods
------hastable(tablesize)
This constructor initialises an empty hashtable. tablesize is the size of
the hashtable, predifined are some usefull values:
tablesize
HASH_NORMAL 211*
HASH_MEDIUM 941
HASH_HEAVY 3911
HASH_HEAVIER
16267
size in kb
1
4
16
64
(*) the tablesizes above are primes close to 2 to the power
of 8,10,12,14,16 (for performance reasons)
The larger the table, the faster the search.
instead of these constanst, you can also use your own size,
if you wish. the size should however be between 2 and 65535,
and a prime for best performance.
example:
DEF t:PTR TO hashtable
NEW t.hashtable(HASH_HEAVY)
may raise: "MEM"
end()
frees table. memory is also deallocated automatically at the end of
the program. example:
END t
hashlink,hashvalue:=t.find(data,len)
Tries to find data with length len in hash table. When data is not
present
in the table, hashlink will NIL. You'll need to store hashvalue only if
you
wish to call add() next, and hashlink only if you wish to access the data
related to whatever you hashed next (see below).
add(link:PTR TO hashlink,hashvalue,data,len)
adds data to table. you'll need to call find() before this one,
not only to know the hashvalue this function needs, but also to be
sure data isn't already in the table. link must be a subclass of
'hashlink':
OBJECT hashlink
ENDOBJECT
/* SIZEOF=12 */
the hashtable uses this object to store data in the table. if you wish
to associate data with whatever you're hashing, make a subclass:
OBJECT mylink OF hashlink
type:INT
-> extra field
ENDOBJECT
these objects are also the ones return by find if data is already present
in
table. example:
t.add(NEW linkptr,hashv,ptr,len)
ptr and len are the same ones you passed to find(). note that add() does
not copy the 'ptr'.
total,num:=t.iterate(proc)
walks through the whole hashtable, calling proc for every link.
proc must be something like:
PROC it(link:PTR TO hashlink,depth)
depth is how deep the link hangs in a chain, i.e. minimally 1.
returned from iterate() is the sum of all returnvalues from
calling proc, and the number of links visited.
calc_hash_spread()
makes use of iterate() to calculate the average number of StrCmp's
needed to find some data in this table. see hashtest.e for example use.
example use
----------Imagine we're programming a compiler, like EC. To keep things simple,
we only do global variables. first, set up the table:
DEF t:PTR TO hashtable, identinfo:PTR TO mylink
NEW t.hashtable(HASH_MEDIUM)
-> see above
now the following piece of code would be entered when we've read
an identifier. in 'ident' and 'len' we have the identifier,
and 'status' says wether we are currently in the scope of a DEF or not:
-> see if ident is already there
identinfo,hashvalue:=t.find(ident,len:=EstrLen(ident))
IF identinfo
IF status=DEFINITION THEN Raise('double declaration!')
-> do something with identinfo.type here, for example
ELSE
IF status=USE THEN Raise('unknown variable!')
t.add(NEW identinfo,hashvalue,ident,len)
-> assign here to identinfo.type, for example
ENDIF
for those who don't know: what is hashing?
-----------------------------------------Assume the task of having to store identifiers like the ones used
for variables in E, and later finding them again (this could be any
kind of data, but for now just strings). We could put all these strings
in a linked list, and when a new identifier arrives, simply walk down
the list and see if it's there. While this may seem a nice and easy
way of doing it, this becomes unacceptably slow when storing, say,
2000 identifiers: we would, on average, perform 1000 StrCmp()s just
to find one identifier!
then comes hashing, which is a simple and fast technique to dramatically
reduce search times:
take a table of n entries (say, n=256), and take a function which for
every ident can compute a number 0..n-1 (this is called the hashvalue,
and such a function might be: add all ascii values MOD n). when you
now put the idents in n linked list hanging from the n entries in the
table, you only have to compute the hashvalue to theorethically wipe
out 255 of each 256 possibilities. searchtime for our example becomes
2000/256/2 => 4 StrCmp, without any costly computations.
and imagine what happens if you take a larger n, say 4096, then for most
compilations, you reduce to 1 StrCmp (!).
This is an ideal situation, since strangely enough, in reality lot's of
strings end up in the same entry. To make a good distribution of strings
over entries, this module uses several techniques, among them a better
hashing function, primes for n, and others. While in reality you may
end up doing 8 StrCmp()s instead of 4 as in the ideal case, this is
still a huge improvement over the 1000 we started with.
Download