>> Making a perfect hash requires 7 additional tables, 3 independent hashes
of
>> the string, 3 tables to convert the hash values into summands, and one
table to
>> convert the sums into pointers to the strings. (There is a trick that can
>> reduce this to 5 tables, but it makes the first hashing round slower.)
>
> Any references for these claims?
>
> gperf needs only one additional table. It doesn't need several
> "hashing rounds", only one very short hash function. Did you
> actually look at it (or comparable programs)?
First, let me clarify some terms. There are two properties of hash functions
involved here. The first is that every keyword hashes to a different value.
This is called "collision-free" hashing, or "single-probe" hashing. The
second property is that hash function for the N keywords is into a set of N things
(such as N pointers, or the integers from 0 to N-1). This is called
"minimal" hashing. If the hash function is both collision-free and minimal, then it
is called "perfect."
Apparently, perfect hashing is so hard to achieve in practice that the terms
have become debased, and people are now calling a hash "perfect" if it
possesses only the first of the two properties.
When I made my statement about 5 or 7 tables, I had in mind my own
implementation of perfect hashing (in the stricter sense). I invented this many years
ago, possibly as early as 1980, but immediately (the next day, in fact)
discovered that I was 2 or 3 years too late. It had already been published, I think
in CACM. However, I do not have the reference. It is the only method I know
to achieve (true) perfect hashing with large and diffcult keyword sets.
I looked up the gperf program, and found that it satisfies the first property
most of the time, except for large sets of keywords, and the second property
rarely, only for small keyword sets. So, from my perspective, it does not
achieve perfect hashing. Also, it appears to use 3 tables, not 1.
In any case, we are getting off the subject, which was how to store a set of
varying-length strings with minimal wasted space. The method that I
suggested, using the Pascal Macro Compiler, uses no wasted space at all. I mentioned
that an additional advantage is that its pointer table or index table could be
arranged to provide O(log N) time searching without using any additional
tables or space.
The fact that this is not the fastest search method is beyond the scope of
this discussion.
If a practical method for searching the table in minimal time is required,
then a self-expanding hash function would be a more suitable solution, since
these can be built during runtime in linear time, with moderate wasted space.
Frank Rubin