>> Making a perfect hash requires 7 additional tables, 3 independent hashes of
>> the string, 3 tables to convert the hash values into summands, and one table to
>> convert the sums into pointers to the strings. (There is a trick that can
>> reduce this to 5 tables, but it makes the first hashing round slower.)
>
> Any references for these claims?
>
> gperf needs only one additional table. It doesn't need several
> "hashing rounds", only one very short hash function. Did you
> actually look at it (or comparable programs)?

First, let me clarify some terms. There are two properties of hash functions involved here. The first is that every keyword hashes to a different value. This is called "collision-free" hashing, or "single-probe" hashing. The second property is that hash function for the N keywords is into a set of N things (such as N pointers, or the integers from 0 to N-1). This is called "minimal" hashing. If the hash function is both collision-free and minimal, then it is called "perfect."

Apparently, perfect hashing is so hard to achieve in practice that the terms have become debased, and people are now calling a hash "perfect" if it possesses only the first of the two properties.

When I made my statement about 5 or 7 tables, I had in mind my own implementation of perfect hashing (in the stricter sense). I invented this many years ago, possibly as early as 1980, but immediately (the next day, in fact) discovered that I was 2 or 3 years too late. It had already been published, I think in CACM. However, I do not have the reference. It is the only method I know to achieve (true) perfect hashing with large and diffcult keyword sets.

I looked up the gperf program, and found that it satisfies the first property most of the time, except for large sets of keywords, and the second property rarely, only for small keyword sets. So, from my perspective, it does not achieve perfect hashing. Also, it appears to use 3 tables, not 1.

In any case, we are getting off the subject, which was how to store a set of varying-length strings with minimal wasted space. The method that I suggested, using the Pascal Macro Compiler, uses no wasted space at all. I mentioned that an additional advantage is that its pointer table or index table could be arranged to provide O(log N) time searching without using any additional tables or space.

If a practical method for searching the table in minimal time is required, then a self-expanding hash function would be a more suitable solution, since these can be built during runtime in linear time, with moderate wasted space.