Contestcen@aol.com wrote:
Making a perfect hash requires 7 additional tables, 3 independent hashes
of
the string, 3 tables to convert the hash values into summands, and one
table to
convert the sums into pointers to the strings. (There is a trick that can reduce this to 5 tables, but it makes the first hashing round slower.)
Any references for these claims?
gperf needs only one additional table. It doesn't need several "hashing rounds", only one very short hash function. Did you actually look at it (or comparable programs)?
First, let me clarify some terms. There are two properties of hash functions involved here. The first is that every keyword hashes to a different value. This is called "collision-free" hashing, or "single-probe" hashing. The second property is that hash function for the N keywords is into a set of N things (such as N pointers, or the integers from 0 to N-1). This is called "minimal" hashing. If the hash function is both collision-free and minimal, then it is called "perfect."
According to other definitions I know (see e.g. Wikipedia), the first one is called "perfect hashing", the second one is called "minimal perfect hashing".
The former is really more important. If you get a few unused slots because you can't achieve minimal perfect hashing, that's probably still much better than your solution for minimal perfect hashing, both in space (if if needs 5 or 7 tables as you say, plus the accompanying code) and in time, obviously.
I looked up the gperf program, and found that it satisfies the first property most of the time, except for large sets of keywords, and the second property rarely, only for small keyword sets. So, from my perspective, it does not achieve perfect hashing. Also, it appears to use 3 tables, not 1.
The one I know produces one hash table, and of course, the array with the strings, so at most 2 tables if you count so.
In any case, we are getting off the subject, which was how to store a set of varying-length strings with minimal wasted space.
Yawn!
BTW, this apparently is your chosen subject which noone else here seems to be interested in. The original subject was about runtime storage.
Also, most people here, including the Chief and me, are more interested in efficient access than saving a small constant number of bytes. The penalties, e.g., for unaligned access, even on systems where it doesn't crash right away, might more than make up for the savings, by larger code alone.
I mentioned that an additional advantage is that its pointer table or index table could be arranged to provide O(log N) time searching without using any additional tables or space.
And I mentioned that this is a rather bad value for compile-time known strings.
If a practical method for searching the table in minimal time is required, then a self-expanding hash function would be a more suitable solution, since these can be built during runtime in linear time, with moderate wasted space.
For runtime, yes. And one of the issues is how to allocate space for the strings in a non-wasteful way, which was the original question and has long been answered.
For compile-time, you probably want a compile-time built hash table. Of course, this is probably out of the scope of a macro processor, so sorry, your macro processor is just not releveant here.
I understand that you want to advertize it, but it's really getting too much. It might be better if you find suitable applications to refer to -- this means problems that people are really having (not only yourself apparently; which excludes this topic), and which concern GPC if you post on this list (this excludes certain problems another compiler has with longer strings).
Frank