Making a perfect hash requires 7 additional tables, 3 independent hashes
of
the string, 3 tables to convert the hash values into summands, and one
table to
convert the sums into pointers to the strings. (There is a trick that can reduce this to 5 tables, but it makes the first hashing round slower.)
Any references for these claims?
gperf needs only one additional table. It doesn't need several "hashing rounds", only one very short hash function. Did you actually look at it (or comparable programs)?
First, let me clarify some terms. There are two properties of hash functions involved here. The first is that every keyword hashes to a different value. This is called "collision-free" hashing, or "single-probe" hashing. The second property is that hash function for the N keywords is into a set of N things (such as N pointers, or the integers from 0 to N-1). This is called "minimal" hashing. If the hash function is both collision-free and minimal, then it is called "perfect."
Apparently, perfect hashing is so hard to achieve in practice that the terms have become debased, and people are now calling a hash "perfect" if it possesses only the first of the two properties.
When I made my statement about 5 or 7 tables, I had in mind my own implementation of perfect hashing (in the stricter sense). I invented this many years ago, possibly as early as 1980, but immediately (the next day, in fact) discovered that I was 2 or 3 years too late. It had already been published, I think in CACM. However, I do not have the reference. It is the only method I know to achieve (true) perfect hashing with large and diffcult keyword sets.
I looked up the gperf program, and found that it satisfies the first property most of the time, except for large sets of keywords, and the second property rarely, only for small keyword sets. So, from my perspective, it does not achieve perfect hashing. Also, it appears to use 3 tables, not 1.
In any case, we are getting off the subject, which was how to store a set of varying-length strings with minimal wasted space. The method that I suggested, using the Pascal Macro Compiler, uses no wasted space at all. I mentioned that an additional advantage is that its pointer table or index table could be arranged to provide O(log N) time searching without using any additional tables or space.
The fact that this is not the fastest search method is beyond the scope of this discussion.
If a practical method for searching the table in minimal time is required, then a self-expanding hash function would be a more suitable solution, since these can be built during runtime in linear time, with moderate wasted space.
Frank Rubin
On Tue, Jul 19, 2005 at 02:47:42AM -0400, Contestcen@aol.com wrote: [...]
In any case, we are getting off the subject,
Finally, a word of wisdom...
which was how to store a set of varying-length strings with minimal wasted space.
... except that the subject of this mailing list is not "how to store something somewhere", but "the GNU Pascal Compiler", in case you didn't notice. Discussion of the Pascal Macro Compiler belongs to the Pascal Macro Compiler mailing list.
Another thing which you apparently didn't notice, although three people already asked you to do so: please, turn off HTML in your mail.
Emil Jerabek
Emil Jerabek wrote:
Another thing which you apparently didn't notice, although three people already asked you to do so: please, turn off HTML in your mail.
BTW, I told him in private mail several times, and he'd always reply that nobody else than me has ever complained ...
To avoid conspiracy theories -- no, I haven't contacted Emil or someone else by private mail and asked them to tell him to turn off HTML. ;-)
I'm sure many other readers dislike it as well, but haven't bothered to complain -- and please don't do it now either, this is not an AOL "me too" forum. ;-)
If he still thinks that HTML is accepted here unless a hundred people tell him so, I guess I'll just have to ask Anja to set up stricter filtering rules ...
Frank
Frank Heckenbach wrote:
Emil Jerabek wrote:
Another thing which you apparently didn't notice, although three people already asked you to do so: please, turn off HTML in your mail.
BTW, I told him in private mail several times, and he'd always reply that nobody else than me has ever complained ...
To avoid conspiracy theories -- no, I haven't contacted Emil or someone else by private mail and asked them to tell him to turn off HTML. ;-)
I'm sure many other readers dislike it as well, but haven't bothered to complain -- and please don't do it now either, this is not an AOL "me too" forum. ;-)
If he still thinks that HTML is accepted here unless a hundred people tell him so, I guess I'll just have to ask Anja to set up stricter filtering rules ...
I would encourage that, but please don't just dump HTML messages, bounce them with the specific reason. That might improve the breed.
Contestcen@aol.com wrote:
Making a perfect hash requires 7 additional tables, 3 independent hashes
of
the string, 3 tables to convert the hash values into summands, and one
table to
convert the sums into pointers to the strings. (There is a trick that can reduce this to 5 tables, but it makes the first hashing round slower.)
Any references for these claims?
gperf needs only one additional table. It doesn't need several "hashing rounds", only one very short hash function. Did you actually look at it (or comparable programs)?
First, let me clarify some terms. There are two properties of hash functions involved here. The first is that every keyword hashes to a different value. This is called "collision-free" hashing, or "single-probe" hashing. The second property is that hash function for the N keywords is into a set of N things (such as N pointers, or the integers from 0 to N-1). This is called "minimal" hashing. If the hash function is both collision-free and minimal, then it is called "perfect."
According to other definitions I know (see e.g. Wikipedia), the first one is called "perfect hashing", the second one is called "minimal perfect hashing".
The former is really more important. If you get a few unused slots because you can't achieve minimal perfect hashing, that's probably still much better than your solution for minimal perfect hashing, both in space (if if needs 5 or 7 tables as you say, plus the accompanying code) and in time, obviously.
I looked up the gperf program, and found that it satisfies the first property most of the time, except for large sets of keywords, and the second property rarely, only for small keyword sets. So, from my perspective, it does not achieve perfect hashing. Also, it appears to use 3 tables, not 1.
The one I know produces one hash table, and of course, the array with the strings, so at most 2 tables if you count so.
In any case, we are getting off the subject, which was how to store a set of varying-length strings with minimal wasted space.
Yawn!
BTW, this apparently is your chosen subject which noone else here seems to be interested in. The original subject was about runtime storage.
Also, most people here, including the Chief and me, are more interested in efficient access than saving a small constant number of bytes. The penalties, e.g., for unaligned access, even on systems where it doesn't crash right away, might more than make up for the savings, by larger code alone.
I mentioned that an additional advantage is that its pointer table or index table could be arranged to provide O(log N) time searching without using any additional tables or space.
And I mentioned that this is a rather bad value for compile-time known strings.
If a practical method for searching the table in minimal time is required, then a self-expanding hash function would be a more suitable solution, since these can be built during runtime in linear time, with moderate wasted space.
For runtime, yes. And one of the issues is how to allocate space for the strings in a non-wasteful way, which was the original question and has long been answered.
For compile-time, you probably want a compile-time built hash table. Of course, this is probably out of the scope of a macro processor, so sorry, your macro processor is just not releveant here.
I understand that you want to advertize it, but it's really getting too much. It might be better if you find suitable applications to refer to -- this means problems that people are really having (not only yourself apparently; which excludes this topic), and which concern GPC if you post on this list (this excludes certain problems another compiler has with longer strings).
Frank