On 06 Aug 2006, at 00:40, Frank Heckenbach wrote:
Jonas Maebe wrote:
[snip]
What I think is strange is when you look at the byte order with current (GPC and apparently also FPC) sets on big-endian machines, e.g. for a 0-based set on a 32 bit machine:
Byte Elements 0 24..31 1 16..23 2 8..15 3 0..7 4 56..63 5 48..55
etc.
This is because the bytes within a word are big-endian, but we always treat the sequence of words as little-endian (lowest elements first), so we actually get a mixed endianness when considering set elements.
I disagree that for sets this is strange, if only for the reason that apparently both the GPC and FPC implementors did it that way independently, and without any hacking or ugly workarounds.
On the contrary, this representation is extremely convenient because it allows one to implement all set routines in an endian-agnostic way (both in the rtl/rts and in the compiler), and the only thing needed to load the values from one endianess architecture on another is byte- swapping.
At a very low level it may seem strange once you start to think about it (I also thought it to be strange the first time I realised this), but does that matter if it's easy to work with it at a high level? (both for the compiler implementors and the end-users no less)
Indeed, it would also reverse the order of elements in a byte (perhaps that's what you meant above), so the set element 1, e.g., wouldn't correspond to the "bit #1", i.e. value 2, anymore, but instead "bit #6", i.e. value $40.
This is indeed what I was referring to.
BTW, actually I used to wonder about the latter when I previously saw it on its own (WRT packed arrays IIRC). Why put the lowest-indexed values in the highest-values bits? But when looking at it in the context of words, as above, it makes sense. (Because if you don't, i.e. want to have the same byte order and bit order as on little-endian, you get strange shifts, like 24 for element 0, 25 for 1, ..., 31 for 7, 16 for 8, 17 for 9, ... This is awkward for single elements already, but really gets ugly when you need bit-masks for ranges etc.)
As long as you treat the set as an "array" of components of the same size which is used when inserting/testing/deleting elements, your range code can remain generic (as e.g. GPC's SetIncludeRange demonstrates). Of course, to make sets binary compatible between little and big endian archs you have to use bytes as opposed to words when doing these operations, and therefore you'd have to do the same with the range code.
For adding/substracting/... it wouldn't make a difference, of course; they could keep working on words.
Bit swapping when porting from big to little endian is a lot less obvious than byte swapping though. We'd at least have to provide routines to do this (maybe GPC has them already, but FPC doesn't).
I don't think we have such a routine, but it wouldn't be hard to provide one (probably table-based for efficiency).
Indeed.
Making people extend their set base types so the sets are a multiple of 8 bytes on both 32 and 64 bit archs seems awkward: it may mess up bit-packed records elsewhere as well, and for enums it may amount to adding a bunch of dummy enum elements (which doesn't look nice either).
You seem to be very concerned with sets in binary file formats, judging from these and previous concerns.
I am indeed. These sort of issues can waste a lot of programmer time when going from one platform to another, which could be spent on much more useful things.
I admit that's not such a big concern of mine (in particular, in contrast to runtime efficiency, or proper gdb support, of course). Actually I haven't seen many file formats containing sets. That's probably because many languages don't have sets, so file formats that should be accessible from different languages probably use bit-fields or something like this instead.
It would at least be a problem for FPC itself. Our ppu files (comparable to a combination of GPC's .gpd and .gpi files) contain a ton of sets. Flags describing properties of functions/proceedures, variables, types, ... are all sets in FPC, and all are stored that way in the .ppu files.
If sets have different sizes on 32 and 64 bit archs, the compiler would need hacks in the ppu writing and reading routines if we want to keep the ability to cross-compile usable ppu files from 32 bit to 64 bit platforms (and vice versa). Similar for ppudump, which "disassembles" ppu files.
Sets may indeed be rare in datafiles used among a lot of different programming environments, but they are quite common in datafiles for use by a single Pascal program (simply because may be no need to make them easily accessible from other languages, like e.g. our ppu files, and because sets are a first class datatype in Pascal).
m2_is_long_set checks if all the fields of the record are consecutive sets (i.e. sets of consecutive range types). I don't really understand why this is useful though, nor do I see at first sight what m2_print_long_set does so differently compared to the M2 print code for TYPE_CODE_SET.
I don't currently have the time to check the code carefully, but since you speak of a record, does the compiler divide large sets internally into record fields (so the debugger has to reassemble them to look as one set), or something like this?
That's what it looks like, yes. The code I posted came from gdb, not gcc though. I don't know Modula-2, and I know next to nothing of gcc internals, so I don't feel like diving into that.
(Otherwise I wonder when a programmer would use "consecutive sets" manually.)
I'm also pretty sure the compiler does it, but as I said I have no idea why it does this since at first sight I can't determine a difference in binary layout compared to the conventional TYPE_CODE_SET way, or what gdb does differently for the two (apart from the fact that the routine for printing these record-sets is a lot more complex than the regular one because it has to hop from field to field).
Jonas