Waldek Hebisch wrote:
Recently Frank wrote that subranges are stored as integer to get better performance. So I decided to run a little silly benchmark to verify my suspitions. I used variations of the following program:
program p1; type foo = record a : 0..127; b : -2000..2000; end;
var t : array [1..1000000] of foo; i, j, s, sum : integer;
begin {writeln(Sizeof(foo));} for i:=1 to 1000000 do begin t[i].a := 20; t[i].b := 1000; end; sum := 0; for j:=1 to 100 do begin s := 0; for i:=1 to 1000000 do begin s := s + t[i].b; end; sum := sum + s div 13678; end; writeln(sum) end .
I decided to compare gpc (latest alpha with gcc-3.2.1, options -O3), p2c (with gcc-3.2.1, options -O3) and fpc-1.0.6 (options -Sd -O3 -Op3). All timings are real time on otherwise idle Duron-650. 'x' in fpc column means that fpc rejected the program.
gpc p2c fpc
p1 3.116s 1.802s 2.550s p2 2.178s 1.802s x p3 2.953s 2.073s 2.888s p4 2.959s 2.072s x p5 3.116s 2.852s 3.365s
p11 1.239s 0.467s 1.652s p12 1.394s 0.467s x p13 2.165s 0.930s 2.397s p14 2.165s 0.930s x p15 1.239s 0.897s 1.720s
p21 0.469s 0.467s 1.084s p22 0.623s 0.467s x p23 1.703s 1.007s 1.932s p24 1.702s 1.007s x p25 0.470s 0.916s 1.026s
p1 is the program above, p2 has packed subranges as components of the record, p3 uses packed record, p4 uses packed record with packed subranges, in p5 the record has integer components. The series p11 to p15 uses array of 1000 elements so it entirely fits into the level 1 cache. Series p21 to p25 has inner for loop replaced by equivalent repeat until loop (otherwise identical to p11 .. p15)
Now, some concusions:
- When working set does not fit into cache smaller records give significantly better speed,
- Other compilers give fastest code just when using subranges.
- When working in the cache aligned subword acces seem to be as fast as fullword acces, while unaligned acces or bitfild acces (used by p2c for packed records) are slower.
- gcc backend is doing something weird to the loops: C for loops gives good (IMHO not optimal) code, but it fails to optimize Pascal for loop (that way I timed variant with repeat until loop).
You may say that this is only a single benchmark on a single machine. But I think that many programs spent a lot of time in similar loops, so it is representative. I admit that I wrote the benchmark to show my point, but I based it on properties shared (IMHO) but most modern machines (especially cheaper ones): main memory much slower then the CPU, slow unaligned acces, fast subword acces.
I'm always quite skeptical with such statements. First, a single program can never be representative. Benchmarkers spend some effort to get representative sets of programs (and often enough, they're quite off), so claiming a single program is representative is a little much IMHO.
More precisely, your program is characterized by a loop that contains only one array load and a trivial computation. I think many real programs of such a kind do somewhat more complex computations and also array stores. But as I said, one would need a set of programs with different behaviour to be really representative.
Secondly, properties of CPUs are not subject to opinion. They are such or other. I don't have the hard facts handy, maybe you do, but without them it's mostly speculation. I can imagine that subword read access is rather fast (`shift' and `and' can be done CPU internally, maybe even combined if supported), but subword stores (which your program didn't test) not so much (they require a memory load and store).
Despite of all this, I suppose the numbers are quite realistic. I have the following explanations:
- First, my previous comments were more about single values (and maybe small arrays) than large arrays. I.e., I think that for plain subrange variables, it's more efficient to store them in word size.
Now, it seems natural to do the same for arrays. And in fact I think it's necessary, otherwise `var' parameters would break (at least if they're implemented via pointers, but anything else I can imagine would be less efficient).
I am aware of cache size issues, so I'm not surprised that large unpacked arrays give bad performance. (Then it doesn't matter whether the fields are integers or subranges as the numbers confirm, p1 and p5. The same seems to be true for FPC -- the number for p5 looks strange (maybe a measuring mistake?).)
p2c seems to make subranges shorter by default. I expect this to cause performance penalties when, e.g., using a subrange for a counter and array index (which is the proper thing to do in Pascal).
- For arrays, there are (at least) 3 reasonable packing options:
- No packing at all => yield the best single-element performance, at the expense of size (and performance in large cases because of cache issues). That's p1 and p5 in GPC.
- Smallest possible packing. That's p3 and p4 (equivalent I think, which the numbers confirm) in GPC. This has a bad performance because GPC always (AFAIK) uses bitfield access. It might be worthwhile to avoid bitfield access when the offset and size is in fact at byte boundaries (not here since smallest possible packing means a should really have 7 bits -- that may be a choice necessary for memory requirements in huge cases; I remember sometime ago someone needed to use an array of 24 bit integers because 16 bits had insufficient range and 32 bit wouldn't fit in memory; it should be understood that this is not best for performance in general).
- Smallest possible packing without bitfields. I think a normal record/array of packed items should yield this in GPC (i.e., p2). It should be (and is) fastest in the first series, not so in the second (because cache issues don't matter much). I'm surprised about the third one -- did you also use 1000 elements there (then it's ok, like the second one), or large arrays again (then it would appear to contradict the cache problems)?
- I know that the implementation of `for' loops is a little hairy. I found a bug with them last year (I don't remember the details now, but could look them up if wanted) -- I suppose it was a backend problem, and I reported it to the GCC list and never got a response, so I kludged it (see FOR_BUG_OK in parse.y). I didn't check yet if the bug is still there in gcc-3, but according to a quick check the kludge does not seem to be responsible for the main difference.
Somehow, the backend's loop optimizations don't recognize GPC's `for' loops optimally. Maybe GPC's handling of `for' loops could be changed (but again, it's hairy, so watch out ...), or it should be improved in the backend, I'm not sure now ...
Another problem seems to be that i and s are global variables (if you make them local, you'll see a big difference in some cases. The global variables are (sometimes?) loaded and stored within each loop, so I suppose for some reason the backend doesn't recognize that it's enough to load/store them once before/after the loop. With p2c, they become local variables of `main' which might explain a lot. While not optimal (in GPC), it doesn't seem so tragic since a real program wouldn't use global variables for inner counters (would it? ;-).
Coming to gpc: it is probably too late to change the method to store subranges, [...]
No, why?
Frank