Recently Frank wrote that subranges are stored as integer to get better performance. So I decided to run a little silly benchmark to verify my suspitions. I used variations of the following program:
program p1; type foo = record a : 0..127; b : -2000..2000; end;
var t : array [1..1000000] of foo; i, j, s, sum : integer;
begin {writeln(Sizeof(foo));} for i:=1 to 1000000 do begin t[i].a := 20; t[i].b := 1000; end; sum := 0; for j:=1 to 100 do begin s := 0; for i:=1 to 1000000 do begin s := s + t[i].b; end; sum := sum + s div 13678; end; writeln(sum) end .
I decided to compare gpc (latest alpha with gcc-3.2.1, options -O3), p2c (with gcc-3.2.1, options -O3) and fpc-1.0.6 (options -Sd -O3 -Op3). All timings are real time on otherwise idle Duron-650. 'x' in fpc column means that fpc rejected the program.
gpc p2c fpc p1 3.116s 1.802s 2.550s p2 2.178s 1.802s x p3 2.953s 2.073s 2.888s p4 2.959s 2.072s x p5 3.116s 2.852s 3.365s
p11 1.239s 0.467s 1.652s p12 1.394s 0.467s x p13 2.165s 0.930s 2.397s p14 2.165s 0.930s x p15 1.239s 0.897s 1.720s
p21 0.469s 0.467s 1.084s p22 0.623s 0.467s x p23 1.703s 1.007s 1.932s p24 1.702s 1.007s x p25 0.470s 0.916s 1.026s
p1 is the program above, p2 has packed subranges as components of the record, p3 uses packed record, p4 uses packed record with packed subranges, in p5 the record has integer components. The series p11 to p15 uses array of 1000 elements so it entirely fits into the level 1 cache. Series p21 to p25 has inner for loop replaced by equivalent repeat until loop (otherwise identical to p11 .. p15)
Now, some concusions: 1) When working set does not fit into cache smaller records give significantly better speed, 2) Other compilers give fastest code just when using subranges. 3) When working in the cache aligned subword acces seem to be as fast as fullword acces, while unaligned acces or bitfild acces (used by p2c for packed records) are slower. 4) gcc backend is doing something weird to the loops: C for loops gives good (IMHO not optimal) code, but it fails to optimize Pascal for loop (that way I timed variant with repeat until loop).
You may say that this is only a single benchmark on a single machine. But I think that many programs spent a lot of time in similar loops, so it is representative. I admit that I wrote the benchmark to show my point, but I based it on properties shared (IMHO) but most modern machines (especially cheaper ones): main memory much slower then the CPU, slow unaligned acces, fast subword acces.
Also, it does not measure performance of using subranges as indices and similar -- I belive that reading from memory is crucial as stores are much less frequent then loads and the compiler can use integer in computations anyway, just loads and stores are done differntly.
Coming to gpc: it is probably too late to change the method to store subranges, but I think it is fair to say that the decision was made in the past and we are stuck with it rather then claim that it gives faster code. Also, the loop behavior is rather embarassing (well, I can guess why it works that way, but fixing it may be harder).