Recently Frank wrote that subranges are stored as integer to
get better performance. So I decided to run a little silly
benchmark to verify my suspitions. I used variations of the
following program:
program p1;
type foo = record
a : 0..127;
b : -2000..2000;
end;
var t : array [1..1000000] of foo;
i, j, s, sum : integer;
begin
{writeln(Sizeof(foo));}
for i:=1 to 1000000 do
begin
t[i].a := 20;
t[i].b := 1000;
end;
sum := 0;
for j:=1 to 100 do
begin
s := 0;
for i:=1 to 1000000 do
begin
s := s + t[i].b;
end;
sum := sum + s div 13678;
end;
writeln(sum)
end
.
I decided to compare gpc (latest alpha with gcc-3.2.1, options -O3),
p2c (with gcc-3.2.1, options -O3)
and fpc-1.0.6 (options -Sd -O3 -Op3). All timings are real time on
otherwise idle Duron-650. 'x' in fpc column means that fpc rejected
the program.
gpc p2c fpc
p1 3.116s 1.802s 2.550s
p2 2.178s 1.802s x
p3 2.953s 2.073s 2.888s
p4 2.959s 2.072s x
p5 3.116s 2.852s 3.365s
p11 1.239s 0.467s 1.652s
p12 1.394s 0.467s x
p13 2.165s 0.930s 2.397s
p14 2.165s 0.930s x
p15 1.239s 0.897s 1.720s
p21 0.469s 0.467s 1.084s
p22 0.623s 0.467s x
p23 1.703s 1.007s 1.932s
p24 1.702s 1.007s x
p25 0.470s 0.916s 1.026s
p1 is the program above, p2 has packed subranges as components of the
record, p3 uses packed record, p4 uses packed record with packed subranges,
in p5 the record has integer components.
The series p11 to p15 uses array of 1000 elements so it entirely fits
into the level 1 cache. Series p21 to p25 has inner for loop replaced
by equivalent repeat until loop (otherwise identical to p11 .. p15)
Now, some concusions:
1) When working set does not fit into cache smaller records give
significantly better speed,
2) Other compilers give fastest code just when using subranges.
3) When working in the cache aligned subword acces seem to be as
fast as fullword acces, while unaligned acces or bitfild acces
(used by p2c for packed records) are slower.
4) gcc backend is doing something weird to the loops: C for loops
gives good (IMHO not optimal) code, but it fails to optimize
Pascal for loop (that way I timed variant with repeat until loop).
You may say that this is only a single benchmark on a single machine.
But I think that many programs spent a lot of time in similar loops,
so it is representative. I admit that I wrote the benchmark to show
my point, but I based it on properties shared (IMHO) but most modern
machines (especially cheaper ones): main memory much slower then the
CPU, slow unaligned acces, fast subword acces.
Also, it does not measure performance of using subranges as indices
and similar -- I belive that reading from memory is crucial as stores
are much less frequent then loads and the compiler can use integer
in computations anyway, just loads and stores are done differntly.
Coming to gpc: it is probably too late to change the method to store
subranges, but I think it is fair to say that the decision was made
in the past and we are stuck with it rather then claim that it gives
faster code.
Also, the loop behavior is rather embarassing (well, I can guess why
it works that way, but fixing it may be harder).
--
Waldek Hebisch
hebisch(a)math.uni.wroc.pl or hebisch(a)hera.math.uni.wroc.pl