A little benchmark

14 Dec 2002


      Recently Frank wrote that subranges are stored as integer to 
get better performance. So I decided to run a little silly 
benchmark to verify my suspitions. I used variations of the 
following program:
program p1;
type foo = record 
    	a : 0..127;
    	b : -2000..2000;
    end;
var t : array [1..1000000] of foo;
    i, j, s, sum : integer;
begin
    {writeln(Sizeof(foo));}
    for i:=1 to 1000000 do
      begin
    	t[i].a := 20;
    	t[i].b := 1000;
      end;
        sum := 0;
    for j:=1 to 100 do
      begin
      s := 0;
      for i:=1 to 1000000 do
        begin
    	s := s + t[i].b;
            end;
          sum := sum + s div 13678;
          end;
        writeln(sum)
end
.
I decided to compare gpc (latest alpha with gcc-3.2.1, options -O3), 
p2c (with gcc-3.2.1, options -O3)
and fpc-1.0.6 (options -Sd -O3 -Op3). All timings are real time on 
otherwise idle Duron-650. 'x' in fpc column means that fpc rejected
the program.
gpc       p2c        fpc
  p1   3.116s     1.802s     2.550s
  p2   2.178s     1.802s          x
  p3   2.953s     2.073s     2.888s
  p4   2.959s     2.072s          x
  p5   3.116s     2.852s     3.365s
p11  1.239s     0.467s     1.652s
  p12  1.394s     0.467s          x
  p13  2.165s     0.930s     2.397s
  p14  2.165s     0.930s          x
  p15  1.239s     0.897s     1.720s
p21  0.469s     0.467s     1.084s
  p22  0.623s     0.467s          x
  p23  1.703s     1.007s     1.932s
  p24  1.702s     1.007s          x
  p25  0.470s     0.916s     1.026s
p1 is the program above, p2 has packed subranges as components of the 
record, p3 uses packed record, p4 uses packed record with packed subranges,
in p5 the record has integer components.
The series p11 to p15 uses array of 1000 elements so it entirely fits 
into the level 1 cache. Series p21 to p25 has inner for loop replaced 
by equivalent repeat until loop (otherwise identical to p11 .. p15)
Now, some concusions:
1) When working set does not fit into cache smaller records give 
   significantly better speed, 
2) Other compilers give fastest code just when using subranges.
3) When working in the cache aligned subword acces seem to be as 
   fast as fullword acces, while unaligned acces or bitfild acces
   (used by p2c for packed records) are slower.
4) gcc backend is doing something weird to the loops: C for loops 
   gives good (IMHO not optimal) code, but it fails to optimize 
   Pascal for loop (that way I timed variant with repeat until loop).
You may say that this is only a single benchmark on a single machine. 
But I think that many programs spent a lot of time in similar loops, 
so it is representative. I admit that I wrote the benchmark to show 
my point, but I based it on properties shared (IMHO) but most modern
machines (especially cheaper ones): main memory much slower then the 
CPU, slow unaligned acces, fast subword acces.
Also, it does not measure performance of using subranges as indices
and similar -- I belive that reading from memory is crucial as stores
are much less frequent then loads and the compiler can use integer 
in computations anyway, just loads and stores are done differntly.
Coming to gpc: it is probably too late to change the method to store 
subranges, but I think it is fair to say that the decision was made 
in the past and we are stuck with it rather then claim that it gives
faster code.
Also, the loop behavior is rather embarassing (well, I can guess why
it works that way, but fixing it may be harder).
-- 
                              Waldek Hebisch
hebisch@math.uni.wroc.pl    or hebisch@hera.math.uni.wroc.pl

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

A little benchmark