Re: A little benchmark

15 Dec 2002

      Waldek Hebisch wrote:
...
Recently Frank wrote that subranges are stored as integer to 
get better performance. So I decided to run a little silly 
benchmark to verify my suspitions. I used variations of the 
following program:
program p1;
type foo = record 
   	a : 0..127;
   	b : -2000..2000;
   end;
var t : array [1..1000000] of foo;
    i, j, s, sum : integer;
begin
   {writeln(Sizeof(foo));}
   for i:=1 to 1000000 do
     begin
   	t[i].a := 20;
   	t[i].b := 1000;
     end;
        sum := 0;
   for j:=1 to 100 do
     begin
     s := 0;
     for i:=1 to 1000000 do
       begin
   	s := s + t[i].b;
            end;
         sum := sum + s div 13678;
          end;
        writeln(sum)
end
.
I decided to compare gpc (latest alpha with gcc-3.2.1, options -O3), 
p2c (with gcc-3.2.1, options -O3)
and fpc-1.0.6 (options -Sd -O3 -Op3). All timings are real time on 
otherwise idle Duron-650. 'x' in fpc column means that fpc rejected
the program.
     gpc       p2c        fpc

p1   3.116s     1.802s     2.550s
  p2   2.178s     1.802s          x
  p3   2.953s     2.073s     2.888s
  p4   2.959s     2.072s          x
  p5   3.116s     2.852s     3.365s
p11  1.239s     0.467s     1.652s
  p12  1.394s     0.467s          x
  p13  2.165s     0.930s     2.397s
  p14  2.165s     0.930s          x
  p15  1.239s     0.897s     1.720s
p21  0.469s     0.467s     1.084s
  p22  0.623s     0.467s          x
  p23  1.703s     1.007s     1.932s
  p24  1.702s     1.007s          x
  p25  0.470s     0.916s     1.026s
p1 is the program above, p2 has packed subranges as components of the 
record, p3 uses packed record, p4 uses packed record with packed subranges,
in p5 the record has integer components.
The series p11 to p15 uses array of 1000 elements so it entirely fits 
into the level 1 cache. Series p21 to p25 has inner for loop replaced 
by equivalent repeat until loop (otherwise identical to p11 .. p15)
Now, some concusions:

When working set does not fit into cache smaller records give 
significantly better speed,
Other compilers give fastest code just when using subranges.
When working in the cache aligned subword acces seem to be as
fast as fullword acces, while unaligned acces or bitfild acces
(used by p2c for packed records) are slower.
gcc backend is doing something weird to the loops: C for loops 
gives good (IMHO not optimal) code, but it fails to optimize 
Pascal for loop (that way I timed variant with repeat until loop).

You may say that this is only a single benchmark on a single machine. 
But I think that many programs spent a lot of time in similar loops, 
so it is representative. I admit that I wrote the benchmark to show 
my point, but I based it on properties shared (IMHO) but most modern
machines (especially cheaper ones): main memory much slower then the 
CPU, slow unaligned acces, fast subword acces.
I'm always quite skeptical with such statements. First, a single
program can never be representative. Benchmarkers spend some effort
to get representative sets of programs (and often enough, they're
quite off), so claiming a single program is representative is a
little much IMHO.
More precisely, your program is characterized by a loop that
contains only one array load and a trivial computation. I think many
real programs of such a kind do somewhat more complex computations
and also array stores. But as I said, one would need a set of
programs with different behaviour to be really representative.
Secondly, properties of CPUs are not subject to opinion. They are
such or other. I don't have the hard facts handy, maybe you do, but
without them it's mostly speculation. I can imagine that subword
read access is rather fast (`shift' and `and' can be done CPU
internally, maybe even combined if supported), but subword stores
(which your program didn't test) not so much (they require a memory
load and store).
Despite of all this, I suppose the numbers are quite realistic.
I have the following explanations:
- First, my previous comments were more about single values (and
  maybe small arrays) than large arrays. I.e., I think that for
  plain subrange variables, it's more efficient to store them in
  word size.
Now, it seems natural to do the same for arrays. And in fact I
  think it's necessary, otherwise `var' parameters would break (at
  least if they're implemented via pointers, but anything else I can
  imagine would be less efficient).
I am aware of cache size issues, so I'm not surprised that large
  unpacked arrays give bad performance. (Then it doesn't matter
  whether the fields are integers or subranges as the numbers
  confirm, p1 and p5. The same seems to be true for FPC -- the
  number for p5 looks strange (maybe a measuring mistake?).)
p2c seems to make subranges shorter by default. I expect this to
  cause performance penalties when, e.g., using a subrange for a
  counter and array index (which is the proper thing to do in
  Pascal).
- For arrays, there are (at least) 3 reasonable packing options:
- No packing at all => yield the best single-element performance,
    at the expense of size (and performance in large cases because
    of cache issues). That's p1 and p5 in GPC.
- Smallest possible packing. That's p3 and p4 (equivalent I think,
    which the numbers confirm) in GPC. This has a bad performance
    because GPC always (AFAIK) uses bitfield access. It might be
    worthwhile to avoid bitfield access when the offset and size is
    in fact at byte boundaries (not here since smallest possible
    packing means a should really have 7 bits -- that may be a
    choice necessary for memory requirements in huge cases; I
    remember sometime ago someone needed to use an array of 24 bit
    integers because 16 bits had insufficient range and 32 bit
    wouldn't fit in memory; it should be understood that this is not
    best for performance in general).
- Smallest possible packing without bitfields. I think a normal
    record/array of packed items should yield this in GPC (i.e.,
    p2). It should be (and is) fastest in the first series, not so
    in the second (because cache issues don't matter much). I'm
    surprised about the third one -- did you also use 1000 elements
    there (then it's ok, like the second one), or large arrays again
    (then it would appear to contradict the cache problems)?
- I know that the implementation of `for' loops is a little hairy.
  I found a bug with them last year (I don't remember the details
  now, but could look them up if wanted) -- I suppose it was a
  backend problem, and I reported it to the GCC list and never got a
  response, so I kludged it (see FOR_BUG_OK in parse.y). I didn't
  check yet if the bug is still there in gcc-3, but according to a
  quick check the kludge does not seem to be responsible for the
  main difference.
Somehow, the backend's loop optimizations don't recognize GPC's
  `for' loops optimally. Maybe GPC's handling of `for' loops could
  be changed (but again, it's hairy, so watch out ...), or it should
  be improved in the backend, I'm not sure now ...
Another problem seems to be that i and s are global variables (if
  you make them local, you'll see a big difference in some cases.
  The global variables are (sometimes?) loaded and stored within
  each loop, so I suppose for some reason the backend doesn't
  recognize that it's enough to load/store them once before/after
  the loop. With p2c, they become local variables of `main' which
  might explain a lot. While not optimal (in GPC), it doesn't seem
  so tragic since a real program wouldn't use global variables for
  inner counters (would it? ;-).
...
Coming to gpc: it is probably too late to change the method to store
subranges, [...]
No, why?
Frank
-- 
Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
GPC download signing key: 51FF C1F0 1A77 C6C2 4482  4DDC 117A 9773 7F88 1707

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: A little benchmark