Dear GPC Users,
The GPC manual says I can turn off range checking with {$R-} or with the command-line option --no-range-checking. I tried both, but the compiler gives me an error: unrecognised directive. I'm using GPC for MacOSX compiled with GCC 3.3.
Can I turn off range checking?
I translated a few thousand lines of Code Warrior Pascal into GPC Pascal. As I did so, I took the opportunity to make use of GPC's wonderful dynamic schemata types, so I removed all my ugly pointer arithmatic, which I had previously used to access dynamically allocated blocks of memory.
My translated code runs at half the speed of my original code. I don't think it is the dynamic schemata, but I thought I would turn off the range checking just to see.
What other default compiler options might slow down the code? Should I turn off debugging information?
Yours, Kevan Hashemi
Kevan Hashemi wrote:
Dear GPC Users,
The GPC manual says I can turn off range checking with {$R-} or with the command-line option --no-range-checking. I tried both, but the compiler gives me an error: unrecognised directive. I'm using GPC for MacOSX compiled with GCC 3.3.
The version on my website is based on gpc-20030507 which doesn't have range-checking. Updated compiler binaries will be available soon there soon. The latest gpc alpha partially implements range-checking.
Can I turn off range checking?
I translated a few thousand lines of Code Warrior Pascal into GPC Pascal. As I did so, I took the opportunity to make use of GPC's wonderful dynamic schemata types, so I removed all my ugly pointer arithmatic, which I had previously used to access dynamically allocated blocks of memory.
My translated code runs at half the speed of my original code. I don't think it is the dynamic schemata, but I thought I would turn off the range checking just to see.
I can hardly believe it's the generated code, but nothing is impossible ... If you send me the source code, I will have a look at it (but I have very limited time available this month).
What other default compiler options might slow down the code? Should I turn off debugging information?
Try -O2 or -O3 for code optimization. Putting off debugging information only affects the size, not the speed. Be aware that, in general, bad field alignments may greatly decrease the speed of the PowerPC chip (e.g. doubles aligned on 4-byte boundaries rather than on 8-byte boundaries).
Yours, Kevan Hashemi
Regards,
Adriaan van Os
Kevan Hashemi wrote:
The GPC manual says I can turn off range checking with {$R-} or with the command-line option --no-range-checking. I tried both, but the compiler gives me an error: unrecognised directive. I'm using GPC for MacOSX compiled with GCC 3.3.
Can I turn off range checking?
The latest GPC release (20030830 -- try `gpc -v' to see the version; the GCC version doesn't mean much here) supports range-checking only in a few cases. General range-checking will be added in the next release.
If your version doesn't recognize `{$R-}', it's probably older and doesn't do range-checking at all, so you don't need to worry. When you upgrade, just add the option again which should be recognized then.
Frank
Dear Frank and Adriaan,
From Frank: If your version doesn't recognize `{$R-}', it's probably older and doesn't do range-checking at all, so you don't need to worry.
Thank you, and I have determined that the indexing performed for the dynamic schemata are as fast as my own pointer arithmetic. With the schemata instead of pointer arithmetic, my code is now half as long, and far more readable. Converting my code to GPC has been a real pleasure, and I continue to discover new and delightful features.
From Adriaan: I can hardly believe it's the generated code, but nothing is impossible ... If you send me the source code, I will have a look at it (but I have very limited time available this month).
Your offer is most generous. I have done some work to try and make your investigation simpler.
I am comparing the CW Pascal compiler issued about five years ago with the GPC compiler I installed two months ago. The CW code runs in MacOS 9. The GPC code runs in MacOS X from the UNIX terminal.
Here is my GPC test program:
program test;
var a:real; m,n,p:integer;
begin a:=1234.567; writeln('starting loop...'); for m:=0 to 1000 do for n:=0 to 1000 do begin for p:=1 to 100 do begin { loop statement here } end; end; writeln('done.'); end.
In CW the code looks the same, but the variables are longreals and longints so that they match the GPC sizes. As you can see, the loop statement gets executed 100,000,000 times. I measure the execution time by counting seconds in my head between the prints to the console. I'm using an 800 MHz iBook. Here are my results:
loop CW time GPC time statement (s) (s)
none 1 3 a:=a*a/a 6 13 a:=p 3 6 a:=round(a) 8 40 a:=sin(p) 15 40
Most things take two or three times as long with GPC. I expect code running on MacOS X (UNIX) to be slower than code on MacOS 9 because MacOS X is re-entrant, is subject to pre-emptive multitasking, and provides protected memory. I am more interested in the fact that the GPC round() function takes four times as long as the GPC implementation of a:=a*a/a, while the CW round() function takes about the same time as the CW implementation of a:=a*a/a.
As you know, rounding a number with platform-independent mathematical functions is slow. The CW round() probably uses the Power PC real number format to abbreviate the rounding process. Perhaps GPC uses a platform- independent implementation.
I have always used round() with sinusoidal look-up tables in my fourier transforms. The above results suggest that I gained very little by doing so. Nevertheless, I also use round() to obtain display coordinates from real-valued graphs, and these routines are running five times slower than before.
Yours, Kevan Hashemi
On Saturday 18 October 2003 19:34, Kevan Hashemi wrote:
Most things take two or three times as long with GPC. I expect code running
Are you enabling any optimizations while compiling? gpc is based around gcc backend and in general produces fairly good code. However to enjoy this you *have* to ask compiler to do so. Leaving out debug info helps as well.
Basically you can use all the standard gcc optimization switches (man/info gcc, look into gcc invocation). In particular the minimal set of switches might look like: -O2 -march=??? [-fomit-frame-pointer] you might try -Os instead of O2, or -O3
-O2 enables bunch of "safe" optimizations, and may be used in almost any situation.
-O3 additionally enables some loop unrolling, which generally makes code larger and sometimes faster, but sometimes slower.
-Os optimizes for size (instead of speed), which might improve cashe hits, which in turn sometimes provides massive speedups..
-fomit-frame-pointer - frees one register, which is more usefull on x86 architecture, however makes debugging impossible.
Few more options to look into (not as safe) -ffast-math (disables some checks and features, in particular "nan" is no longer defined, which have beaten few (mostly scientific) programs on my memory, although surprisingly not that many..
-expensive-optimizations - somewhat safer, although may (additionally) increase your compilation time and it is hard to squeeze anything on top of -O2 anyway..
-mfpmath=sse - may be a life saver on pentium3/4 or athlon (or brake things) but probably not that usefull outside x86.
Plus alignment stuff like -fforce-addr or -falign-functions=4/8 might add you few percent here and there..
Kind of a disclaimer: I am writing this off the memory, so I might have misspelled some of the flags (oh, and some of them have been renamed during transition from gcc-2.95.x to gcc-3.x), but they should be pretty close. Look into man/info pages for more detailed description and exact spelling..
George
Dear George, Frank, Adriaan, and Gale,
George wrote:
Are you enabling any optimizations while compiling?
No, I was not. Thank you for your description of the optimization options. Gale has filled out my table with the effect of -O3. I tried -O2 and -O3 on my fourier analysis routine, and obtained the following results (ms is for milliseconds).
none 60 ms (measured by GetTimeStamp 100 repetitions) -O2 50 ms -O3 50 ms
The GPC code calculates sin() directly with the sin() function. My old CW code looks up the sin() value in a table with the help of round(). For the old code I get:
CW 30 ms (measured by second hand and 1000 repetitions)
I have a 30% increase in execution time, but I get platform-independent code and exact results from the transform.
Frank wrote (referring to features of MacOS X):
Provided the hardware supports these features (I suppose it does, though I don't have a Mac myself) and there's no heavy concurrent load (which would cause context switches etc.), this should hardly matter.
You're right. I ran my CW test code on MacOS 9 and then within OS X's MacOS 9 emmulator. The 10^8 sin() loop takes the same amount of time on both operating systems (about sixteen seconds).
Frank wrote:
Maybe using `Trunc' (and adjusting the tables accordingly, or using `Trunc (x + 0.5)' if you know the argument will always be non-negative) is an option.
Using -O2 and the unix "time" facility, I get 10^8 round() in 22.3 seconds, and 10^8 trunc() in 21.1 seconds.
Adriaan wrote:
For-loops themselves are suspect in GPC, see the thread "A littlle benchmark" in the GPC mailing list archives:
Looking at Gale's results, it appears that the -O3 optimization takes care of the slow for-loop problem.
Adriaan wrote:
By the way, you can also use 'TickCount' (which returns a count in 1/60 seconds) or 'Microseconds'.
Yes, but in this case I liked the separation of the timer and the thing to be timed. Also, the code is the same for both compilers (except for longint and longreal), which I thought would save you some mental effort.
Adriaan wrote:
If you look in the fp.pas unit provided with the ported GPCPInterfaces, you will find there a wealth of mathematical routines.
The fp.pas routines are tempting, but I am trying to write multi-platform GPC code. My impression is that I will have to re-implement the fp.pas routines when I port my code to Linux and UNIX.
Gale wrote:
loop CW time GPC time GPC -O3 time statement (Scaled s) (Scaled s) (Scaled s)
none 1.12 2.12 0.82 a:=a*a/a 6.06 6.40 5.36 a:=p 3.18 3.67 2.49 a:=round(a) 8.54 26.54 21.83 a:=round_fp(a) 5.79 7.33 6.13 a:=sin(p) 13.85 16.45 14.87
Your CW times are the same as mine. I turned on -O3 and went through the tests. I confirmed your times using the command-line "time" facility. When I repeat the "time" measurements, they vary by about 5%, and are about 10% faster than yours.
I was left wondering why my GPC no-optimization results of Saturday were longer than yours, so I turned off -O3 and re-built. Today, 10^8 executions of a:=a*a/a took 7 seconds, but on Saturday they took 13 seconds. I unplugged my power adaptor and tried again: back to 13 seconds. While I was running off my battery on Saturday, MacOS X was switching the CPU to half-speed to save energy, but OS9 was not. Thus the CW code had an advantage.
So, thank you for repeating my tests. I now agree with your results. By the way, I turned on and turned off all the CW optimizations, as I have done before, and noticed no significant difference in performance.
Gale wrote:
You might want to look into using one of the several floating point rounding functions declared in Apple's Universal Interfaces fp.p(.pas) unit.
I'm going to call sin() instead of using a sinusoidal look-up table. I am amazed at the efficiency of the sin() function. Consider the following GPC results. Recall that p is an integer index, climbing from 1 to 100. I set b:=12.345 before the loop, so it is constant throughout.
loop GPC statement (s) Optimization
none 0.64 -O3 a:=sin(p) 17 -O3 a:=sin(b) 0.92 -O3 (clever optimizer, knows b is constant) a:=sin(b) 0.89 -O2 (clever again) a:=sin(b) 16 none a:=sin(a+b) 16 -O2
It takes only 150 ns to calculate the sin of a real number, which is 120 clock cycles, and a little over twice the time it takes to calculate a*a/a. The fastest round() function we have takes about 50 clock cycles, and we need another ten or twenty to access a look-up table. Using sin() is 30% only slower.
My second use of round() is in drawing lines. But in this case, it is easy to implement an incremental rounding function that adds or subtracts one from an integer as its real-valued cousin rises or falls.
Gale wrote:
Now, getting to the relatively large GPC difference with a:=round(a). After looking at the differences in generated code between the compilers (assembly code and algorithms implemented), I think the main difference is that GPC supports rounding to 64 bit integers and CW Pascal only supports rounding to 32 bit integers.
Thank you for the explanation.
In short: GPC with -O3 and fp.pas routines is about 10% faster than CW. If you want to avoid using fp.pas, then you will still be faster than CW if you keep the GPC round() and trunc() out of your most heavily-used loops.
Yours, Kevan Hashemi
Kevan Hashemi wrote:
In short: GPC with -O3 and fp.pas routines is about 10% faster than CW. If you want to avoid using fp.pas, then you will still be faster than CW if you keep the GPC round() and trunc() out of your most heavily-used loops.
I would be quite curious to know if there is any difference in speed and precision if you also pass the --fast-math option (possibly dropping IEEE conformance).
Regards,
Adriaan van Os
Kevan Hashemi wrote:
My second use of round() is in drawing lines. But in this case, it is easy to implement an incremental rounding function that adds or subtracts one from an integer as its real-valued cousin rises or falls.
Sounds quite like the standard Bresenham algorithm which works completely with integers.
Gale wrote:
Now, getting to the relatively large GPC difference with a:=round(a). After looking at the differences in generated code between the compilers (assembly code and algorithms implemented), I think the main difference is that GPC supports rounding to 64 bit integers and CW Pascal only supports rounding to 32 bit integers.
BTW, if you can rebuild GPC, you can check if this is the main issue if you replace `long_long_integer_type_node' by `integer_type_node' in p/predef.c after `p_Round'. This won't be a real solution since we don't want to give up on 64 bit `Round' in general, but it would help determine the influence.
Frank
Frank Heckenbach wrote:
Kevan Hashemi wrote:
My second use of round() is in drawing lines. But in this case, it is easy to implement an incremental rounding function that adds or subtracts one from an integer as its real-valued cousin rises or falls.
Sounds quite like the standard Bresenham algorithm which works completely with integers.
Yes, but it quite depends on the application whether the Bressenham line drawing algorithm is precise enough. If the "source" coordinates are floating point, you can round the start- and end-coordinates to integer values and then draw a line with the Bressenham algorithm. However, some intermediate pixels will invariably be one pixel off, because the rounding of the start- and end-coordinates also influences the "weight" and rounding of intermediate pixels.
AutoCAD, for example, does this wrong and thus draws ugly lines. It becomes even worse when circles are approximated by polygons.
I have been thinking about an improved Bressenham line drawing algorithm that eliminates this problem, but I never had the time to write one, maybe it already exists, it would be interesting.
Regards,
Adriaan van Os
Dear Adriaan,
I have been thinking about an improved Bressenham line drawing algorithm that eliminates this problem, but I never had the time to write one, maybe it already exists, it would be interesting.
I don't think this routine of mine is anything to brag about, but I'm curious to see what you think of it.
Integer arithemtic does not go much faster than real-number arithmetic (see our tests), and integer->real translation is fast, so I don't see any point in staying in integer-only math. Here's my line-drawer. The lines get clipped to a rectangle called analysis_bounds.
type {for 2-D integer geometry} ij_point_type=record i,j:integer; end; ij_line_type=record a,b:ij_point_type; end; ij_rectangle_type=record top,left,bottom,right:integer; end;
type {for 2-D real geometry} xy_point_type=record x,y:real; end; xy_line_type=record a,b:xy_point_type; end;
procedure draw_overlay_line (image_ptr:image_ptr_type;line:ij_line_type; color:overlay_pixel_type);
const rough_step_size=0.8;{pixels}
var num_steps,step_num:integer; p,q,step:xy_point_type; s:ij_point_type; image_rect:ij_rectangle_type; outside:boolean;
begin if not valid_image_ptr(image_ptr) then exit; if not valid_analysis_bounds(image_ptr) then exit; ij_clip_path(line,outside,image_ptr^.analysis_bounds); if outside then exit; if not ij_in_rectangle(line.a,image_ptr^.analysis_bounds) then exit; if not ij_in_rectangle(line.b,image_ptr^.analysis_bounds) then exit;
with line,image_ptr^ do begin overlay[a.j,a.i]:=color; overlay[b.j,b.i]:=color; p.x:=a.i; p.y:=a.j; q.x:=b.i; q.y:=b.j; s:=a; end;
if xy_separation(p,q)<rough_step_size then num_steps:=0 else num_steps:=round(xy_separation(p,q)/rough_step_size); step:=xy_scale(xy_difference(q,p),1/(num_steps+1));
for step_num:=1 to num_steps do begin p:=xy_sum(p,step); if p.x-s.i>0.5 then inc(s.i) else if p.x-s.i<-0.5 then dec(s.i); if p.y-s.j>0.5 then inc(s.j) else if p.y-s.j<-0.5 then dec(s.j); image_ptr^.overlay[s.j,s.i]:=color; end; end;
It is almost twice as fast as using round() to get s from p, and my 800 MHz iBook draws a 400-pixel line in 90 microseconds (-O3). The line clipping and checks at the top of the routine take up less than 10% of the execution time. As you can see, the rough step size of 0.8 makes sure we don't skip pixels. For 45-degree lines, we lose some efficiency. We could make the rough step size just under sqrt(2).
Yours, Kevan
Adriaan van Os wrote:
Frank Heckenbach wrote:
Kevan Hashemi wrote:
My second use of round() is in drawing lines. But in this case, it is easy to implement an incremental rounding function that adds or subtracts one from an integer as its real-valued cousin rises or falls.
Sounds quite like the standard Bresenham algorithm which works completely with integers.
Yes, but it quite depends on the application whether the Bressenham line drawing algorithm is precise enough. If the "source" coordinates are floating point, you can round the start- and end-coordinates to integer values and then draw a line with the Bressenham algorithm. However, some intermediate pixels will invariably be one pixel off, because the rounding of the start- and end-coordinates also influences the "weight" and rounding of intermediate pixels.
AutoCAD, for example, does this wrong and thus draws ugly lines. It becomes even worse when circles are approximated by polygons.
It would be better to use the Bresenham algorithm for circles, of course. :-)
I have been thinking about an improved Bressenham line drawing algorithm that eliminates this problem, but I never had the time to write one, maybe it already exists, it would be interesting.
Bresenham effectively uses rationals internally. I think you could round to them (i.e., multiply by the denominator, then round to an integer). There are some tricky parts (such as finding a good denominator -- with integer coordinates it's obvious, with floating-point coordinates it's more difficult, but I think it should be possible). I haven't implemented that myself, though ...
Frank
Kevan Hashemi wrote:
Most things take two or three times as long with GPC. I expect code running on MacOS X (UNIX) to be slower than code on MacOS 9 because MacOS X is re-entrant, is subject to pre-emptive multitasking, and provides protected memory.
Provided the hardware supports these features (I suppose it does, though I don't have a Mac myself) and there's no heavy concurrent load (which would cause context switches etc.), this should hardly matter.
I am more interested in the fact that the GPC round() function takes four times as long as the GPC implementation of a:=a*a/a, while the CW round() function takes about the same time as the CW implementation of a:=a*a/a.
As you know, rounding a number with platform-independent mathematical functions is slow. The CW round() probably uses the Power PC real number format to abbreviate the rounding process. Perhaps GPC uses a platform- independent implementation.
It does -- and in addition it's harmed by the fact that Pascal's rounding does not match either of the 4 IEEE rounding modes, so it's implemented as "Trunc (if x >= 0.0 then x + 0.5 else x - 0.5)", according to the standard's description:
: From the expression x that shall be of realÂtype, this function : shall return a result of integer type. If x is positive or zero, : round(x) shall be equivalent to trunc(x+0.5); otherwise, round(x) : shall be equivalent to trunc(x-0.5) It shall be an error if such a : value does not exist.
I.e., round to nearest and half-integers towards +/-infinity. It does appear strange, maybe an unintentional feature, and I don't know which other Pascal compilers actually implement this ...
I have always used round() with sinusoidal look-up tables in my fourier transforms. The above results suggest that I gained very little by doing so.
Maybe using `Trunc' (and adjusting the tables accordingly, or using `Trunc (x + 0.5)' if you know the argument will always be non-negative) is an option.
Frank
Kevan Hashemi wrote:
Here is my GPC test program:
program test;
var a:real; m,n,p:integer;
begin a:=1234.567; writeln('starting loop...'); for m:=0 to 1000 do for n:=0 to 1000 do begin for p:=1 to 100 do begin { loop statement here } end; end; writeln('done.'); end.
In CW the code looks the same, but the variables are longreals and longints so that they match the GPC sizes. As you can see, the loop statement gets executed 100,000,000 times.
For-loops themselves are suspect in GPC, see the thread "A littlle benchmark" in the GPC mailing list archives:
<http://www.gnu-pascal.de/crystal/gpc/en/ mail7480.html?pos=23012761#23012761> <http://www.gnu-pascal.de/crystal/gpc/en/ mail7471.html?pos=22983274#22983274>
Frank Heckenbach writes there:
Somehow, the backend's loop optimizations don't recognize GPC's `for' loops optimally. Maybe GPC's handling of `for' loops could be changed (but again, it's hairy, so watch out ...), or it should be improved in the backend, I'm not sure now ...
If Frank doesn't know, I certainly don't know either. You may want to try a repeat-until loop instead.
I measure the execution time by counting seconds in my head between the prints to the console.
By the way, you can also use 'TickCount' (which returns a count in 1/60 seconds) or 'Microseconds'. Both are in the ported GPCPInterfaces (available from my website). They require linking with the 'Carbon framework'.
I'm using an 800 MHz iBook. Here are my results:
loop CW time GPC time statement (s) (s)
none 1 3 a:=a*a/a 6 13 a:=p 3 6 a:=round(a) 8 40 a:=sin(p) 15 40
Most things take two or three times as long with GPC. I expect code running on MacOS X (UNIX) to be slower than code on MacOS 9 because MacOS X is re-entrant, is subject to pre-emptive multitasking, and provides protected memory.
Not much slower on Mac OS X, I guess, except in using QuickDraw and the like.
I am more interested in the fact that the GPC round() function takes four times as long as the GPC implementation of a:=a*a/a, while the CW round() function takes about the same time as the CW implementation of a:=a*a/a.
As you know, rounding a number with platform-independent mathematical functions is slow. The CW round() probably uses the Power PC real number format to abbreviate the rounding process. Perhaps GPC uses a platform- independent implementation.
I have always used round() with sinusoidal look-up tables in my fourier transforms. The above results suggest that I gained very little by doing so. Nevertheless, I also use round() to obtain display coordinates from real-valued graphs, and these routines are running five times slower than before.
If you look in the fp.pas unit provided with the ported GPCPInterfaces, you will find there a wealth of mathematical routines, e.g.
* rint Rounds its argument to an integral value in floating point * * format, honoring the current rounding direction. * * * * nearbyint Differs from rint only in that it does not raise the inexact * * exception. It is the nearbyint function recommended by the * * IEEE floating-point standard 854. * * * * rinttol Rounds its argument to the nearest long int using the current * * rounding direction. NOTE: if the rounded value is outside * * the range of long int, then the result is undefined. * * * * round Rounds the argument to the nearest integral value in floating * * point format similar to the Fortran "anint" function. That is: * * add half to the magnitude and chop. * * * * roundtol Similar to the Fortran function nint or to the Pascal round. * * NOTE: if the rounded value is outside the range of long int, * * then the result is undefined. * * * * trunc Computes the integral value, in floating format, nearest to * * but no larger in magnitude than its argument. NOTE: on 68K * * compilers when using -elems881, trunc must return an int *
You can use these routines with GPC instead of the built-in GPC runtime routines, to see if that makes any difference. Apple has always meticulously followed the IEEE standards (see the foreword of Professor W. Kahan in the Apple Numerics manual, second edition). To use them, you need to link with the Carbon framework.
I haven't checked if this automatically links in the requested routine for functions with the same name, e.g. 'sin'. If not, you may experiment with:
* linking order of included libraries on the command line * renaming declarations in GPCPInterfaces or inserting the declarations directly in your source code
Please, let us know if this works !
As a side bar, I want to mention Motorola's Libmoto for the PowerPC processor
<http://e-www.motorola.com/webapp/sps/site/ prod_summary.jsp?code=LIBMOTO>
For benchmarks, go to http://developer.apple.com/ and search for "libmoto". However, the library is inaccurate in edge-case conditions. I used it a while in my CAD software, but later dropped it. Besides, I don't know of a Mac OS X port and Motorola no longer supports the "product", which is a standard problem for much software that is a "product" ...
Regards,
Adriaan van Os
P.S. A separate post on optimization issues on Mac OS X will follow.
Adriaan van Os wrote:
I measure the execution time by counting seconds in my head between the prints to the console.
By the way, you can also use 'TickCount' (which returns a count in 1/60 seconds) or 'Microseconds'. Both are in the ported GPCPInterfaces (available from my website). They require linking with the 'Carbon framework'.
Or just run `time myprogram myarguments' on the command-line (if `time' exists on Mac OS X, but I guess it does).
Frank
Kevan Hashemi wrote:
Dear Frank and Adriaan,
From Frank: If your version doesn't recognize `{$R-}', it's probably older and doesn't do range-checking at all, so you don't need to worry.
Thank you, and I have determined that the indexing performed for the dynamic schemata are as fast as my own pointer arithmetic. With the schemata instead of pointer arithmetic, my code is now half as long, and far more readable. Converting my code to GPC has been a real pleasure, and I continue to discover new and delightful features.
From Adriaan: I can hardly believe it's the generated code, but nothing is impossible ... If you send me the source code, I will have a look at it (but I have very limited time available this month).
Your offer is most generous. I have done some work to try and make your investigation simpler.
I am comparing the CW Pascal compiler issued about five years ago with the GPC compiler I installed two months ago. The CW code runs in MacOS 9. The GPC code runs in MacOS X from the UNIX terminal.
Here is my GPC test program:
program test;
var a:real; m,n,p:integer;
begin a:=1234.567; writeln('starting loop...'); for m:=0 to 1000 do for n:=0 to 1000 do begin for p:=1 to 100 do begin { loop statement here } end; end; writeln('done.'); end.
In CW the code looks the same, but the variables are longreals and longints so that they match the GPC sizes. As you can see, the loop statement gets executed 100,000,000 times. I measure the execution time by counting seconds in my head between the prints to the console. I'm using an 800 MHz iBook. Here are my results:
loop CW time GPC time statement (s) (s)
none 1 3 a:=a*a/a 6 13 a:=p 3 6 a:=round(a) 8 40 a:=sin(p) 15 40
Correction:
loop CW time GPC time statement (s) (s)
a:=sin(p) 15 20 (not 40)
Below are timing results from testing on a 500MHz G4 with the results linearly scaled to 800 MHz for comparison purposes. I used calls to the Mac OS Microseconds system routine immediately before and after the loops to gather the timing data. For CW Pascal, level 4 global optimization with peephole optimization and 7400 PPC instruction scheduling was used (in other words, maximal optimizations). For GPC, I used a plain vanilla automake with no optimization argument for the first column and a -O3 optimization argument for the second column.
For CW Pascal testing, I used the CW Pro 7 Pascal update compiler for compiling and tested running cooperative multitasking Mac OS 9.1. For GPC testing, I used Adriaan's gpc-3.3d6 version (built using gpc version 20030507, based on gcc-3.3) and tested running preemptive multitasking Mac OS X 10.2.6.
The "a:=round_fp(a)" line is a new test which doesn't require the float point to integer and integer to floating point conversions. For this test, I used the Mac OS system round function (from the fp unit) which takes a double floating point argument and returns a rounded double floating point result. (Note for Mac OS users who may not be aware of it, it is possible to use both the Pascal language integer returning round and the Mac OS system double returning round in the same code. Although the means to do so are different in CW Pascal and GPC, both support language features which provid for this possibility.)
loop CW time GPC time GPC -O3 time statement (Scaled s) (Scaled s) (Scaled s)
none 1.12 2.12 0.82 a:=a*a/a 6.06 6.40 5.36 a:=p 3.18 3.67 2.49 a:=round(a) 8.54 26.54 21.83 a:=round_fp(a) 5.79 7.33 6.13 a:=sin(p) 13.85 16.45 14.87
Overall, at least with my testing configuration and with the exception of a:=round(a), there isn't much of a performance difference between GPC and CW Pascal. Unoptimized GPC code is slightly slower than fully optimized CW Pascal code and optimized GPC code is slightly faster than CW Pascal code for non-library call related code. (The library call code, round_fp and sin, ends up using Apple supplied code for both compilers so the differences are really due to differences in Apple's library call mechanics and code implemention between Mac OS 9.1 and Mac OS X 10.2.6.) In light of the 100,000,000 iteration count and the "artifical" pattern of the test code, I think for a more realistic code pattern mixture the speed differences between optimized CW Pascal and optimized GPC will be noise level for most code.
Now, getting to the relatively large GPC difference with a:=round(a). After looking at the differences in generated code between the compilers (assembly code and algorithms implemented), I think the main difference is that GPC supports rounding to 64 bit integers and CW Pascal only supports rounding to 32 bit integers. On a 32 bit PPC CPU converting between 64 bit integer and floating point formats requires quite a few more instructions than 32 bit integer format conversions. Another "penalty" with the PPC CPU is the data has to go through memory since there is no direct connection between integer registers and floating point registers and with double the data you're hitting, if not exeeding, the limits of the load/store unit capabilies.
As a general PPC performance rule of thumb, it is always a "win" to minimize the number of floating point to/from integer format conversions since the conversion is relatively expensive. With a:=round(a), two conversions are performed (64 bit integers for GPC and 32 bit integers for CW Pascal). The round function involves a floating point to integer conversion and the result is then reconverted back to floating point for storing in variable 'a'. (The back-to-back conversions can't be optimized away due to the possibility of integer overflow in the floating point to integer conversion.)
Most things take two or three times as long with GPC. I expect code running on MacOS X (UNIX) to be slower than code on MacOS 9 because MacOS X is re-entrant, is subject to pre-emptive multitasking, and provides protected memory. I am more interested in the fact that the GPC round() function takes four times as long as the GPC implementation of a:=a*a/a, while the CW round() function takes about the same time as the CW implementation of a:=a*a/a.
As you know, rounding a number with platform-independent mathematical functions is slow. The CW round() probably uses the Power PC real number format to abbreviate the rounding process. Perhaps GPC uses a platform- independent implementation.
Actually, although the algorithms are different, both CW and GPC use platform independent round() implementations. As mentioned above, I think the major factor in performance difference is that GPC's round returns 64 bit integer results whereas CW Pascal only returns 32 bit results. (CW implements round as a library routine written in ISO C and by eyeball check looks to be a less efficient algorithm than GPC's algorithm.)
I have always used round() with sinusoidal look-up tables in my fourier transforms. The above results suggest that I gained very little by doing so. Nevertheless, I also use round() to obtain display coordinates from real-valued graphs, and these routines are running five times slower than before.
You might want to look into using one of the several floating point rounding functions declared in Apple's Universal Interfaces fp.p(.pas) unit. Depending upon your needs, one of the routines may yield better performance for both CW Pascal and GPC as can be seen in the a:=round(a) versus a:=round_fp(a) line timing results above. If platform independance is a concern, I'll note that fp.p(.pas) is mostly just an Apple repackaging of the latest ISO C standard required math.h which can be easily dealt with with GPC as long as the platform target has a fairly up-to-date GPC/gcc.
Gale Paeper gpaeper@empirenet.com
Gale Paeper wrote:
Now, getting to the relatively large GPC difference with a:=round(a). After looking at the differences in generated code between the compilers (assembly code and algorithms implemented), I think the main difference is that GPC supports rounding to 64 bit integers and CW Pascal only supports rounding to 32 bit integers.
Yes, that's what GPC does. And since it's generally not obvious from the context when 32 bit are sufficient, and we don't want to give up the 64 bit range, the best we could do is probably to detect some special cases to optimize (such as assignment of the result to a 32 bit integer -- which doesn't apply in this test, anyway) ...
As a general PPC performance rule of thumb, it is always a "win" to minimize the number of floating point to/from integer format conversions since the conversion is relatively expensive. With a:=round(a), two conversions are performed (64 bit integers for GPC and 32 bit integers for CW Pascal). The round function involves a floating point to integer conversion and the result is then reconverted back to floating point for storing in variable 'a'. (The back-to-back conversions can't be optimized away due to the possibility of integer overflow in the floating point to integer conversion.)
If you don't require overflow detection and accept "undefined behaviour" there, it might be possible to optimize it away. (Not that I plan to do that any time soon, but it could be a possibility -- if it has any practical relevance ...)
Frank
Correction:
loop CW time GPC time statement (s) (s)
a:=sin(p) 15 20 (not 40)
Yours, Kevan Hashemi