Kevan Hashemi wrote:
Dear Frank and Adriaan,
From Frank: If your version doesn't recognize `{$R-}', it's probably older and doesn't do range-checking at all, so you don't need to worry.
Thank you, and I have determined that the indexing performed for the dynamic schemata are as fast as my own pointer arithmetic. With the schemata instead of pointer arithmetic, my code is now half as long, and far more readable. Converting my code to GPC has been a real pleasure, and I continue to discover new and delightful features.
From Adriaan: I can hardly believe it's the generated code, but nothing is impossible ... If you send me the source code, I will have a look at it (but I have very limited time available this month).
Your offer is most generous. I have done some work to try and make your investigation simpler.
I am comparing the CW Pascal compiler issued about five years ago with the GPC compiler I installed two months ago. The CW code runs in MacOS 9. The GPC code runs in MacOS X from the UNIX terminal.
Here is my GPC test program:
program test;
var a:real; m,n,p:integer;
begin a:=1234.567; writeln('starting loop...'); for m:=0 to 1000 do for n:=0 to 1000 do begin for p:=1 to 100 do begin { loop statement here } end; end; writeln('done.'); end.
In CW the code looks the same, but the variables are longreals and longints so that they match the GPC sizes. As you can see, the loop statement gets executed 100,000,000 times. I measure the execution time by counting seconds in my head between the prints to the console. I'm using an 800 MHz iBook. Here are my results:
loop CW time GPC time statement (s) (s)
none 1 3 a:=a*a/a 6 13 a:=p 3 6 a:=round(a) 8 40 a:=sin(p) 15 40
Correction:
loop CW time GPC time statement (s) (s)
a:=sin(p) 15 20 (not 40)
Below are timing results from testing on a 500MHz G4 with the results linearly scaled to 800 MHz for comparison purposes. I used calls to the Mac OS Microseconds system routine immediately before and after the loops to gather the timing data. For CW Pascal, level 4 global optimization with peephole optimization and 7400 PPC instruction scheduling was used (in other words, maximal optimizations). For GPC, I used a plain vanilla automake with no optimization argument for the first column and a -O3 optimization argument for the second column.
For CW Pascal testing, I used the CW Pro 7 Pascal update compiler for compiling and tested running cooperative multitasking Mac OS 9.1. For GPC testing, I used Adriaan's gpc-3.3d6 version (built using gpc version 20030507, based on gcc-3.3) and tested running preemptive multitasking Mac OS X 10.2.6.
The "a:=round_fp(a)" line is a new test which doesn't require the float point to integer and integer to floating point conversions. For this test, I used the Mac OS system round function (from the fp unit) which takes a double floating point argument and returns a rounded double floating point result. (Note for Mac OS users who may not be aware of it, it is possible to use both the Pascal language integer returning round and the Mac OS system double returning round in the same code. Although the means to do so are different in CW Pascal and GPC, both support language features which provid for this possibility.)
loop CW time GPC time GPC -O3 time statement (Scaled s) (Scaled s) (Scaled s)
none 1.12 2.12 0.82 a:=a*a/a 6.06 6.40 5.36 a:=p 3.18 3.67 2.49 a:=round(a) 8.54 26.54 21.83 a:=round_fp(a) 5.79 7.33 6.13 a:=sin(p) 13.85 16.45 14.87
Overall, at least with my testing configuration and with the exception of a:=round(a), there isn't much of a performance difference between GPC and CW Pascal. Unoptimized GPC code is slightly slower than fully optimized CW Pascal code and optimized GPC code is slightly faster than CW Pascal code for non-library call related code. (The library call code, round_fp and sin, ends up using Apple supplied code for both compilers so the differences are really due to differences in Apple's library call mechanics and code implemention between Mac OS 9.1 and Mac OS X 10.2.6.) In light of the 100,000,000 iteration count and the "artifical" pattern of the test code, I think for a more realistic code pattern mixture the speed differences between optimized CW Pascal and optimized GPC will be noise level for most code.
Now, getting to the relatively large GPC difference with a:=round(a). After looking at the differences in generated code between the compilers (assembly code and algorithms implemented), I think the main difference is that GPC supports rounding to 64 bit integers and CW Pascal only supports rounding to 32 bit integers. On a 32 bit PPC CPU converting between 64 bit integer and floating point formats requires quite a few more instructions than 32 bit integer format conversions. Another "penalty" with the PPC CPU is the data has to go through memory since there is no direct connection between integer registers and floating point registers and with double the data you're hitting, if not exeeding, the limits of the load/store unit capabilies.
As a general PPC performance rule of thumb, it is always a "win" to minimize the number of floating point to/from integer format conversions since the conversion is relatively expensive. With a:=round(a), two conversions are performed (64 bit integers for GPC and 32 bit integers for CW Pascal). The round function involves a floating point to integer conversion and the result is then reconverted back to floating point for storing in variable 'a'. (The back-to-back conversions can't be optimized away due to the possibility of integer overflow in the floating point to integer conversion.)
Most things take two or three times as long with GPC. I expect code running on MacOS X (UNIX) to be slower than code on MacOS 9 because MacOS X is re-entrant, is subject to pre-emptive multitasking, and provides protected memory. I am more interested in the fact that the GPC round() function takes four times as long as the GPC implementation of a:=a*a/a, while the CW round() function takes about the same time as the CW implementation of a:=a*a/a.
As you know, rounding a number with platform-independent mathematical functions is slow. The CW round() probably uses the Power PC real number format to abbreviate the rounding process. Perhaps GPC uses a platform- independent implementation.
Actually, although the algorithms are different, both CW and GPC use platform independent round() implementations. As mentioned above, I think the major factor in performance difference is that GPC's round returns 64 bit integer results whereas CW Pascal only returns 32 bit results. (CW implements round as a library routine written in ISO C and by eyeball check looks to be a less efficient algorithm than GPC's algorithm.)
I have always used round() with sinusoidal look-up tables in my fourier transforms. The above results suggest that I gained very little by doing so. Nevertheless, I also use round() to obtain display coordinates from real-valued graphs, and these routines are running five times slower than before.
You might want to look into using one of the several floating point rounding functions declared in Apple's Universal Interfaces fp.p(.pas) unit. Depending upon your needs, one of the routines may yield better performance for both CW Pascal and GPC as can be seen in the a:=round(a) versus a:=round_fp(a) line timing results above. If platform independance is a concern, I'll note that fp.p(.pas) is mostly just an Apple repackaging of the latest ISO C standard required math.h which can be easily dealt with with GPC as long as the platform target has a fairly up-to-date GPC/gcc.
Gale Paeper gpaeper@empirenet.com