Dear George, Frank, Adriaan, and Gale,
George wrote:
Are you enabling any optimizations while compiling?
No, I was not. Thank you for your description of the optimization options. Gale has filled out my table with the effect of -O3. I tried -O2 and -O3 on my fourier analysis routine, and obtained the following results (ms is for milliseconds).
none 60 ms (measured by GetTimeStamp 100 repetitions) -O2 50 ms -O3 50 ms
The GPC code calculates sin() directly with the sin() function. My old CW code looks up the sin() value in a table with the help of round(). For the old code I get:
CW 30 ms (measured by second hand and 1000 repetitions)
I have a 30% increase in execution time, but I get platform-independent code and exact results from the transform.
Frank wrote (referring to features of MacOS X):
Provided the hardware supports these features (I suppose it does, though I don't have a Mac myself) and there's no heavy concurrent load (which would cause context switches etc.), this should hardly matter.
You're right. I ran my CW test code on MacOS 9 and then within OS X's MacOS 9 emmulator. The 10^8 sin() loop takes the same amount of time on both operating systems (about sixteen seconds).
Frank wrote:
Maybe using `Trunc' (and adjusting the tables accordingly, or using `Trunc (x + 0.5)' if you know the argument will always be non-negative) is an option.
Using -O2 and the unix "time" facility, I get 10^8 round() in 22.3 seconds, and 10^8 trunc() in 21.1 seconds.
Adriaan wrote:
For-loops themselves are suspect in GPC, see the thread "A littlle benchmark" in the GPC mailing list archives:
Looking at Gale's results, it appears that the -O3 optimization takes care of the slow for-loop problem.
Adriaan wrote:
By the way, you can also use 'TickCount' (which returns a count in 1/60 seconds) or 'Microseconds'.
Yes, but in this case I liked the separation of the timer and the thing to be timed. Also, the code is the same for both compilers (except for longint and longreal), which I thought would save you some mental effort.
Adriaan wrote:
If you look in the fp.pas unit provided with the ported GPCPInterfaces, you will find there a wealth of mathematical routines.
The fp.pas routines are tempting, but I am trying to write multi-platform GPC code. My impression is that I will have to re-implement the fp.pas routines when I port my code to Linux and UNIX.
Gale wrote:
loop CW time GPC time GPC -O3 time statement (Scaled s) (Scaled s) (Scaled s)
none 1.12 2.12 0.82 a:=a*a/a 6.06 6.40 5.36 a:=p 3.18 3.67 2.49 a:=round(a) 8.54 26.54 21.83 a:=round_fp(a) 5.79 7.33 6.13 a:=sin(p) 13.85 16.45 14.87
Your CW times are the same as mine. I turned on -O3 and went through the tests. I confirmed your times using the command-line "time" facility. When I repeat the "time" measurements, they vary by about 5%, and are about 10% faster than yours.
I was left wondering why my GPC no-optimization results of Saturday were longer than yours, so I turned off -O3 and re-built. Today, 10^8 executions of a:=a*a/a took 7 seconds, but on Saturday they took 13 seconds. I unplugged my power adaptor and tried again: back to 13 seconds. While I was running off my battery on Saturday, MacOS X was switching the CPU to half-speed to save energy, but OS9 was not. Thus the CW code had an advantage.
So, thank you for repeating my tests. I now agree with your results. By the way, I turned on and turned off all the CW optimizations, as I have done before, and noticed no significant difference in performance.
Gale wrote:
You might want to look into using one of the several floating point rounding functions declared in Apple's Universal Interfaces fp.p(.pas) unit.
I'm going to call sin() instead of using a sinusoidal look-up table. I am amazed at the efficiency of the sin() function. Consider the following GPC results. Recall that p is an integer index, climbing from 1 to 100. I set b:=12.345 before the loop, so it is constant throughout.
loop GPC statement (s) Optimization
none 0.64 -O3 a:=sin(p) 17 -O3 a:=sin(b) 0.92 -O3 (clever optimizer, knows b is constant) a:=sin(b) 0.89 -O2 (clever again) a:=sin(b) 16 none a:=sin(a+b) 16 -O2
It takes only 150 ns to calculate the sin of a real number, which is 120 clock cycles, and a little over twice the time it takes to calculate a*a/a. The fastest round() function we have takes about 50 clock cycles, and we need another ten or twenty to access a look-up table. Using sin() is 30% only slower.
My second use of round() is in drawing lines. But in this case, it is easy to implement an incremental rounding function that adds or subtracts one from an integer as its real-valued cousin rises or falls.
Gale wrote:
Now, getting to the relatively large GPC difference with a:=round(a). After looking at the differences in generated code between the compilers (assembly code and algorithms implemented), I think the main difference is that GPC supports rounding to 64 bit integers and CW Pascal only supports rounding to 32 bit integers.
Thank you for the explanation.
In short: GPC with -O3 and fp.pas routines is about 10% faster than CW. If you want to avoid using fp.pas, then you will still be faster than CW if you keep the GPC round() and trunc() out of your most heavily-used loops.
Yours, Kevan Hashemi