Peter N Lewis wrote:
One thing that stood out in the profile was that compute_checksum, a two line function that is called once for each loaded gpi is responsible for 12% of the total time. It scans through the entire gpi bytes (in this case, including my GPCMacOSAll 27Meg gpi, once for each of the 250 units in my project), that adds up to around 43 seconds of the 6 minute rebuild time.
static gpi_int compute_checksum_original (unsigned char *buf, gpi_int size) { gpi_int sum = 0, n = 0;
for (n = 0; n < size; n++) sum += n * buf[n]; return sum; }
It appears that gpc_int is a 64 bit int on my system.
I did some timing of some improvements:
time= 42.1 sum= 46473731586140096 compute_checksum_original time= 35.0 sum= 46473731586140096 compute_checksum_unrolled time= 18.8 sum= 8551919141529536 compute_checksum_native time= 12.2 sum= -2963523148067389 compute_checksum_shift time= 7.8 sum= -852473248 compute_checksum_add
(time is roughly the time in seconds gpc is taking just calling compute_checksum on the GPCMacOSAll.gpi in my complete rebuild).
I did a little test trying also a few other checksums. Remarks:
1) I do not know if we need a checksum at all 2) loading interfaces should not be a bottleneck, we should be able to compile many modules in a single run, loading interfaces just once in the whole run 3) On 32-bit AMD when computing 64-bit checksum the bottlneck is lack of registers, while the fastest checksum is probably limited by DRAM speed 4) I slightly surprised by Mac results: G5 Mac can (and should) use 64-bit arithmetic even for 32-bit applications, also Mac has many registers 5) On 64-bit machines current checksum seem to be reasonably fast 6) The current checksum can be computed using only additions (compute_checksum_ladd) and on AMD 64 it is the second 7) On 32-bit machines current checksum can be computed using mostly 32-bit operations (compute_checksum_short and compute_checksum_sadd) 8) If we want a checksum but do not care which one we use then summing 32-bit words (compute_checksum_lladd) may be good solution
I have attached modified test program.
32 bit AMD 1.250 GHz gpi_int = long long
sizeof: 8 time= 527408 sum= 46473731586140096 compute_checksum_original time= 101484 sum= 46473731586140096 compute_checksum_short time= 365382 sum= 46473731586140096 compute_checksum_ladd time= 120329 sum= 46473731586140096 compute_checksum_sadd time= 62523 sum= -169197487615280 compute_checksum_lladd time= 503241 sum= 46473731586140096 compute_checksum_unrolled time= 140273 sum= 8551919141529536 compute_checksum_native time= 155221 sum= -2963523148067389 compute_checksum_shift time= 74935 sum= -852473248 compute_checksum_add
AMD 1.250 GHz gpi_int = long
sizeof: 4 time= 103675 sum= -694933568 compute_checksum_original time= 100753 sum= -694933568 compute_checksum_short time= 119555 sum= -694933568 compute_checksum_ladd time= 120073 sum= -694933568 compute_checksum_sadd time= 62365 sum= -1545956656 compute_checksum_lladd time= 103780 sum= -694933568 compute_checksum_unrolled time= 104034 sum= -694933568 compute_checksum_native time= 78179 sum= -8794685 compute_checksum_shift time= 72357 sum= -852473248 compute_checksum_add
AMD Athlon(tm) 64 Processor 3000+ (1.8 GHz)
sizeof: 8 time= 48315 sum= 46473731586140096 compute_checksum_original time= 76613 sum= 46473731586140096 compute_checksum_short time= 34075 sum= 46473731586140096 compute_checksum_ladd time= 47050 sum= 46473731586140096 compute_checksum_sadd time= 12772 sum= -169197487615280 compute_checksum_lladd time= 57156 sum= 46473731586140096 compute_checksum_unrolled time= 61925 sum= 8551919141529536 compute_checksum_native time= 42195 sum= -2963523148067389 compute_checksum_shift time= 28440 sum= -852473248 compute_checksum_add