The random generator is in both cases the Mersenne Twister (routine mt19937 from the authors' site). And I had forgot the Fortran program. Anyway you are right: the pseudorandom generator takes the most time in the program. But here is another example, this time both in Pascal and in Fortran:
------------------------------------------------ program matrix; uses GPC, SysUtils;
const size = 30;
type tMatrix = array[0..size, 0..size] of longint;
VAR
time1, time2 : longCard; dummy, j, big, cpu1, cpu2 : integer;
procedure mkmatrix( rows, cols : integer; var mx : tMatrix); var R, C : integer; count : longint; begin Dec(rows); Dec(cols); count := 1; for R := 0 to rows do begin for C := 0 to cols do begin mx[R, C] := count; Inc(count); end; end; End;
procedure mmult(rows, cols : integer; m1, m2 : tMatrix; var mm : tMatrix ); var i, j, k : integer; val: longint; begin Dec(rows); Dec(cols); For i := 0 To rows do begin For j := 0 To cols do begin val := 0; For k := 0 To cols do begin Inc(val, m1[i, k] * m2[k, j]); end; mm[i, j] := val; end; end; End;
var NUM, I : integer; M1, M2, MM : tMatrix;
begin
writeln ('# MATRIX MULTIPLICATION: GPC'); time1 := GetMicroSecondTime;
if ParamCount = 0 then NUM := 1 else NUM := StrToInt(ParamStr(1));
if NUM < 1 then NUM := 1;
mkmatrix(size, size, M1); mkmatrix(size, size, M2);
for I := 0 To NUM do begin mmult(size, size, M1, M2, MM); end; WriteLn( IntToStr(MM[0, 0]) + ' ' + IntToStr(MM[2, 3]) + ' ' + IntToStr(MM[3, 2]) + ' ' + IntToStr(MM[4, 4]));
dummy := 0;
time2 := GetMicroSecondTime; writeln('GPC: elapsed time: ', ( (time2-time1) * 0.000001):7:2, ' seconds, ' );
end.
------------------------------------------------
program perfor2for
integer i character*20 argum integer size parameter ( size = 30 ) integer m1 ( 0:size, 0:size), m2 ( 0:size, 0:size), & mm ( 0:size, 0:size), mx ( 0:size, 0:size ) real etime real elapsed(2) real total
write (*,*) '# MATRIX MULTIPLICATION: IFORT'
call getarg ( 1, argum ) read ( argum, * ) num write (*,*) '# size = ', size
if ( num .le. 1 ) num =1
call mkmatrix ( size, size, m1 ) call mkmatrix ( size, size, m2 )
write (*,*)'------------------------------------------------------'
write (*,*) m1 (0,0), ' ', m1(2,3), ' ', m1(3,2), ' ', m1(4,4) write (*,*) m2 (0,0), ' ', m2(2,3), ' ', m2(3,2), ' ', m2(4,4)
write (*,*)'------------------------------------------------------'
do i = 1, num call mmult ( size, size, m1, m2, mm ) enddo
write (*,*) mm (0,0), ' ', mm(2,3), ' ', mm(3,2), ' ', mm(4,4)
total = etime ( elapsed ) print *, 'Elapsed time =', total, ' secondi '! , ' user=', elapsed(1), ' system=', elapsed(2)
stop end
subroutine mkmatrix ( rows, cols, mx ); integer r, c, rows, cols include 'size.h' integer mx ( 0:size, 0:size ), count
rows = rows - 1 cols = cols - 1 count = 1
write (*,*) '# rows = ', rows, ' cols = ', cols
do r = 0, rows do c = 0, cols mx ( r, c ) = count count = count + 1 enddo enddo
return end
subroutine mmult ( rows, cols, m1, m2, mm )
integer r, c, rows, cols, i, j, k include 'size.h' integer m1( 0:size, 0:size ), m2( 0:size, 0:size ), mm ( 0:size, 0:size ), count, val
rows = rows - 1 cols = cols - 1
do i = 0, rows do j = 0, cols val = 0 do k = 0, cols val = val + m1(i,k)*m2(k, j) enddo mm(i, j) = val enddo enddo
return end
------------------------------------------------------------------
here the difference is marked. Perhaps by exhanging columns with rows one can make Pascal execution faster. I will try. Thanks, best regards
Silvio a Beccara
Silvio a Beccara wrote:
The random generator is in both cases the Mersenne Twister (routine mt19937 from the authors' site). And I had forgot the Fortran program. Anyway you are right: the pseudorandom generator takes the most time in the program. But here is another example, this time both in Pascal and in Fortran:
In Pascal program you have:
type tMatrix = array[0..size, 0..size] of longint;
In Fortran:
integer m1 ( 0:size, 0:size), m2 ( 0:size, 0:size),
AFAIK Fortran integers on x86 are 32 bit, but GPC longint is 64 bit and highier precision has it cost. Also, Fortran passes array by reference, but you passed arguments to 'mmult' by value:
procedure mmult(rows, cols : integer; m1, m2 : tMatrix; var mm : tMatrix );
to pass by reference you can use 'const' attribute: procedure mmult(rows, cols : integer; const m1, m2 : tMatrix; var mm : tMatrix );
I have modified your program to use integer instead of longint and to use 'const' attribute (as above). Also tried a two versions of GPC with different options: original modified gpc-20041017+gcc-3.3.5 -O2 -march=athlon-xp 0.035884 0.008142 gpc-20041017+gcc-3.3.5 -O2 -march=i686 0.033786 0.008811 gpc-20030830+gcc-3.3.2 -O2 -march=athlon-xp 0.037098 0.013156 gpc-20020510+gcc-2.95.3 -O2 -march=i386 0.039530 0.025574 gpc-20020510+gcc-2.95.3 -O2 -march=i686 0.037567 0.026358
As you can see with new gpc modified version is 4 times faster then original. Main gain comes from reduced precision but optimizing for correct processor gives 10% and 'const' attribute another 10%. The fastest version takes 6.28 clocks per inner loop iteration which still looks too high for me. But it seem hard to get better speed without significantly changing program.
By the way, if what you want is matrix multiplicatin than using Atlas library may be a solution: Atlas is hand optimized and IMHO hard to beat. IIRC Atlas is floating point only, but converting to integer to doubles, doing floating point matrix multiply and converting back to integers is likely to be faster then direct integer matrix multiply (integer and floating point arithmetic are of similar speed, but floating point registers are separate from integer registers, so floating point program effectively have more registers to use).
Dear Waldek,
thanks very much for your suggestions. As for the optimisation switches, -march=athlon-xp doesn't seem to work on my system (Mandrake 9.1, gpc version 2.1 (20020510), based on 2.95.3 20010315 (release)). Do I have to compile or install GPC with particular options?
If I use doubles instead of integers, do you think that the performances will again enjoy this enhancement?
Thanks, best regards
Silvio a Beccara
| Silvio a Beccara wrote: | > The random generator is in both cases the Mersenne Twister (routine | > mt19937 from the authors' site). And I had forgot the Fortran program. | > Anyway you are right: the pseudorandom generator takes the most time in | > the program. But here is another example, this time both in Pascal and | > in Fortran: | | In Pascal program you have: | > type tMatrix = array[0..size, 0..size] of longint; | | In Fortran: | > integer m1 ( 0:size, 0:size), m2 ( 0:size, 0:size), | | AFAIK Fortran integers on x86 are 32 bit, but GPC longint is 64 bit | and highier precision has it cost. Also, Fortran passes array by reference, | | but you passed arguments to 'mmult' by value: | > procedure mmult(rows, cols : integer; m1, m2 : tMatrix; var mm : tMatrix | > ); | | to pass by reference you can use 'const' attribute: | procedure mmult(rows, cols : integer; const m1, m2 : tMatrix; var mm : | tMatrix ); | | I have modified your program to use integer instead of longint and | to use 'const' attribute (as above). Also tried a two versions of | GPC with different options: | original modified | gpc-20041017+gcc-3.3.5 | -O2 -march=athlon-xp 0.035884 0.008142 | gpc-20041017+gcc-3.3.5 | -O2 -march=i686 0.033786 0.008811 | gpc-20030830+gcc-3.3.2 | -O2 -march=athlon-xp 0.037098 0.013156 | gpc-20020510+gcc-2.95.3 | -O2 -march=i386 0.039530 0.025574 | gpc-20020510+gcc-2.95.3 | -O2 -march=i686 0.037567 0.026358 | | | As you can see with new gpc modified version is 4 times faster | then original. Main gain comes from reduced precision but | optimizing for correct processor gives 10% and 'const' attribute | another 10%. The fastest version takes 6.28 clocks per inner | loop iteration which still looks too high for me. But it seem | hard to get better speed without significantly changing | program. | | By the way, if what you want is matrix multiplicatin than using | Atlas library may be a solution: Atlas is hand optimized and | IMHO hard to beat. IIRC Atlas is floating point only, but | converting to integer to doubles, doing floating point matrix | multiply and converting back to integers is likely to be | faster then direct integer matrix multiply (integer and floating | point arithmetic are of similar speed, but floating point | registers are separate from integer registers, so floating point | program effectively have more registers to use).
On Wed, Oct 27, 2004 at 09:00:13AM +0200, Silvio a Beccara wrote:
Dear Waldek,
thanks very much for your suggestions. As for the optimisation switches, -march=athlon-xp doesn't seem to work on my system (Mandrake 9.1, gpc version 2.1 (20020510), based on 2.95.3 20010315 (release)). Do I have to compile or install GPC with particular options?
If I am not mistaken, optimization for Athlon XP was introduced in GCC 3.x, so you need to compile GPC against a newer backend. The same probably holds for Pentium 4, which you mentioned in another mail (which makes me wonder a little bit: is your machine an Athlon XP or P4? It makes no sense to optimize for one CPU and then run the program on a different one.)
Emil Jerabek
Silvio a Beccara wrote:
thanks very much for your suggestions. As for the optimisation switches, -march=athlon-xp doesn't seem to work on my system (Mandrake 9.1, gpc version 2.1 (20020510), based on 2.95.3 20010315 (release)). Do I have to compile or install GPC with particular options?
-march option is handled by the backend. Athlon XP and Pentium 4 were released after the release of gcc-2.95, so to use such options you need new backend (like 3.3.5). And if you have Pentium 4 you should use -march=pentium4.
If I use doubles instead of integers, do you think that the performances will again enjoy this enhancement?
Later:
I don't know whether the i686 switch versus the pentium4 switch in the GPC compilation can make a difference of a factor 2, but I am doubtful.
I you look at my numbers you see that I get 3 times speedup by using new version of the compiler. Note that your tests are tiny, on such test you can expect large variation in results.
If you ask about general trends: IIRC 3.3.x backend is on average 10% faster then 2.95.3. Note that average is computed over a family of C programs (SPEC 2000 benchmark). On particular program result can be much different then average. Intel compiler is about 10% faster then GCC when generating code for Pentium 4, when generating code for AMD processors GCC and Intel compiler give equal speed (within mesurement error). Pentium 4 for best performance requires different code then other processors, so 10-20% speedup is typical. Disclaimer: this is opinion based on published benchmarks.
Concerening languages: in principle Pascal can give better code then C (due to more strict typing), but with GPC 'equivalent' Pascal code is likely to run as fast or slightly slower then C code. For matrix computation Fortran should give you faster code then C or Pascal. However assuming that code is well writen and averaging over larger colletion of programs differences should be small, probably less then 10%.
Going to practical recommendations: 10% speedup just by changing compilers may be quite attractive. Similar speedup by changing language is much less attractive. On particular programs you may see much larger difference. Such cases are worth reporting, at least in same cases it is possible to fix GPC to generate faster code.
Silvio a Beccara a écrit:
But here is another example, this time both in Pascal and in Fortran:
<snip>
subroutine mmult ( rows, cols, m1, m2, mm )
integer r, c, rows, cols, i, j, k include 'size.h' integer m1( 0:size, 0:size ), m2( 0:size, 0:size ), mm ( 0:size, 0:size ), count, val
rows = rows - 1 cols = cols - 1
probably an untested last minute "improvement" ? in fortran all parameters are passed by reference, not by value, so the global rows and cols are decreased at each enter of the subroutines, particularly if you use num >1 to have more meaningfull timings
do i = 0, rows do j = 0, cols val = 0 do k = 0, cols val = val + m1(i,k)*m2(k, j) enddo mm(i, j) = val enddo enddo
return end
Anyway making the changes indicated by Walter in another mail, plus correcting this mistake, after checking that pascal and fortran programs give identical results, using a pentium4, djgpp in a W98 dos box gives (-march=pentium4 -O3) for 10000 iterations 0.88 seconds for fortan 0.93 seconds for pascal i.e. one clock tick difference One can analyze in more detail the reason of this difference (we have seen in this list a comparison between C and pascal where the issue was a different treatment of global variables) but the conclusion is already clear: no significant difference between execution times between languages, when they can have identical constructs like here. The reason to choose a computer language is not there. Why do you expect anything else ? computer science is not magics
Maurice
| > rows = rows - 1 | > cols = cols - 1 | | probably an untested last minute "improvement" ?
No, a simple plain mistake. After correcting it, and checking that the two compilers give the same numbers, this is what I get:
# MATRIX MULTIPLICATION: GPC, 1000 TIMES 270165 1061760 1453695 1856025 GPC: elapsed time: 0.12264 seconds
# MATRIX MULTIPLICATION 1000 TIMES: IFORT 270165 1061760 1453695 1856025 Elapsed time: = 5.9999999E-02 seconds
using GPC with the following command line (only the i686 works on my system, I don't know why)
gpc -O2 -march=i686 perfor2-gpc.pas -o perfor2-gpc
using the Intel fortran compiler, version 8.0, with the following command line:
ifort -O3 -axW -ipo -tpp7 -w -extend_source perfor2-for.f -o perfor2-for
I don't know whether the i686 switch versus the pentium4 switch in the GPC compilation can make a difference of a factor 2, but I am doubtful. Actually, you don't say which f77 compiler you used, but I reckon it must be the g77, in which case the switches can be the same.
Silvio