\ MM.FTH - Forth Floating point matrix multiply benchmark 0 [IF] ====================================================== This file is maintained by: Stephen Pelc MicroProcessor Engineering 133 Hill Lane Southampton SO15 5AF England tel: +44 (0)23 8063 1441 fax: +44 (0)23 8033 9691 email: stephen@mpeforth.com The Forth MM benchmark was contributed by Marcel Hendrix, derived from a C benchmark by Mark Smotherman. The code may be freely redistributed. If you modify this code, and want the changes to be incorporated in future releases, please send the changes to Stephen Pelc. The code is used to test MPE's VFX code generator and optimiser. Except for legacy reasons, no attempt is made to maintain this code for non-optimising and non-ANS systems. The code is maintained to test ANS Forth systems. It does not take advantage of system specific extensions, except for one (see below). However, it uses the utility file from the Forth Scientific Library, which is intended to be hacked for specific systems as much as you can. Consequently, when you contribute results, please also contribute the FSL utility file as well. Change Notes ============ 20060901 Added other systems, refactored .ELAPSED to remove display effects. ************************************* On 2.8GHz P4, 512Mb DDR266 RAM, XPpro ************************************* SwiftForth 3.0.1 09-Aug-2006 500x500 mm - normal algorithm 3.219 secs. 500x500 mm - temporary variable in loop 9.703 secs. 500x500 mm - unrolled inner loop, factor of 4 9.875 secs. 500x500 mm - unrolled inner loop, factor of 8 9.828 secs. 500x500 mm - unrolled inner loop, factor of 16 9.891 secs. 500x500 mm - pointers used to access matrices 3.984 secs. 500x500 mm - pointers used, unrolled by 4 3.969 secs. 500x500 mm - transposed B matrix 8.750 secs. 500x500 mm - interchanged inner loops 8.766 secs. 500x500 mm - blocking, step size of 20 11.250 secs. 500x500 mm - Robert's algorithm 2.641 secs. 500x500 mm - T. Maeno's algorithm, subarray 20x20 3.625 secs. 500x500 mm - Generic Maeno, subarray 20x20 2.985 secs. 500x500 mm - D. Warner's algorithm, subarray 20x20 9.313 secs. ========================================================= Total using no extensions and using no hackery 97.799 secs. ok iForth version 2.1.1662, generated 14:10:06, November 27, 2005. 500x500 mm - normal algorithm 1.672 secs. 500x500 mm - temporary variable in loop 2.735 secs. 500x500 mm - unrolled inner loop, factor of 4 2.641 secs. 500x500 mm - unrolled inner loop, factor of 8 2.625 secs. 500x500 mm - unrolled inner loop, factor of 16 2.625 secs. 500x500 mm - pointers used to access matrices 1.734 secs. 500x500 mm - pointers used, unrolled by 4 1.594 secs. 500x500 mm - transposed B matrix 1.140 secs. 500x500 mm - interchanged inner loops 2.203 secs. 500x500 mm - blocking, step size of 20 2.375 secs. 500x500 mm - Robert's algorithm 0.594 secs. 500x500 mm - T. Maeno's algorithm, subarray 20x20 0.563 secs. 500x500 mm - Generic Maeno, subarray 20x20 0.829 secs. 500x500 mm - D. Warner's algorithm, subarray 20x20 1.297 secs. ========================================================= Total using no extensions and using no hackery 24.627 secs. ok VFX Forth Version: 4.00 [build 2325] Build date: 31 August 2006 500x500 mm - normal algorithm 1.630 secs. 500x500 mm - temporary variable in loop 2.702 secs. 500x500 mm - unrolled inner loop, factor of 4 2.601 secs. 500x500 mm - unrolled inner loop, factor of 8 2.610 secs. 500x500 mm - unrolled inner loop, factor of 16 2.603 secs. 500x500 mm - pointers used to access matrices 2.235 secs. 500x500 mm - pointers used, unrolled by 4 1.599 secs. 500x500 mm - transposed B matrix 1.150 secs. 500x500 mm - interchanged inner loops 2.169 secs. 500x500 mm - blocking, step size of 20 2.337 secs. 500x500 mm - Robert's algorithm 0.610 secs. 500x500 mm - T. Maeno's algorithm, subarray 20x20 0.563 secs. 500x500 mm - Generic Maeno, subarray 20x20 0.619 secs. 500x500 mm - D. Warner's algorithm, subarray 20x20 1.265 secs. ========================================================= Total using no extensions and using no hackery 24.693 secs. ok =============================================================== [THEN] \ ************************************************ \ Select system to be tested, set FORTHSYSTEM \ to value of selected target. \ Set SPECIFICS false to avoid system dependencies. \ Set SPECIFICS true to show off implementation tricks. \ Set HACKING false to use the base source code. \ Set HACKING true to optimise the source code. \ ************************************************ 1 constant VfxForth4 \ MPE VFX Forth v4.x 2 constant VfxForth3 \ MPE VFX Forth v3.x 3 constant Pfw22 \ MPE ProForth 2.2 3 constant SwiftForth20 \ FI SwiftForth 2.0 5 constant SwiftForth15 \ FI SwiftForth 1.5 6 constant Win32Forth \ Win32Forth 4.2 7 constant BigForth \ BigForth 11 July 1999 8 constant BigForth-Linux \ BigForth 11 July 1999 9 constant iForth \ iForth 1.12 5 Aug 2001 10 constant iForth20 \ iForth 2.0 8 June 2002 11 constant SwiftForth22 \ FI SwiftForth 2.2.2.9 12 constant SwiftForth30 \ FI SwiftForth 3.0.1 \ Select system to test VfxForth4 constant ForthSystem \ VfxForth3 constant ForthSystem \ iForth20 constant ForthSystem \ SwiftForth30 constant ForthSystem \ Win32Forth constant ForthSystem false constant specifics \ true to use system dependent code false constant hacking \ true to use "guru" level code that \ makes assumptions of an optimising compiler. true constant ANSSystem \ Some Forth 83 systems cannot compile \ all the test examples without carnal \ knowledge, especially if the compiler \ checks control structures. : .specifics \ -- ; display trick state ." using" specifics 0= if ." no" then ." extensions" ; : .hacking \ -- ; display hack state ." using" hacking 0= if ." no" then ." hackery" ; : .testcond \ -- ; display test conditions .specifics ." and" .hacking ; \ *********************************** \ VFX Forth for Windows v4.x harness \ ********************************** VfxForth4 ForthSystem = [IF] [defined] +idata [if] +idata \ enable P4 data options variable zzz \ preallocate first IDATA buffer [then] #0 LoopAlignment \ #32 loopAlignment true constant ndp? \ -- flag ; true if NDP stack version mc" %vfxpath%\Lib" setmacro LibDir mc" %vfxpath%\Lib\FSL\library" setmacro FslDir mc" %vfxpath%\Lib" setmacro NdpDir ndp? [if] S" %NdpDir%\Ndp387" INCLUDED \ -1 to FPsin? \ slower [else] S" %NdpDir%\Hfp387" INCLUDED [then] char . dp-char ! \ select ANS number conversion char . fp-char ! -short-branches \ disable short forward branches S" %FslDir%\Vfx4Util" INCLUDED \ FSL harness for VFX Forth 4.x : HTAB out @ - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; extern: DWORD PASCAL GetTickCount( void ); (( : COUNTER \ -- ms GetTickCount ; )) [UNDEFINED] m*/ [IF] [-sin : m*/ \ d1 n2 +n3 -- dquot \ *G The result dquot=(d1*n2)/n3. The intermediate value d1*n2 \ ** is triple-precision. In an ANS Forth standard program n3 \ ** can only be a positive signed number and a negative value \ ** for n3 generates an ambiguous condition, which may cause \ ** an error on some implementations. >r \ -- d1 n2 ; R: -- n3 s>d >r abs \ -- d1 |n2| ; R: -- n3 sign(n2) -rot \ -- |n2| d1 ; R: -- n3 sign(n2) s>d r> xor \ -- |n2| d1 d1h*sign(n2) ; R: -- n3 r> swap >r >r \ -- |n2| d1 ; R: -- d1h*sign(n2) n3 dabs rot \ -- |d1| |n2| ; R: -- d1h*sign(n2) n3 tuck um* 2swap um* \ -- d1h*n2 d1l*n2 ; R: -- d1h*sign(n2) n3 swap >r 0 d+ r> -rot \ -- t ; R: -- d1h*sign(n2) n3 r@ um/mod -rot r> um/mod nip swap r> IF dnegate THEN ; sin] [THEN] Extern: BOOL PASCAL QueryPerformanceCounter( void * int64 ); Extern: BOOL PASCAL QueryPerformanceFrequency( void * int64 ); : Counter \ -- ms \ *G Return a ticker count in milliseconds. \ *E seconds = count / freq \ ** ms = (count * 1000) / freq \ *P Note that we assume that frequency can be expressed as \ ** a positive 32 bit number. { | count[ 2 cells ] freq[ 2 cells ] -- ms } count[ QueryPerformanceCounter drop freq[ QueryPerformanceFrequency drop count[ 2@ swap \ count #1000 freq[ @ m*/ drop \ ms ; : DFVARIABLE #16 buffer: ; : cr cr emptyidle ; [THEN] \ *********************************** \ VFX Forth for Windows v3.7+ harness \ *********************************** VfxForth3 ForthSystem = [IF] [defined] +idata [if] +idata \ enable P4 data options variable zzz \ preallocate first IDATA buffer [then] true constant ndp? \ -- flag ; true if NDP stack version c" C:\Products\PfwVfx.dev\WindowsBox\Sources\Lib" setmacro LibDir c" C:\Products\PfwVfx.dev\WindowsBox\Sources\Lib\FSL\library" setmacro FslDir c" C:\Products\PfwVfx.dev\WindowsBox\Sources\Lib" setmacro NdpDir ndp? [if] S" %NdpDir%\Ndp387" INCLUDED [else] S" %NdpDir%\Hfp387" INCLUDED [then] char . dp-char ! \ select ANS number conversion char . fp-char ! -short-branches \ disable short forward branches S" %FslDir%\VfxUtil" INCLUDED \ FSL harness for VFX Forth 3.x : HTAB out @ - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; extern: DWORD PASCAL GetTickCount( void ); : COUNTER \ -- ms GetTickCount ; : DFVARIABLE #16 buffer: ; [THEN] \ ****************************** \ iForth 2.0 for Windows harness \ ****************************** iForth20 ForthSystem = [IF] NEEDS -miscutil NEEDS -dynlink 0 VALUE 'counter S" kernel32.dll" LIBRARY-OPEN THROW ( dll) S" GetTickCount" ROT LIBRARY-FIND THROW TO 'counter : counter ( -- ms ) 0 'counter FOREIGN ; : defined \ caddr -- flag find nip ; \ include c:\dfwforth\examples\fsl\fsl_util.frt include c:\dfwforth\include\fsl_util.frt : & ; IMMEDIATE [THEN] \ ********************** \ SwiftForth 2.0 harness \ ********************** SwiftForth20 ForthSystem = [IF] include C:\MyApps\SwiftForth20\Lib\Options\fpmath.f : f<> F= 0= ; include C:\MyApps\SwiftForth20\Lib\FSLib\Library\fsl-util.f : HTAB get-xy drop - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; [THEN] \ ********************** \ SwiftForth 2.2 harness \ ********************** SwiftForth22 ForthSystem = [IF] \ FPCONFIG.F should be in the BENCHMRK folder include C:\MyApps\SwiftForth2229\Lib\Options\fpmath.f : f<> F= 0= ; include C:\MyApps\SwiftForth2229\Unsupported\FSLib\Library\fsl-util.f : HTAB get-xy drop - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; [THEN] \ ********************** \ SwiftForth 3.0 harness \ ********************** SwiftForth30 ForthSystem = [IF] \ FPCONFIG.F should be in the BENCHMRK folder include C:\MyApps\Forth200x\SwiftForth300\SwiftForth\Lib\Options\fpmath.f : f<> F= 0= ; include C:\MyApps\Forth200x\SwiftForth300\SwiftForth\Unsupported\FSLib\Library\fsl-util.f : HTAB get-xy drop - spaces ; \ n -- ; step to position n : DEC. ( n -- ) BASE @ >R DECIMAL . R> BASE ! ; [THEN] \ ****************** \ Win32Forth harness \ ****************** Win32Forth ForthSystem = [IF] : COUNTER \ -- ms Call GetTickCount ; : >pos \ n -- ; step to position n getxy drop - spaces ; : HTAB #TAB ; \ n -- ; step to position n : M/ \ d n1 -- quot fm/mod nip ; : buffer: \ n -- ; -- addr create here over allot swap erase ; : 2- \ n -- n-2 2 - ; : [o/n] \ -- ; stop optimiser treating * DROP etc as no code ; immediate : SendMessage \ h m w l -- res swap 2swap swap \ Win32Forth uses reverse order Call SendMessage ; : GetTickCount \ -- ms Call GetTickCount ; [THEN] \ ******************** \ Start of common code \ ******************** 500 CONSTANT N \ -- n ; number of iterations to use \ ****************** \ Timing and display \ ****************** 50 CONSTANT TCOL 0 VALUE _time_ VARIABLE TotalTime : TIMER-RESET ( -- ) COUNTER TO _time_ ; : #? ( d -- d ) 2DUP OR 0= IF BL HOLD ELSE # THEN ; : .secs \ ms -- 0 <# BL HOLD # # # [char] . HOLD # #? #? #> TYPE ." secs." ; : .ELAPSED ( -- ) COUNTER _time_ - dup TotalTime +! \ SFP001 TCOL HTAB .secs ; : .algo \ caddr u -- \ Display size and algorithm from string CR N 0 .R ." x" N 0 .R SPACE TYPE ; \ ***** \ TOOLS \ ***** C" DF@+" DEFINED 0= [IF] : DF@+ ( addr -- addr' ) ( F: -- r ) DUP DF@ DFLOAT+ ; [THEN] C" DF+!" DEFINED 0= [IF] : DF+! ( addr -- ) ( F: r -- ) DUP DF@ F+ DF! ; [THEN] C" DF!+" DEFINED 0= [IF] : DF!+ ( addr -- addr' ) ( F: r -- ) DUP DF! DFLOAT+ ; [THEN] C" DF+!+" DEFINED 0= [IF] : DF+!+ ( addr -- addr' ) ( F: r -- ) DUP DF@ F+ DF!+ ; [THEN] C" *DSUM" DEFINED 0= [IF] : *DSUM ( addr1 addr2 count -- addr1' addr2' ) ( F: -- n ) 0e 0 ?DO SWAP DF@+ SWAP DF@+ F* F+ LOOP ; : *DSUML ( addr1 addr2 count stride2 -- addr1' addr2' ) ( F: -- r ) LOCALS| stride2 | 0e 0 ?DO SWAP DF@+ SWAP DUP DF@ stride2 + F* F+ LOOP ; [THEN] CHAR x CONSTANT 'x' CHAR n CONSTANT 'n' CHAR v CONSTANT 'v' CHAR u CONSTANT 'u' CHAR p CONSTANT 'p' CHAR t CONSTANT 't' CHAR i CONSTANT 'i' CHAR b CONSTANT 'b' CHAR m CONSTANT 'm' CHAR r CONSTANT 'r' CHAR w CONSTANT 'w' CHAR z CONSTANT 'z' \ ===================================================================================== DOUBLE DMATRIX a{{ DOUBLE DMATRIX b{{ DOUBLE DMATRIX c{{ DOUBLE DMATRIX d{{ DOUBLE DMATRIX bt{{ 0 [IF] ================================================ \ Useful test bits : .a{{ \ -- ; display A{{ matrix cr ." A{{" N 0 ?DO cr N 0 ?DO a{{ J I }} DF@ F. LOOP LOOP ; : .b{{ \ -- ; display B{{ matrix cr ." B{{" N 0 ?DO cr N 0 ?DO b{{ J I }} DF@ F. LOOP LOOP ; : .c{{ \ -- ; display B{{ matrix cr ." C{{" N 0 ?DO cr N 0 ?DO c{{ J I }} DF@ F. LOOP LOOP ; : .bt{{ \ -- ; display B{{ matrix cr ." BT{{" N 0 ?DO cr N 0 ?DO bt{{ J I }} DF@ F. LOOP LOOP ; [THEN] : SET-COEFFICIENTS ( -- ) \ Set coefficients so that result matrix should have row entries \ equal to (1/2)*n*(n-1)*i in row i N 0 ?DO N 0 ?DO J S>F FDUP b{{ J I }} DF! a{{ J I }} DF! LOOP LOOP ; : FLUSH-CACHE ( -- ) N 0 ?DO N 0 ?DO 0e d{{ J I }} DF! LOOP LOOP ; FVARIABLE row_sum FVARIABLE sum : CHECK-RESULT ( -- ) 0e row_sum F! N N 1- * 2/ S>F sum F! N 0 ?DO I S>F sum F@ F* row_sum F! N 0 ?DO a{{ J I }} DF@ J S>F F<> IF CR ." error in result entry a{{ " J DEC. I DEC. ." }}: " a{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT THEN b{{ J I }} DF@ J S>F F<> IF CR ." error in result entry b{{ " J DEC. I DEC. ." }}: " b{{ J I }} DF@ F. ." <> " J S>F F. UNLOOP UNLOOP EXIT THEN c{{ J I }} DF@ row_sum F@ F<> IF CR ." error in result entry c{{ " J DEC. I DEC. ." }}: " c{{ J I }} DF@ F. ." <> " row_sum F@ F. UNLOOP UNLOOP EXIT THEN LOOP LOOP ; : NORMAL() ( -- ) s" mm - normal algorithm" .algo TIMER-RESET N 0 ?DO N 0 ?DO a{{ J 0 }} b{{ 0 I }} N N DFLOATS *DSUML 2DROP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : TNSQ() ( -- ) 0 LOCALS| K | s" mm - temporary variable in loop" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO a{{ J 0 }} DF@ b{{ 0 I }} DF@ F* N 1 ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL4() ( -- ) 0 0 LOCALS| K S | s" mm - unrolled inner loop, factor of 4" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO 0e 0 TO S N 3 - 0 ?DO I TO S a{{ K I }} DF@ b{{ I J }} DF@ F* F+ a{{ K I 1+ }} DF@ b{{ I 1+ J }} DF@ F* F+ a{{ K I 2+ }} DF@ b{{ I 2+ J }} DF@ F* F+ a{{ K I 3 + }} DF@ b{{ I 3 + J }} DF@ F* F+ 4 +LOOP N S 4 + ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL8() ( -- ) 0 0 LOCALS| K S | s" mm - unrolled inner loop, factor of 8" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO 0e 0 TO S N 7 - 0 ?DO I TO S a{{ K I }} DF@ b{{ I J }} DF@ F* F+ a{{ K I 1+ }} DF@ b{{ I 1+ J }} DF@ F* F+ a{{ K I 2+ }} DF@ b{{ I 2+ J }} DF@ F* F+ a{{ K I 3 + }} DF@ b{{ I 3 + J }} DF@ F* F+ a{{ K I 4 + }} DF@ b{{ I 4 + J }} DF@ F* F+ a{{ K I 5 + }} DF@ b{{ I 5 + J }} DF@ F* F+ a{{ K I 6 + }} DF@ b{{ I 6 + J }} DF@ F* F+ a{{ K I 7 + }} DF@ b{{ I 7 + J }} DF@ F* F+ 8 +LOOP N S 8 + ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL16() ( -- ) 0 0 LOCALS| K S | s" mm - unrolled inner loop, factor of 16" .algo TIMER-RESET N 0 ?DO I TO K N 0 ?DO 0e 0 TO S N 15 - 0 ?DO I TO S a{{ K I }} DF@ b{{ I J }} DF@ F* F+ a{{ K I 1+ }} DF@ b{{ I 1+ J }} DF@ F* F+ a{{ K I 2+ }} DF@ b{{ I 2+ J }} DF@ F* F+ a{{ K I 3 + }} DF@ b{{ I 3 + J }} DF@ F* F+ a{{ K I 4 + }} DF@ b{{ I 4 + J }} DF@ F* F+ a{{ K I 5 + }} DF@ b{{ I 5 + J }} DF@ F* F+ a{{ K I 6 + }} DF@ b{{ I 6 + J }} DF@ F* F+ a{{ K I 7 + }} DF@ b{{ I 7 + J }} DF@ F* F+ a{{ K I 8 + }} DF@ b{{ I 8 + J }} DF@ F* F+ a{{ K I 9 + }} DF@ b{{ I 9 + J }} DF@ F* F+ a{{ K I 10 + }} DF@ b{{ I 10 + J }} DF@ F* F+ a{{ K I 11 + }} DF@ b{{ I 11 + J }} DF@ F* F+ a{{ K I 12 + }} DF@ b{{ I 12 + J }} DF@ F* F+ a{{ K I 13 + }} DF@ b{{ I 13 + J }} DF@ F* F+ a{{ K I 14 + }} DF@ b{{ I 14 + J }} DF@ F* F+ a{{ K I 15 + }} DF@ b{{ I 15 + J }} DF@ F* F+ 16 +LOOP N S 16 + ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; : UNROLL ( n -- ) CASE 4 OF UNROLL4() ENDOF 8 OF UNROLL8() ENDOF 16 OF UNROLL16() ENDOF CR ." mm - unrolled inner loop, factor of " DUP DEC. ." not implemented" ENDCASE ; specifics [if] VfxForth3 ForthSystem = [if] -fasti [then] [then] : PNSQ4() ( -- ) 0 LOCALS| S | s" mm - pointers used, unrolled by 4" .algo TIMER-RESET N 0 ?DO N 0 ?DO 0e a{{ J 0 }} b{{ 0 I }} 0 TO S N 3 - 0 ?DO I TO S SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + 4 +LOOP N S 4 + ?DO SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS + LOOP c{{ J I }} DF! 2DROP LOOP LOOP .ELAPSED ; : PNSQ() ( n -- ) DUP 4 = IF DROP PNSQ4() EXIT THEN s" mm - pointers used to access matrices" .algo ?DUP IF ." , unroll factor of " DEC. ." not allowed" EXIT THEN TIMER-RESET N 0 ?DO N 0 ?DO 0e a{{ J 0 }} b{{ 0 I }} N 0 ?DO SWAP DF@+ SWAP DUP DF@ N DFLOATS + F* F+ LOOP c{{ J I }} DF! 2DROP LOOP LOOP .ELAPSED ; specifics [if] VfxForth3 ForthSystem = [if] +fasti [then] [then] : TRANSPOSE() ( -- ) 0 LOCALS| K | s" mm - transposed B matrix" .algo TIMER-RESET N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP N 0 ?DO I TO K N 0 ?DO a{{ J 0 }} DF@ bt{{ I 0 }} DF@ F* N 1 ?DO a{{ K I }} DF@ bt{{ J I }} DF@ F* F+ LOOP c{{ J I }} DF! LOOP LOOP .ELAPSED ; \ from Monica Lam ASPLOS-IV paper : REG_LOOPS() ( -- ) 0 LOCALS| K | s" mm - interchanged inner loops" .algo TIMER-RESET N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO K N 0 ?DO a{{ J I }} DF@ N 0 ?DO FDUP b{{ J I }} DF@ F* c{{ K I }} DF+! LOOP FDROP LOOP LOOP .ELAPSED ; : TILING() ( step -- ) s" mm - blocking, step size of " .algo DUP DEC. DUP 4 N 1+ WITHIN 0= IF drop ." is unreasonable" EXIT THEN TIMER-RESET 0 0 0 LOCALS| K kk jj step | N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk N 0 ?DO I TO jj N 0 ?DO I TO K kk step + N MIN kk ?DO a{{ J I }} DF@ jj step + N MIN jj ?DO FDUP b{{ J I }} DF@ F* c{{ K I }} DF+! LOOP FDROP LOOP LOOP step +LOOP step +LOOP .ELAPSED ; \ ********************************************/ \ * Contributed by Robert Debath 26 Nov 1995 */ \ * rdebath@cix.compulink.co.uk */ \ ********************************************/ : ROBERT() ( -- ) s" mm - Robert's algorithm" .algo TIMER-RESET N 0 ?DO N 0 ?DO b{{ J I }} DF@ bt{{ I J }} DF! LOOP LOOP N 0 ?DO N 0 ?DO a{{ J 0 }} bt{{ I 0 }} N *DSUM 2DROP c{{ J I }} DF! LOOP LOOP .ELAPSED ; 0 [IF] =========================================================================== * Matrix Multiply by Dan Warner, Dept. of Mathematics, Clemson University * * mmbu2.f multiplies matrices a and b * a and b are n by n matrices * nb is the blocking parameter. * the tuning guide indicates nb = 50 is reasonable for the * ibm model 530 hence 25 should be reasonable for the 320 * since the 320 has 32k rather than 64k of cache. * Inner loops unrolled to depth of 2 * The loop functions without clean up code at the end only * if the unrolling occurs to a depth k which divides into n * in this case n must be divisible by 2. * The blocking parameter nb must divide into n if the * multiply is to succeed without clean up code at the end. * * converted to c by Mark Smotherman * note that nb must also be divisible by 2 => cannot use 25, so use 20 =========================================================================== [THEN] DFVARIABLE s10 DFVARIABLE s00 DFVARIABLE s01 DFVARIABLE s11 : WARNER() ( nb -- ) 0 0 0 0 LOCALS| K ii jj kk nb | s" mm - D. Warner's algorithm, subarray " .algo nb 0 .R 'x' EMIT nb 0 .R N nb MOD N 2 MOD OR IF cr ." the matrix size " N DEC. ." must be divisible both by the block size " nb DEC. ." and 2." EXIT THEN nb 2 MOD IF ." block size must be evenly divisible by 2" EXIT THEN TIMER-RESET N 0 ?DO I TO ii N 0 ?DO I TO jj nb ii + ii ?DO nb jj + jj ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO kk nb ii + ii ?DO I TO K nb jj + jj ?DO c{{ J I }} DF@ s00 DF! c{{ J I 1+ }} DF@ s01 DF! c{{ J 1+ I }} DF@ s10 DF! c{{ J 1+ I 1+ }} DF@ s11 DF! nb kk + kk ?DO a{{ K I }} DF@ b{{ I J }} DF@ F* s00 DF+! a{{ K I }} DF@ b{{ I J 1+ }} DF@ F* s01 DF+! a{{ K 1+ I }} DF@ b{{ I J }} DF@ F* s10 DF+! a{{ K 1+ I }} DF@ b{{ I J 1+ }} DF@ F* s11 DF+! LOOP s00 DF@ c{{ J I }} DF! s01 DF@ c{{ J I 1+ }} DF! s10 DF@ c{{ J 1+ I }} DF! s11 DF@ c{{ J 1+ I 1+ }} DF! 2 +LOOP 2 +LOOP nb +LOOP nb +LOOP nb +LOOP .ELAPSED ; 0 [IF] =========================================================================== Matrix Multiply tuned for SS-10/30; * Maeno Toshinori * Tokyo Institute of Technology * * Using gcc-2.4.1 (-O2), this program ends in 12 seconds on SS-10/30. * * in original algorithm - sub-area for cache tiling * #define L 20 * #define L2 20 * three 20x20 matrices reside in cache; two may be enough =========================================================================== [THEN] DFVARIABLE t0 DFVARIABLE t1 DFVARIABLE t2 DFVARIABLE t3 DFVARIABLE t4 DFVARIABLE t5 DFVARIABLE t6 DFVARIABLE t7 : MAENO() ( nb -- ) 0 0 0 0 0 LOCALS| K it kt i2 kk lparm | s" mm - T. Maeno's algorithm, subarray " .algo lparm 0 .R 'x' EMIT lparm 0 .R N lparm MOD N 4 MOD OR IF cr ." the matrix size " N DEC. ." must be divisible both by the block size " lparm DEC. ." and 4." EXIT THEN lparm 4 MOD IF cr ." block size must be evenly divisible by 4" EXIT THEN TIMER-RESET N 0 ?DO N 0 ?DO 0e c{{ J I }} DF! LOOP LOOP N 0 ?DO I TO i2 N 0 ?DO I TO kk i2 lparm + TO it kk lparm + TO kt N 0 ?DO I TO K it i2 ?DO 0e t0 DF! 0e t1 DF! 0e t2 DF! 0e t3 DF! 0e t4 DF! 0e t5 DF! 0e t6 DF! 0e t7 DF! kt kk ?DO a{{ J I }} DF@ FDUP b{{ I K }} DUP DF@+ F* t0 DF+! FDUP DF@+ F* t1 DF+! FDUP DF@+ F* t2 DF+! DF@ F* t3 DF+! a{{ J 1+ I }} DF@ FDUP DF@+ F* t4 DF+! FDUP DF@+ F* t5 DF+! FDUP DF@+ F* t6 DF+! DF@ F* t7 DF+! LOOP t0 DF@ c{{ I J }} DF+!+ t1 DF@ DF+!+ t2 DF@ DF+!+ t3 DF@ DF+! t4 DF@ c{{ I 1+ J }} DF+!+ t5 DF@ DF+!+ t6 DF@ DF+!+ t7 DF@ DF+! 2 +LOOP 4 +LOOP lparm +LOOP lparm +LOOP .ELAPSED ; : MM_MAENO \ pip1 pip2 pop3 nb -- \ Takes pointers to two FSL input arrays, an FSL output array and a \ block size nb. 0 0 0 0 0 LOCALS| K it kt i2 kk lparm pop3{{ pip2{{ pip1{{ | N 0 ?DO N 0 ?DO 0e pop3{{ J I }} DF! LOOP LOOP N 0 ?DO I TO i2 N 0 ?DO I TO kk i2 lparm + TO it kk lparm + TO kt N 0 ?DO I TO K it i2 ?DO 0e t0 DF! 0e t1 DF! 0e t2 DF! 0e t3 DF! 0e t4 DF! 0e t5 DF! 0e t6 DF! 0e t7 DF! kt kk ?DO pip1{{ J I }} DF@ FDUP pip2{{ I K }} DUP DF@+ F* t0 DF+! FDUP DF@+ F* t1 DF+! FDUP DF@+ F* t2 DF+! DF@ F* t3 DF+! pip1{{ J 1+ I }} DF@ FDUP DF@+ F* t4 DF+! FDUP DF@+ F* t5 DF+! FDUP DF@+ F* t6 DF+! DF@ F* t7 DF+! LOOP t0 DF@ pop3{{ I J }} DF+!+ t1 DF@ DF+!+ t2 DF@ DF+!+ t3 DF@ DF+! t4 DF@ pop3{{ I 1+ J }} DF+!+ t5 DF@ DF+!+ t6 DF@ DF+!+ t7 DF@ DF+! 2 +LOOP 4 +LOOP lparm +LOOP lparm +LOOP ; : MAENO2() \ nb -- s" mm - Generic Maeno, subarray " .algo dup 0 .R 'x' EMIT dup 0 .R N over MOD N 4 MOD OR IF cr ." the matrix size " N DEC. ." must be divisible both by the block size " DEC. ." and 4." EXIT THEN dup 4 MOD IF drop cr ." block size must be evenly divisible by 4" EXIT THEN TIMER-RESET >r a{{ b{{ c{{ r> MM_MAENO .ELAPSED ; : MM ( char n -- ) DEPTH 0= ABORT" no algorithm chosen" DEPTH 2 < IF 0 THEN LOCALS| ur | & a{{ N N }}malloc malloc-fail? & b{{ N N }}malloc malloc-fail? OR & bt{{ N N }}malloc malloc-fail? OR & c{{ N N }}malloc malloc-fail? OR & d{{ N N }}malloc malloc-fail? OR ABORT" MM :: out of core" SET-COEFFICIENTS FLUSH-CACHE CASE 'n' OF NORMAL() ENDOF 'v' OF TNSQ() ENDOF 'u' OF ur UNROLL ENDOF 'p' OF ur PNSQ() ENDOF 't' OF TRANSPOSE() ENDOF 'i' OF REG_LOOPS() ENDOF 'b' OF ur TILING() ENDOF 'm' OF ur MAENO() ENDOF 'z' OF ur MAENO2() ENDOF 'r' OF ROBERT() ENDOF 'w' OF ur WARNER() ENDOF CR ." `" DUP EMIT ." ' is an invalid algorithm" ENDCASE CHECK-RESULT & d{{ }}free & c{{ }}free & bt{{ }}free & b{{ }}free & a{{ }}free key? drop \ Permits o/p update on some systems ; : ALL-TESTS ( -- ) page 0 TotalTime ! 'n' mm 'v' mm 'u' 4 mm 'u' 8 mm 'u' 16 mm 'p' mm 'p' 4 mm 't' mm 'i' mm 'b' 20 mm 'r' mm 'm' 20 mm 'z' 20 mm 'w' 20 mm cr TCOL 7 + 0 ?DO ." =" LOOP cr ." Total" .testcond TCOL HTAB TotalTime @ .secs ; : .ABOUT CR ." Try: 'n' mm -- normal" CR ." 'v' mm -- with a temporary variable in the inner loop" CR ." 'u' n mm -- with unrolled (by n) inner loop, n = {4,8,16}" CR ." 'p' mm -- using pointers instead of array notation" CR ." 'p' 4 mm -- using pointers instead of array notation, unrolled by 4 [new]" CR ." 't' mm -- with transposed b matrix" CR ." 'i' mm -- with switched inner loops" CR ." 'b' n mm -- using blocking by n, 4 < n < " N DEC. CR ." 'r' mm -- using Robert's algorithm" CR ." 'r' 8 mm -- using Robert's algorithm unrolled by 8" CR ." 'm' n mm -- using Maeno's algorithm with blocking factor n" CR ." 'z' n mm -- using Maeno's algorithm with blocking factor n - generic form" CR ." 'w' n mm -- using Warner's algorithm with blocking factor n" CR CR ." ALL-TESTS -- test all algorithms" ; .ABOUT ( * End of Source * )