\ MM.FTH - Forth Floating point matrix multiply benchmark

0 [IF] ======================================================
This file is maintained by:
  Stephen Pelc
  MicroProcessor Engineering
  133 Hill Lane
  Southampton SO15 5AF
  England
  tel:   +44 (0)23 8063 1441
  email: stephen@mpeforth.com

The Forth MM benchmark was contributed by Marcel Hendrix, derived
from a C benchmark by Mark Smotherman. The code may be freely
redistributed. If you modify this code, and want the changes to
be incorporated in future releases, please send the changes to
Stephen Pelc.

The code is used to test MPE's VFX code generator and optimiser.
Except for legacy reasons, no attempt is made to maintain this
code for non-optimising and non-ANS systems.

The code is maintained to test ANS Forth systems. It does not take
advantage of system specific extensions, except for one (see below).
However, it uses the utility file from the Forth Scientific Library,
which is intended to be hacked for specific systems as much as you
can. Consequently, when you contribute results, please also contribute
the FSL utility file as well.

Change Notes
============
20170515 Changed directory separators for VFX 4.x layer.
20060901 Added other systems, refactored .ELAPSED to remove display
         effects.


*************************************
On 2.8GHz P4, 512Mb DDR266 RAM, XPpro
*************************************

SwiftForth 3.0.1  09-Aug-2006
500x500 mm - normal algorithm                       3.219 secs.
500x500 mm - temporary variable in loop             9.703 secs.
500x500 mm - unrolled inner loop, factor of 4       9.875 secs.
500x500 mm - unrolled inner loop, factor of 8       9.828 secs.
500x500 mm - unrolled inner loop, factor of 16      9.891 secs.
500x500 mm - pointers used to access matrices       3.984 secs.
500x500 mm - pointers used, unrolled by 4           3.969 secs.
500x500 mm - transposed B matrix                    8.750 secs.
500x500 mm - interchanged inner loops               8.766 secs.
500x500 mm - blocking, step size of 20             11.250 secs.
500x500 mm - Robert's algorithm                     2.641 secs.
500x500 mm - T. Maeno's algorithm, subarray 20x20   3.625 secs.
500x500 mm - Generic Maeno, subarray 20x20          2.985 secs.
500x500 mm - D. Warner's algorithm, subarray 20x20  9.313 secs.
=========================================================
Total using no extensions and using no hackery     97.799 secs. ok

iForth version 2.1.1662, generated 14:10:06, November 27, 2005.
500x500 mm - normal algorithm                       1.672 secs.
500x500 mm - temporary variable in loop             2.735 secs.
500x500 mm - unrolled inner loop, factor of 4       2.641 secs.
500x500 mm - unrolled inner loop, factor of 8       2.625 secs.
500x500 mm - unrolled inner loop, factor of 16      2.625 secs.
500x500 mm - pointers used to access matrices       1.734 secs.
500x500 mm - pointers used, unrolled by 4           1.594 secs.
500x500 mm - transposed B matrix                    1.140 secs.
500x500 mm - interchanged inner loops               2.203 secs.
500x500 mm - blocking, step size of 20              2.375 secs.
500x500 mm - Robert's algorithm                     0.594 secs.
500x500 mm - T. Maeno's algorithm, subarray 20x20   0.563 secs.
500x500 mm - Generic Maeno, subarray 20x20          0.829 secs.
500x500 mm - D. Warner's algorithm, subarray 20x20  1.297 secs.
=========================================================
Total using no extensions and using no hackery     24.627 secs. ok

VFX Forth Version: 4.00 [build 2325] Build date: 31 August 2006
500x500 mm - normal algorithm                       1.630 secs.
500x500 mm - temporary variable in loop             2.702 secs.
500x500 mm - unrolled inner loop, factor of 4       2.601 secs.
500x500 mm - unrolled inner loop, factor of 8       2.610 secs.
500x500 mm - unrolled inner loop, factor of 16      2.603 secs.
500x500 mm - pointers used to access matrices       2.235 secs.
500x500 mm - pointers used, unrolled by 4           1.599 secs.
500x500 mm - transposed B matrix                    1.150 secs.
500x500 mm - interchanged inner loops               2.169 secs.
500x500 mm - blocking, step size of 20              2.337 secs.
500x500 mm - Robert's algorithm                     0.610 secs.
500x500 mm - T. Maeno's algorithm, subarray 20x20   0.563 secs.
500x500 mm - Generic Maeno, subarray 20x20          0.619 secs.
500x500 mm - D. Warner's algorithm, subarray 20x20  1.265 secs.
=========================================================
Total using no extensions and using no hackery     24.693 secs. ok


===============================================================
[THEN]


\ ************************************************
\ Select system to be tested, set FORTHSYSTEM
\ to value of selected target.
\ Set SPECIFICS false to avoid system dependencies.
\ Set SPECIFICS true to show off implementation tricks.
\ Set HACKING false to use the base source code.
\ Set HACKING true to optimise the source code.
\ ************************************************

1  constant VfxForth4		\ MPE VFX Forth v4.x
2  constant VfxForth3		\ MPE VFX Forth v3.x
3  constant Pfw22		\ MPE ProForth 2.2
3  constant SwiftForth20	\ FI SwiftForth 2.0
5  constant SwiftForth15	\ FI SwiftForth 1.5
6  constant Win32Forth		\ Win32Forth 4.2
7  constant BigForth		\ BigForth 11 July 1999
8  constant BigForth-Linux	\ BigForth 11 July 1999
9  constant iForth		\ iForth 1.12 5 Aug 2001
10 constant iForth20		\ iForth 2.0 8 June 2002
11 constant SwiftForth22	\ FI SwiftForth 2.2.2.9
12 constant SwiftForth30	\ FI SwiftForth 3.0.1

\ Select system to test
  VfxForth4 constant ForthSystem
\ VfxForth3 constant ForthSystem
\ iForth20 constant ForthSystem
\ SwiftForth30 constant ForthSystem
\ Win32Forth constant ForthSystem

 false constant specifics	\ true to use system dependent code
 false constant hacking		\ true to use "guru" level code that
				\ makes assumptions of an optimising compiler.
 true constant ANSSystem	\ Some Forth 83 systems cannot compile
				\ all the test examples without carnal
				\ knowledge, especially if the compiler
				\ checks control structures.

: .specifics	\ -- ; display trick state
  ."  using"  specifics 0=
  if  ."  no"  then
  ."  extensions"
;

: .hacking	\ -- ; display hack state
  ."  using"  hacking 0=
  if  ."  no"  then
  ."  hackery"
;

: .testcond	\ -- ; display test conditions
  .specifics ."  and" .hacking
;


\ ***********************************
\ VFX Forth for Windows v4.x harness
\ **********************************

VfxForth4 ForthSystem = [IF]

[defined] +idata [if]
  +idata	\ enable P4 data options
  variable zzz	\ preallocate first IDATA buffer
[then]

\ 32 to sinthreshold
\ 64 to sinthreshold		\ best
\ 128 to sinthreshold		\ default
\ 256 to sinthreshold

\ #8 LoopAlignment		\ best=default
\ #32 loopAlignment

true constant ndp?	\ -- flag ; true if NDP stack version

mc" %vfxpath%/Lib"
  setmacro LibDir
mc" %vfxpath%/Lib/FSL/library"
  setmacro FslDir
mc" %vfxpath%/Lib/x86"
  setmacro NdpDir

ndp? [if]
  S" %NdpDir%/Ndp387" INCLUDED
\  -1 to FPsin?		\ slower
[else]
  S" %NdpDir%/Hfp387" INCLUDED
[then]

  char . dp-char !			\ select ANS number conversion
  char . fp-char !

-short-branches				\ disable short forward branches
S" %FslDir%/Vfx4Util" INCLUDED		\ FSL harness for VFX Forth 4.x

: HTAB  out @ - spaces ; \ n -- ; step to position n
: DEC.  ( n -- ) BASE @ >R DECIMAL . R> BASE ! ;

extern: DWORD PASCAL GetTickCount( void );

((
: COUNTER 	\ -- ms
  GetTickCount ;
))
[UNDEFINED] m*/ [IF]
[-sin
: m*/		\ d1 n2 +n3 -- dquot
\ *G The result dquot=(d1*n2)/n3. The intermediate value d1*n2
\ ** is triple-precision. In an ANS Forth standard program n3
\ ** can only be a positive signed number and a negative value
\ ** for n3 generates an ambiguous condition, which may cause
\ ** an error on some implementations.
  >r					\ -- d1 n2 ; R: -- n3
  s>d >r abs				\ -- d1 |n2| ; R: -- n3 sign(n2)
  -rot					\ -- |n2| d1 ; R: -- n3 sign(n2)
  s>d r> xor				\ -- |n2| d1 d1h*sign(n2) ; R: -- n3
  r> swap >r >r				\ -- |n2| d1 ; R: -- d1h*sign(n2) n3
  dabs rot				\ -- |d1| |n2| ; R: -- d1h*sign(n2) n3
  tuck um* 2swap um*			\ -- d1h*n2 d1l*n2 ; R: -- d1h*sign(n2) n3
  swap >r  0 d+ r> -rot			\ -- t ; R: -- d1h*sign(n2) n3
  r@ um/mod -rot r> um/mod nip swap
  r> IF dnegate THEN
;
sin]
[THEN]

Extern: BOOL PASCAL QueryPerformanceCounter( void * int64 );
Extern: BOOL PASCAL QueryPerformanceFrequency( void * int64 );

: Counter	\ -- ms
\ *G Return a ticker count in milliseconds.
\ *E seconds = count / freq
\ ** ms      = (count * 1000) / freq
\ *P Note that we assume that frequency can be expressed as
\ ** a positive 32 bit number.
  { | count[ 2 cells ] freq[ 2 cells ] -- ms }
  count[ QueryPerformanceCounter drop
  freq[ QueryPerformanceFrequency drop
  count[ 2@ swap			\ count
  #1000 freq[ @ m*/ drop		\ ms
;

: DFVARIABLE	#16 buffer:  ;

: cr  cr emptyidle  ;

[THEN]


\ ***********************************
\ VFX Forth for Windows v3.7+ harness
\ ***********************************

VfxForth3 ForthSystem = [IF]

[defined] +idata [if]
  +idata	\ enable P4 data options
  variable zzz	\ preallocate first IDATA buffer
[then]

true constant ndp?	\ -- flag ; true if NDP stack version

c" C:\Products\VfxForth.dev\Sources\Lib"
  setmacro LibDir
c" C:\Products\VfxForth.dev\Sources\Lib\FSL\library"
  setmacro FslDir
c" C:\Products\VfxForth.dev\Sources\Lib"
  setmacro NdpDir

ndp? [if]
  S" %NdpDir%\Ndp387" INCLUDED
[else]
  S" %NdpDir%\Hfp387" INCLUDED
[then]

  char . dp-char !			\ select ANS number conversion
  char . fp-char !

-short-branches				\ disable short forward branches
S" %FslDir%\VfxUtil" INCLUDED		\ FSL harness for VFX Forth 3.x

: HTAB  out @ - spaces ; \ n -- ; step to position n
: DEC.  ( n -- ) BASE @ >R DECIMAL . R> BASE ! ;

extern: DWORD PASCAL GetTickCount( void );

: COUNTER 	\ -- ms
  GetTickCount ;

: DFVARIABLE	#16 buffer:  ;

[THEN]


\ ******************************
\ iForth 2.0 for Windows harness
\ ******************************

iForth20 ForthSystem = [IF]

NEEDS -miscutil
NEEDS -dynlink
  0 VALUE 'counter
  S" kernel32.dll" LIBRARY-OPEN THROW ( dll)
  S" GetTickCount" ROT LIBRARY-FIND THROW TO 'counter

: counter ( -- ms ) 0 'counter FOREIGN ;

: defined	\ caddr -- flag
  find nip
;

\ include c:\dfwforth\examples\fsl\fsl_util.frt
include c:\dfwforth\include\fsl_util.frt

: &  ; IMMEDIATE

[THEN]


\ **********************
\ SwiftForth 2.0 harness
\ **********************

SwiftForth20 ForthSystem = [IF]

include C:\MyApps\SwiftForth20\Lib\Options\fpmath.f
: f<>   F= 0=  ;

include C:\MyApps\SwiftForth20\Lib\FSLib\Library\fsl-util.f

: HTAB    get-xy drop - spaces  ; \ n -- ; step to position n
: DEC.  ( n -- ) BASE @ >R DECIMAL . R> BASE ! ;

[THEN]


\ **********************
\ SwiftForth 2.2 harness
\ **********************

SwiftForth22 ForthSystem = [IF]

\ FPCONFIG.F should be in the BENCHMRK folder
include C:\MyApps\SwiftForth2229\Lib\Options\fpmath.f
: f<>   F= 0=  ;

include C:\MyApps\SwiftForth2229\Unsupported\FSLib\Library\fsl-util.f

: HTAB    get-xy drop - spaces  ; \ n -- ; step to position n
: DEC.  ( n -- ) BASE @ >R DECIMAL . R> BASE ! ;

[THEN]


\ **********************
\ SwiftForth 3.0 harness
\ **********************

SwiftForth30 ForthSystem = [IF]

\ FPCONFIG.F should be in the BENCHMRK folder
include C:\MyApps\Forth200x\SwiftForth300\SwiftForth\Lib\Options\fpmath.f
: f<>   F= 0=  ;

include C:\MyApps\Forth200x\SwiftForth300\SwiftForth\Unsupported\FSLib\Library\fsl-util.f

: HTAB    get-xy drop - spaces  ; \ n -- ; step to position n
: DEC.  ( n -- ) BASE @ >R DECIMAL . R> BASE ! ;

[THEN]


\ ******************
\ Win32Forth harness
\ ******************

Win32Forth ForthSystem = [IF]

: COUNTER 	\ -- ms
  Call GetTickCount ;

: >pos          \ n -- ; step to position n
  getxy drop - spaces
;

: HTAB  #TAB ; \ n -- ; step to position n

: M/            \ d n1 -- quot
  fm/mod nip
;

: buffer:	\ n -- ; -- addr
  create
    here  over allot  swap erase
;

: 2-		\ n -- n-2
  2 -
;

: [o/n]		\ -- ; stop optimiser treating * DROP etc as no code
; immediate

: SendMessage	\ h m w l -- res
  swap 2swap swap		\ Win32Forth uses reverse order
  Call SendMessage
;

: GetTickCount	\ -- ms
  Call GetTickCount
;

[THEN]


\ ********************
\ Start of common code
\ ********************

500 CONSTANT N	\ -- n ; number of iterations to use


\ ******************
\ Timing and display
\ ******************

50 CONSTANT TCOL

0 VALUE _time_
VARIABLE TotalTime

: TIMER-RESET ( -- )
  COUNTER TO _time_  ;

: #? ( d -- d )
  2DUP OR 0=
  IF  BL HOLD  ELSE  #  THEN ;

: .secs		\ ms --
  0 <#  BL HOLD # # # [char] . HOLD # #? #?  #> TYPE
  ." secs."
;

: .ELAPSED ( -- )
  COUNTER _time_ - dup TotalTime +!	\ SFP001
  TCOL HTAB .secs
;

: .algo		\ caddr u --
\ Display size and algorithm from string
  CR N 0 .R  ." x"  N 0 .R  SPACE TYPE
;


\ *****
\ TOOLS
\ *****

C" DF@+"  DEFINED 0= [IF] : DF@+  ( addr -- addr' ) ( F: -- r ) DUP DF@ DFLOAT+ ; [THEN]
C" DF+!"  DEFINED 0= [IF] : DF+!  ( addr -- )       ( F: r -- ) DUP DF@ F+ DF!  ; [THEN]
C" DF!+"  DEFINED 0= [IF] : DF!+  ( addr -- addr' ) ( F: r -- ) DUP DF! DFLOAT+ ; [THEN]
C" DF+!+" DEFINED 0= [IF] : DF+!+ ( addr -- addr' ) ( F: r -- ) DUP DF@ F+ DF!+ ; [THEN]

C" *DSUM" DEFINED 0= [IF]

: *DSUM  ( addr1 addr2 count -- addr1' addr2' ) ( F: -- n )
  0e 0 ?DO  SWAP DF@+ SWAP DF@+ F* F+  LOOP ;

: *DSUML ( addr1 addr2 count stride2 -- addr1' addr2' ) ( F: -- r )
  LOCALS| stride2 |
  0e 0 ?DO
    SWAP DF@+
    SWAP DUP DF@ stride2 +
    F* F+
  LOOP ;

[THEN]

CHAR x CONSTANT 'x'
CHAR n CONSTANT	'n'
CHAR v CONSTANT	'v'
CHAR u CONSTANT	'u'
CHAR p CONSTANT	'p'
CHAR t CONSTANT	't'
CHAR i CONSTANT	'i'
CHAR b CONSTANT	'b'
CHAR m CONSTANT	'm'
CHAR r CONSTANT	'r'
CHAR w CONSTANT	'w'
CHAR z CONSTANT 'z'


\ =====================================================================================

DOUBLE DMATRIX  a{{
DOUBLE DMATRIX  b{{
DOUBLE DMATRIX  c{{
DOUBLE DMATRIX  d{{
DOUBLE DMATRIX bt{{

0 [IF] ================================================
\ Useful test bits

: .a{{		\ -- ; display A{{ matrix
  cr ." A{{"
  N 0 ?DO
    cr  N 0 ?DO
      a{{ J I }} DF@ F.
    LOOP
  LOOP
;

: .b{{		\ -- ; display B{{ matrix
  cr ." B{{"
  N 0 ?DO
    cr  N 0 ?DO
      b{{ J I }} DF@ F.
    LOOP
  LOOP
;

: .c{{		\ -- ; display B{{ matrix
  cr ." C{{"
  N 0 ?DO
    cr  N 0 ?DO
      c{{ J I }} DF@ F.
    LOOP
  LOOP
;

: .bt{{		\ -- ; display B{{ matrix
  cr ." BT{{"
  N 0 ?DO
    cr  N 0 ?DO
      bt{{ J I }} DF@ F.
    LOOP
  LOOP
;

[THEN]

: SET-COEFFICIENTS ( -- )
\ Set coefficients so that result matrix should have row entries
\ equal to (1/2)*n*(n-1)*i in row i
  N 0 ?DO
    N 0 ?DO
      J S>F FDUP b{{ J I }} DF!  a{{ J I }} DF!
    LOOP
  LOOP ;

: FLUSH-CACHE ( -- )
  N 0 ?DO
    N 0 ?DO
      0e d{{ J I }} DF!
    LOOP
  LOOP ;

FVARIABLE row_sum
FVARIABLE sum

: CHECK-RESULT ( -- )
  0e row_sum F!
  N N 1- * 2/ S>F sum F!
  N 0 ?DO
    I S>F sum F@ F* row_sum F!
    N 0 ?DO
      a{{ J I }} DF@  J S>F  F<> IF
        CR ." error in result entry a{{ " J DEC. I DEC. ." }}: "
        a{{ J I }} DF@ F. ." <> "  J S>F F.
        UNLOOP UNLOOP EXIT
      THEN
      b{{ J I }} DF@ J S>F  F<> IF
        CR ." error in result entry b{{ " J DEC. I DEC. ." }}: "
        b{{ J I }} DF@ F. ." <> "  J S>F F.
        UNLOOP UNLOOP EXIT
      THEN
      c{{ J I }} DF@ row_sum F@ F<> IF
        CR ." error in result entry c{{ " J DEC. I DEC. ." }}: "
        c{{ J I }} DF@ F. ." <> " row_sum F@ F.
        UNLOOP UNLOOP EXIT
      THEN
    LOOP
  LOOP  ;

: NORMAL() ( -- )
  s" mm - normal algorithm" .algo
  TIMER-RESET
  N 0 ?DO
    N 0 ?DO
      a{{ J 0 }} b{{ 0 I }} N N DFLOATS *DSUML 2DROP  c{{ J I }} DF!
    LOOP
  LOOP
  .ELAPSED  ;

: TNSQ() ( -- )
  0 LOCALS| K |
  s" mm - temporary variable in loop" .algo
  TIMER-RESET
  N 0 ?DO  I TO K
    N 0 ?DO
      a{{ J 0 }} DF@  b{{ 0 I }} DF@ F*
      N 1 ?DO
        a{{ K I }} DF@  b{{ I J }} DF@ F* F+
      LOOP
      c{{ J I }} DF!
    LOOP
  LOOP
  .ELAPSED ;

: UNROLL4() ( -- )
  0 0 LOCALS| K S |
  s" mm - unrolled inner loop, factor of 4" .algo
  TIMER-RESET
  N 0 ?DO  I TO K
    N 0 ?DO  0e  0 TO S
      N 3 - 0 ?DO  I TO S
        a{{ K I     }} DF@  b{{ I     J }} DF@ F* F+
        a{{ K I 1+  }} DF@  b{{ I 1+  J }} DF@ F* F+
        a{{ K I 2+  }} DF@  b{{ I 2+  J }} DF@ F* F+
        a{{ K I 3 + }} DF@  b{{ I 3 + J }} DF@ F* F+
      4 +LOOP
      N  S 4 + ?DO
        a{{ K I     }} DF@  b{{ I     J }} DF@ F* F+
      LOOP
      c{{ J I }} DF!
    LOOP
  LOOP
  .ELAPSED ;

: UNROLL8() ( -- )
  0 0 LOCALS| K S |
  s" mm - unrolled inner loop, factor of 8" .algo
  TIMER-RESET
  N 0 ?DO  I TO K
    N 0 ?DO  0e  0 TO S
      N 7 - 0 ?DO  I TO S
        a{{ K I     }} DF@  b{{ I     J }} DF@ F* F+
        a{{ K I 1+  }} DF@  b{{ I 1+  J }} DF@ F* F+
        a{{ K I 2+  }} DF@  b{{ I 2+  J }} DF@ F* F+
        a{{ K I 3 + }} DF@  b{{ I 3 + J }} DF@ F* F+
        a{{ K I 4 + }} DF@  b{{ I 4 + J }} DF@ F* F+
        a{{ K I 5 + }} DF@  b{{ I 5 + J }} DF@ F* F+
        a{{ K I 6 + }} DF@  b{{ I 6 + J }} DF@ F* F+
        a{{ K I 7 + }} DF@  b{{ I 7 + J }} DF@ F* F+
      8 +LOOP
      N  S 8 + ?DO
        a{{ K I     }} DF@  b{{ I     J }} DF@ F* F+
      LOOP
      c{{ J I }} DF!
    LOOP
  LOOP
  .ELAPSED ;

: UNROLL16() ( -- )
  0 0 LOCALS| K S |
  s" mm - unrolled inner loop, factor of 16" .algo
  TIMER-RESET
  N 0 ?DO  I TO K
    N 0 ?DO  0e 0 TO S
      N 15 - 0 ?DO  I TO S
        a{{ K I      }} DF@  b{{ I      J }} DF@ F* F+
        a{{ K I  1+  }} DF@  b{{ I  1+  J }} DF@ F* F+
        a{{ K I  2+  }} DF@  b{{ I  2+  J }} DF@ F* F+
        a{{ K I  3 + }} DF@  b{{ I  3 + J }} DF@ F* F+
        a{{ K I  4 + }} DF@  b{{ I  4 + J }} DF@ F* F+
        a{{ K I  5 + }} DF@  b{{ I  5 + J }} DF@ F* F+
        a{{ K I  6 + }} DF@  b{{ I  6 + J }} DF@ F* F+
        a{{ K I  7 + }} DF@  b{{ I  7 + J }} DF@ F* F+
        a{{ K I  8 + }} DF@  b{{ I  8 + J }} DF@ F* F+
        a{{ K I  9 + }} DF@  b{{ I  9 + J }} DF@ F* F+
        a{{ K I 10 + }} DF@  b{{ I 10 + J }} DF@ F* F+
        a{{ K I 11 + }} DF@  b{{ I 11 + J }} DF@ F* F+
        a{{ K I 12 + }} DF@  b{{ I 12 + J }} DF@ F* F+
        a{{ K I 13 + }} DF@  b{{ I 13 + J }} DF@ F* F+
        a{{ K I 14 + }} DF@  b{{ I 14 + J }} DF@ F* F+
        a{{ K I 15 + }} DF@  b{{ I 15 + J }} DF@ F* F+
      16 +LOOP
      N S  16 + ?DO
        a{{ K I      }} DF@  b{{ I      J }} DF@ F* F+
      LOOP
      c{{ J I }} DF!
    LOOP
  LOOP
  .ELAPSED ;

: UNROLL ( n -- )
  CASE
     4 OF UNROLL4()  ENDOF
     8 OF UNROLL8()  ENDOF
    16 OF UNROLL16() ENDOF
          CR ." mm - unrolled inner loop, factor of " DUP DEC. ." not implemented"
  ENDCASE ;

specifics [if]
  VfxForth3 ForthSystem = [if]  -fasti  [then]
[then]

: PNSQ4() ( -- )
  0 LOCALS| S |
  s" mm - pointers used, unrolled by 4" .algo
  TIMER-RESET
  N 0 ?DO
    N 0 ?DO
      0e  a{{ J 0 }}  b{{ 0 I }}  0 TO S
      N 3 -   0 ?DO  I TO S
        SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS +
        SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS +
        SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS +
        SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS +
      4 +LOOP
      N   S 4 + ?DO
        SWAP DF@+ SWAP DUP DF@ F* F+ N DFLOATS +
      LOOP
      c{{ J I }} DF!  2DROP
    LOOP
  LOOP
  .ELAPSED ;

: PNSQ() ( n -- )
  DUP 4 =
  IF  DROP PNSQ4() EXIT  THEN
  s" mm - pointers used to access matrices" .algo
  ?DUP IF
    ." , unroll factor of " DEC. ." not allowed" EXIT
  THEN
  TIMER-RESET
  N 0 ?DO
    N 0 ?DO
      0e  a{{ J 0 }}  b{{ 0 I }}
      N 0 ?DO
        SWAP DF@+ SWAP
        DUP DF@ N DFLOATS +
        F* F+
      LOOP
      c{{ J I }} DF!  2DROP
    LOOP
  LOOP
  .ELAPSED ;

specifics [if]
  VfxForth3 ForthSystem = [if]  +fasti  [then]
[then]


: TRANSPOSE() ( -- )
  0 LOCALS| K |
  s" mm - transposed B matrix" .algo
  TIMER-RESET
  N 0 ?DO
    N 0 ?DO
      b{{ J I }} DF@  bt{{ I J }} DF!
    LOOP
  LOOP
  N 0 ?DO  I TO K
    N 0 ?DO
      a{{ J 0 }} DF@  bt{{ I 0 }} DF@ F*
      N 1 ?DO
        a{{ K I }} DF@  bt{{ J I }} DF@ F* F+
      LOOP
      c{{ J I }} DF!
    LOOP
  LOOP
  .ELAPSED ;

\ from Monica Lam ASPLOS-IV paper
: REG_LOOPS() ( -- )
  0 LOCALS| K |
  s" mm - interchanged inner loops" .algo
  TIMER-RESET
  N 0 ?DO
    N 0 ?DO
      0e c{{ J I }} DF!
    LOOP
  LOOP
  N 0 ?DO  I TO K
    N 0 ?DO
      a{{ J I }} DF@
      N 0 ?DO
        FDUP b{{ J I }} DF@ F* c{{ K I }} DF+!
      LOOP
      FDROP
    LOOP
  LOOP
  .ELAPSED ;

: TILING() ( step -- )
  s" mm - blocking, step size of " .algo  DUP DEC.
  DUP 4 N 1+ WITHIN 0= IF
    drop ." is unreasonable"  EXIT
  THEN
  TIMER-RESET
  0 0 0 LOCALS| K kk jj step |
  N 0 ?DO  N 0 ?DO  0e c{{ J I }} DF!  LOOP LOOP
  N 0 ?DO  I TO kk
    N 0 ?DO  I TO jj
      N 0 ?DO   I TO K
        kk step + N MIN  kk ?DO
          a{{ J I }} DF@
	  jj step + N MIN  jj ?DO
            FDUP b{{ J I }} DF@ F*  c{{ K I }} DF+!
	  LOOP
          FDROP
        LOOP
      LOOP
    step +LOOP
  step +LOOP
  .ELAPSED
;


\ ********************************************/
\ * Contributed by Robert Debath 26 Nov 1995 */
\ * rdebath@cix.compulink.co.uk              */
\ ********************************************/

: ROBERT() ( -- )
  s" mm - Robert's algorithm" .algo
  TIMER-RESET
  N 0 ?DO  N 0 ?DO
    b{{ J I }} DF@  bt{{ I J }} DF!
  LOOP LOOP
  N 0 ?DO  N 0 ?DO
    a{{ J 0 }} bt{{ I 0 }} N *DSUM 2DROP c{{ J I }} DF!
  LOOP LOOP
  .ELAPSED
;


0 [IF] ===========================================================================
 * Matrix Multiply by Dan Warner, Dept. of Mathematics, Clemson University
 *
 *    mmbu2.f multiplies matrices a and b
 *    a and b are n by n matrices
 *    nb is the blocking parameter.
 *    the tuning guide indicates nb = 50 is reasonable for the
 *    ibm model 530 hence 25 should be reasonable for the 320
 *    since the 320 has 32k rather than 64k of cache.
 *    Inner loops unrolled to depth of 2
 *    The loop functions without clean up code at the end only
 *    if the unrolling occurs to a depth k which divides into n
 *    in this case n must be divisible by 2.
 *    The blocking parameter nb must divide into n if the
 *    multiply is to succeed without clean up code at the end.
 *
 * converted to c by Mark Smotherman
 * note that nb must also be divisible by 2 => cannot use 25, so use 20
=========================================================================== [THEN]

DFVARIABLE s10
DFVARIABLE s00
DFVARIABLE s01
DFVARIABLE s11

: WARNER() ( nb -- )
  0 0 0 0 LOCALS| K ii jj kk nb |
  s" mm - D. Warner's algorithm, subarray " .algo  nb 0 .R  'x' EMIT  nb 0 .R
  N nb MOD  N 2 MOD OR IF
    cr ."   the matrix size " N DEC.
    ." must be divisible both by the block size " nb DEC. ." and 2."
    EXIT
  THEN
  nb 2 MOD IF
    ."   block size must be evenly divisible by 2"
    EXIT
  THEN
  TIMER-RESET
  N 0 ?DO  I TO ii
    N 0 ?DO  I TO jj
      nb ii + ii ?DO
        nb jj + jj ?DO
          0e c{{ J I }} DF!
        LOOP
      LOOP
      N 0 ?DO  I TO kk
        nb ii + ii ?DO  I TO K
          nb jj + jj ?DO
	    c{{ J     I    }} DF@ s00 DF!
            c{{ J     I 1+ }} DF@ s01 DF!
            c{{ J 1+  I    }} DF@ s10 DF!
            c{{ J 1+  I 1+ }} DF@ s11 DF!
            nb kk + kk ?DO
	      a{{ K    I }} DF@  b{{ I J    }} DF@ F* s00 DF+!
              a{{ K    I }} DF@  b{{ I J 1+ }} DF@ F* s01 DF+!
              a{{ K 1+ I }} DF@  b{{ I J    }} DF@ F* s10 DF+!
              a{{ K 1+ I }} DF@  b{{ I J 1+ }} DF@ F* s11 DF+!
            LOOP
            s00 DF@ c{{ J     I    }} DF!
            s01 DF@ c{{ J     I 1+ }} DF!
            s10 DF@ c{{ J 1+  I    }} DF!
            s11 DF@ c{{ J 1+  I 1+ }} DF!
          2 +LOOP
        2 +LOOP
      nb +LOOP
    nb +LOOP
  nb +LOOP
  .ELAPSED ;


0 [IF] ===========================================================================
Matrix Multiply tuned for SS-10/30;
 *                      Maeno Toshinori
 *                      Tokyo Institute of Technology
 *
 * Using gcc-2.4.1 (-O2), this program ends in 12 seconds on SS-10/30.
 *
 * in original algorithm - sub-area for cache tiling
 * #define      L       20
 * #define      L2      20
 * three 20x20 matrices reside in cache; two may be enough
=========================================================================== [THEN]

DFVARIABLE t0
DFVARIABLE t1
DFVARIABLE t2
DFVARIABLE t3
DFVARIABLE t4
DFVARIABLE t5
DFVARIABLE t6
DFVARIABLE t7

: MAENO() ( nb -- )
  0 0 0 0 0 LOCALS| K it kt i2 kk lparm |
  s" mm - T. Maeno's algorithm, subarray " .algo  lparm 0 .R 'x' EMIT lparm 0 .R
  N lparm MOD  N 4 MOD OR IF
    cr ."   the matrix size " N DEC.
    ." must be divisible both by the block size " lparm DEC. ." and 4."
    EXIT
  THEN
  lparm 4 MOD IF
    cr ."   block size must be evenly divisible by 4"
    EXIT
  THEN
  TIMER-RESET
  N 0 ?DO N 0 ?DO  0e c{{ J I }} DF!  LOOP LOOP
  N 0 ?DO  I TO i2
    N 0 ?DO  I TO kk
      i2 lparm + TO it
      kk lparm + TO kt
      N 0 ?DO  I TO K
        it i2 ?DO
          0e t0 DF!  0e t1 DF!  0e t2 DF!  0e t3 DF!
          0e t4 DF!  0e t5 DF!  0e t6 DF!  0e t7 DF!
          kt  kk ?DO
            a{{ J I }} DF@
            FDUP b{{ I K }} DUP DF@+ F* t0 DF+!
            FDUP                DF@+ F* t1 DF+!
            FDUP                DF@+ F* t2 DF+!
                                DF@  F* t3 DF+!
            a{{ J 1+ I }}  DF@
            FDUP                DF@+ F* t4 DF+!
            FDUP                DF@+ F* t5 DF+!
            FDUP                DF@+ F* t6 DF+!
                                DF@  F* t7 DF+!
          LOOP
          t0 DF@ c{{ I    J }}  DF+!+
          t1 DF@                DF+!+
          t2 DF@                DF+!+
          t3 DF@                DF+!
          t4 DF@ c{{ I 1+ J }}  DF+!+
          t5 DF@                DF+!+
          t6 DF@                DF+!+
          t7 DF@                DF+!
        2 +LOOP
      4 +LOOP
    lparm +LOOP
  lparm +LOOP
  .ELAPSED ;

: MM_MAENO	\ pip1 pip2 pop3 nb --
\ Takes pointers to two FSL input arrays, an FSL output array and a
\ block size nb.
  0 0 0 0 0 LOCALS| K it kt i2 kk lparm pop3{{ pip2{{ pip1{{ |
  N 0 ?DO N 0 ?DO  0e pop3{{ J I }} DF!  LOOP LOOP
  N 0 ?DO  I TO i2
    N 0 ?DO  I TO kk
      i2 lparm + TO it
      kk lparm + TO kt
      N 0 ?DO  I TO K
        it i2 ?DO
          0e t0 DF!  0e t1 DF!  0e t2 DF!  0e t3 DF!
          0e t4 DF!  0e t5 DF!  0e t6 DF!  0e t7 DF!
          kt  kk ?DO
            pip1{{ J I }} DF@
            FDUP pip2{{ I K }}  DUP DF@+ F* t0 DF+!
            FDUP                DF@+ F* t1 DF+!
            FDUP                DF@+ F* t2 DF+!
                                DF@  F* t3 DF+!
            pip1{{ J 1+ I }}    DF@
            FDUP                DF@+ F* t4 DF+!
            FDUP                DF@+ F* t5 DF+!
            FDUP                DF@+ F* t6 DF+!
                                DF@  F* t7 DF+!
          LOOP
          t0 DF@ pop3{{ I    J }}  DF+!+
          t1 DF@                DF+!+
          t2 DF@                DF+!+
          t3 DF@                DF+!
          t4 DF@ pop3{{ I 1+ J }}  DF+!+
          t5 DF@                DF+!+
          t6 DF@                DF+!+
          t7 DF@                DF+!
        2 +LOOP
      4 +LOOP
    lparm +LOOP
  lparm +LOOP
;

: MAENO2()	\ nb --
  s" mm - Generic Maeno, subarray " .algo  dup 0 .R  'x' EMIT  dup 0 .R
  N over MOD  N 4 MOD OR IF
    cr ."   the matrix size " N DEC.
    ." must be divisible both by the block size " DEC. ." and 4."
    EXIT
  THEN
  dup 4 MOD IF
    drop
    cr ."   block size must be evenly divisible by 4"
    EXIT
  THEN
  TIMER-RESET
  >r a{{ b{{ c{{ r> MM_MAENO
  .ELAPSED ;

: MM ( char n -- )
  DEPTH 0= ABORT" no algorithm chosen"
  DEPTH 2 < IF 0 THEN
  LOCALS| ur |
  &  a{{ N N }}malloc malloc-fail?
  &  b{{ N N }}malloc malloc-fail? OR
  & bt{{ N N }}malloc malloc-fail? OR
  &  c{{ N N }}malloc malloc-fail? OR
  &  d{{ N N }}malloc malloc-fail? OR ABORT" MM :: out of core"
  SET-COEFFICIENTS
  FLUSH-CACHE
  CASE
    'n' OF NORMAL()    ENDOF
    'v' OF TNSQ()      ENDOF
    'u' OF ur UNROLL   ENDOF
    'p' OF ur PNSQ()   ENDOF
    't' OF TRANSPOSE() ENDOF
    'i' OF REG_LOOPS() ENDOF
    'b' OF ur TILING() ENDOF
    'm' OF ur MAENO()  ENDOF
    'z' OF ur MAENO2() ENDOF
    'r' OF ROBERT()    ENDOF
    'w' OF ur WARNER() ENDOF
           CR ." `" DUP EMIT ." ' is an invalid algorithm"
  ENDCASE
  CHECK-RESULT
  &  d{{ }}free
  &  c{{ }}free
  & bt{{ }}free
  &  b{{ }}free
  &  a{{ }}free
  key? drop			\ Permits o/p update on some systems
;

: ALL-TESTS ( -- )
  page
  0 TotalTime !
  'n'    mm    'v'    mm    'u'  4 mm    'u'  8 mm
  'u' 16 mm    'p'    mm    'p'  4 mm    't'    mm
  'i'    mm    'b' 20 mm    'r'    mm
  'm' 20 mm    'z' 20 mm    'w' 20 mm
  cr  TCOL 7 + 0
  ?DO  ." ="  LOOP
  cr ." Total" .testcond TCOL HTAB  TotalTime @ .secs
;

: .ABOUT
  CR ." Try: 'n'    mm  -- normal"
  CR ."      'v'    mm  -- with a temporary variable in the inner loop"
  CR ."      'u'  n mm  -- with unrolled (by n) inner loop, n = {4,8,16}"
  CR ."      'p'    mm  -- using pointers instead of array notation"
  CR ."      'p'  4 mm  -- using pointers instead of array notation, unrolled by 4 [new]"
  CR ."      't'    mm  -- with transposed b matrix"
  CR ."      'i'    mm  -- with switched inner loops"
  CR ."      'b'  n mm  -- using blocking by n, 4 < n < " N DEC.
  CR ."      'r'    mm  -- using Robert's algorithm"
  CR ."      'r'  8 mm  -- using Robert's algorithm unrolled by 8"
  CR ."      'm'  n mm  -- using Maeno's algorithm with blocking factor n"
  CR ."      'z'  n mm  -- using Maeno's algorithm with blocking factor n - generic form"
  CR ."      'w'  n mm  -- using Warner's algorithm with blocking factor n"
  CR
  CR ." ALL-TESTS       -- test all algorithms" ;

.ABOUT

( * End of Source * )