 +----------------------------------------------+
 |  One Example for Speed-Optimizing FPU-Code   |
 +----------------------------------------------+


 Here is an example for optimizing the V3D::Add-function:

	void V3D::Add (V3D &a, V3D &b, V3D &c) {
		c.x = a.x + b.x;
		c.y = a.y + b.y;
		c.z = a.z + b.z;
	}

 As you see this is one of the most simple function to choose.
 But even this primitive code isn't optimized perfectly by mo-
 dern compilers (I used Watcom 11.0 and VC++). So please don't
 rely on your compiler always, and check the produced code
 (debugger, wdis.exe, etc) for time-critical cases.


 +----------+
 | THE TEST |
 +----------+


 I) C++ Compiler
    ------------

   1) Watcom C++ 11.0:
      ----------------

     call: wpp386 /5r /fp5 /fpi87 /zp8 /mf /otexanhk FpuOpt.cpp

     code: void near V3D::Add( V3D near &, V3D near & ) {
		          fld       dword ptr [edx]
			        fadd      dword ptr [ebx]
			        fstp      dword ptr [eax]
			        fld       dword ptr 0x4[edx]
			        fadd      dword ptr 0x4[ebx]
			        fstp      dword ptr 0x4[eax]
			        fld       dword ptr 0x8[edx]
			        fadd      dword ptr 0x8[ebx]
			        fstp      dword ptr 0x8[eax]
			        ret
       	   }

	 2) Visual C++:   (I don't own a newer version, so I used VC 5.0
	    -----------    in my university for testing)

		 call: cl /c /Ox /G5 /Gr /Zp8 FpuOpt.cpp

		 code:	?Add@V3D@@QAIXAAV1@0@Z {
 					    mov       eax,0x4[esp]
 					    fld       dword ptr [eax]
 					    fadd      dword ptr [edx]
 					    fstp      dword ptr [ecx]
 					    fld       dword ptr 0x4[edx]
 					    fadd      dword ptr 0x4[eax]
 					    fstp      dword ptr 0x4[ecx]
 					    fld       dword ptr 0x8[edx]
 					    fadd      dword ptr 0x8[eax]
 					    fstp      dword ptr 0x8[ecx]
 					    ret       0x00000004
					 }

	 As you see, they both produce almost the same result (with the exception
	 of the one stack-parameter in the VC++-code) (both compiler are using
	 registers for the paramters,

	 Analyzing the FPU-code with VTUNE reveals (you can also use a simple cycle
	 tester like the one in Agner Fogs great article on P5-optimizing).

    Source                     Clocks Penalties and Warnings
    fld      DWORD PTR [eax]   1
    fadd     DWORD PTR [edx]   1+2
    fstp     DWORD PTR [ebx]   5      FP_Dep_st(0):2, fst_Dep:1
    fld      DWORD PTR [eax+4] 1
    fadd     DWORD PTR [edx+4] 1+2
    fstp     DWORD PTR [ebx+4] 5      FP_Dep_st(0):2, fst_Dep:1
    fld      DWORD PTR [eax+8] 1
    fadd     DWORD PTR [edx+8] 1+2
    fstp     DWORD PTR [ebx+8] 5      FP_Dep_st(0):2, fst_Dep:1

    above code                  21 cycles
    + inits+call               ~24 cycles (~25 for the VC++-version)


   comment:
   ~~~~~~~
   As you see both compiler do a big !!MISTAKE!! :
   The fstp's have to wait for the preceeding fadd's to finish. This stall
   takes full 3 cycles aways. Everybody knows about this, why does the comp-
   iler generate this kind of code ?



 II) First simple optimization (eax = &a, ebx = &b, ecx = &c)
  	 -------------------------

     Source                     Clocks Penalties and Warnings
     fld      DWORD PTR [eax]   2      Exp_AGI_U_Pen:1
     fadd     DWORD PTR [ebx]   1+2
     fld      DWORD PTR [eax+4] 1
     fadd     DWORD PTR [ebx+4] 1+2
     fld      DWORD PTR [eax+8] 1
     fadd     DWORD PTR [ebx+8] 1+2
     fstp     DWORD PTR [ecx+8] 5      FP_Dep_st(0):2, fst_Dep:1
     fstp     DWORD PTR [ecx+4] 2
     fstp     DWORD PTR [ecx]   2

     above code                  16 cycles (15 if no at first fld)
     + inits (base reg load)     ~18 cycles

   comment:
   ~~~~~~~
   The fstp's have been put at the end of the code, so only the first causes
   a stall. One saves 6 cycles with this simple change. The agi-stall in the
   first fld is caused by the load of register eax in the preceeding instruc-
   tion (you can also avoid this).



 III) Second optimization (eax = &a, ebx = &b, ecx = &c)
			-------------------

     Source                     Clocks Penalties and Warnings
     fld      DWORD PTR [eax]   2      Exp_AGI_U_Pen:1
     fadd     DWORD PTR [ebx]   1+2
     fld      DWORD PTR [eax+4] 1
     fadd     DWORD PTR [ebx+4] 1+2
     fxch     st(1)
     fstp     DWORD PTR [ecx]   3      fst_Dep:1
     fld      DWORD PTR [eax+8] 1
     fadd     DWORD PTR [ebx+8] 1+2
     fxch     st(1)
     fstp     DWORD PTR [ecx+4] 2
     fstp     DWORD PTR [ecx+8] 3      fst_Dep:1

		 above code                  15 (14) cycles
     + inits (base reg load)     ~17 cycles


	IV) Perfect (if you beat me, you get a beer) (eax = &a, ebx = &b, ecx = &c)
	    ----------------------------------------

		 This time no vtune-output, but a more detailed description:
		 (sx=ax+bx, ax is the x-compoment of vector a)

		 command								cycles	fpu-stack   comment

  	 fld  dword ptr [eax+0] 1       ax
  	 fadd dword ptr [ebx+0]	2-4			sx
  	 fld  dword ptr [eax+4]	3				ay sx				fld runs parallel to the fadd,
																								this is called overlapping
  	 fadd dword ptr [ebx+4]	4-6			sy sx
  	 fxch				   					4				sx sy				fchg is the only pairing fpu-
																								command (in the v-pipe)

  	 fld  dword ptr [eax+8]	5				az sx sy
  	 fadd dword ptr [ebx+8]	6-8			sz sx sy
  	 fxch										6  			sx sz sy
  	 fstp dword ptr [ecx+0]	7-8     sz sy			  fstp takes 2 cycles,no overlap
  	 fstp dword ptr [ecx+8]	10-11		sy				  this fstp has to wait 1 cycle
  	 																						(stall) because it can only
																								store values that where ready
																								in the preceeding cycle !
																								And sz was ready in cycle 9
  	 fstp dword ptr [ecx+4]	12-13		empty


		 above code:                   13 cycles


 +---------+
 | RESULTS |
 +---------+

 Don't trust your compiler ! Always take a look at the generated code
 ( wdis.exe is a great tool for this task ).

 Even for this easy example both compilers totally failed (I've not tested
 DJGPP, but I don't think the code will be faster then Watcom).

 Compiler : handoptimzed = 21:13, so the hand-optimized code is ~ 40% faster.
 																																	---
 And don't forget this is just one example, there are many many more. In some
 cases hand-optimized code is only 10-20% faster, but sometimes you can opti-
 mize the code about whole factors (2-3 is not seldom).


 +---------------------------------+
 | SOME GENERAL WORDS ON THE TOPIC |
 +---------------------------------+

 There are more and more badly optimized demos ! I can imagine two main
 reasons for this. First there are many coders with an unconditional be-
 lieve in their compiler's optimzization-quality, and second, some seem
 to rely on the power of their PII400Mhz (which they got from Mom and Dad
 last XMas). One of the BIGGEST mistake programmers do, is saying: "We don't
 need assembly language any more, because the compilers produce almost
 optimal code". Yes, compiler-code can come near to perfectly hand-opti-
 mized assembly in some cases, but ONLY, if the source is preoptimized
 well enabling the compiler to generate fast code. And some coders doesn't
 even know, how to optimize c++-source, not speaking from assembly lang-
 uage. Some of them seem to forget that the compiler is just a front-end
 to the assembler.













