The program below is useful for measuring the number of clock cycles a piece
of code takes. The program executes the code to test 10 times and stores the
10 clock counts. The program can be used in both 16 and 32 bit mode on the
PPlain and PMMX:

ITER    EQU     10              ; number of iterations
OVERHEAD EQU    15              ; 15 for PPlain, 17 for PMMX

RDTSC   MACRO                   ; define RDTSC instruction
        DB      0FH,31H
ENDM

.DATA                           ; data segment
ALIGN   4
COUNTER DD      0               ; loop counter
TICS    DD      0               ; temporary storage of clock
RESULTLIST  DD  ITER DUP (0)    ; list of test results

.CODE                           ; code segment
BEGIN:  MOV     [COUNTER],0     ; reset loop counter
TESTLOOP:                       ; test loop
;****************   Do any initializations here:    ************************
        FINIT
;****************   End of initializations          ************************
        RDTSC                   ; read clock counter
        MOV     [TICS],EAX      ; save count
        CLD                     ; non-pairable filler
REPT    8
        NOP                     ; eight NOP's to avoid shadowing effect
ENDM

;****************   Put instructions to test here:  ************************
        FLDPI                   ; this is only an example
        FSQRT
        RCR     EBX,10
        FSTP    ST
;********************* End of instructions to test  ************************

        CLC                     ; non-pairable filler with shadow
        RDTSC                   ; read counter again
        SUB     EAX,[TICS]      ; compute difference
        SUB     EAX,OVERHEAD    ; subtract the clock cycles used by fillers etc
        MOV     EDX,[COUNTER]   ; loop counter
        MOV     [RESULTLIST][EDX],EAX   ; store result in table
        ADD     EDX,TYPE RESULTLIST     ; increment counter
        MOV     [COUNTER],EDX           ; store counter
        CMP     EDX,ITER * (TYPE RESULTLIST)
        JB      TESTLOOP                ; repeat ITER times

; insert here code to read out the values in RESULTLIST


The 'filler' instructions before and after the piece of code to test are
are included in order to get consistent results on the PPlain.
The CLD is a non-pairable instruction which has been inserted to
make sure the pairing is the same the first time as the subsequent times.
The eight NOP instructions are inserted to prevent any prefixes in the code
to test to be decoded in the shadow of the preceding instructions on the
PPlain. Single byte instructions are used here to obtain the same pairing
the first time as the subsequent times. The CLC after the code to test is
a non-pairable instruction which has a shadow under which the 0FH prefix
of the RDTSC can be decoded so that it is independent of any shadowing
effect from the code to test on the PPlain.

On The PMMX you may want to insert  XOR EAX,EAX / CPUID  before the
instructions to test if you want the FIFO instruction buffer to be
empty, or some time-consuming instruction (f.ex. CLI or AAD) if you
want the FIFO buffer to be full.

On the PPro and PII you have to put in a serializing instruction like
CPUID before and after each RDTSC to prevent it from executing in parallel
with anything else. (CPUID is a serializing instruction which means that
it flushes the pipeline and waits for all pending operations to finish
before proceeding. This is useful for testing purposes. CPUID has no
shadow under which prefixes of subsequent instructions can decode.)

The RDTSC instruction cannot execute in virtual mode on the PPlain and
PMMX, so if you are running DOS programs you must run in real mode. (Press
F8 while booting and select 'safe mode command prompt only' or 'bypass
startup files').

The Pentium processors have special performance monitor counters which can
count events such as cache misses, misalignments, AGI stalls, etc. Details
about how to use the performance monitor counters are not covered by this
manual but can be found in the MMX technology developer's manual.
