From:	CSBVAX::MRGATE!NED%YMIR.BITNET@CUNYVM.CUNY.EDU@SMTP 17-JAN-1988 01:18
To:	ARISIA::EVERHART
Subj:	Microcode problem on 8600 processors


Received: from CUNYVM.CUNY.EDU by KL.SRI.COM with TCP; Sat 16 Jan 88 21:36:54-PST
Received: from YMIR.BITNET by CUNYVM.CUNY.EDU ; Sun, 17 Jan 88 00:36:44 EST
Date: Sat, 16 Jan 88 21:35 PST
From: Ned Freed <NED%YMIR.BITNET@CUNYVM.CUNY.EDU>
Subject: Microcode problem on 8600 processors
To: info-vax@kl.sri.com
X-VMS-To: IN%"info-vax@kl.sri.com"

Recently I found a microcode bug on our VAX 8600. When we reported it to DEC
they provided us with a microcode update that fixed the problem immediately.
This bug is not especially esoteric, as the following example MACRO program
shows:

        .entry start,^m<r2,r3>

        movl    first,-(sp)
        movl    second,r0
        mulf2   r0,(sp)
        movl    (sp)+,r0
        ret

first:  .float  1.0e-14
second: .float  1.0e-27

        .end    start

The operation performed by the program is quite simple. One small floating
point value is loaded onto the stack and another is loaded into R0. These two
values are then multiplied with the result stored on the stack. This operation
will underflow so the result should be 0. This result is then popped off the
stack into R0 and the program returns. The net result should then be a
"NONAME-W-NOMSG, Message number 000000" message reported as the program exits.
And yes indeed, the program does just this on the VAX-11/750, the uVAX-II and
the VAX 8700.

But not on our 8600. On our 8600 the program returns a status value of 1 and
not 0! If you carefully single step the program in the debugger the reason for
this will become clear -- for some reason the "mulf2 r0,(sp)" instruction
DECREMENTS the stack pointer by 4. Thus you end up picking up some random
value off the stack that turns out to be a 1.

This problem has been verified on two different 8600s at different sites, both
under DEC maintenance, so don't assume that YOUR microcode is up to date. Our
8600 recently had a whole slew of hardware problems and almost every part of it
was replaced and checked, but the microcode was not updated until I found this
problem.

Here are a few additional technical points:

(1) Almost ANY change to the program will cause the problem to go away. For
    example, everything works fine if you add a "nop" just after the "mulf2", or
    remove the "movl (sp)+,r0", or do almost anything else.

(2) The problem appears not to be sensitive to the floating point values
    involved; anything that causes an underflow will cause the problem. The
    program works properly if the multiply does not underflow.

(3) The problem does not appear to exist when using floating point types other
    than F_floating.

(4) Despite the clear indications that the error has something to do with
    the handling of floating underflow in the 8600 pipeline, the error does
    manifest itself even when single stepping in the debugger. I think this
    is especially strange.

(5) I have not tried this program on an 8650, and I would be very interested
    to find out if this problem exists on that CPU. In fact, I would appreciate
    receiving reports of the results people get when they run this program on
    their systems, regardless of what type of CPU they have.

This hardware error has been plaguing our local software for more than two
years, causing a whole series of access violations and divide by zero errors. I
have been looking for the cause off and on for quite a while, but it just
didn't occur to me that a microcode bug could be to blame!

I am somewhat upset that DEC knew about this problem and didn't see fit to
distribute a fix for it. It is quite conceivable that this problem could
manifest itself in such a way that a program would report no obvious errors but
would return erroneous results.

                                Ned Freed
                                ned@ymir.bitnet