From: CSBVAX::MRGATE!NED%YMIR.BITNET@CUNYVM.CUNY.EDU@SMTP 17-JAN-1988 01:18 To: ARISIA::EVERHART Subj: Microcode problem on 8600 processors Received: from CUNYVM.CUNY.EDU by KL.SRI.COM with TCP; Sat 16 Jan 88 21:36:54-PST Received: from YMIR.BITNET by CUNYVM.CUNY.EDU ; Sun, 17 Jan 88 00:36:44 EST Date: Sat, 16 Jan 88 21:35 PST From: Ned Freed Subject: Microcode problem on 8600 processors To: info-vax@kl.sri.com X-VMS-To: IN%"info-vax@kl.sri.com" Recently I found a microcode bug on our VAX 8600. When we reported it to DEC they provided us with a microcode update that fixed the problem immediately. This bug is not especially esoteric, as the following example MACRO program shows: .entry start,^m movl first,-(sp) movl second,r0 mulf2 r0,(sp) movl (sp)+,r0 ret first: .float 1.0e-14 second: .float 1.0e-27 .end start The operation performed by the program is quite simple. One small floating point value is loaded onto the stack and another is loaded into R0. These two values are then multiplied with the result stored on the stack. This operation will underflow so the result should be 0. This result is then popped off the stack into R0 and the program returns. The net result should then be a "NONAME-W-NOMSG, Message number 000000" message reported as the program exits. And yes indeed, the program does just this on the VAX-11/750, the uVAX-II and the VAX 8700. But not on our 8600. On our 8600 the program returns a status value of 1 and not 0! If you carefully single step the program in the debugger the reason for this will become clear -- for some reason the "mulf2 r0,(sp)" instruction DECREMENTS the stack pointer by 4. Thus you end up picking up some random value off the stack that turns out to be a 1. This problem has been verified on two different 8600s at different sites, both under DEC maintenance, so don't assume that YOUR microcode is up to date. Our 8600 recently had a whole slew of hardware problems and almost every part of it was replaced and checked, but the microcode was not updated until I found this problem. Here are a few additional technical points: (1) Almost ANY change to the program will cause the problem to go away. For example, everything works fine if you add a "nop" just after the "mulf2", or remove the "movl (sp)+,r0", or do almost anything else. (2) The problem appears not to be sensitive to the floating point values involved; anything that causes an underflow will cause the problem. The program works properly if the multiply does not underflow. (3) The problem does not appear to exist when using floating point types other than F_floating. (4) Despite the clear indications that the error has something to do with the handling of floating underflow in the 8600 pipeline, the error does manifest itself even when single stepping in the debugger. I think this is especially strange. (5) I have not tried this program on an 8650, and I would be very interested to find out if this problem exists on that CPU. In fact, I would appreciate receiving reports of the results people get when they run this program on their systems, regardless of what type of CPU they have. This hardware error has been plaguing our local software for more than two years, causing a whole series of access violations and divide by zero errors. I have been looking for the cause off and on for quite a while, but it just didn't occur to me that a microcode bug could be to blame! I am somewhat upset that DEC knew about this problem and didn't see fit to distribute a fix for it. It is quite conceivable that this problem could manifest itself in such a way that a program would report no obvious errors but would return erroneous results. Ned Freed ned@ymir.bitnet