Is 3dnow! really better then the PII FPU, will KNI blow 3dnow! away? Why are the FPUs of
alternative CPUs weaker? What is the best solution, both on the schematics and in the real
world? That is what this article is about. FPU-performance is more hyped than ever, a K6-2
300 enables a potential peak-performance of 1.2 GigaFLOPS, A Katmai 500 will output 2
GigaFLOPS if you believe AMD and Intel.
About 10 years ago, nobody was interested in FPU-performance. A fast 386 was all we wanted.
Nowadays, a CPU without a blazing fast FPU is considered almost worthless. Just a few years
ago - spring 96- I remember that all french magazines were excited about the 6x86." Cyrix
6x86-166+ was state of the art, the end of the Intel kingdom " they said. A few months later,
benchmarks with Quake were popular and the 6x86 star was not shining anymore. Intel was the
way to go. 3D games became real FPU-performance hogs because they didn't use sprites anymore.
Games were calculating 3D objects in a three-dimensional space, as opposed to the friendlier
sprites (simply arrays of integers) of past.
In spring '97 the K6 came out, some people hurried to the store to get the K6, but only to
even faster return the K6 and get a Pentium MMX. Verdict: Quake runs too slowly, so the K6
FPU is weak. VIVA INTEL. Or not? A gigantic discussion began on the Internet between hardware
gurus. Some benchmarks show that K6 FPU is as fast as the PMMX and almost as fast as the PII-
FPU, but most benchmarks point to Intels pipelined FPU as the king of floating-point
performance. Look at the benchmarks below.
Sisoft Sandra
ZD
FPU
FPUmark
K6-233
132
740
PMMX-233
130
900
K6-2 300
167
980
PII-300
160
1540
Notice how sisoft sandra FPU whetstone gives a slight advantage to the K6 and K6-2 FPU while
the ZD fpu mark claims that the PII FPU is 57% faster than the K6-2, the PMMX FPU should be
20% faster than the K6, clock for clock. We will try to explain that.
It is clear that a good CPU nowadays needs a fast FPU. The 3D-games market has exploded. Even
real time strategy games are beginning to use 3D, and people like to be creative with 3D-
rendering software like Bryce, Truespace, and Simply 3D.
What is the best answer for all those FPU-hungry applications? A Pipelined FPU, 3D-Now!, KNI
or something else? Hardcore gamers, Creative 3D-animators, what do they need? Which FPU is
the answer to their prayers? Lets find out. Strap yourself in for a roller coaster ride of
CPU-architecture, marketing hype and astonishing creativity
Different companies, different strategies.
To raise FPU-performance, semi-conductors manufacturers developed different strategies :
Raise the frequency, more mhz. Pretty simple, huh?
Pipeline the FPU.
Lower the latencies of the FPU.
SIMD FPU (Single Instruction, Multiple Data)
A parallel pipelinded FPU (surprise, surprise)
The first is so simple, we won't discuss it. The second was Intels strategy for the
Pentium/PII FPU, the third was the one of AMD K6 FPU. The forth is the path AMD chose with
the K6-2 and 3dnow!, and now Intel will do the same with KNI and the Katmai. And the fifth?
We discovered that one at the Microprocessor Forum in San Jose, California (K7). But first,
what is a pipelined FPU? And what about a lower latency FPU?
The Pipelined FPU, the Pentium/PII-FPU
What follows is a simplification, to clarify the FPU-discussions to everybody who is
interested. We won't be 100% accurate, to avoid excessive confusion.
You might recall from your mathematics course that not all calculations can be done in one
step. Adding and substracting numbers can be done in one step, but a complex algorithm is
required to find the square root of a random number, unless you know all square roots by
heart (just kidding). So, it takes several steps to do a square root and the same is true for
a simpler calculation such as dividing one number by another. The same is true when an FPU is
crunching those decimal numbers, in fact, even simple calculations like addition or
subtraction require at least 2 clock cycles.
A pipeline is like an assembly line. You have several stages, and the result of one pipeline
stage is passed onto another until you have the final result. Intermediate results of each
step are used to calculate the next steps result, when you are, for example, dividing or
taking a square root. But the story doesn't end there. You will notice the big advantage of a
pipelined fpu when similar FPU-calculations are done one after another. See figure 1.
Clock 1
Stage 1
Stage 2
Stage 3
Operand 13
Operand 12
Operand 11
Clock 2
Stage 1
Stage 2
Stage 3
Operand 14
Operand 13
Operand 12
Clock 3
Stage 1
Stage 2
Stage 3
Operand 15
Operand 14
Operand 13
Notice that each clock, an operand (a number that is use to calculate on) enters stage 3, so
you understand that after each clock a final result is calculated. In other words, in those
ideal situations -each stage the pipeline is always filled - the FPU can calculate one result
each clock cycle, regardless of the complexity the calculation is. This translates very
roughly into 1 FLOPs for a 1 hz CPU. Almost, because if we throw only 3 operands at a time at
the FPU, we see that the average FLOPs is only 0.5 for a 1 hz CPU. See fig 2.
Clock 1
Stage 1
Stage 2
Stage 3
Operand 1
Clock 2
Stage 1
Stage 2
Stage 3
Operand 2
Operand 1
Clock 3
Stage 1
Stage 2
Stage 3
Operand 3
Operand 2
Operand 1
Clock 5
Stage 1
Stage 2
Stage 3
Operand 3
Operand 2
Clock 6
Stage 1
Stage 2
Stage 3
Operand 3
You see? To take fully advantage of the pipelined FPU, a programmer must try to calculate
similar FP-calculations together! The programmer has to optimize or "schedule" the code.
If you issue your FP-calculations ungrouped or in the wrong order, the pipelined FPU won't be
much help. In that case, an FP-calculation that takes "normally" 3 clock-cycles on the same
FPU without pipelining, will also take 3 clocks on a pipelined FPU. If you fire 1000 of the
same FPU-calculations one after another, you will get a peak performance of 1 Flop for 1 hz
CPU. The better designed the pipeline is, the closer you will get to that 1 flop.
The PII has also two partioned units. One fully pipelined for addition and substraction and
the other partially pipelined for all the other operations. To take advantage of this
parallelism a programmer must carefully "schedule" the FPU-instructions. One addition, then
one multiplication, for example. This dual FPU-unit does not exist in the PMMX. So the PII-
300s super-pipelined FPU has a potential peak performance of 300 MFLOPS. A PMMX -300 with a
more simplistic FPU will not get as close to those 300 MFLOPS as a PII.
Why doesn't everybody go for the Pipelined FPU? Are there disadvantages?
After reading the impressive specifications of the PII, you have indeed the right to ask
yourself why AMD, Cyrix and others don't have a Pipelined FPU?
Well two reasons:
Most compilers in '96/'97 could not optimize the FPU-code they generated well enough for the
PPRO/PII. So if a programmer wanted very fast FPU-performance for the PII, hand optimization
was required. The guys from Id, for example, boosted performance in Quake and Quake II that
way. Coding in assembler is very time consuming and hard to learn and master. But since Intel
is by far the market leader and 3D game developpers want the best quality and performance,
they did the job. ( a compiler generates machine code from your program written in higher
level languages like C++, pascal etc. )
To execute one simple calculation like add, multiply or substract, the PII pipelined FPU
takes 3 to 5 clocks. We'll see that a FPU can do better than that! Those 3 instructions are
used very often.
The low latency FPU, the K6-FPU
Now, the solution from AMD. You remember that FPU-calculations like square root and divide
can't be calculated in one step, or - we are talking about the CPUs here - in one clock. Let
us take the example we used when explaining the pipelined FPU. It takes 3 clock cycles to
calculate the final result. Instead of calculating 3 clock cycles to get the result, let us
try to modify the CPU so we can calculate that final result in 2 clock cycles. We are
lowering the latencies. The most important (add, subtract, multiply) FPU-instructions need 2
clock cycles on the K6-FPU, while they take 3-5 on the PII. So do you now understand what AMD
means with a low latency FPU?
Why doesn't AMD develop a low-latency, but pipelined FPU? Maybe some day a genious will
invent such a FPU, but it is very difficult. You want proof? Well, Intel had to raise
latencies in order to develop a good pipelined FPU. It is not easy to pipeline and lower
latencies (less clock cycles per instruction, remember?) at the same time.
Now we can figure out why we got those contradictory results from the two benchmarks. Sisoft
Sandra is not optimized for the Pipelined FPU and the K6 gets a slight advantage because it
takes less clock cycles to calculate those FP-calculations. The Ziff-Davis synthetic floating
point benchmark takes full advantage of the dual pipelined FPU.
So, what about Cyrix?
Even if programmers wouldn't optimize for the PII, the Cyrix FPU would still be slower. Cyrix
has somewhat ignored FPU-performance. The most important FPU-instructions (add, substract and
multiply remember?) take 4 clock-cycles to complete. That is much worse than the K6, and a
bit worser than the PII as far as latencies are concerned of course. In addition, the Cyrix
FPU is unpipelined, yet the biggest problem lies with the 6x86/M2 clock speed. Any floating-
point unit requires signifigantly more time for even very simple operations than the
...
read more »