rec.autos.simulators

by **Chath** » Thu, 17 Dec 1998 04:00:00

Part 1: Low Latency FPU versus Pipelined FPU

Is 3dnow! really better then the PII FPU, will KNI blow 3dnow! away? Why are the FPUs of
alternative CPUs weaker? What is the best solution, both on the schematics and in the real
world? That is what this article is about. FPU-performance is more hyped than ever, a K6-2
300 enables a potential peak-performance of 1.2 GigaFLOPS, A Katmai 500 will output 2
GigaFLOPS if you believe AMD and Intel.

About 10 years ago, nobody was interested in FPU-performance. A fast 386 was all we wanted.
Nowadays, a CPU without a blazing fast FPU is considered almost worthless. Just a few years
ago - spring 96- I remember that all french magazines were excited about the 6x86." Cyrix
6x86-166+ was state of the art, the end of the Intel kingdom " they said. A few months later,
benchmarks with Quake were popular and the 6x86 star was not shining anymore. Intel was the
way to go. 3D games became real FPU-performance hogs because they didn't use sprites anymore.
Games were calculating 3D objects in a three-dimensional space, as opposed to the friendlier
sprites (simply arrays of integers) of past.

In spring '97 the K6 came out, some people hurried to the store to get the K6, but only to
even faster return the K6 and get a Pentium MMX. Verdict: Quake runs too slowly, so the K6
FPU is weak. VIVA INTEL. Or not? A gigantic discussion began on the Internet between hardware
gurus. Some benchmarks show that K6 FPU is as fast as the PMMX and almost as fast as the PII-
FPU, but most benchmarks point to Intels pipelined FPU as the king of floating-point
performance. Look at the benchmarks below.

Sisoft Sandra
ZD

FPU
FPUmark

K6-233
132
740

PMMX-233
130
900

K6-2 300
167
980

PII-300
160
1540

Notice how sisoft sandra FPU whetstone gives a slight advantage to the K6 and K6-2 FPU while
the ZD fpu mark claims that the PII FPU is 57% faster than the K6-2, the PMMX FPU should be
20% faster than the K6, clock for clock. We will try to explain that.

It is clear that a good CPU nowadays needs a fast FPU. The 3D-games market has exploded. Even
real time strategy games are beginning to use 3D, and people like to be creative with 3D-
rendering software like Bryce, Truespace, and Simply 3D.

What is the best answer for all those FPU-hungry applications? A Pipelined FPU, 3D-Now!, KNI
or something else? Hardcore gamers, Creative 3D-animators, what do they need? Which FPU is
the answer to their prayers? Lets find out. Strap yourself in for a roller coaster ride of
CPU-architecture, marketing hype and astonishing creativity

Different companies, different strategies.

To raise FPU-performance, semi-conductors manufacturers developed different strategies :

Raise the frequency, more mhz. Pretty simple, huh?
Pipeline the FPU.
Lower the latencies of the FPU.
SIMD FPU (Single Instruction, Multiple Data)
A parallel pipelinded FPU (surprise, surprise)
The first is so simple, we won't discuss it. The second was Intels strategy for the
Pentium/PII FPU, the third was the one of AMD K6 FPU. The forth is the path AMD chose with
the K6-2 and 3dnow!, and now Intel will do the same with KNI and the Katmai. And the fifth?
We discovered that one at the Microprocessor Forum in San Jose, California (K7). But first,
what is a pipelined FPU? And what about a lower latency FPU?

The Pipelined FPU, the Pentium/PII-FPU

What follows is a simplification, to clarify the FPU-discussions to everybody who is
interested. We won't be 100% accurate, to avoid excessive confusion.

You might recall from your mathematics course that not all calculations can be done in one
step. Adding and substracting numbers can be done in one step, but a complex algorithm is
required to find the square root of a random number, unless you know all square roots by
heart (just kidding). So, it takes several steps to do a square root and the same is true for
a simpler calculation such as dividing one number by another. The same is true when an FPU is
crunching those decimal numbers, in fact, even simple calculations like addition or
subtraction require at least 2 clock cycles.

A pipeline is like an assembly line. You have several stages, and the result of one pipeline
stage is passed onto another until you have the final result. Intermediate results of each
step are used to calculate the next steps result, when you are, for example, dividing or
taking a square root. But the story doesn't end there. You will notice the big advantage of a
pipelined fpu when similar FPU-calculations are done one after another. See figure 1.

Clock 1

Stage 1
Stage 2
Stage 3

Operand 13
Operand 12
Operand 11

Clock 2

Stage 1
Stage 2
Stage 3

Operand 14
Operand 13
Operand 12

Clock 3

Stage 1
Stage 2
Stage 3

Operand 15
Operand 14
Operand 13

Notice that each clock, an operand (a number that is use to calculate on) enters stage 3, so
you understand that after each clock a final result is calculated. In other words, in those
ideal situations -each stage the pipeline is always filled - the FPU can calculate one result
each clock cycle, regardless of the complexity the calculation is. This translates very
roughly into 1 FLOPs for a 1 hz CPU. Almost, because if we throw only 3 operands at a time at
the FPU, we see that the average FLOPs is only 0.5 for a 1 hz CPU. See fig 2.

Clock 1

Stage 1
Stage 2
Stage 3

Operand 1

Clock 2

Stage 1
Stage 2
Stage 3

Operand 2
Operand 1

Clock 3

Stage 1
Stage 2
Stage 3

Operand 3
Operand 2
Operand 1

Clock 5

Stage 1
Stage 2
Stage 3

Operand 3
Operand 2

Clock 6

Stage 1
Stage 2
Stage 3

Operand 3

You see? To take fully advantage of the pipelined FPU, a programmer must try to calculate
similar FP-calculations together! The programmer has to optimize or "schedule" the code.

If you issue your FP-calculations ungrouped or in the wrong order, the pipelined FPU won't be
much help. In that case, an FP-calculation that takes "normally" 3 clock-cycles on the same
FPU without pipelining, will also take 3 clocks on a pipelined FPU. If you fire 1000 of the
same FPU-calculations one after another, you will get a peak performance of 1 Flop for 1 hz
CPU. The better designed the pipeline is, the closer you will get to that 1 flop.

The PII has also two partioned units. One fully pipelined for addition and substraction and
the other partially pipelined for all the other operations. To take advantage of this
parallelism a programmer must carefully "schedule" the FPU-instructions. One addition, then
one multiplication, for example. This dual FPU-unit does not exist in the PMMX. So the PII-
300s super-pipelined FPU has a potential peak performance of 300 MFLOPS. A PMMX -300 with a
more simplistic FPU will not get as close to those 300 MFLOPS as a PII.

Why doesn't everybody go for the Pipelined FPU? Are there disadvantages?
After reading the impressive specifications of the PII, you have indeed the right to ask
yourself why AMD, Cyrix and others don't have a Pipelined FPU?

Well two reasons:

Most compilers in '96/'97 could not optimize the FPU-code they generated well enough for the
PPRO/PII. So if a programmer wanted very fast FPU-performance for the PII, hand optimization
was required. The guys from Id, for example, boosted performance in Quake and Quake II that
way. Coding in assembler is very time consuming and hard to learn and master. But since Intel
is by far the market leader and 3D game developpers want the best quality and performance,
they did the job. ( a compiler generates machine code from your program written in higher
level languages like C++, pascal etc. )

To execute one simple calculation like add, multiply or substract, the PII pipelined FPU
takes 3 to 5 clocks. We'll see that a FPU can do better than that! Those 3 instructions are
used very often.

The low latency FPU, the K6-FPU

Now, the solution from AMD. You remember that FPU-calculations like square root and divide
can't be calculated in one step, or - we are talking about the CPUs here - in one clock. Let
us take the example we used when explaining the pipelined FPU. It takes 3 clock cycles to
calculate the final result. Instead of calculating 3 clock cycles to get the result, let us
try to modify the CPU so we can calculate that final result in 2 clock cycles. We are
lowering the latencies. The most important (add, subtract, multiply) FPU-instructions need 2
clock cycles on the K6-FPU, while they take 3-5 on the PII. So do you now understand what AMD
means with a low latency FPU?

Why doesn't AMD develop a low-latency, but pipelined FPU? Maybe some day a genious will
invent such a FPU, but it is very difficult. You want proof? Well, Intel had to raise
latencies in order to develop a good pipelined FPU. It is not easy to pipeline and lower
latencies (less clock cycles per instruction, remember?) at the same time.

Now we can figure out why we got those contradictory results from the two benchmarks. Sisoft
Sandra is not optimized for the Pipelined FPU and the K6 gets a slight advantage because it
takes less clock cycles to calculate those FP-calculations. The Ziff-Davis synthetic floating
point benchmark takes full advantage of the dual pipelined FPU.

So, what about Cyrix?
Even if programmers wouldn't optimize for the PII, the Cyrix FPU would still be slower. Cyrix
has somewhat ignored FPU-performance. The most important FPU-instructions (add, substract and
multiply remember?) take 4 clock-cycles to complete. That is much worse than the K6, and a
bit worser than the PII as far as latencies are concerned of course. In addition, the Cyrix
FPU is unpipelined, yet the biggest problem lies with the 6x86/M2 clock speed. Any floating-
point unit requires signifigantly more time for even very simple operations than the ...

read more »

by **KVS** » Fri, 18 Dec 1998 04:00:00

Chatham wrote:

snip

> You see? To take fully advantage of the pipelined FPU, a programmer must try to calculate
> similar FP-calculations together! The programmer has to optimize or "schedule" the code.

> If you issue your FP-calculations ungrouped or in the wrong order, the pipelined FPU won't be
> much help. In that case, an FP-calculation that takes "normally" 3 clock-cycles on the same
> FPU without pipelining, will also take 3 clocks on a pipelined FPU. If you fire 1000 of the
> same FPU-calculations one after another, you will get a peak performance of 1 Flop for 1 hz
> CPU. The better designed the pipeline is, the closer you will get to that 1 flop.

> The PII has also two partioned units. One fully pipelined for addition and substraction and
> the other partially pipelined for all the other operations. To take advantage of this
> parallelism a programmer must carefully "schedule" the FPU-instructions. One addition, then
> one multiplication, for example. This dual FPU-unit does not exist in the PMMX. So the PII-
> 300s super-pipelined FPU has a potential peak performance of 300 MFLOPS. A PMMX -300 with a
> more simplistic FPU will not get as close to those 300 MFLOPS as a PII.

The PII 300 can process 1 (fadd/fsub) + 0.5 (fmul) operations
per cycle peak which translates to 450 MFLOPS not 300. The
P5 300 MMX (if one were to exist) would only be able to have
300 MFLOPS peak since its has only one fp pipeline.

- Show quoted text -

> Why doesn't everybody go for the Pipelined FPU? Are there disadvantages?
> After reading the impressive specifications of the PII, you have indeed the right to ask
> yourself why AMD, Cyrix and others don't have a Pipelined FPU?

> Well two reasons:

> Most compilers in '96/'97 could not optimize the FPU-code they generated well enough for the
> PPRO/PII. So if a programmer wanted very fast FPU-performance for the PII, hand optimization
> was required. The guys from Id, for example, boosted performance in Quake and Quake II that
> way. Coding in assembler is very time consuming and hard to learn and master. But since Intel
> is by far the market leader and 3D game developpers want the best quality and performance,
> they did the job. ( a compiler generates machine code from your program written in higher
> level languages like C++, pascal etc. )

> To execute one simple calculation like add, multiply or substract, the PII pipelined FPU
> takes 3 to 5 clocks. We'll see that a FPU can do better than that! Those 3 instructions are
> used very often.

> The low latency FPU, the K6-FPU

> Now, the solution from AMD. You remember that FPU-calculations like square root and divide
> can't be calculated in one step, or - we are talking about the CPUs here - in one clock. Let
> us take the example we used when explaining the pipelined FPU. It takes 3 clock cycles to
> calculate the final result. Instead of calculating 3 clock cycles to get the result, let us
> try to modify the CPU so we can calculate that final result in 2 clock cycles. We are
> lowering the latencies. The most important (add, subtract, multiply) FPU-instructions need 2
> clock cycles on the K6-FPU, while they take 3-5 on the PII. So do you now understand what AMD
> means with a low latency FPU?

--->
> Why doesn't AMD develop a low-latency, but pipelined FPU? Maybe some day a genious will
> invent such a FPU, but it is very difficult. You want proof? Well, Intel had to raise
> latencies in order to develop a good pipelined FPU. It is not easy to pipeline and lower
> latencies (less clock cycles per instruction, remember?) at the same time.

This is perhaps a pathology of the x86 architecture, a
MIPS R10000, for example, has two fpu pipelines
(fmul, fadd/fsub) with fp instructions having a latency
of 2 clock cylces (cf http://www.sgi.com/processors/r10k)

- Show quoted text -

> Now we can figure out why we got those contradictory results from the two benchmarks. Sisoft
> Sandra is not optimized for the Pipelined FPU and the K6 gets a slight advantage because it
> takes less clock cycles to calculate those FP-calculations. The Ziff-Davis synthetic floating
> point benchmark takes full advantage of the dual pipelined FPU.

> So, what about Cyrix?
> Even if programmers wouldn't optimize for the PII, the Cyrix FPU would still be slower. Cyrix
> has somewhat ignored FPU-performance. The most important FPU-instructions (add, substract and
> multiply remember?) take 4 clock-cycles to complete. That is much worse than the K6, and a
> bit worser than the PII as far as latencies are concerned of course. In addition, the Cyrix
> FPU is unpipelined, yet the biggest problem lies with the 6x86/M2 clock speed. Any floating-
> point unit requires signifigantly more time for even very simple operations than the integer
> unit. A fairly complex floating-point instruction might require 30 or 40 clock cycles, which
> is asking a great deal of a processor only running at 233 MHz in many cases.

> Low latencies or pipelining?
> So what is the best FPU-solution? if we give our FPU an instruction every now and then while
> crunching a lot of integer calculations, or our FP-calculations are randomly grouped, the K6-
> FPU will perform much better. In most cases, however, similar FP-calculations tend to come
> naturally one after another, even if you don't optimize. So, in most cases, the pipelined FPU
> will prevail. Setting up 3D-geometry is one of the best examples.

> The pipelined FPU had also the advantage of being the child of the the market leader, so when
> the programmer optimizes, the optimization is for the pipelined FPU. So forget that jabbering
> that a pipelined FPU is always the best solution and that you don't have to optimize for it.
> Non-optimized FPU-intensive programs will show a 0-20% performance advantage to the pipelined
> FPU, optimized will show 40% to 60% better performance.

> Why doesn't the K6-3 or MII have a pipelined FPU?
> It is not so easy as it seems. If AMD or Cyrix wanted the same FPU-performance as of the PII
> in Quake, Unreal or 3D Studio MAX, they had three choices :

--->

> They copy the dual pipelined FPU completly! If they developed a similar FPU for their
> existing CPUs, the K6 and the MII, a pipelined FPU, it still wouldn't be a match for the PII
> FPU. Why not? Don't forget that those programs are hand-optimized not only to take benefit of
> A pipelined FPU! They are optimized for The Intel pipelined-FPU implementation!
> Not a valid option, unless you want to face the legal department of Intel in Court.

This logic is flawed. If some piece of code is hand optimized for
the PII (peaking at 1 clock cycle for fadd/fsub and 2 for fmul) then
the instruction stream will have fmul operations staggered with
fadd/fsub or fstore so that only one fmul instruction gets issued
every 2 clock cycles. If the same instruction stream is fed
to an x87 unit with a full fmul pipeline all that will happen
is that it won't be supplied to its full potential. The fmul
pipeline isn't being stalled, it is just having an effective
throughput of 0.5 fmul operations per cylce. So there is no
loss of performance compared to the PII.

clock 1 clock 2 clock 3 clock 4 clock 5
stage1 op1 - op2 - op3
stage2 - op1 - op2 -
stage3 - - op1 - op2

Intel has NO patent on fpu pipelining and implementing
it doesn't hurt Cyrix and AMD.

- Show quoted text -

> The days AMD and Cyrix simply copied Intels designs are over.

> Why bother to make a poor copy, when you can outclass the competitors design ?
> Build a new fast FPU in your newest CPU. That is the best strategy, because the new FPU will
> not be limited by an older design. You can build the new CPU around that new FPU.

> Implement new instructions specially designed for accelerating those 3D-programs.
> AMD attacked with the 3Dnow! technology, which was able to surpass the PII FPU in single
> precision FP-calculations. Of course, we are talking about single precision here, for some
> tasks you still need the "normal" FPU. Then they developed the K7 with a awesome spanking new
> parallel pipelined FPU. The K7 was designed from the ground up with that new FPU in mind.

> Cyrix followed in their footsteps; they adopted 3dnow! (thanks to Microsoft!) and designed a
> new FPU for the cayenne core. So you see; alternative CPU manufacturers are not deaf to your
> cries for more FPU-power.

rec.autos.simulators

About the Pipelined FPU, 3DNOW! and KNI

About the Pipelined FPU, 3DNOW! and KNI

About the Pipelined FPU, 3DNOW! and KNI