• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 1014 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
I'm curious about one thing, and I don't know if it's been discussed before.

AMD states that Zen4 has 3x FP Pipes and Zen5 has 4x FP Pipes.

Could it be that:

Zen4 has 3x 256-bit and only 2x256-bit for AVX512, resulting in 1x512-bit.

Zen5 has 4x 256-bit and therefore 4x256-bit for AVX512, resulting in 2x512-bit.


https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19c9ddec-2725-4553-a1b9-7a676a1e7d35_1915x1071.png

AFnVIoGFWSCE6YXO.jpg
 
Last edited:
AMD states that Zen4 has 3x FP Pipes and Zen5 has 4x FP Pipes.
AMD loves to have typos in official materials.

Zen 4 has 4 256b FP execution units, 2 of which can do FMA.

Zen 5 DT/Server has 4 512b FP execution units, 2 of which can do FMA. (this includes dense versions)

Zen 5 Mobile (Strix Point/Krackan) has 4 256b FP execution units, 2 of which can do FMA. (this includes dense versions).

There are subtle differences about the shuffles, you can read Mystical's Teardown and / or official instruction tables / Agner Fog's core characterisations to stitch this info together.
 
I know this version very well, but I have a question about FP Pipes. I was even starting to wonder if it wasn't a bug.

So what do 3x FP Pipes in Zen4 and 4x FP Pipes in Zen5 mean?

Edit:
It's possible that Zen4 has 3 FP Pipes, but can only combine 2 FP Pipes into a single 512-bit AVX512.

Zen5 has 4 FP Pipes with the ability to combine all 4 FP Pipes into 2x512-bit AVX512.

Remember how Agner Fog kept saying that Skylake already had a 5-wide decoder? No Intel diagrams confirmed this. They even showed it was still 4-wide. Although it must be admitted that Agner himself claimed that lab tests showed 4-wide. It was even assumed that SunnyCove also had a 5-wide decoder, when in reality, it physically has 4-wide.

EDIT2:
Which AMD materials clearly show/state that Zen4 has a physical 4x256 bit? And which AMD materials confirm that Zen5 has a physical 4x512 bit?

Just because Zen5 can execute two 512-bit operands of one type and two 512-bit operands of another type, but only 2x512 at a time, doesn't mean it has a physical FP of 4x512 bits.

Similarly, Zen4 can execute 2x256 bits of one type and 2x256 bits of another type, but can only execute 2x256 at a time, doesn't mean it has a physical FP of 4x256 bits.

EDIT3:
Agner Fog
Vector instructions and floating point instructions can execute at a rate of two vector additions, two vector multiplications, and two vector read or write instructions simultaneously per clock cycle. All vector units have full 512 bits capabilities except for memory writes. A 512-bit vector write instruction is executed as two 256-bit writes.

Integer memory operations can execute at a rate of four reads per clock cycle or two reads and two writes. Floating point and vector memory operations can execute at the rate of two reads or writes per clock cycle, except for 512-bit writes.
Assuming AMD is correct on this slide, it stands to reason that even if the three FP pipelines in Zen4 were 256-bit, they wouldn't be able to be combined into more than 1x512.

Zen5, on the other hand, by gaining four FP pipelines, can combine all four into 2x512.
 
Last edited:
Which AMD materials clearly show/state that Zen4 has a physical 4x256 bit?
Software Optimization Guide for the AMD Zen4 Microarchitecture, publication number 57647

Figure 6 below shows a basic diagram of the floating point unit and how it interfaces with the other
units in the processor. Notice that there are four execution pipes (0 through 3) that can execute an
operation every cycle. (...)
Because the data paths are 256 bits wide, the scheduler uses two consecutive cycles to issue a
512-bit operation.

And which AMD materials confirm that Zen5 has a physical 4x512 bit?
Software Optimization Guide for the AMD Zen5 Microarchitecture, publication number 58455

Notice that there are four execution pipes (0 through 3) that can execute an operation every cycle. (...) The floating-point unit supports AVX-512 with full 512-bit data paths and operations.

if you want to see pictures, google both docs.

Just because Zen5 can execute two 512-bit operands of one type and two 512-bit operands of another type, but only 2x512 at a time, doesn't mean it has a physical FP of 4x512 bits.

Similarly, Zen4 can execute 2x256 bits of one type and 2x256 bits of another type, but can only execute 2x256 at a time, doesn't mean it has a physical FP of 4x256 bits.

Your misunderstanding stems from the fact that only two of the pipes are capable of FMA, not that only two can execute at the time. So you can do 2 FMA per cycle, but the same cycle you can do 2 ADDs or whatever else does not sit on the same pipes that FMA does.
 
Physical vs. Logical Analysis:

Logically: The architecture can accept a 512-bit µop operation and "execute" it in each of the four FP pipelines → from an ISA perspective, each pipeline is a "full 512-bit" pipeline.

Physically: Manufacturers often implement such wide data paths as combinations of narrower units, e.g., two 256-bit or four 128-bit paths.

Arguments: Transistor cost, power consumption, and die size increase significantly with true 4 x 512-bit FP units.

The FP memory limit of 1/cycle, despite having two pipelines, suggests that the 512-bit FP ports are not fully independent for some operations.

Conclusion:

Each of the four FP ports can logically execute a 512-bit operation per cycle.

There's no guarantee that each port physically has its own independent 512-bit hardware. These might be shared bandwidth or combinations of narrower units that together support 512-bit.

Is it possible to execute 2xFADD(512b) + 2xFMA(512b) at the same time in Zen5? Or maybe FADD and FMA share the same registers, e.g., 4x256bit, and it is not possible to execute 2x512+2x512. Is only 1xFADD 512 + 1xFMA 512 possible?
 
Last edited:
I know that. I just wanted to know what AMD meant when they mentioned on the slide that Zen4 has 3xFP Pipes. Maybe it just looked better for marketing purposes, and since most people won't notice anyway, what difference does it make? :smile:
 
So it's possible that these 3FP Pipes for Zen4 and 4FP Pipes for Zen5 are a typo?
You can tell the real AMD marketing materials from the fakes by the fact that the authentic AMD materials are always completely full of errors.

Like, there are theories that AMD uses intentional typos to catch sources of leaks (by giving everyone versions with different typos), because the errors are so common. I think they are just really sloppy and don't care.
 
I'm just going to copy paste my reddit comment here lol but:
When taking the geometric mean of 555 benchmarks in total, the AMD EPYC 9745 was delivering about 90% the performance of its 128 core EPYC 9755 sibling. Considering that the EPYC 9745 has just 80% the TDP rating of the EPYC 9755, that's a nice showing for the EPYC 9745 with its dense cores.
Is it just me, or is this not a nice showing? Perf doesn't scale linearly with power, and 80% the power draw for 90% the perf sounds achievable for a product on the same node.

Even for the more embarrassingly multithreaded benchmarks, such as Blender, you see the 9745 getting 20-25% better perf/watt, however a good chunk of that can also easily be attributed to simply consuming way less power and thus getting better perf/watt from that.

For example, assuming each core in the 9745 is maybe getting ~2 watts, while each core in the 9755 is getting ~3 watts, that alone would provide an almost 30% perf/watt advantage to the 9745... if it was using the exact same N4P CCDs as the 9755.

It seems to me as if, per core, Zen 5C on N3E in Turin Dense is not any more efficient than Zen 5 on N4P in regular Turin.

Ofc, the only way to really confirm this would be to have someone test the two chips at the same power draw.

From: https://www.phoronix.com/review/amd-epyc-9745-9755/9
 
The two products serve two completely different markets. One is pushing maximum per-thread performance achievable with the socket power budget. The other is trying to wring every last drop of performance out of lower power budgets, especially for legacy platforms that are limited to lower TDPs. It's that simple. Scaling down the 9755 likely has some other issues that we just aren't seeing.
 
1000054226.jpg
Nope the 400W is just at the high end of power consumption. Dropping to 320 W TDP has a 2% impact on performance but saves power. The dense cores are being pushed to the edge of their VF curve, while the classic ones are not.

More typical usage, +50% power for +10% performance.
 
Scaling down the 9755 likely has some other issues that we just aren't seeing.
Like what? We know that you can scale Zen 5 standard cores to even shy of 1 watt per core if need be.
The two products serve two completely different markets. One is pushing maximum per-thread performance achievable with the socket power budget. The other is trying to wring every last drop of performance out of lower power budgets, especially for legacy platforms that are limited to lower TDPs. It's that simple.
The problem is that it seems as if the 9755 was limited to the 9745's tdp... you would get very similar perf.
 
I'm just going to copy paste my reddit comment here lol but:

Is it just me, or is this not a nice showing? Perf doesn't scale linearly with power, and 80% the power draw for 90% the perf sounds achievable for a product on the same node.

Even for the more embarrassingly multithreaded benchmarks, such as Blender, you see the 9745 getting 20-25% better perf/watt, however a good chunk of that can also easily be attributed to simply consuming way less power and thus getting better perf/watt from that.

For example, assuming each core in the 9745 is maybe getting ~2 watts, while each core in the 9755 is getting ~3 watts, that alone would provide an almost 30% perf/watt advantage to the 9745... if it was using the exact same N4P CCDs as the 9755.

It seems to me as if, per core, Zen 5C on N3E in Turin Dense is not any more efficient than Zen 5 on N4P in regular Turin.

Ofc, the only way to really confirm this would be to have someone test the two chips at the same power draw.

From: https://www.phoronix.com/review/amd-epyc-9745-9755/9
it haez half the cache and also a lot less CCDs in case you've ever wondered.
The whole point's that it is cheap. Venice is Not Cheap. good luck!
 
it haez half the cache and also a lot less CCDs in case you've ever wondered
Half the cache would be the explanation that's believable, that PPC is getting negatively effected too much in some workloads and that the frequency/power benefit isn't enough to compensate.
If there were some nT workloads that phoronix tested that didn't hit the L3 much at all, or actually just an average core clock reporting for the two cpus tested, that would be helpful as well.
Fewer CCDs seems like an advantage though, since you worry less about interconnect power. Unless you think mem bandwidth per core gets lessened, if the dense CCDs dont have GMI-wide and use all the IF links available from the IOD?
 
Nope the 400W is just at the high end of power consumption. Dropping to 320 W TDP has a 2% impact on performance but saves power. The dense cores are being pushed to the edge of their VF curve, while the classic ones are not.

More typical usage, +50% power for +10% performance.
AFAICT, several of these benchmarks have poor performance-over-power scaling on the 9745 because of its halved level 3 cache.

(Edit: Oops, I replied before reading the last few responses.)
 
Last edited:
Everything would be fine if it weren't for the crazy prices of RAM and SSDs.:mask:


View attachment 140718
Wouldn't be an AMD marketing item without at least one mistake. Calling the 9950x3D both Ryzen 7 and Ryzen 9 in the same video!

Anyway, they've pushed this for content creation and left out gaming because it doesn't really help there I expect. I'm surprised they launched this at all. Looking forward to digging up some of my old comments when the members enthusiastic about this release for its gaming performance are ultimately disappointed.
 
Back
Top