Discussion RDNA 5 / UDNA (CDNA Next) speculation

adroc_thurston · Apr 2, 2025

Kronos1996 said:
Why jump straight to something that complex and expensive

They're good at it.

Kronos1996 said:
Assuming RDNA 5 uses chiplets

It doesn't. They don't have the manpower for that anymore.

Kronos1996 said:
3D stack the GCD on the MCD and you have a low-end product. No expensive/complex interposer or bridge packaging needed. Just two dies like Zen X3D. Then combine two and three stacks for the mid-range and high-end products. Those would require some sort of interposer or active bridge. The product stack would look like this:

That's just N4C with even more tapeouts.
No bueno.

GTracing · Apr 2, 2025

Kronos1996 said:
I went back and looked at the Navi 4C designs. Why jump straight to something that complex and expensive? Assuming RDNA 5 uses chiplets, why not go for a simpler layout? This is what I’m thinking:

GCD - 40 CU (80-100mm2)
MCD - 128 bit bus + 32mb Infinity Cache (~70mm2)

3D stack the GCD on the MCD and you have a low-end product. No expensive/complex interposer or bridge packaging needed. Just two dies like Zen X3D. Then combine two and three stacks for the mid-range and high-end products. Those would require some sort of interposer or active bridge. The product stack would look like this:

Peasants - 40 CU, 128 bit bus, 32 mb cache, 2 dies (150-170mm2)

Plebeians - 80 CU, 256 bit bus, 64 mb cache, 4 dies (300-340mm2)

Whales - 120 CU, 384 bit bus, 96 mb cache, 6 dies (450-510mm2)

That’s pretty conservative in terms of silicon and covers most of the product stack with two dies and less advanced packaging than what they were working on. Assuming N2X, the low-end version might approach a 9700 XT while the high-end would be at-least twice as fast. They could maybe fit 60 CU but it hards to say without knowing how much bigger RDNA 5 CU’s might be or if they use N3P vs N2X.

There are a few things that make it more complicated than that. Just to cover the three biggest.

Firstly, there are things like the media engine, display engine, and command processor that you don't need or want multiples of.

Secondly, in that arrangement the display PHYs would be in the top die in a stack. I don't think that's possible.

Thirdly, each CU needs access to all the memory. You would need interconnects with crazy amounts of bandwidth, which would almost certainly be expensive/complex.

Kronos1996 · Apr 2, 2025

adroc_thurston said:
They're good at it.

It doesn't. They don't have the manpower for that anymore.

That's just N4C with even more tapeouts.
No bueno.

What happened to their manpower?

gdansk · Apr 2, 2025

Kronos1996 said:
What happened to their manpower?

AI.

adroc_thurston · Apr 2, 2025

Kronos1996 said:
What happened to their manpower?

GPGPU slave mines, yearly.
Most unfortunate.

GTracing said:
which would almost certainly be expensive/complex.

ah no SoIC-X is cheap d2d bumps: the packaging.

Kronos1996 · Apr 2, 2025

gdansk said:
AI.

Poor bastards. Unfortunately, consumer and server GPU’s have diverged to the point where they have little in common. Sharing dies between product stacks isn’t possible in the way it is on the CPU side. I do wonder about sharing package designs though.

Navi 4C is pretty similar to MI300. If they designed both to use the same packaging, that may help with R&D costs. Having the option to use HBM MCD’s from Instinct on professional RDNA cards would be especially advantageous.

adroc_thurston · Apr 2, 2025

Kronos1996 said:
Navi 4C is pretty similar to MI300

oh hell no, it was much more complex.

Kronos1996 said:
Having the option to use HBM MCD’s

doesn't have dat, the IMC is always on AID.

Kronos1996 · Apr 2, 2025

adroc_thurston said:
GPGPU slave mines, yearly.
Most unfortunate.

ah no SoIC-X is cheap d2d bumps: the packaging.

I mean you gotta go where the money is. I get it but it’s still sad.

Kronos1996 · Apr 2, 2025

adroc_thurston said:
oh hell no, it was much more complex.

doesn't have dat, the IMC is always on AID.

MI400 then. Just theorizing about why they’d use such a complex package for consumer products. Perhaps they’re trying to reuse a package design made for server? Yeah it might cost more but the R&D was already paid for.

Would also open up some options for die sharing in certain markets. Slapping RDNA GCD’s on top of CDNA AID’s (or whatever stupid name they call them now) would make one helluva workstation product. HBM would be a decisive advantage in the AI war.

adroc_thurston · Apr 2, 2025

Kronos1996 said:
Just theorizing about why they’d use such a complex package for consumer products.

a) sometimes you gotta win.
b) they have somewhat of a fetish for advanced packaging. Remember, they tried to force HBM into client.

Kronos1996 said:
Slapping RDNA GCD’s on top of CDNA AID’s (or whatever stupid name they call them now) would make one helluva workstation product. HBM would be a decisive advantage in the AI war.

none of that is product level viable.
Chiplets in gfx are a "win more" option.

branch_suggestion · Apr 2, 2025

adroc_thurston said:
It doesn't. They don't have the manpower for that anymore.

But a big simple SoIC stack doesn't require as much manpower as N4C would've.
May the merciful above burst the sinful AI bubble soon.

basix · Apr 3, 2025

The "simplest" Multi-Chip approach is that what Nvidia does with B100/200: Just fuse two GPUs together. That makes a ~300...350mm2 GPU a little bit bigger (you need to have an additional chip-to-chip interface) but does not add other cost. Smaller Die of the same portfolio are not affected as well. The big GPU, which uses two of the chiplets (600...700mm2 total) would have doubled media engines etc. but for a Halo part not a huge problem. You can also put those in use for prosumers etc. (e.g. GB202 has more video encoders/decoders compared to the the smaller Blackwell GPUs).

Packaging cost should also not increase by too much. I think RDNA3 alike organic "Infinity Fanout Links" are enough for a few TByte/s of bandwidth. For VRAM and Infinity Cache bandwidth surely enough. The question is, what happens with the L2$? I assume, that this one needs to be some sort of "private" for each chiplet. I do not see, that you can route much if not all L2$ traffic over the chip-to-chip interface and keep the cost reasonable.

adroc_thurston · Apr 3, 2025

basix said:
I do not see, that you can route much if not all L2$ traffic over the chip-to-chip interface and keep the cost reasonable.

Well you don't.
Everything NV since A100 is NUCA with hella caveats.

branch_suggestion · Apr 3, 2025

basix said:
The "simplest" Multi-Chip approach is that what Nvidia does with B100/200: Just fuse two GPUs together.

Easy for compute, graphics is a whole other story due to the serial nature of APIs.
Also the bigger you go, the demands on interconnect bandwidth go up exponentially.

basix said:
That makes a ~300...350mm2 GPU a little bit bigger (you need to have an additional chip-to-chip interface) but does not add other cost. Smaller Die of the same portfolio are not affected as well. The big GPU, which uses two of the chiplets (600...700mm2 total) would have doubled media engines etc. but for a Halo part not a huge problem. You can also put those in use for prosumers etc. (e.g. GB202 has more video encoders/decoders compared to the the smaller Blackwell GPUs).

Thing is you need a big, expensive and high demand substrate, and you could just build a big enough monolithic part.

basix said:
Packaging cost should also not increase by too much. I think RDNA3 alike organic "Infinity Fanout Links" are enough for a few TByte/s of bandwidth.

Not between compute engines, you need CoWoS-L for enough bandwidth to sync L3+memory ~5TB/s bidirectional. L2, forget about it, you would need to have 2 GPUs work together as one through all sorts of tricks.
CoWoS is a nonstarter for client, same reasons as HBM.
Much easier to build one big compute engine and stack it on top of cache and memory PHYs.

basix said:
For VRAM and Infinity Cache bandwidth surely enough. The question is, what happens with the L2$? I assume, that this one needs to be some sort of "private" for each chiplet. I do not see, that you can route much if not all L2$ traffic over the chip-to-chip interface and keep the cost reasonable.

Well N4C had L2 private to each SED, even with SoIC-X having coherent L2 across multiple chiplets is a very tall ask.
So once again the best idea without going beyond retscale base dimensions is a simple 3D stack, frontend+compute+L2 up top, L3+memory below.
And I guess a MID or two connected to the base with fanouts for IO. Thing is expensive enough as is.

Kronos1996 · Apr 3, 2025

I understand they gotta prioritize server but abandoning chiplet consumer GPU’s may be foolish. I assumed the plan was to share GPU chiplets with laptop. Future laptop chips could then share CPU and GPU dies with desktop. The only custom die would be an IOD of some sort. Probably with the GPU chiplet 3D stacked on top. They could iterate a lot faster and wouldn’t have to redo as much work every year. Just make a new IOD each time.

If OEM’s still insist on an annual cadence, maybe they do a mid-gen refresh chiplet for the CPU and GPU. Zen 6+ and RDNA 5.5 for example. They could also use those on desktop for an annual release cadence. With the introduction of Halo, AMD is perfectly positioned to dominate this new premium segment. That would be easier if they could share GPU chiplets with desktop though.

adroc_thurston · Apr 3, 2025

Kronos1996 said:
I assumed the plan was to share GPU chiplets with laptop

No.
GPU chiplets are a win more gimmick.

Kronos1996 said:
Future laptop chips could then share CPU and GPU dies with desktop.

Lmao

Vikv1918 · Apr 3, 2025

Kronos1996 said:
I understand they gotta prioritize server but abandoning chiplet consumer GPU’s may be foolish. I assumed the plan was to share GPU chiplets with laptop. Future laptop chips could then share CPU and GPU dies with desktop. The only custom die would be an IOD of some sort. Probably with the GPU chiplet 3D stacked on top. They could iterate a lot faster and wouldn’t have to redo as much work every year. Just make a new IOD each time.

If OEM’s still insist on an annual cadence, maybe they do a mid-gen refresh chiplet for the CPU and GPU. Zen 6+ and RDNA 5.5 for example. They could also use those on desktop for an annual release cadence. With the introduction of Halo, AMD is perfectly positioned to dominate this new premium segment. That would be easier if they could share GPU chiplets with desktop though.

Maybe AMD is saving chiplets as a "break glass in case of emergency" plan if nvidia gets too competitive. Right now, they dont need it as their monolithic architecture is good enough. In laptop especially they have zero ambition and dont care that they have 0.01% marketshare, so laptops don't matter enough to design their GPU around it. They've already secured PS6 and nextbox so they have guaranteed revenue to chug along at least for the next 5-7 years.

adroc_thurston · Apr 3, 2025

Vikv1918 said:
Maybe AMD is saving chiplets as a "break glass in case of emergency" plan if nvidia gets too competitive.

Lmao that's an interesting way to say "they don't have the manpower for that".

Vikv1918 said:
In laptop especially they have zero ambition and dont care that they have 0.01% marketshare

They have a lot of ambition but they also screwed up.

UsedTweaker · Apr 3, 2025

The only way GPU chiplets work today is with ultra high bandwidth packaging, aka the stuff AI looks like it might order through straight into RDNA5 ship date anyway (at TSMC anyway). The RDNA3 chiplet strategy wouldn't offer much, die size of the PHIs and sram on N48 are <20%, hiving off a 70mm chiplet doesn't save much $$. Plus that adds latency going out to big cache/main memory and GPUs aren't as good at hiding latency as you'd expect.

You'd need packaging that could split the logic of the next chip in half or something to matter much. 2 150mm dies increases yield noticeably over 1 300mm die.

Maybe the packaging could be something cheap though. Moving GDDR to on package over soldered would give a latency boost of a couple 10s of ms. That matters, especially in raytracing, and you could plausibly swing having AIBs pay for that today and keep up the financial trickery of pretending you're only selling the dies so your profit margins look bigger.

adroc_thurston · Apr 3, 2025

UsedTweaker said:
Plus that adds latency going out to big cache/main memory and GPUs aren't as good at hiding latency as you'd expect.

'latency' add is negligeble and GPUs are very much good at hiding latency.
that's why they exist.

UsedTweaker said:
Maybe the packaging could be something cheap though. Moving GDDR to on package over soldered would give a latency boost of a couple 10s of ms.

MoP does not do anything about latency.
That's not how DRAM works.

Kronos1996 · Apr 3, 2025

UsedTweaker said:
The only way GPU chiplets work today is with ultra high bandwidth packaging, aka the stuff AI looks like it might order through straight into RDNA5 ship date anyway (at TSMC anyway). The RDNA3 chiplet strategy wouldn't offer much, die size of the PHIs and sram on N48 are <20%, hiving off a 70mm chiplet doesn't save much $$. Plus that adds latency going out to big cache/main memory and GPUs aren't as good at hiding latency as you'd expect.

You'd need packaging that could split the logic of the next chip in half or something to matter much. 2 150mm dies increases yield noticeably over 1 300mm die.

Maybe the packaging could be something cheap though. Moving GDDR to on package over soldered would give a latency boost of a couple 10s of ms. That matters, especially in raytracing, and you could plausibly swing having AIBs pay for that today and keep up the financial trickery of pretending you're only selling the dies so your profit margins look bigger.

I mean Intel has separate GPU tiles on their mobile SoC’s now. Other issues aside, EMIB seems quite affordable now that it’s at volume. I don’t think it’s far-fetched AMD follows suit before long. Those super complex SoC’s are quite difficult to make apparently. Separating more things out of the main die should make design easier.

adroc_thurston · Apr 4, 2025

Kronos1996 said:
EMIB seems quite affordable now that it’s at volume

MTL/ARL/yaddayadda are not EMIB.

Kronos1996 said:
I don’t think it’s far-fetched AMD follows suit before long

you already saw the stxH. that's the future of AMD client.

SolidQ · May 4, 2025

Guys are slow, but there no patent about Traversal Engine(most important thing), which we have in this topic

AMD's Next-Gen UDNA 5 Gaming GPUs Could Potentially Bridge The Ray-Tracing Performance Gap With NVIDIA, Indicates Extensive Patent Filings

AMD has big plans for its UDNA 5 GPUs, as a set of patent filings over the last two years indicate that firm plans to upscale its RT game.

wccftech.com

marees · May 7, 2025

MI400 series development remains on track to launch next year.

https://twitter.com/x/status/1919943266663071879

soresu · May 7, 2025

SolidQ said:
Guys are slow, but there no patent about Traversal Engine(most important thing), which we have in this topic

AMD's Next-Gen UDNA 5 Gaming GPUs Could Potentially Bridge The Ray-Tracing Performance Gap With NVIDIA, Indicates Extensive Patent Filings

AMD has big plans for its UDNA 5 GPUs, as a set of patent filings over the last two years indicate that firm plans to upscale its RT game.

wccftech.com

Similarly, with Sony's Project Amethyst, AMD plans to develop its advanced path tracing solutions, potentially leveraging neural rendering (AI) to compete with NVIDIA’s resource-intensive ReSTIR technology. Ultimately, this "RT race" will benefit the end consumer, who'll likely find better performance and features with next-gen GPUs.

They also seem to not understand that ReSTIR is entirely usable on any RT accelerated GPU, as the specifics of the technique are explicitly laid out in research papers, along with demonstration code - not proprietary RTX code that no one else can use performantly due to closed source, a la GameWorks™️.

RTX Remix includes an implementation of ReSTIR (as does Q2 RTX and likely Portal RTX too), but it's not the only one out there.

It's best to think of ReSTIR in a similar context to deferred lighting.

ie a rendering optimisation technique that benefits all, bringing up the performance baseline - but for RT/PT instead of raster gfx in the case of deferred lighting.

This is what I hate about nVidia.

Even when they actually fund or do work that is open/usable to all their proprietary branding language like RTX in official PR has clouded the topic so drastically that the average gamer walking into a PC shop to buy a whole system, or to order one online hasn't a chance of sifting through it confidently.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Diamond Member

Senior member

Member

Diamond Member

Diamond Member

Member

Diamond Member

Member

Member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Member

Diamond Member

Member

Diamond Member

Junior Member

Diamond Member

Member

Diamond Member

Golden Member

Platinum Member

Diamond Member