• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 117 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Looks like this extends the concept from the 2020 Shared L1 paper to register files.
L1 shared paper via Work Distribution Crossbar implementation under 5.3 provided 140% perf uplift for P-GEMM. Much larger uplift than 16X L1 private or theoretical perf with zero latency and replication with slow mesh interconnect.

Making L2 bigger doesn't help because "because performance is limited by the L2 reply bandwidth bottleneck [49, 73, 74]. Such a bottleneck is relieved with Shared++ and DynEB as the shared L1 organization utilizes the remote cores as an additional source of bandwidth."
Put differently it's basically free performance without exploding cache sizes for shared friendly workloads + section 5.5 shows progressive IPC scaling with more CUs, because remote core BW scales accordingly.

For other ML the mediocre mesh baseline already give huge gains for U-nets such as AN, RN and SN. With NoC + higher clock the gains would be larger. For example L1 decoupled follow up paper beefed things up with NoC + higher clocks. 8X vs ~2X for AN. SN 5X vs 4X.
If AMD are smart they also use Work Distribution Crossbar here for inter-SE communication to bypass L2 completely. Good for workload re-distribution (load balancing) too. Work graphs and many other workloads would benefit.
There's way lower BW pressure on L2 since distributed L0 acts as quasi-L2. Considering everything is SE local scratchpad L2 spill-over prob done too. Tiny L2's for ATx makes more sense if this is true and likely more passive role (LLC) than existing L2. They might get away with slower and denser cache.

NVIDIA have shared DSMEM since Hopper on DC side, since Blackwell on consumer side. Prob over ringbus within each GPC.
Beefing up that ringbus and permitting the GPU to send the instruction and transfer the register data on on it in addition to LDS data. This enhances VGPR with remote core BW and bypass existing cachemem bottlenecks as discussed previously. Memory wall problem addressed on three fronts.

Unconfirmed but we'll know in ~3 months if CDNA5 exceeds Blackwell baseline for cachemem. Regardless these kind of insane perf figures from simulations are too good to ignore. Hope RDNA 5 goes for all three (L0s, LDS and VGPR) because it's basically free IPC.
 
ML perf and feature set delta between GFX12 and GFX13 is too big. Like Adroc has already said FSR Diamond will be RDNA 5+ exclusive.
Hunyh also said "natively optimized for Project Helix". Translation: FSR porting aint gonna happen.
Best case scenario: exclusive to RDNA5+ cards
Worst case scenario: exclusive to Helix family
 
Worst case scenario: exclusive to Helix family
Helix uses off the shelf AT2.
AMD should really make FSR4 work on RDNA3 and FSR5 on RDNA4. Compared to not even running it, it is much better when it runs - eventhough very slowly. It just gives the much better impression, even if RDNA3/4 owners might not like the performance hit.
The install base is so marginally tiny they don't have to port anything.
 
It looks like there is sufficient capacity for bdie(SF4X?)
It seems the Samsung Foundry team is firmly convinced that AMD won't be giving them any orders
 
It looks like there is sufficient capacity for bdie(SF4X?)
It seems the Samsung Foundry team is firmly convinced that AMD won't be giving them any orders
sf4 was supposed to be sonoma valley (mendocino replacement) right ?

i hope one or two low end zen 7 socs ( especially the 15watt grimlock point 4) find their way to samsung 2nm
 
I suspect it was cancelled, as it seems to have been replaced by Shockwave and Bumblebee
both soundwave & bumblebee on tsmc rather than samsung

plus in leaked amd roadmaps bumblebee doesn’t replace mendocino but sits a tier above that

mendocino lives until 2029 / zen 7 grimlock point 4
 
On these AMD uses Samsung rumors:

Whenever AMD executives are asked about their collaboration with TSMC, they speak very highly of their partnership with TSMC.

Whenever Lisa Su is asked whether AMD will have its chips manufactured by Intel or Samsung, or is considering doing so, she offers a few polite platitudes and sings the praises of the excellent working relationship with TSMC.

i hope one or two low end zen 7 socs ( especially the 15watt grimlock point 4) find their way to samsung 2nm
Why would AMD do that?

Is it because Samsung’s wafers are supposed to be so cheap? Why do you think they’re so cheap?

AMD was prepared to pay GF for every wafer it had manufactured by TSMC using the 7 nm process. I wonder why?

A nice video; it gives a sense that the manufacture of semiconductor chips involves far more than just processes and fabs:
 
It seems like even the 4nm FinFET process doesn't have many customers. They still seem to think that C-BaseDIE is out of the question.
 
Last edited:
"According to an industry source, Samsung and AMD are reportedly discussing a joint announcement regarding their partnership. The source noted that this move is expected to reaffirm the strategic significance of Samsung's semiconductor division."
 
Last edited:
I'm not sure if I should say this here, but Korean memory guys will likely be shocked by the news of Micron supplying hbm4.
 
Last edited:
I'm not sure if I should say this here, but Korean memory guys will likely be shocked by the news of Micron supplying hbm4.
If by "Korean memory guys" you mean SK Hynix and Samsung, then they probably weren't shocked. I think SK Hynix and Samsung are very well positioned to assess Micron's capabilities.

If you mean the guys who have been spreading the word for months that Micron can't deliver... Who takes these guys seriously?
 
Wonder if consoles with RDNA5 will have enought AI power for AI slop like DLSS5
Hope they ignore this BS. Even if it becomes flawless I don't want every game to look like the same Photorealistic game (UE5 bland look 2.0). Per-game training is possible but ain't gonna happen.

Do we have any idea about roughly what kind of GEMM throughput RDNA 5 will achieve compared to RDNA 4 and 50 series? Not talking about on paper TFLOPs here but actual perf differences assuming same number of CUs/SMs and clocks.
 
Wonder if consoles with RDNA5 will have enought AI power for AI slop like DLSS5
"Power" is only a problem if they want to offer something similar.
I suspect FSR Diamond will be somewhere between DLSS 4.5 and 5:
More like 4.5 for stuff like faces (no AI "upgrades" that make characters look different), but with some RT-related stuff similar to DLSS 5, so they can offer lighting/shading quality equivalent to full RT/PT at a fraction of the performance cost.

Do we have any idea about roughly what kind of GEMM throughput RDNA 5 will achieve compared to RDNA 4 and 50 series? Not talking about on paper TFLOPs here but actual perf differences assuming same number of CUs/SMs and clocks.
I'm not sure we even have numbers on on-paper TFLOPs yet (at least not for lower-precision GEMM formats), and the new CUs likely have seen some significant changes to the VGPR and/or caches as well, so I doubt anyone can give even a rough estimate except "higher than RDNA4".

I'd expect substantial improvements in any case, since RDNA5 is AMD's first architecture whose dev cycle fell square into the AI craze, and I doubt MI4xx would've gotten as many early orders if the GFX1250/13xx CU design wasn't a substantial improvement for some AI workloads.
 
I'm not an expert , but I wanna share this so more knowledgeable people here can talk about it. 😉

Techpowerup article -> AMD "RDNA 5" to Heavily Boost Shader Performance in Games with New Dual-Issue Pipeline.

Source -> RDNA 5 may allow dual issues to work in more cases, making it easier to achieve peak FP32 performance.

RDNA 5 may allow dual issues to work in more cases, making it easier to achieve peak FP32 perf...png

 
Last edited:
Back
Top