• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion RDNA 5 / UDNA (CDNA Next) speculation

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
The newly announced m5 series supports 4*fp16.
I know that rdna5 is also supported, so is int8 twice the value of fp16?

Rdna5<-Udna->cdna-next(cdna4 5?)
 
Last edited:
AMD's been filing patents for data reuse cache + global VGPR-to-VGPR network but not sure if it's gonna solve the bottleneck. Not confirmed yet.
The data reuse cache patent seems to be more tailored towards CPUs.

Very interesting is claim 3:
- 4 cycles = L1-Cache (48kByte on Zen 5)
- 2 cycle = Data Re-Use Cache (maybe 16kByte?)
3. The method of claim 2, wherein the loading the data from the load response is performed in four cycles from the level of the cache system and the loading the data from the data reuse cache is performed in two cycles.
Halfed cache latency is very impressive!

This is realized like that:
The data re-use cache sits very close to the execution units and before the Load and Store system.
11. The processor unit of claim 10, wherein the data reuse cache is physically located closer to the execution unit on an integrated circuit than the load-store unit or the cache system.

If looking at the Zen 5 core diagram, this Data Re-Use Cache must therefore sit inside the Execution Engine block or between Load/Store Units and Execution Engine.
5a0a62d1-c4a7-46dd-8ad5-97e7c6504eac_1744x1199.jpg
 
Last edited:
The data reuse cache patent seems to be more tailored towards CPUs.

Very interesting is claim 3:
- 4 cycles = L1-Cache (48kByte on Zen 5)
- 2 cycle = Data Re-Use Cache (maybe 16kByte?)

Halfed cache latency is very impressive!

This is realized like that:
The data re-use cache sits very close to the execution units and before the Load and Store system.


If looking at the Zen 5 core diagram, this Data Re-Use Cache must therefore sit inside to Execution Engine block or between Load/Store Units and Execution Engine.
View attachment 139237
Yeah that's an L0d for (probably) Zen7
 
The data reuse cache patent seems to be more tailored towards CPUs.
It does mention parallel processor four times. I just assumed it would be used given all the weird design choices relating to register renaming, improved queue priority, OoO etc...

Here's the patent for how they avoid flooding that data reuse cache:
 
TBH I was more interested in the other patent:

Remote tensor cores can directly access another register file (bypassing L1 and L2) through a dedicated network. Based on the description this doesn't look like a regular register file + patent says scratchpad multiple times so it could be approaching TMEM-esque design. It even allow softmax (post processing) to tap into this network and operate on register file data directly instead of doing expensive requests through L1 and L2. Won't speculate as to how fast this interconnect could be but it says it's higher than L2 data bandwidth.

Looks like this extends the concept from the 2020 Shared L1 paper to register files. Because register files have a massive bandwidth advantage, extremely low memory latency, and more tightly coupled to execution the impact on matmul should be more significant.

Saw VGPR in PS5 Pro has 1.67TB/s. So that's 6.68TB/s per WGP. C&C's measured L0 bandwidth at 210GB/s per WGP. ~32X more BW. Even if it uses smaller and tensor-core private VGPR the bandwidth advantage will still be >10X.
If we compare shared L1 paper and this across a 9 WGP shader engine (PS6 design) with RDNA 2 cachemem (just a placeholder) that amounts to total remote core bandwidth of 1.68TB/s for L0 vs 53.44TB/s for VRF. Absolutely crazy.

Also saw this one:

This is related with the stuff I shared 1.5 months ago about mixing ML data formats (fx INT4, INT8, BF8, MXFP8, FP4 etc...).

If RDNA 5 uses this, allows softmax to operate on register data, and extends PS5 Pro's WGP takeover mode to the entire SE then that would be huge. Add DSMEM, Shared L0 and other changes and it's even more impressive.

Unfortunately that likely means FSR5 won't be able to run on RDNA 4 cards.
 
It does mention parallel processor four times. I just assumed it would be used given all the weird design choices relating to register renaming, improved queue priority, OoO etc...

Here's the patent for how they avoid flooding that data reuse cache:
It might get used on GPUs. But the claims are very specific given the 2/4 cycles cache latency. GPU caches are not in that latency region. At least claims 1-9 are not applicable to GPUs. The other claims 10-20 might be used for CPUs and GPUs. But probably still more beneficial for CPUs, because the invention is primarily about latency reduction beyond and lower than the L1 cache. GPUs are much more latency tolerant, so the benefit of such an L0d is lower.

I think shared caches etc. like you mention with the other patents will yield in more performance uplifts than an L0d.

Unfortunately that likely means FSR5 won't be able to run on RDNA 4 cards.
It will run. But probably slower or as a tuned down version (like the FSR4 INT8 version on PS5 Pro).
 
Last edited:
It will run. But probably slower or as a tuned down version (like the FSR4 INT8 version on PS5 Pro).
I wouldn't be so sure. If it's very different even compared to current 50 series (due to architectural changes), it'll run horribly on RDNA4, worse than FP8 emulation, or AMD will have spend time making a separate RDNA 4 version.

Last option is very unlikely considering FSR 4.1 looks like a sizeable improvement in terms of image clarity in motion, especially for foliage. Much crisper and less ghosting in KCD2. Not the only game where this is the case. If AMD thinks this is good enough they'll abandon RDNA 4 when FSR 5 happens.

I think shared caches etc. like you mention with the other patents will yield in more performance uplifts than an L0d.
100%. If only I knew how GPGPU-sim worked. Then we could get somewhat reasonable estimates before launch.

Sounds like the register file patent is a big deal, but I'm not qualified to assess what kind of impact it could make if in RDNA 5.
 
Regarding price, AMD will have to charge a premium because Radeon sells less than Geforce. Lowering the price won’t help much. Gamers are trained “Geforce” and AMD knows it.

If gamers get AT0, it is likely going to be priced higher than the 5090.

…and that is an “if”. A whole lot of companies are putting projects on hold due to the memory/storage insanity.
 
If we take into account the rest of gfx13's patent derived changes for RT Traversal, ignoring everything else, then I can see only one scenario: one where RDNA 5 dominates. In ray traversal throughput (at same raster) it should be able to destroy ALL existing generations, including current NVIDIA ones. In addition, factoring in the rest of the potential architectural changes should only extend the lead.
- I'll prob need to redo October analysis because it ain't complete at all, since many new patents have surfaced in the last 5 months.

As for RT shading portion many things will likely change such as cachemem and scheduling to name a few. In addition it seems like SWC is a dedicated block unlike NVIDIA SER (repurposes existing HW). As a result in existing PT games (DXR 1.2) RDNA 5 shouldn't have any problems winning at iso-raster. When it comes to a potential work graphs pivot for DXR 1.3, AMD and everyone else has untapped fine wine, but RDNA 5 much more so for reasons explained too many times to repeat.

Remember that RT is just one aspect of an architecture, and if we need to consider all potential changes with the nextgen Radeon. With this in mind architecturally it looks like RDNA 5 could be their most impressive product family since the R300 series (9700 pro). I'm talking about architectural sophistication and superiority here over existing products, not that they're gonna take the performance crown.

As for 60 series I have no idea and wonder what kind of HW changes 60 series introduces to counter this. Another Ada rebrand or Turing resource spam (Ampere) definitely isn't going to cut it, and it would be asinine for NVIDIA to keep bolting things unto Ampere for the third time. Hopefully we see a major architectural change on NVIDIA side.

It will be interesting to see if any of the PT stuff will even matter long term considering neural shading is on track to replace most if not all of the shading pipeline.

I just want to say I appreciate these posts.
 
Regarding price, AMD will have to charge a premium because Radeon sells less than Geforce. Lowering the price won’t help much. Gamers are trained “Geforce” and AMD knows it.

If gamers get AT0, it is likely going to be priced higher than the 5090.

…and that is an “if”. A whole lot of companies are putting projects on hold due to the memory/storage insanity.
The only way I would buy an RDNA 5 AT0 is if it meets or beats my RTX 5090 for the same or less money. Also, the numbers are questionable from the "leak" that Tom from Moore's Law Is Dead keeps pushing that the AT0 with 154 Compute Units and 24GB of GDDR7 VRAM on the 3nm process running at ONLY 380W will beat the 5090 and be a contender with the RTX 6090. I find that difficult to believe given the TBP and AMD not being really power efficient relative to performance. Additionally, the RTX 5090 has 170 Streaming Multiprocessors which is the equivalent to a Compute Unit, so the RTX 5090 already is above what AMD's RDNA 5 AT0 is putting out when it comes to CUs vs SMs. I know the die shrink should have an IPC boost, but it all depends on if an increase from previous generations will translate to the upcoming generation.

Regardless, there were very few RTX 5090s sold compared to less expensive models, so that likely shows that the appetite for higher end GPUs is limited. AMD may take that into consideration and hold off on a flagship GPU for the discrete GPU consumer market. AMD also likely saw how successful making a strong upper-mid range GPU like the RX 9070XT was and how well that has done when going against the increasingly expensive and rare upper-mid NVIDIA GPUs like the RTX 5070TI and RTX 5080. So, AMD may stick with targeting the mid-range and gain more ground there rather than trying to expend any capital on taking on the high-end of the GPU market just yet.
 
Last edited:
Additionally, the RTX 5090 has 170 Streaming Multiprocessors which is the equivalent to a Compute Unit,
Yeah? 64CU N48 goes alright vs a 70SM GB203 chop. Might be another Factor here that helps it? Maybe a Factor that being a node of GB202 ahead could help Maximize.
AMD not being really power efficient relative to performance
A N48 chop is usually near the top of charts for power efficiency and notably VSync efficiency while hampered by obsolescent memory that has higher pj/bit transmit (poor RTX 5050, it's the least efficient Blackwell so far).

I don't know why I decided to respond to a 1 poster. The new theme makes it harder to discern. But we can't categorize RDNA5 configurations based on RDNA4 simplficiations and MLID's bogus TBP. But don't worry, it's not shipping outside of DC. We won't have to buy Radeon.
 
Last edited:
Yeah? 64CU N48 goes alright vs a 70SM GB203 chop. Might be another Factor here that helps it? Maybe a Factor that being a node of GB202 ahead could help Maximize.
The node ahead could be a factor, but I still don't think it will reduce power consumption to the level of 380W to produce better overall performance over the GB202.


I don't know why I decided to respond to a 1 poster. The new theme makes it harder to discern. But we can't categorize RDNA5 configurations based on RDNA4 simplficiations and MLID's bogus TBP. But don't worry, it's not shipping outside of DC. We won't have to buy Radeon.
I just joined because I want to question Kepler_L2 who is the purveyor of propaganda that MLID is peddling. Suffice it to say, I have an axe to grind with how much misinformation seems to spew from the AMD side - it's the equivalent of the "RTX 4090-level performance for $549". I just think that the rumor mills are what has led to a lot of discontentment in gamers who get fed false hope - like FSR 4 being ported to RDNA 3 and MLID and others in the TechTuber world who get people's hopes up.

I wouldn't be opposed to buying a Radeon GPU if it actually did what they are claiming it would do and be competitive to above an RTX 5090 for less money, but these guys tend to have to do damage control when things don't turn out as they've previously stated (i.e. MLID on the 9950X3D2, and Kepler_L2 on PS6 RDNA 5 claims).

It does make me wonder if RDNA 5 AT0 is going to be exclusively for enterprise clients and data centers as you alluded to.
 
That's another bad assumption. It will sell for more than an RTX 5090 if they caught the customer they hunted. But not to us. There's no way you build a chip that big on the hope to sell it to DLSS enthusiasts.
You're probably right. If it is released to gamers, it would almost be legally actionable (possibly so for breach of fiduciary duty to shareholders) if AMD gave a more powerful, or even equally powerful, GPU to consumers for less money. If NVIDIA cannot maintain the enthusiast/high-end GPU consumer, then something definitely went awry at NVIDIA.
 
Additionally, the RTX 5090 has 170 Streaming Multiprocessors which is the equivalent to a Compute Unit, so the RTX 5090 already is above what AMD's RDNA 5 AT0 is putting out when it comes to CUs vs SMs.
5090 like 4090 is hitting a massive scaling wall. Evident if you take TPU's averaged 4K raster native numbers and compare against game clock adjusted TFLOPS.
Vs a 5070 TI and 5080 it's -26% vs 5060 TI 16GB it's -35%. Functions like a GPU with 126 and 110.5 SMs respectively.

If they fix core scaling on RDNA 5 then game over for 5090.
 
5090 like 4090 is hitting a massive scaling wall. Evident if you take TPU's averaged 4K raster native numbers and compare against game clock adjusted TFLOPS.
Vs a 5070 TI and 5080 it's -26% vs 5060 TI 16GB it's -35%. Functions like a GPU with 126 and 110.5 SMs respectively.

If they fix core scaling on RDNA 5 then game over for 5090.
True based on TPU numbers a 5090 is only like 30% faster than 4090 and most people were expecting it to be much faster.
 
It's open season for press grilling:

It does not matter they're also mentioning the flawed Steam review results or some other fallacy that may come up in the near future, the end result is the same: the press can and will erode whatever mind share Radeon may still have with consumers.

But hey, good thing AMD did not release FSR4 for RDNA3 (or even some RDNA2), it may have been poorly received by the press. /s
 
It's open season for press grilling:

It does not matter they're also mentioning the flawed Steam review results or some other fallacy that may come up in the near future, the end result is the same: the press can and will erode whatever mind share Radeon may still have with consumers.

But hey, good thing AMD did not release FSR4 for RDNA3 (or even some RDNA2), it may have been poorly received by the press. /s
The sad part is AMD actually has solid double digit share in DIY desktop and sales data reflects that.
Just that NV has nearly 100% share in all other discrete markets, remember ProViz counts to JPR and NV had by far their biggest quarter ever in that segment, means they are selling plenty of GB202.
Medusa Premium/Halo are for all purposes the same as a discrete AIB but will only count to overall graphics share as a iGPU.

Steam HW study needs to be paused as it leads to all sorts of flawed perceptions.
Either do everyone quarterly/yearly, fix all submissions to the proper device (Radeon Graphics smh) or don't at all.
 
Speaking of that, every Project Helix is effectively an AIB PC sale for AMD too.
The nature of what entails a capable GPU has to include big integrated devices for the overall graphics share, and there AMD and Intel are not awfully far behind NV for PC all inclusive.
 
Steam HW study needs to be paused as it leads to all sorts of flawed perceptions
While I think that Steam's HW Survey needs to be revised (a beta version of this HW self-reporting is currently available), I find it odd that neither AMD nor NVIDIA will give us hard sales numbers. I think it might be because it is hard to actually track where the GPUs ultimately land once they are sold to AIB partners. Surveys can only be so definitive. I know both sides go to the data sets that most favor their narrative. I will say that AMD did a good job with the mid-range with the 9000 Series. I wonder if they will revise their business strategy for RDNA 5 and scrap the AT0 flagship GPU for gaming to remain focused on the mid-range. I assume that would only happen if RAM demand and prices remain high.
 
Back
Top