• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion Nvidia Blackwell in Q1-2025

Page 183 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
I'm not sure using the same base resolution and measure FPS count is the best way to compare upscalers.

There should be some effort to reach ISO-quality regardless of base resolution and only then should they measure average FPS and/or frametimes. I'm not saying this is the case, but DLSS4.5 with base 1080p for a 4K output could show identical IQ to DLSS4 with base 1440p for the same 4K output, and performance on the former could be better.
Nonsense, they're selling this as an IQ improvement for the existing DLSS install base.
If you have to lower down your preset to maintain perf (and thus sacrifice the IQ) then the update is a meme.
 
When you read the docs: The new presets L and M are indeed much heavier

And from the performance numbers including memory usage they seem to have switched to and FP8 based DNN (at least in some parts). Turing and Ampere performance falls apart because sticking to FP16.
Model M seems to trade execution speed with a little bit (tiny bit) of memory overhead. Let's see what the quality differences will be.

And hey: Creating a selling point for Rubin with much higher Tensor throughput sounds like business strategy 😉
 
I'm worried about what I read about sharpening, I hope that's optional or there are presets for us who don't like any visible trace of that type of effect.
 
Even RTX 5060 drop fps
2-3% overhead. Nice try NVIDIA. Not everyone has a 5090.

I'm worried about what I read about sharpening, I hope that's optional or there are presets for us who don't like any visible trace of that type of effect.
The new models are tuned for performance and ultra performance only. Use older models for everything else.
But hey it can make 720P UP usable:
 
I am very impressed of the quality with 10% & 20% scaling in KCD2 (384 x 216 / 768 x 432 input to 4K output). Thats 25x / 100x less rendered pixels compared to 4K and much less pixels compared to UP with 9x.
 
Last edited:
When you read the docs: The new presets L and M are indeed much heavier

And from the performance numbers including memory usage they seem to have switched to and FP8 based DNN (at least in some parts). Turing and Ampere performance falls apart because sticking to FP16.
Model M seems to trade execution speed with a little bit (tiny bit) of memory overhead. Let's see what the quality differences will be.

And hey: Creating a selling point for Rubin with much higher Tensor throughput sounds like business strategy 😉
My guess is that it's 100% FP8 training and inference like DLSS4 RR: https://research.nvidia.com/labs/adlr/DLSS4/
Crazy how DLSS4 RR joint upscaler + ray denoising overhead is ~+-40-50% Preset L and roughly doubling Preset M: https://github.com/NVIDIA/DLSS/blob/main/doc/DLSS-RR Integration Guide.pdf

Not apples to apples because it replaces NRD. Maybe why we haven't seen DLSS RR yet.
Anyone thinks we'll get DLSS 4.5 RR at all? Maybe not till DLSS5?
 
HUB DLSS4 Preset K vs DLSS 4.5 Preset M:

HUB DLSS4 Preset K vs DLSS 4.5 Preset M vs FSR4:

DF DLSS4 Preset K vs DLSS 4.5 Preset M:
https://www.youtube.com/watch?v=x8CcD13eS6I

TL;DW:
- DLSS4 for Ray traced games, RR, and 20-30 series.
- DLSS 4.5 for 40-50 series and non-RT games.
- Surprisingly FSR4 in some instances closer to DLSS 4.5 than FSR4, overall still behind and significantly softer than both.
- DLSS4.5 is very sharp and TAA deblur, breaks in games where game sharpening slider = disable (oversharpened), especially with balanced or higher with Preset M.
- Ultra performance testing and Preset L waiting for another day, but I've seen testing on YT suggesting DLSS4.5 Preset M P and Preset L UP are very close. Depends on the game.
 
My guess is that it's 100% FP8 training and inference like DLSS4 RR: https://research.nvidia.com/labs/adlr/DLSS4/
Crazy how DLSS4 RR joint upscaler + ray denoising overhead is ~+-40-50% Preset L and roughly doubling Preset M: https://github.com/NVIDIA/DLSS/blob/main/doc/DLSS-RR Integration Guide.pdf

Not apples to apples because it replaces NRD. Maybe why we haven't seen DLSS RR yet.
Anyone thinks we'll get DLSS 4.5 RR at all? Maybe not till DLSS5?
Yeah, I agree. FP8 for RR was already introduced at DLSS 4.0. That is also the reason, why DLSS 4 RR runs worse on Ampere and Turing. We are now seeing the very same for Peset M/L.

I assume we do not see a RR update until DLSS 5, when they probably switch to FP4.
 
Yeah, I agree. FP8 for RR was already introduced at DLSS 4.0. That is also the reason, why DLSS 4 RR runs worse on Ampere and Turing. We are now seeing the very same for Peset M/L.

I assume we do not see a RR update until DLSS 5, when they probably switch to FP4.
My conclusion as well but FP4 is sheit. They need NVFP4. ~10% perf regression but less quantization error.

For DLSS5 I'm thinking NVFP4 + Preset K -> Preset M ms scaling = +60% overhead on preset M so roughly ~15-20% more demanding than Preset L on 50 series at 4K.

RR model is more tricky but it'll prob be extremely demanding to run. ~3X perf cost over DLSS4 RR on 50 series. So in general either brute force (trend so far) or some clever redesign (unlikely). 60 series needs a massive NVFP4 gain + some architectural cachemem tweaks to be able to run this new model more efficiently.
 
Sure, NVFP4 will be used if my "FP4" was too imprecise 😉

About DLSS 5 and Rubin I made longer post here on 3DCenter (German):

Here the translated version via Google:
Not only the smaller cards would benefit. All Rubin cards would benefit.
wink.gif


Here's a calculation example for DLSS 5:
  • The number of parameters is increased by 1.33x.
  • The approach taken is that of Preset L, which uses 1.5 times more processing power than Preset M without any additional parameters. This allows for the extraction of more detail (see, for example, Daniel Owen's upscaling comparisons on YouTube). 1.5 * 1.33 = 2x more processing power required for upscaling.
  • FG takes a similar approach: 2 parameters / computational requirements
  • DLSS 5 uses FP4, resulting in 2x throughput (on Blackwell and Rubin).
  • In short, DLSS 5 on Blackwell would run exactly as fast as DLSS 4.5 with preset M. Cards older than Blackwell would be slower.

Then Rubin with 2x Tensor Throughput per SM and TMA & TMEM:
  • 2x FLOPS = 2x Performance
  • TMA & TMEM = 1.5x better scaling with FLOPS (my assumption without any solid basis)
  • A total of 3x net throughput at DNN

Now, a performance comparison of DLSS 4.5 and DLSS 5 on Rubin:
  • DLSS 4.5 is ~1.15x slower than DLSS 3 with CNN model
  • We assume CNN = zero computing power so that calculations can be performed.
  • With 3x more net throughput, it would only be 1 + (1 - 1.15) / 3 = 1.05x slower than the CNN model (slightly faster than DLSS 4 with preset K)
  • Ideally, DLSS 5 on Rubin would be almost as fast as DLSS 3 CNN on older cards with a similar base performance level (e.g. 6060 Ti vs. 5070), since DLSS 3 also requires some computing power.
  • Compared to similarly fast Blackwell cards, you gain +10% performance.

And now it gets even more interesting when you use FG:
  • DLSS 4 FG 2x scaled on average at approximately 1.75x.
  • Rubin would scale by 2 - (2 - 1.75) / 3 = 1.92x. This is compared to Blackwell +10% performance (base/input FPS) and smoothness (output FPS).
  • With 4x or even 6x/8x MFG, this would then multiply.
  • 4x MFG at 1.75^2 = 3.06 vs. 1.92^2 = 3.69, which would correspond to +20% performance (base/input FPS) and smoothness (output FPS).
  • 8x MFG at 1.75^3 = 5.36 vs. 1.92^3 = 7.08, which would correspond to +32% performance (base/input FPS) and smoothness (output FPS).

Result:
  • A higher net Tensor throughput in Rubin could lead to an estimated 1.1x higher performance compared to Blackwell when using DLSS upscaling.
  • Including 2x FG, that would increase to 1.21x.
  • Including 4x MFG, that would increase to 1.33x.
  • Including 8x MFG, that would increase to 1.46x.
  • This could allow Rubin to noticeably differentiate itself from Blackwell even beyond basic performance.
  • Improved Tensor Core performance could therefore be quite advantageous from a gaming perspective. This is also true for Nvidia, as it provides a noticeable advantage over Blackwell, making Rubin more attractive to gamers.
  • And for GEMM in general, this is of course also interesting (any local execution of DNN such as Stable Diffusion etc. and, looking further into the future, "Neural Rendering")

Edit:
  • Ray reconstruction is another option. If the same approach were used there (double the computing power requirement and switching to FP4), Blackwell would again be just as fast as with DLSS 4.
  • Rubin could also gain performance here with 3x net throughput. Something between 1.1x and 1.2x.

A follow up post then discussed the implications on chip area when doubling the widht of the Tensor Cores per SM:
Quote from AffenJack View post
I also expect TMEM, but I don't think we'll see a doubling of TCs at the moment. Doubling/multiplying the speed at 1.58 bits versus FP4 as a new format. But there's only limited space, and for consumers, I initially see more of a focus on increased use of NVFP4 for speed improvements.
I only see a doubling of TCs if they generally double the FLOPS per SM to combat the scaling problems that GB202 has. It will be interesting to see how Nvidia tackles the wall their architecture has run into.
Rubin will be introduced in N3P, where logic density is the primary increase compared to N4P. Additional Tensor FLOPS primarily require logic transistors.

It appears GR202 will remain at 192 SMs. Therefore, performance gains can only be achieved through IPC and clock speed. Improving DLSS scaling could, in principle, be categorized as an IPC improvement, but only indirectly and when DLSS is used. However, when DLSS is used, it simply appears as increased performance to the user.

Consider this: How much larger does the chip need to be for a 10% performance increase with DLSS SR, or a 33% increase with DLSS and 4x MFG?

Here's the old Turing post: https://www.reddit.com/r/nvidia/comm..._07/?rdt=47655
Back then, Tensor Cores were estimated to add approximately 13% to the cost per SM, or 6% to the total chip area. Meanwhile, a substantial L2 cache has been added, and the SMs have grown larger (more shared memory, doubled FP32, stronger RT cores). However, the Tensor FLOPS per FP32 FLOP have halved. The main difference is the addition of new data types. A very rough estimate suggests that today's Tensor Cores occupy about the same amount of space as Turing's (+5%). Doubling the size would add another 5%. This would be 100% worthwhile if using DLSS. And, of course, N3P scales very well with logic. Now, TMEM is being added, which will require more area. Let's say +5% (e.g. only 128kB of TMEM instead of 256kB as with B200). The net increase in chip area is +10% for a 10% performance gain with DLSS SR. However, once FG and Neural Rendering come into play, the increase will be significantly greater. So, the net gain is clear: The performance increase outweighs the additional chip area required.

And then there's the use of these chips for things like RTX Pro, where additional Tensor Core performance is highly desirable. Therefore, increased Tensor Core throughput would fit well into Nvidia's overall strategy.

Edit:
TMEM might even be cheaper, but 10% seems rather low to me. https://arxiv.org/pdf/2512.02189
Quote:
The 256KB TMEM per SM (10% of SM memory) [...]
 
Sure, NVFP4 will be used if my "FP4" was too imprecise 😉

About DLSS 5 and Rubin I made longer post here on 3DCenter (German):

Here the translated version via Google:


A follow up post then discussed the implications on chip area when doubling the widht of the Tensor Cores per SM:
I hope your prediction turns out to be true and not my pessimistic assumption that they'll 5X compute again with DLSS5 SR.

Doubling size of systolic array results in less than 2X area cost. NVIDIA already did this with Ampere as well going from old tensor core to new one (8/SM vs 4/SM), said it reduced area and power consumption IIRC. Also isn't the new lower-bit datatypes basically just free performance?
2X tensor core throughput only puts them in line with ATx IIRc with a risk of being disrupted due to a possible novel cachemem design.
Maybe they'll go 4X?

Also 1.58bit DLSS sounds interesting but prob not practical xD
 
DLSS 4 added 5x compute and DLSS 4.5 added another 4x compute. That is 20x vs. DLSS 3 with CNN, which is kinda astounding.

And you can see that they are hesitant to add more compute requirements, that's why we see preset M and L. Framerate performance begins to tank.
Many reviewers test GPU performance with DLSS/FSR by default. If DLSS 5 adds too much compute requirements, Nvidia cards could look weaker.
I think preset "L" is a DLSS 5 precursor. They want to add more compute requirements to improve quality, but that is not viable without NVFP4 (and potentially Rubin's higher Tensor Core throughput).

One thing I forgot about Rubin:
They could add the Transformer Engine (inferencing sparsity compression?) from datacenter B300. With that you get 1.5x inferencing FLOPS for "free".

Regarding TC size:
Smaller data types could be nearly for free, if you can do ALU re-use etc. but I am not sure if that gets done that way for FP. For INT such ALU re-use is easy, for FP it is more difficult.
If we get 4x TC throughput for gaming cards: I would take that (I use my card also for local DNNs). But maybe they do a market segment differentiation & add a salvaging path: Gaming = 2x; RTX Pro = 4x

1.58bit might be a thing as well, sure. But earliest for DLSS 6 and if Rubin supports that datatype. When looking at the history of DLSS 3 to 4 / 4.5 we can see that Nvidia keeps N-1 generation "alive". Everything is FP8 now. They could have switched to NVFP4 but Lovelace does not support that. Anyways, if looking at DLSS 4.5 preset L results at 10/20% resolution scaling we are entering the endgame of temporal upsampling. DLSS 5 will improve quality again and after that the perceived quality gains will likely be very minor. So keeping performance up instead of adding even more compute requirements might be the more sensible path forward. I mean it will get crazy, how many pixels are rendered vs. displayed:
  • 20% / 25% SR modes (I expect that UP will not stay the max for DLSS 5)
  • 6x / 8x MFG
  • Extreme case where I could imagine still decent / usable results: 4K / 20% scaling / 8x MFG --> 1 of 200 pixels gets actually rendered
    • Today we look at 1 of 36 pixels (UP + 4x MFG), but UP image quality was sub-par before DLSS 4.5, so it was realistically closer to 1 of 16 pixels
 
Last edited:
Looks like bad RT game quality is not DLSS 4.5's fault but in-game denoisers.

Impressive how well Preset M and L resolve lighting without denoising. Superres so good that it behaves like denoising.
Edit: It is denoising. See later replies.
 
Last edited:
I hope we get one or two new things based on that:
- DLSS automatically disables ingame denoisers (e.g. based on whitelisting and using existing game engine console commands). At least for UE5 this would be already quite helpful.
- We get at least a tool with a switch to disable ingame denoisers
 
Looks like bad RT game quality is not DLSS 4.5's fault but in-game denoisers.

Impressive how well Preset M and L resolve lighting without denoising. Superres so good that it behaves like denoising.
It is actually denoising. Try looking at the blurry unstable vs perfect stable reflections at 7:20 and 8:15. Strongly suggest RR. At 7:58 lighting and textures for the entire paved area changes from blurry and unstable to sharp and stable.
If that's not enough DLSSTweaks dev claims the new presets are called rrlite and rrlite_folded in nvngx_dlss DLL here, and looks very similar to RR in Code Vein II:

Guru3D thread where it was first mentioned: https://forums.guru3d.com/threads/i...tsr-and-mods-etc.439761/page-247#post-6388116
 
Last edited:
Back
Top