Discussion Nvidia Blackwell in Q1-2025

adroc_thurston · Jan 6, 2026

ToTTenTranz said:
I'm not sure using the same base resolution and measure FPS count is the best way to compare upscalers.

There should be some effort to reach ISO-quality regardless of base resolution and only then should they measure average FPS and/or frametimes. I'm not saying this is the case, but DLSS4.5 with base 1080p for a 4K output could show identical IQ to DLSS4 with base 1440p for the same 4K output, and performance on the former could be better.

Nonsense, they're selling this as an IQ improvement for the existing DLSS install base.
If you have to lower down your preset to maintain perf (and thus sacrifice the IQ) then the update is a meme.

basix · Jan 6, 2026

When you read the docs: The new presets L and M are indeed much heavier

DLSS/doc/DLSS_Programming_Guide_Release.pdf at main · NVIDIA/DLSS

NVIDIA DLSS is a new and improved deep learning neural network that boosts frame rates and generates beautiful, sharp images for your games - NVIDIA/DLSS

github.com

And from the performance numbers including memory usage they seem to have switched to and FP8 based DNN (at least in some parts). Turing and Ampere performance falls apart because sticking to FP16.
Model M seems to trade execution speed with a little bit (tiny bit) of memory overhead. Let's see what the quality differences will be.

And hey: Creating a selling point for Rubin with much higher Tensor throughput sounds like business strategy 😉

SolidQ · Jan 6, 2026

Even RTX 5060 drop fps

CakeMonster · Jan 6, 2026

I'm worried about what I read about sharpening, I hope that's optional or there are presets for us who don't like any visible trace of that type of effect.

jpiniero · Jan 6, 2026

CakeMonster said:
I'm worried about what I read about sharpening, I hope that's optional or there are presets for us who don't like any visible trace of that type of effect.

The older presets are going to stick around.

MrMPFR · Jan 8, 2026

SolidQ said:
Even RTX 5060 drop fps

2-3% overhead. Nice try NVIDIA. Not everyone has a 5090.

CakeMonster said:
I'm worried about what I read about sharpening, I hope that's optional or there are presets for us who don't like any visible trace of that type of effect.

The new models are tuned for performance and ultra performance only. Use older models for everything else.
But hey it can make 720P UP usable:

MrMPFR · Jan 8, 2026

Didn't see new Rubin thread. Moved it there:

Racan · Jan 15, 2026

adroc_thurston · Jan 15, 2026

Racan said:

least unexpected thing ever, they're maximizing revenue per DRAM bit attached.

MrMPFR · Jan 31, 2026

Posted in wrong thread. Moved to Rubin thread.

MrMPFR · Jan 31, 2026

This is pretty stupid but remarkable how much DLSS has progressed since 2.0.

basix · Jan 31, 2026

I am very impressed of the quality with 10% & 20% scaling in KCD2 (384 x 216 / 768 x 432 input to 4K output). Thats 25x / 100x less rendered pixels compared to 4K and much less pixels compared to UP with 9x.

Win2012R2 · Jan 31, 2026

basix said:
Thats 25x / 100x less rendered pixels compared to 4K

🤮 🤮 🤮

MrMPFR · Jan 31, 2026

basix said:
When you read the docs: The new presets L and M are indeed much heavier

DLSS/doc/DLSS_Programming_Guide_Release.pdf at main · NVIDIA/DLSS

NVIDIA DLSS is a new and improved deep learning neural network that boosts frame rates and generates beautiful, sharp images for your games - NVIDIA/DLSS

github.com

And from the performance numbers including memory usage they seem to have switched to and FP8 based DNN (at least in some parts). Turing and Ampere performance falls apart because sticking to FP16.
Model M seems to trade execution speed with a little bit (tiny bit) of memory overhead. Let's see what the quality differences will be.

And hey: Creating a selling point for Rubin with much higher Tensor throughput sounds like business strategy 😉

My guess is that it's 100% FP8 training and inference like DLSS4 RR: https://research.nvidia.com/labs/adlr/DLSS4/
Crazy how DLSS4 RR joint upscaler + ray denoising overhead is ~+-40-50% Preset L and roughly doubling Preset M: https://github.com/NVIDIA/DLSS/blob/main/doc/DLSS-RR Integration Guide.pdf

Not apples to apples because it replaces NRD. Maybe why we haven't seen DLSS RR yet.
Anyone thinks we'll get DLSS 4.5 RR at all? Maybe not till DLSS5?

MrMPFR · Jan 31, 2026

HUB DLSS4 Preset K vs DLSS 4.5 Preset M:

HUB DLSS4 Preset K vs DLSS 4.5 Preset M vs FSR4:

DF DLSS4 Preset K vs DLSS 4.5 Preset M:
https://www.youtube.com/watch?v=x8CcD13eS6I

TL;DW:
- DLSS4 for Ray traced games, RR, and 20-30 series.
- DLSS 4.5 for 40-50 series and non-RT games.
- Surprisingly FSR4 in some instances closer to DLSS 4.5 than FSR4, overall still behind and significantly softer than both.
- DLSS4.5 is very sharp and TAA deblur, breaks in games where game sharpening slider = disable (oversharpened), especially with balanced or higher with Preset M.
- Ultra performance testing and Preset L waiting for another day, but I've seen testing on YT suggesting DLSS4.5 Preset M P and Preset L UP are very close. Depends on the game.

Tup3x · Jan 31, 2026

In my own testing K has been much better for DLAA. Preset M definitely needs some work.

basix · Feb 1, 2026

MrMPFR said:
My guess is that it's 100% FP8 training and inference like DLSS4 RR: https://research.nvidia.com/labs/adlr/DLSS4/
Crazy how DLSS4 RR joint upscaler + ray denoising overhead is ~+-40-50% Preset L and roughly doubling Preset M: https://github.com/NVIDIA/DLSS/blob/main/doc/DLSS-RR Integration Guide.pdf

Not apples to apples because it replaces NRD. Maybe why we haven't seen DLSS RR yet.
Anyone thinks we'll get DLSS 4.5 RR at all? Maybe not till DLSS5?

Yeah, I agree. FP8 for RR was already introduced at DLSS 4.0. That is also the reason, why DLSS 4 RR runs worse on Ampere and Turing. We are now seeing the very same for Peset M/L.

I assume we do not see a RR update until DLSS 5, when they probably switch to FP4.

MrMPFR · Feb 1, 2026

basix said:
Yeah, I agree. FP8 for RR was already introduced at DLSS 4.0. That is also the reason, why DLSS 4 RR runs worse on Ampere and Turing. We are now seeing the very same for Peset M/L.

I assume we do not see a RR update until DLSS 5, when they probably switch to FP4.

My conclusion as well but FP4 is sheit. They need NVFP4. ~10% perf regression but less quantization error.

For DLSS5 I'm thinking NVFP4 + Preset K -> Preset M ms scaling = +60% overhead on preset M so roughly ~15-20% more demanding than Preset L on 50 series at 4K.

RR model is more tricky but it'll prob be extremely demanding to run. ~3X perf cost over DLSS4 RR on 50 series. So in general either brute force (trend so far) or some clever redesign (unlikely). 60 series needs a massive NVFP4 gain + some architectural cachemem tweaks to be able to run this new model more efficiently.

basix · Feb 2, 2026

Sure, NVFP4 will be used if my "FP4" was too imprecise 😉

About DLSS 5 and Rubin I made longer post here on 3DCenter (German):

3DCenter Forum - nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 - Seite 450

nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 Technologie

www.forum-3dcenter.org

Here the translated version via Google:

Not only the smaller cards would benefit. All Rubin cards would benefit.

Here's a calculation example for DLSS 5:

The number of parameters is increased by 1.33x.

The approach taken is that of Preset L, which uses 1.5 times more processing power than Preset M without any additional parameters. This allows for the extraction of more detail (see, for example, Daniel Owen's upscaling comparisons on YouTube). 1.5 * 1.33 = 2x more processing power required for upscaling.

FG takes a similar approach: 2 parameters / computational requirements

DLSS 5 uses FP4, resulting in 2x throughput (on Blackwell and Rubin).

In short, DLSS 5 on Blackwell would run exactly as fast as DLSS 4.5 with preset M. Cards older than Blackwell would be slower.

Then Rubin with 2x Tensor Throughput per SM and TMA & TMEM:

2x FLOPS = 2x Performance

TMA & TMEM = 1.5x better scaling with FLOPS (my assumption without any solid basis)

A total of 3x net throughput at DNN

Now, a performance comparison of DLSS 4.5 and DLSS 5 on Rubin:

DLSS 4.5 is ~1.15x slower than DLSS 3 with CNN model

We assume CNN = zero computing power so that calculations can be performed.

With 3x more net throughput, it would only be 1 + (1 - 1.15) / 3 = 1.05x slower than the CNN model (slightly faster than DLSS 4 with preset K)

Ideally, DLSS 5 on Rubin would be almost as fast as DLSS 3 CNN on older cards with a similar base performance level (e.g. 6060 Ti vs. 5070), since DLSS 3 also requires some computing power.

Compared to similarly fast Blackwell cards, you gain +10% performance.

And now it gets even more interesting when you use FG:

DLSS 4 FG 2x scaled on average at approximately 1.75x.

Rubin would scale by 2 - (2 - 1.75) / 3 = 1.92x. This is compared to Blackwell +10% performance (base/input FPS) and smoothness (output FPS).

With 4x or even 6x/8x MFG, this would then multiply.

4x MFG at 1.75^2 = 3.06 vs. 1.92^2 = 3.69, which would correspond to +20% performance (base/input FPS) and smoothness (output FPS).

8x MFG at 1.75^3 = 5.36 vs. 1.92^3 = 7.08, which would correspond to +32% performance (base/input FPS) and smoothness (output FPS).

Result:

A higher net Tensor throughput in Rubin could lead to an estimated 1.1x higher performance compared to Blackwell when using DLSS upscaling.

Including 2x FG, that would increase to 1.21x.

Including 4x MFG, that would increase to 1.33x.

Including 8x MFG, that would increase to 1.46x.

This could allow Rubin to noticeably differentiate itself from Blackwell even beyond basic performance.

Improved Tensor Core performance could therefore be quite advantageous from a gaming perspective. This is also true for Nvidia, as it provides a noticeable advantage over Blackwell, making Rubin more attractive to gamers.

And for GEMM in general, this is of course also interesting (any local execution of DNN such as Stable Diffusion etc. and, looking further into the future, "Neural Rendering")

Edit:

Ray reconstruction is another option. If the same approach were used there (double the computing power requirement and switching to FP4), Blackwell would again be just as fast as with DLSS 4.

Rubin could also gain performance here with 3x net throughput. Something between 1.1x and 1.2x.

A follow up post then discussed the implications on chip area when doubling the widht of the Tensor Cores per SM:

3DCenter Forum - nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 - Seite 451

nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 Technologie

www.forum-3dcenter.org

Quote from AffenJack

I also expect TMEM, but I don't think we'll see a doubling of TCs at the moment. Doubling/multiplying the speed at 1.58 bits versus FP4 as a new format. But there's only limited space, and for consumers, I initially see more of a focus on increased use of NVFP4 for speed improvements.
I only see a doubling of TCs if they generally double the FLOPS per SM to combat the scaling problems that GB202 has. It will be interesting to see how Nvidia tackles the wall their architecture has run into.

Click to expand...

Rubin will be introduced in N3P, where logic density is the primary increase compared to N4P. Additional Tensor FLOPS primarily require logic transistors.

It appears GR202 will remain at 192 SMs. Therefore, performance gains can only be achieved through IPC and clock speed. Improving DLSS scaling could, in principle, be categorized as an IPC improvement, but only indirectly and when DLSS is used. However, when DLSS is used, it simply appears as increased performance to the user.

Consider this: How much larger does the chip need to be for a 10% performance increase with DLSS SR, or a 33% increase with DLSS and 4x MFG?

Here's the old Turing post: https://www.reddit.com/r/nvidia/comm..._07/?rdt=47655
Back then, Tensor Cores were estimated to add approximately 13% to the cost per SM, or 6% to the total chip area. Meanwhile, a substantial L2 cache has been added, and the SMs have grown larger (more shared memory, doubled FP32, stronger RT cores). However, the Tensor FLOPS per FP32 FLOP have halved. The main difference is the addition of new data types. A very rough estimate suggests that today's Tensor Cores occupy about the same amount of space as Turing's (+5%). Doubling the size would add another 5%. This would be 100% worthwhile if using DLSS. And, of course, N3P scales very well with logic. Now, TMEM is being added, which will require more area. Let's say +5% (e.g. only 128kB of TMEM instead of 256kB as with B200). The net increase in chip area is +10% for a 10% performance gain with DLSS SR. However, once FG and Neural Rendering come into play, the increase will be significantly greater. So, the net gain is clear: The performance increase outweighs the additional chip area required.

And then there's the use of these chips for things like RTX Pro, where additional Tensor Core performance is highly desirable. Therefore, increased Tensor Core throughput would fit well into Nvidia's overall strategy.

Edit:
TMEM might even be cheaper, but 10% seems rather low to me. https://arxiv.org/pdf/2512.02189
Quote:

The 256KB TMEM per SM (10% of SM memory) [...]

Click to expand...

MrMPFR · Feb 2, 2026

basix said:
Sure, NVFP4 will be used if my "FP4" was too imprecise 😉

About DLSS 5 and Rubin I made longer post here on 3DCenter (German):

3DCenter Forum - nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 - Seite 450

nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 Technologie

www.forum-3dcenter.org

Here the translated version via Google:

A follow up post then discussed the implications on chip area when doubling the widht of the Tensor Cores per SM:

3DCenter Forum - nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 - Seite 451

nVidias Deep Learning Super Sampling - DLSS 2.x, DLSS 3.x, DLSS 4 Technologie

www.forum-3dcenter.org

I hope your prediction turns out to be true and not my pessimistic assumption that they'll 5X compute again with DLSS5 SR.

Doubling size of systolic array results in less than 2X area cost. NVIDIA already did this with Ampere as well going from old tensor core to new one (8/SM vs 4/SM), said it reduced area and power consumption IIRC. Also isn't the new lower-bit datatypes basically just free performance?
2X tensor core throughput only puts them in line with ATx IIRc with a risk of being disrupted due to a possible novel cachemem design.
Maybe they'll go 4X?

Also 1.58bit DLSS sounds interesting but prob not practical xD

basix · Feb 2, 2026

DLSS 4 added 5x compute and DLSS 4.5 added another 4x compute. That is 20x vs. DLSS 3 with CNN, which is kinda astounding.

And you can see that they are hesitant to add more compute requirements, that's why we see preset M and L. Framerate performance begins to tank.
Many reviewers test GPU performance with DLSS/FSR by default. If DLSS 5 adds too much compute requirements, Nvidia cards could look weaker.
I think preset "L" is a DLSS 5 precursor. They want to add more compute requirements to improve quality, but that is not viable without NVFP4 (and potentially Rubin's higher Tensor Core throughput).

One thing I forgot about Rubin:
They could add the Transformer Engine (inferencing sparsity compression?) from datacenter B300. With that you get 1.5x inferencing FLOPS for "free".

Regarding TC size:
Smaller data types could be nearly for free, if you can do ALU re-use etc. but I am not sure if that gets done that way for FP. For INT such ALU re-use is easy, for FP it is more difficult.
If we get 4x TC throughput for gaming cards: I would take that (I use my card also for local DNNs). But maybe they do a market segment differentiation & add a salvaging path: Gaming = 2x; RTX Pro = 4x

1.58bit might be a thing as well, sure. But earliest for DLSS 6 and if Rubin supports that datatype. When looking at the history of DLSS 3 to 4 / 4.5 we can see that Nvidia keeps N-1 generation "alive". Everything is FP8 now. They could have switched to NVFP4 but Lovelace does not support that. Anyways, if looking at DLSS 4.5 preset L results at 10/20% resolution scaling we are entering the endgame of temporal upsampling. DLSS 5 will improve quality again and after that the perceived quality gains will likely be very minor. So keeping performance up instead of adding even more compute requirements might be the more sensible path forward. I mean it will get crazy, how many pixels are rendered vs. displayed:

20% / 25% SR modes (I expect that UP will not stay the max for DLSS 5)
6x / 8x MFG
Extreme case where I could imagine still decent / usable results: 4K / 20% scaling / 8x MFG --> 1 of 200 pixels gets actually rendered
- Today we look at 1 of 36 pixels (UP + 4x MFG), but UP image quality was sub-par before DLSS 4.5, so it was realistically closer to 1 of 16 pixels

Win2012R2 · Feb 4, 2026

basix said:
we are entering the endgame of temporal upsampling

Thank God for that!

I hope devs will finally start using new actual tech features (like mesh stuff that was out like for ages) to optimise engines better.

MrMPFR · Feb 4, 2026

Looks like bad RT game quality is not DLSS 4.5's fault but in-game denoisers.

Impressive how well Preset M and L resolve lighting without denoising. Superres so good that it behaves like denoising.
Edit: It is denoising. See later replies.

basix · Feb 4, 2026

I hope we get one or two new things based on that:
- DLSS automatically disables ingame denoisers (e.g. based on whitelisting and using existing game engine console commands). At least for UE5 this would be already quite helpful.
- We get at least a tool with a switch to disable ingame denoisers

MrMPFR · Feb 4, 2026

MrMPFR said:
Looks like bad RT game quality is not DLSS 4.5's fault but in-game denoisers.

Impressive how well Preset M and L resolve lighting without denoising. Superres so good that it behaves like denoising.

It is actually denoising. Try looking at the blurry unstable vs perfect stable reflections at 7:20 and 8:15. Strongly suggest RR. At 7:58 lighting and textures for the entire paved area changes from blurry and unstable to sharp and stable.
If that's not enough DLSSTweaks dev claims the new presets are called rrlite and rrlite_folded in nvngx_dlss DLL here, and looks very similar to RR in Code Vein II:

https://twitter.com/x/status/2019095400334930344

Guru3D thread where it was first mentioned: https://forums.guru3d.com/threads/i...tsr-and-mods-etc.439761/page-247#post-6388116

Discussion Nvidia Blackwell in Q1-2025

Diamond Member

Senior member

Golden Member

Golden Member

Lifer

Senior member

Senior member

Golden Member

Diamond Member

Senior member

Senior member

Senior member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Senior member

Senior member

Senior member

Golden Member

Senior member

Senior member

Senior member