Quote from AffenJack
I also expect TMEM, but I don't think we'll see a doubling of TCs at the moment. Doubling/multiplying the speed at 1.58 bits versus FP4 as a new format. But there's only limited space, and for consumers, I initially see more of a focus on increased use of NVFP4 for speed improvements.
I only see a doubling of TCs if they generally double the FLOPS per SM to combat the scaling problems that GB202 has. It will be interesting to see how Nvidia tackles the wall their architecture has run into.
Rubin will be introduced in N3P, where logic density is the primary increase compared to N4P. Additional Tensor FLOPS primarily require logic transistors.
It appears GR202 will remain at 192 SMs. Therefore, performance gains can only be achieved through IPC and clock speed. Improving DLSS scaling could, in principle, be categorized as an IPC improvement, but only indirectly and when DLSS is used. However, when DLSS is used, it simply appears as increased performance to the user.
Consider this: How much larger does the chip need to be for a 10% performance increase with DLSS SR, or a 33% increase with DLSS and 4x MFG?
Here's the old Turing post:
https://www.reddit.com/r/nvidia/comm..._07/?rdt=47655
Back then, Tensor Cores were estimated to add approximately 13% to the cost per SM, or 6% to the total chip area. Meanwhile, a substantial L2 cache has been added, and the SMs have grown larger (more shared memory, doubled FP32, stronger RT cores). However, the Tensor FLOPS per FP32 FLOP have halved. The main difference is the addition of new data types. A very rough estimate suggests that today's Tensor Cores occupy about the same amount of space as Turing's (+5%). Doubling the size would add another 5%. This would be 100% worthwhile if using DLSS. And, of course, N3P scales very well with logic. Now, TMEM is being added, which will require more area. Let's say +5% (e.g. only 128kB of TMEM instead of 256kB as with B200). The net increase in chip area is +10% for a 10% performance gain with DLSS SR. However, once FG and Neural Rendering come into play, the increase will be significantly greater. So, the net gain is clear: The performance increase outweighs the additional chip area required.
And then there's the use of these chips for things like RTX Pro, where additional Tensor Core performance is highly desirable. Therefore, increased Tensor Core throughput would fit well into Nvidia's overall strategy.
Edit:
TMEM might even be cheaper, but 10% seems rather low to me.
https://arxiv.org/pdf/2512.02189
Quote:
The 256KB TMEM per SM (10% of SM memory) [...]