Zen 5 isn't identical 4x FPUs. So Unified Core's FPU doesn't need to be 4x Skymont's, since Skymont has four identical FPUs. Lion Cove is similar to Zen 5 in this regard.
Zen 5:
-FMA
-FMA
-FADD
-FADD
Lion Cove:
-FMA
-FMA
-FADD
-FADD
Skymont:
-FMA
-FMA
-FMA
-FMA
So if you 4x Skymont, then it would be way fatter than Zen 5's AVX implementation. Doubling the width of all 4 units to 256-bit would make it already significantly beefier than Lion Cove's FPUs. Even just doubling it might be ok for all applications, not just PC.
Caches are misleading in terms of transistor count and area. They are not only super-dense and error-resistant due to their extreme redundancy, they also lower the overall power of the SoC thus is worth adding versus logic, which isn't always the case. So this isn't necessarily accurate analysis either. There's a saying "before you expand your uarch, consider if it'll be worth adding equivalent amount in caches instead". When they used to breakdown area, transistor and power of the large server cores years ago in Hot Chips, L3 caches would take up 60% of the space but use 10% of the whole power or less. This is ignoring the elephant in the room where cache access would mean less access in DRAM, meaning overall you LOWER power. I don't know where people got the idea caches are power or area inefficient. It is a "dumb" way of adding performance.
Also the decision makers that put AMX were working under a different Intel, and they might have wanted to push it in all areas to "grab them all", whereas 4 year delay makes it look like a waste. You must also consider when comparing chips like Clearwater Forest that there's a high likely chance that the project fell short of the original expectations, which is typical when delays happen. Delays happen not to "make it better" but because they mis-fired and needed extra time to fix things.
Look at projects that went really well - they were ahead of schedule and expectations. Core 2 for example were delivered weeks ahead of original schedule. The impact of the delay is not something we can ever put it on a sheet of paper and compare. We can't "ISO" such things. Nvidia's NV30 got delayed and also disappointed. Intel's Knight's Landing was 9 months delayed, and compared to plans, used 10% more power and performed 10% less. So a hypothetical CWF last year could have been 15% better perf/watt while arriving more than 6 months earlier. The impact of the delay affecting not only schedule but performance of the product is almost profound - they are both in new categories and it's a make or break difference. Xeon Phi might have had a longer lifespan and a far greater impact if it was 9 months earlier with higher perf and lower power. The same with Clearwater Forest arriving 3-5 months ago and performing 10-15% better. It would also have been a big impact.
Zen 5:
-FMA
-FMA
-FADD
-FADD
Lion Cove:
-FMA
-FMA
-FADD
-FADD
Skymont:
-FMA
-FMA
-FMA
-FMA
So if you 4x Skymont, then it would be way fatter than Zen 5's AVX implementation. Doubling the width of all 4 units to 256-bit would make it already significantly beefier than Lion Cove's FPUs. Even just doubling it might be ok for all applications, not just PC.
Think that waste may be the same team that caused the bloat in the P cores. You have a different one in charge now.Honestly it looks like the P-cores have a bunch of extra space to grow because of how big the E-core clusters are getting lol.
There's also the problem of P-cores having to be used in DC, but I still think area is the least important factor in PPA, and Intel especially seems to be giving very little concern into area by adding massive core private caches and AMX extensions per core.
Caches are misleading in terms of transistor count and area. They are not only super-dense and error-resistant due to their extreme redundancy, they also lower the overall power of the SoC thus is worth adding versus logic, which isn't always the case. So this isn't necessarily accurate analysis either. There's a saying "before you expand your uarch, consider if it'll be worth adding equivalent amount in caches instead". When they used to breakdown area, transistor and power of the large server cores years ago in Hot Chips, L3 caches would take up 60% of the space but use 10% of the whole power or less. This is ignoring the elephant in the room where cache access would mean less access in DRAM, meaning overall you LOWER power. I don't know where people got the idea caches are power or area inefficient. It is a "dumb" way of adding performance.
Also the decision makers that put AMX were working under a different Intel, and they might have wanted to push it in all areas to "grab them all", whereas 4 year delay makes it look like a waste. You must also consider when comparing chips like Clearwater Forest that there's a high likely chance that the project fell short of the original expectations, which is typical when delays happen. Delays happen not to "make it better" but because they mis-fired and needed extra time to fix things.
Look at projects that went really well - they were ahead of schedule and expectations. Core 2 for example were delivered weeks ahead of original schedule. The impact of the delay is not something we can ever put it on a sheet of paper and compare. We can't "ISO" such things. Nvidia's NV30 got delayed and also disappointed. Intel's Knight's Landing was 9 months delayed, and compared to plans, used 10% more power and performed 10% less. So a hypothetical CWF last year could have been 15% better perf/watt while arriving more than 6 months earlier. The impact of the delay affecting not only schedule but performance of the product is almost profound - they are both in new categories and it's a make or break difference. Xeon Phi might have had a longer lifespan and a far greater impact if it was 9 months earlier with higher perf and lower power. The same with Clearwater Forest arriving 3-5 months ago and performing 10-15% better. It would also have been a big impact.
Last edited: