ToTTenTranz
Golden Member
The fact that people are kind of glossing over the fact they were using two 5090's is pretty insane as well. This slop will never run on worth a damn on anything less than a 5090 if that.
This isn't necessarily true.
They're probably at the stage where they're using the freshly trained model with the original number of parameters and weight sizes. For end-user implementation they're probably going to distill the model through a teacher<->student method and reduce weight sizes e.g. from FP16 to NVFP4 for much higher performance on Blackwell GPUs, while keeping >95% of the original model's accuracy and performance. If this is a dense model, there's a chance they'll gain a lot of performance again by going into a mixture of experts architecture.
It'll be a bit like what's done on the 20B parameter LLMs on FP4 getting 98% of the accuracy of the 200B FP16 ones despite using up 1/40th of the size and running 20x faster on the same chip.