That's indeed a good blog + I noticed the stuff about occupancy. Isn't no barriers supposed to be a major win especially for divergent workloads? As soon as producer is finished, no more BS waiting for the last SIMD32 thread to complete?
I've prob spent too much time looking at this stuff xD This article introduces:
Our primer on GPU Work Graphs introduces this exciting new paradigm for graphics developers, which enable a live shader kernel to dispatch new workloads on-demand without needing to circle back around to the CPU first.
gpuopen.com
So
#1 API flexibility
#2 Cachemem efficiency boost
#3 No more fixed limitations where one weak link can stall entire execution
#4 Very much the same as #3, but due additional factors.
The SIGGRAPH 2025 PDF notes are a gold mine:
Page 253 illustrates why running multiple material shaders at once within a SIMD unit is a terrible idea. But by having a material node for each unique material this can be avoided altogether ensuring only coherent material shaders are being run.
From a hardware angle to properly benefit from this we would need to defer the any-hit shader evaluations, since performing a SE/GPC global payload sort is required
to ensure there's plenty of any-hit requests to choose from.
This is to ensure only one material node at a time is being executed for the material shaders. This is misleading the producer results can accumulate via the global payload sorter until these can be sent to the consumers. No expensive writes to global memory. Should allow the GPU to achieve very high occupancy.
Because only one or a few to one material node at a time is executed in compute units, we can achieve very high occupancy. If there's more than one node a simple partial sort should be enough to ensure coherent shader execution. Thus extremely high coherence could be expected, likely eclipsing what is currently afforded by SER in DXR 1.2 So no more thread divergence and also no more low occupancy. Since PT is largely shading, not traversal, the potential improvement here is significant.
By moving the entire RT pipeline into a work graph, not just the compute shader passes, additional efficiencies are adventitiously exploited to ensure a pipeline that's even more coherent and has even higher occupancy, thus eliminating most if not all bubbles, barriers, and empty launches.
Regular gaming workloads can benefit too, but will unlock new possibilities for GPU driven procedural content generation, complex systems on GPUs (AI and physics), neural shading, and as already mentioned ray and path tracing.
For a HW architecture that is hard coded (you know which one) to match this capability across the entire stack, which includes building a robust cachemem foundation, I suspect we could see massive benefits. Hopefully in the best case mirroring or even exceeding the
occupancy coherence of the Pixel shaders pass in the
C&C's coverage. Going from 30-45%
occupancy coherence to ~90% is 2-3X higher
occupancy coherence is a big deal. I know the
Chips and Cheese's SER article math includes traversal step as well, but even if these benefits only applied to shading it would still be a
complete gamechanger.
All this sounds too good to be true so can anyone please provide a sanity check or shoot it down if it's misleading?