Post

Replies

Boosts

Views

Activity

Reply to OS choosing performance state poorly for GPU use case
I've done a ton of experimenting with this, and it appears to me from the outside that the heuristic that MacOS is using to determine whether the GPU needs to be clocked up is something like "is any GPU command buffer fully saturating." It does not seem to matter what percentage of the GPU's full parallelism is being used, etc -- if there's computation that is only as wide as a single warp, but that warp is saturated, the GPU will clock up. In general, this means that any computational process that hands back-and-forth between non-overlapping phases on the CPU and GPU is unlikely to get clocked up appropriately, because while the CPU is doing work, the GPU is idle (and vice-versa), indicating to this heuristic that a higher clock rate is not needed. Admittedly this is a kind of odd situation, but in the realm of audio unit plugins, this is actually the default situation if you are trying to use the GPU for audio computation, because you need to compute little bits of audio as quickly as possible, and hand them off to the host application (GarageBand, etc) for processing that is typically done on the CPU. The workaround for this is horrible, but extremely effective: simply spin a GPU threadgroup warp (the minimum unit of power wastage) in a busy loop 100% of the time that the plugin is running, to signal to the OS that it needs to clock up the GPU. I implemented this, and it works perfectly, albeit wastefully. I describe the performance gains here: https://anukari.com/blog/devlog/waste-makes-haste I tried many other approaches, including simply keeping a deeper queue of the "real" work I am doing on the GPU. But that queue had to be blocked using SharedEvents when there was no work to do, which defeated the benefit of having a deep queue: the load average was still not high enough for the OS to clock it up. My suggestion to Apple would be to allow apps to signal that they are GPU latency-sensitive and need higher clocks to meet user needs. This would be less wasteful than spinning a GPU core, and also would allow the OS to prompt the user for permission, etc.
Topic: Graphics & Games SubTopic: Metal Tags:
Nov ’24
Reply to OS choosing performance state poorly for GPU use case
Thanks for your reply! For real-time audio, latency from user input to audio changes is quite important, and when running as an AU plugin inside GarageBand for example, processing is done in ~3ms chunks to minimize this latency. So I can't trigger useful work more than that far in advance. It's an interesting and useful question though; there may be ways to restructure the work so that I queue 30ms of kernels, but at 3ms out the kernel is just busy-waiting on a spinlock until work arrives. This would have the advantage that it would alleviate the latency overhead of scheduling a kernel for each 3ms chunk of work. Do you know something specific about this 30ms number? Is Apple looking at something like the GPU's load average for power throttling?
Topic: Graphics & Games SubTopic: Metal Tags:
Nov ’24