I have a complex CAS loop with branches that essentially implement a mutex and I'm porting it from CUDA to Metal. I'm looking for the equivalent of CUDA __treadfence();
=> docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions
Unfortunately metal::threadgroup_barrier(metal::mem_flags::mem_device) also implies execution synchronization and needs to be "seen" by all threads or deadlock.
I need to have one store to buffer A happen before another store to buffer B.
Atomic memory oder options are only one: "relaxed".
How to accomplish this ordering guarantee?
=> docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions
Unfortunately metal::threadgroup_barrier(metal::mem_flags::mem_device) also implies execution synchronization and needs to be "seen" by all threads or deadlock.
I need to have one store to buffer A happen before another store to buffer B.
Atomic memory oder options are only one: "relaxed".
How to accomplish this ordering guarantee?