Memory ordering

I have a complex CAS loop with branches that essentially implement a mutex and I'm porting it from CUDA to Metal. I'm looking for the equivalent of CUDA __treadfence();

=> docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

Unfortunately metal::threadgroup_barrier(metal::mem_flags::mem_device) also implies execution synchronization and needs to be "seen" by all threads or deadlock.

I need to have one store to buffer A happen before another store to buffer B.

Atomic memory oder options are only one: "relaxed".

How to accomplish this ordering guarantee?



What does this undocumented builtin do?

Code Block
__metal_atomic_fence(int, int, int) -> air.atomic.fence(i32 0, i32 0, i32 0)


Hi, did you solve that problem?

Memory ordering
 
 
Q