I posted this question to StackOverflow. Perhaps it is better suited here where Apple developers are more likely to see it.
I was looking through the project linked on the page "Selecting Device Objects for Compute Processing" in the Metal documentation (linked here - https://developer.apple.com/documentation/metal/gpu_selection_in_macos/selecting_device_objects_for_compute_processing) There, I noticed a clever use of threadgroup memory that I am hoping to adopt in my own particle simulator. However, before I do so I need to understand a particular aspect of threadgroup memory and what the developers are doing in this scenario.
The code contains a segment like so:
metal
// In AAPLKernels.metal
// Parameter of the kernel
threadgroup float4* sharedPosition [[threadgroup(0)]]
// Body
...
// For each particle / body
for(i = 0; i params.numBodies; i += numThreadsInGroup)
{
// Because sharedPosition uses the threadgroup address space, 'numThreadsInGroup' elements
// of sharedPosition will be initialized at once (not just one element at lid as it
// may look like)
sharedPosition[threadInGroup] = oldPosition[sourcePosition];
j = 0;
while(j numThreadsInGroup)
{
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
} // while
sourcePosition += numThreadsInGroup;
} // for
In particular, the comment just before the assignment of sharedPosition starting with "Because..." I found confusing. I haven't read anywhere that threadgroup memory writes happen on all threads in the same threadgroup simultaneously; in fact, I thought a barrier would be needed before reading from the shared memory pool again to avoid undefined behavior since *each* thread is subsequently reading from the entire pool of threadgroup memory after the assignment (the assignment being a write of course). Why is a barrier unnecessary here?
Selecting any option will automatically load the page