HI - I'm trying to implement a Barnes-Hut N-Body simulation code in Metal. The code requires construction of a tree. The CUDA implementation uses locks to allow insertion of new nodes into the tree.
I've tried using an array of atomic ints in a test case, but this doesn't seem to work:
kernel void binning_compute_function(
device MyArgument *arg1 [[ buffer(0)]],
constant float *ranarr [[ buffer(1) ]],
device volatile atomic_int *flagArr [[ buffer(2) ]],
device int *bins [[buffer(3)]],
uint index [[ thread_position_in_grid ]]) {
int expected=0;
int ibin = (ranarr[index] * arg1->nbins);
for (int i = 0; i < 100000000; i++) {
// Lock
expected = 0;
bool test = !atomic_compare_exchange_weak_explicit(&flagArr[ibin],&expected,1,memory_order_relaxed,memory_order_relaxed);
if (test) {
bins[ibin] += 1;
atomic_store_explicit(&flagArr[ibin], 0, memory_order_relaxed);
break;
}
}
}
Any other suggestions? The alternative is to use the CPU for this, but seems a shame to miss out on the processing power of the GPU.
Thank you,
Colin
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Hi - not sure this is strictly a metal issue, but I'm having problems with a memory leak.
When I create a buffer to use with the GPU, then bind the results so that I can access the data, the created buffer seems to stay in memory even once the variables intensityPointer and intensityBuff is no longer in scope.
I tried using deallocate, but this caused an error too.
Is there a standard way of managing such memory, or accessing the buffer in a different way which will allow the memory to be released when no longer used?
Thank you,
Colin
let intensityBuff = myGPUData.device?.makeBuffer(length: MemoryLayout<Float>.stride * Int(myStars.nstars * myStars.npatch * myStars.npatch, options: .storageModeShared)
let intensityPointer = intensityBuff?.contents().bindMemory(to: Float.self, capacity: MemoryLayout<Float>.stride * Int(myStars.nstars * myStars.npatch * myStars.npatch))
Hi - I've been developing code to determine star shapes using the GPU.
The code passes a patch of intensity values near the star and then uses a Levenberg-Marquardt algorithm to get a least squares fit.
I can get the code to work well for up to five stars at a time provided I pass an array of size 15x15. However, if I pass more than 5 stars, I get a GPU timeout error.
If I change the size of the array to 16x16, or 14x14 then even one star will cause a GPU timeout error.
The Levenberg-Marquardt algorithm does use lots of if statements - about 10 per loop, and the loop is executed multiple times. Not sure if there is a limit to the number of if statements allowed?
I use one threadgroup with a single thread per star.
Is there anyway to debug this to work out what causes the problem? In one run I did get a slightly different error:
Stack Overflow Exception. Please check the [MTLComputePipelineDescriptor maxCallStackDepth] setting.
The subroutine does call routines which call others, so that may be an issues - I think the call depth is max 2.
Any other thoughts gratefully received.
Colin
Hi - I'm trying to work out whether it is better to use a single metal function in a compute pipeline, or whether to split the function into multiple parts.
For context, my code involves tracing NRAYS for NSTAR locations through a reflecting telescope and calculating the resulting star shapes as they hit the detector.
I can trace all of the rays for each star in a single metal compute function (one thread per ray, so NRAYS * NSTARS)- but then for each star, I need to work out the average location of the ray as they hit the detector. I could try and make NRAYS equal to the maxTotalThreadsPerThreadgroup, and then use threadgroup_barrier to ensure that all rays for a star have been traced before averaging.
Alternatively, I could break the code into several parts, though still all in one command encoder. That way I can vary the number of rays as I wish ( I may need more than 1024 rays to get a good star shape).
However, this is my first experience of programming GPGPU code and don't yet have any feel for the relative timings for each method. If the first method was going to be a lot faster, I could accept the constraint of not being able to vary the number of rays.
Any advice gratefully received.
Thank you!
Colin