The answer would be "usually, no". The basic idea underlying modern GPUs is they try to get around memory latency by having A LOT of parallelism. The cores are usually designed to execute several workgroups at the same time. Some hardware (not sure about Apple Silicon, but I knew several CUDA machines that have this) wants you to have several thread groups AND several "simd groups"/warps in each threadgroup to reach peak efficiency.
My understanding is, you even proved this yourself - you got 250GFlops when asking for 32kB and 400GFlops when asking for 8kB of threadgroup memory. That memory request translates into "I want only one threadgroup to run on each core" vs "I am ok with up to 4 thread groups running on each core".
If I were you, I would immediately try with 4kb and 16kb, so with up to 8 and up to 2 thread groups, then check performance.
Topic:
Graphics & Games
SubTopic:
Metal