MikeAlpha’s Profile | Apple Developer Forums

Reply to Processing AVCaptureVideoDataOutput video stream with appleLog and HLG_BT2020 AVCaptureColorSpace input

Hello! I don't have an experience in this particular situation. However, I spent a lot of time working with GPUs and video (either from live camera or from decoder). So this is just a guess and please take it with a grain of salt: Apple Metal does support textures with more than 8 bits per channel, for example MTLPixelFormatRGB10A2Unorm or MTLPixelFormatBGRA10_XR. However, I don't think that Metal supports typical video (YUV) formats with more than 8 bits per channel. Like if you had, say, h.265 stream using Main10 profile with 10 bit depth. Hence (I guess), short path from high quality video frame to Metal texture via MTKTextureLoader is probably not for you. What I would do instead, is to get access to real data in memory, and write Metal kernel performing manual color loading and conversion, for example from HLG_BT2020 directly into MTLPixelFormatRGB10A2Unorm. Some 6 years passed since I worked with raw video and Metal, but I seem to remember that you can go CGImage -> CVPixelBuffer -> raw data pointer route to have something that can be read by Metal kernel (or shader for that matter) and then one can write or render to texture the usual way.

Media Technologies Streaming

Apr ’25

Reply to Threadgroup memory for fragment shader

Feeling a bit like in echo chamber, but anyway. I was able to "hack" it and perform pretty efficient reduce over tile memory from the fragment shader the following way: [[simdgroup_index_in_threadgroup]] is not available, but I computed my own out of [[pixel_position_in_tile]]. Metal specs talk a bit about how simd group is laid out when processing the tile, also it can be checked by writing thread indices to memory buffer. Having simdgroup index one can perform simd group wide reduce, then write simd group result to temporary buffer (which needs to have enough space for all the simd groups in the tile, in my case 32x32 / 32 = 32) Extra compute kernel is then dispatched to read the temporary buffer and perform final reduction of 32 values for each tile into final tile value. Overall, this is still a win, because I can use the data that is loaded into tile memory and avoid need to re-read it again for reduction. BUT it is a pity that it doesn't work as it should. Metal specs version 3.2 mention usage of threadgroup memory in fragment functions several times. For example, on page 86 it says "See the Metal Feature Set Tables to learn which GPUs support threadgroup space arguments for fragment shaders". Yet I couldn't find such entry in Feature Set Tables. This is omission/bug in what is otherwise excellent feature of Metal and Apple GPUs. I hope it can be rectified some day...

Graphics & Games Metal

Apr ’25

Reply to Why slower with larger threadgroup memory?

This is normal. Thing is, every GPU core has only so much threadgroup memory. In case of newer Apple Silicon GPUs, that's 32kB. Now, if you asking that threadgroup be allocated 32kB, you are simply telling the GPU "that one threadgroup is going to use whole threadgroup memory available on the core". In effect, you are limiting number of concurrently executing thread groups to one. Smaller allocations will increase that number. So for example with 16kB the core will be able to run two independent thread groups, with 8kB four and so on. Of course, the performance of your code also depends on the size of the threadgroup memory it can use. So optimisation in this case would be about finding the sweet spot where your code can still use reasonable amount of threadgroup memory in every threadgroup, but at the same time GPU can run enough thread groups concurrently.

Graphics & Games Metal

Apr ’25

Reply to Why slower with larger threadgroup memory?

The answer would be "usually, no". The basic idea underlying modern GPUs is they try to get around memory latency by having A LOT of parallelism. The cores are usually designed to execute several workgroups at the same time. Some hardware (not sure about Apple Silicon, but I knew several CUDA machines that have this) wants you to have several thread groups AND several "simd groups"/warps in each threadgroup to reach peak efficiency. My understanding is, you even proved this yourself - you got 250GFlops when asking for 32kB and 400GFlops when asking for 8kB of threadgroup memory. That memory request translates into "I want only one threadgroup to run on each core" vs "I am ok with up to 4 thread groups running on each core". If I were you, I would immediately try with 4kb and 16kb, so with up to 8 and up to 2 thread groups, then check performance.

Graphics & Games Metal

Apr ’25

MikeAlpha

Post

Replies

Boosts

Views

Activity