Performance Optimization for Large-Kernel Image Processing

Question

Created Jun ’26

Replies 1

Boosts 0

Participants 2

This post is from the WWDC26 Metal Q&A.

I am processing large images where each output pixel depends on a large neighborhood of surrounding pixels. As a result, the shader performs a very high number of texture sampling operations, which appears to cause cache misses and becomes a performance bottleneck.

Since neighboring threads often process adjacent pixels, many of the sampled pixels overlap between threads. Although each thread operates on a slightly different output pixel, a large portion of the texture accesses are effectively identical.

Does Metal provide mechanisms that allow neighboring threads to share or synchronize intermediate results in order to reduce redundant texture fetches?
Are there recommended approaches for exploiting data reuse across threads, for example through threadgroup memory or other Metal-specific features?
In this type of workload, how effective is texture gathering (gather) for reducing sampling overhead, especially when only the RGB channels of an RGBA texture are required?
Would using gather generally improve cache utilization and performance in this scenario?
When using gather, what is the preferred way to handle texture borders and edge conditions without introducing per-thread branching (e.g., explicit if statements)?

Any recommendations for optimizing large-radius neighborhood operations in Metal would be greatly appreciated.

Answered by Graphics and Games Engineer in 891596022

Classically, thread group memory has been used to cache tiles of inputs to avoid multiple reads of the same data across threads and you can apply this technique to reduce the number of texture reads. However, simd_shuffle_and_fill_down/up was introduced with A15 Bionic (see Discover advances in Metal for A15 Bionic at about 16m44s) to help with permuting data across threads without needing to allocate and read/write thread group memory. And then in Metal 4.1, read() can take a sampler for edge conditions, and there are multi pixel block_read() methods to read larger blocks of texture data (https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf). In terms of gather(), the gather operation will still need populate the cache lines containing the r, g, b and a channel data, even if you only need r, g and b. Pre-Metal-4.1, to handle borders and edge conditions: if you're using texture reads, you need to manually predicate, but if you're using texture sampling, you could configure the sampler addressing.

Answer 1

Graphics and Games Engineer OP

Apple

Jun ’26

Recommended

Classically, thread group memory has been used to cache tiles of inputs to avoid multiple reads of the same data across threads and you can apply this technique to reduce the number of texture reads. However, simd_shuffle_and_fill_down/up was introduced with A15 Bionic (see Discover advances in Metal for A15 Bionic at about 16m44s) to help with permuting data across threads without needing to allocate and read/write thread group memory. And then in Metal 4.1, read() can take a sampler for edge conditions, and there are multi pixel block_read() methods to read larger blocks of texture data (https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf). In terms of gather(), the gather operation will still need populate the cache lines containing the r, g, b and a channel data, even if you only need r, g and b. Pre-Metal-4.1, to handle borders and edge conditions: if you're using texture reads, you need to manually predicate, but if you're using texture sampling, you could configure the sampler addressing.