I am processing large images where each output pixel depends on a large neighborhood of surrounding pixels. As a result, the shader performs a very high number of texture sampling operations, which appears to cause cache misses and becomes a performance bottleneck.
Since neighboring threads often process adjacent pixels, many of the sampled pixels overlap between threads. Although each thread operates on a slightly different output pixel, a large portion of the texture accesses are effectively identical.
- Does Metal provide mechanisms that allow neighboring threads to share or synchronize intermediate results in order to reduce redundant texture fetches?
- Are there recommended approaches for exploiting data reuse across threads, for example through threadgroup memory or other Metal-specific features?
- In this type of workload, how effective is texture gathering (gather) for reducing sampling overhead, especially when only the RGB channels of an RGBA texture are required?
- Would using gather generally improve cache utilization and performance in this scenario?
- When using gather, what is the preferred way to handle texture borders and edge conditions without introducing per-thread branching (e.g., explicit if statements)?
Any recommendations for optimizing large-radius neighborhood operations in Metal would be greatly appreciated.