Hi everyone,
I’ve been developing a custom, end-to-end 3D rendering engine called Crescent from scratch using C++20 and Metal-cpp (targeting macOS and visionOS). My primary goal is to build a zero-bottleneck, GPU-driven pipeline that maximizes the potential of Apple Silicon’s Unified Memory and TBDR architecture.
While the fundamental systems are stable, I am looking for architectural feedback from Metal framework engineers regarding specific synchronization and latency challenges.
Current Core Implementations:
GPU-Driven Instance Culling: High-performance occlusion culling using a Hierarchical Z-Buffer (HZB) approach via Compute Shaders.
Clustered Forward Shading: Support for high-count dynamic lights through view-space clustering.
Temporal Stability: Custom TAA with history rejection and Motion Blur resolve.
Asset Infrastructure: Robust GUID-based scene serialization and a JSON-driven ECS hierarchy.
The Architectural Challenge:
I am currently seeing slight synchronization overhead when generating the HZB mip-chain. On Apple Silicon, I am evaluating the cost of encoder transitions versus cache-friendly barriers.
&& m_hzbInitPipeline && m_hzbDownsamplePipeline && !m_hzbMipViews.empty();
if (canBuildHzb) {
MTL::ComputeCommandEncoder* hzbInit = commandBuffer->computeCommandEncoder();
hzbInit->setComputePipelineState(m_hzbInitPipeline);
hzbInit->setTexture(m_depthTexture, 0);
hzbInit->setTexture(m_hzbMipViews[0], 1);
if (m_pointClampSampler) {
hzbInit->setSamplerState(m_pointClampSampler, 0);
} else if (m_linearClampSampler) {
hzbInit->setSamplerState(m_linearClampSampler, 0);
}
const uint32_t hzbWidth = m_hzbMipViews[0]->width();
const uint32_t hzbHeight = m_hzbMipViews[0]->height();
const uint32_t threads = 8;
MTL::Size tgSize = MTL::Size(threads, threads, 1);
MTL::Size gridSize = MTL::Size((hzbWidth + threads - 1) / threads * threads,
(hzbHeight + threads - 1) / threads * threads,
1);
hzbInit->dispatchThreads(gridSize, tgSize);
hzbInit->endEncoding();
for (size_t mip = 1; mip < m_hzbMipViews.size(); ++mip) {
MTL::Texture* src = m_hzbMipViews[mip - 1];
MTL::Texture* dst = m_hzbMipViews[mip];
if (!src || !dst) {
continue;
}
MTL::ComputeCommandEncoder* downEncoder = commandBuffer->computeCommandEncoder();
downEncoder->setComputePipelineState(m_hzbDownsamplePipeline);
downEncoder->setTexture(src, 0);
downEncoder->setTexture(dst, 1);
const uint32_t mipWidth = dst->width();
const uint32_t mipHeight = dst->height();
MTL::Size downGrid = MTL::Size((mipWidth + threads - 1) / threads * threads,
(mipHeight + threads - 1) / threads * threads,
1);
downEncoder->dispatchThreads(downGrid, tgSize);
downEncoder->endEncoding();
}
if (m_instanceCullHzbPipeline) {
dispatchInstanceCulling(m_instanceCullHzbPipeline, true);
}
}
My Questions:
Encoder Synchronization: Would you recommend moving this loop into a single ComputeCommandEncoder using MTLBarrier between dispatches to maintain L2 cache residency, or is the overhead of separate encoders negligible for depth-downsampling on TBDR?
visionOS Bindless Latency: For stereo rendering on visionOS, what are the best practices for managing MTL4ArgumentTable updates at 90Hz+? I want to ensure that updating bindless resources for each eye doesn't introduce unnecessary CPU-to-GPU latency.
Memory Management: Are there specific hints for Memoryless textures that could be applied to intermediate HZB levels to save bandwidth during this process?
I’ve attached a screenshot of a scene rendered with the engine (PBR, SSR, and IBL).
Metal
RSS for tagRender advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Hi. I'm a 3D designer, using Blender for most of my work. The most recent Blender conference discussed utilizing the Open Shading Language (OSL) in their latest versions, which allows designers to write custom shaders for their workflows.
At the moment, only Nvidia Optix GPU's can utilize this language for rendering (from what I understand), but Blender developers stated they are waiting on other GPU manufacturers to implement this feature as well. I'm not sure if there are any licensing issues here, but would this be something Apple could implement in Metal to make their hardware more attractive to the 3D design community?
Any help or knowledge on this topic would be greatly appreciated.
Unable to find intelgpu_kbl_gt2r0 slice or a compatible one in binary archive 'file:///System/Library/PrivateFrameworks/IconRendering.framework/Resources/binary.metallib'
available slices: applegpu_g13g, applegpu_g13s, applegpu_g13d, applegpu_g14g, applegpu_g14s, applegpu_g14d, applegpu_g15g, applegpu_g15s, applegpu_g15d, applegpu_g16g, applegpu_g16s, applegpu_g17g, applegpu_g15g, applegpu_g15s, applegpu_g15d, applegpu_g16s
Is it related to performance of applications in macOS 26.2 on Intel Macs?
Hello Apple Developers and users I am writing this message reguarding some help on some performance codes/settings I can use for my Macbook since I recently downloaded the MacOs Tahoe 26.2 and its been very glitchy and laggy with gaming and just using my mac normally I have tried using a FPS unlocker and downloading Metal 4 the FPS unlocker hasent worked at all I am still stuck on the normal 60 FPS and need some advice/help. Thank you. Kind regards Zachary
I'm currently learning Metal. While reading the reference, I came across a strange description.
Page 78 in Version 4 Reference (2025-10-25) says:
It is legal to call the following set_indices functions to set the indices if the position in the index buffer is valid and if the position in the index buffer is a multiple of 2 (uchar2 overload) or 2 (uchar4 overload). The index I needs to be in the range [0, max_indices).
void set_indices(uint I, uchar2 v);
void set_indices(uint I, uchar4 v);
However, it seems that the uchar4 overload should be multiple of 4.
Furthermore, there is no explanation of what these methods actually do. I believe it involves setting two to four consecutive indices at once, but there is no mention of that here.
I would like to know if the above understanding is correct.
The sample code just draw a triangle and sample texture.
both sample code can draw a correct triangle and sample texture as expected. there are no error message from terminal.
Sample code using constexpr Sampler can capture and replay well.
Sample code using a argumentTable to bind a MTLSamplerState was crashed when using Metal capture and replay on Xcode.
Here are sample codes.
Sample Code
Test Environment:
M1 Pro
MacOS 26.3 (25D125)
Xcode Version 26.2 (17C52)
Feedback ID: FB22031701
Setup: MSAA rendering using a memoryless texture as the color attachment (render_image) and a "normal" texture as the resolve attachment (resolve_image). MTL_DEBUG_LAYER / API validation is enabled for this.
When trying to add the memoryless texture to a residency set, I get the following error:
-[MTLDebugResidencySet validateResource:], line 114: error 'residency sets do not support memoryless resources.
Which is as expected and identical to Metal 3.
However, if I don't add it to the residency set, I then get the following error when committing to the command queue:
-[MTL4DebugCommandQueue commit:count:options:], line 67: error 'Commit With Options Validation
Attachment texture (Label: render_image) used in command buffer (at index 0) is not added to any residency set on the command buffer or command queue.
So which way around is actually correct in Metal 4?
Either way, this makes the use of memoryless textures/attachments impossible right now when validation is enabled.
FWIW: when disabling all validation, either way seems to work just fine.
Tested on: M1 Max, macOS 26.3, Xcode 26.2 & 26.4b2
Topic:
Graphics & Games
SubTopic:
Metal
I've been using Metal compute shaders for lattice quantum chromodynamics simulations and wanted to share the experience in case others are doing scientific computing on Metal.
The workload involves SU(2) matrix operations on 4D lattice grids — lots of 2x2 and 3x3 complex matrix multiplies, reductions over lattice sites, and nearest-neighbor stencil operations. The implementation bridges a C++ scientific framework (Grid) to Metal via Objective-C++ .mm files, with MSL kernels compiled into .metallib archives during the build.
Things that work well:
Shared memory on M-series eliminates the CPU↔GPU copy overhead that dominates in CUDA workflows
The .metallib compilation integrates cleanly with autotools builds using xcrun
Float4 packing for SU(2) matrices maps naturally to MSL vector types
Things I'm still figuring out:
Optimal threadgroup sizes for stencil operations on 4D grids
Whether to use MTLHeap for gauge field storage or stick with individual buffers
Best practices for double precision — some measurements need float64 but Metal's double support varies by hardware
The application is measuring chromofield flux distributions between static quarks, ultimately targeting multi-quark systems. Production runs are on MacBook Pro M-series and Mac Studio.
Code: https://github.com/ThinkOffApp/multiquark-lattice-qcd
Hello everyone! I am trying to wrap a ViewModifier inside a Swift Package that bundles a metal shader file to be used in the modifier. Everything works as expected in the Preview, in the Simulator and on a real device for iOS. It also works in Preview and in the Simulator for tvOS but not on a real AppleTV. I have tried this on a 4th generation Apple TV running tvOS 26.3 using Xcode 26.2.0.
Xcode logs the following: The metallib is processed and exists in the bundle.
Compiler failed to build request
precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn
Reason: visible function not loaded
Compiler failed to build request
precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn
Reason: visible function not loaded
Compiler failed to build request
precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn
Reason: visible function not loaded
Compiler failed to build request
precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn
Reason: visible function not loaded
Compiler failed to build request
precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn
Reason: visible function not loaded
Compiler failed to build request
precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn
Reason: visible function not loaded
Contents of Package.swift:
import PackageDescription
let package = Package(
name: "Test",
platforms: [
.iOS(.v17),
.tvOS(.v17)
],
products: [
.library(
name: "Test",
targets: [
"Test"
]
)
],
targets: [
.target(
name: "Test",
resources: [
.process("Shaders")
]
),
.testTarget(
name: "TestTests",
dependencies: [
"Test"
]
)
]
)
Content of my metal file:
#include <metal_stdlib>
using namespace metal;
[[ stitchable ]] float2 complexWave(float2 position, float time, float2 size, float speed, float strength, float frequency) {
float2 normalizedPosition = position / size;
float moveAmount = time * speed;
position.x += sin((normalizedPosition.x + moveAmount) * frequency) * strength;
position.y += cos((normalizedPosition.y + moveAmount) * frequency) * strength;
return position;
}
And my ViewModifier:
import MetalKit
import SwiftUI
extension ShaderFunction {
static let complexWave: ShaderFunction = {
ShaderFunction(
library: .bundle(.module),
name: "complexWave"
)
}()
}
extension Shader {
static func complexWave(arguments: [Shader.Argument]) -> Shader {
Shader(function: .complexWave, arguments: arguments)
}
}
struct WaveModifier: ViewModifier {
let start: Date = .now
func body(content: Content) -> some View {
TimelineView(.animation) { context in
let delta = context.date.timeIntervalSince(start)
content
.visualEffect { view, proxy in
view.distortionEffect(
.complexWave(
arguments: [
.float(delta),
.float2(proxy.size),
.float(0.5),
.float(8),
.float(10)
]
),
maxSampleOffset: .zero
)
}
}
.onAppear {
let paths = Bundle.module.paths(forResourcesOfType: "metallib", inDirectory: nil)
print(paths)
}
}
}
extension View {
public func wave() -> some View {
modifier(WaveModifier())
}
}
#Preview {
Image(systemName: "cart")
.wave()
}
Any help is appreciated.
I think if your buffer is less than 4k its recommended to use
setVertexBytes, the question I have is can I keep hammering on setVertexBytes as the primary method to issue multiple draw calls within a render buffer and rely on Metal to figure out how to orphan and replace the target buffer?
A lot of the primitives I am drawing are less than 4k and the process of wiring down larger segments of memory for individual buffers for each draw primitive call seems to be a negative.
And it's just simpler to copy, submit and forget about buffer synchronization.
Topic:
Graphics & Games
SubTopic:
Metal
I'm new to graphics and game design and I just wanted to know if a compute pipeline could be as efficient as a render pipeline for rasterization and an explanation on how and why. Also is it possible to manually perform rasterization with a render pipeline as in manipulate individual pixel data in a metal texture yourself but do it with a render pipeline?
I’m trying to use MTL4FXTemporalDenoisedScaler, and I’m seeing a crash during initialization even with a very simple sample app.
I created a minimal sample here:
https://github.com/tatsuya-ogawa/MetalFXInitExample
The exception is:
NSException: "-[AGXG16XFamilyHeap baseObject]: unrecognized selector sent to instance ..."
What I found is:
• This works:
descriptor.makeTemporalDenoisedScaler(device: device)
• This crashes:
descriptor.makeTemporalDenoisedScaler(device: device, compiler: metal4Compiler)
So the issue seems to happen only with the Metal4FX version.
For testing, I’m using an iPhone 15 Pro.
According to the Metal Feature Set Tables, MetalFX denoised upscaling should be supported on Apple9 and later, so I believe the device itself should meet the requirements.
Reference:
https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf
Has anyone seen this before, or knows what might be causing it?
I’d appreciate any advice.
Thanks.
In my project I need to do the following:
In runtime create metal Dynamic library from source.
In runtime create metal Executable library from source and Link it with my previous created Dynamic library.
Create compute pipeline using those two libraries created above.
But I get the following error at the third step:
Error Domain=AGXMetalG15X_M1 Code=2 "Undefined symbols:
_Z5noisev, referenced from: OnTheFlyKernel
" UserInfo={NSLocalizedDescription=Undefined symbols:
_Z5noisev, referenced from: OnTheFlyKernel
}
import Foundation
import Metal
class MetalShaderCompiler {
let device = MTLCreateSystemDefaultDevice()!
var pipeline: MTLComputePipelineState!
func compileDylib() -> MTLDynamicLibrary {
let source = """
#include <metal_stdlib>
using namespace metal;
half3 noise() {
return half3(1, 0, 1);
}
"""
let option = MTLCompileOptions()
option.libraryType = .dynamic
option.installName = "@executable_path/libFoundation.metallib"
let library = try! device.makeLibrary(source: source, options: option)
let dylib = try! device.makeDynamicLibrary(library: library)
return dylib
}
func compileExlib(dylib: MTLDynamicLibrary) -> MTLLibrary {
let source = """
#include <metal_stdlib>
using namespace metal;
extern half3 noise();
kernel void OnTheFlyKernel(texture2d<half, access::read> src [[texture(0)]],
texture2d<half, access::write> dst [[texture(1)]],
ushort2 gid [[thread_position_in_grid]]) {
half4 rgba = src.read(gid);
rgba.rgb += noise();
dst.write(rgba, gid);
}
"""
let option = MTLCompileOptions()
option.libraryType = .executable
option.libraries = [dylib]
let library = try! self.device.makeLibrary(source: source, options: option)
return library
}
func runtime() {
let dylib = self.compileDylib()
let exlib = self.compileExlib(dylib: dylib)
let pipelineDescriptor = MTLComputePipelineDescriptor()
pipelineDescriptor.computeFunction = exlib.makeFunction(name: "OnTheFlyKernel")
pipelineDescriptor.preloadedLibraries = [dylib]
pipeline = try! device.makeComputePipelineState(descriptor: pipelineDescriptor, options: .bindingInfo, reflection: nil)
}
}
I am building a MacOS desktop app (https://anukari.com) that is using Metal compute to do real-time audio/DSP processing, as I have a problem that is highly parallelizable and too computationally expensive for the CPU.
However it seems that the way in which I am using the GPU, even when my app is fully compute-limited, the OS never increases the power/performance state. Because this is a real-time audio synthesis application, it's a huge problem to not be able to take advantage of the full clock speeds that the GPU is capable of, because the app can't keep up with real-time.
I discovered this issue while profiling the app using Instrument's Metal tracing (and Game tracing) modes. In the profiling configuration under "Metal Application" there is a drop-down to select the "Performance State." If I run the application under Instruments with Performance State set to Maximum, it runs amazingly well, and all my problems go away.
For comparison, when I run the app on its own, outside of Instruments, the expensive GPU computation it's doing takes around 2x as long to complete, meaning that the app performs half as well.
I've done a ton of work to micro-optimize my Metal compute code, based on every scrap of information from the WWDC videos, etc. A problem I'm running into is that I think that the more efficient I make my code, the less it signals to the OS that I want high GPU clock speeds!
I think part of why the OS is confused is that in most use cases, my computation can be done using only a small number of Metal threadgroups. I'm guessing that the OS heuristics see that only a small fraction of the GPU is saturated and fail to scale up the power/clock state.
I'm not sure what to do here; I'm in a bit of a bind. One possibility is that I intentionally schedule busy work -- spin threadgroups just to waste energy and signal to the OS that I need higher clock speeds. This is obviously a really bad idea, but it might work.
Is there any other (better) way for my app to signal to the OS that it is doing real-time latency-sensitive computation on the GPU and needs the clock speeds to be scaled up?
Note that game mode is not really an option, as my app also runs as an AU plugin inside hosts like Garageband, so it can't be made fullscreen, etc.
In this video, tile fragment shading is recommended for image processing. In this example, the unpack function takes two arguments, one of which is RasterizerData. As I understand it, this is the data passed to us from the previous stage (Vertex) of the graphics pipeline.
However, the properties of MTLTileRenderPipelineDescriptor do not include an option for specifying a Vertex function. Therefore, in this render pass, a mix of commands is used: first, a draw command is executed to obtain UV coordinates, and then threads are dispatched.
My question is: without using a draw command, only dispatch, how can I get pixel coordinates in the fragment tile function? For the kernel tile function, everything is clear.
typedef struct
{
float4 OPTexture [[ color(0) ]];
float4 IntermediateTex [[ color(1) ]];
} FragmentIO;
fragment FragmentIO Unpack(RasterizerData in [[ stage_in ]],
texture2d<float, access::sample> srcImageTexture [[texture(0)]])
{
FragmentIO out;
//...
// Run necessary per-pixel operations
out.OPTexture = // assign computed value;
out.IntermediateTex = // assign computed value;
return out;
}
I'm implementing optimized matmul on metal: https://github.com/crynux-ai/metal-matmul/blob/main/metal/1_shared_mem.metal
I notice that performance is significantly different with different threadgroup memory set in
[computeEncoder setThreadgroupMemoryLength]
All other lines are exactly same, the only difference is this parameter.
Matmul performance is roughly 250 GFLops if I set 32768 (max bytes allowed on this M1 Max),
but 400 GFLops if I set 8192.
Why does this happen? How can I optimize it?
Topic:
Graphics & Games
SubTopic:
Metal
Hello,
Thank you for attending today’s Metal & game technologies group lab at WWDC25!
We were delighted to answer many questions from developers and energized by the community engagement.
We hope you enjoyed it and welcome your feedback.
We invite you to carry on the conversation here, particularly if your question appeared in Slido and we were unable to answer it during the lab.
If your question received feedback let us know if you need clarification.
You may want to ask your question again in a different lab e.g. visionOS tomorrow.
(We realize that this can be confusing when frameworks interoperate)
We have a lot to learn from each other so let’s get to Q&A and make the best of WWDC25! 😃
Looking forward to your questions posted in new threads.
I'm a newbee at Vulkan and Xcode.
I have my project on github https://github.com/flocela/OrangeSpider/
Whenever I run, two windows open instead of only one.
I added testing, which means I have an OrangeSpider.xctestplan in the OrangeSpider/TestsOrangeSpider/ folder.
This is my first time adding testing to an XCode project, so I think this may be where the problem is.
I also get this error message:
ViewBridge to RemoteViewService Terminated: Error Domain=com.apple.ViewBridge Code=18 "(null)" UserInfo={com.apple.ViewBridge.error.hint=this process disconnected remote view controller -- benign unless unexpected, com.apple.ViewBridge.error.description=NSViewBridgeErrorCanceled}
Topic:
Graphics & Games
SubTopic:
Metal
Hi,
What's the best way to handle drastic changes in scene charateristics with the new MTLFXTemporalDenoisedScaler?
Let's say a visible object of the scene radically changes its material properties. I can modify the albedo and roughness textures consequently. But I suspect the history will be corrupted. Blending visual information between the new frame and the previous ones might be a nonsense.
I guess the problem should be the same when objects appear or disappear instantly.
Is the upsacler manage these events for us (by lowering blending), or should we use the reactive or the denoise strength mask or something like that to handle them?
Description:
In the official visionOS 26 Hover Effect sample code project , I encountered an issue where the event.trackingAreaIdentifier returned by onSpatialEvent does not reset as expected.
Steps to Reproduce:
Select an object with trackingAreaID = 6 in the sample app.
Look at a blank space (outside any tracking area) and perform a pinch gesture .
Expected Behavior:
The event.trackingAreaIdentifier should return 0 when interacting with a non-tracking area.
Actual Behavior:
The event.trackingAreaIdentifier still returns 6, even after restarting the app or killing the process. This persists regardless of where the pinch gesture is performed