Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Posts under Metal subtopic

Post

Replies

Boosts

Views

Created

Unable to profile Metal app on M2 Ultra (profiling works on M3 Pro)
On MacBook Pro M3 14" I can profile the Metal App performance by running it, then clicking on the M icon and choosing profile after replay. On Mac Studio M2 Ultra I cannot: the profiler starts and crashes. I have tried everything including reinstalling the OS, Xcode, the Metal SDK, you name it. The app uses the Metal 4 API. The content of the replayer errorinfo report is shown at the end. Any ideas what is going on here and/or what else I can do do root cause this and fix it? FWIW, it was worse on 26.1 (Xcode just reported Metal 4 profiling not available). In 26.2 Xcode attempts to profile and invariably crashes. === Error summary: === 1x DYErrorDomain (512) - guest app crashed (512) 1x com.apple.gputools.MTLReplayer (100) - Abort trap: 6 === First Error === Domain: DYErrorDomain Error code: 512 Description: guest app crashed (512) GTErrorKeyPID: 26913 GTErrorKeyProcessName: GPUToolsReplayService GTErrorKeyCrashDate: 2026-01-09 19:22:52 +0000 === Underlying Error #1 === Domain: com.apple.gputools.MTLReplayer Error code: 100 Description: Abort trap: 6 Call stack: 0 GPUToolsReplay 0x0000000249c25850 MakeNSError + 284 1 GPUToolsReplay 0x0000000249c26428 HandleCrashSignal + 252 2 libsystem_platform.dylib 0x00000001856c7744 _sigtramp + 56 3 libsystem_pthread.dylib 0x00000001856bd888 pthread_kill + 296 4 libsystem_c.dylib 0x00000001855c2850 abort + 124 5 libsystem_c.dylib 0x00000001855c1a84 err + 0 6 IOGPU 0x00000001a9ea60a8 -[IOGPUMetal4CommandQueue _commit:count:commitFeedback:].cold.1 + 0 7 IOGPU 0x00000001a9ea0df8 __77-[IOGPUMetal4CommandQueue commitFillArgs:count:args:argsSize:commitFeedback:]_block_invoke + 0 8 IOGPU 0x00000001a9ea1004 -[IOGPUMetal4CommandQueue _commit:count:commitFeedback:] + 148 9 AGXMetalG14X 0x00000001158a2c98 -[AGXG14XFamilyCommandQueue_mtlnext noMergeCommit:count:options:commitFeedback:error:] + 116 10 AGXMetalG14X 0x0000000115a45c14 +[AGXG14XFamilyRenderContext_mtlnext mergeRenderEncoders:count:options:commitFeedback:queue:error:] + 4740 11 AGXMetalG14X 0x00000001158a2b34 -[AGXG14XFamilyCommandQueue_mtlnext commit:count:options:] + 96 12 GPUToolsReplay 0x0000000249bf0644 GTMTLReplayController_defaultDispatchFunction_noPinning + 2744 13 GPUToolsReplay 0x0000000249befb10 GTMTLReplayController_defaultDispatchFunction + 1368 14 GPUToolsReplay 0x0000000249b7a61c _ZL16DispatchFunctionP21GTMTLReplayControllerPK11GTTraceFuncRb + 476 15 GPUToolsReplay 0x0000000249b8603c ___ZN35GTUSCSamplingStreamingManagerHelper19StreamFrameTimeDataEv_block_invoke + 456 16 Foundation 0x0000000186f6c878 __NSBLOCKOPERATION_IS_CALLING_OUT_TO_A_BLOCK__ + 24 17 Foundation 0x0000000186f6c740 -[NSBlockOperation main] + 96 18 Foundation 0x0000000186f6c6d8 __NSOPERATION_IS_INVOKING_MAIN__ + 16 19 Foundation 0x0000000186f6c308 -[NSOperation start] + 640 20 Foundation 0x0000000186f6c080 __NSOPERATIONQUEUE_IS_STARTING_AN_OPERATION__ + 16 21 Foundation 0x0000000186f6bf70 __NSOQSchedule_f + 164 22 libdispatch.dylib 0x00000001855104d0 _dispatch_block_async_invoke2 + 148 23 libdispatch.dylib 0x000000018551aad4 _dispatch_client_callout + 16 24 libdispatch.dylib 0x00000001855056e4 _dispatch_continuation_pop + 596 25 libdispatch.dylib 0x0000000185504d58 _dispatch_async_redirect_invoke + 580 26 libdispatch.dylib 0x0000000185512fc8 _dispatch_root_queue_drain + 364 27 libdispatch.dylib 0x0000000185513784 _dispatch_worker_thread2 + 180 28 libsystem_pthread.dylib 0x00000001856b9e10 _pthread_wqthread + 232 29 libsystem_pthread.dylib 0x00000001856b8b9c start_wqthread + 8 Replayer breadcrumbs: [ ] GTErrorKeyProcessSignal: SIGABRT === Setup === Capture device: star.localdomain (Mac14,14) - macOS 26.2 (25C56) - 0BA10D1D-D340-5F2E-934B-536675AF9BA1 Metal version: 370.64.2 Supported graphics APIs: Metal device: Apple M2 Ultra Supported GPU families: Apple1 Apple2 Apple3 Apple4 Apple5 Apple6 Apple7 Apple8 Mac1 Mac2 Common1 Common2 Common3 Metal3 Metal4 Replay device: star (Mac14,14) - macOS 26.2 (25C56) - 0BA10D1D-D340-5F2E-934B-536675AF9BA1 Metal version: 370.64.2 Supported graphics APIs: Metal device: Apple M2 Ultra Supported GPU families: Apple1 Apple2 Apple3 Apple4 Apple5 Apple6 Apple7 Apple8 Mac1 Mac2 Common1 Common2 Common3 Metal3 Metal4 Host: Mac14,14 - macOS 26.2 (25C56) Tool: Xcode (17C52) Known SDKs:
1
0
342
3w
# [CRITICAL] Metal RHI Memory Leak - Resource exhaustion vulnerability (CWE-400) - Bug Report
[CRITICAL] Metal API Memory Leak - Heap Memory Never Released to OS (CWE-400) Security Classification This issue constitutes a resource exhaustion vulnerability (CWE-400): Aspect Details Type Uncontrolled Resource Consumption CWE CWE-400 Vector Local (any Metal application) Impact System instability, denial of service User Control None - no mitigation available Recovery Requires application restart Summary Metal heap allocations are never released back to macOS, even when the memory is entirely unused. This causes continuous, unbounded memory growth until system instability or crash. The issue affects any application using Metal API heap allocation. This was discovered in Unreal Engine 5, but reproduces in a completely blank UE5 project with zero application code - confirming this is Metal framework behavior, not application-level. Environment OS: macOS Tahoe 26.2 Hardware: Apple Silicon M4 Max (also reproduced on M1, M2, M3) API: Metal Reproduction Steps Run any Metal application that allocates and deallocates GPU buffers via Metal heaps Open Activity Monitor and observe the application's memory usage Let the application run idle (no user interaction required) Observe memory growing continuously at ~1-2 MB per second Memory never plateaus or stabilizes Eventually system becomes unstable For testing: Any Unreal Engine 5.4+ project on macOS will reproduce this. Even a blank project with no gameplay code exhibits the leak. (Tested on UE 5.7.1) Observed Behavior Memory Analysis Using Unreal's memreport -full command, two reports taken 86 seconds apart: Metric Report 1 (183s) Report 2 (269s) Delta Process Physical 4373.64 MB 4463.39 MB +89.75 MB Metal Heap Buffer 7168 MB 8192 MB +1024 MB Unused Heap 3453 MB 4477 MB +1024 MB Object Count 73,840 73,840 0 (no change) Key Finding Metal Heap grew by exactly 1 GB while "Unused Heap" also grew by 1 GB. This demonstrates: Metal is allocating new heap blocks in ~1 GB increments Previously allocated heap memory becomes "unused" but is never released The unused memory accumulates indefinitely No application-level objects are leaking (count remains constant) Memory Growth Pattern Continuous growth while idle (no user interaction) Growth rate: approximately 1-2 MB per second No plateau or stabilization occurs Metal allocates new 1 GB heap blocks rather than reusing freed space Eventually leads to system instability and crash What is NOT Causing This We verified the following are NOT the source: Application objects - Object count remains constant Application code - Blank project with no code reproduces the issue Texture streaming - Disabling texture streaming had no effect CPU garbage collection - Running GC has no effect (this is GPU memory) Mitigations Attempted (None Worked) setPurgeableState Setting resources to purgeable state before release: [buffer setPurgeableState:MTLPurgeableStateEmpty]; Result: Metal ignores this hint and does not reclaim heap memory. Avoiding Heap Pooling Forcing individual buffer allocations instead of heap-based pooling. Result: Leak persists - Metal still manages underlying allocations. Aggressive Buffer Compaction Attempting to compact/defragment buffers within heaps every frame. Result: Only moves data between existing heaps. Does NOT release heaps back to OS. Reducing Pool Sizes Minimizing all buffer pool sizes to force more frequent reuse. Result: Slightly slows the leak rate but does not stop it. Root Cause Analysis How Metal Heap Allocation Appears to Work Metal allocates GPU heap blocks in large chunks (~1 GB observed) Application requests buffers from these heaps When application releases buffers, memory becomes "unused" within the heap Metal does NOT release heap blocks back to macOS, even when entirely unused When fragmentation prevents reuse, Metal allocates new heap blocks Result: Continuous memory growth with no upper bound The Core Problem There appears to be no Metal API to force heap memory release. The only way to reclaim this memory is to destroy the Metal device entirely, which requires restarting the application. Expected Behavior Metal should: Release unused heaps - When a heap block is entirely unused, release it back to macOS Respect purgeable hints - Honor setPurgeableState calls from applications Compact allocations - Defragment heap allocations to reduce fragmentation Provide control APIs - Allow applications to request heap compaction or release Enforce limits - Have configurable maximum heap memory consumption Security Implications Local Denial of Service - Any Metal application can exhaust system memory, causing instability affecting all running applications Memory Pressure Attack - Forces other applications to swap to disk, degrading system-wide performance No Upper Bound - Memory consumption continues until system failure Unmitigable - End users have no way to prevent or limit the leak Affects All Metal Apps - Any application using Metal heaps is potentially affected Impact Applications become unstable after extended use System-wide performance degrades as memory pressure increases Users must periodically restart applications Developers cannot work around this at the application level Long-running applications (games, creative tools, servers) are particularly affected Request Investigate Metal heap memory management behavior Implement heap release when blocks become entirely unused Honor setPurgeableState hints from applications Consider providing an API for applications to request heap compaction Document any intended behavior or workarounds Additional Notes This issue has been observed across multiple Unreal Engine versions (5.4, 5.7) and multiple Apple Silicon generations (M1 through M4). The behavior is consistent and reproducible. The Unreal Engine team has implemented various CVars to attempt mitigation (rhi.Metal.HeapBufferBytesToCompact, rhi.Metal.ResourcePurgeInPool, etc.) but none successfully address the issue because the root cause is at the Metal framework level. Tested: January 2026 Platform: macOS Tahoe 26.2, Apple Silicon (M1/M2/M3/M4)
5
2
907
3w
Memory leak when no draw calls issued to encoder
I noticed that when the render command encoder adds no draw calls an apps memory usage seems to grow unboundedly. Using a super simple MTKView-based drawing with the following delegate (code at end). If I add the simplest of draw calls, e.g., a single vertex, the app's memory usage is normal, around 100-ish MBs. I am attaching a couple screenshot, one from Xcode and one from Instruments. What's going on here? Is this an illegal program? If yes, why does it not crash, such as if the encode or command buffer weren't ended. Or is there some race condition at play here due to the lack of draws? class Renderer: NSObject, MTKViewDelegate { var device: MTLDevice var commandQueue: MTL4CommandQueue var commandBuffer: MTL4CommandBuffer var allocator: MTL4CommandAllocator override init() { guard let d = MTLCreateSystemDefaultDevice(), let queue = d.makeMTL4CommandQueue(), let cmdBuffer = d.makeCommandBuffer(), let alloc = d.makeCommandAllocator() else { fatalError("unable to create metal 4 objects") } self.device = d self.commandQueue = queue self.commandBuffer = cmdBuffer self.allocator = alloc super.init() } func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {} func draw(in view: MTKView) { guard let drawable = view.currentDrawable else { return } commandBuffer.beginCommandBuffer(allocator: allocator) guard let descriptor = view.currentMTL4RenderPassDescriptor, let encoder = commandBuffer.makeRenderCommandEncoder( descriptor: descriptor ) else { fatalError("unable to create encoder") } encoder.endEncoding() commandBuffer.endCommandBuffer() commandQueue.waitForDrawable(drawable) commandQueue.commit([commandBuffer]) commandQueue.signalDrawable(drawable) drawable.present() } }
3
0
374
3w
Metal 4: Proper usage of requestResidency() with unique per-frame textures at 120fps
Hello, I have some confusion regarding ResidencySet. Specifically, about the requestResidency() function: how often should we call it? I have a captureOutput(_:didOutput:from:) method that is triggered at 60 or 120 fps. Inside this method, I am calling the following code every frame: computeResidencySet.removeAllAllocations() сomputeResidencySet.addAllocation(TextureA) computeResidencySet.addAllocation(TextureB) computeResidencySet.addAllocation(TextureC) computeResidencySet.commit() computeResidencySet.requestResidency() // Should we call it every frame? Please keep in mind that TextureA, TextureB, and TextureC are unique for each call (new instances are provided on every frame)."
1
0
465
1w
MetalFX FrameInterpolator assertion: `Color texture width mismatch from descriptor` even when all texture sizes match
I am integrating MetalFX FrameInterpolator into a custom Unity RenderGraph–based render pipeline (C++ native plugin + C# render passes), and I am hitting the following assertion at runtime: /MetalFXDebugError.h:29: failed assertion `Color texture width mismatch from descriptor' What makes this confusing is that all input/output textures have the correct width and height, and they exactly match the values specified in the MTLFXFrameInterpolatorDescriptor. Setup Input resolution: 1024 x 512 Output resolution: 2048 x 1024 MTLFXTemporalScaler is created first and then passed into MTLFXFrameInterpolator The TemporalScaler and FrameInterpolator descriptors use the same input/output sizes and formats All Metal textures: Have no parentTexture Are 2D textures Match the descriptor sizes exactly (verified via logging) Texture bindings at encode time frameInterpolator.colorTexture = mtlTexColor; // 1024 x 512 frameInterpolator.prevColorTexture = mtlTexPrevColor; // 1024 x 512 frameInterpolator.motionTexture = mtlTexMotion; // 1024 x 512 frameInterpolator.depthTexture = mtlTexDepth; // 1024 x 512 frameInterpolator.uiTexture = mtlTexUI; // 2048 x 1024 frameInterpolator.outputTexture = mtlTexOutput; // 2048 x 1024 All widths/heights are logged and match: Color : 1024 x 512 (input) PrevColor : 1024 x 512 (input) Motion : 1024 x 512 (input) Depth : 1024 x 512 (input) UI : 2048 x 1024 (output) Output : 2048 x 1024 (output) The TemporalScaler works correctly on its own. The assertion only occurs when using FrameInterpolator. Important detail about colorTexture Originally, colorTexture was copied from BuiltinRenderTextureType.CurrentActive. After reading that this might violate MetalFX semantics, I changed the pipeline so that: colorTexture now comes from a dedicated private RenderGraph texture It is not the backbuffer It is not a drawable It is not used as a final output It is created before UI rendering Despite this, the assertion still occurs. Question Can uiTexture for MTLFXFrameInterpolator legally come from a texture copied from BuiltinRenderTextureType.CurrentActive? More generally: Are there additional hidden constraints on colorTexture / prevColorTexture (such as Metal usage, storageMode, aliasing, or hazard tracking) that could cause this assertion, even when sizes match? Does FrameInterpolator require colorTexture and prevColorTexture to be created in a very specific way (e.g. non-aliased, ShaderRead usage, identical Metal resource properties)? Any clarification on the exact semantic requirements for colorTexture, prevColorTexture, or uiTexture in MetalFX FrameInterpolator would be greatly appreciated.
4
0
523
1w
Xcode Metal Trace
Code is download from apple official metal4 sample [https://developer.apple.com/documentation/metal/drawing-a-triangle-with-metal-4?language=objc] enable metal gpu trace in macOS schema and trace a frame in Xcode. Xcode may show segment fault on App from some 'GTTrace' function when click trace button. When replay a .gputrace file, Xcode may crash , throw an internal error or a XPC error. The example code using old metal-renderer can trace without any problem and everything works fine. Test Environment: Xcode Version 26.2 (17C52) macOS 26.2 (25C56) M1 Pro 16GB A2442
2
0
404
5d
Optimizing HZB Mip-Chain Generation and Bindless Argument Tables in a Custom Metal Engine
Hi everyone, I’ve been developing a custom, end-to-end 3D rendering engine called Crescent from scratch using C++20 and Metal-cpp (targeting macOS and visionOS). My primary goal is to build a zero-bottleneck, GPU-driven pipeline that maximizes the potential of Apple Silicon’s Unified Memory and TBDR architecture. While the fundamental systems are stable, I am looking for architectural feedback from Metal framework engineers regarding specific synchronization and latency challenges. Current Core Implementations: GPU-Driven Instance Culling: High-performance occlusion culling using a Hierarchical Z-Buffer (HZB) approach via Compute Shaders. Clustered Forward Shading: Support for high-count dynamic lights through view-space clustering. Temporal Stability: Custom TAA with history rejection and Motion Blur resolve. Asset Infrastructure: Robust GUID-based scene serialization and a JSON-driven ECS hierarchy. The Architectural Challenge: I am currently seeing slight synchronization overhead when generating the HZB mip-chain. On Apple Silicon, I am evaluating the cost of encoder transitions versus cache-friendly barriers. && m_hzbInitPipeline && m_hzbDownsamplePipeline && !m_hzbMipViews.empty(); if (canBuildHzb) { MTL::ComputeCommandEncoder* hzbInit = commandBuffer->computeCommandEncoder(); hzbInit->setComputePipelineState(m_hzbInitPipeline); hzbInit->setTexture(m_depthTexture, 0); hzbInit->setTexture(m_hzbMipViews[0], 1); if (m_pointClampSampler) { hzbInit->setSamplerState(m_pointClampSampler, 0); } else if (m_linearClampSampler) { hzbInit->setSamplerState(m_linearClampSampler, 0); } const uint32_t hzbWidth = m_hzbMipViews[0]->width(); const uint32_t hzbHeight = m_hzbMipViews[0]->height(); const uint32_t threads = 8; MTL::Size tgSize = MTL::Size(threads, threads, 1); MTL::Size gridSize = MTL::Size((hzbWidth + threads - 1) / threads * threads, (hzbHeight + threads - 1) / threads * threads, 1); hzbInit->dispatchThreads(gridSize, tgSize); hzbInit->endEncoding(); for (size_t mip = 1; mip < m_hzbMipViews.size(); ++mip) { MTL::Texture* src = m_hzbMipViews[mip - 1]; MTL::Texture* dst = m_hzbMipViews[mip]; if (!src || !dst) { continue; } MTL::ComputeCommandEncoder* downEncoder = commandBuffer->computeCommandEncoder(); downEncoder->setComputePipelineState(m_hzbDownsamplePipeline); downEncoder->setTexture(src, 0); downEncoder->setTexture(dst, 1); const uint32_t mipWidth = dst->width(); const uint32_t mipHeight = dst->height(); MTL::Size downGrid = MTL::Size((mipWidth + threads - 1) / threads * threads, (mipHeight + threads - 1) / threads * threads, 1); downEncoder->dispatchThreads(downGrid, tgSize); downEncoder->endEncoding(); } if (m_instanceCullHzbPipeline) { dispatchInstanceCulling(m_instanceCullHzbPipeline, true); } } My Questions: Encoder Synchronization: Would you recommend moving this loop into a single ComputeCommandEncoder using MTLBarrier between dispatches to maintain L2 cache residency, or is the overhead of separate encoders negligible for depth-downsampling on TBDR? visionOS Bindless Latency: For stereo rendering on visionOS, what are the best practices for managing MTL4ArgumentTable updates at 90Hz+? I want to ensure that updating bindless resources for each eye doesn't introduce unnecessary CPU-to-GPU latency. Memory Management: Are there specific hints for Memoryless textures that could be applied to intermediate HZB levels to save bandwidth during this process? I’ve attached a screenshot of a scene rendered with the engine (PBR, SSR, and IBL).
0
0
120
23h