Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Posts under Metal subtopic

Post

Replies

Boosts

Views

Activity

Metal recommendedMaxWorkingSetSize vs actual RAM on iPhone (LLM load fails)
Context I’m deploying large language models on iPhone using llama.cpp. A new iPhone Air (12 GB RAM) reports a Metal MTLDevice.recommendedMaxWorkingSetSize of 8,192 MB, and my attempt to load Llama-2-13B Q4_K (~7.32 GB weights) fails during model initialization. Environment Device: iPhone Air (12 GB RAM) iOS: 26 Xcode: 26.0.1 Build: Metal backend enabled llama.cpp App runs on device (not Simulator) What I’m seeing MTLCreateSystemDefaultDevice().recommendedMaxWorkingSetSize == 8192 MiB Loading Llama-2-13B Q4_K (7.32 GB) fails to complete. Logs indicate memory pressure / allocation issues consistent with the 8 GB working-set guidance. Smaller models (e.g., 7B/8B with similar quantization) load and run (8B Q4_K provide around 9 tokens/second decoding speed). Questions Is 8,192 MB an expected recommendedMaxWorkingSetSize on a 12 GB iPhone? What values should I expect on other 2025 devices including iPhone 17 (8 GB RAM) and iPhone 17 Pro (12 GB RAM) Is it strictly enforced by Metal allocations (heaps/buffers), or advisory for best performance/eviction behavior? Can a process practically exceed this for long-lived buffers without immediate Jetsam risk? Any guidance for LLM scenarios near the limit?
0
0
598
Oct ’25
MPSMatrixRandom SEGFAULTs when ran in an async context
The following minimal snippet SEGFAULTS with SDK 26.0 and 26.1. Won't crash if I remove async from the enclosing function signature - but it's impractical in a real project. import Metal import MetalPerformanceShaders let SEED = UInt64(0x0) typealias T = Float16 /* Why ran in async context? Because global GPU object, and async makeMTLFunction, and async makeMTLComputePipelineState. Nevertheless, can trigger the bug without using global @MainActor let myGPU = MyGPU() */ @main struct CMDLine { static func main() async { let ptr = UnsafeMutablePointer<T>.allocate(capacity: 0) async let future: Void = randomFillOnGPU(ptr, count: 0) print("Main thread is playing around") await future print("Successfully reached the end.") } static func randomFillOnGPU(_ buf: UnsafeMutablePointer<T>, count destbufcount: Int) async { // let (device, queue) = await (myGPU.device, myGPU.commandqueue) let myGPU = MyGPU() let (device, queue) = (myGPU.device, myGPU.commandqueue) // Init MTLBuffer, async let makeFunction, makeComputePipelineState, etc. let tempDataType = MPSDataType.uInt32 let randfiller = MPSMatrixRandomMTGP32(device: device, destinationDataType: tempDataType, seed: Int(bitPattern:UInt(SEED))) print("randomFillOnGPU: successfully created MPSMatrixRandom.") // try await computePipelineState // ^ Crashes before this could return // Or in this minimal case, after randomFillOnGPU() returns // make encoder, set pso, dispatch, commit... } } actor MyGPU { let device : MTLDevice let commandqueue : MTLCommandQueue init() { guard let dev: MTLDevice = MPSGetPreferredDevice(.skipRemovable), let cq = dev.makeCommandQueue(), dev.supportsFamily(.apple6) || dev.supportsFamily(.mac2) else { print("Unable to get Metal Device! Exiting"); exit(EX_UNAVAILABLE) } print("Selected device: \(String(format: "%llX", dev.registryID))") self.device = dev self.commandqueue = cq print("myGPU: initialization complete.") } } See FB20916929. Apparently objc autorelease pool is releasing the wrong address during context switch (across suspension points). I wonder why such obvious case has not been caught before.
0
0
167
Nov ’25
Cannot load .mtlpackage to MTLLibrary
After watching WWDC 2025 session "Combine Metal 4 machine learning and graphics", I have decided to give it a shot to integrate the latest MTL4MachineLearningCommandEncoder to my existing render pipeline. After a lot of trial and errors, I managed to set up the pipeline and have the app compiled. However, I am now stuck on creating a MTLLibrary with .mtlpackage. Here is the code I have to create a MTLLibrary according the WWDC session https://developer.apple.com/videos/play/wwdc2025/262/?time=550: let coreMLFilePath = bundle.path(forResource: "my_model", ofType: "mtlpackage")! let coreMLURL = URL(string: coreMLFilePath)! do { metalDevice.makeLibrary(URL: coreMLURL) } catch { print("error: \(error)") } With the above code, I am getting error: Error Domain=MTLLibraryErrorDomain Code=1 "Invalid metal package" UserInfo={NSLocalizedDescription=Invalid metal package} What is the correct way to create a MTLLibrary with .mtlpackage? Do I see this error because the .mtlpackage I am using is incorrect? How should I go with debugging this? I'd really appreciate if I could get some help on this as I have been stuck with it for some time now. Thanks in advance!
0
0
252
Nov ’25
MetalFX for Unity 2022.3.62f3?
Hi, I’m testing Unity’s Spaceship HDRP demo on iPhone 17 Pro Max and iPad Pro M4 (iOS 26.1). Everything renders correctly, and my custom MetalFX Spatial plugin initializes successfully — it briefly reports active scaling (e.g. 1434×660 → 2868×1320 at 50% scaling), then reverts to native rendering a few frames later. Setup: Xcode 16.1 (targeting iOS 18) Unity 2022.3.62f3 (HDRP) Metal backend Dynamic Resolution enabled in HDRP assets and cameras Relevant Xcode console excerpt: [MetalFXPlugin] MetalFX_Enable(True) called. [SpaceshipOptions] MetalFX enabled with HDRP dynamic resolution integration. [SpaceshipOptions] Disabled TAA for MetalFX Spatial. [SpaceshipOptions] Created runtime RenderTexture: 1434x660 [MetalFX] Spatial scaler created (1434x660 → 2868x1320). [MetalFX] Processed frame with scaler. [MetalFXPlugin] Sent RenderTexture (1434x660) to MetalFX. Output target 2868x1320. [SpaceshipOptions] MetalFX target set: 1434x660 [SpaceshipOptions] Camera targetTexture cleared after MetalFX handoff. It looks like HDRP clears the camera’s target texture right after MetalFX submits the frame, which causes it to revert to native rendering. Is there a recommended way to persist or rebind the MetalFX output texture when using HDRP on iOS? Unity doesn’t appear to support MetalFX in the Editor either: Thanks!
0
0
262
Nov ’25
Deterministic RNG behaviour across Mac M1 CPU and Metal GPU – BigCrush pass & structural diagnostics
Hello, I am currently working on a research project under ENINCA Consulting, focused on advanced diagnostic tools for pseudorandom number generators (structural metrics, multi-seed stability, cross-architecture reproducibility, and complementary indicators to TestU01). To validate this diagnostic framework, I prototyped a small non-linear 64-bit PRNG (not as a goal in itself, but simply as a vehicle to test the methodology). During these evaluations, I observed something interesting on Apple Silicon (Mac M1): • bit-exact reproducibility between M1 ARM CPU and M1 Metal GPU, • full BigCrush pass on both CPU and Metal backends, • excellent p-values, • stable behaviour across multiple seeds and runs. This was not the intended objective, the goal was mainly to validate the diagnostic concepts, but these results raised some questions about deterministic compute behaviour in Metal. My question: Is there any official guidance on achieving (or expecting) deterministic RNG or compute behaviour across CPU ↔ Metal GPU on Apple Silicon? More specifically: • Are deterministic compute kernels expected or guaranteed on Metal for scientific workloads? • Are there recommended patterns or best practices to ensure reproducibility across GPU generations (M1 → M2 → M3 → M4)? • Are there known Metal features that can introduce non-determinism? I am not sharing the internal recurrence (this work is proprietary), but I can discuss the high-level diagnostic observations if helpful. Thank you for any insight, very interested in how the Metal engineering team views deterministic compute patterns on Apple Silicon. Pascal ENINCA Consulting
0
0
283
Nov ’25
Deterministic RNG behaviour across Mac M1 CPU and Metal GPU – BigCrush pass & structural diagnostics
Hello, I am currently working on a research project under ENINCA Consulting, focused on advanced diagnostic tools for pseudorandom number generators (structural metrics, multi-seed stability, cross-architecture reproducibility, and complementary indicators to TestU01). To validate this diagnostic framework, I prototyped a small non-linear 64-bit PRNG (not as a goal in itself, but simply as a vehicle to test the methodology). During these evaluations, I observed something interesting on Apple Silicon (Mac M1): • bit-exact reproducibility between M1 ARM CPU and M1 Metal GPU, • full BigCrush pass on both CPU and Metal backends, • excellent p-values, • stable behaviour across multiple seeds and runs. This was not the intended objective, the goal was mainly to validate the diagnostic concepts, but these results raised some questions about deterministic compute behaviour in Metal. My question: Is there any official guidance on achieving (or expecting) deterministic RNG or compute behaviour across CPU ↔ Metal GPU on Apple Silicon? More specifically: • Are deterministic compute kernels expected or guaranteed on Metal for scientific workloads? • Are there recommended patterns or best practices to ensure reproducibility across GPU generations (M1 → M2 → M3 → M4)? • Are there known Metal features that can introduce non-determinism? I am not sharing the internal recurrence (this work is proprietary), but I can discuss the high-level diagnostic observations if helpful. Thank you for any insight, very interested in how the Metal engineering team views deterministic compute patterns on Apple Silicon. Pascal ENINCA Consulting
0
0
211
Nov ’25
Error: "CoreImage Metal library does not contain function"
Hey I'm using the CIDepthBlurEffect Core Image Filter in my app. It seems to work ok but I get these errors in the console when calling the class. CoreImage Metal library does not contain function for name: sparserendering_xhlrb_scan CoreImage Metal library does not contain function for name: sparserendering_xhlrb_diffuse CoreImage Metal library does not contain function for name: sparserendering_xhlrb_copy_back CoreImage Metal library does not contain function for name: plain_or_sRGB_copy Am I missing some sort of import to gain these Metal functions? I am using my own custom shaders but I assume you'd be able to use them along side the built in ones.
0
0
537
Dec ’25
Hover effects w/ Compositor Services w/ PSVR2 controllers
Hi, I would like clarification on whether the new hover effects feature introduced in vision os 26 supported pinch gestures through the psvr 2 controllers. In your sample application, I was not able to confirm that this was working. Only pinch clicking with my hands worked. Pulling the trigger on the controller whilst looking at a 3d object did not activate the hover effect spatial event in the sample application. (The object is showing the highlight though) This is inconsistent with hover effect behavior with psvr2 controllers on swift ui views, where the trigger press does count as a button click. The sample I used was this one: https://developer.apple.com/documentation/compositorservices/rendering_hover_effects_in_metal_immersive_apps
0
0
453
Jan ’26
Optimizing HZB Mip-Chain Generation and Bindless Argument Tables in a Custom Metal Engine
Hi everyone, I’ve been developing a custom, end-to-end 3D rendering engine called Crescent from scratch using C++20 and Metal-cpp (targeting macOS and visionOS). My primary goal is to build a zero-bottleneck, GPU-driven pipeline that maximizes the potential of Apple Silicon’s Unified Memory and TBDR architecture. While the fundamental systems are stable, I am looking for architectural feedback from Metal framework engineers regarding specific synchronization and latency challenges. Current Core Implementations: GPU-Driven Instance Culling: High-performance occlusion culling using a Hierarchical Z-Buffer (HZB) approach via Compute Shaders. Clustered Forward Shading: Support for high-count dynamic lights through view-space clustering. Temporal Stability: Custom TAA with history rejection and Motion Blur resolve. Asset Infrastructure: Robust GUID-based scene serialization and a JSON-driven ECS hierarchy. The Architectural Challenge: I am currently seeing slight synchronization overhead when generating the HZB mip-chain. On Apple Silicon, I am evaluating the cost of encoder transitions versus cache-friendly barriers. && m_hzbInitPipeline && m_hzbDownsamplePipeline && !m_hzbMipViews.empty(); if (canBuildHzb) { MTL::ComputeCommandEncoder* hzbInit = commandBuffer->computeCommandEncoder(); hzbInit->setComputePipelineState(m_hzbInitPipeline); hzbInit->setTexture(m_depthTexture, 0); hzbInit->setTexture(m_hzbMipViews[0], 1); if (m_pointClampSampler) { hzbInit->setSamplerState(m_pointClampSampler, 0); } else if (m_linearClampSampler) { hzbInit->setSamplerState(m_linearClampSampler, 0); } const uint32_t hzbWidth = m_hzbMipViews[0]->width(); const uint32_t hzbHeight = m_hzbMipViews[0]->height(); const uint32_t threads = 8; MTL::Size tgSize = MTL::Size(threads, threads, 1); MTL::Size gridSize = MTL::Size((hzbWidth + threads - 1) / threads * threads, (hzbHeight + threads - 1) / threads * threads, 1); hzbInit->dispatchThreads(gridSize, tgSize); hzbInit->endEncoding(); for (size_t mip = 1; mip < m_hzbMipViews.size(); ++mip) { MTL::Texture* src = m_hzbMipViews[mip - 1]; MTL::Texture* dst = m_hzbMipViews[mip]; if (!src || !dst) { continue; } MTL::ComputeCommandEncoder* downEncoder = commandBuffer->computeCommandEncoder(); downEncoder->setComputePipelineState(m_hzbDownsamplePipeline); downEncoder->setTexture(src, 0); downEncoder->setTexture(dst, 1); const uint32_t mipWidth = dst->width(); const uint32_t mipHeight = dst->height(); MTL::Size downGrid = MTL::Size((mipWidth + threads - 1) / threads * threads, (mipHeight + threads - 1) / threads * threads, 1); downEncoder->dispatchThreads(downGrid, tgSize); downEncoder->endEncoding(); } if (m_instanceCullHzbPipeline) { dispatchInstanceCulling(m_instanceCullHzbPipeline, true); } } My Questions: Encoder Synchronization: Would you recommend moving this loop into a single ComputeCommandEncoder using MTLBarrier between dispatches to maintain L2 cache residency, or is the overhead of separate encoders negligible for depth-downsampling on TBDR? visionOS Bindless Latency: For stereo rendering on visionOS, what are the best practices for managing MTL4ArgumentTable updates at 90Hz+? I want to ensure that updating bindless resources for each eye doesn't introduce unnecessary CPU-to-GPU latency. Memory Management: Are there specific hints for Memoryless textures that could be applied to intermediate HZB levels to save bandwidth during this process? I’ve attached a screenshot of a scene rendered with the engine (PBR, SSR, and IBL).
0
0
430
Feb ’26
Open Shading Language (OSL) in Metal
Hi. I'm a 3D designer, using Blender for most of my work. The most recent Blender conference discussed utilizing the Open Shading Language (OSL) in their latest versions, which allows designers to write custom shaders for their workflows. At the moment, only Nvidia Optix GPU's can utilize this language for rendering (from what I understand), but Blender developers stated they are waiting on other GPU manufacturers to implement this feature as well. I'm not sure if there are any licensing issues here, but would this be something Apple could implement in Metal to make their hardware more attractive to the 3D design community? Any help or knowledge on this topic would be greatly appreciated.
0
0
259
Feb ’26
Terminal Codes
Hello Apple Developers and users I am writing this message reguarding some help on some performance codes/settings I can use for my Macbook since I recently downloaded the MacOs Tahoe 26.2 and its been very glitchy and laggy with gaming and just using my mac normally I have tried using a FPS unlocker and downloading Metal 4 the FPS unlocker hasent worked at all I am still stuck on the normal 60 FPS and need some advice/help. Thank you. Kind regards Zachary
0
0
165
Feb ’26
The description of set_indices in the MSL reference seems incorrect.
I'm currently learning Metal. While reading the reference, I came across a strange description. Page 78 in Version 4 Reference (2025-10-25) says: It is legal to call the following set_indices functions to set the indices if the position in the index buffer is valid and if the position in the index buffer is a multiple of 2 (uchar2 overload) or 2 (uchar4 overload). The index I needs to be in the range [0, max_indices). void set_indices(uint I, uchar2 v); void set_indices(uint I, uchar4 v); However, it seems that the uchar4 overload should be multiple of 4. Furthermore, there is no explanation of what these methods actually do. I believe it involves setting two to four consecutive indices at once, but there is no mention of that here. I would like to know if the above understanding is correct.
0
0
104
4w
Xcode Metal Capture crash when using MTLSamplerState
The sample code just draw a triangle and sample texture. both sample code can draw a correct triangle and sample texture as expected. there are no error message from terminal. Sample code using constexpr Sampler can capture and replay well. Sample code using a argumentTable to bind a MTLSamplerState was crashed when using Metal capture and replay on Xcode. Here are sample codes. Sample Code Test Environment: M1 Pro MacOS 26.3 (25D125) Xcode Version 26.2 (17C52) Feedback ID: FB22031701
0
0
100
3w
Metal 4 (validation / debug layer): residency set requirement mismatch for memoryless attachments
Setup: MSAA rendering using a memoryless texture as the color attachment (render_image) and a "normal" texture as the resolve attachment (resolve_image). MTL_DEBUG_LAYER / API validation is enabled for this. When trying to add the memoryless texture to a residency set, I get the following error: -[MTLDebugResidencySet validateResource:], line 114: error 'residency sets do not support memoryless resources. Which is as expected and identical to Metal 3. However, if I don't add it to the residency set, I then get the following error when committing to the command queue: -[MTL4DebugCommandQueue commit:count:options:], line 67: error 'Commit With Options Validation Attachment texture (Label: render_image) used in command buffer (at index 0) is not added to any residency set on the command buffer or command queue. So which way around is actually correct in Metal 4? Either way, this makes the use of memoryless textures/attachments impossible right now when validation is enabled. FWIW: when disabling all validation, either way seems to work just fine. Tested on: M1 Max, macOS 26.3, Xcode 26.2 & 26.4b2
0
0
60
3w
Using Metal compute for scientific simulation (lattice QCD gauge theory)
I've been using Metal compute shaders for lattice quantum chromodynamics simulations and wanted to share the experience in case others are doing scientific computing on Metal. The workload involves SU(2) matrix operations on 4D lattice grids — lots of 2x2 and 3x3 complex matrix multiplies, reductions over lattice sites, and nearest-neighbor stencil operations. The implementation bridges a C++ scientific framework (Grid) to Metal via Objective-C++ .mm files, with MSL kernels compiled into .metallib archives during the build. Things that work well: Shared memory on M-series eliminates the CPU↔GPU copy overhead that dominates in CUDA workflows The .metallib compilation integrates cleanly with autotools builds using xcrun Float4 packing for SU(2) matrices maps naturally to MSL vector types Things I'm still figuring out: Optimal threadgroup sizes for stencil operations on 4D grids Whether to use MTLHeap for gauge field storage or stick with individual buffers Best practices for double precision — some measurements need float64 but Metal's double support varies by hardware The application is measuring chromofield flux distributions between static quarks, ultimately targeting multi-quark systems. Production runs are on MacBook Pro M-series and Mac Studio. Code: https://github.com/ThinkOffApp/multiquark-lattice-qcd
0
0
105
3w
Metal Shader inside Swift Package not found?
Hello everyone! I am trying to wrap a ViewModifier inside a Swift Package that bundles a metal shader file to be used in the modifier. Everything works as expected in the Preview, in the Simulator and on a real device for iOS. It also works in Preview and in the Simulator for tvOS but not on a real AppleTV. I have tried this on a 4th generation Apple TV running tvOS 26.3 using Xcode 26.2.0. Xcode logs the following: The metallib is processed and exists in the bundle. Compiler failed to build request precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn Reason: visible function not loaded Compiler failed to build request precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn Reason: visible function not loaded Compiler failed to build request precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn Reason: visible function not loaded Compiler failed to build request precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn Reason: visible function not loaded Compiler failed to build request precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn Reason: visible function not loaded Compiler failed to build request precondition failure: pipeline error: custom_effect-fg2a5cia7fmha4: error: unresolved visible function reference: custom_fn Reason: visible function not loaded Contents of Package.swift: import PackageDescription let package = Package( name: "Test", platforms: [ .iOS(.v17), .tvOS(.v17) ], products: [ .library( name: "Test", targets: [ "Test" ] ) ], targets: [ .target( name: "Test", resources: [ .process("Shaders") ] ), .testTarget( name: "TestTests", dependencies: [ "Test" ] ) ] ) Content of my metal file: #include <metal_stdlib> using namespace metal; [[ stitchable ]] float2 complexWave(float2 position, float time, float2 size, float speed, float strength, float frequency) { float2 normalizedPosition = position / size; float moveAmount = time * speed; position.x += sin((normalizedPosition.x + moveAmount) * frequency) * strength; position.y += cos((normalizedPosition.y + moveAmount) * frequency) * strength; return position; } And my ViewModifier: import MetalKit import SwiftUI extension ShaderFunction { static let complexWave: ShaderFunction = { ShaderFunction( library: .bundle(.module), name: "complexWave" ) }() } extension Shader { static func complexWave(arguments: [Shader.Argument]) -> Shader { Shader(function: .complexWave, arguments: arguments) } } struct WaveModifier: ViewModifier { let start: Date = .now func body(content: Content) -> some View { TimelineView(.animation) { context in let delta = context.date.timeIntervalSince(start) content .visualEffect { view, proxy in view.distortionEffect( .complexWave( arguments: [ .float(delta), .float2(proxy.size), .float(0.5), .float(8), .float(10) ] ), maxSampleOffset: .zero ) } } .onAppear { let paths = Bundle.module.paths(forResourcesOfType: "metallib", inDirectory: nil) print(paths) } } } extension View { public func wave() -> some View { modifier(WaveModifier()) } } #Preview { Image(systemName: "cart") .wave() } Any help is appreciated.
0
0
328
2w
Can a compute pipeline be as efficient as a render pipeline for rasterization?
I'm new to graphics and game design and I just wanted to know if a compute pipeline could be as efficient as a render pipeline for rasterization and an explanation on how and why. Also is it possible to manually perform rasterization with a render pipeline as in manipulate individual pixel data in a metal texture yourself but do it with a render pipeline?
0
0
161
1w
MTL4FXTemporalDenoisedScaler initialization
I’m trying to use MTL4FXTemporalDenoisedScaler, and I’m seeing a crash during initialization even with a very simple sample app. I created a minimal sample here: https://github.com/tatsuya-ogawa/MetalFXInitExample The exception is: NSException: "-[AGXG16XFamilyHeap baseObject]: unrecognized selector sent to instance ..." What I found is: • This works: descriptor.makeTemporalDenoisedScaler(device: device) • This crashes: descriptor.makeTemporalDenoisedScaler(device: device, compiler: metal4Compiler) So the issue seems to happen only with the Metal4FX version. For testing, I’m using an iPhone 15 Pro. According to the Metal Feature Set Tables, MetalFX denoised upscaling should be supported on Apple9 and later, so I believe the device itself should meet the requirements. Reference: https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf Has anyone seen this before, or knows what might be causing it? I’d appreciate any advice. Thanks.
0
0
28
16h