Metal

Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Post

Replies

Boosts

Views

Activity

Metal rendering application is not releasing resources

I am developing a metal based ray tracing rendering application (running heavy GPU kernels). I am sometimes "forcefully quitting" my application and I can see the application is not in the activity monitor. But I can see the windowserver is using %97 the GPU. The mac gets hotter and hotter. I kill the windowserver, re-login it is still the case. The only way to fix is to restart the mac. I have checked if there are any zombie processes, there are none. I am 3-4 month into Mac development (I used many rendering APIs e.g. before under Windows and Linux, they release the resources automatically unless the driver is very broken), but I believe when you force quit or exit gracefully, regarding application should release resources. I may be missing some knowledge. Does anybody have an idea? I had added every corner a graceful exit code but once the kernel has some infinite loop the clean up cannot happen. In Windows there are some driver reload mechanisms to recover when GPU is stuck, is there a similar system ?

Graphics & Games Metal Metal

778

macOS 27 beta: ProMotion refresh cadence is unstable, causing constant scroll judder

FB24091347 On macOS 27.0 beta (26A5388g), MacBook Pro M4 Pro, the built-in ProMotion display never settles on a stable refresh cadence. Scrolling in SwiftUI judders constantly. The same app binary was smooth on macOS 26, and is smooth on a 120 Hz ProMotion iPad. I captured two 60-second Instruments traces — same app, same scene, same scrolling, no external display — changing only the display's refresh-rate setting. On ProMotion the vsync interval standard deviation is 4.093 ms across six different cadences, mostly flip-flopping between 120 Hz and 60 Hz. Forced to a fixed 60 Hz it drops to 0.391 ms with a single cadence. The app presented an identical 59 fps median in both runs — frame production is perfectly steady, the display just holds each frame for an unpredictable length of time. That's what makes this nasty: it's invisible to every frame-rate metric, so it looks like the app got slow when nothing about the app changed. I spent most of a day profiling my own code before realising the app was never the problem. Workaround: force the built-in display to 60 Hz. Worth noting, because it complicates the picture: attaching a 60 Hz Studio Display makes the built-in smooth, but the Studio itself then judders — despite its own vsync cadence measuring perfectly stable. So refresh rate alone isn't the whole story, and there may be a second mechanism. The clean, reproducible, single-variable result is the ProMotion vs forced-60 Hz comparison on the built-in panel. If you can reproduce this on an M-series MacBook Pro on 27 beta, please file a duplicate referencing FB24091347.

Graphics & Games General macOS Core Animation SwiftUI Metal

145

Core Image kernel sampling broken in iOS27 DB4

I've noticed that my camera app is returning blank images on developer beta 4. After some investigation there is an issue with core image custom kernels where the texture sampler is returning NaN / 0 floats. I have a reproducible demo here: https://github.com/alexfoxy/ci-metal-shader-bug Feedback ticket here: https://feedbackassistant.apple.com/feedback/23895753

Graphics & Games Metal Core Image Metal Metal Performance Shaders

1.9k

Residency set memory not freed if process performs no GPU operation

Feedback report: FB23959296 If a process creates a residency set, calls requestResidency, endResidency, and then releases the residency set without ever having done any GPU operations, the memory from the residency set is not freed. Workaround: if the application runs any GPU operation (even an operation not involving the residency set) at any point in its lifecycle (before/while/after creating/releasing the residency set), the memory is freed properly. This was observed in the context of an application that makes an AI model resident in GPU-accessible memory. If the user unloads the model without running any prompts, the memory is not freed. The model occupies ~16GB of RAM so a lot of memory is being leaked. Reproduction: Store the repro.m and workaround.m files from below Run the following commands (repro.m demonstrates the bug; workaround.m demonstrates the workaround): $ clang -framework Foundation -framework Metal -o repro repro.m $ ./repro Footprint at start: 0.00 GB Footprint after buffer allocation: 4.30 GB Footprint 5s after teardown: 4.30 GB $ clang -framework Foundation -framework Metal -o workaround workaround.m $ ./workaround Footprint at start: 0.00 GB Footprint after buffer allocation: 4.37 GB Footprint 5s after teardown: 0.01 GB Expected behavior: Footprint 5s after teardown should be ~0 GB, i.e., the memory is freed. Observed behavior: Footprint 5s after teardown is 4.30 GB, i.e., the memory is not freed. Versions: XCode: 26.6 (17F113) Clang: 21.0.0 (clang-2100.1.1.101, arm64-apple-darwin25.5.0) macOS: 26.5.2 (25F84) Files: repro.m: // Build: clang -framework Foundation -framework Metal -o repro repro.m #import <Metal/Metal.h> #include <mach/mach.h> // Returns the physical memory footprint of the process. static double footprint_gb(void) { task_vm_info_data_t info; mach_msg_type_number_t n = TASK_VM_INFO_COUNT; task_info(mach_task_self(), TASK_VM_INFO, (task_info_t)&info, &n); return (double)info.phys_footprint / 1e9; } int main(int argc, char ** argv) { printf("Footprint at start: %5.2f GB\n", footprint_gb()); @autoreleasepool { id<MTLDevice> dev = MTLCreateSystemDefaultDevice(); // Allocate ~4GB of memory. const size_t size = 4ULL << 30; id<MTLBuffer> buf = [dev newBufferWithLength:size options:MTLResourceStorageModeShared]; memset(buf.contents, 0xab, size); // fault the pages in printf("Footprint after buffer allocation: %5.2f GB\n", footprint_gb()); MTLResidencySetDescriptor * desc = [[MTLResidencySetDescriptor alloc] init]; id<MTLResidencySet> rset = [dev newResidencySetWithDescriptor:desc error:nil]; [desc release]; [rset addAllocation:buf]; [rset commit]; [rset requestResidency]; [rset endResidency]; [rset removeAllAllocations]; [rset commit]; [rset release]; [buf release]; [dev release]; } sleep(5); printf("Footprint 5s after teardown: %5.2f GB\n", footprint_gb()); return 0; } workaround.m: // Build: clang -framework Foundation -framework Metal -o workaround workaround.m #import <Metal/Metal.h> #include <mach/mach.h> // Returns the physical memory footprint of the process. static double footprint_gb(void) { task_vm_info_data_t info; mach_msg_type_number_t n = TASK_VM_INFO_COUNT; task_info(mach_task_self(), TASK_VM_INFO, (task_info_t)&info, &n); return (double)info.phys_footprint / 1e9; } // Performing any work on the GPU ensures the memory from the residency set will be released. static void do_dummy_work(id<MTLDevice> dev, id<MTLCommandQueue> queue) { @autoreleasepool { id<MTLBuffer> tmp = [dev newBufferWithLength:1 options:MTLResourceStorageModeShared]; id<MTLCommandBuffer> cb = [queue commandBuffer]; id<MTLBlitCommandEncoder> enc = [cb blitCommandEncoder]; [enc fillBuffer:tmp range:NSMakeRange(0, 1) value:0]; [enc endEncoding]; [cb commit]; [tmp release]; } } int main(int argc, char ** argv) { printf("Footprint at start: %5.2f GB\n", footprint_gb()); @autoreleasepool { id<MTLDevice> dev = MTLCreateSystemDefaultDevice(); id<MTLCommandQueue> queue = [dev newCommandQueue]; // Workaround that ensures the memory will be released. // It also works if we call this after the residency set release or at any point in between. do_dummy_work(dev, queue); // Allocate ~4GB of memory. const size_t size = 4ULL << 30; id<MTLBuffer> buf = [dev newBufferWithLength:size options:MTLResourceStorageModeShared]; memset(buf.contents, 0xab, size); // fault the pages in printf("Footprint after buffer allocation: %5.2f GB\n", footprint_gb()); MTLResidencySetDescriptor * desc = [[MTLResidencySetDescriptor alloc] init]; id<MTLResidencySet> rset = [dev newResidencySetWithDescriptor:desc error:nil]; [desc release]; [rset addAllocation:buf]; [rset commit]; [rset requestResidency]; [rset endResidency]; [rset removeAllAllocations]; [rset commit]; [rset release]; [buf release]; [queue release]; [dev release]; } sleep(5); printf("Footprint 5s after teardown: %5.2f GB\n", footprint_gb()); return 0; }

Graphics & Games Metal Metal

683

Xcode MTL Validation Crashes App

I don't really know the terminology around this very well, but I was trying to test my Mac OS Catalyst app on Mac OS Sequoia, and the app kept crashing apparently due to MTL validation. I was trying to debug why using a menu (as in File, Edit, View, etc.) would crash. The stack looked roughly like this: 6 -[MTLDebugComputeCommandEncoder setBuffer:offset:attributeStride:atIndex:] MetalTools 5 _CF_forwarding_prep_0 CoreFoundation 4 ___forwarding___ CoreFoundation 3 -[NSObject doesNotRecognizeSelector:] CoreFoundation 2 objc_exception_throw libobjc.A.dylib 1 __cxa_throw b 0 _Unwind_RaiseException libunwind.dylib Both Claude and Gemini indicated that there was no flaw in my code, but rather that Xcode was responsible. Sure enough, unchecking the MTL validation checkbox in Xcode stopped the crash from happening.

Developer Tools & Services Xcode Metal MetalKit Xcode

232

Xcode Cloud 26b7 Metal Compilation Failure

I've been getting intermittent failures on Xcode code compiling my app on multiple platforms because it fails to compile a metal shader. The Metal Toolchain was not installed and could not compile the Metal source files. Download the Metal Toolchain from Xcode > Settings > Components and try again. Sometimes if I re-run it, it works fine. Then I'll run it again, and it will fail. If you tell me to file a feedback, please tell me what information would be useful and actionable, because this is all I have.

Developer Tools & Services Xcode Cloud Metal Xcode Cloud

1.9k

iPad Pro M4 (11-inch) – Persistent Gaming Performance Issues Across Multiple iPadOS Versions

Hello everyone, I am posting this to determine whether other iPad Pro M4 users are experiencing the same issue. Device: iPad Pro 11-inch (M4) Original Apple charger Tested on multiple iPadOS versions, stebal and beta including 26.2, 26.3, 26.4, 26.5.2 Games Tested: BGMI PUBG Mobile Global Call of Duty: Mobile Fortnite Issue: Despite using one of Apple's most powerful tablets, I continue to experience gaming performance problems. The issues include: FPS drops during long gaming sessions. Frame pacing inconsistencies. Reduced responsiveness during intense fights. Inconsistent hit registration and spray accuracy after extended play. Performance sometimes changes when gaming while charging with the original Apple charger. I have tested multiple iPadOS versions and multiple game updates over several months, but the issue has never been completely resolved. Interestingly, iPadOS feels more consistent for me than some previous versions, but the overall gaming experience is still not what I would expect from the M4 hardware. I have also noticed that many other iPad Pro M4 users have reported similar concerns on Reddit, Apple Communities, and other gaming forums. Questions: Are other iPad Pro M4 users experiencing the same FPS drops and gameplay inconsistencies? Has anyone found a reliable solution? Is Apple aware of these gaming performance issues on the M4 iPad Pro? Is this an iPadOS optimization issue, a GPU scheduling issue, or something related to game optimization? I hope Apple and game developers investigate this further because the M4 hardware should be capable of delivering a consistently excellent gaming experience. Thank you.

Graphics & Games General Graphics and Games Metal Metal Performance Shaders Performance

299

Metal Shader Converter thread safety

Hello Apple! We've got offline shader compilation from HLSL -> Metallib using DXC -> SPIR-V -> metal.exe. This works okay for the most part, but it requires the creation of intermediate files to pass to/from the metal.exe process and we've had some issues with metal.exe sometimes not launching (probably our fault). Then we noticed Metal Shader Converter (MSC) exists and has a DLL - this looks way better since there's no need to launch processes or store intermediate files. However, upon trying to replace metal.exe with it I quickly ran into rampant heap corruption. I was surprised because the docs claim this: Each thread in your program needs to create its own instance of IRCompiler to avoid race conditions. But once I start calling IRCompilerAllocCompileAndLink in parallel all hell breaks loose, whether or not each thread has its own IRCompiler. I figured I must be doing something wrong, so I removed my attempt and compiled DXC locally with the MSC integration and encountered the exact same heap corruption. So I'm inclined to think the library isn't actually thread safe, but I'm wondering if there's something I'm missing? I tried all 3 versions of MSC just in case it was a problem with 3.0, but I got the same result each time. The only way to make it work was to surround compilation with a mutex, which makes its use pointless in our case.

Graphics & Games Metal Graphics and Games Metal Shader Converter Metal

282

关于我使用Swift和Metal制作的神经网络引擎

我今年18岁。没有机器学习背景，没有上过大学，高中都没去上，没有导师。几天前我盯着一张纸发呆。突然想：为什么计算机神经网络一定要是2D的？可以模拟生物吗？为什么一定要在平面上算？如果多个平面，岂不是翻倍？如果把六张纸想象成一个魔方，六个面各自承载神经元，八条体对角线变成新的通信通道会怎么样？我真的很喜欢折腾这些，然后我立刻制定了详细计划，使用AI工具辅助写下了第一个 kernel。跑崩了。我又重新想了一下，和qq群友分享了我的目标，又写。又崩。连续几十次。没有 PyTorch，没有 TensorFlow，没有 CUDA。只有Swift和Metal。因为我的电脑显卡是AMD Vega 64，没装任何框架辅助，因为我想明白最底层的运行方式是什么原理。这就是CubeNN。 ##以下为AI的详细解答，内容与架构改动太多，我在这里一次讲不清楚它是什么一个用魔方几何作为计算架构的神经网络引擎。标准 Transformer: 把数据排成一行，O(n²) 地互相看 CubeNN: 把数据分布在 14 个面上，只在该看的地方看 6 个标准面 → 块稀疏注意力（粗看全局 + 细看局部） 8 个 X 面对角线 → 跨面信息桥（不做 Attention，只负责传递）每轮：6 面算 → 投影到 8 X 面 → 上采样精炼 → 融合回 6 面最关键的是 Cube Cascade——一个树+链级联推理：树阶段: 1 个魔方 spawn 8 个 → 8 个 spawn 64 个 → 73 个并行探索 GPU 上同时跑，选最优路径链阶段: 最优叶子无限深度精炼 3-5 步收敛，方差提升 ~7% 怎么实现的纯 Swift + Metal。零依赖。零框架。 // 大致代码就是这些 import Metal import Foundation let device = MTLCreateSystemDefaultDevice()! let library = try! device.makeLibrary(filepath: "cube_nn.metallib") // ...12 个 GPU kernel，12,000 次 dispatch 关键技术决策：单 Command Buffer：整个树阶段 73 个魔方的全部 kernel dispatch 打包进一个 CB，0 次 CPU-GPU 同步 Pipeline State 缓存：编码从 1022ms 降到 42ms Buffer 偏移：所有 73 个魔方的 14 个面存进一个连续 buffer，kernel 通过 buffer(15) 传偏移量 FP16：N≥64 时半精度提速 21% 性能 ##经过测试，但是因设备差异可能不准确，仅参考 AMD Radeon RX Vega 64 (2017 年显卡, 14nm, 295W): 规模神经元魔方数耗时 N=32 6,144 73 (树) 435ms N=64 24,576 21 (树) 817ms N=128 98,304 1 116ms N=32 全连接 Attention 每层 201M FLOP → CubeNN 块稀疏 370K FLOP (544× 减少) N=128 全连接需要 32GB 显存（物理上不存在）→ CubeNN 用 192KB N=256 全连接需要 2.2T FLOP → CubeNN 52M FLOP (42,300× 减少) 代码体积：161KB。对比 PyTorch 的 800MB。我经历了什么这个项目最困难的不是写 kernel，是在没有任何人告诉我"能不能做"的情况下，靠反复试错找到路。第一次试图跑 73 个魔方，GPU 直接 hang 了。花了 3 天定位到是 Command Buffer 堆叠过多。改了 single encoder 方案，又碰上 SIGILL——Metal 不允许 makeBuffer(length: 0)，B=0 时创建了零长度 buffer。想用 threadgroup memory 做 kernel fusion，结果跨 threadgroup 读不到数据，才明白 LDS 是 per-group 的。 N=64 的 FP16 要手动写 float↔half 转换函数，因为 macOS 11 上 Float16 类型被标为 unavailable。每一次崩溃都教会我一个 Metal 的底层细节。没有人教我，但 Metal 的报错信息就是最好的老师。为什么发在 Apple 开发者论坛因为这是为苹果生态而生的项目。CubeNN 从头到尾只用了两个东西：Swift 和 Metal。它不需要移植就能跑在任何 Apple Silicon Mac 上（API兼容）。如果未来能把部分 kernel 映射到 Neural Engine，效率会再翻几倍。我想问 Apple 的 Metal 工程师和 Core ML 团队： ** 有没有更好的 GPU 任务调度方式？**目前表现仍然欠佳（对于我这个完美主义者来说），可能改得有点乱了有没有兴趣评估这个架构在 M4 上的表现？我手里只有 Vega 64。M4 GPU + ANE方法跑 CubeNN 会是什么效果？源代码 ├── run.swift # 统一 CLI，参数化 N/B/depth ├── src/ │ ├── cube_nn.metal # FP16 kernel │ └── cube_nn_fp32.metal # FP32 kernel └── benchmarks/ # 实测数据如果你读到了这里——谢谢你。一个门外汉靠痴狂的，纯粹到几乎是妄想的主意和Metal走到了这里。我懂的不是很多，如果这个架构有任何价值，我想让它变得更好。任何建议、批评、或者指教，都非常欢迎。

Machine Learning & AI Core ML Swift Metal

422

MDLAsset loads texture in usdz file loaded with wrong colorspace

I have a very basic usdz file from this repo I call loadTextures() after loading the usdz via MDLAsset. Inspecting the MDLTexture object I can tell it is assigning a colorspace of linear rgb instead of srgb although the image file in the usdz is srgb. This causes the textures to ultimately render as over saturated. In the code I later convert the MDLTexture to MTLTexture via MTKTextureLoader but if I set the srgb option it seems to ignore it. This significantly impacts the usefulness of Model I/O if it can't load a simple usdz texture correctly. Am I missing something? Thanks!

Graphics & Games Metal Graphics and Games Metal MetalKit USDZ

1.4k

Jul ’26

Background GPU Access availability

I would love to use Background GPU Access to do some video processing in the background. However the documentation of BGContinuedProcessingTaskRequest.Resources.gpu clearly states: Not all devices support background GPU use. For more information, see Performing long-running tasks on iOS and iPadOS. Is there a list available of currently released devices that do (or don't) support GPU background usage? That would help to understand what part of our user base can use this feature. (And what hardware we need to test this on as developers.) For example it seems that it isn't supported on an iPad Pro M1 with the current iOS 26 beta. The simulators also seem to not support the background GPU resource. So would be great to understand what hardware is capable of using this feature!

Graphics & Games Metal Metal Background Tasks MetalFX

1.7k

Jul ’26

Linker trying to link Metal toolchain for every object file on Catalyst

When building our project for Mac Catalyst with Xcode 26.2, we get this warning almost a hundred times, once for every object file: directory not found for option '-L/var/run/com.apple.security.cryptexd/mnt/com.apple.MobileAsset.MetalToolchain-v17.3.48.0.UZtKea/Metal.xctoolchain/usr/lib/swift/maccatalyst' Somehow, every Link <FileName>.o build step got the following parameter, regardless if the target contained Metal files or not: -L/var/run/com.apple.security.cryptexd/mnt/com.apple.MobileAsset.MetalToolchain-v17.3.48.0.UZtKea/Metal.xctoolchain/usr/lib/swift/maccatalyst The toolchain is mounted at this point, but the directory usr/lib/swift/maccatalyst doesn't exist. When building the project for iOS, the option doesn't exist and the warning is not shown. We already check the build settings, but we couldn't find a reason why the linker is trying to link against the toolchain here. Even for targets that do contain Metal files, we get the following linker warning: search path '/var/run/com.apple.security.cryptexd/mnt/com.apple.MobileAsset.MetalToolchain-v17.3.48.0.UZtKea/Metal.xctoolchain/usr/lib/swift/maccatalyst' not found Is this a known issue? Is there a way to get rid of these warnings?

Developer Tools & Services Xcode Metal Xcode Mac Catalyst Linker

1.3k

Jun ’26

_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU

When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.

Machine Learning & AI Core ML Metal tensorflow-metal

1.4k

May ’26

Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets

Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets Summary The Metal driver AGXMetalG17X 351.2 on macOS 26.5 (25F71) for the M5 Pro chip crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory (00000008) when running LLM inference workloads with working sets as small as ~1.5GB, despite 24GB of unified memory being available and Apple Diagnostics confirming the hardware is fully functional. This affects multiple tools: MLX, llama.cpp (Metal backend), and native apps using Metal for inference. System Component Value Model MacBook Pro (Mac17,9) Chip Apple M5 Pro (applegpu_g17s) GPU Cores 16 RAM 24 GB LPDDR5 macOS 26.5 (25F71) Metal Metal 4 GPU Driver AGXMetalG17X 351.2 Xcode 26.5 (17F42) Reproduction MLX (Python) pip install mlx mlx-lm python -m mlx_lm.generate \ --model mlx-community/Qwen2.5-3B-Instruct-4bit \ --max-tokens 10 \ --prompt "Hello" Expected: Normal text generation Actual: Crash with: libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) llama.cpp brew install llama.cpp llama-cli --model model.gguf --prompt "Hello" --n-predict 20 --n-gpu-layers 99 Expected: Fast GPU generation Actual: Process hangs indefinitely Test Results Tool Model Peak Memory Result MLX Qwen2.5-0.5B-4bit 0.36 GB ✅ Works MLX Qwen2.5-1.5B-4bit 0.98 GB ✅ Works MLX Qwen3-1.7B-4bit 1.01 GB ✅ Works MLX Qwen2.5-3B-4bit ~1.5 GB ❌ Metal OOM crash MLX Qwen3-4B-4bit ~2.1 GB ❌ Metal OOM crash MLX Qwen3-8B-4bit ~4.5 GB ❌ Metal OOM crash llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ❌ Hangs with GPU llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ✅ Works with CPU only Key Evidence Hardware is healthy — Apple Diagnostics passed all tests Basic Metal works — matmul, array ops work fine CPU inference works — llama.cpp with -ngl 0 runs correctly The error is NOT about actual memory exhaustion — kIOGPUCommandBufferCallbackErrorOutOfMemory means the kernel rejects the Metal memory commit, not that physical memory is full. The system reports 17.76GB available for Metal working set. Crash Log Extract Thread 31 Crashed: 0 libsystem_kernel.dylib __pthread_kill + 8 1 libsystem_pthread.dylib pthread_kill + 296 2 libsystem_c.dylib abort + 148 3 Metal MTLReportFailure.cold.1 + 48 4 Metal MTLReportFailure + 576 5 Metal -[_MTLCommandBuffer addCompletedHandler:] + 104 ... Exception Type: EXC_CRASH (SIGABRT) Termination Reason: Namespace SIGNAL, Code 6, Abort trap: 6 Related Issues ml-explore/mlx#3586 — Metal compiler regression on macOS 26.5 ml-explore/mlx#3534 — M5 float32 precision issue ml-explore/mlx#3568 — M5 random divergence ml-explore/mlx#3539 — Metal residency OOM (M4 Max) Request Please investigate the AGXMetalG17X driver for M5 Pro on macOS 26.5. The driver appears to incorrectly reject Metal memory commits for LLM inference workloads, even when the working set is well within the system's reported limits (1.5GB requested vs 17.76GB available). Happy to provide full crash logs, sysdiagnose archives, or run additional tests.

Graphics & Games Metal Metal macOS Apple Silicon metal-cpp

607

May ’26

MetalToolchain and auto updates...

Hello, I can understand why you do not ship the MetalToolchain with the default Xcode installation any more due to the relatively low usage and high download size. That said, every time Xcode runs an auto update it wipes MetalToolchain and breaks my local development build. It would be nice if the updates would be smart enough to honor the fact that. I have already run: "xcodebuild -downloadComponent MetalToolchain" and include that in the update, rather than deleting the module. Thanks, Chris

Developer Tools & Services Xcode Metal MetalKit Metal Performance Shaders

402

May ’26

Inexplicable Metal crash ever since iOS 26.5 beta 4

Hi all, I'm working on updating my audio visualizer app. I'm adding new visualizers based on Metal 4 compute shaders. They worked in iOS 26.4 and iOS 26.5 up until beta 3. However, after that, the visualizers started crashing the phone and forcing a restart. On the latest version of iOS 26.5, the crash is still there. I submitted feedback, but haven't heard anything back just yet. I was wondering if others have faced this same issue, and if there are any workarounds. Here is my repo if you want to look at the code (forgive me if it's sloppy, I'm quite new to graphics programming and Metal): https://github.com/aabagdi/VisualMan/tree/main Thank you!

Graphics & Games Metal Metal MetalKit

1.8k

May ’26

XPC Communication between Editor app and user-compiled code

Hello! I'm trying to implement an editor app (macOS) that allows the user to write code, which will be compiled and executed, showing the result in the editor window. Imagine it like SwiftUI previews, but the graphic output is created with Metal, not SwiftUI. I found that IOSurface can be used to share that kind of data over XPC, so I would not have to rely on the private NSRemoteView. However, I'm confused if it is, at all, possible for my editor app to connect to an XPC Service, that was NOT bundled with it (but compiled by it at runtime). I succeeded to launch an XPC service defined as: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>com.myteam.myproject.service</string> <key>MachServices</key> <dict> <key>com.myteam.myproject.service</key> <true/> </dict> <key>Program</key> <string>/Path/to/service/run_my_service.sh</string> </dict> </plist> But the call to let connection = NSXPCConnection(machServiceName: "com.myteam.myproject.service") let proxy = connection.remoteObjectProxyWithErrorHandler { error in continuation.resume(throwing: error) } as? MyServiceProtocol fails with "The connection to service named com.myteam.myproject.service was invalidated: Connection init failed at lookup with error 3 - No such process." I have added <key>com.apple.security.temporary-exception.mach-lookup.global-name</key> <array> <string>com.myteam.myproject.service</string> </array> to my entitlements. Since the tutorials I followed are quite old, I'm wondering if support for something like this was dropped at some point. Thanks for any advice!

App & System Services Processes & Concurrency Metal macOS XPC

1.2k

May ’26

Possibilities of Overclocking Apple Silicon

I've been testing Apple Silicon devices in their desktop configurations on the Mac Studio and now retired Mac Pro and it seems like they're greatly bottlenecked by their clock speeds. For reference here's my testing results. Testing Results: Mac Studio M2 Max • 32GBs RAM • 30 core GPU • 1TB Storage CPU Utilization • 60% • 20W CPU Temperature • 47ºC GPU Utilization • 100% • 20W GPU Temperature • 55ºC Fan Speed • 50% Workload Duration • 2hrs Another point is that the clock speed on the M2 Max's CPU is 3.5 GHz and on the GPU it is 1.44 GHz at max performance. Which the Mac Studio has no trouble pushing. My question is how do I push those clock speeds higher? Cause 1.44 GHz at 55ºC is evidence for extensive headroom. I'm sure there are tools internally for testing the upper limits of the silicon, but it makes no sense why it would be set so low the Mac Studio is at no worries of melting. Is there any way to push the performance of my Mac Studio? FB22713867 - Possibilities of Overclocking Apple Silicon

Graphics & Games General Metal Performance Apple Silicon

513

May ’26

Metal, Vulkan, OpenGL & Godot

Greetings! I'm preparing to publish an app in Apple Store. It's a 2D Audio app made in Godot, already published in Google Store.. As we know, OpenGL is considered deprecated since iOS 12 / 2018 .. However given the current state of Metal, or Vulkan integration in Godot, and with the idea of bringing the Best possible experience on iOS.. I'm not completely sure what will be the best API to use as primary option.. -As good as Metal, or even Vulkan work in Godot; the fact of the matter is, each API has its strong and weak points.. -Metal: Native on iOS, fully compliant and supported. However it has two weak points: Initial Compilation Freeze - +5 sec. Performance Hit, (although negligible for final user) app uses 25% more CPU (on my iPhone 12). Battery drain? -Vulkan: In godot, Vulkan > MoltenVk > Metal More complex translation layer, but interestingly gives slightly better Performance than Metal.. Initial Compilation doesn't cause Freeze, because is lazy/delayed and performed while the app is starting. Uses 25% less CPU than Metal and gives slightly more stable Framerate. (iPhone 12) However, given the extra complexity it could be more prone to error, or Compatibility Problems, which are known and have been reported with older iOS devices (iPads come to mind..) Right? -OpenGL: No Initial Compilation Needed Max Performance, No CPU munch Universally supported, (in theory?) works Perfectly on my iPhone 12 with iOS 26.3 and 26.4.2 And all in all, gives the best Performance and user experience. -And that's pretty much the situation! Since the graphics API of choice, will have an effect and directly translate to User experience... what's then the best one? -This will be the first app I Publish on Apple Store, so as you can imagine I want to Comply with Apple as much as possible; and bring iOS users the best possible experience. However each one of the APIs seem to have a negative aspect.. Metal: 5sec Compilation Freeze Vulkan: Compatibility Problems? OpenGL: "Deprecated" In practical terms, right now, OpenGL gives the best Performance, and the best User Experience.. So what to do? -The Android version is published in Google Store in OpenGL Compat mode. Works perfectly. Even tho OpenGL has been Deprecated on iOS for 7+ years, it has survived all along, with no announced removal date from Apple. And it seems to work perfectly and be fully operational up to the latest iOS 26 version.. right? Maybe Apple is maintaining it for stability and compatibility reasons, even if they're no longer actively developing it? Butthee "deprecated" label sounds alarming, as if support could drop any day.. So what will be the best choice in this situation? -Will an app built primarily for OpenGL, (with Metal fallback) be Rejected right away in Apple Store? -Otoh Vulkan (via MoltenVK) could be a middle term solution, second best Performance, no Compilation Freeze.. But yeah, the Compatibility aspect is important; and while considerable improvements have been made in Godot's implementation, the current status or possible outcome is harder to assess.. Both Metal and OpenGL seem safer options in that sense..

Graphics & Games General Metal OpenGL

1.4k

Apr ’26

LowLevelInstanceData & animation

AppleOS 26 introduces LowLevelInstanceData that can reduce CPU draw calls significantly by instancing. However, I have noticed trouble with animating each individual instance. As I wanted low-level control, I'm using a custom system and LowLevelInstanceData.replace(using:) to update the transform each frame. The update closure itself is extremely efficient (Xcode Instruments reports nearly no cost). But I noticed extremely high runloop time, reach around 20ms. Time Profiler shows that the CPU is blocked by kernel.release.t6401. I think it is caused by synchronization between CPU and GPU, however, as I am already using a MTLCommandBuffer to coordinate it, I don't understand why I am still seeing large CPU time.

Spatial Computing General Metal RealityKit

1.1k

Apr ’26