Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets

Question

iSwon OP

Created May ’26

Replies 0

Boosts 0

Participants 1

Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets

Summary

The Metal driver AGXMetalG17X 351.2 on macOS 26.5 (25F71) for the M5 Pro chip crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory (00000008) when running LLM inference workloads with working sets as small as ~1.5GB, despite 24GB of unified memory being available and Apple Diagnostics confirming the hardware is fully functional.

This affects multiple tools: MLX, llama.cpp (Metal backend), and native apps using Metal for inference.

System

ComponentValue

Model	MacBook Pro (Mac17,9)
Chip	Apple M5 Pro (applegpu_g17s)
GPU Cores	16
RAM	24 GB LPDDR5
macOS	26.5 (25F71)
Metal	Metal 4
GPU Driver	AGXMetalG17X 351.2
Xcode	26.5 (17F42)

Reproduction

MLX (Python)

pip install mlx mlx-lm
python -m mlx_lm.generate \
  --model mlx-community/Qwen2.5-3B-Instruct-4bit \
  --max-tokens 10 \
  --prompt "Hello"

Expected: Normal text generation
Actual: Crash with:

libc++abi: terminating due to uncaught exception of type std::runtime_error: 
[METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

llama.cpp

brew install llama.cpp
llama-cli --model model.gguf --prompt "Hello" --n-predict 20 --n-gpu-layers 99

Expected: Fast GPU generation
Actual: Process hangs indefinitely

Test Results

ToolModelPeak MemoryResult

MLX	Qwen2.5-0.5B-4bit	0.36 GB	✅ Works
MLX	Qwen2.5-1.5B-4bit	0.98 GB	✅ Works
MLX	Qwen3-1.7B-4bit	1.01 GB	✅ Works
MLX	Qwen2.5-3B-4bit	~1.5 GB	❌ Metal OOM crash
MLX	Qwen3-4B-4bit	~2.1 GB	❌ Metal OOM crash
MLX	Qwen3-8B-4bit	~4.5 GB	❌ Metal OOM crash
llama.cpp	Qwen2.5-0.5B GGUF	~0.5 GB	❌ Hangs with GPU
llama.cpp	Qwen2.5-0.5B GGUF	~0.5 GB	✅ Works with CPU only

Key Evidence

Hardware is healthy — Apple Diagnostics passed all tests
Basic Metal works — matmul, array ops work fine
CPU inference works — llama.cpp with -ngl 0 runs correctly
The error is NOT about actual memory exhaustion — kIOGPUCommandBufferCallbackErrorOutOfMemory means the kernel rejects the Metal memory commit, not that physical memory is full. The system reports 17.76GB available for Metal working set.

Crash Log Extract

Thread 31 Crashed:
0   libsystem_kernel.dylib    __pthread_kill + 8
1   libsystem_pthread.dylib   pthread_kill + 296
2   libsystem_c.dylib         abort + 148
3   Metal                     MTLReportFailure.cold.1 + 48
4   Metal                     MTLReportFailure + 576
5   Metal                     -[_MTLCommandBuffer addCompletedHandler:] + 104
...
Exception Type: EXC_CRASH (SIGABRT)
Termination Reason: Namespace SIGNAL, Code 6, Abort trap: 6

Related Issues

ml-explore/mlx#3586 — Metal compiler regression on macOS 26.5
ml-explore/mlx#3534 — M5 float32 precision issue
ml-explore/mlx#3568 — M5 random divergence
ml-explore/mlx#3539 — Metal residency OOM (M4 Max)

Request

Please investigate the AGXMetalG17X driver for M5 Pro on macOS 26.5. The driver appears to incorrectly reject Metal memory commits for LLM inference workloads, even when the working set is well within the system's reported limits (1.5GB requested vs 17.76GB available).

Happy to provide full crash logs, sysdiagnose archives, or run additional tests.