Core ML

Integrate machine learning models into your app using Core ML.

Post

Replies

Boosts

Views

Activity

MLModelAsset(specification:blobMapping:) with mlprogram model: correct predictions but drastically slower inference than compiled .mlmodelc path

I'm distributing an encrypted .mlpackage to my app and want to load it entirely in memory without ever writing decrypted weights to disk. I tried MLModelAsset(specification:blobMapping:) as the path to achieve this, but ran into a significant inference performance gap compared to the compiled code path. What I'm trying to do The encrypted .enc file is a serialized FileWrapper of the full .mlpackage, sealed with AES-GCM. At runtime I decrypt it in memory, deserialize the FileWrapper, extract the spec and weight blob, and load via MLModelAsset: static func loadEncryptedPackage(url: URL, configuration: MLModelConfiguration) async throws -> MLModel { // AES-GCM decryption → decryptedData (full serialized .mlpackage) guard let wrapper = FileWrapper(serializedRepresentation: decryptedData) else { throw ... } guard let (specWrapper, specParent) = findSpecWrapper(in: wrapper), let spec = specWrapper.regularFileContents else { throw ... } var blobs: [URL: Data] = [:] collectBlobs(in: specParent, relativePath: "", excluding: specWrapper, into: &blobs) // keys built as URL(fileURLWithPath: rel), e.g. "weights/weight.bin" let asset = try MLModelAsset(specification: spec, blobMapping: blobs) let model = try await MLModel.load(asset: asset, configuration: configuration) // See observation #3 below — must retain these for the model's lifetime objc_setAssociatedObject(model, &retentionKey, Retainer(spec: spec, blobs: blobs), .OBJC_ASSOCIATION_RETAIN) return model } What I observed Predictions are accurate. The blobs are found, weights are applied, and the model produces correct results. Inference is drastically slower than the compiled code path. The same model loaded via MLModel.compileModel(at:) + MLModel.load(contentsOf:) runs inference much faster on the same device with the same MLModelConfiguration (computeUnits = .all). With MLModelAsset the slowdown is consistent across every prediction call, not just the first one. The spec and blob Data objects must stay alive for the model's lifetime. Without retaining them via objc_setAssociatedObject, inference produces NaN outputs or crashes. This suggests Core ML holds a reference back into those Data buffers beyond the load() call, rather than copying them into its own memory during loading. Using the exact blob URI from the spec as the blobMapping key triggers a compilation error. The spec (inspected via strings on the .mlmodel protobuf) stores blob references as @model_path/weights/weight.bin. When I key the blobMapping with URL(string: "@model_path/weights/weight.bin"), MLModel.load(asset:) throws: compiler error: Encountered an error while compiling a model: validator error: The in-memory ML Program must not have a blob file reference but found a reference to mem://weights/weight.bin. With other key formats (e.g. URL(fileURLWithPath: "weights/weight.bin")), this error does not appear — the model loads and predictions are accurate, but inference is slow as in observation #2. The working alternative (which I want to avoid) Decrypting to a temporary directory, calling MLModel.compileModel(at:), loading from the compiled .mlmodelc, then deleting the temp files produces fast inference. Same model, same device, same configuration. The only difference is the compilation step — and the fact that decrypted weights touch disk, which I want to avoid for security reasons. Questions Is MLModelAsset(specification:blobMapping:) expected to produce inference performance equivalent to loading from a compiled .mlmodelc? If not, is the performance gap fundamental to the API or something that can be addressed? Is there any supported way to load an mlprogram model with external weight blobs entirely in memory and achieve inference performance comparable to the compiled code path — i.e. without writing decrypted model data to disk at any point? The validator error "in-memory ML Program must not have a blob file reference" is a hard block when Core ML successfully resolves the blobs and attempts mlprogram compilation. Is this an intended constraint, and does it mean MLModelAsset(specification:blobMapping:) is not the right API for this use case?

Machine Learning & AI Core ML Core ML

Encrypted CoreML model load failed with error : Failed to set up decrypt context for /path/to/xxx.mlmodelc. error:-42905"

Hi there. We use coreml for image processing, and we have many models, each one produces a different effect. However, this error sometimes occurs on some devices. No idea when and how. But once it happens, it keeps failing on the device. At this point, these will solve the issue: Uninstall and reinstall the app. It works fine for a while, but the same error soon occurs again. Reboot the device. It work fine for a longer time than reinstall. I tried repeatedly loading and unloading the model in a loop, but I couldn't reproduce the issue. If the model loads successfully the first time the application starts up, it will continue to load successfully for the remainder of the application's lifecycle. If I repeatedly launch the app, load the model, and then force-close the app—after doing this a few hundred times—the problem will eventually occur. The following are two similar questions: https://developer.apple.com/forums/thread/678599 https://developer.apple.com/forums/thread/740731 A previous reply stated that "it typically means there is a leak in the system," but since the app has already been terminated, the resources it was using should have been automatically reclaimed by the system. This issue has been a problem for a long time; it does indeed appear to be a system-level bug. This uncertainty and unreliability can lead to user churn.

Machine Learning & AI Core ML Core ML

106

iPhone app memory limit seems capped to 6GB

Hi all :) I tried to raise this in the group lab and was pointed here. I’m seeing a flat per-app memory ceiling of about 6 GB on iPhone, even on devices with more physical RAM and with com.apple.developer.kernel.increased-memory-limit. Measured with os_proc_available_memory() plus task_vm_info.phys_footprint, the total process budget stays around 6144 MB on both: iPhone 16 Pro Max, 8 GB RAM iPhone 17 Pro Max, 12 GB RAM This came up while running Gemma 4 multimodal support in mlx-swift-lm (PR #343). The model loads at about 4.4 GB resident, leaving roughly 1.7 GB for inference/prefill. Reducing a GPU buffer cache from 512 MB to 64 MB recovered enough headroom to avoid jetsam and allowed a full image + video + audio multimodal test to complete, so the measurement seems to reflect a real per-process limit rather than free system memory. I re-measured the ceiling on the 12 GB phone with these capabilities: increased-memory-limit only: ~6144 MB increased-memory-limit + extended-virtual-addressing: ~6144 MB, no change increased-memory-limit + increased-debugging-memory-limit: ~6656 MB I have also observed that 12 GB iPad devices expose more memory to an app than 12 GB iPhone devices but I didn't measure specifically and no longer have the device to hand. Is the ~6 GB per-process tier on Pro iPhones expected, even with increased-memory-limit? Is there any supported way for a shipping app to access more of the available RAM on 12 GB iPhone models? FB23183521

Machine Learning & AI Core ML iOS Entitlements ML Compute Core ML

129

关于我使用Swift和Metal制作的神经网络引擎

我今年18岁。没有机器学习背景，没有上过大学，高中都没去上，没有导师。几天前我盯着一张纸发呆。突然想：为什么计算机神经网络一定要是2D的？可以模拟生物吗？为什么一定要在平面上算？如果多个平面，岂不是翻倍？如果把六张纸想象成一个魔方，六个面各自承载神经元，八条体对角线变成新的通信通道会怎么样？我真的很喜欢折腾这些，然后我立刻制定了详细计划，使用AI工具辅助写下了第一个 kernel。跑崩了。我又重新想了一下，和qq群友分享了我的目标，又写。又崩。连续几十次。没有 PyTorch，没有 TensorFlow，没有 CUDA。只有Swift和Metal。因为我的电脑显卡是AMD Vega 64，没装任何框架辅助，因为我想明白最底层的运行方式是什么原理。这就是CubeNN。 ##以下为AI的详细解答，内容与架构改动太多，我在这里一次讲不清楚它是什么一个用魔方几何作为计算架构的神经网络引擎。标准 Transformer: 把数据排成一行，O(n²) 地互相看 CubeNN: 把数据分布在 14 个面上，只在该看的地方看 6 个标准面 → 块稀疏注意力（粗看全局 + 细看局部） 8 个 X 面对角线 → 跨面信息桥（不做 Attention，只负责传递）每轮：6 面算 → 投影到 8 X 面 → 上采样精炼 → 融合回 6 面最关键的是 Cube Cascade——一个树+链级联推理：树阶段: 1 个魔方 spawn 8 个 → 8 个 spawn 64 个 → 73 个并行探索 GPU 上同时跑，选最优路径链阶段: 最优叶子无限深度精炼 3-5 步收敛，方差提升 ~7% 怎么实现的纯 Swift + Metal。零依赖。零框架。 // 大致代码就是这些 import Metal import Foundation let device = MTLCreateSystemDefaultDevice()! let library = try! device.makeLibrary(filepath: "cube_nn.metallib") // ...12 个 GPU kernel，12,000 次 dispatch 关键技术决策：单 Command Buffer：整个树阶段 73 个魔方的全部 kernel dispatch 打包进一个 CB，0 次 CPU-GPU 同步 Pipeline State 缓存：编码从 1022ms 降到 42ms Buffer 偏移：所有 73 个魔方的 14 个面存进一个连续 buffer，kernel 通过 buffer(15) 传偏移量 FP16：N≥64 时半精度提速 21% 性能 ##经过测试，但是因设备差异可能不准确，仅参考 AMD Radeon RX Vega 64 (2017 年显卡, 14nm, 295W): 规模神经元魔方数耗时 N=32 6,144 73 (树) 435ms N=64 24,576 21 (树) 817ms N=128 98,304 1 116ms N=32 全连接 Attention 每层 201M FLOP → CubeNN 块稀疏 370K FLOP (544× 减少) N=128 全连接需要 32GB 显存（物理上不存在）→ CubeNN 用 192KB N=256 全连接需要 2.2T FLOP → CubeNN 52M FLOP (42,300× 减少) 代码体积：161KB。对比 PyTorch 的 800MB。我经历了什么这个项目最困难的不是写 kernel，是在没有任何人告诉我"能不能做"的情况下，靠反复试错找到路。第一次试图跑 73 个魔方，GPU 直接 hang 了。花了 3 天定位到是 Command Buffer 堆叠过多。改了 single encoder 方案，又碰上 SIGILL——Metal 不允许 makeBuffer(length: 0)，B=0 时创建了零长度 buffer。想用 threadgroup memory 做 kernel fusion，结果跨 threadgroup 读不到数据，才明白 LDS 是 per-group 的。 N=64 的 FP16 要手动写 float↔half 转换函数，因为 macOS 11 上 Float16 类型被标为 unavailable。每一次崩溃都教会我一个 Metal 的底层细节。没有人教我，但 Metal 的报错信息就是最好的老师。为什么发在 Apple 开发者论坛因为这是为苹果生态而生的项目。CubeNN 从头到尾只用了两个东西：Swift 和 Metal。它不需要移植就能跑在任何 Apple Silicon Mac 上（API兼容）。如果未来能把部分 kernel 映射到 Neural Engine，效率会再翻几倍。我想问 Apple 的 Metal 工程师和 Core ML 团队： ** 有没有更好的 GPU 任务调度方式？**目前表现仍然欠佳（对于我这个完美主义者来说），可能改得有点乱了有没有兴趣评估这个架构在 M4 上的表现？我手里只有 Vega 64。M4 GPU + ANE方法跑 CubeNN 会是什么效果？源代码 ├── run.swift # 统一 CLI，参数化 N/B/depth ├── src/ │ ├── cube_nn.metal # FP16 kernel │ └── cube_nn_fp32.metal # FP32 kernel └── benchmarks/ # 实测数据如果你读到了这里——谢谢你。一个门外汉靠痴狂的，纯粹到几乎是妄想的主意和Metal走到了这里。我懂的不是很多，如果这个架构有任何价值，我想让它变得更好。任何建议、批评、或者指教，都非常欢迎。

Machine Learning & AI Core ML Swift Metal

131

Core ML RIP?

No mention of Core ML at WWDC26... Shall we assume it was replaced by Core AI? What about Adapters?

Machine Learning & AI Core ML

191

Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes)

Silent FP16 Overflow in coremltools: 5 Numerical Failures Affecting ANE Inference (With Fixes) Hi everyone, With the announcements at WWDC26 regarding Core AI and "automatic stable decompositions," it is clear that managing mathematical stability in constrained FP16 environments is a major priority for the ecosystem. To support developers maintaining existing models that cannot migrate to the newest architectures overnight, I have published a research paper and an open-source static analysis tool documenting 5 silent numerical failures in the standard coremltools pipeline. Because the Apple Neural Engine (ANE) executes inference in FP16, the maximum representable value is 65,504 ($\exp(11.09)$). Inputs exceeding these tight bounds cause silent overflows to infinity or collapses to zero without warnings. Deployed Operations Currently Affected softplus (YOLOv5/v8): Outputs silently collapse to 0.0 at $x > 10.4$ on ANE. logsumexp (Attention mechanisms): Overflows at $x > 7.63$ for 32 channels. For vocabulary-sized reductions, the threshold drops below $5$. log_softmax (Classifiers like BERT, GPT, ViT): Softmax probabilities underflow to 0, causing $\log(0) = -\infty$. logcumsumexp (CTC decoders): Overflows at $x > 11.09$. mish (YOLO variants): Inherits the softplus overflow limits. The Immediate Safety Net: Algebraically Equivalent Reformulations We can bypass these hardware limits entirely by rewriting the operations into mathematically stable forms. For example, rewriting softplus as: $$\max(x, 0) + \log(1 + \exp(-|x|))$$ Because $-|x| \le 0$, $\exp(-|x|)$ is bound strictly between $(0, 1]$. Overflow becomes mathematically impossible in any precision, yielding bit-identical outputs for all valid inputs. While PyTorch AMP traditionally classifies these operations as FP32-only, the ANE has no such fallback—making stable decomposition mandatory. Tools & Patches Deployed Today The Paper: "Silent Numerical Failures in On-Device ML Converters: A Systematic Audit of FP16 Overflow in Apple Neural Engine Deployment." (Complete vulnerability census, discrepancy pattern analysis, formal proofs, and quantitative evaluation). The Tool (ane-fp16-lint): A CLI that scans .mlpackage files and flags FP16-unsafe operations before you push to production. It detects nine patterns and provides stable alternatives for each. The Fixes: We have submitted three Pull Requests to the official apple/coremltools repository implementing these stable decompositions, which are currently under review by Apple's Core ML team. While Core AI introduces great automated stability for new architectures like the 20B AFM 3 Core Advanced, millions of deployed production models still need an immediate safety net. Full technical paper, proofs, and the linting tool are available on GitHub: github.com/apple-f16-overflow-audit (Note: Replace with your direct, clean GitHub repository link—avoiding social media redirects so the forum filters do not auto-flag the post) Looking forward to hearing if anyone else has run into these unexpected discrepancy patterns in production!

Machine Learning & AI Core ML

LLM inference on Apple Silicon: why do some MoE architectures outperform dense models despite similar parameter counts?

We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive. In several cases, MoE models significantly outperform dense models despite having similar total parameter counts. Examples (simplified): Dense model: ~30B parameters MoE model: ~30B total parameters, ~3B active parameters On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead. A few hypotheses we're considering: Active parameter count appears to matter more than total parameter count for decode throughput. Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected. Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not. Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput. What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically. Has anyone profiled decode throughput across: dense models large-expert MoE many-small-expert MoE and identified which hardware characteristics are actually driving the difference? I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.

Machine Learning & AI Core ML

Apple GPU forward progress guarantees for persistent-thread synchronization?

We're doing some research on Apple Silicon inference runtimes and trying to understand the practical synchronization boundary of Apple GPUs. We are not asking about threadgroup barriers (those are documented), but about device-scope synchronization patterns built from atomics. What we've observed: Device-scope atomics are available. It is possible to build global counters and persistent-thread style coordination structures. However, we cannot find any documented guarantee regarding: threadgroup co-residency, global forward progress, occupancy-bounded synchronization safety. In our experiments, synchronization schemes that rely on all threadgroups making progress eventually can become unreliable, while strictly local producer/consumer handoff patterns appear much more robust. Questions: Does Metal provide any documented forward-progress guarantees across threadgroups beyond what is explicitly stated in the Metal specification? Is there any recommended pattern for implementing long-lived producer/consumer GPU pipelines without relying on global synchronization assumptions? For Apple GPUs specifically, should developers assume that occupancy-bounded global synchronization is unsupported unless explicitly provided by the API? We are not looking for undocumented implementation details, only for guidance on what assumptions are safe for production systems. Thanks.

Machine Learning & AI Core ML

Resolving co channel interference VOIP

Subject: Inquiry Regarding Architectural Overhead and Buffer Access in the Push to Talk Framework for Real-Time Core ML Blind Source Separation Dear Apple Engineering Team, We are currently developing an Apple-native communication platform that utilizes the Push to Talk framework alongside Core ML to handle real-time, on-device audio processing. We are working to resolve the issue of single-channel, co-channel interference (overlapping voice streams) directly on the edge. Our current challenge lies in the pipeline latency and background lifecycle constraints when intercepting incoming audio buffers. To cleanly separate overlapping voices before they hit the audio output mixer, we need to process the raw PCM data immediately upon arrival. Could you please provide guidance on the following architectural questions: Low-Latency Buffer Interception: What is the recommended design pattern within the PTChannelManagerDelegate flow to pass raw incoming audio buffers directly to a Core ML model running on the Apple Neural Engine (ANE) before the system routes them to AVAudioEngine for playback? Background Thread Management: Given the strict background execution boundaries enforced by the Push to Talk framework, how can we best optimize thread scheduling to ensure our speech separation model completes its execution without triggering an OS background processing timeout or process termination? Dynamic UI Manifestation: Once a combined audio stream is separated into two clean, distinct voice vectors on-device, what is the best approach for registering multiple PTParticipant states simultaneously so that the native system UI (like the Dynamic Island) accurately reflects both speakers? Thank you for your time, insights, and continued support of developer innovation within the iOS and iPadOS ecosystems. Best regards, Ken Zakreski Founder, Marine Link Pro

Machine Learning & AI Core ML

111

_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU

When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.

Machine Learning & AI Core ML Metal tensorflow-metal

1.1k

May ’26

When will mps support fp8 dtypes?

https://github.com/pytorch/pytorch/issues/132624 this fp8 dtypes unsupport issue has been existed for 2 years, does mlx have any plan to it?

Machine Learning & AI Core ML ML Compute

719

May ’26

Do loading multiple functions that share model weights multiply memory use?

Hi, I have a multifunction model where the functions share the same model weights, and for latency I have multiple functions loaded at the same time. According to what Codex found this multiplies RAM usage, so if the single model weights 2GB, loading two functions that share the underlying weights still doubles RAM usage to 4GB (seems that it is something like neural wired memory). Does anyone have any knowledge relating to this?

Machine Learning & AI Core ML

1.2k

May ’26

CoreML model load failed with this error : Failed to set up decrypt context for /private/var/mobile/Containers/Data/Application/ACB94507-F8DE-494B-8499-B0CF75FC3B55/Library/Caches/temp.m/xxx.mlmodelc. error:-42905"

Hi there. We use a core ML model for image processing, and because loading core ml model take long time (~10 sec), we preload core ML model when app start time. but in some device, loading core ml model fails with such error. we download core ML model from server then load model from local storage. loading code looks like this. typical. MLModel.load(contentsOf: compliedUrl, configuration: config) once this error happen, it keeps fails until we restart the device. (+) In this article, I saw that it is related some "limitation of decrypt session" : https://developer.apple.com/forums/thread/707622 but it also happens to in-house test flight builds which are used only under 5 people. Can I know why this happens?

Machine Learning & AI Core ML Core ML

2.6k

May ’26

CoreML model cache causes fake hard drive memory usage

Hi, I experiment by creating and compiling a lot of CoreML models and I have the issue that this causes a lot of disk usage, but when I try to delete everything (I search in the disk for possible CoreML cache directories) the disk space is not actually freed up. This is a picture of my disk usage according to what is shown inside of Settings>General>Storage and the Disk Utility app. I am running on macOS 15.7.5

Machine Learning & AI Core ML

1.6k

May ’26

Does using Vision API offline to label a custom dataset for Core ML training violate DPLA?

Hello everyone, I am currently developing a smart camera app for iOS that recommends optimal zoom and exposure values on-device using a custom Core ML model. I am still waiting for an official response from Apple Support, but I wanted to ask the community if anyone has experience with a similar workflow regarding App Review and the DPLA. Here is my training methodology: I gathered my own proprietary dataset of original landscape photos. I generated multiple variants of these photos with different zoom and exposure settings offline on my Mac. I used the CalculateImageAestheticsScoresRequest (Vision framework) via a local macOS command-line tool to evaluate and score each variant. Based on those scores, I labeled the "best" zoom and exposure parameters for each original photo. I used this labeled dataset to train my own independent neural network using PyTorch, and then converted it to a Core ML model to ship inside my app. Since the app uses my own custom model on-device and does not send any user data to a server, the privacy aspect is clear. However, I am curious if using the output of Apple's Vision API strictly offline to label my own dataset could be interpreted as "reverse engineering" or a violation of the Developer Program License Agreement (DPLA). Has anyone successfully shipped an app using a similar knowledge distillation or automated dataset labeling approach with Apple's APIs? Did you face any pushback during App Review? Any insights or shared experiences would be greatly appreciated!

Machine Learning & AI Core ML App Review Vision Machine Learning Core ML

638

Apr ’26

MPS SDPA Attention Kernel Regression on A14-class (M1) in macOS 26.3.1 — Works on A15+ (M2+)

Summary Since macOS 26, our Core ML / MPS inference pipeline produces incorrect results on Mac mini M1 (Macmini9,1, A14-class SoC). The same model and code runs correctly on M2 and newer (A15-class and up). The regression appears to be in the Scaled Dot-Product Attention (SDPA) kernel path in the MPS backend. Environment Affected Mac mini M1 — Macmini9,1 (A14-class) Not affected M2 and newer (A15-class and up) Last known good macOS Sequoia First broken macOS 26 (Tahoe) ? Confirmed broken on macOS 26.3.1 Framework Core ML + MPS backend Language C++ (via CoreML C++ API) Description We ship an audio processing application (VoiceAssist by NoiseWorks) that runs a deep learning model (based on Demucs architecture) via Core ML with the MPS compute unit. On macOS Sequoia this works correctly on all Apple Silicon Macs including M1. After updating to macOS 26 (Tahoe), inference on M1 Macs fails — either producing garbage output or crashing. The same binary, same .mlpackage, same inputs work correctly on M2+. Our Apple contact has suggested the root cause is a regression in the A14-specific MPS SDPA attention kernel, which may have broken when the Metal/MPS stack was updated in macOS 26. The model makes heavy use of attention layers, and the failure correlates precisely with the SDPA path being exercised on A14 hardware. Steps to Reproduce Load a Core ML model that uses Scaled Dot-Product Attention (e.g. a transformer or attention-based audio model) Run inference with MLComputeUnits::cpuAndGPU (MPS active) Run on Mac mini M1 (Macmini9,1) with macOS 26.3.1 Compare output to the same model running on M2 / macOS Sequoia Expected: Correct inference output, consistent with M2+ and macOS Sequoia behavior Actual: Incorrect / corrupted output (or crash), only on A14-class hardware running macOS 26+ Workaround Forcing MLComputeUnits::cpuOnly bypasses MPS entirely and produces correct output on M1, confirming the issue is in the MPS compute path. This is not acceptable as a shipping workaround due to performance impact. Additional Notes The failure is hardware-specific (A14 only) and OS-specific (macOS 26+), pointing to a kernel-level regression rather than a model or app bug We first became aware of this through a customer report Happy to provide a symbolicated crash log if helpful this text was summarized by AI and human verified

Machine Learning & AI Core ML Metal Performance Shaders

516

Apr ’26

CoreML MLE5ProgramLibrary AOT recompilation hangs/crashes on iOS 26.4 — C++ exception in espresso IR compiler bypasses Swift error handling

Area: CoreML / Machine Learning Describe the issue: On iOS 26.4, calling MLModel(contentsOf:configuration:) to load an .mlpackage model hangs indefinitely and eventually kills the app via watchdog. The same model loads and runs inference successfully in under 1 second on iOS 26.3.1. The hang occurs inside eort_eo_compiler_compile_from_ir_program (espresso) during on-device AOT recompilation triggered by MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:. A C++ exception (__cxa_throw) is thrown inside libBNNS.dylib during the exception unwind, which then hangs inside __cxxabiv1::dyn_cast_slow and __class_type_info::search_below_dst. Swift's try/catch does not catch this — the exception originates in C++ and the process hangs rather than terminating cleanly. Setting config.computeUnits = .cpuOnly does not resolve the issue. MLE5ProgramLibrary initialises as shared infrastructure regardless of compute units. Steps to reproduce: Create an app with an .mlpackage CoreML model using the MLE5/espresso backend Call MLModel(contentsOf: modelURL, configuration: config) at runtime Run on a device on iOS 26.3.1 — loads successfully in <1 second Update device to iOS 26.4 — hangs indefinitely, app killed by watchdog after 60–745 seconds Expected behaviour: Model loads successfully, or throws a catchable Swift error on failure. Actual behaviour: Process hangs in MLE5ProgramLibrary.lazyInitQueue. App killed by watchdog. No Swift error thrown. Full stack trace at point of hang: Thread 1 Queue: com.apple.coreml.MLE5ProgramLibrary.lazyInitQueue (serial) frame 0: __cxxabiv1::__class_type_info::search_below_dst libc++abi.dylib frame 1: __cxxabiv1::(anonymous namespace)::dyn_cast_slow libc++abi.dylib frame 2: ___lldb_unnamed_symbol_23ab44dd4 libBNNS.dylib frame 23: eort_eo_compiler_compile_from_ir_program espresso frame 24: -[MLE5ProgramLibraryOnDeviceAOTCompilationImpl createProgramLibraryHandleWithRespecialization:error:] CoreML frame 25: -[MLE5ProgramLibrary _programLibraryHandleWithForceRespecialization:error:] CoreML frame 26: __44-[MLE5ProgramLibrary prepareAndReturnError:]_block_invoke CoreML frame 27: _dispatch_client_callout libdispatch.dylib frame 28: _dispatch_lane_barrier_sync_invoke_and_complete libdispatch.dylib frame 29: -[MLE5ProgramLibrary prepareAndReturnError:] CoreML frame 30: -[MLE5Engine initWithContainer:configuration:error:] CoreML frame 31: +[MLE5Engine loadModelFromCompiledArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 32: +[MLLoader _loadModelWithClass:fromArchive:modelVersionInfo:compilerVersionInfo:configuration:error:] CoreML frame 45: +[MLModel modelWithContentsOfURL:configuration:error:] CoreML frame 46: @nonobjc MLModel.__allocating_init(contentsOf:configuration:) GKPersonalV2 frame 47: MDNA_GaitEncoder_v1_3.__allocating_init(contentsOf:configuration:) frame 48: MDNA_GaitEncoder_v1_3.__allocating_init(configuration:) frame 50: GaitModelInference.loadModel() frame 51: GaitModelInference.init() iOS version: Reproduced on iOS 26.4. Works correctly on iOS 26.3.1. Xcode version: 26.2 Device: iPhone (model used in testing) Model format: .mlpackage

Machine Learning & AI Core ML ML Compute

Apr ’26

Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome

Hi all, I've been working on a pure-Swift port of Google's Gemma 4 text decoder that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback from the MLX team and the community before I propose anything upstream. Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core Why As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box. The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from Gemma 3 in several structural places. The chat-template path through swift-jinja 1.x also silently corrupts the prompt, so the model loads but generates incoherent text. What's in the package A from-scratch Swift implementation of the Gemma 4 decoder (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer) Per-Layer Embedding (PLE) support — the shared embedding table that feeds every decoder layer through a gated MLP as a third residual KV sharing across the back half of the decoder, threaded through the forward pass via a donor table with a single global rope offset A custom Gemma4ProportionalRoPE class for the partial-rotation rope type that initializeRope doesn't currently recognize A chat-template bypass that builds the prompt as a literal string with the correct turn markers and encodes via tokenizer.encode(text:), matching Python mlx-lm's apply_chat_template byte-for-byte Measured on iPhone (A-series, 7.4 GB RAM) Model: mlx-community/gemma-4-e2b-it-4bit Warm load: ~6 s Memory after load: 341–392 MB Time to first token (end-to-end, 333-token system prompt): 2.82 s Generation throughput: 12–14 tok/s What I'd love feedback on Is the sidecar registration pattern the right way to extend mlx-swift-lm with new model families, or is there a more idiomatic path I missed? The chat-template bypass works but feels like a workaround. Is the right long-term fix in swift-jinja, in the tokenizer, or somewhere else entirely? Anyone running into the same PLE / KV-sharing issues on other Gemma-family checkpoints? I'd like to make sure the implementation generalizes beyond E2B before tagging a 0.2.0. Happy to open a PR against mlx-swift-lm if the maintainers think any of this belongs upstream. Thanks for reading.

Machine Learning & AI Core ML

485

Apr ’26

CoreML GPU NaN bug with fused QKV attention on macOS Tahoe

Problem: CoreML produces NaN on GPU (works fine on CPU) when running transformer attention with fused QKV projection on macOS 26.2. Root cause: The common::fuse_transpose_matmul optimization pass triggers a Metal kernel bug when sliced tensors feed into matmul(transpose_y=True). Workaround: pipeline = ct.PassPipeline.DEFAULT pipeline.remove_passes(['common::fuse_transpose_matmul']) mlmodel = ct.convert(model, ..., pass_pipeline=pipeline) Minimal repro: https://github.com/imperatormk/coreml-birefnet/blob/main/apple_bug_repro.py Affected: Any ViT/Swin/transformer with fused QKV attention (BiRefNet, etc.) Has anyone else hit this? Filed FB report too.

Machine Learning & AI Core ML

716

Apr ’26

Memory stride warning when loading CoreML models on ANE

When I am doing an uncached load of CoreML model on ANE, I received this warning in Xcode console Type of hiddenStates in function main's I/O contains unknown strides. Using unknown strides for MIL tensor buffers with unknown shapes is not recommended in E5ML. Please use row_alignment_in_bytes property instead. Refer to https://e5-ml.apple.com/more-info/memory-layouts.html for more information. However, the web link does not seem to be working. Where can I find more information about about this and how can I fix it?

Machine Learning & AI Core ML

900

Mar ’26

MLModelAsset(specification:blobMapping:) with mlprogram model: correct predictions but drastically slower inference than compiled .mlmodelc path

Machine Learning & AI Core ML Core ML