Post

Replies

Boosts

Views

Activity

KV-Cache MLState Not Updating During Prefill Stage in Core ML LLM Inference
Hello, I'm running a large language model (LLM) in Core ML that uses a key-value cache (KV-cache) to store past attention states. The model was converted from PyTorch using coremltools and deployed on-device with Swift. The KV-cache is exposed via MLState and is used across inference steps for efficient autoregressive generation. During the prefill stage — where a prompt of multiple tokens is passed to the model in a single batch to initialize the KV-cache — I’ve noticed that some entries in the KV-cache are not updated after the inference. Specifically: Here are a few details about the setup: The MLState returned by the model is identical to the input state (often empty or zero-initialized) for some tokens in the batch. The issue only happens during the prefill stage (i.e., first call over multiple tokens). During decoding (single-token generation), the KV-cache updates normally. The model is invoked using MLModel.prediction(from:using:options:) for each batch. I’ve confirmed: The prompt tokens are non-repetitive and not masked. The model spec has MLState inputs/outputs correctly configured for KV-cache tensors. Each token is processed in a loop with the correct positional encodings. Questions: Is there any known behavior in Core ML that could prevent MLState from updating during batched or prefill inference? Could this be caused by internal optimizations such as lazy execution, static masking, or zero-value short-circuiting? How can I confirm that each token in the batch is contributing to the KV-cache during prefill? Any insights from the Core ML or LLM deployment community would be much appreciated.
1
0
107
May ’25