On-device model capabilities, limits, and versioning

Question

gromgrom OP

Created Jun ’26

Replies 2

Boosts 0

Participants 3

This post is from the WWDC26 Foundation Models Q&A.

What is the context window of the on-device model (AFM 3 Core Advanced and the 3B Core), and how should developers handle prompts that exceed it — automatic truncation, error, or developer-managed chunking?
For guided/structured generation into typed Swift values, what are the limits on schema complexity (nesting depth, enums, arrays, optionals), and what is the failure mode when the model cannot satisfy the schema?
How deterministic and reliable is on-device tool calling under the Tool protocol — are there guarantees on argument validity, and a recommended pattern for validating/repairing tool arguments before execution?
For the new image input: what are the constraints on resolution, image count per prompt, and formats, and does passing images change which device tiers or which model (on-device vs PCC) services the request?
Since the on-device model ships and updates with the OS, how should developers detect the active model version at runtime and guard against behavioral drift between OS releases? Is there a pinning or capability-query API?
What are the realistic latency and concurrency expectations on supported hardware, and is there a supported way to run multiple sessions or background inference without thermal/throttling penalties?

Answered by Frameworks Engineer in 892955022

The on-device model has a 4K context window. To avoid reaching your max context window size, use the following techniques:
- Profile how many tokens are used with your prompts with Instruments, and with the usage property on LanguageModelSession.Response.
- As described in this WWDC video, use Dynamic Instructions, coupled with tools from the Foundation Models Utilities package, to compact the transcript of your session.
- Consider delegating sub-tasks to new sessions.
Guided generation broadly supports schemas as long as they can be represented in JSON. However, consider that the smaller the schema, the easier it will be for a model to reason about it. If a schema somehow cannot be satisfied, the framework will produce an error.
Guided generation ensures that tool arguments will always follow the defined arguments schema.
There are no set restrictions on image size, though the image might be resized by the framework before being provided by the input. You may pass as many images as you'd like as long as those images fit within the context window. A broad variety of image formats are supported.
- Passing an image does not change the model servicing the request.
There is currently no pinning or model version retrieval API. As the model gets updated across OS updates, we recommend using the Evaluations framework to catch regressions.
The OS limits the amount of concurrent requests to the model. While the framework won't return an error if the model is already being accessed, there might be a delay before the OS relays your request to the model.
- On iOS, we might limit access to the model on background tasks. We recommend designing your code around the assumption that excessive requests to the model made in the background might be throttled or canceled.

Answer 1

divyaravi11992 OP

Jun ’26

Accepted Answer

Sharing what's confirmed from the WWDC26 ML labs, sub-question by sub-question — and flagging where Apple's direct answer is needed.

Context window + overflow: On-device is 4096 tokens, shared between input and output. PCC is 32K. Overflow handling is developer-managed, not automatic — there's no silent truncation you can rely on. The framework added token-counting APIs (check context size, count tokens before sending) and the response reports input/output/cached token usage. Use those to chunk or summarize proactively.

Guided generation / schema complexity: There's no published hard limit on nesting depth or enum/array/optional counts. The practical guidance from the labs was to keep schemas as flat and simple as the use case allows, since deeply nested structures are harder for the model to satisfy reliably. The failure mode is a generation error rather than silent malformed output. Best practice: validate the decoded result and be ready to retry. The exact complexity ceiling is something only the Foundation Models team can give definitively.

Tool calling reliability: Tool calling is not guaranteed deterministic — the model can produce arguments that don't validate. The recommended pattern is to treat the tool boundary as a validation gate: check arguments inside your Tool implementation before acting, and either throw (letting the model retry) or repair to the nearest valid value. Don't trust raw tool arguments.

Image input constraints: Passing images can change which tier services the request — worth testing whether your image prompts stay on-device or escalate to PCC, since that affects latency and the context budget. Exact resolution/count/format limits weren't given numerically in the labs, so that's one to confirm with the team.

Version detection / drift: This is the hard one. Since the model ships and updates with the OS, drift between releases is real, and no model-pinning API was mentioned. The recommended mitigation is the Evaluations framework — build an eval set of your core cases and run it against each OS beta to catch drift before users do. Evaluation-as-regression-testing is currently the answer, not pinning.

Latency / concurrency / background: From the labs — background inference works but can be rate-limited when the system is busy (a distinct error you can catch and retry). On Mac you're not rate-limited in the foreground. Keep work chunked so the system can pause/resume around thermal pressure. Realistic latency and safe multi-session concurrency limits are hardware-dependent and best confirmed directly.

Net: context window, overflow, drift mitigation, and tool validation have clear working patterns today. The exact numeric limits (schema depth, image specs, concurrency ceilings) are worth getting from the Foundation Models team directly rather than inferring.

— Divya Ravi, Senior iOS Engineer

Answer 2

Frameworks Engineer OP

Apple

Jun ’26

Recommended

The on-device model has a 4K context window. To avoid reaching your max context window size, use the following techniques:
- Profile how many tokens are used with your prompts with Instruments, and with the usage property on LanguageModelSession.Response.
- As described in this WWDC video, use Dynamic Instructions, coupled with tools from the Foundation Models Utilities package, to compact the transcript of your session.
- Consider delegating sub-tasks to new sessions.
Guided generation broadly supports schemas as long as they can be represented in JSON. However, consider that the smaller the schema, the easier it will be for a model to reason about it. If a schema somehow cannot be satisfied, the framework will produce an error.
Guided generation ensures that tool arguments will always follow the defined arguments schema.
There are no set restrictions on image size, though the image might be resized by the framework before being provided by the input. You may pass as many images as you'd like as long as those images fit within the context window. A broad variety of image formats are supported.
- Passing an image does not change the model servicing the request.
There is currently no pinning or model version retrieval API. As the model gets updated across OS updates, we recommend using the Evaluations framework to catch regressions.
The OS limits the amount of concurrent requests to the model. While the framework won't return an error if the model is already being accessed, there might be a delay before the OS relays your request to the model.
- On iOS, we might limit access to the model on background tasks. We recommend designing your code around the assumption that excessive requests to the model made in the background might be throttled or canceled.