Sharing what's confirmed from the WWDC26 ML labs, sub-question by sub-question — and flagging where Apple's direct answer is needed.
Context window + overflow: On-device is 4096 tokens, shared between input and output. PCC is 32K. Overflow handling is developer-managed, not automatic — there's no silent truncation you can rely on. The framework added token-counting APIs (check context size, count tokens before sending) and the response reports input/output/cached token usage. Use those to chunk or summarize proactively.
Guided generation / schema complexity: There's no published hard limit on nesting depth or enum/array/optional counts. The practical guidance from the labs was to keep schemas as flat and simple as the use case allows, since deeply nested structures are harder for the model to satisfy reliably. The failure mode is a generation error rather than silent malformed output. Best practice: validate the decoded result and be ready to retry. The exact complexity ceiling is something only the Foundation Models team can give definitively.
Tool calling reliability: Tool calling is not guaranteed deterministic — the model can produce arguments that don't validate. The recommended pattern is to treat the tool boundary as a validation gate: check arguments inside your Tool implementation before acting, and either throw (letting the model retry) or repair to the nearest valid value. Don't trust raw tool arguments.
Image input constraints: Passing images can change which tier services the request — worth testing whether your image prompts stay on-device or escalate to PCC, since that affects latency and the context budget. Exact resolution/count/format limits weren't given numerically in the labs, so that's one to confirm with the team.
Version detection / drift: This is the hard one. Since the model ships and updates with the OS, drift between releases is real, and no model-pinning API was mentioned. The recommended mitigation is the Evaluations framework — build an eval set of your core cases and run it against each OS beta to catch drift before users do. Evaluation-as-regression-testing is currently the answer, not pinning.
Latency / concurrency / background: From the labs — background inference works but can be rate-limited when the system is busy (a distinct error you can catch and retry). On Mac you're not rate-limited in the foreground. Keep work chunked so the system can pause/resume around thermal pressure. Realistic latency and safe multi-session concurrency limits are hardware-dependent and best confirmed directly.
Net: context window, overflow, drift mitigation, and tool validation have clear working patterns today. The exact numeric limits (schema depth, image specs, concurrency ceilings) are worth getting from the Foundation Models team directly rather than inferring.
— Divya Ravi, Senior iOS Engineer