The Foundation Models framework is adding built-in OCR and barcode reader tools this year . If we implement a custom backend using the Language Model Protocol, can we return complex multi-modal objects (like bounding boxes or segmentation masks) back to the agentic flow, or is the protocol currently limited to text-based responses? For the 'Phone a Friend' pattern, is there a standard way to pass 'privacy-preserving embeddings' instead of raw text when calling a third-party model to maintain a higher level of user data protection?
This post is from the WWDC26 Foundation Models Q&A.