On Protocol Extensibility & Multi-Modal Data

Question

Created Jun ’26

Replies 1

Boosts 0

Participants 2

This post is from the WWDC26 Foundation Models Q&A.

The Foundation Models framework is adding built-in OCR and barcode reader tools this year . If we implement a custom backend using the Language Model Protocol, can we return complex multi-modal objects (like bounding boxes or segmentation masks) back to the agentic flow, or is the protocol currently limited to text-based responses? For the 'Phone a Friend' pattern, is there a standard way to pass 'privacy-preserving embeddings' instead of raw text when calling a third-party model to maintain a higher level of user data protection?

Answered by Frameworks Engineer in 892983022

Yes, absolutely! You can use a CustomSegment to provide anything back that may not be fully defined in the framework currently.

Additionally, their is a SKILL.md file in the Foundation Models Utilities that can help build a LanguageModel implementation.

Answer 1

Frameworks Engineer OP

Apple

Jun ’26

Recommended

Yes, absolutely! You can use a CustomSegment to provide anything back that may not be fully defined in the framework currently.

Additionally, their is a SKILL.md file in the Foundation Models Utilities that can help build a LanguageModel implementation.