I am trying to use the new SpeechAnalyzer framework in my Mac app, and am running into an issue for some languages.
When I call AssetInstallationRequest.downloadAndInstall() for some languages, it throws an error:
Error Domain=SFSpeechErrorDomain Code=1 "transcription.ar asset not found after attempted download."
The ".ar" appears to be the language code, which in this case was Arabic.
When I call AssetInventory.status(forModules:) before attempting the download, it is giving me a status of "downloading" (perhaps from an earlier attempt?). If this language was completely unsupported, I would expect it to return a status of "unsupported", so I'm not sure what's going on here.
For other languages (Polish, for example) SpeechTranscriber.supportedLocale(equivalentTo:) is returning nil, so that seems like a clearly unsupported language. But I can't tell if the languages I'm trying, like Arabic, are supported and something is going wrong, or if this error represents something I can work around.
Here's the relevant section of code. The error is thrown from downloadAndInstall(), so I never even get as far as setting up the SpeechAnalyzer itself.
private func setUpAnalyzer() async throws {
guard let sourceLanguage else {
throw Error.languageNotSpecified
}
guard let locale = await SpeechTranscriber.supportedLocale(equivalentTo: Locale(identifier: sourceLanguage.rawValue)) else {
throw Error.unsupportedLanguage
}
let transcriber = SpeechTranscriber(locale: locale, preset: .progressiveTranscription)
self.transcriber = transcriber
let reservedLocales = await AssetInventory.reservedLocales
if !reservedLocales.contains(locale) && reservedLocales.count == AssetInventory.maximumReservedLocales {
if let oldest = reservedLocales.last {
await AssetInventory.release(reservedLocale: oldest)
}
}
do {
let status = await AssetInventory.status(forModules: [transcriber])
print("status: \(status)")
if let installationRequest = try await AssetInventory.assetInstallationRequest(supporting: [transcriber]) {
try await installationRequest.downloadAndInstall()
}
}
...
Speech
RSS for tagRecognize spoken words in recorded or live audio using Speech.
Posts under Speech tag
54 Posts
Sort by:
Post
Replies
Boosts
Views
Activity
I am trying to use SpeechDetector Module in Speech framework along with SpeechTranscriber. and it is giving me an error
Cannot convert value of type 'SpeechDetector' to expected element type 'Array.ArrayLiteralElement' (aka 'any SpeechModule')
Below is how I am using it
let speechDetector = Speech.SpeechDetector()
let transcriber = SpeechTranscriber(locale: Locale.current,
transcriptionOptions: [],
reportingOptions: [.volatileResults],
attributeOptions: [.audioTimeRange])
speechAnalyzer = try SpeechAnalyzer(modules: [transcriber,speechDetector])
Hello,
I am testing the sample project provided here: Bringing advanced speech-to-text capabilities to your app.
On both macOS 26.0 beta and iOS 26.0 beta, the app crashes immediately on launch with a dyld "Symbol not found" error related to FoundationModels.framework.
It feels like this may be related to testing primarily on newer Apple Silicon devices, as I am seeing consistent crashes on an Intel MacBook and on an older iPhone device.
I would appreciate any insight, confirmation, or guidance on whether this is a known limitation or if there is a workaround. Is it planned to be resolved soon?
Environment
macOS:
Device: MacBook Pro (Intel)
Processor: 2 GHz Quad-Core Intel Core i5
Graphics: Intel Iris Plus Graphics 1536 MB
Memory: 16 GB 3733 MHz LPDDR4X
OS: macOS Tahoe Version 26.0 Beta (25A5338b)
iOS:
Device: iPhone 11
Model Number: MHDD3HN/A
OS: iOS 26.0
Xcode:
Version: 26.0 beta 3 (17A5276g)
Crash (macOS)
Abort signal received. Excerpt from crash dump:
dyld`__abort_with_payload:
0x7ff80e3ad4a0 <+0>: movl $0x2000209, %eax
0x7ff80e3ad4a5 <+5>: movq %rcx, %r10
0x7ff80e3ad4a8 <+8>: syscall
-> 0x7ff80e3ad4aa <+10>: jae 0x7ff80e3ad4b4
Console:
dyld[9819]: Symbol not found: _$s16FoundationModels20LanguageModelSessionC5model10guardrails5tools12instructionsAcA06SystemcD0C_AC10GuardrailsVSayAA4Tool_pGAA12InstructionsVSgtcfC
Referenced from: /Users/userx/Library/Developer/Xcode/DerivedData/SwiftTranscriptionSampleApp-*/Build/Products/Debug/SwiftTranscriptionSampleApp.app/Contents/MacOS/SwiftTranscriptionSampleApp.debug.dylib
Expected in: /System/Library/Frameworks/FoundationModels.framework/Versions/A/FoundationModels
Crash (iOS)
Abort signal received. Excerpt from crash dump:
dyld`__abort_with_payload:
0x18f22b4b0 <+0>: mov x16, #0x209
0x18f22b4b4 <+4>: svc #0x80
-> 0x18f22b4b8 <+8>: b.lo 0x18f22b4d8
Console
dyld[2080]: Symbol not found: _$s16FoundationModels20LanguageModelSessionC5model10guardrails5tools12instructionsAcA06SystemcD0C_AC10GuardrailsVSayAA4Tool_pGAA12InstructionsVSgtcfC
Referenced from: /private/var/containers/Bundle/Application/.../SwiftTranscriptionSampleApp.app/SwiftTranscriptionSampleApp.debug.dylib
Expected in: /System/Library/Frameworks/FoundationModels.framework/FoundationModels
Question
Is this crash expected on Intel Macs and older iPhone models with the beta SDKs?
Is there an official statement on whether macOS 26.x releases support Intel, or it exists only until macOS 26.1?
Any suggested workarounds for testing this sample project on current hardware?
Is this a known limitation for the 26.0 beta, and if so, should we expect a fix in 26.0 or only in subsequent releases?
Attaching screenshots for reference.
Thank you in advance.
We are a research team conducting a study collecting subject's SensorKit speech data, and we've encountered some questions we couldn't resolve ourselves or by looking up the online SensorKit documentation:
Microphone Activation: In general, how is the microphone being turned on to capture a speech session? And how was each session determined to be an independent session?
Negative Values: In the speech classification data, there are entries where some of the start and end values are negative (see screenshot below). How should we interpret and handle these values? Is it safe to filter them out?
Duplicated sessions: From the same screenshot you can see there are multiple session identifiers linked to the same subject with the same timestamp - what does this represent?
Another Negative Values: The same question for speech recognition data's average pause duration, what does the -1 mean and should we remove them as well?
(Note that these screenshot got rid of subject IDs for privacy purposes but each screenshot was from one subject.)
We greatly appreciate your time and help.
Using the official SwiftTranscriptionSampleApp from WWDC 2025, speech transcription takes 14+ seconds from audio input to first result, making it unusable for real-time applications.
Environment
iOS: 26.0 Beta
Xcode: Beta 5
Device: iPhone 16 pro
Sample App: Official Apple SwiftTranscriptionSampleApp from WWDC 2025
Configuration Tested
Locale: en-US (properly allocated with AssetInventory.allocate(locale:)) and es-ES
Setup: All optimizations applied (preheating, high priority, model retention)
I started testing in my own app to replace SFSpeech API and include speech detection but after long fights with documentation (this part is quite terrible TBH) I tested the example (https://developer.apple.com/documentation/speech/bringing-advanced-speech-to-text-capabilities-to-your-app) and saw same results.
I added some logs to check the specific time:
🎙️ [20:30:41.532] ✅ Analyzer started successfully - ready to receive audio!
🎙️ [20:30:41.532] Listening for transcription results...
🎙️ [20:30:56.342] 🚀 FIRST TRANSCRIPTION RESULT after 14.810s: 'Hello' (isFinal: false)
Questions
Is this expected performance for iOS 26 Beta, because old SFSpeech is far faster?
Are there additional optimization steps for SpeechTranscriber?
Should we expect significant performance improvements in later betas?
I started playing which transcription of audio files on macOS today, latest beta of Xcode and latest beta of Tahoe. Transcription itself works really well, but for some reason the majority of the results contain no audioTimeRange. I got 22 single-word results with time ranges, spread out all over total file of 53 minutes.
Is there something I can do to improve this? To my understanding, I have followed sample code and instructions very closely, but the SwiftTranscriptionSampleApp and other examples I've seen lead me to believe I should be getting a lot more time ranges than I actually do.
I have Xcode 16 and am setting everything to a minimum target deployment to 17.5, and am using import Speech
Never the less, Xcode doesn't can't find it.
At ChatGPT's urging I tried going back to Xcode 15.3, but that won't work with Sequoia
Am I misunderstanding something?
Here's how I am trying to use it:
if templateItems.isEmpty {
templateItems = dbControl?.getAllItems(templateName: templateName) ?? []
items = templateItems.compactMap { $0.itemName?.components(separatedBy: " ") }.flatMap { $0 }
let phrases = extractContextualWords(from: templateItems)
Task {
do {
// 1. Get your items and extract words
templateItems = dbControl?.getAllItems(templateName: templateName) ?? []
let phrases = extractContextualWords(from: templateItems)
// 2. Build the custom model and export it
let modelURL = try await buildCustomLanguageModel(from: phrases)
// 3. Prepare the model (STATIC method)
try await SFSpeechRecognizer.prepareCustomLanguageModel(at: modelURL)
// ✅ Ready to use in recognition request
print("✅ Model prepared at: \(modelURL)")
// Save modelURL to use in Step 5 (speech recognition)
// e.g., self.savedModelURL = modelURL
} catch {
print("❌ Error preparing model: \(error)")
}
}
}
So,
I've been wondering how fast a an offline STT -> ML Prompt -> TTS roundtrip would be.
Interestingly, for many tests, the SpeechTranscriber (STT) takes the bulk of the time, compared to generating a FoundationModel response and creating the Audio using TTS.
E.g.
InteractionStatistics:
- listeningStarted: 21:24:23 4480 2423
- timeTillFirstAboveNoiseFloor: 01.794
- timeTillLastNoiseAboveFloor: 02.383
- timeTillFirstSpeechDetected: 02.399
- timeTillTranscriptFinalized: 04.510
- timeTillFirstMLModelResponse: 04.938
- timeTillMLModelResponse: 05.379
- timeTillTTSStarted: 04.962
- timeTillTTSFinished: 11.016
- speechLength: 06.054
- timeToResponse: 02.578
- transcript: This is a test.
- mlModelResponse: Sure! I'm ready to help with your test. What do you need help with?
Here, between my audio input ending and the Text-2-Speech starting top play (using AVSpeechUtterance) the total response time was 2.5s.
Of that time, it took the SpeechAnalyzer 2.1s to get the transcript finalized, FoundationModel only took 0.4s to respond (and TTS started playing nearly instantly).
I'm already using reportingOptions: [.volatileResults, .fastResults] so it's probably as fast as possible right now?
I'm just surprised the STT takes so much longer compared to the other parts (all being CoreML based, aren't they?)
So experimenting with the new SpeechTranscriber, if I do:
let transcriber = SpeechTranscriber(
locale: locale,
transcriptionOptions: [],
reportingOptions: [.volatileResults],
attributeOptions: [.audioTimeRange]
)
only the final result has audio time ranges, not the volatile results.
Is this a performance consideration? If there is no performance problem, it would be nice to have the option to also get speech time ranges for volatile responses.
I'm not presenting the volatile text at all in the UI, I was just trying to keep statistics about the non-speech and the speech noise level, this way I can determine when the noise level falls under the noisefloor for a while.
The goal here was to finalize the recording automatically, when the noise level indicate that the user has finished speaking.
i tried combine speech detector and speech transciber to anlayzer.
but speech detector is not speech module. please help me
In iOS 26, AVSpeechSynthesizer read Mandarin into Cantonese pronunciation.
No matter how you set the language, and change the settings of my phone system, it doesn't work.
let utterance = AVSpeechUtterance(string: "你好啊")
//let voice = AVSpeechSynthesisVoice(language: "zh-CN") // not work
let voice = AVSpeechSynthesisVoice(language: "zh-Hans") // not work too
utterance.voice = voice
et synth = AVSpeechSynthesizer()
synth.speak(utterance)
Topic:
Media Technologies
SubTopic:
General
Tags:
Speech
Internationalization
Localization
AVFoundation
I am using the sample app from:
https://developer.apple.com/videos/play/wwdc2025/277/?time=763
I installed this on an Iphone 15 Pro with iOS 26 beta 1. I was able to get good transcription with it. The app did crash sometimes when transcribing and I was going to post here with the details. I then installed iOS beta 2 and uninstalled the sample app. Now every time I try to run the sample app on the 15 Pro I get this message:
SpeechAnalyzer: Input loop ending with error: Error Domain=SFSpeechErrorDomain Code=10 "Cannot use modules with unallocated locales [en_US (fixed en_US)]" UserInfo={NSLocalizedDescription=Cannot use modules with unallocated locales [en_US (fixed en_US)]}
I can't continue our our work towards using SpeechAnalyzer now with this error.
I have set breakpoints on all the catch handlers and it doesn't catch this error. My phone region is "United States"
I'm experimenting with the new SpeechTranscriber in macOS/iOS 26, transcribing speech from a prerecorded mp4 file. Speed and quality are amazing!
I've told the transcriber to include time indexes. Each run is always exactly one word, which can be very useful. When I look at the indexes the end of one run is always identical to the start of the next run, even if there's a pause.
I'd like to identify pauses, perhaps to generate something like phrases for subtitling. With each run of text going into the next I can't do this, other than using punctuation - which might be rather rough.
Any suggestions on detecting pauses, or getting that kind of metadata from the transcriber?
Here's a short sample, showing each run with the start, end, and characters in the run:
105.9 --> 107.04 I
107.04 --> 107.16 think
107.16 --> 108.0 more
108.0 --> 108.42 lighting
108.42 --> 108.6 is
108.6 --> 108.72 definitely
108.72 --> 109.2 needed,
109.2 --> 109.92 downtown.
109.98 --> 110.4 My
110.4 --> 110.52 only
110.52 --> 110.7 question
110.7 --> 111.06 is,
111.06 --> 111.48 poll
111.48 --> 111.78 five,
111.78 --> 111.84 that
111.84 --> 112.08 you're
112.08 --> 112.38 increasing
112.38 --> 112.5 the
112.5 --> 113.34 50,000?
113.4 --> 113.58 Where
113.58 --> 113.88 exactly
During testing the “Bringing advanced speech-to-text capabilities to your app” sample app demonstrating the use of iOS 26 SpeechAnalyzer, I noticed that the language model for the English locale was presumably already downloaded. Upon checking the documentation of AssetInventory, I found out that indeed, the language model can be preinstalled on the system.
Can someone from the dev team share more info about what assets are preinstalled by the system? For example, can we safely assume that the English language model will almost certainly be already preinstalled by the OS if the phone has the English locale?
When I create a SFSpeechRecognizer object, I find SFLocalSpeechRecognitionClient remains in memory and never gets released.
You can create a demo with a single UIButton whose touch action is
SFSpeechRecognizer(locale: Locale(identifier: "zh_CN"))
Hello, I’ve followed all the steps you recommended and confirmed that the entitlement is correctly added in Xcode, but the provisioning profile still fails. I believe the issue is that my App ID com.echo.eyes.app is missing the com.apple.developer.speech-recognition entitlement on Apple’s end.
Could you please manually add this entitlement to my App ID, or guide me on how to get it attached? I’ve already added it locally and confirmed the error in Xcode is due to it not being in the provisioning profile.
.
I am building an iOS app with the App ID: com.echo.eyes.app
I have a paid Apple Developer membership and have followed all correct procedures, including:
Adding com.apple.developer.speech-recognition manually to the App.entitlements file
Setting Info.plist keys for microphone and speech permissions
Assigning my Apple Developer Team to the project
Setting App/App.entitlements under Code Signing Entitlements
Despite all this, Xcode automatic signing fails, and I receive the error:
vbnet
Copy
Edit
Provisioning profile 'iOS Team Provisioning Profile: com.echo.eyes.app' doesn't include the com.apple.developer.speech-recognition entitlement.
I am unable to add the entitlement via the Capabilities section, and no method I try will allow provisioning to succeed.
Please update this App ID to include the required entitlement in the provisioning profile. This issue is preventing all voice recognition functionality.
Thank you.
Subject: Assistance Needed with Enabling Speech Recognition Entitlement for iOS App
Hi everyone,
I’m seeking guidance regarding the Speech Recognition entitlement for my iOS app using Capacitor. Our App and we submitted a request to Apple Developer Support four days ago, but have not yet received a response.
🧩 Summary of the issue:
Our app uses the Capacitor speech recognition plugin (@capacitor-community/speech-recognition) to listen for native voice input on iOS.
We have added both of the required keys in Info.plist:
NSSpeechRecognitionUsageDescription
NSMicrophoneUsageDescription
We previously had a duplicate microphone key, which caused the system to silently skip the permission request. After removing the duplicate, we did briefly see the microphone permission prompt appear.
However, in our most recent builds, the app launches without any prompts, even on a fresh install. The plugin reports:
available = true
permissionStatus = granted
Despite this, no speech input is ever received, and the listener returns nothing.
We believe the app is functioning correctly at a code level (plugin loads, no errors, correct Info.plist), but suspect the missing Speech Recognition entitlement is blocking actual access to the speech system.
🔎 What we need help with:
How can we confirm whether the Speech Recognition entitlement is enabled for our App ID?
If it’s not enabled, is there a way to escalate or re-submit the request? Our app is currently stuck until this entitlement is granted.
Thank you for your time and any guidance you can offer!
When a new application runs on iOS 18.4 simulator and tries to access the Speech Framework, prompting a request for authorisation to use Speech Recognition, the application will crash if the user clicks allow. Same issue in the visionOS 2.4 simulator.
Using Swift 6. Report Identifier: FB17686186
/// Checks speech recognition availability and requests necessary permissions.
@MainActor
func checkAvailabilityAndPermissions() async {
logger.debug("Checking speech recognition availability and permissions...")
// 1. Verify that the speechRecognizer instance exists
guard let recognizer = speechRecognizer else {
logger.error("Speech recognizer is nil - speech recognition won't be available.")
reportError(.configurationError(description: "Speech recognizer could not be created."), context: "checkAvailabilityAndPermissions")
self.isAvailable = false
return
}
// 2. Check recognizer availability (might change at runtime)
if !recognizer.isAvailable {
logger.error("Speech recognizer is not available for the current locale.")
reportError(.configurationError(description: "Speech recognizer not available."), context: "checkAvailabilityAndPermissions")
self.isAvailable = false
return
}
logger.trace("Speech recognizer exists and is available.")
// 3. Request Speech Recognition Authorization
// IMPORTANT: Add `NSSpeechRecognitionUsageDescription` to Info.plist
let speechAuthStatus = SFSpeechRecognizer.authorizationStatus()
logger.debug("Current Speech Recognition authorization status: \(speechAuthStatus.rawValue)")
if speechAuthStatus == .notDetermined {
logger.info("Requesting speech recognition authorization...")
// Use structured concurrency to wait for permission result
let authStatus = await withCheckedContinuation { continuation in
SFSpeechRecognizer.requestAuthorization { status in
continuation.resume(returning: status)
}
}
logger.debug("Received authorization status: \(authStatus.rawValue)")
// Now handle the authorization result
let speechAuthorized = (authStatus == .authorized)
handleAuthorizationStatus(status: authStatus, type: "Speech Recognition")
// If speech is granted, now check microphone
if speechAuthorized {
await checkMicrophonePermission()
}
} else {
// Already determined, just handle it
let speechAuthorized = (speechAuthStatus == .authorized)
handleAuthorizationStatus(status: speechAuthStatus, type: "Speech Recognition")
// If speech is already authorized, check microphone
if speechAuthorized {
await checkMicrophonePermission()
}
}
}
I'm working in Swift/SwiftUI, running XCode 16.3 on macOS 15.4 and I've seen this when running in the iOS simulator and in a macOS app run from XCode. I've also seen this behaviour with 3 different audio files.
Nothing in the documentation says that the speechRecognitionMetadata property on an SFSpeechRecognitionResult will be nil until isFinal, but that's the behaviour I'm seeing.
I've stripped my class down to the following:
private var isAuthed = false
// I call this in a .task {} in my SwiftUI View
public func requestSpeechRecognizerPermission() {
SFSpeechRecognizer.requestAuthorization { authStatus in
Task {
self.isAuthed = authStatus == .authorized
}
}
}
public func transcribe(from url: URL) {
guard isAuthed else { return }
let locale = Locale(identifier: "en-US")
let recognizer = SFSpeechRecognizer(locale: locale)
let recognitionRequest = SFSpeechURLRecognitionRequest(url: url)
// the behaviour occurs whether I set this to true or not, I recently set
// it to true to see if it made a difference
recognizer?.supportsOnDeviceRecognition = true
recognitionRequest.shouldReportPartialResults = true
recognitionRequest.addsPunctuation = true
recognizer?.recognitionTask(with: recognitionRequest) { (result, error) in
guard result != nil else { return }
if result!.isFinal {
//speechRecognitionMetadata is not nil
} else {
//speechRecognitionMetadata is nil
}
}
}
}
Further, and this isn't documented either, the SFTranscriptionSegment values don't have correct timestamp and duration values until isFinal. The values aren't all zero, but they don't align with the timing in the audio and they change to accurate values when isFinal is true.
The transcription otherwise "works", in that I get transcription text before isFinal and if I wait for isFinal the segments are correct and speechRecognitionMetadata is filled with values.
The context here is I'm trying to generate a transcription that I can then highlight the spoken sections of as audio plays and I'm thinking I must be just trying to use the Speech framework in a way it does not work. I got my concept working if I pre-process the audio (i.e. run it through until isFinal and save the results I need to json), but being able to do even a rougher version of it 'on the fly' - which requires segments to have the right timestamp/duration before isFinal - is perhaps impossible?