SpeechTranscriber/SpeechAnalyzer being relatively slow compared to FoundationModel and TTS

So,

I've been wondering how fast a an offline STT -> ML Prompt -> TTS roundtrip would be.

Interestingly, for many tests, the SpeechTranscriber (STT) takes the bulk of the time, compared to generating a FoundationModel response and creating the Audio using TTS.

E.g.

        InteractionStatistics:
        - listeningStarted:             21:24:23 4480 2423
        - timeTillFirstAboveNoiseFloor: 01.794
        - timeTillLastNoiseAboveFloor:  02.383
        - timeTillFirstSpeechDetected:  02.399
        - timeTillTranscriptFinalized:  04.510
        - timeTillFirstMLModelResponse: 04.938
        - timeTillMLModelResponse:      05.379
        - timeTillTTSStarted:           04.962
        - timeTillTTSFinished:          11.016
        - speechLength:                 06.054
        - timeToResponse:               02.578
        - transcript:                   This is a test.
        - mlModelResponse:              Sure! I'm ready to help with your test. What do you need help with?

Here, between my audio input ending and the Text-2-Speech starting top play (using AVSpeechUtterance) the total response time was 2.5s. Of that time, it took the SpeechAnalyzer 2.1s to get the transcript finalized, FoundationModel only took 0.4s to respond (and TTS started playing nearly instantly).

I'm already using reportingOptions: [.volatileResults, .fastResults] so it's probably as fast as possible right now? I'm just surprised the STT takes so much longer compared to the other parts (all being CoreML based, aren't they?)

We've added some advice on improving performance to our documentation, at https://developer.apple.com/documentation/speech/speechanalyzer#Improve-responsiveness.

The prepareToAnalyze method may be useful to preheat the analyzer and get the transcription started a bit sooner.

SpeechTranscriber/SpeechAnalyzer being relatively slow compared to FoundationModel and TTS
 
 
Q