SpeechTranscriber/SpeechAnalyzer being relatively slow compared to FoundationModel and TTS

Question

Bersaelor OP

Created 1w

Replies 1

Boosts 0

Participants 2

So,

I've been wondering how fast a an offline STT -> ML Prompt -> TTS roundtrip would be.

Interestingly, for many tests, the SpeechTranscriber (STT) takes the bulk of the time, compared to generating a FoundationModel response and creating the Audio using TTS.

E.g.

        InteractionStatistics:
        - listeningStarted:             21:24:23 4480 2423
        - timeTillFirstAboveNoiseFloor: 01.794
        - timeTillLastNoiseAboveFloor:  02.383
        - timeTillFirstSpeechDetected:  02.399
        - timeTillTranscriptFinalized:  04.510
        - timeTillFirstMLModelResponse: 04.938
        - timeTillMLModelResponse:      05.379
        - timeTillTTSStarted:           04.962
        - timeTillTTSFinished:          11.016
        - speechLength:                 06.054
        - timeToResponse:               02.578
        - transcript:                   This is a test.
        - mlModelResponse:              Sure! I'm ready to help with your test. What do you need help with?

Here, between my audio input ending and the Text-2-Speech starting top play (using AVSpeechUtterance) the total response time was 2.5s. Of that time, it took the SpeechAnalyzer 2.1s to get the transcript finalized, FoundationModel only took 0.4s to respond (and TTS started playing nearly instantly).

I'm already using reportingOptions: [.volatileResults, .fastResults] so it's probably as fast as possible right now? I'm just surprised the STT takes so much longer compared to the other parts (all being CoreML based, aren't they?)

Boost

Answer 1

Engineer OP

Apple

1w

We've added some advice on improving performance to our documentation, at https://developer.apple.com/documentation/speech/speechanalyzer#Improve-responsiveness.

The prepareToAnalyze method may be useful to preheat the analyzer and get the transcription started a bit sooner.

0