Conclusion:
This can't be done with SFSpeechURLRecognitionRequest(url:)
You must utilize SFSpeechAudioBufferRecognitionRequest()
The solution is to utilize SFSpeechAudioBufferRecognitionRequest and read the audio into a buffer then either shift the entire audio block left (left trimming, removing the previously recognized speech segment) after every recognition or to feed SFSpeechAudioBufferRecognitionRequest 60 second snippets of audio.
Also because the progress didn't work, I used the running count to determine the current position that was being recognized relative to the length of the audio, to determine the progress.
Requirement
You must keep a rolling count of where you segments were found to keep track of your position.
Caveats
If the request was short (let's say 30 seconds), recognition will not proceed, so a segment that was short must be padded with silence and adjusted for in your accounting
If the audio buffer contains blocks that are more than 1 minute of non-speech (could be silence, music, unintelligible speech), you must wait for a timeout and then advance 60 seconds, otherwise you will just timeout and not get any further recognition data. I have not been able to determine how to shorten the timeout which appears to be 22 seconds.
Example: If you have audio with a two minute stretch of non-speech, you will need to wait 22 seconds after the first timeout, cancel the request, then advance to the next position, append the audio and wait for the recognition request, which is another 22 seconds for the timeout before again advancing. So, if the audio contains many stretches of non-speech, this process works but is problematic in terms of processing time. Granted a 22 second timeout is better than a 1 minute timeout.
I am still tuning this process but it does work.
Topic:
Machine Learning & AI
SubTopic:
General
Tags: