I have a voice app that is both playing and recording audio. I have enabled voice processing and am setting AVAudioSession.Category to .playAndRecord and AVAudioSession.Mode to .voiceChat.
When the experience first launches, we play a greeting. The first few hundred milliseconds of that greeting are being captured by the inputNode before AEC seems to start working. Is there any way to get AEC working the entire time? For now we've had to disable recording while we're playing audio, but would prefer to both play and record simultaneously.
Here's some code snippets:
public init(denoiseModelPath: URL? = nil) {
noiseReducer = denoiseModelPath.flatMap { NoiseReducer(modelPath: $0) }
recorderNode = engine.inputNode
speakerNode = engine.outputNode
mainMixerNode = engine.mainMixerNode
engine.attach(audioPlayer)
engine.connect(
audioPlayer,
to: mainMixerNode,
format: nil
)
playbackFormat = mainMixerNode.outputFormat(forBus: 0)
}
public func setupAudioSession() async throws(AudioError) {
do {
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(
.playAndRecord,
mode: .voiceChat,
policy: .default,
options: [
.defaultToSpeaker,
.allowBluetoothHFP,
]
)
try audioSession.setActive(true)
} catch {
throw .audioSessionSetupFailed(error)
}
do {
try recorderNode.setVoiceProcessingEnabled(true)
try speakerNode.setVoiceProcessingEnabled(true)
} catch {
throw .enableVoiceProcessingFailed(error)
}
}
The echo cancellation algorithm requires a short convergence period before it can achieve effective suppression (typically >20 dB). Under normal conditions, this convergence should complete within approximately 200ms, though certain edge cases may result in a slightly longer ramp-up time.
One approach to mitigate the impact during this initial phase is to gradually ramp up your playback audio volume at the start of the greeting. By starting at a lower volume and increasing it over the first ~200ms, you give the AEC algorithm time to converge before the full audio signal is present, which should significantly reduce the amount of echo captured by the input node.