Delay in Microphone Input When Talking While Receiving Audio in PTT Framework (Full Duplex Mode)

Context:

I am currently developing an app using the Push-to-Talk (PTT) framework. I have reviewed both the PTT framework documentation and the CallKit demo project to better understand how to properly manage audio session activation and AVAudioEngine setup.

I am not activating the audio session manually. The audio session configuration is handled in the incomingPushResult or didBeginTransmitting callbacks from the PTChannelManagerDelegate.

I am using a single AVAudioEngine instance for both input and playback. The engine is started in the didActivate callback from the PTChannelManagerDelegate. When I receive a push in full duplex mode, I set the active participant to the user who is speaking.


Issue

When I attempt to talk while the other participant is already speaking, my input tap on the input node takes a few seconds to return valid PCM audio data. Initially, it returns an empty PCM audio block.

Details:

  • The audio session is already active and configured with .playAndRecord.
  • The input tap is already installed when the engine is started.
  • When I talk from a neutral state (no one is speaking), the system plays the standard "microphone activation" tone, which covers this initial delay. However, this does not happen when I am already receiving audio.

Assumptions / Current Setup

  • Because the audio session is active in play and record, I assumed that microphone input would be available immediately, even while receiving audio.
  • However, there seems to be a delay before valid input is delivered to the tap, only occurring when switching from a receive state to simultaneously talking.

Questions

  1. Is this expected behavior when using the PTT framework in full duplex mode with a shared AVAudioEngine?
  2. Should I be restarting or reconfiguring the engine or audio session when beginning to talk while receiving audio?
  3. Is there a recommended pattern for managing microphone readiness in this scenario to avoid the initial empty PCM buffer?
  4. Would using separate engines for input and output improve responsiveness?

I would like to confirm the correct approach to handling simultaneous talk and receive in full duplex mode using PTT framework and AVAudioEngine. Specifically, I need guidance on ensuring the microphone is ready to capture audio immediately without the delay seen in my current implementation.


Relevant Code Snippets

Engine Setup

func setup() {
    let input = audioEngine.inputNode
    do {
        try input.setVoiceProcessingEnabled(true)
    } catch {
        print("Could not enable voice processing \(error)")
        return
    }

    input.isVoiceProcessingAGCEnabled = false

    let output = audioEngine.outputNode
    let mainMixer = audioEngine.mainMixerNode

    audioEngine.connect(pttPlayerNode, to: mainMixer, format: outputFormat)
    audioEngine.connect(beepNode, to: mainMixer, format: outputFormat)
    audioEngine.connect(mainMixer, to: output, format: outputFormat)

    // Initialize converters
    converter = AVAudioConverter(from: inputFormat, to: outputFormat)!
    f32ToInt16Converter = AVAudioConverter(from: outputFormat, to: inputFormat)!

    audioEngine.prepare()
}

Input Tap Installation

func installTap() {
    guard AudioHandler.shared.checkMicrophonePermission() else {
        print("Microphone not granted for recording")
        return
    }

    guard !isInputTapped else {
        print("[AudioEngine] Input is already tapped!")
        return
    }

    let input = audioEngine.inputNode
    let microphoneFormat = input.inputFormat(forBus: 0)
    let microphoneDownsampler = AVAudioConverter(from: microphoneFormat, to: outputFormat)!
    let desiredFormat = outputFormat
    let inputFramesNeeded = AVAudioFrameCount((Double(OpusCodec.DECODED_PACKET_NUM_SAMPLES) * microphoneFormat.sampleRate) / desiredFormat.sampleRate)
    input.installTap(onBus: 0, bufferSize: inputFramesNeeded, format: input.inputFormat(forBus: 0)) { [weak self] buffer, when in
        guard let self = self else { return }
        // Output buffer: 1920 frames at 16kHz
        guard let outputBuffer = AVAudioPCMBuffer(pcmFormat: desiredFormat, frameCapacity: AVAudioFrameCount(OpusCodec.DECODED_PACKET_NUM_SAMPLES)) else { return }
        outputBuffer.frameLength = outputBuffer.frameCapacity

        let inputBlock: AVAudioConverterInputBlock = { inNumPackets, outStatus in
            outStatus.pointee = .haveData
            return buffer
        }

        var error: NSError?
        let converterResult = microphoneDownsampler.convert(to: outputBuffer, error: &error, withInputFrom: inputBlock)

        if converterResult != .haveData {
            DebugLogger.shared.print("Downsample error \(converterResult)")
        } else {
            self.handleDownsampledBuffer(outputBuffer)
        }
    }
    isInputTapped = true
}
Answered by DTS Engineer in 852071022

When I talk from a neutral state (no one is speaking), the system plays the standard "microphone activation" tone, which covers this initial delay. However, this does not happen when I am already receiving audio.

Can you file a bug about the second (no tone) case and post the bug number back here? That's not what I expected and may be a bug.

Because the audio session is active in play and record, I assumed that microphone input would be available immediately, even while receiving audio.

That assumption is incorrect. It shouldn't be "long", but there will be a delay. What's actually going on here is callservicesd "releasing" audio input to your app, which does cause a short delay. I believe the delay is roughly the same as unmuting a CallKit call.

One thing to understand here is that, just like CallKit*, the PTT audio session is NOT actually a standard PlayAndRecord session. It can do things that the standard PlayAndRecord cannot (for example, it CANNOT be interrupted by other PlayAndRecord sessions) but it's also being manipulated by "external" controls in ways that other sessions are not.

*As background context, the PTT session is implemented and managed by the same "infrastructure" CallKit uses, which is why you see similar functionality.

  1. Is this expected behavior when using the PTT framework in full duplex mode with a shared AVAudioEngine?

Yes. The time you're describing sounds like it's on the "long" side, but the basic behavior is normal.

  1. Should I be restarting or reconfiguring the engine or audio session when beginning to talk while receiving audio?

No. Once you go active, don't mess with your audio session.

  1. Is there a recommended pattern for managing microphone readiness in this scenario to avoid the initial empty PCM buffer?

I'm not sure you want to get rid of it. It's been a while since I've played with this, but I don't think the system will ever send you "real" audio that's actually a zeroed buffer, as there's always SOME amount of audio "noise" the system will pick up. I think it's reasonable to just ignore zero'd buffers, but you might also be able to use those buffers to trigger (or better "prime") your own audio tone telling the user when they can speak.

  1. Would using separate engines for input and output improve responsiveness?

No, I would expect that to matter. However, the audio system is sufficiently complex that I also wouldn't be surprised if there were a weird configuration where it did.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

When I talk from a neutral state (no one is speaking), the system plays the standard "microphone activation" tone, which covers this initial delay. However, this does not happen when I am already receiving audio.

Can you file a bug about the second (no tone) case and post the bug number back here? That's not what I expected and may be a bug.

Because the audio session is active in play and record, I assumed that microphone input would be available immediately, even while receiving audio.

That assumption is incorrect. It shouldn't be "long", but there will be a delay. What's actually going on here is callservicesd "releasing" audio input to your app, which does cause a short delay. I believe the delay is roughly the same as unmuting a CallKit call.

One thing to understand here is that, just like CallKit*, the PTT audio session is NOT actually a standard PlayAndRecord session. It can do things that the standard PlayAndRecord cannot (for example, it CANNOT be interrupted by other PlayAndRecord sessions) but it's also being manipulated by "external" controls in ways that other sessions are not.

*As background context, the PTT session is implemented and managed by the same "infrastructure" CallKit uses, which is why you see similar functionality.

  1. Is this expected behavior when using the PTT framework in full duplex mode with a shared AVAudioEngine?

Yes. The time you're describing sounds like it's on the "long" side, but the basic behavior is normal.

  1. Should I be restarting or reconfiguring the engine or audio session when beginning to talk while receiving audio?

No. Once you go active, don't mess with your audio session.

  1. Is there a recommended pattern for managing microphone readiness in this scenario to avoid the initial empty PCM buffer?

I'm not sure you want to get rid of it. It's been a while since I've played with this, but I don't think the system will ever send you "real" audio that's actually a zeroed buffer, as there's always SOME amount of audio "noise" the system will pick up. I think it's reasonable to just ignore zero'd buffers, but you might also be able to use those buffers to trigger (or better "prime") your own audio tone telling the user when they can speak.

  1. Would using separate engines for input and output improve responsiveness?

No, I would expect that to matter. However, the audio system is sufficiently complex that I also wouldn't be surprised if there were a weird configuration where it did.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you for the detailed reply. I've submitted a bug report as requested: FB19421676Push-to-Talk Framework: Microphone activation tone does not play when sending while audio session is active in full duplex mode.

Thanks to the context you provided regarding how the PTT framework functions, I was able to identify the cause of the transmission delay I was experiencing. It turns out that isVoiceProcessingInputMuted was set to true when starting a transmission, and only reverted to false once audio output stopped. This was the source of the delay between initiating transmission and receiving valid microphone input.

By manually setting isVoiceProcessingInputMuted to false on the input node at the start of transmission, I was able to eliminate this delay and begin receiving microphone samples immediately.

I'm still relatively new to Swift and iOS audio development, and I was wondering if there are any sample projects or best practices that demonstrate integrating audio with the Push-to-Talk framework. Having a reference implementation would help me avoid common pitfalls and improve how I manage audio routing and session state.

Thanks again for your help!

Delay in Microphone Input When Talking While Receiving Audio in PTT Framework (Full Duplex Mode)
 
 
Q