Thanks for the tips — both already in place. The test app calls finishTasksAndInvalidate() after each run and waits for urlSession(_:didBecomeInvalidWithError:) before proceeding; that's what made the qlog files appear reliably. And yes, I pulled everything into a standalone macOS app — that's where all the measurements below were taken.
I've been systematically comparing configurations. The key variable is request size: our /api/launch POST has ~18 KB of headers (cookies, auth tokens, experiment flags) plus a small body. With URLSessionConfiguration.default, measuring requestStart→requestEnd from URLSessionTaskMetrics:
H3 · large headers · modern engine · with prewarm 230 ms - 542 ms
H3 · large headers · modern engine · no prewarm 34 ms-57 ms
H3 · large headers · classic engine · with prewarm 24 ms-26 ms
H3 · small headers · any config 0 ms
H2 · any config 0 ms (because, as I understand it, sending packets here is at a different level in tcp and the implementation doesn't know when it sent the last packet, unlike quic)
The stall is H3-specific and header-size-dependent. The interesting part: preceding the large POST with a small HEAD request (connection prewarm) worsens the stall compared to skipping prewarm entirely.
No stream_data_blocked or data_blocked frames appear in the qlogs — consistent with congestion control (not flow control) being the bottleneck. The modern engine shows ~10× worse stall than classic under the same conditions, which is the part I can't fully explain yet.
But it seems I still haven't figured out my problem, because I had it on the classic engine as well. But I'll run another experiment on it in my app to confirm this for now.