Post

Replies

Boosts

Views

Activity

Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Follow-up — post-submission PCAP analysis of the filtered QUIC capture from the 2026-04-21 reproduction (attached to FB22476701, third update). Analysed with tshark 4.6.4 for QUIC long-header parsing; Initial + Handshake packets are decryptable per RFC 9001 §5.2 (keys derived from DCID). 1-RTT STREAM payloads remain opaque — nsurlsessiond does not export SSLKEYLOGFILE. The 102 s deadlock window is quantifiable on the wire: Last pre-gap packet: 20:09:51.802 (Client→Server, protected, port 51943) First post-gap packet: 20:11:34.625 (Client→Server, fresh Initial on new port 50715) 102.823 s with zero packets in either direction — no client PING, no keepalive, no probe, no CONNECTION_CLOSE The same missing-wakeup pattern was already documented earlier the same evening via sample on cloudd during a reactive-capture repro: the main thread, CFStream.LegacyThread and NSURLConnectionLoader all at 8627/8627 samples blocked in mach_msg2_trap. The user-space thread stacks and the UDP flow tell the same story from two angles — daemon is alive, consuming no CPU, simply not being notified that the QUIC session has died. Three tshark-level observations worth recording: All seven TLS handshakes in the capture completed cleanly (7 × ClientHello + ServerHello, ALPN=h3, RTT 7.6–12.2 ms, median ~9 ms). No CONNECTION_CLOSE frames visible in any Initial or Handshake packet (these are decryptable — if present they would show). Session terminations happen in the 1-RTT data phase, not during TLS establishment. Server is demonstrably healthy: after the 102 s gap, the fresh client Initial is answered by a full server Initial + Handshake in 7.9 ms. The endpoint is neither down nor rate-limited — the silence is purely client-side. Per-session byte counts confirm the body transfer never progresses meaningfully: largest pre-gap session 85 KB C→S, all three pre-gap sessions combined 110 KB, while cloudd log shows CKAsset body attempts with requestBodyBytes of 2.5–3.3 MB each, all returning err=T, requestDuration=-1.000. On the client-side gap specifically: RFC 9000 §10.1 permits silent idle close server-side ("the connection is silently closed and its state is discarded"), which a GCS load balancer may legitimately do. But §10.1.1 and §10.1.2 of the same RFC explicitly describe the client sending PING or other ack-eliciting frames for liveness testing, particularly when an endpoint is "expecting response data but does not have or is unable to send application data" (§10.1.2) — precisely cloudd's state during the deadlock. None of that fires in the 102 s window. The original report's fix recommendations #1 (aggressive stale-session invalidation when the QUIC pool detects an unresponsive session) and #2 (explicit pool-invalidation API surfaced to CloudKit / cloudd) remain precisely targeted. The client-side lack of any idle-phase probing or upper-layer failure propagation is the actionable engineering gap. All of this is on FB22476701 now (fourth update, 2026-04-21 evening) — this post is a community-side summary for anyone hitting the same pattern.
1d
Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Quick update — the prediction from the 2026-04-18 post panned out. As of 2026-04-21, h3/GCS is back on this account/region for CKAsset body PUTs. Same Mac (macOS 26.4.1, build 25E253), same iCloud Drive account, no macOS update since 2026-04-18. The deadlock reproduced three times within 50 minutes today: Run 1 (natural, ambient work): 4 files stuck 10–30 min, signature matches FB22476701 — 94 × protocol=h3, 10 × requestDuration=-1.000 in the 30-min log window, pool=com.apple.cloudkit.BackgroundConnectionPool Runs 2 + 3 (scripted bulk-copy test set, 90 s windows): same signature, pool-poisoning within 90 s of each upload batch New forensic data point captured this time — sample on the user cloudd during the hang shows NSURLConnectionLoader + CFStream.LegacyThread both 100 % blocked in mach_msg2_trap for the full 10 s capture window. No CPU spin, no lock contention; the daemon is alive and consuming no CPU, simply not being notified that the QUIC session has died. Missing-wakeup pattern, not a mutex deadlock. All of this is now on FB22476701 (third update) with filtered QUIC PCAP, thread samples, router-log link-layer control, and the full 30-min pre-kill log. Keeping the primary channel on FB per Quinn's earlier direction.
1d
Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Quick update — tried to reproduce the deadlock again today, 2026-04-18, same Mac (macOS 26.4.1, build 25E253), same iCloud Drive account. No reproduction. The change is server-side. Comparing 2026-04-11 → 2026-04-18: Upload host: gcs-eu-00002.content-storage-upload.googleapis.com (GCS) → eu-irl-00001.s3.dualstack.eu-west-1.amazonaws.com (AWS S3 Ireland) Protocol: h3 (QUIC) → http/1.1 Large CKAsset body (~20–26 MB): 6 × err=T with requestDuration=-1.000 → single err=F in 1.4–3.0 s This routing change is specific to the CKAsset body-upload path — other cloudd flows (e.g. metadata requests to gateway.icloud.com) continue as before. The underlying CFNetwork BackgroundConnectionPool stale-QUIC-session behavior is still in the stack — no macOS update has shipped in between (still on 26.4.1 build 25E253). It's just not being exercised, because h3 isn't negotiated to the new endpoint. If h3 returns on the CKAsset-upload path (AWS enabling it on this endpoint, or Apple routing back to an h3-capable endpoint), the deadlock would likely return.
4d
Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Thanks for the detailed write-up and the constructive suggestions. One correction worth flagging for anyone finding this thread later: on macOS 26.4.1, disabling QUIC via defaults write … CFNetworkHTTP3Enabled -bool false does not suppress HTTP/3 for nsurlsessiond's BackgroundConnectionPool. I tested this as part of ruling out root-cause hypotheses — the deadlock still reproduces with h3 as the negotiated protocol. So unfortunately there is currently no user-facing off-switch for QUIC on the iCloud Drive path; the only recovery that works is killing the user-level cloudd and nsurlsessiond processes. On packet capture and pre/post state snapshots — good call, both gaps in my evidence so far. I'll add them to the collection the next time the pool poisons and post a follow-up if anything diagnostic turns up.
1w
Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Follow-up — post-submission PCAP analysis of the filtered QUIC capture from the 2026-04-21 reproduction (attached to FB22476701, third update). Analysed with tshark 4.6.4 for QUIC long-header parsing; Initial + Handshake packets are decryptable per RFC 9001 §5.2 (keys derived from DCID). 1-RTT STREAM payloads remain opaque — nsurlsessiond does not export SSLKEYLOGFILE. The 102 s deadlock window is quantifiable on the wire: Last pre-gap packet: 20:09:51.802 (Client→Server, protected, port 51943) First post-gap packet: 20:11:34.625 (Client→Server, fresh Initial on new port 50715) 102.823 s with zero packets in either direction — no client PING, no keepalive, no probe, no CONNECTION_CLOSE The same missing-wakeup pattern was already documented earlier the same evening via sample on cloudd during a reactive-capture repro: the main thread, CFStream.LegacyThread and NSURLConnectionLoader all at 8627/8627 samples blocked in mach_msg2_trap. The user-space thread stacks and the UDP flow tell the same story from two angles — daemon is alive, consuming no CPU, simply not being notified that the QUIC session has died. Three tshark-level observations worth recording: All seven TLS handshakes in the capture completed cleanly (7 × ClientHello + ServerHello, ALPN=h3, RTT 7.6–12.2 ms, median ~9 ms). No CONNECTION_CLOSE frames visible in any Initial or Handshake packet (these are decryptable — if present they would show). Session terminations happen in the 1-RTT data phase, not during TLS establishment. Server is demonstrably healthy: after the 102 s gap, the fresh client Initial is answered by a full server Initial + Handshake in 7.9 ms. The endpoint is neither down nor rate-limited — the silence is purely client-side. Per-session byte counts confirm the body transfer never progresses meaningfully: largest pre-gap session 85 KB C→S, all three pre-gap sessions combined 110 KB, while cloudd log shows CKAsset body attempts with requestBodyBytes of 2.5–3.3 MB each, all returning err=T, requestDuration=-1.000. On the client-side gap specifically: RFC 9000 §10.1 permits silent idle close server-side ("the connection is silently closed and its state is discarded"), which a GCS load balancer may legitimately do. But §10.1.1 and §10.1.2 of the same RFC explicitly describe the client sending PING or other ack-eliciting frames for liveness testing, particularly when an endpoint is "expecting response data but does not have or is unable to send application data" (§10.1.2) — precisely cloudd's state during the deadlock. None of that fires in the 102 s window. The original report's fix recommendations #1 (aggressive stale-session invalidation when the QUIC pool detects an unresponsive session) and #2 (explicit pool-invalidation API surfaced to CloudKit / cloudd) remain precisely targeted. The client-side lack of any idle-phase probing or upper-layer failure propagation is the actionable engineering gap. All of this is on FB22476701 now (fourth update, 2026-04-21 evening) — this post is a community-side summary for anyone hitting the same pattern.
Replies
Boosts
Views
Activity
1d
Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Quick update — the prediction from the 2026-04-18 post panned out. As of 2026-04-21, h3/GCS is back on this account/region for CKAsset body PUTs. Same Mac (macOS 26.4.1, build 25E253), same iCloud Drive account, no macOS update since 2026-04-18. The deadlock reproduced three times within 50 minutes today: Run 1 (natural, ambient work): 4 files stuck 10–30 min, signature matches FB22476701 — 94 × protocol=h3, 10 × requestDuration=-1.000 in the 30-min log window, pool=com.apple.cloudkit.BackgroundConnectionPool Runs 2 + 3 (scripted bulk-copy test set, 90 s windows): same signature, pool-poisoning within 90 s of each upload batch New forensic data point captured this time — sample on the user cloudd during the hang shows NSURLConnectionLoader + CFStream.LegacyThread both 100 % blocked in mach_msg2_trap for the full 10 s capture window. No CPU spin, no lock contention; the daemon is alive and consuming no CPU, simply not being notified that the QUIC session has died. Missing-wakeup pattern, not a mutex deadlock. All of this is now on FB22476701 (third update) with filtered QUIC PCAP, thread samples, router-log link-layer control, and the full 30-min pre-kill log. Keeping the primary channel on FB per Quinn's earlier direction.
Replies
Boosts
Views
Activity
1d
Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Quick update — tried to reproduce the deadlock again today, 2026-04-18, same Mac (macOS 26.4.1, build 25E253), same iCloud Drive account. No reproduction. The change is server-side. Comparing 2026-04-11 → 2026-04-18: Upload host: gcs-eu-00002.content-storage-upload.googleapis.com (GCS) → eu-irl-00001.s3.dualstack.eu-west-1.amazonaws.com (AWS S3 Ireland) Protocol: h3 (QUIC) → http/1.1 Large CKAsset body (~20–26 MB): 6 × err=T with requestDuration=-1.000 → single err=F in 1.4–3.0 s This routing change is specific to the CKAsset body-upload path — other cloudd flows (e.g. metadata requests to gateway.icloud.com) continue as before. The underlying CFNetwork BackgroundConnectionPool stale-QUIC-session behavior is still in the stack — no macOS update has shipped in between (still on 26.4.1 build 25E253). It's just not being exercised, because h3 isn't negotiated to the new endpoint. If h3 returns on the CKAsset-upload path (AWS enabling it on this endpoint, or Apple routing back to an h3-capable endpoint), the deadlock would likely return.
Replies
Boosts
Views
Activity
4d
Reply to iCloud Drive silent upload deadlock caused by stale HTTP/3 session in nsurlsessiond (FB22476701)
Thanks for the detailed write-up and the constructive suggestions. One correction worth flagging for anyone finding this thread later: on macOS 26.4.1, disabling QUIC via defaults write … CFNetworkHTTP3Enabled -bool false does not suppress HTTP/3 for nsurlsessiond's BackgroundConnectionPool. I tested this as part of ruling out root-cause hypotheses — the deadlock still reproduces with h3 as the negotiated protocol. So unfortunately there is currently no user-facing off-switch for QUIC on the iCloud Drive path; the only recovery that works is killing the user-level cloudd and nsurlsessiond processes. On packet capture and pre/post state snapshots — good call, both gaps in my evidence so far. I'll add them to the collection the next time the pool poisons and post a follow-up if anything diagnostic turns up.
Replies
Boosts
Views
Activity
1w