The environment:
macOS Ventura 13.3.1 on a Mac mini (2018 model)
A content filter extension (content-filter-provider), configured with filterSockets = YES, filterPackets = NO
A DNS proxy extension (dns-proxy)
An Ethernet network and two Wi-Fi access points
OpenVPN
The exact conditions to reproduce the issue are obscure. So far, we managed to narrow it down to:
Connect to the Ethernet network
Connect to Wi-Fi access point A, and wait a couple minutes
Connect to Wi-Fi access point B (in our tests, this is a mobile hotspot; not sure if relevant)
Establish an OpenVPN tunnel
Put machine to sleep, and wait 15 minutes
Disable Wi-Fi access point B, and wait 3 hours
Wake up the machine
The result is a complete loss of network connectivity: the machine can't even renew the DHCP lease
The content filter logs new flows without blocking or pausing them. If a flow is missing some address information, it peeks 1 incoming byte or 1 outgoing byte, whichever comes first ([NEFilterNewFlowVerdict filterDataVerdictWithFilterInbound:YES peekInboundBytes:1 filterOutbound:YES peekOutboundBytes:1]), in the hope to get more complete flow metadata once data is being exchanged. Most importantly, if a flow looks like a DNS request over UDP, the content filter sniffs all incoming data, asking for at most 10000 bytes at a time ([NEFilterNewFlowVerdict filterDataVerdictWithFilterInbound:YES peekInboundBytes:10000 filterOutbound:YES peekOutboundBytes:1]). The DNS answer sniffer has already caused conflict with the DNS proxy in the past, but we fixed it.
I don't know the DNS proxy in as much detail, but I know that it has both blacklisting and proxying features (it proxies requests to a CloudFlare API, I believe), so it's a little more complex than a simple pass-through like the content filter
The last event logged by the content filter is, in fact, a DNS query whose answer is being sniffed. Then it stops logging anything, coinciding with the system-wide loss of connectivity (verified from the logs of other applications and services)
Spindump shows that the content filter is idle, stuck in NetworkExtension internals, which in turn are stuck in an event wait (lck_mtx_sleep). Also of note, mDNSResponder and nesessionmanager are both stuck in a kernel mode call to soflow_db_apply, as a result of closeing a socket. In all cases, by "stuck", I mean that the thread was found to be executing that code in 1000 samples out of 1000, with no use of CPU time
An anonymized subset of the log is attached: spindump-abridged.txt
Are we doing anything obviously wrong? Any gotchas that we may have tripped onto in our extension implementations? Or is it possibly a system bug that we should open a radar for? Thank you
Selecting any option will automatically load the page