Post

Replies

Boosts

Views

Activity

Reply to Loss of connectivity with DNS proxy and content filter extensions running at the same time
Apologies for the confusion, @meaton, but it wasn't a question, I was anticipating a question. We know for a fact that the content filter extension is involved, because as soon as it's gone (uninstalling, disabling or killing all work), network connectivity is regained (with some residual issues, though, e.g. Chrome works fine but Safari doesn't) Yes, we are trying to isolate the issue by making further tests with some of the variables changed. In fact, I believe we're currently performing a test with the NEFilterDataProvider alone I'm not sure I understand what you mean "start with the APIs that you are using". Do you want me to give you an overview of the APIs used by the network extensions, or are you referring to the tests you suggested to perform, where we test one API at the time? Finally, I believe that I have successfully decoded the private kernel APIs in the stack backtraces of the stuck threads. My best guess for the network extension thread is: 1000 -[NEFilterSocketFlow createDataReply:controlSocket:direction:verdict:context:] + 372 (NetworkExtension + 395450) [0x7ff8183728ba] 1000 <patched truncated backtrace> 1000 write + 10 (libsystem_kernel.dylib + 17242) [0x7ff80781a35a] *1000 hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96] *1000 unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020] *1000 write + 289 (kernel + 6863105) [0xffffff8000967901] *1000 writev_internal ??? (kernel + 6862327) [0xffffff80009675f7] *1000 fp_writev ??? (kernel + 6861353) [0xffffff8000967229] *1000 soo_write ??? (kernel + 6883640) [0xffffff800096c938] *1000 sosend + 2519 (kernel + 7064071) [0xffffff8000998a07] *1000 ctl_send ??? (kernel + 6536106) [0xffffff8000917baa] *1000 cfil_ctl_send ??? (kernel + 4997925) [0xffffff80007a0325] *1000 cfil_action_data_pass ??? (kernel + 5007513) [0xffffff80007a2899] *1000 cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9] *1000 _sleep ??? (kernel + 6729239) [0xffffff8000946e17] *1000 lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305] It looks like the thread is taking the socket referenced by the NEFilterSocketFlow and attempting to acquire the lock of its send or receive buffer, but the lock was acquired by another thread and never released (for hours/days) For the mDNSResponder thread stuck in close, I guessed: 1000 _dispatch_source_cancel_callout + 153 (libdispatch.dylib + 98497) [0x7ff8076ca0c1] 1000 _dispatch_continuation_pop + 463 (libdispatch.dylib + 23397) [0x7ff8076b7b65] 1000 _dispatch_client_callout + 8 (libdispatch.dylib + 12339) [0x7ff8076b5033] 1000 <patched truncated backtrace> 1000 close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6] *1000 hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96] *1000 unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020] *1000 sys_close ??? (kernel + 6451958) [0xffffff80009032f6] *1000 soo_close ??? (kernel + 6448665) [0xffffff8000902619] *1000 soclose + 169 (kernel + 7055673) [0xffffff8000996939] *1000 soclose_locked + 226 (kernel + 7052818) [0xffffff8000995e12] *1000 soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022] *1000 cfil_sock_udp_is_closed ??? (kernel + 5047815) [0xffffff80007ac607] *1000 cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9] *1000 _sleep ??? (kernel + 6729239) [0xffffff8000946e17] *1000 lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305] I'm less confident about the nesessionmanager thread, but it could be: 1000 _dispatch_call_block_and_release + 12 (libdispatch.dylib + 7569) [0x7ff8076b3d91] 1000 ??? (nesessionmanager + 220610) [0x10ccb3dc2] 1000 ??? (nesessionmanager + 73682) [0x10cc8ffd2] 1000 _objc_rootDealloc + 62 (libobjc.A.dylib + 40383) [0x7ff8074c3dbf] 1000 objc_destructInstance + 99 (libobjc.A.dylib + 40497) [0x7ff8074c3e31] 1000 object_cxxDestructFromClass(objc_object*, objc_class*) + 83 (libobjc.A.dylib + 68187) [0x7ff8074caa5b] 1000 <patched truncated backtrace> 1000 close + 10 (libsystem_kernel.dylib + 8134) [0x7ff807817fc6] *1000 hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96] *1000 unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020] *1000 sys_close ??? (kernel + 6451958) [0xffffff80009032f6] *1000 soo_close ??? (kernel + 6448665) [0xffffff8000902619] *1000 soclose + 169 (kernel + 7055673) [0xffffff8000996939] *1000 soclose_locked + 1381 (kernel + 7053973) [0xffffff8000996295] *1000 ??? (kernel + 6534839) [0xffffff80009176b7] *1000 ??? (kernel + 5586252) [0xffffff800082fd4c] *1000 sbflush + 88 (kernel + 7105400) [0xffffff80009a2b78] *1000 soflow_db_apply + 146 (kernel + 7307298) [0xffffff80009d4022] *1000 cfil_sock_udp_buf_update_per_flow ??? (kernel + 5050092) [0xffffff80007aceec] *1000 cfil_acquire_sockbuf ??? (kernel + 5015481) [0xffffff80007a47b9] *1000 _sleep ??? (kernel + 6729239) [0xffffff8000946e17] *1000 lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305] In all cases, there seems to be a call to cfil_acquire_sockbuf, stuck in a permanent wait. Short of a kernel debugger, which I've never used, I don't know how I could get a better idea of what is holding the lock on the socket buffer and why Any idea is welcome
Topic: App & System Services SubTopic: Core OS Tags:
Jun ’23
Reply to Loss of connectivity with DNS proxy and content filter extensions running at the same time
@meaton I can already tell that Simple Firewall isn't going to trigger the issue: in its handleNewFlow method, it can only return allow, drop and pause verdicts. I strongly suspect that this issue is related to filter data verdicts, as we already fixed (worked around, really) a similar issue in the past that disappeared when we stopped using filter data verdicts We'll work to minimize the test case and reproduction steps, and open a feedback report, as we've received confirmation via a WWDC Q&A that Apple is already aware of issues with DNS proxy and content filter
Topic: App & System Services SubTopic: Core OS Tags:
Jun ’23
Reply to Loss of connectivity with DNS proxy and content filter extensions running at the same time
@meaton I have more information. As a last resort, we tried adding a mechanism to kill the system extension when the issue occurs, as it appeared to fix the issue in previous tests. It doesn't fix the issue, but it does change its root cause. Elsewhere in our application, we periodically submit a NSURLSessionTask for an HTTP POST request. When connectivity is lost, the request fails with the following error: Error Domain=NSURLErrorDomain Code=-1001 "The request timed out." UserInfo={_kCFStreamErrorCodeKey=-2103, _NSURLErrorFailingURLSessionTaskErrorKey=LocalUploadTask <03BFA491-37B2-419C-B69A-C3217558760B>.<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=( "LocalUploadTask <03BFA491-37B2-419C-B69A-C3217558760B>.<1>" ), NSLocalizedDescription=The request timed out., NSErrorFailingURLStringKey=<URL>, NSErrorFailingURLKey=<URL>, _kCFStreamErrorDomainKey=4} Our watchdog mechanism notices that the network extension isn't being notified of new network flows anymore (we have a separate mechanism that periodically starts a new UDP flow, to ensure a constant stream of calls to -[NEFilterDataProvider handleNewFlow:]. If we stop receiving calls to that method, it must be some kind of system issue/bug), and kills the network extension process with a KILL signal (don't judge us, we really need this to work). This does not restore connectivity; in fact, the aforementioned NSURLSessionTask starts failing with a different error: Error Domain=NSURLErrorDomain Code=-1009 "The Internet connection appears to be offline." UserInfo={_kCFStreamErrorCodeKey=50, NSUnderlyingError=0x60000355c180 {Error Domain=kCFErrorDomainCFNetwork Code=-1009 "(null)" UserInfo={_NSURLErrorNWPathKey=unsatisfied (Path was denied by NECP policy), interface: en1[802.11], ipv4, dns, _kCFStreamErrorCodeKey=50, _kCFStreamErrorDomainKey=1}}, _NSURLErrorFailingURLSessionTaskErrorKey=LocalUploadTask <F7A60E10-A65C-4D1F-9954-8A0325B2CB88>.<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=( "LocalUploadTask <F7A60E10-A65C-4D1F-9954-8A0325B2CB88>.<1>" ), NSLocalizedDescription=The Internet connection appears to be offline., NSErrorFailingURLStringKey=<URL>, NSErrorFailingURLKey=<URL>, _kCFStreamErrorDomainKey=1} Note that the failure is in DNS specifically (interface: en1[802.11], ipv4, dns). Needless to say, we don't explicitly invoke any NECP policies. This is how we create and submit the POST request: void sendHttpRequest(NSURL *url, const char *jwt, NSString *userAgent, NSData *data, void (^completionHandler)(NSData *_Nullable data, NSURLResponse *_Nullable response, NSError *_Nullable error)) { NSURLSessionConfiguration *const configuration = [NSURLSessionConfiguration defaultSessionConfiguration]; configuration.TLSMinimumSupportedProtocolVersion = tls_protocol_version_TLSv12; configuration.timeoutIntervalForResource = 30; configuration.requestCachePolicy = NSURLRequestReloadIgnoringLocalAndRemoteCacheData; configuration.HTTPCookieStorage = nil; configuration.URLCredentialStorage = nil; configuration.URLCache = nil; NSURLSession *const session = [NSURLSession sessionWithConfiguration:configuration]; NSMutableURLRequest *const request = [NSMutableURLRequest requestWithURL:url]; [request setValue:@"application/json" forHTTPHeaderField:@"Content-Type"]; [request setValue:[NSString stringWithFormat:@"Bearer %s", jwt] forHTTPHeaderField:@"Authorization"]; [request setValue:userAgent forHTTPHeaderField:@"User-Agent"]; request.HTTPMethod = @"POST"; NSURLSessionTask *const task = [session uploadTaskWithRequest:request fromData:data completionHandler:completionHandler]; [task resume]; }
Topic: App & System Services SubTopic: Core OS Tags:
Jul ’23
Reply to Loss of connectivity with DNS proxy and content filter extensions running at the same time
Spindump shows why killing the process didn't work: Process: ... [2231] (suspended) (zombie) UUID: ... Shared Cache: None Architecture: x86_64 Parent: launchd [1] UID: 0 Footprint: 4096 B Time Since Fork: 143146s Num samples: 1001 (1-1001) Note: Suspended for 1001 samples Note: Terminated (zombie) for 1001 samples Thread 0x10187 1001 samples (1-1001) priority 31 (base 31) 1001 <truncated backtrace> *1001 hndl_unix_scall64 + 22 (kernel + 1265046) [0xffffff8000410d96] *1001 unix_syscall64 + 480 (kernel + 7880736) [0xffffff8000a60020] *1001 write + 289 (kernel + 6863105) [0xffffff8000967901] *1001 ??? (kernel + 6862327) [0xffffff80009675f7] *1001 ??? (kernel + 6861353) [0xffffff8000967229] *1001 ??? (kernel + 6883640) [0xffffff800096c938] *1001 sosend + 2519 (kernel + 7064071) [0xffffff8000998a07] *1001 ??? (kernel + 6536106) [0xffffff8000917baa] *1001 ??? (kernel + 4997925) [0xffffff80007a0325] *1001 ??? (kernel + 5007513) [0xffffff80007a2899] *1001 ??? (kernel + 5015481) [0xffffff80007a47b9] *1001 ??? (kernel + 6729239) [0xffffff8000946e17] *1001 lck_mtx_sleep + 101 (kernel + 1753861) [0xffffff8000488305] *1001 thread_block_reason + 192 (kernel + 1828864) [0xffffff800049a800] *1001 ??? (kernel + 1837097) [0xffffff800049c829] *1001 machine_switch_context + 215 (kernel + 2961047) [0xffffff80005aee97] (suspended) Binary Images: 0x109935000 - ??? ??? ... *0xffffff80002dc000 - 0xffffff8000cdbfff kernel (8796.101.5) <CF2A42DA-3F7C-30C6-9433-6F2076FF1F94> /System/Library/Kernels/kernel The process was killed, but the stuck thread is stuck in a kernel mode wait that can't be interrupted, so the process remains in a zombified state, still exhibiting symptoms of the kernel mode deadlock.
Topic: App & System Services SubTopic: Core OS Tags:
Jul ’23