Post

Replies

Boosts

Views

Activity

Reply to System-wide deadlock in removexattr from revisiond / APFS
I think we finally found a way to reproduce the issue in a reliable way (a stress binary which mixes random file creations, APFS transparent compression, and extended attributes adding and removal, all this mixed in a ton of different threads). A kernel deadlock (similar to the "real" one we have) usually happens after 15-20 minutes of this stress binary running, without needing any Endpoint Security client to be registered (and without needing root privileges): it works on a fresh clean VM. We reproduce it on macOS 15 and macOS 26, up to 26.4 (not included) — the tool ran for hours on this macOS version without any repro. So I believe the issue was actually fixed starting from 26.4. We are continuing our test to confirm that. We can add this tool in the feedback / DTS ticket if needed, but if the bug is fixed, I guess Apple kernel team figured out a way to reproduce it themselves.
Topic: App & System Services SubTopic: Core OS Tags:
Apr ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
Lots of interesting things to check in your answer, thanks ! That's said, I have a last question (on my side) regarding Endpoint Security. In the past, when I had to investigate deadlocks where ES was involved, we 100% of the time saw symbols like EndpointSecurityEventManager::es_something and/or EndpointSecurityEventManager::sendSomething, and something which show that they are waiting for an answer (for auth events, but I guess it can happen with notify events when the queue is full, and ES is not willing to drop the new events - I think someone from ES team explained to us some years ago that ES can try to slow down entering events to make a chance of the client to empty a bit the queue…), before being killed by ES kext itself. But here, I don't see such things. When EndpointSecurityEventManager is on the appear in the spindump, it's only because they are blocked, like anyone else, on one of these 2 lock. There is known typical scenario where ES can result in such double lock interlocking, without it being even a clear smoking gun ?
Topic: App & System Services SubTopic: Core OS Tags:
Mar ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
All of the APFS locks tend to be held for very short periods of time, so it's not unusual for work to pile up very quickly. More to the point, all of those other threads are (mostly) irrelevant to the issue. I'd actually be looking for any other reference to compression/decompression or xattrs. If they are held for a very short amount of time, shouldn't we rarely see other threads waiting for it ? It's what I would expect, at least. And here we can see that all other threads are waiting for it for the whole spindump duration (Num samples: 940 (1-940) / IORWLockWrite & IORWLockRead → 940). I mean, I know this count the number of times the sampler see these symbols each time it samples the processes (i.e. it doesn't mean this code was running between each sample), but I would be surprised that these exact same stacks is re-happening exactly at the same time as the samples are done by chances: they are likely running for the whole time. Yes, you do. It's defined in IOLocks.h, which maps it to lck_rw_lock_exclusive. However, I wouldn't expect that to lead you anywhere useful. Yep, I noticed that, but as we see IORWLock... in the stacks, I concluded that IOLOCKS_INLINE wasn't set, and that it really uses IORWLock... function (#define is a preprocessor macro, no reason for this function to appear in the stack if IOLOCKS_INLINE is set) No, not really. But then I don't understand why all the threads are pointing that revisiond thread is owning the lock, while revisiond thread stack seems to says that it wasn't able to own it (and so is suspended)… Or, as you said, the blame logic is just all wrong, and everyone is pointing this revisiond thread by mistake, and revisiond is just blocked on someone else, like everyone else. I let the OP answer the other points.
Topic: App & System Services SubTopic: Core OS Tags:
Mar ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
The IORWLockWrite stack seems to point machine_switch_context, i.e. when the lock is owned by another thread, and so the current thread is suspended / yielded to another one, waiting the lock to be reclaimable again. But then it's a bit incoherent with all the other threads pointing that "blocked by krwlock for writing owned by revisiond [426] thread 0xc0616d" (it can't be at the same time the owner, and not the owner…). Is it possible that machine_switch_context is called if you were able to get the ownership of the lock ? In which kind of scenario ? The stack doesn't seem to tell it. And we don't have the source code of IORWLockWrite. It's like something suspended the revisiond thread in the kernel when it executed IORWLockWrite, but then this "something" is unable to resume it because it is blocked itself (on this same lock ?). But then it doesn't align with this machine_switch_context symbol in the stack.
Topic: App & System Services SubTopic: Core OS Tags:
Mar ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
Thank you for the confirmation, and thank you for the help ! We will try the different solutions that you pointed to workaround that on macOS < 26.4.
Topic: App & System Services SubTopic: Core OS Tags:
Replies
Boosts
Views
Activity
Apr ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
I think we finally found a way to reproduce the issue in a reliable way (a stress binary which mixes random file creations, APFS transparent compression, and extended attributes adding and removal, all this mixed in a ton of different threads). A kernel deadlock (similar to the "real" one we have) usually happens after 15-20 minutes of this stress binary running, without needing any Endpoint Security client to be registered (and without needing root privileges): it works on a fresh clean VM. We reproduce it on macOS 15 and macOS 26, up to 26.4 (not included) — the tool ran for hours on this macOS version without any repro. So I believe the issue was actually fixed starting from 26.4. We are continuing our test to confirm that. We can add this tool in the feedback / DTS ticket if needed, but if the bug is fixed, I guess Apple kernel team figured out a way to reproduce it themselves.
Topic: App & System Services SubTopic: Core OS Tags:
Replies
Boosts
Views
Activity
Apr ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
Lots of interesting things to check in your answer, thanks ! That's said, I have a last question (on my side) regarding Endpoint Security. In the past, when I had to investigate deadlocks where ES was involved, we 100% of the time saw symbols like EndpointSecurityEventManager::es_something and/or EndpointSecurityEventManager::sendSomething, and something which show that they are waiting for an answer (for auth events, but I guess it can happen with notify events when the queue is full, and ES is not willing to drop the new events - I think someone from ES team explained to us some years ago that ES can try to slow down entering events to make a chance of the client to empty a bit the queue…), before being killed by ES kext itself. But here, I don't see such things. When EndpointSecurityEventManager is on the appear in the spindump, it's only because they are blocked, like anyone else, on one of these 2 lock. There is known typical scenario where ES can result in such double lock interlocking, without it being even a clear smoking gun ?
Topic: App & System Services SubTopic: Core OS Tags:
Replies
Boosts
Views
Activity
Mar ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
All of the APFS locks tend to be held for very short periods of time, so it's not unusual for work to pile up very quickly. More to the point, all of those other threads are (mostly) irrelevant to the issue. I'd actually be looking for any other reference to compression/decompression or xattrs. If they are held for a very short amount of time, shouldn't we rarely see other threads waiting for it ? It's what I would expect, at least. And here we can see that all other threads are waiting for it for the whole spindump duration (Num samples: 940 (1-940) / IORWLockWrite & IORWLockRead → 940). I mean, I know this count the number of times the sampler see these symbols each time it samples the processes (i.e. it doesn't mean this code was running between each sample), but I would be surprised that these exact same stacks is re-happening exactly at the same time as the samples are done by chances: they are likely running for the whole time. Yes, you do. It's defined in IOLocks.h, which maps it to lck_rw_lock_exclusive. However, I wouldn't expect that to lead you anywhere useful. Yep, I noticed that, but as we see IORWLock... in the stacks, I concluded that IOLOCKS_INLINE wasn't set, and that it really uses IORWLock... function (#define is a preprocessor macro, no reason for this function to appear in the stack if IOLOCKS_INLINE is set) No, not really. But then I don't understand why all the threads are pointing that revisiond thread is owning the lock, while revisiond thread stack seems to says that it wasn't able to own it (and so is suspended)… Or, as you said, the blame logic is just all wrong, and everyone is pointing this revisiond thread by mistake, and revisiond is just blocked on someone else, like everyone else. I let the OP answer the other points.
Topic: App & System Services SubTopic: Core OS Tags:
Replies
Boosts
Views
Activity
Mar ’26
Reply to System-wide deadlock in removexattr from revisiond / APFS
The IORWLockWrite stack seems to point machine_switch_context, i.e. when the lock is owned by another thread, and so the current thread is suspended / yielded to another one, waiting the lock to be reclaimable again. But then it's a bit incoherent with all the other threads pointing that "blocked by krwlock for writing owned by revisiond [426] thread 0xc0616d" (it can't be at the same time the owner, and not the owner…). Is it possible that machine_switch_context is called if you were able to get the ownership of the lock ? In which kind of scenario ? The stack doesn't seem to tell it. And we don't have the source code of IORWLockWrite. It's like something suspended the revisiond thread in the kernel when it executed IORWLockWrite, but then this "something" is unable to resume it because it is blocked itself (on this same lock ?). But then it doesn't align with this machine_switch_context symbol in the stack.
Topic: App & System Services SubTopic: Core OS Tags:
Replies
Boosts
Views
Activity
Mar ’26
Reply to Bluetooth in Daemon program Not Working on Big Sur.
Eddie Hua — Just for information, we have the same problem. We opened a feedback ticket as requested by Gualtier Malde (FB9297838).
Topic: App & System Services SubTopic: Core OS Tags:
Replies
Boosts
Views
Activity
Jul ’21