My EndpointSecurity Client process is kicked by OS on Mac sleep/wake cycle

Hi, I develop an ES client applying rule-engine evaluating ES events (mostly File-system events).

It is a bit non-standard not being deployed as a System-Extension, but rather as a global daemon.

On some Macs, I sometimes see "crash reports" for the ES process, all sharing

Termination Reason: Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

All of these happen not while normal Mac usage, but rather right at Mac wakeup time after sleep.

My guess is, some ES_AUTH events (with deadline) arrive when Mac goes to sleep, and somehow my high-priority dispatch_queue handling them is "put to sleep" mid processing them, so when the Mac wakes up - event handling continues long after the deadline passed, and MacOS decides to kick the process.

Questions:

  • What is the recommended behavior with ES vs Sleep/Wake cycles? (we're not an antivirus, and we don't care much to clear events or go "blind" for such time)
  • Can I specify somewhere in the info.plist of my bundle (this is built like an App) that my process should't be put to sleep, or that the OS should sleep it only when it becomes idle, or some other way tells the OS it is "ready for sleep" ?
  • If not -- How do I observe the scenario so I can suspend my event handling IN TIME and resume on wake?

Thanks!

Answered by DTS Engineer in 877261022

It is a bit non-standard not being deployed as a System-Extension, but rather as a global daemon.

This is actually pretty common.

So, let me start here:

My guess is, some ES_AUTH events (with deadline) arrive when Mac goes to sleep, and somehow my high-priority dispatch_queue handling them is "put to sleep" mid-processing them, so when the Mac wakes up - event handling continues long after the deadline passed, and MacOS decides to kick the process.

No, I don't think that's what's going on, at least not in the basic case. The system that "feeds" events into your ES client as a kernel extension feeding events into user space and it will have kept your ES client active long enough that we "cleared" all syscall activity through your client before we allowed the system to sleep. There are lots of places where things can go wrong, but I don't think the core event delivery system is the issue.

That leads back to here:

Hi, I develop an ES client applying rule-engine evaluating ES events (mostly File-system events).

What does that engine actually "do"? In particular:

  • Is it running inside your daemon or in a helper process?

  • What if ANY system APIs (file system related or not) does it call in the process of evaluating its rules?

  • What logic have you put in place to ensure that you're processing requests quickly and, if necessary, short-circuiting your normal processing to ensure you complete every request?

I suspect what's actually going on here is that some part of the processing you're doing ends up calling into a system daemon which either isn't yet awake (because you're processing an event during very early wake) or already asleep (because you're very late in the sleep process).

One thing to be aware of about the termination here:

On some Macs, I sometimes see "crash reports" for the ES process, all sharing Termination Reason: Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

First off, if possible, please post the crash logs as I'd like to see what they actually show.

Next, I'm not sure how much you can "trust" the timing data of the log itself. I'd need to look at the code in more detail, but I suspect the termination logic works something like this:

  1. The ES system decides you've missed a deadline.
  2. The ES system "cuts" your daemon out of the approval loop, allowing "normal" system operation to proceed.
  3. The ES system sends the termination request to kill your app.
  4. The request is processed, killing your app.

The problem here is that if you're late enough in the sleep process, it's entirely possible (fairly likely even) that the system will actually go to sleep somewhere after #2 bug before #3/#4. In that case, the log will show that you terminated after wake (because that's what happened) even though the termination decision (and the problem you need to fix) actually happened before sleep.

Unfortunately, this may also render the crash log itself fairly useless, as your daemon may have proceeded past the failure point without realizing anything went wrong. If you haven't looked at this already, this is where doing your own "manual" deadline tracking can be helpful— in production, you can use it to prevent catastrophic failure, but in debugging, you can also use it to force a crash closer to the point "real" failure occurred.

Finally, you can also use a much shorter deadline in your own tracker. MANY ES failures actually start as significant slowdowns in event processing but only terminate when the right circumstances make the slow down big enough. Those earlier slowdowns are exactly the same underlying failure, but are often easier to follow and debug because "less" is going on.

Can I specify somewhere in the info.plist of my bundle (this is built like an App)?

As a minor side note, good, that is an excellent choice. At this point, I'm basically convinced that ALL daemons should be inside an app bundle. At a minimum, it lets you make the component more "attractive" (custom icon, localized names, better "Get Info") and there's a good chance it will prevent problems (now and in the future).

That leads to here:

That my process should't be put to sleep, or that the OS should sleep it only when it becomes idle, or some other way tells the OS it is "ready for sleep" ?

If not -- How do I observe the scenario so I can suspend my event handling IN TIME and resume on wake?

Unfortunately, this is the wrong way to think about this (and most ES issues). ES clients are event processing "engines" and they fail when defects in that event processing logic prevent events from flowing properly. The trap here is focusing on the direct cause ("the system when to sleep") instead of trying to figure out the underlying defect ("what disrupted engine processing"). Focusing on specific causes can easily lead to ongoing series of increasingly weird edge cases as the system finds new and interesting ways to trip over the underlying defect.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

It is a bit non-standard not being deployed as a System-Extension, but rather as a global daemon.

This is actually pretty common.

So, let me start here:

My guess is, some ES_AUTH events (with deadline) arrive when Mac goes to sleep, and somehow my high-priority dispatch_queue handling them is "put to sleep" mid-processing them, so when the Mac wakes up - event handling continues long after the deadline passed, and MacOS decides to kick the process.

No, I don't think that's what's going on, at least not in the basic case. The system that "feeds" events into your ES client as a kernel extension feeding events into user space and it will have kept your ES client active long enough that we "cleared" all syscall activity through your client before we allowed the system to sleep. There are lots of places where things can go wrong, but I don't think the core event delivery system is the issue.

That leads back to here:

Hi, I develop an ES client applying rule-engine evaluating ES events (mostly File-system events).

What does that engine actually "do"? In particular:

  • Is it running inside your daemon or in a helper process?

  • What if ANY system APIs (file system related or not) does it call in the process of evaluating its rules?

  • What logic have you put in place to ensure that you're processing requests quickly and, if necessary, short-circuiting your normal processing to ensure you complete every request?

I suspect what's actually going on here is that some part of the processing you're doing ends up calling into a system daemon which either isn't yet awake (because you're processing an event during very early wake) or already asleep (because you're very late in the sleep process).

One thing to be aware of about the termination here:

On some Macs, I sometimes see "crash reports" for the ES process, all sharing Termination Reason: Namespace ENDPOINTSECURITY, Code 2 EndpointSecurity client terminated because it failed to respond to a message before its deadline

First off, if possible, please post the crash logs as I'd like to see what they actually show.

Next, I'm not sure how much you can "trust" the timing data of the log itself. I'd need to look at the code in more detail, but I suspect the termination logic works something like this:

  1. The ES system decides you've missed a deadline.
  2. The ES system "cuts" your daemon out of the approval loop, allowing "normal" system operation to proceed.
  3. The ES system sends the termination request to kill your app.
  4. The request is processed, killing your app.

The problem here is that if you're late enough in the sleep process, it's entirely possible (fairly likely even) that the system will actually go to sleep somewhere after #2 bug before #3/#4. In that case, the log will show that you terminated after wake (because that's what happened) even though the termination decision (and the problem you need to fix) actually happened before sleep.

Unfortunately, this may also render the crash log itself fairly useless, as your daemon may have proceeded past the failure point without realizing anything went wrong. If you haven't looked at this already, this is where doing your own "manual" deadline tracking can be helpful— in production, you can use it to prevent catastrophic failure, but in debugging, you can also use it to force a crash closer to the point "real" failure occurred.

Finally, you can also use a much shorter deadline in your own tracker. MANY ES failures actually start as significant slowdowns in event processing but only terminate when the right circumstances make the slow down big enough. Those earlier slowdowns are exactly the same underlying failure, but are often easier to follow and debug because "less" is going on.

Can I specify somewhere in the info.plist of my bundle (this is built like an App)?

As a minor side note, good, that is an excellent choice. At this point, I'm basically convinced that ALL daemons should be inside an app bundle. At a minimum, it lets you make the component more "attractive" (custom icon, localized names, better "Get Info") and there's a good chance it will prevent problems (now and in the future).

That leads to here:

That my process should't be put to sleep, or that the OS should sleep it only when it becomes idle, or some other way tells the OS it is "ready for sleep" ?

If not -- How do I observe the scenario so I can suspend my event handling IN TIME and resume on wake?

Unfortunately, this is the wrong way to think about this (and most ES issues). ES clients are event processing "engines" and they fail when defects in that event processing logic prevent events from flowing properly. The trap here is focusing on the direct cause ("the system when to sleep") instead of trying to figure out the underlying defect ("what disrupted engine processing"). Focusing on specific causes can easily lead to ongoing series of increasingly weird edge cases as the system finds new and interesting ways to trip over the underlying defect.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you so much for the elaborate and helpful reply. I will try to address each point, to try and see how to "close in" on my problem. I will also attach a sample crash-report.

  1. My "rule engine" is very fast, based on NSPredicates over ObjC custom classes (No core-data etc.). Rule evaluation is completely internal to my daemon, no API calls and no external calls.

  2. There IS , however a stage prior to rule evaluation, where I Parse ES messages into my custom ObjC class, and in that stage there are few API calls one of which has given me grief in the past. I need to decide whether a file-system object (from the ES message) is a "Document", in my own semantics. A document is any "normal file", or -- "package" directories. If the file-system object exists and is a directory, and it has a file-name extension, I use something like:

_isDocument = [[NSWorkspace sharedWorkspace] isFilePackageAtPath:_filePath ];

If it is non-existing (to be created etc.) Then I use this heuristics and API:

{  // non-existing directory object about to be created...
NSArray<UTType *> *types = [UTType typesWithTag:_filenameExtension tagClass:UTTagClassFilenameExtension conformingToType:UTTypePackage];
switch (types.count) {
  case 0: _isDocument = NO;   break; // no package type with this filename extension known to LaunchServices.
  case 1: _isDocument = ! [[types firstObject].identifier hasPrefix:@"dyn."];    break;
  default:_isDocument = YES;  break;
}

Both APIs boil down to LaunchServices calls, that MAY be extenal, and which notoriously "hang" for long periods sometimes. However - these things almost NEVER happen to my ES daemon in normal use (no such crash-reports from 100K+ deployments) but around Mac sleep cycles, this DOES happen.

  1. I DO NOT currently employ an active mechanism to "short-circuit" my evaluation of events. What I have instead is:
  • Lots of "white-listing" (large list of "muted" binaries and processes for ES. Events I never receive.
  • If I receive an event with <50 milliseconds deadline, I "give up" and authorize immediately.
  • I subscribe to small set (8) of file-system events, that usually come with 30-60 sec deadlines. mostly open/create. Never read/write. Of course I see much smaller deadlines, but I NEVER saw in my logs a "give up" because of tight deadline.
  • My rules-engine has been tested with at least million events per sec, and it is indeed very fast.

I will be happy to share with you parts of my code, but I cannot post publicly - this is corporate proprietary. Please let me know any private e-mail or other method I can send you relevant parts.

Thank you so much!

Splitting for length, part 1...

I will also attach a sample crash report.

So, let me start there. All three logs have threads blocked in dispatch_sync, with the worst case having 35 threads blocked in this code chunk waiting for another thread to clear the block:

6   libdispatch.dylib             	       0x7ff80f50468e _dispatch_sync_f_slow + 175 
7   com.proofpoint.itm.prevention 	       0x10a102a81
8   com.proofpoint.itm.prevention 	       0x10a0ff5cb
9   com.proofpoint.itm.prevention 	       0x10a0ff482
10  com.proofpoint.itm.prevention 	       0x10a0c5716
11  libdispatch.dylib             	       0x7ff80f4f76b9 _dispatch_call_block_and_release + 12

The other two logs are that extreme, but both of them also show blocks in dispatch_sync and one of them also shows the same contentions (though not as extreme). I have some general comments below, but the critical next step here is to fully symbolicate these logs and figure out what your client is actually "doing". You're clearly bottlenecking on "something" and figuring out what that is is obviously an important first step.

I obviously don't know the context of that block, but in general, my experience has been that dispatch_sync usage is a fundamental anti-pattern. It has some limited utility as a quasi-lock API[1], but when used as such, it MUST be treated as a simple construct used to protect a particular resource, NOT on the execution context you actually "do" work on. Putting that other way, if you wouldn't put a mutex or an os_unfair_lock around something, then you shouldn't be using dispatch_sync.

Outside of that very limited role, it leads to the kind of issue you're seeing in those logs— a whole bunch of threads twiddling their thumbs waiting on another thread.

That leads to my larger point, which is that in my experience, NSOperationQueue is a better API "foundation" to build on the GCD, even in an ES client[2]. It provides basic functionality (for example, cancellation and a base class to build work objects on) that GCD lacks and, most critically here, it allows you to restrict the volume of active work. I had an extensive conversation about this on this thread, but I'd particularly highlight this post as an example of how GCD can ACTIVELY make code MUCH slower.

[1] Basically, it provides a lock you can easily defer work to (using dispatch_async), orders work, and which you can't forget to unlock. That isn't necessarily what you want (there's a reason why os_unfair_lock is our preferred locking API), but it can be useful

[2] I could actually make strong arguments for both sides of this choice, but choosing to build on Objective-C means you're already invalidating most of the best "pro-GCD" arguments.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Splitting for length, part 2...

Moving on to specifics:

I need to decide whether a file-system object (from the ES message) is a "Document", in my own semantics.

As a general comment, you should be aware that relying on data read from the file system opens you up to a variety of TOCTU attacks. That may not be avoidable, but the risk is quite real.

A document is any "normal file", or -- "package" directories. If the file-system object exists and is a directory, and it has a file-name extension, I use something like:

How is your ES client being run? Assuming you're running as a daemon, then you should not be calling NSWorkspace, as AppKit is definitely NOT daemon safe. In this particular case, it's easy to replace as isFilePackageAtPath is actually just a trivial wrapper around NSURL.getResourceValue:forKey: which is checking NSURLIsPackageKey. However, I'd strongly recommend taking a look at any other usage of NSWorkspace.

If it is non-existing (to be created etc.) Then I use this heuristic and API:

Have you implemented any of your own caching for this data? The system is also doing its own caching, but the worst case here does mean a call into LS. That's likely avoidable, since file name extensions aren't THAT variable.

Both APIs boil down to LaunchServices calls, that MAY be external, and which notoriously "hang" for long periods sometimes.

Yes, that makes sense. System sleep isn't necessarily an "instantaneous" process and, for obvious security reasons, your ES client needs to be one of the last components to sleep. If it ends up waiting on the wrong thing, then bad things are going to happen.

Next, I have a comment on these two points:

However - these things almost NEVER happen to my ES daemon in normal use (no such crash-reports from 100K+ deployments) but around Mac sleep cycles, this DOES happen.

...

  • My rules-engine has been tested with at least a million events per sec, and it is indeed very fast.

The problem here is that basically "all" ES problems start with the description "My ES client works fine except for <insert edge case>". That's not necessarily due to any inherent success or quality, but simply because the ES client and the broader system actually spend most of their time doing "basically the same thing" under "basically the same conditions".

Validating a few other details:

  • Lots of "white-listing" (large list of "muted" binaries and processes for ES. Events I never receive.

Are you returning "cache=true" on your auth responses? There's a LOT of work duplication in many apps, so allowing the result to be cached and greatly reduce event volume.

-If I receive an event with <50 milliseconds deadline, I "give up" and authorize immediately.

One thing to keep in mind here is that the deadline itself "started" at a constant time, so a shrinking deadline indicates that some kind of backlog is delaying events into your client. I'd probably consider increasing that value for general “safety," but I'd also be looking at changes in its value as a way to understand my clients’ own activity.

I subscribe to a small set (8) of file-system events.

Do you subscribe to notify as well as auth events? If so, then I'd recommend using separate clients for the two event types, as described here.

I will be happy to share with you parts of my code, but I cannot post publicly - this is corporate proprietary. Please let me know any private e-mail or other method I can send you relevant parts.

If there's something you'd like me to take a look at privately, you can file a Code-level support request on this page. Make sure you include a link back to your post in the request, and you can also mention my name to streamline routing.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

My EndpointSecurity Client process is kicked by OS on Mac sleep/wake cycle
 
 
Q