Zsh kills Python process with plenty of available VM

On a MacBook Pro, 16GB of RAM, 500 GB SSD, OS Sequoia 15.7.1, M3 chip, I am running some python3 code in a conda environment that requires lots of RAM and sure enough, once physical memory is almost exhausted, swapfiles of about 1GB each start being created, which I can see in /System/Volumes/VM. This folder has about 470 GB of available space at the start of the process (I can see this through get info) however, once about 40 or so swapfiles are created, for a total of about 40GB of virtual memory occupied (and thus still plenty of available space in VM), zsh kills the python process responsible for the RAM usage (notably, it does not kill another python process using only about 100 MB of RAM). The message received is "zsh: killed" in the tmux pane where the logging of the process is printed.

All the documentation I was able to consult says that macOS is designed to use up to all available storage on the startup disk (which is the one I am using since I have only one disk and the available space aforementioned reflects this) for swapping, when physical RAM is not enough. Then why is the process killed long before the swapping area is exhausted? In contrast, the same process on a Linux machine (basic python venv here) just keeps swapping, and never gets killed until swap area is exhausted.

One last note, I do not have administrator rights on this device, so I could not run dmesg to retrieve more precise information, I can only check with df -h how the swap area increases little by little. My employer's IT team confirmed that they do not mess with memory usage on managed profiles, so macOS is just doing its thing.

Thanks for any insight you can share on this issue, is it a known bug (perhaps with conda/python environments) or is it expected behaviour? Is there a way to keep the process from being killed?

Answered by DTS Engineer in 874800022

Just adding a quick follow-up, in case you have some other ideas: I tested with many values of vm_compression_limit, from 0 up to 10^12, but the behaviour when the VMM kills a process is not affected: when the swap size reaches approximately 44GB, the process gets killed.

I had a chance to play with this today and I think you're just setting it wrong. You need to set this as a boot argument, so the nvram command looks like this:

nvram boot-args="debug=<existing value> vm_compression_limit=4000000000"

With that configuration, I got to ~130GB of memory usage and ~100GB of swap before the process was terminated. You can use "nvram -p" to print the full list of firmware variables, find your existing "boot-args" value, then insert it back into the command above to preserve it.

Note: It's also possible that boot-args doesn't exist or you can choose to overwrite the value.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Accepted Answer

I see, thank you for pointing this out. So it is not a percentage, but an actual number of pages. Could you expand a little on how to interpret <overcommit pages> in your previous answer?

So, stepping back for a moment, the basic issue here is deciding "when should the kernel stop just blindly backing memory". It COULD (and, historically, did) just limit that to total available storage; however, in practice, that just means the machine grinds itself into a useless state without actually "failing". So, what macOS does is artificially limit the VM system to ensure that the machine remains always in a functional state.

The next question then becomes "how to implement that limit". There are lots of places you COULD limit the VM system, but the problem is that the VM system is complicated enough that many obvious metrics don't really work. For example, purgable memory[1] means that simply dirty pages doesn't necessarily "work“ - a process could have a very large number of dirty pages, but if they're all purgable, they shouldn't really "count", since they'll never be written to disk. Similarly, memory compression means that there can be a very large difference between the size of memory and the size that's actually written to disk.

[1] Purgable is a mach memory configuration which tells the VM system that the pages should be discarded instead of swapped, clients then locking/unlocking the pages they actively work with.

All of those issues mean that the check ends up being entangled with the memory compression system. More specifically, I think the actual limit here is "how much memory the compression system will swap to disk". You could set it to "none", at which point you basically end up with how iOS works. Memory compression still occurs, but we terminate processes instead of swapping data out.

In any case, all of this basically means that setting that to a bigger number means we'll swap more data to disk.

How does one find the available range?

I don't think there is any specific range as such. The ultimate upper limit would be available storage, but that's already inherently dynamic (because the rest of the system can be eating storage), so the kernel already has to deal with that anyway.

What does it mean to overcommit pages?

As general terminology, overcommit just refers to the fact that the VM system is handing out more memory than it actually "has". In this particular case, I think it's just "borrowing" the word to mean how much memory will the compression system use beyond its normal range of physical memory... which translates to how much memory it will swap to disk.

Ideally, I would try to get as close as possible to a memory overcommitment scenario. Would this correspond to an "infinite" number of overcommitted pages?

To be clear, you're already overcommitting— that's how a machine with 16 GB of RAM is running a process that's using 40 GB of memory. You want to overcommit more.

Also, to be clear, I think you also need to think through what "infinite" here actually means. In real-world usage, infinite overcommit just means you're enabling swap death. There are limited cases where increasing memory usage won't cause that, but all of those cases are inherently somewhat broken. Case in point, my test tool above (on its own) won't really cause swap death— it consumes memory and completely ignores it, which allows the VM system to stream it to disk... and then ignore it too. The problem is real apps don't really work that way— the point of allocating memory is to "use it".

Is there a way to enter "infinite" in this parameter?

I don't think so. As a practical matter, this boot arg mostly exists to let the kernel team experiment with different scenarios, so "infinite" isn't really all that useful or necessary. If you really wanted to test that scenario, you'd just pass in a number larger than available storage.

Or there is a maximum number, which can change from machine to machine?

I don't think so. I believe this is just one constraint among many, so if you pass in a "large enough" number then those other constraints (like available storage) will determine what actually happens. You can easily see the reverse of this today— if you fill up your drive enough, you'll quickly see that the system won't let you use 40 GB of memory.

If I am interpreting the directionality this parameter has to move towards to, in order to get the desired behaviour, I need to retrieve this number, not just compute it roughly via the known 4KB size of a page and the capacity of the disk.

FYI, the page size today is actually 16KB, not 4KB.

I don't see why you'd need to be that specific. Honestly, I'd probably just pick a number and then use my test tool to see what happens. The main risks here are:

  1. A really small number rendering the machine unusable, due to a lack of "usable" memory. I don't think this is actually possible, but it’s easy to avoid by just picking a really big number.

  2. A big number creating increased risk of swap death due to excessive overcommit.

Both of those risks are "real", however, they're also relatively easy to control for. Just minimize what you actually "do" until you figure out how the boot arg has altered the system’s behavior.

Say the maximum number is 1200 pages. From the documentation,

First off, just to be clear, this is well outside the "documented" system. It isn't really secret (after all, the code is open source), but I don't want to give you the impression that this is something I'm really recommending. Notably, this isn't something I would ever change on another person’s machine or in some kind of broad deployment. It WILL create problems that would otherwise not occur.

This is also why my answers below are somewhat vague— if you're not comfortable testing and experimenting with this yourself, then I'm not sure this is something you should be messing around with.

I am supposed to boot in recovery mode, disable SIP, and then run sudo nvram boot-args="vm_compressor_limit=1200" and then restart to make the changes effective.

Haven't tried it, but sure, that sounds right.

Do I need to keep SIP disabled, or can I re-enable it after the changes make effect?

I don't know, that's something you'd need to test yourself.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Just adding a quick follow-up, in case you have some other ideas: I tested with many values of vm_compression_limit, from 0 up to 10^12, but the behaviour when the VMM kills a process is not affected: when the swap size reaches approximately 44GB, the process gets killed.

Just adding a quick follow-up, in case you have some other ideas: I tested with many values of vm_compression_limit, from 0 up to 10^12, but the behaviour when the VMM kills a process is not affected: when the swap size reaches approximately 44GB, the process gets killed.

I had a chance to play with this today and I think you're just setting it wrong. You need to set this as a boot argument, so the nvram command looks like this:

nvram boot-args="debug=<existing value> vm_compression_limit=4000000000"

With that configuration, I got to ~130GB of memory usage and ~100GB of swap before the process was terminated. You can use "nvram -p" to print the full list of firmware variables, find your existing "boot-args" value, then insert it back into the command above to preserve it.

Note: It's also possible that boot-args doesn't exist or you can choose to overwrite the value.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you, that's interesting, do you also have Sequoia on your machine? When I first run nvram -p it showed no boot-args existing on my machine. So I simply run

sudo nvram boot-arg="vm_compression_limit=4000000000"

without the debug part, which would be unnecessary if there is nothing to override. When subsequently trying different values, I would remove the boot-arg previously set with

sudo nvram -d boot-args

and then rerun the previous command with another value. Each time, doublechecking with nvram -p, was displaying the boot-arg set with correct value, so I guess doing it this way is equivalent to using debug. Or is debug compulsory? It does not appear in the documentation https://ss64.com/mac/nvram.html , I could give it a try though, but what should I put as value for debug if no boot-args are shown by nvram -p?

and then rerun the previous command with another value. Each time, double-checking with nvram -p, was displaying the boot-arg set with the correct value, so I guess doing it this way is equivalent to using debug. Or is debug compulsory?

No, it's not, sorry for the confusion. The "boot-arg" value is structured as a space-separated series of "<key>=<value>" pairs, and I happened to have "debug" set, as it's the common configuration for kernel development. You can just drop it.

Thank you, that's interesting, do you also have Sequoia on your machine? When I first run nvram -p, it showed no boot-args existing on my machine.

I do not, but yes, I believe "boot-args" would not be defined in the default configuration. I don't actually have a machine at hand that doesn't have "debug" set, which is why I didn't think of that.

However, note that you dropped the "s" from boot-args in the command you posted above. That is a problem, as nvram is basically maintaining a key/value store, so it will happily add "boot-arg" alongside "boot-args"... which the rest of the system will then ignore.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Apologies, the missing 's' was a typo, I double-checked on my terminal's history and and when typing there I was correctly using 'args'. Could it be that Sequoia has a bug preventing the boot-arg to be effective on th eVMM's behaviour?

Could it be that Sequoia has a bug preventing the boot-arg to be effective on th eVMM's behaviour?

No. I just tested this on an older iMac running Sequoia and I was able to get to ~94 GB. That was the point I ended the test as the machine was entering swap death and it had already taken 10m to get to that point.

However...

...effective on th eVMM's behaviour?

Are you testing this in a VM? If so, then I'm sure the "nvram" command will work. You may need to specify boot arguments through the VM's configuration.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you so much for trying. By VMM I meant the Virtual Memory Manager, I am doing this on an actual Macbook Pro, not a VM (Virtual Machine). Then the source of the issue must be either that this mac I have, which is managed by the company I worked for, somehow due to their managing software does not implements boot-args properly, even with SIP disabled (they did receive an alert when I disabled it); or python itself is somehow responsible for being capped so early. The python script I used is quite simple, so that per se should not be the problem: just appending arrays of 200 MB of random numbers to an array.

A third option could be that I do something wrong. Just to double-check, I disable SIP from recovery mode (I read this needs to be done somewhere online, do you do it also? I was not able to change boot-args with SIP on). Then I log into normal mode, with SIP kept off, and set the boot-args with

sudo nvram boot-args="vm_compression_limit=4000000000"

Then I check that

nvram -p | grep boot-args

reflects the changes. Once the changes display correctly, the machine should behave according to these, and I run the test python script aforementioned. Do you do anything different (besides the checks of course, and the test script/language used).

I have read that to make the boot-args changes permanent, one needs to reboot the machine. However, I do not need them to be permanent for now as I am just testing, and as far as the current session is concerned, they should already be reflected by the VMM behaviour during the test, without having to reboot (by the way, in doubt, I did also try to reboot but the behaviour was the same).

Soon I will also get my hands on an older intel mac which is not managed by the company, and I will be able to check if the boot-args change the VMM behaviour correctly on that one. If that happens, then it must be some hidden block the managed device has, which the IT person I spoke to is not aware of, as they said whatever restrictions they put on the machine, it should not interfere with its proper behaviour, and I suppose reflecting tunable boot-args is proper behaviour.

For comparison, this command on my machine:

nvram -p | grep boot-args

Returns:

boot-args	debug=0x104c0c vm_compression_limit=4000000000

The "debug" entry is not required; that just happens to be how my machine is configured.

Once the changes display correctly, the machine should behave according to these, and I run the test Python script aforementioned. Do you do anything different (besides the checks, of course, and the test script/language used)?

Have you not been rebooting after you set the value? It looks like I wasn't entirely clear on this, but the "boot-args" values are quite literally "the arguments passed into the kernel when it boots up". Many of them, including this one, are used to set critical constants that are then used to define the system’s wider behavior, often in ways that simply cannot be dynamically modified. At a purely technical level, I'm not sure ANY of them actually apply dynamically; in general, the system relationship to pre-boot state (like firmware variables) is that they're pushed "into" the system as part of the initial bootup process and then... basically ignored.

Note that this is largely intentional, as it means that the system doesn't really need any complicated mechanism for managing dynamic changes; the only time the values can be modified (while the system is fully up and running) is also a time when the entire system is basically "ignoring" all those values.

and as far as the current session is concerned, they should already be reflected by the VMM behaviour during the test, without having to reboot.

No, that is definitely NOT the case. You need to reboot for the changes to have any effect. Along with the general issues above, allowing this limit to dynamically change would introduce weird edge cases (for example, suddenly being forced to terminate large portions of user space because the limit shrank) without any real benefit.

Soon I will also get my hands on an older Intel Mac which is not managed by the company, and I will be able to check if the boot-args change the VMM behaviour correctly on that one.

FYI, I haven't actually tested this on any Intel Mac, though I believe this boot argument is the same.

If that happens, then it must be some hidden block the managed device has, which the IT person I spoke to is not aware of, as they said whatever restrictions they put on the machine, it should not interfere with its proper behaviour, and I suppose reflecting tunable boot-args is proper behaviour.

This isn't really something you could create a "block" like you’re describing. That is, the ONLY mechanism that could prevent boot-args from being applied is to prevent them being set at all (which SIP does). The thing to understand here is that at the point the kernel is processing boot-args, basically "none" of the "normal" system actually "works". Standard resources like the file system or network aren't "available" because NONE of the infrastructure those resources require... actually exists. That dynamic is why firmware variables (and the nvram tool to modify them) exist in the first place, as firmware variables are the ONLY input source available during early boot.

All of that means that any kind of setting/modification to early boot behavior must go through those variables as there simply isn't anywhere else to "put" that configuration data.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Zsh kills Python process with plenty of available VM
 
 
Q