Zsh kills Python process with plenty of available VM

On a MacBook Pro, 16GB of RAM, 500 GB SSD, OS Sequoia 15.7.1, M3 chip, I am running some python3 code in a conda environment that requires lots of RAM and sure enough, once physical memory is almost exhausted, swapfiles of about 1GB each start being created, which I can see in /System/Volumes/VM. This folder has about 470 GB of available space at the start of the process (I can see this through get info) however, once about 40 or so swapfiles are created, for a total of about 40GB of virtual memory occupied (and thus still plenty of available space in VM), zsh kills the python process responsible for the RAM usage (notably, it does not kill another python process using only about 100 MB of RAM). The message received is "zsh: killed" in the tmux pane where the logging of the process is printed.

All the documentation I was able to consult says that macOS is designed to use up to all available storage on the startup disk (which is the one I am using since I have only one disk and the available space aforementioned reflects this) for swapping, when physical RAM is not enough. Then why is the process killed long before the swapping area is exhausted? In contrast, the same process on a Linux machine (basic python venv here) just keeps swapping, and never gets killed until swap area is exhausted.

One last note, I do not have administrator rights on this device, so I could not run dmesg to retrieve more precise information, I can only check with df -h how the swap area increases little by little. My employer's IT team confirmed that they do not mess with memory usage on managed profiles, so macOS is just doing its thing.

Thanks for any insight you can share on this issue, is it a known bug (perhaps with conda/python environments) or is it expected behaviour? Is there a way to keep the process from being killed?

All the documentation I was able to consult says that macOS is designed to use up to all available storage on the startup disk (which is the one I am using since I have only one disk, and the available space aforementioned reflects this) for swapping when physical RAM is not enough.

Sure, that's what the system will do. Strictly speaking, it will actually start warning the user and then automatically terminating processes as it approaches "full", but it will basically use "all" available storage.

However...

Then why is the process killed long before the swapping area is exhausted?

...the fact that the system is willing to use "all" available storage doesn't mean that it should let any random process do that. Every process on the system has its own memory limit (both address space and used pages) enforced by the kernel. I'm not sure what the default limit is...

once about 40 or so...

...however, 40 GB doesn't seem like a terrible default. Keep in mind that the point of the default isn't simply to prevent the drive from filling up, but is really about enforcing "reasonable" behavior. Most processes never get anywhere CLOSE to using 40 GB of memory, so in practice, this limit is a lot closer to "how much memory will the system let a broken process pointlessly leak". From that perspective, 40 GB is extremely generous.

In terms of determining the exact size, os_proc_available_memory() will tell you how far from the limit you actually are and is much easier to use than task_info(). I think getrlimit()/setrlimit() (see the man page for more info) would also work, though raising the limit requires super user.

Thanks for any insight you can share on this issue. Is it a known bug (perhaps with conda/Python environments) or is it expected behaviour?

It is very much expected behaviour.

In contrast, the same process on a Linux machine (basic Python venv here) just keeps swapping, and never gets killed until the swap area is exhausted.

Yes. Well, everyone has made choices they're not proud of.

Is there a way to keep the process from being killed?

The limit itself is raisable. Have you tried using "ulimit()" in the shell? Aside from that, I'm not sure mapped files[1] are tracked through the same limit, so you might be able to map a 50 GB file even though the VM system wouldn't let you allocate 40 GB.

[1] In practice, mapped I/O is why hitting this limit isn't common. Most applications that want to interact with large amounts of RAM also have some interest in preserving whatever it is they're manipulating.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thank you so much for your reply, now I have a picture of what is going on. Could you share also how to use these functions? The only documentation I could find does not have examples. Say I have, among all, this process running, labelled python3 with PID 33238. I tried writing os_proc_available_memory() in my terminal (bash shell), and all I get is a prompt > awaiting for input. Same with getrlimit and setrlimit. I tried also os_proc_available_memory(33238) etc but I get error messages. The documentation keep mentioning 'the current process' but there are many, how do I run this functions relative to a specific ongoing process?

So, I got some more time to look at this today and need to change my answer a bit. Let me actually jump back to here:

All the documentation I was able to consult says that macOS is designed to use up to all available storage on the startup disk (which is the one I am using since I have only one disk and the available space aforementioned reflects this) for swapping, when physical RAM is not enough.

Did you actually see this in our documentation and, if so, where? If any of our "modern" documentation actually says that we do this, then that's something I'd like to clean up and correct.

So, let me go back to the idea here:

macOS is designed to use up to all available storage on the startup disk

That's the "classic” UNIX system machine swap and, historically, it's how Mac OS X originally worked. However, the downside of this approach is "swap death" in one of two forms:

  1. "Soft", meaning the system has slowed down to the point that the user is now no longer willing to use the machine, even though it technically still "works".

  2. "Hard", meaning the system outstanding swap "debt" has become so high that the system can no longer make forward progress, as the time required to manipulate the VM system swamps all other activity.

The distinction I'm drawing here is about recoverability— the "soft" state is recoverable, as the machine is still functional enough that the user can shutdown/terminate work, returning the system to normal. Hard is not recoverable, as the system itself has become so buried in VM activity that it can't resolve the issue.

Historically, the slow performance of spinning disk meant that this issue was largely self-regulating, as the machine would be come slower and slower in a relatively "linear" way and the user would then push the machine as much as they could tolerate.

That basic concept is well understood, but what's NOT commonly understood is how the dynamics of those failures have changed as hardware and software have evolved.

However, two major things have changed over time:

  1. Increasing CPU power meant that it became feasible to compress VM without obvious performance impact, allowing us to store and retrieve higher volumes of physical memory.

  2. SSDs dramatically improved I/O performance, particularly random I/O, allowing the system to "jump around" on physical media in a way it really can't on spinning disks.

Those both dramatically increase the benefit of VM, but they also create new scenarios. Notably:

  • SSDs are fundamentally "consumable" devices, which eventually run out of write cycles. Allowing unbounded swap file usage to destroy hardware is obviously not acceptable.

  • Compression can slow/delay freeing memory/swap, since freeing memory can require additional memory as the system has to decompress swap so that it can dispose of the freed memory, then recompress the data it still needs.

  • The combination of compression and SSD performance makes it possible for the machine to swing VERY suddenly from operating normally into swap death with very little notice or warning.

Expanding on that last point, sequences like this become possible:

  • The user uses a high memory application, which builds up significant memory use.

  • The user backgrounds that application and moves on to other work for an extended period of time. As a result, "all" of that application’s memory is compressed and streamed out to disk.

  • The user switches back to the app and immediately starts trying to interact with "all" of its memory.

Under very high load, that sudden usage swing can actually overload the entire machine, as it's trying to simultaneously "flip" the entire state of the system. Critically, the difference here isn't that it can't happen on a spinning disk (it can), it's that the slower performance of a spinning disk meant that it was far less likely to happen "suddenly".

So, let's go back to here:

however, once about 40 or so swap files are created, for a total of about 40GB of virtual memory occupied

What's actually going on here is that the system has artificially capped the amount of swap space it's willing to use. The size is derived from total RAM (more RAM, higher limit), and I suspect you have a 16 GB machine, as I hit exactly the same limit on my machine. However, the limit isn't tied to the specific size, but it is actually tied to the amount of swap being used. In my test app, I hit the limit at ~45 GB when using rand() to fill memory, but was able to go to ~88 GB when I filled with "1s". Better compression meant more memory.

That then leads back to here:

Is there a way to keep the process from being killed?

No, not when working with VM-allocated memory. The system’s swap file limits are basically fixed, based on total RAM.

However, if you really want to exceed this limitation, the most direct approach would be to allocate 1 or more files, mmap those files into memory, then use that mapped memory instead of system-allocated memory. That bypasses the limit entirely, since the limit is specifically about the system swap file usage, not memory usage. However, be aware that this approach does have the same NAND lifetime issues.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

I see, thank you for the explanation. Yes, my machine has 16 GB of RAM and I read about the VMM at https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/AboutMemory.html#//apple_ref/doc/uid/20001880-BCICIHAB

Is there a guide for macOS on the steps you describe at the end, that is on how to allocate a swapfile, mmap that swapfile into memory, then use that mapped memory instead of system-allocated memory? I am familiar with dd from /dev/zero etc and the usual declaration of a swapfile by appending on /etc/fstab, but that is in Linux, and perhaps that does not work on macOS...

That's the "classic” UNIX system machine swap

Well, I’d argue that the classic Unix swap involved a swap partition. None of this fancy, new fangled anonymous pager stuff (-:

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Is there a guide for macOS on the steps you describe at the end, that is on how to allocate a swapfile,

Just to be clear, you're not actually creating a "swapfile" as such. Your mapping a file into memory which means, assuming the mapping MAP_SHARED, the VM system will then use that file as the backing store for that memory range. That gives you basically the "same" behavior as swap backed system memory, but it isn't ACTUALLY the same as using true "swap" (for example, you'll be writing the data directly "back" to file, so there won't be any compression or VM level encryption).

mmap that swapfile into memory,

mmap is a standard Unix API, which means we don't really provide specific documentation for it, however, there is an old code snippet here showing how it works. One note I will note is that some of the recommendations there aren't really relevant anymore, particularly any recommendation about limiting mapping size. Those concerns where driven by the limited 32-bit address space, but that's not an issue with 64-bit addressing. In any case, here is a rundown of what's involved:

  • Create a file on disk that's a large as you want it to be. Note that you'll want it to be a multiple of 16KB (the page size) so that you get a one-to-one mapping between VM pages and your file.

  • open the file.

  • pass the file into mmap (note that you'll need to pass in "MAP_SHARED" for to create a mapping the writes to disk).

...and the pointer returned by mmap will be the start address of the file you mapped, which you could then use however you wanted. The man page for mmap does a decent job of covering the detail.

then use that mapped memory instead of system-allocated memory?

So, what mmap return to your app is basically just "memory" that you can use like any other memory. Technically, you could build a malloc replacement that used mmap as it's backing store, however, the kind of code that hits this limitation tends to be working with fairly large allocations anyway, since it's hard to get to 40+ GB if you're only allocating a few KB at a time. Because of that, it's typically easiest to rearchitect those large allocations around mmap while continuing to use malloc for "normal" work.

Well, I’d argue that the classic Unix swap involved a swap partition. None of this fancy, new fangled anonymous pager stuff (-:

I am but a young sapling beneath your ancient oak.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Apologies if I misunderstand things completely, I am no developer so memory management is completely foreign to me. Are the steps you describe involving the creation of a file and mmap it, to be used in the shell, and the output then applied to a process already running (say some python code) or are they supposed to be performed before the python code is run? In the first case, how do I assign this points to the process already running? I assume this will be extra virtual memory on top of the one the VMM has already used for the process. In the second case, does the pointer have to be used within the python script (that is, do I need to modify my python script in a certain way to tap into this file for virtual memory)? Or once I have it, I have to launch the process in a certain way that it uses it? Feel free to refer me to some resource that explains how to map a file to a process, all I could find, manages included, was not helpful and does not go into enough detail for my background. There is no mention of the process' PID, so I am quite confused about how to let the process know that it can use this file via the pointer. Further, once this is set-up, will the process still use the available physical ram or it will only use the file?

One important clarification for my use case, in the eventuality that the script has to be modified via mmap, is that the python process I need to use the mapped file uses a gurobi routine. Only this routine is the memory heavy part. However, this is proprietary software (written in C) which I cannot modify nor have access to. I simply call it through a python API with a command: model.optimize(). Thus I fear, in this case, that mmap is not an option as I do not find any mention of it in the gurobi documentation.

Zsh kills Python process with plenty of available VM
 
 
Q