Thanks for the superb answer, Kevin!
I'll just touch on the memory issue:
I confess, I haven't actually looked into this. What's the limit you’re actually hitting and how much memory do you think/need/want?
We mostly use IOMallocAligned(), and then manage chunks within the kext mostly using the vmem and kmem caching originated in OpenSolaris, adapted to a system where there is a parent memory system using alloc/free style calls rather than a very large range of physical pages. (This was the starting point for zfsonlinux, too). We would like as much as a knowledgeable user expecting to do large amounts of cacheable I/O might want to dedicate. Ideally, it would be fully automatic and dynamically reactive to the amount of otherwise-unused memory on the system and the amount of cache-serviced I/O actually observed.
In the case of APFS inside a zvol for example, consumers of files in the APFS filesystem will make use of the UBC just fine. This performs surprisingly well in two ways: firstly, ARC and other ZFS caching hides a lot of drive-seek latencies from APFS; and secondly, because APFS caches data in the UBC, the number of reads into the ARC for "double-cached" data largely vanishes, and consequently the ARC copy is more likely to stay off the frequently-read part of the cache and thus more eligible for replacement.
What ARC caches in APFS-in-zvol is morally equivalent to sets of disk blocks (from the APFS perspective) -- there is no knowledge of individual APFS files or other objects.
However, for datasets (by far the most common use of ZFS everywhere) UBC is not engaged, and I think that's the focus of lundman's comments on UBC. Here what is cached is ranges of a "DMU" object, which is roughly equivalent to ranges of an ordinary file (or file-like directory or other metadata).
Multiple read(2)s of a part of a file in a dataset are served by caches within the kext, and we jump through a couple of additional hoops when a file is mmap()ed. writes (including msync()s) are also cached (after compression and checksumming and other processing), and aggregated.
(Experimental plumbings between the UBC and the ARC using UPLs mostly ended up having poor performance, although that was many years ago. One current blocker that has arisen since then is that the modern ARC typically retains mostly compressed data, with short-lived caching of uncompressed data. (The data is almost always stored compressed on the underlying media, and what's cached is what's read off the media). Extra copies of uncompressed data in the UBC, or decompressing data in the kext to hand to the UBC to generate and keep a compressed copy of its own, seems a little wasteful.)
Because we starve all the other kernel clients of RAM if we go over some threshold (e.g. 60 GiB on a 512 GiB M3 Mac Studio) we have some defensive capping well below that.
We do give memory back via IOFreeAligned() if the ARC shrinks because of user intervention or automatically if memory pressure is detected. Since the project goes back to (and still runs on) old intel Mac Minis, where page table shootdowns and older kernel-internal memory management made bursts of many small frees expensive, we mostly exchange fairly large chunks of memory with the kernel. Because of the nature of a hierarchy of slab allocators, that risks pinning more memory in the kext that we might want, or through another lens, makes us less reactive to memory pressure (or a manual reduction in footprint) than I'd like.
Maybe on newer Apple Silicon machines with newer kernels with Apple's many interesting memory management innovations, allocating/freeing smaller chunks of memory might be less noticeable. Additionally, if plumbing between UBC and the kext's caches were "easier", we could hold on to less memory.
("Easier" is mostly about correctness in plumbing between existing code and UPLs given the restricted interface for the latter and the need to handle files which are potentially mmap()ed and having read()/write() done on them simultaneously. Think of a file on which some application is doing ordinary read()/write() when mds and friends come along to index it.)
Finally, for completeness, there's minor uses of other IO alloc/free calls like IOMallocType() too, and apparently two corner cases where kmalloc() is used.
Sean Doran
Not coauthored by any intelligence and it probably shows
Topic:
App & System Services
SubTopic:
Core OS
Tags: