Reclaiming cached data from an `enumerateDirectory` call

If I'm in an enumerateDirectory call, I can very quickly fill in the fileID, parentID, and (maybe) the type attributes based on the directory entry I have loaded. That is, I can quickly fill in anything that is contained in the dirent structure in dirent.h, plus the parentID.

However, if any other attributes are requested (say, flags), or if the file system doesn't store the filetype in the directory entry, then I need to do additional I/O and load an inode. If I have to load an inode, I might keep a reference to it and assume that I can clean it up later whenever there is a matching call to reclaimItem. But in the enumerateDirectory call, I never provide an FSItem to the system!

By observation, I see that normally, a call to enumerateDirectory of this nature is followed up by a lookupItem call for every single fetched item, and then assumedly the system can later reclaim it if need be. At least, I tried various ways of listing directories, and each way I tried showed this behavior. If that's the case, then I can rely on a later reclaimItem call telling me when to clean up this cached data from memory.

Is this guaranteed, however? I don't see a mention of this in the documentation, so I'm not sure if I can rely on this.

Or, do I need to handle a case where, if I do additional I/O after enumerateDirectory, I might need to figure out when cached data should be cleaned up to avoid a "leak?" (Using the term "leak" loosely here, since in theory looking up the file later would make it reclaimable, but perhaps that might not happen.)

Answered by DTS Engineer in 885809022

Part 2...

My question is more so, if I'm creating a new FSItem in enumerateDirectory like that (for the purpose of getting those "non-minimal" attributes), can I assume a lookupItem call is going to follow, and thus giving the system a chance to later reclaim that FSItem?

I don't think that's a safe assumption and, in practice, I think you're very likely to see lots of cases where a lookup ISN'T generated. I don't think a basic "ls" will generate a lookup call and I'd expect/hope the Finder would avoid it at least some of the time.

Or should I be throwing away my reference to the FSItem if I created it inside enumerateDirectory? It seems like it would be a bit wasteful if I threw away an FSItem made in the enumerateDirectory call if I know there's going to be a lookupItem call shortly after asking for it, although I don't see a guarantee that that would be the case.

I think this all depends on how you want to manage this process. Most file system implementations route all object requests through some kind of "bottleneck" (like "lookup") to help ensure that they provide a consistent "view" of every object across multiple clients. That bottleneck either provides the cached object it already has or creates a new object (adding it to the cache) if it doesn't have one, ensuring that there is never more than one object for any given file system object.

When that object goes away is entirely up to you, but it's very typical for the caching system to intentionally hold on to otherwise unused/needed references just in case they're needed again "shortly".

Reorganizing things a bit for clarity:

...what its docs are referring to when they say "This method doesn’t support partial reading of metadata?"

This is a case of making something sound a lot more complicated than it actually is. Let me start with the "readInto" documentation:

(1)

length
A maximum number of bytes to read. The completion handler receives a parameter with the actual number of bytes read.

...

(2)

For the read to succeed, requests must conform to any transfer requirements of the underlying resource. Disk drives typically require sector (physicalBlockSize) addressed operations of one or more sector-aligned offsets.

Both of those requirements are straightforward side effects of readInto/writeInto's underlying implementation. Starting with #2, the I/O is actually going to the raw dev node ("/dev/rdiskXXs"), which will fail any I/O that doesn't match the hardware requirements the device presents in IOKit. For #1, the actual API being called are pread/pwrite, which:

pread:

Upon successful completion, read(), readv(), and pread() return the num-
ber of bytes actually read and placed in the buffer.  The system guaran-
tees to read the number of bytes requested if the descriptor references a
normal file that has that many bytes left before the end-of-file, but in
no other case.

pwrite:

... pointer associated with fildes, see lseek(2).  Upon return from
write(), the pointer is incremented by the number of bytes which were
written.

Note that the behavior above isn't necessarily "useful" or even really "normal". I think the main way you'll get a "short" read if your read request extends past the end of the disk, but I'm not sure why that would ever be helpful/useful. pread/pwrite work this way because of other I/O context (like socket I/O), and FSKit is exposing this behavior; it's easier that creating a new read/write implementation.

With all that context, all this means:

"This method doesn’t support partial reading of metadata"

...is that the metadata I/O methods WON'T do #1. The UBC (Universal Buffer Cache) expects I/O requests to be "sensible", so it fails the I/O requests that would cause pread/pwrite to do partial I/O. This is arguably the more reasonable behavior, but the documentation should probably have said this in a much more straightforward way (r.175533440).

Oh? Does this mean it's not good to pass lengths smaller than 16KB to e.g. metadataRead,

No, or at least not exactly. First off, as background context, I have a post on this here which is worth reading.

As I explained there, metadataRead is the direct equivalent of buf_meta_bread. As such, it should not be used for (generally) not be used for multipage I/O. However, the nature of the UBC also means that there isn't a big difference between a 4KB read and a 16KB read.

Finally, one follow-up point here:

although right now I only have reading implemented, so I don't know if that causes a problem later if I implement write.

As a block storage file system, one thing you should be looking at/planning is to transition to FSVolumeKernelOffloadedIOOperations and away from FSVolumeReadWriteOperations. Routing bulk I/O through your extension puts a hard ceiling on your overall performance, so I wouldn't waste too much time trying to optimize the direct I/O protocol.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

By observation, I see that normally, a call to enumerateDirectory of this nature is followed up by a lookupItem call for every single fetched item, and then assumedly the system can later reclaim it if need be.

Really? I haven't specifically tested it, but my expectation would be that iterating the directory with "geattrlistbulk" and a minimal attribute set would avoid any additional lookupItem call.

However...

If that's the case, then I can rely on a later reclaimItem call telling me when to clean up this cached data from memory.

What kind of filesystem are you working with? For a block storage file system, caching doesn't really work in those terms, as the block data you’re reading from disk doesn’t really align with the filesystem’s own structures. That's particularly true on macOS where your I/O is page oriented, so the minimum cache unit is 16Kb (one page).

On the other hand, in a network file system, not only is caching the primary tool you have to improve performance but the entire architecture is basically one giant exercise in cache management.

Last but not least, in a filter file system the problem you have is that caching ANYTHING becomes dangerous[1] - you could cache the data from enumerateDirectory, but there's no guarantee that it hasn't already changed when lookupItem is called.

[1] This is also true in a network file system, but user expectations and usage patterns tend to reduce these issues.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

my expectation would be that iterating the directory with "geattrlistbulk" and a minimal attribute set would avoid any additional lookupItem call

Oh, I think I was a little unclear in what I wrote. What I was trying to say in "a call to enumerateDirectory of this nature" was that if I call enumerateDirectory with a non-minimal attribute set (i.e. include attributes that need more I/O to fetch) then I see that behavior. But you're right in that iterating over a directory with minimal attributes doesn't generally have additional lookupItem calls in my tests.

What kind of filesystem are you working with?

It's a block device file system.

I think maybe "cached data" wasn't the best word for me to use... I'm referring to references I'm keeping to instances of my subclass of FSItem. I have a helper function implemented on my subclass to get the "non-minimal" attributes, so what I'm doing in my enumerateDirectory implementation is checking if I have a reference for the item already, and if so, I simply use that and its helper method. If not, I create a new instance of the FSItem, then proceed. (If there are only minimal attributes that don't need any more I/O requested, then I don't do anything with these FSItems.)

My question is more so, if I'm creating a new FSItem in enumerateDirectory like that (for the purpose of getting those "non-minimal" attributes), can I assume a lookupItem call is going to follow, and thus giving the system a chance to later reclaim that FSItem? Or should I be throwing away my reference to the FSItem if I created it inside enumerateDirectory? It seems like it would be a bit wasteful if I threw away an FSItem made in the enumerateDirectory call if I know there's going to be a lookupItem call shortly after asking for it, although I don't see a guarantee that that would be the case.

That's particularly true on macOS where your I/O is page oriented, so the minimum cache unit is 16Kb (one page).

Oh? Does this mean it's not good to pass lengths smaller than 16KB to e.g. metadataRead, and what its docs are referring to when they say "This method doesn’t support partial reading of metadata?" It's often the case that I've been using 4KB, and things seem to run fine, although right now I only have reading implemented, so I don't know if that causes a problem later if I implement write.

Part 1...

Disclaimer: Some of the information here may already be obvious or well understood to you. If it is, then thank you for your patience as I use this as an excuse to push out background information that other developers may find useful.

Oh, I think I was a little unclear in what I wrote. What I was trying to say in "a call to enumerateDirectory of this nature" was that if I call enumerateDirectory with a non-minimal attribute set (i.e. include attributes that need more I/O to fetch), then I see that behavior. But you're right in that iterating over a directory with minimal attributes doesn't generally have additional lookupItem calls in my tests.

Again, you have to be careful of the API you're using. The "classic" Unix directory iteration pattern[1] is to read the directory (readdir-> enumerateDirectory) and then retrieve metadata (stat-> lookupItemNamed).

There are two problems with that:

  1. At a basic level, it generates a lot of syscalls, each of which nibble away at performance.

  2. From the file system side, it throws away some very easy and obvious performance wins.

Expanding on that last point, the vast majority of block file systems all:

  • Store most of a file’s metadata in a single data structure.

  • Organize those records such that the records for objects within the same directory are (often) physically close to each other on disk a significant portion of the time.

  • Use a storage allocation scheme ("allocation block size") which guarantees that data will be read in large enough chunks that many records will invariably be retrieved at once.

Putting all those points into concrete terms, when a file system retrieves the name of any given object, it invariably has a bunch of other information "at hand" (size, dates, etc.), since all of that information was in the same record. Similarly, the blocks it read to retrieve the data about one file are VERY likely to include the data about other files in the same directory, since they were already close to each other on disk.

All of that is an API like "getattrlistbulk()", readdir_r(), and getdirentriesattr() exist— they let the file system return the data for a bunch of files "at once", taking advantage of the fact that the file system often has much of that data already. One interesting detail about that process— the reason getattrlistbulk takes a list of attributes isn’t to optimize the retrieval process on the file system side— as I talked about above, there often isn't anything to "optimize", since all the data is a single record that the file system can't partially retrieve. What it's actually doing there is reducing what it has to RETURN to user space so that it can pack more replies into the same syscall.

With all that background, let me jump back to here:

What I was trying to say in "a call to enumerateDirectory of this nature" was that if I call enumerateDirectory with a non-minimal attribute set (i.e. include attributes that need more I/O to fetch)

The thing to be aware of here is that different APIs will generate different behavior in ways that aren't entirely obvious/predictable. Notably, NSFileManager's URL enumerator does "bulk" enumeration while its older path enumerator ends up calling stat() on every file. I have a forum post here that covers this and includes some performance comparisons.

Switching over to the FSKit side, that's why "enumerateDirectory" has the design it has— you're using FSDirectoryEntryPacker to fill up a buffer which will then be returned to the kernel (once it's full or you run out of entries).

[1] As a side note, while writing this up I noticed that "Building a passthrough file system" not only uses the (slow) readdir/stat pattern, but appears to leak the directory descriptor and has a serious issue with its telldir() usage (r.175523886). Anyone actually implementing a passthrough file system should probably NOT use that implementation as a direct model.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Accepted Answer

Part 2...

My question is more so, if I'm creating a new FSItem in enumerateDirectory like that (for the purpose of getting those "non-minimal" attributes), can I assume a lookupItem call is going to follow, and thus giving the system a chance to later reclaim that FSItem?

I don't think that's a safe assumption and, in practice, I think you're very likely to see lots of cases where a lookup ISN'T generated. I don't think a basic "ls" will generate a lookup call and I'd expect/hope the Finder would avoid it at least some of the time.

Or should I be throwing away my reference to the FSItem if I created it inside enumerateDirectory? It seems like it would be a bit wasteful if I threw away an FSItem made in the enumerateDirectory call if I know there's going to be a lookupItem call shortly after asking for it, although I don't see a guarantee that that would be the case.

I think this all depends on how you want to manage this process. Most file system implementations route all object requests through some kind of "bottleneck" (like "lookup") to help ensure that they provide a consistent "view" of every object across multiple clients. That bottleneck either provides the cached object it already has or creates a new object (adding it to the cache) if it doesn't have one, ensuring that there is never more than one object for any given file system object.

When that object goes away is entirely up to you, but it's very typical for the caching system to intentionally hold on to otherwise unused/needed references just in case they're needed again "shortly".

Reorganizing things a bit for clarity:

...what its docs are referring to when they say "This method doesn’t support partial reading of metadata?"

This is a case of making something sound a lot more complicated than it actually is. Let me start with the "readInto" documentation:

(1)

length
A maximum number of bytes to read. The completion handler receives a parameter with the actual number of bytes read.

...

(2)

For the read to succeed, requests must conform to any transfer requirements of the underlying resource. Disk drives typically require sector (physicalBlockSize) addressed operations of one or more sector-aligned offsets.

Both of those requirements are straightforward side effects of readInto/writeInto's underlying implementation. Starting with #2, the I/O is actually going to the raw dev node ("/dev/rdiskXXs"), which will fail any I/O that doesn't match the hardware requirements the device presents in IOKit. For #1, the actual API being called are pread/pwrite, which:

pread:

Upon successful completion, read(), readv(), and pread() return the num-
ber of bytes actually read and placed in the buffer.  The system guaran-
tees to read the number of bytes requested if the descriptor references a
normal file that has that many bytes left before the end-of-file, but in
no other case.

pwrite:

... pointer associated with fildes, see lseek(2).  Upon return from
write(), the pointer is incremented by the number of bytes which were
written.

Note that the behavior above isn't necessarily "useful" or even really "normal". I think the main way you'll get a "short" read if your read request extends past the end of the disk, but I'm not sure why that would ever be helpful/useful. pread/pwrite work this way because of other I/O context (like socket I/O), and FSKit is exposing this behavior; it's easier that creating a new read/write implementation.

With all that context, all this means:

"This method doesn’t support partial reading of metadata"

...is that the metadata I/O methods WON'T do #1. The UBC (Universal Buffer Cache) expects I/O requests to be "sensible", so it fails the I/O requests that would cause pread/pwrite to do partial I/O. This is arguably the more reasonable behavior, but the documentation should probably have said this in a much more straightforward way (r.175533440).

Oh? Does this mean it's not good to pass lengths smaller than 16KB to e.g. metadataRead,

No, or at least not exactly. First off, as background context, I have a post on this here which is worth reading.

As I explained there, metadataRead is the direct equivalent of buf_meta_bread. As such, it should not be used for (generally) not be used for multipage I/O. However, the nature of the UBC also means that there isn't a big difference between a 4KB read and a 16KB read.

Finally, one follow-up point here:

although right now I only have reading implemented, so I don't know if that causes a problem later if I implement write.

As a block storage file system, one thing you should be looking at/planning is to transition to FSVolumeKernelOffloadedIOOperations and away from FSVolumeReadWriteOperations. Routing bulk I/O through your extension puts a hard ceiling on your overall performance, so I wouldn't waste too much time trying to optimize the direct I/O protocol.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Some of the information here may already be obvious or well understood to you.

Thank you for taking the time to write all this down. That context is very helpful, especially as I'm not super familiar with the lower level Unix APIs related to this.

I don't think that's a safe assumption and, in practice, I think you're very likely to see lots of cases where a lookup ISN'T generated. I don't think a basic "ls" will generate a lookup call and I'd expect/hope the Finder would avoid it at least some of the time.

Both of those expectations don't The second expectation doesn't seem to be what happens in reality, at least on macOS 15.7.5. I've been testing with a sample file system that has a directory with 10,000 items. When I do a time ls /Volumes/MyFS/dir, I see that a lookupItem call was done for every single item (actually, 2 for each item, e.g. one for file1.txt and another for ._file1.txt, which doesn't exist). And if I open /Volumes/MyFS in the Finder, then it immediately goes wild at making lookupItem calls to the thousands of items in the /Volumes/MyFS/dir before I even open dir (although, I do have the "calculate all sizes" view option enabled in the Finder; when I turn that off, it at least waits until I open dir).

Edit: OK, actually, it seems like my shell has added extra stuff to a "default" ls call... it's probably oh-my-zsh or something. When I run it in a plain bash shell, a regular ls doesn't make all those lookup calls. But it also doesn't ask for any of the heavy attributes that lead to my codepath that can instantiate the FSItem object, so that specific case doesn't lead to a leak in my code.

These operations (the ls and opening the big folder in the Finder) are really slow, and while a lot of it seems to simply be FSKit overhead (FB21069313), I'm currently trying to optimize my own code to avoid some of the bottleneck on my side.

I guess, writing this down, it really looks like the "FSKit overhead" might be the tons of lookupItem calls... so if that's not intended behavior, then yeah, sounds like my assumption wouldn't work and I should be trying to figure out when to prune these objects.

That bottleneck either provides the cached object it already has or creates a new object (adding it to the cache) if it doesn't have one, ensuring that there is never more than one object for any given file system object.

Yeah, I've currently got something like that. I'm currently relying on reclaimItem calls to know when to release my hold on the object from the bottleneck, but I was recently reading my code again and realized that I might be never reclaiming it if the system never actually receives the item! Hence I came to the forums to ask this, to see if this is actually a problem or not.

First off, as background context, I have a which is worth reading.

Oh hey, that's an answer to one of my previous questions :)

As a block storage file system, one thing you should be looking at/planning is to transition to and away from

Got it. I do have that implemented, although right now I'm rejecting any request that contains the write flag. I also don't have any of the supporting write functionality (creating files, changing attributes, etc.) done, which is probably going to be the hard part with that.

Reclaiming cached data from an `enumerateDirectory` call
 
 
Q