Update to my previous statement.
That did not fix the problem, the problems just kind of go away on their own for a couple days after updating the apps, and since it never happens when the debugger is attached we thought we solved it.
We eventually tracked it down to "thread pool exhaustion". By calling too many synchronous disc access functions within concurrent Task / async contexts, async apparently ran out of threads, and (almost) deadlocked itself. (it would all work again if we went into our search page and tapped into the search field. but of course, user's didn't think to try that) The concurrent disc accesses were all fine, of course, with threads and dispatch queues, so that definitely caught us unaware. Frankly, it feels like a design flaw in async / await. In many places, multiple file accesses still needed to be synchronous, but we found a higher-level context to insert async continuations, which resume after the disc access calls are completed on dispatch queues. Apparently, we can do as many of these as we need concurrently as long as we create dispatch queues for them, they just can't be in a Task / async context without a continuation. For the moment, all the Image Cache disc access calls use one serial dispatch queue, which should be fine. Since cache misses resulted in network fetches, which were async anyway, all the call sites into the image service were async anyway. That worked fine since our Image Cache only had 5 or 7 functions where it accessed the disc.
Users with these new changes have confirmed for over a week they have not gotten the aforementioned stuck-loading issues.
Now we get to re-write our data-race safety for background uploading which has 123 places to access the disc, and requires delegate apis to be synchronous. We're stuck with multiple delegate methods that report pieces of the state, so we have to write them into separate files. these separate files then have to be read together to make one cohesive state to decide what to do next. So just putting "async" on the low-level disc access functions does not maintain data race safety, since actors silently give up on enforcing single access if you call an asynchronous function from with their methods (which I discovered much to my dismay when building AppAttest on top of actors a couple of years ago, and then discovered - you can't). If you want data-race safety, you need DispatchQueues. Should be fun. And by "fun" I mean a wide-awake panic for weeks on end.
I did try getting the sys diagnose, but there were too many files over too long a period of time for me to make sense of. In fact, I couldn't even confirm that doing the actions to create the system diagnose worked. There were too many files in the folder to tell if there were new ones, or what the new ones would be. Someone at apple decided the file names would truncate instead of wrap, and that meant I can't read the time stamps on any file names.
I also have not yet tested the disc use profile.