Post

Replies

Boosts

Views

Activity

Reply to Diagnosing iOS disc contention impacting networking?
Well, I had written a great reply, but the dev forums deleted it when I sent it because my daily log-in had expired, unbeknownst to me. And of course the latest saved draft was saved before the log-in expired, which was like a few hours ago or something, so it wasn't what I ended up saying. But you'll be happy to know DispatchQueues are going to figure prominently in my solution. But it is going to be a lot of painstaking work on our part. At a minimum, the Xcode analyzer should note any synchronous disc access called from a Task or async / await context and warn devs about the dangers of thread pool exhaustion.
Topic: App & System Services SubTopic: Core OS Tags:
Mar ’25
Reply to Diagnosing iOS disc contention impacting networking?
Update to my previous statement. That did not fix the problem, the problems just kind of go away on their own for a couple days after updating the apps, and since it never happens when the debugger is attached we thought we solved it. We eventually tracked it down to "thread pool exhaustion". By calling too many synchronous disc access functions within concurrent Task / async contexts, async apparently ran out of threads, and (almost) deadlocked itself. (it would all work again if we went into our search page and tapped into the search field. but of course, user's didn't think to try that) The concurrent disc accesses were all fine, of course, with threads and dispatch queues, so that definitely caught us unaware. Frankly, it feels like a design flaw in async / await. In many places, multiple file accesses still needed to be synchronous, but we found a higher-level context to insert async continuations, which resume after the disc access calls are completed on dispatch queues. Apparently, we can do as many of these as we need concurrently as long as we create dispatch queues for them, they just can't be in a Task / async context without a continuation. For the moment, all the Image Cache disc access calls use one serial dispatch queue, which should be fine. Since cache misses resulted in network fetches, which were async anyway, all the call sites into the image service were async anyway. That worked fine since our Image Cache only had 5 or 7 functions where it accessed the disc. Users with these new changes have confirmed for over a week they have not gotten the aforementioned stuck-loading issues. Now we get to re-write our data-race safety for background uploading which has 123 places to access the disc, and requires delegate apis to be synchronous. We're stuck with multiple delegate methods that report pieces of the state, so we have to write them into separate files. these separate files then have to be read together to make one cohesive state to decide what to do next. So just putting "async" on the low-level disc access functions does not maintain data race safety, since actors silently give up on enforcing single access if you call an asynchronous function from with their methods (which I discovered much to my dismay when building AppAttest on top of actors a couple of years ago, and then discovered - you can't). If you want data-race safety, you need DispatchQueues. Should be fun. And by "fun" I mean a wide-awake panic for weeks on end. I did try getting the sys diagnose, but there were too many files over too long a period of time for me to make sense of. In fact, I couldn't even confirm that doing the actions to create the system diagnose worked. There were too many files in the folder to tell if there were new ones, or what the new ones would be. Someone at apple decided the file names would truncate instead of wrap, and that meant I can't read the time stamps on any file names. I also have not yet tested the disc use profile.
Topic: App & System Services SubTopic: Core OS Tags:
Mar ’25
Reply to Diagnosing iOS disc contention impacting networking?
Yes, I was taking screenshots using the hardware buttons, the sleep/wake button and volume button, to take a screenshot. As long as the app appeared to be stuck loading, the image of the screenshot would have the correct clock time in it, but when I open the exif data, it has the timestamp of the exact second when all the network requests succeed at resuming, some minutes later. Another of our developers said while his app was stuck, the screenshots did not appear in his Photos app until either the app got unstuck, or he force-quit the app. The images are also saved out of order, meaning they each have a unique sequential number in the file name, but sometimes screenshots with minute "58" in them have a later number than screenshots with minute "59" in them. Our only interaction with the photos library would be if the user goes to create new content, we would use a PHPickerViewController to select media items, but for the bugs I'm describing, the user has not gotten that far in the app. Thanks for the lead on the disc space profile, I'll give it a shot.
Topic: App & System Services SubTopic: Core OS Tags:
Feb ’25
Reply to Diagnosing iOS disc contention impacting networking?
How I eventually alleviated the symptoms, shocker it had nothing to do with the disc: I had believed the MainActor was not getting blocked, because I was still able to use the UI and the watchdog timer was not killing the app after minutes. However, in several places we had code for a Task that would await an asynchronous fetch from the network and then await MainActor.run to send the changes to the data model for the UI. I changed all of those await MainActor.run to Task { @MainActor in. The theory was the containing Task may have been trying to keep using the network thread to run on the MainActor preventing the network thread from going back to work on other urlsesison tasks, and that by creating a sub Task, I would work around that particular issue. I also found one spot where the Task that awaited the network data was not even await MainActor.run when calling navigationController?.pushViewController, but instead was relying on the Task to implicitly capture the MainActor from the calling code, which was probably a programmer error. It also init'd a UIViewController subclass while not explicitly calling await MainActor , then pushed that onto the nav stack. So I put both the VC init call and the push explicitly into a Task { @Mainactor in. The theory being that these functions somehow required explicit redirection to the MainActor, and stalled out never getting it, blocking the network thread from continuing, but somehow not blocking other code that was explicitly running on the main actor. After making these two changes, after only a couple days of testing, people who had been having the issue frequently are no longer having an issue. So I conclude one or both of these issues prevented the task / thread associations at the system level from resuming completing work. The measurement of disc access events taking too long must also have been a result of task / thread contention stalling the starting of the actual disc reading, and the exif timestamps in the screenshots getting delayed until the apps would resume working correctly must also have come from system-level task / thread contention. The fact that an app's "threading" bug can cause a system-wide task scheduling stall for minutes on end is disturbing however, I though apple moved away from that when they introduced preemptive multitasking in macOS x? But perhaps it's got more to do with the screenshot code being run as a background process that isn't given priority somehow. As to why the OS is budgeting GB's of disc data against us, we still haven't solved that one. I'm aware of the SFSafariViewController DataStore cache clearing, that's not it. I'm aware we have some code that isn't managing to turn off the URLCache, but that's not it either. And I actually wrote code to just enumerate all our directories in our various app group containers and get their file sizes, and I haven't been able to get FileManager to account for even GB's of data on disc. The image cache code was working to keep itself down to the size we had set. As to why the app would not exhibit the main actor / network thread bug when the user had deleted the app and reinstalled, it may have been some kind of race condition regarding the timing of fetching images vs. checking them from the disc cache, or it may have been related to the data on disc that we don't know what it is.
Topic: App & System Services SubTopic: Core OS Tags:
Feb ’25
Reply to Errors codes for invalid resumeData with URLSession UploadTask?
Eskimo, thanks for your answers. There are several follow up questions I wanted to ask after I did some testing, especially around the concept that the resumeData becomes invalid when I restart the request, for cases where our app crashes after resuming an upload and thus having only the previously saved resume data and not a fresh version. Unfortunately, when I went to go test some of these scenarios, I discovered that as of last week, CloudFlare's proxy service no longer allows the 104 intermediate response to go through, and thus I'm never getting any resumeData. We've been working with our CloudFlare representative last week and this week, but don't yet have an answer on why that happened, or if they could allow the 104s again. (They never fully worked, because cloudflare always timed out the connection about a minute after the 104s came through, but we were cancelling the upload as soon as we got the 104 so we could get some resume data and save it on disk, and then restarting from resumeData immediately, and that was working fine for months during feature development.) Do you know of any proxy services which do the support Apple's implemented variant of the tusd protocol (i.e. with the intermediate 104 response)? Can Apple somehow influence CloudFlare to support Apple's implemented variant of the tusd protocol? It's been released on iOS devices for over a year. I'm happy to go sit on the sidewalk across the street from their headquarters holding up a sign that says "Support 104s" if it will help. Is there a way to force the resume manually if the device can't get the 104 intermediate response from the server? We're doing PUTs not POSTs and we already have a unique identifier, so from our perspective we could provide the Location header value pre-calculated when we start the upload initially, but I don't know how to synthesize resumeData with that value. It looks like you've carefully neutered NSKeyedArchiver's ability to read the data blob which I could maybe fix, but it also looks like you maybe even encrypted the original request headers in the data blob, so I'm assuming we wouldn't be able to install the correct encryption keys to provide you with a synthesized resumeData? Is there a way to get resume data if the 104 response is final, and not intermediate? In our testing the answer seemed to be 'no'.
Oct ’24