Post

Replies

Boosts

Views

Activity

Reply to Lock Contention in APFS/Kernel?
F_FULLFSYNC is basically "always" going to be relatively slow. The API operates on a file handle, but flushing the volume to a coherent state is a broader operation the final I/O command is a device command. All of that makes it a fairly "heavy" operation. Gotcha, thanks for the background! On that note, I did my research—to our mutual peril—and found a comment (https://news.ycombinator.com/item?id=25204202) that cited sqlite's codebase, noting that "fdatasync() on HFS+ doesn't yet flush the file size if it changed correctly". Some folks on the postgres mailing list also appear to be a bit uncertain about the current situation: https://www.postgresql.org/message-id/flat/CA%2BhUKGLv-kvrtA5QEL%3Dp%3DdYK0p9gsMXJaVhUFu%2BA-KyFrFi%3D2g%40mail.gmail.com#fe6d1c5665a381687842758fd5b245d4. I'm a newbie when it comes to understanding C code, but looking over kern_aio.c, I see two seemingly contradictory comments. On lines 658–659: * NOTE - we do not support op O_DSYNC at this point since we do not support the * fdatasync() call. Later, on lines 721–727, above aio_return: /* * aio_return - return the return status associated with the async IO * request referred to by uap->aiocbp. The return status is the value * that would be returned by corresponding IO request (read, write, * fdatasync, or sync). This is where we release kernel resources * held for async IO call associated with the given aiocb pointer. */ So I guess I'm wondering: is it... fine to now use fdatasync on APFS? Because if it is now fine (as per sqlite's understanding via the Hacker News comment...), then I think there's a bunch of software might be relying on outdated documentation/advice, since: man fsync, on macOS 26.1 refers to a drive's "platters". To the best of my knowledge, my MacBook Pro does not have any platters! As of 2022, it appears that Apple's patched version of SQLite uses F_BARRIERFSYNC. The wording of the documentation, at least for iOS, suggests that file data size would be synced to disk. Foundation's FileHandle (which is, I think, equivalent to Rust's std::fs::File?) uses a plain fsync(_fd), not an F_FULLFSYNC like Rust's (and Go's, for that matter!) standard libraries do. On the APFS side, the specific concern I have here is about the performance dip at high core count. That's partly because of the immediate issue and mostly because our core count has been increasing, and we need to be watching for these concurrency bottlenecks. Understood. My friend André—who produced the graph on his M3 Ultra Mac Studio—theorized this past weekend that part of the observed performance degradation was partly due to the interconnects between the M3 Maxes, but this was idle speculation over brunch. He, to your point, also noted that the base Apple Silicon chips went from 8 cores to 12 cores in the span of 5 years, with roughly 20% year-over-year performance improvements! That'll certainly stress design assumptions! Expanding on that last point, there's a real danger in these comparisons that comes from assuming that both implementations are directly "equivalent", so any performance divergence comes from correctable issues in the other platform. That certainly true some of the time, but it's definitely NOT true here. Understood! If the answer from your end is "you're hitting a pathological edge case on APFS doing something it wasn't really designed for", then that's fine! It'd be nice if I can have my cake and eat it too à la ext4, but as you mentioned previously, there's a latency/throughput tradeoff here, and APFS is firmly on the side of "latency". (And just to wave my credentials/sympathy for APFS' position: I spent four years of my life working on a latency-sensitive reimplementation of the Rust compiler, so I get how fundamental these design tradeoffs are!) Asking a question that probably should have been asked earlier... why? Why are you doing this at all? Unless you're applying some external force/factor (basically, cutting power to the drive), I think all of these sync calls are really just slowing you down. That's true of ALL platforms, not just macOS. If your code is doing it as part of it's own internal logic then that fine, but if this is actually part of your testing infrastructure then I'm not sure you shouldn't just "stop". No, that's a great point to clarify! The fsyncs are not part of the tests—at least, not directly—but rather, a part of core logic of jj itself. The change was introduced in this commit and has reduced the frequency of people reporting data corruption issues. (The "op_store" in the linked commit can be thought of as a "write ahead log, but for the files you're keeping under version control". We could probably restore from the op log in the event of data corruption, now that I think of it...)
Topic: App & System Services SubTopic: Core OS Tags:
3w
Reply to Lock Contention in APFS/Kernel?
Based on the general description, I suspect it’s F_FULLFSYNC. F_FULLFSYNC asks the target device to flush its hardware cache, which both takes longer and has "broader" consequences. That doesn't make it "wrong" (the right choice is "the right choice"), but it does have performance consequences. Yup, it is calling F_FULLFSYNC. Tracing through the code, jj starts by calling std::fs::File::sync_data. This methods then calls sync_data on its private which, in turn, calls libc::fcntl(fd, libc::F_FULLFSYNC), but only for target triple vendors that are Apple. On a bunch of Unixes, std will instead call fdatasync, which does not appear to be implemented on Darwin, which, fair enough. However, one note to set "expectations". [...] Everything else we hold off shipping until the next major release to ensure that those changes have as much testing time as possible. That's the best way to protect our users’ data, but it does mean that it can often look like file system issues take a particularly long time to address. No worries at all! I don't expect this to be fixed any time soon and I know I'm asking for potentially a very large change if y'all are able to reproduce this properly. Have you looked at using separate volumes to further separate the data sets? Perhaps using a DiskImage with multiple APFS volumes as the target? We haven't, but that's certainly an option to explore, especially it doesn't require the end-user to approve/run anything! So far, we've found success by having developers on macOS run: sudo mkdir -p /Volumes/RAMDisk sudo chmod a+wx /Volumes/RAMDisk sudo mount_tmpfs -e /Volumes/RAMDisk ...and running tests as env TMPDIR=/Volumes/RAMDisk cargo nextest run, but that doesn't Just Work™ out of the box. On an M3 Ultra, this brings our test runtimes back to to around 42.9s when the test runner is given 12 cores (above 12 cores, test runtimes increase, as you can see on the lower, blue graph in my initial post.). As most of us are on M{2, 3, 4} Pros instead of the Max or Ultra variants, this works out surprisingly nicely. So, the next step here would be to file a bug that includes a system trace of the issue, a sysdiagnose taken immediately after you finish your test run, and the reproduction steps above. Please post the bug number back here once it's filed. Can do! That'll probably be a tomorrow morning thing.
Topic: App & System Services SubTopic: Core OS Tags:
Sep ’25
Reply to Lock Contention in APFS/Kernel?
Thanks y'all! Ed, I'll try running the system trace tomorrow morning (I'm in New York City, so it's the end of the day for me). Kevin: Everything found was largely addressed in macOS 10.14 (Mojave). Ah! Good to know that my information is kinda out of date :) Unfortunately, comparisons between different systems are extremely tricky due to "fundamental design differences.” One of macOS’s fundamental design choices is that its goal is to "responsive" NOT "fast.” In other words, the system’s ability to quickly react to the user’s actions is MORE important than the total time it takes to complete any given operation. In the I/O system, that means restricting I/O memory commitments more than Linux, both to reserve memory for other purposes and to reduce the chance that the system will have to stall waiting for memory to be flushed to disk. Understood! Makes a lot of sense given the common, respective usages of Linux vs. macOS! Are you calling "fsync" or are you doing fcntl(F_FULLFSYNC)? The second call will be significantly slower, but if you need the data to actually reach the disk, then you need to use "F_FULLFSYNC" (see man fsync(2) and fcntl(2) for more details). I'm not sure if we're using fsync or fcntl(F_FULLFSYNC), but I'll followup and give an answer tomorrow. If all of these tests are interacting with the same dataset, then that obviously exacerbates the issue. The tests shouldn't be interacting with the same logical data set, we're just using tmp as an on-disk location. however, the most useful thing you can do is provide tests we can run ourselves. It's not too bad: I'm guessing y'all have the clang/the Xcode command line tools installed, but if you don't, you need those! Install Rust. I recommend using https://rustup.rs/, which will install the Rust compiler and Cargo at ~/.rustup/ and ~/.cargo. It will also add those paths to your $PATH. Clone jj: git clone https://github.com/jj-vcs/jj.git Install the per-process test runner we use, cargo-nextest, by running cargo install cargo-nextest --locked. cargo-nextest should now be in your $PATH, installed at ~/.cargo/bin/cargo-nextest. Inside the cloned jj repo, run cargo nextest run --no-fail-fast. The above should be sufficient to replicate the slowness we've observed, but let me know if you run into any issues.
Topic: App & System Services SubTopic: Core OS Tags:
Sep ’25
Reply to Using Processor Trace on Non-Xcode Built Binary
Thanks Quinn! Well, that’s an annoying cascade failure. I copied it from you who copied it from Instruments. But, yeah, I shouldn’t have let that slip past )-: Thanks for the heads up. No worries! I should've noticed that and it's my first time seeing a typo in a diagnostic like this. It's not something I expect to see. The best way to get these in front of the folks who can actually enact change is to file them in Feedback Assistant. See Bug Reporting: How and Why? for more on that. Thanks for the information and sorry for not being familiar with the process of reporting bugs to Apple! For the bugs, here are my feedback items: The Instruments OOM: FB18583028. The "adhoc-signed binaries shouldn't need an additional entitlement just to be profiled" feedback item: FB18543729. I submitted both under "Developer Tools & Resources", but I can see an argument for the latter being filed under something closer to "kernel/core OS". Lemme know if you'd like me to file the entitlement feedback there! Except for the Typo in Diagnostics issue. There’s no need for you to file an additional bug about that, unless you want to be notified of the fix. I didn't submit a feedback item for the typo; I think your bug report is sufficient :). I'll just keep an eye on the changelog of future Xcode releases!
Jul ’25
Reply to Using Processor Trace on Non-Xcode Built Binary
Thanks so much for the response! clang and rustc (by design on Rust' part, to be clear!) are sufficiently similar that it was pretty easy to translate between C++ and Rust! Your tips/suggestions almost worked for me, except that the binary would be sigkilled'd immediately after launch. I did some rubber-duck debugging using Claude, and it—rather impressively!—pointed out in https://claude.ai/share/5a4ca3ca-9e98-4e2a-b9ae-71b49c6983cf that the entitlement I needed to use was com.apple.security.get-task-allow, not com.apple.security-get-task-allow. Instruments' diagnostic contained a typo! Once I fixed this typo, I was able to use the "Processor Trace" instrument via xctrace. Of course, since this is beta software, which I hit a few bugs, which I'll cover at the end of this post. Apple silicon code must be signed, so the linker automatically applies an ad-hoc signature. You can see this if you dump the hello tool before re-signing it: [dump redacted] If you’re going to re-sign the binary anyway, you can disable linker signing with the -no_adhoc_codesign linker option. I think the Rust compiler sets the adhoc signature somewhere by default, so while I agree it's kinda wasteful to replace it later, it's also not the worst. but it’s not appropriate for a product that you want to ship to a wide range of users Yeah, I figured as such. I'm only really using these adhoc binaries for as part of my local development workflow. The Bugs! Anyways! I promised a few bug reports, here they are! Typo in Diagnostics Instruments and xctrace have a typo in their disagnostics entitlement: they both suggest com.apple.security-get-task-allow instead of com.apple.security.get-task-allow. I spent my morning scratching my head over this. See below for the typo: ❯ xctrace record --template 'Processor Trace' --target-stdout - --launch -- target/dev-rel/deps/hir_ty-f1dbf1b1d36575fe --exact tests::incremental::add_struct_invalidates_trait_solve Starting recording with the Processor Trace template. Launching process: hir_ty-f1dbf1b1d36575fe. Ctrl-C to stop the recording Run issues were detected (trace is still ready to be viewed): * [Error] Processor Trace cannot profile this process without proper permissions. * [Error] Recovery Suggestion: Either: 1. Add the 'com.apple.security-get-task-allow' entitlement to your binary entitlements file, or 2. Make sure the build setting CODE_SIGN_INJECT_BASE_ENTITLEMENTS = YES is enabled when building with Xcode. Recording failed with errors. Saving output file... Output file saved as: Launch_hir_ty-f1dbf1b1d36575fe_2025-07-02_13.39.40_EBCB3760.trace OOMs in Instruments I had to use xctrace record --template 'Processor Trace' --target-stdout - --launch -- hir_ty-f1dbf1b1d36575fe --exact tests::incremental::add_struct_invalidates_trait_solve from my shell instead of launching from Instruments directly, as Instruments ended up using something like 130 GB of RAM, forcing me to restart my Mac. I have only have a paltry 48GB! An Entitlement for Profiling using Hardware Profiling Features Feels Strange I would personally expect that a linker-signed, adhoc binary would imply com.apple.security.get-task-allow. Some additional, assorted thoughts: rustc tends to use the system's C compilers to indirectly drive the linker, which means that if there's some feature that the linker should be doing automatically on said platform, the Rust compiler does it. Given that the Rust compiler produces adhoc-signed binaries by default and I can debug those binaries using lldb without any additional entitlements, I'd expect the same of hardware-supported CPU execution tracing (modulo restarting my Mac), especially if the CPU execution tracing in M4 processors is anything similar to similar to Intel's Processor Trace: https://lldb.llvm.org/use/intel_pt.html. A friend pointed out that your phrasing of "That is, the signature applied by the linker" should have clued me into the fact that there's some linker magic happening with ld_prime. Cards on the table, I think there should be a little more magic happening :D. The new profiling functionality is downright magical. It just works with non-Swift/C/Objective-C languages and I feel like it's a shame that end-users need to become deeply familiar with codesigning to use this incredible set of tooling. Heck, I bought the Mac I'm writing this post on two days ago in order to use this feature! Anyways, thanks so much for your help and let me know if I can provide any additional detail!
Jul ’25
Reply to Can FSEvents include Snapshots of the Changed Files?
Kevin: thank you so much for the extremely detailed response about the different APIs available! [quote='791817022, DTS Engineer, /thread/757382?answerId=791817022#791817022'] Keep in mind that in many cases these APIs are actually best used to complement each other, not as direct replacements. For example, an app might use FSEvents to determine that a large hierarchy "has changed" while there app was not running, then rely on kqueue or Dispatch for realtime monitoring while they are running. [/quote] Interesting, gotcha! Here's a related question: since rust-analyzer is only interested in file change events while the IDE is running (we take deliberate steps to avoid any external, serializable state for a bunch of reasons that I'll elide for now, but I can get into later!), does it still make sense sense to do the layering you describe, or can we reasonably rely on the real-time approaches? For context, I'd group the two types of file events we get as "human-driven" (where a user might create a new file or whatever) or "machine-driven" (where the user switched branches or is rebasing, so we'd see a lot file change events in quick succession). We ran into the most issues with the latter, most commonly in the form of stale diagnostics when fed file change events by VS Code. That is a great question that I'm not (quite) ready to answer yet, but I wanted to reply with what I already had. I'll have more to say about this in the next day or two. Thanks again! I also realized that I didn't clarify the current state particularly well: by switching to using FSEvents via the Notify Rust library, rust-analyzer's reliability during rebases went from "guaranteed to be broken" to "basically works every time". I'm mostly asking if the last few percentage points of reliability are possible/if any potential footguns that we'd inadvertently introduce to macOS users by relying on that directly (e.g., would there be any gotchas with this approach on an NFS-based file system like Eden? are there any potential negative interactions with FSKit-based virtual file systems?)
Topic: App & System Services SubTopic: Core OS Tags:
Jun ’24