Optimizing SCSI HBA Constraints and Alignment for DriverKit on Apple Silicon

Hi Kevin,

I'm starting this new thread to focus on alignment optimization and recalibrating our HBA constraints.

Following up on your suggestion about UserReportHBAConstraints and alignment optimization, here are our current DEXT settings:

Via UserReportHBAConstraints():

  • kIOMaximumSegmentCountRead/WriteKey: 129
  • kIOMaximumSegmentByteCountRead/WriteKey: 65,536 (64 KB)
  • kIOMinimumSegmentAlignmentByteCountKey: 4 bytes
  • kIOMaximumSegmentAddressableBitCountKey: 32
  • kIOMinimumHBADataAlignmentMaskKey: 0

Via SetProperties() (additional injection):

  • kIOMaximumByteCountRead/WriteKey: 524,288 (512 KB)
  • kIOMaximumBlockCountRead/WriteKey: 1,024

We inherited the segment count (129) and max I/O length (512 KB) from our legacy KEXT, which were originally calculated based on a 4 KB segment size (Max I/O 512 KB / 4 KB + 1 = 129). The current alignment value of 4 was essentially a placeholder, as the legacy hardware didn't enforce strict page-level alignment.

Given that our testing is on Apple Silicon, we are considering increasing kIOMinimumSegmentAlignmentByteCountKey to 16,384 (16 KB) to match the native page size. However, I have two specific questions regarding this:

  1. Stripe Size vs. Page Size: Our RAID stripe size is typically larger than 16 KB (e.g., 64 KB or 128 KB). Should we be aligning the system to the RAID stripe size for hardware efficiency, or is it more critical to stick to the 16 KB page size to optimize the IOMMU/DART mapping overhead in DriverKit?

  2. Recalibration: If we increase the alignment to 16 KB, should we also adjust the kIOMaximumSegmentByteCount to match (i.e., 16 KB), or is it better to keep it at 64 KB to allow fewer, larger segments per I/O?

We suspect that the 38% gain we saw in 4 KB Random Reads might improve even further if we fix this alignment bottleneck. Looking forward to your thoughts.

Best regards,

Charles

Given that our testing is on Apple Silicon, we are considering increasing kIOMinimumSegmentAlignmentByteCountKey to 16,384 (16 KB) to match the native page size. > However, I have two specific questions regarding this:

First off, my disclaimer here is that I don't actually "know" what the right answer here is. I have some general intuition about what I think will work, however:

  1. It's entirely possible I've missed some detail that will render my suggestions invalid.

  2. Even if my general guidance is valid, it's very possible/likely that the specific details of a given hardware implementation would invalidate that guidance.

...which means the right answer here really comes down to testing.

Stripe Size vs. Page Size: Our RAID stripe size is typically larger than 16 KB (e.g., 64 KB or 128 KB). Should we be aligning the system to the RAID stripe size for hardware efficiency, or is it more critical to stick to the 16 KB page size to optimize the IOMMU/DART mapping overhead in DriverKit?

So, I think there are two different issues here:

  1. In practice, I'd expect most of your transfers are already 16 KB multiples, so increasing to the page size is basically "formalizing" the existing behavior. I'm not sure you'll actually get a huge performance boost from increasing this size (since your I/O were already that large), but I think it might "tidy up" the system’s general behavior.

  2. When increasing to your stripe size, the key issue here is RAID 5 and this:

Since 4K and 16K random writes are smaller than the RAID stripe size, they often trigger Read-Modify-Write (RMW) cycles,

My expectation is that those RMW cycles are a huge performance drain, probably larger than any other single factor. The big question here is whether the performance benefit of eliminating the extra read outweighs the extra memory cost... and the answer is that I don't know, you'll need to test. My guess is that the extra memory will be relatively marginal, but there's no way to know without testing.

Recalibration: If we increase the alignment to 16 KB, should we also adjust the kIOMaximumSegmentByteCount to match (i.e., 16 KB), or is it better to keep it at 64 KB to allow fewer, larger segments per I/O?

I think you want kIOMaximumSegmentByteCount to be as large as possible.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for the detailed explanation regarding page size alignment. We followed your advice and recalibrated our HBA constraints during the Start() sequence to match the native 16KB page size of Apple Silicon. Our specific settings are as follows:

  • kIOMaximumSegmentByteCount: 16,384 (16 KB)
  • kIOMaximumSegmentCount: 33 (maintaining our 512 KB hardware limit)
  • Mode: Bundled Mode enabled (256 slots)

Testing on our new RAID 5 hardware (4 HDDs) showed a dramatic reduction in system overhead by aligning segments to the 16KB page boundaries and leveraging the batching capabilities of Bundled Mode.

Key findings from our benchmarks (Bundled vs. Legacy):

  • Random Reads (16K - 64K): We observed a massive performance leap. At 16K (single-page I/O), Bundled Mode outperformed Legacy by 162%. At 32K and 64K, the gains increased to between 220% and 288%.
  • Mixed Workloads (70% Read / 30% Write): Across various block sizes, Bundled Mode consistently delivered a throughput increase ranging from 40% to 290%.
  • System Responsiveness: The system feels significantly more responsive during concurrent small-file or metadata-heavy operations. This confirms that reducing context switches and simplifying IOMMU (DART) mapping complexity is indeed the key to performance on the Apple Silicon platform.

One interesting observation: in 4K Random Read tests, Legacy Mode occasionally showed slightly higher numbers, which we suspect is due to specific cache-hit behaviors. However, as soon as the I/O size reached or exceeded the 16KB page size, Bundled Mode became the undisputed winner.

Your insight was spot on: at the DriverKit level, page size alignment is more critical than RAID stripe alignment. By streamlining the DART mapping process, we’ve effectively unblocked the I/O pipeline.

Best regards,

Charles

Testing on our new RAID 5 hardware (4 HDDs) showed a dramatic reduction in system overhead by aligning segments to the 16KB page boundaries and leveraging the batching capabilities of Bundled Mode.

Interesting. That's actually much larger than I expected, as my intuition here was that the system would have ended up effectively aligning you at 16KB, even though you allowed 4KB alignment (nothing obligates it to "force" a smaller alignment).

Have you actually compared the typical I/O sizes you're getting from the system in the two different cases?

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Hi Kevin,

Thank you for the deep dive into DART (IOMMU) mapping overhead. To address your question regarding the "typical I/O sizes" the system dispatches, we performed an A/B comparison to observe the behavior of the macOS Storage Stack under different kIOMaximumSegmentByteCount configurations.

Testing Methodology:

We instrumented UserProcessBundledParallelTasks_Impl to log the exact fRequestedTransferCount for every incoming task. While the documentation refers to this field as a "count," the values received (e.g., 16,384) confirm it represents the total byte count. To stimulate the kernel's merging logic, we used fio with a mixed workload ranging from 4KB to 64KB (bsrange=4k-64k).

The following distribution was captured during a 10-second high-load test:

I/O Size (Bytes)    | Case A (16K Limit) | Case A % | Case B (64K Limit) | Case B %
------------------|--------------------|----------|--------------------|---------
16,384 (16K)         | 3,174                         | 81.2%      | 2,906                       | 82.9%
524,288 (512K)    | 526                           | 13.5%      | 430                           | 12.3%
262,144 (256K)    | 210                            | 5.3%       | 167                            | 4.8%
4,096 (4K)             | 0                               | 0%           | 2                                | 0.06%

(Note: The 512K and 256K requests are large composite tasks consisting of multiple segments, respecting our 512KB hardware limit set via HBA constraints.)

Observations and Conclusions:

  1. Validation of Intuition: The data perfectly aligns with your intuition. Even in Case B, where we allowed up to 64KB segments, over 82% of the requests remained exactly 16KB. macOS on Apple Silicon clearly has an inherent preference for 16KB alignment, likely matching the native page size.
  2. Merging Efficiency: Tiny 4KB fragments were almost entirely non-existent. Even though fio issued 4KB requests, the Storage Stack merged them into 16KB blocks before they reached the DEXT.
  3. The 2.8x Performance Secret: This explains why the 16KB segment limit provided such a massive boost. Since 16KB is already the "natural" size the system prefers to send, explicitly locking kIOMaximumSegmentByteCount to 16KB removes the calculation cost for the DART to handle cross-page boundaries without forcing the system to send smaller or more fragmented I/O.

By setting the driver's constraints to match the kernel's native behavior, we’ve effectively put the I/O path on the "fast track" by simplifying the mapping logic.

Best regards,

Charles

Optimizing SCSI HBA Constraints and Alignment for DriverKit on Apple Silicon
 
 
Q