I'm modifying <1mb of a 256mb managed buffer (calling didModifyRange), but according to Metal System Trace, the GPU copies the whole buffer (SDMA0 channel, "Page On 268435456 bytes"), taking 13ms.
I'm making lots of small modifications (~4k) per frame. I also tried coalescing into a single call to didModifyRange (~66mb) and still the entire buffer is copied. I also tried calling didModifyRange for the first byte, and then the copied data is small.
So I'm wondering why didModifyRange doesn't seem to be efficient for many small updates to a big buffer?