I've been using Metal compute shaders for lattice quantum chromodynamics simulations and wanted to share the experience in case others are doing scientific computing on Metal.
The workload involves SU(2) matrix operations on 4D lattice grids — lots of 2x2 and 3x3 complex matrix multiplies, reductions over lattice sites, and nearest-neighbor stencil operations. The implementation bridges a C++ scientific framework (Grid) to Metal via Objective-C++ .mm files, with MSL kernels compiled into .metallib archives during the build.
Things that work well:
- Shared memory on M-series eliminates the CPU↔GPU copy overhead that dominates in CUDA workflows
- The .metallib compilation integrates cleanly with autotools builds using xcrun
- Float4 packing for SU(2) matrices maps naturally to MSL vector types
Things I'm still figuring out:
- Optimal threadgroup sizes for stencil operations on 4D grids
- Whether to use MTLHeap for gauge field storage or stick with individual buffers
- Best practices for double precision — some measurements need float64 but Metal's double support varies by hardware
The application is measuring chromofield flux distributions between static quarks, ultimately targeting multi-quark systems. Production runs are on MacBook Pro M-series and Mac Studio.