Thanks for getting back to me! It appears that upgrading to Xcode 26.1 has fixed the issue, and the headers are now detected correctly.
By the way, I noticed that there is a lot of discrepancy between the documentation and the shipped APIs. I suppose you guys are aware of this and working on a fix? And we really need a Performance Primitives tuning guide. The API is very flexible and finding settings that actually work well for performance can be challenging. For example, I am yet to find a tile size for which using a multi-simdgroup execution scope would not result in performance regression. Also, what about bfloat?