Amazing - markdown - good to know. thanks!
I didn't write the OpenCL version. I'm pretty sure it isn't doing 8x the amount of work. I assume one of the many glue functions is doing the math somewhere. Its not doing 8X the amount of work overall, its doing 8X the amount of work in a single kernel invocation. Specifically, it is doing 8 floats per kernel execution, whereas the metal kernel is doing a single float. This means for example when you submit in OpenCl in clEnqueueNDRangeKernel your work dimension is 8X smaller than in metal. This is way less kernel invocations, and so you are adding overhead in the metal code. Metal is never going to win under these circumstances. So to make them equivalent, you would need to do 8 x/y pairs in metal. Maybe an apple engineer can chime in here.
Also, as I mentioned before, my SIMD and 3D experience predates Metal, OpenCL, and OpenGL. Most of the discussions involving Metal assume that someone is 1) writing a game, 2) has used Metal on iOS, 3) has used OpenGL or OpenCL, DirectX, etc. I don't meet any of those assumptions. For example, the word "global" doesn't appear in Apple's Metal Shading Language Specification. I know what it means in other contexts, but not here. I assume you mean it is the same as "device". For example, I do know what "shading" means in a 3D context, but I'm also pretty sure that I will never actually perform that operation. I have very specific goals for this project. Agreed. Metal documentation is assuming you know things. But partly this is because there is so much documentation out there on CUDA especially, and OpenCL to a much lesser degree. Once you get the big picture design principles on Metal as a system, and what's different compared to how you run things, it's really largely the same inside the actual kernels. And so many of the normal optimization tricks that are extensively discussed in CUDA land work (i.e. limiting number of kernel invocations, using shared memory, watching how you access memory). Apple calls the things different names (because they "think different" perhaps), but conceptually what they are is largely the same. So with memory - yes - global is device, const is constant, shared is threadgroup. How you use them is always dependent on the architecture to some degree, but mostly similar in concept. But even within device memory there are options, like private, managed, etc, these all have equivalents in OpenCL, and can make a huge difference depending on what you are doing, you just have to figure out the mapping between OpenCL and metal to be able to use them efficiently.
I only do compute with Metal, coming from OpenCL. I have found precision on float is identical to OpenCL (and equivalent to CPU float code), but you have to watch how you compile/write if you need highly precise float code. If you need double, you are indeed out of luck.
After you do the simple things (like increasing the amount of work per kernel, and using thread groups appropriately), that's when I typically look at GPU frame capture and take a look. For me the biggest help there is with register pressure inside a kernel which is hard for me to eyeball.