Have you looked at the performance information from Capture GPU Frame? No. I'm a total GPU newbie. My knowledge of SIMD computing dates from before GPUs were popular.
Unfortunately, I don't think you get the fine grained stats on Intel/Discrete that are available on A10X and newer SOCs. If you expect the code to live into the Apple Silicon era, exploring performance on a modern iPad might not be a waste of time. This code might never actually be deployed on Intel. Apple Silicon, on iOS or macOS, is all I care about. I will definitely look at GPU performance when I have this running on iOS. But for now, it still a proof of concept to just get the logic implemented on Metal. I'm using the OpenCL code as a path of least resistance to port to Metal.
Also, would OpenCL be using both CPU and GPU while Metal is only using GPU? I don't think so. This code has the ability to use either, but I think it has to be one or the other.
I think it is just a side effect of using this older 2014 machine. I think OpenCL is optimized for it more than it is for the 2017. On the newer machine, Metal does much better. OpenCL still wins 90% of the time, but sometimes Metal wins on the 2017.