I had to perform 2D FFT on very large tensors. So I did some research about vDSP's fft routines, and find out the following things :
There's no batch function for fft 2D (something like fftm for 1D), so if I wanted to perform fft2 method on all my tensor I had to put it in a loop and manually batching it by moving my pointers across the tensor for each call of the function, which was obviously pretty slow. I couldn't made just one call of fft2 on all the tensor because the log2N parameter would be too big.
I find a trick, doing an fftm, then transposing the tensor to have the columns becoming rows and so becoming contiguous in memory, then doing another fftm. This way was the fastest I could find, even if the transpose operation cost some time too.
Basically I followed all the tips I find on the documentation to have the best performances with vDSP : using a stride of 1 as much as possible (that's why I transposed my tensor between the two fftm) and allocating memory 16 bytes align, using posix_memalign method.
However I am a beginner developper and I definitely could have missed something that made my vDSP's fft too slow, but the fact is my current results didn't match performance of some other GPU framework, like the cuFFT in CUDA. That is why I thought that a highly-optimized Metal FFT could exist in MPS.