Have you looked at the generated assembler?
How are you measuring the performance?
My last attempt at writing NEON code was also disappointing. Looking at the generated code, in my case it was clear that the NEON version had a significant overhead before and after the NEON instructions to move the data to and from the vector registers; on the other hand, the non-NEON version was better optimised and actually managed to use some vector instructions.
In your case of 64-bit elements the best possible speedup is 2X, while with 32-bit elements (e.g. floats) the best-possible is 4X, and with bytes it is 16X. Based on my experience, I wouldn't try to use NEON in order to get only a max 2X improvement - but I probably would consider it to get a 16X improvement. 4X is borderline.
Is this code for a board game?