I am TF macos 2.9, and TF metal 0.5, M2 Max 96gb
I ran into this issue using HF distilled Bert model to train on my dataset. My batch size is just 128 (less than 512 you reported, but impact sort of depends on the model).
I suspect this may be a memory issue (or mismanagement/misalignment due to framework bugs). I will try to reduce the batch size and see if this improves.
But even so, this may be quite a disappointment since i got 96gb to really push the batch size up in my local env.
Topic:
Machine Learning & AI
SubTopic:
General
Tags: