I've also had this issue on this network on TF 2.9.1 (both tensorflow-macos and built from source), tensorflow-metal 0.4.1 and 0.5.0. It's mitigable by restarting the python script every 10-30 epochs (usually corresponding to 20 mins of time), transferring weights and then resuming training, but sometimes it would hang right off the bat. Eventually I had to switch to CPU training, and I can also concur that it seems to be faster for some reason.
Hardware info: Mac Studio w/ M1 Max, 10 CPU cores 24 GPU cores, 32GB RAM
Software info: macOS 12.4, Python 3.9.13 (homebrew)
Topic:
Machine Learning & AI
SubTopic:
General
Tags: