I'm seeing something similar when training the SwinTransformerV2Tiny_ns model from https://github.com/leondgarse/keras_cv_attention_models. After 4075-ish training steps it pretty reliably seems to just give up on using the GPU. The gpu memory / usage drops off and cpu usage also stays low. You can see the steps/sec absolutely tank in the training logs:
FastEstimator-Train: step: 3975; ce: 1.2872236; model_lr: 0.00022985446; steps/sec: 4.19;
FastEstimator-Train: step: 4000; ce: 1.3085787; model_lr: 0.00022958055; steps/sec: 4.2;
FastEstimator-Train: step: 4025; ce: 1.3924551; model_lr: 0.00022930496; steps/sec: 4.19;
FastEstimator-Train: step: 4050; ce: 1.4702798; model_lr: 0.0002290277; steps/sec: 4.16;
FastEstimator-Train: step: 4075; ce: 1.2734954; model_lr: 0.00022874876; steps/sec: 0.05;
GPU Memory Utilization over time (about 30% during training, then just cuts out. The first dip is an evaluation step during training, then training resumes and cuts out)
GPU Utilization over time (about 100% during training, then just stalls out. The first dip is an evaluation step during training, then training resumes and cuts out)
After the GPU gives up the terminal no longer responds to attempts to kill the training with ctrl-c.