hxssg1124’s Profile | Apple Developer Forums

Reply to Low performance for calculation of dense layers

I submitted the issue and the feedback ID is: FB9803715. Thanks for investigating the issue. I plan to redo the experiments using matrix multiplication instead of Dense layers to see whether the issue is specifically for using Dense layers and will update here.

Machine Learning & AI General

Dec ’21

Reply to Low performance on matrix multiplication

More experiments below which shows interesting result: from tqdm import tqdm def foo(x, y): z = x * y return z if __name__ == '__main__': z0 = None x = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32) y = tf.random.uniform((1024 * 12, 1024 * 12), dtype=tf.float32) for i in tqdm(range(1000000)): zz = foo(x, y) x += 1 y += 1 if z0 is None: z0 = zz else: z0 += zz This experiment avoids caching as well as extra cost creating another tensor, M1 max scores 61.9it/s and RTX 3090 scores 160.2it/s which definitely shows the potential of M1 max (38.6% the performance of a RTX 3090). Interestingly in this experiment, the M1 Max consumes roughly 50 watts in total. I'm trying to find the performance bottleneck of M1 max using deep learning models like Transformer since currently it's very slow to run on a M1 series chip, I will update here when I find something.

Graphics & Games General

Nov ’21

Reply to Low performance on matrix multiplication

Thanks for pointing out the issue. The reason I want to create random tensor each time during the loop is to avoid potential "caching" for the same calculation of two variables to reflect performance in the real scenario (Loading different data during the for-loop for training / inference). I also did the experiemnts for your attached code. M1 max scores 103.10it/s and RTX 3090 scores 234.82it/s, which reflects 43.9% of the performance of RTX 3090. But I think this is due to some internal caching, when I run some deep learning models for M1 max, it also shows the training performance is roughly 1/6 the performance of a RTX 3090 which is consistent with my result above (An example would be training qa model on the huggingface tensorflow examples). The interesting part is about the wattage usage. GPU utilisation and wattage consumption is the same for deep learning and gaming on a RTX 3090 GPU, but they are much different with M1 Max (wattage consumption is much lower for deep learning compared to gaming) which suggests the GPU cores of M1 max might not be fully utilized for deep learning. Hope you can find the issues and improve the performance of tensorflow running on M1 chip.

Graphics & Games General

Nov ’21

hxssg1124

Post

Replies

Boosts

Views

Activity