Low performance for calculation of dense layers

Hi, I have some latest experiment which may indicate low performance issues when using Dense layer on the M1 Max (this is a follow-up issue about my previous question ).

import tensorflow as tf
from tensorflow.keras import Model, layers
import numpy as np
from tqdm import tqdm

class NeuralNet(Model):
    # Set layers.
    def __init__(self):
        super(NeuralNet, self).__init__()
        # First fully-connected hidden layer.
        self.fc1 = layers.Dense(8192 * 8 * 2, activation=tf.nn.relu)

    # Set forward pass.
    def call(self, x):
        return self.fc1(x)

# Build neural network model.
neural_net = NeuralNet()
batch_size = 1024
x = np.random.rand(batch_size, 256)
for _ in tqdm(range(10000000)):
    neural_net(x)

The above code runs at 17.06it/s on the M1 Max chip and 168.04it/s on the Zotac RTX 3090. Both gpu utilisation of M1 max and RTX 3090 is 100%. The wattage usage for M1 max is 44.5W and 340W for RTX 3090. The M1 max is much slower compared to RTX 3090 (10% the performance of RTX 3090 which shouldn't be the case, it should be roughly 30% of a RTX 3090).

Here is the detailed performance comparsion of a RTX 3090 / M1 max for different batch size used which shows RTX 3090 is roughly 10 times faster than a M1 max and even faster for bigger batch size:

Notice that the batch size of above experiments is already big enough. Please test the above experiments and fix the problems. Thanks.

Thank you for sharing your observations. We are investigating this. Please file a request through Feedback Assistant and post it here.

I submitted the issue and the feedback ID is: FB9803715. Thanks for investigating the issue. I plan to redo the experiments using matrix multiplication instead of Dense layers to see whether the issue is specifically for using Dense layers and will update here.

Low performance for calculation of dense layers
 
 
Q