Post

Replies

Boosts

Views

Created

_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU
When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.
0
0
120
1d
_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU
When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.
Replies
0
Boosts
0
Views
120
Activity
1d