When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights).
the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer.
The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer:
import tensorflow as tf
import numpy as np
# Any shape works; larger K makes the error more obvious
M, K, N = 64, 2048, 1024
W = tf.Variable(tf.random.normal([K, N]))
b = tf.Variable(tf.random.normal([N]))
x = tf.random.normal([M, K])
@tf.function
def graph_fused(x):
return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b))
@tf.function
def graph_safe(x):
return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd
eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct
fused_out = graph_fused(x) # Metal graph mode = WRONG
safe_out = graph_safe(x) # Metal graph mode = correct
print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}")
# ^ typically 30–80+ (WRONG)
print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}")
# ^ typically ~1e-5 (correct)
Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac.
Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.
0
0
120