i believe i am encountering a bug in the MPS backend of CoreML.
i believe there is an invalid conversion of a slice_by_index + gather operation resulting in indexing the wrong values on GPU execution.
the following is a python program using the coremltools library illustrating the issue:
from coremltools.converters.mil import Builder as mb
from coremltools.converters.mil.mil import types
dB = 20480
shapeI = (2, dB)
shapeB = (dB, 22)
@mb.program(input_specs=[mb.TensorSpec(shape=shapeI, dtype=types.int32),
mb.TensorSpec(shape=shapeB)])
def prog(i, b):
lslice = mb.slice_by_index(x=i, begin=[0, 0], end=[1, dB], end_mask=[False, True],
squeeze_mask=[True, False], name='slice_left')
rslice = mb.slice_by_index(x=i, begin=[1, 0], end=[2, dB], end_mask=[False, True],
squeeze_mask=[True, False], name='slice_right')
ldata = mb.gather(x=b, indices=lslice)
rdata = mb.gather(x=b, indices=rslice) # actual bug in optimization of gather+slice
x = mb.add(x=ldata, y=rdata)
# dummy ops to make a bigger graph to run on GPU
x = mb.mul(x=x, y=2.)
x = mb.mul(x=x, y=.5)
x = mb.mul(x=x, y=2.)
x = mb.mul(x=x, y=.5)
x = mb.mul(x=x, y=2.)
x = mb.mul(x=x, y=.5)
x = mb.mul(x=x, y=2.)
x = mb.mul(x=x, y=.5)
x = mb.mul(x=x, y=2.)
x = mb.mul(x=x, y=.5)
x = mb.mul(x=x, y=2.)
x = mb.mul(x=x, y=.5)
x = mb.mul(x=x, y=2.)
x = mb.mul(x=x, y=.5)
x = mb.mul(x=x, y=1., name='result')
return x
input_types = [
ct.TensorType(name="i", shape=shapeI, dtype=np.int32),
ct.TensorType(name="b", shape=shapeB, dtype=np.float32),
]
with tempfile.TemporaryDirectory() as tmpdirname:
model_cpu = ct.convert(prog,
inputs=input_types,
compute_precision=ct.precision.FLOAT32,
compute_units=ct.ComputeUnit.CPU_ONLY,
package_dir=tmpdirname + 'model_cpu.mlpackage')
model_gpu = ct.convert(prog,
inputs=input_types,
compute_precision=ct.precision.FLOAT32,
compute_units=ct.ComputeUnit.CPU_AND_GPU,
package_dir=tmpdirname + 'model_gpu.mlpackage')
inputs = {
"i": torch.randint(0, shapeB[0], shapeI, dtype=torch.int32),
"b": torch.rand(shapeB, dtype=torch.float32),
}
cpu_output = model_cpu.predict(inputs)
gpu_output = model_gpu.predict(inputs)
# equivalent to prog
expected = inputs["b"][inputs["i"][0]] + inputs["b"][inputs["i"][1]]
# what actually happens on GPU
actual = inputs["b"][inputs["i"][0]] + inputs["b"][inputs["i"][0]]
print(f"diff expected vs cpu: {np.sum(np.absolute(expected - cpu_output['result']))}")
print(f"diff expected vs gpu: {np.sum(np.absolute(expected - gpu_output['result']))}")
print(f"diff actual vs gpu: {np.sum(np.absolute(actual - gpu_output['result']))}")
the issue seems to occur in the slice_right + gather operations when executed on GPU. the wrong items in input "i" are selected.
the program outpus
diff expected vs cpu: 0.0
diff expected vs gpu: 150104.015625
diff actual vs gpu: 0.0
this behavior has been tested on MacBook Pro 14inches 2023, (M2 pro) on mac os 14.7, using coremltools 8.0b2 with python 3.9.19
Selecting any option will automatically load the page
Post
Replies
Boosts
Views
Activity
Hi,
I am trying to take advantage of my device ANE. I have created a model from torch using coremltools and adapted it until xcode model performance preview indicates it will run on my device ANE.
But when i profile my integration into my app, i can see from the com.apple.ane logs the model has been loaded on device:
Timestamp Type Process Category Message
00:00.905.087 Debug MLBench (11135) client doLoadModel:options:qos:error:: model[0x2804500c0] : success=1 : progamHandle=10 000 241 581 886: intermediateBufferHandle=10 000 242 143 532 : queueDepth=32 :err=
but when i call predict on my model, the ANE is unloaded and the prediction run on CPU:
Timestamp Type Process Category Message
00:00.996.015 Debug MLBench (11135) client doUnloadModel:options:qos:error:: model[0x2804500c0]=_ANEModel: { modelURL=file:///var/mobile/Containers/Data/Application/0A9F356B-B8C7-4B86-90A5-6812EF48CC94/tmp/math_custom_trans_decoder_seg_0DB63A47-E84E-4887-A606-BC9986B2C662.mlmodelc/ : key={"isegment":0,"inputs":{"extras":{"shape":[2,1,1,1,1]},"memory":{"shape":[128,5,1,1,1]},"proj_key_seg_in":{"shape":[128,39,1,1,1]},"state_in_k":{"shape":[32,1,1,20,2]},"tgt":{"shape":[5,1,1,1,1]},"state_in_v":{"shape":[32,1,1,20,2]},"pos_enc":{"shape":[128,1,1,1,1]}},"outputs":{"attn_seg":{"shape":[1,5,1,4,1]},"state_out_v":{"shape":[32,2,1,20,2]},"output":{"shape":[292,5,1,1,1]},"state_out_k":{"shape":[32,2,1,20,2]},"extras_tmp":{"shape":[2,1,1,1,1]},"proj_key_seg_in_tmp":{"shape":[128,39,1,1,1]},"attn":{"shape":[1,1,1,5,2]},"proj_key_seg":{"shape":[128,1,1,20,1]}}} : string_id=0x70ac000000015257 : program=_ANEProgramForEvaluation: { programHandle=10000241581886 : intermediateBufferHandle=10000242143532 : queueDepth=32 } : state=3 : programHandle=10000241581886 : intermediateBufferHandle=10000242143532 : queueDepth=32 : attr=... : perfStatsMask=0}
i dont see any obvious error messages in com.apple.ane, com.apple.coreml or com.apple.espresso.
where/what should i look for to understand what is going on? and in particular why the ANE model was unloaded?
Thank you