Post

Replies

Boosts

Views

Activity

LLM inference on Apple Silicon: why do some MoE architectures outperform dense models despite similar parameter counts?
We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive. In several cases, MoE models significantly outperform dense models despite having similar total parameter counts. Examples (simplified): Dense model: ~30B parameters MoE model: ~30B total parameters, ~3B active parameters On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead. A few hypotheses we're considering: Active parameter count appears to matter more than total parameter count for decode throughput. Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected. Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not. Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput. What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically. Has anyone profiled decode throughput across: dense models large-expert MoE many-small-expert MoE and identified which hardware characteristics are actually driving the difference? I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.
0
0
90
2w
Apple GPU forward progress guarantees for persistent-thread synchronization?
We're doing some research on Apple Silicon inference runtimes and trying to understand the practical synchronization boundary of Apple GPUs. We are not asking about threadgroup barriers (those are documented), but about device-scope synchronization patterns built from atomics. What we've observed: Device-scope atomics are available. It is possible to build global counters and persistent-thread style coordination structures. However, we cannot find any documented guarantee regarding: threadgroup co-residency, global forward progress, occupancy-bounded synchronization safety. In our experiments, synchronization schemes that rely on all threadgroups making progress eventually can become unreliable, while strictly local producer/consumer handoff patterns appear much more robust. Questions: Does Metal provide any documented forward-progress guarantees across threadgroups beyond what is explicitly stated in the Metal specification? Is there any recommended pattern for implementing long-lived producer/consumer GPU pipelines without relying on global synchronization assumptions? For Apple GPUs specifically, should developers assume that occupancy-bounded global synchronization is unsupported unless explicitly provided by the API? We are not looking for undocumented implementation details, only for guidance on what assumptions are safe for production systems. Thanks.
0
0
89
2w
LLM inference on Apple Silicon: why do some MoE architectures outperform dense models despite similar parameter counts?
We're doing some local LLM inference experiments on Apple Silicon and have observed something that seems counterintuitive. In several cases, MoE models significantly outperform dense models despite having similar total parameter counts. Examples (simplified): Dense model: ~30B parameters MoE model: ~30B total parameters, ~3B active parameters On Apple Silicon, the MoE model consistently achieves higher decode throughput even after accounting for routing overhead. A few hypotheses we're considering: Active parameter count appears to matter more than total parameter count for decode throughput. Memory traffic may dominate M=1 autoregressive decode, making sparse activation more important than expected. Expert matrix geometry might matter as much as parameter count. Some MoE designs appear to produce GPU-friendly GEMV shapes while others do not. Quantization layout and memory alignment seem to have surprisingly large effects on practical throughput. What I'm curious about is whether others have observed similar behavior on Apple Silicon specifically. Has anyone profiled decode throughput across: dense models large-expert MoE many-small-expert MoE and identified which hardware characteristics are actually driving the difference? I'm particularly interested in observations from Metal profiling rather than benchmark leaderboards.
Replies
0
Boosts
0
Views
90
Activity
2w
Apple GPU forward progress guarantees for persistent-thread synchronization?
We're doing some research on Apple Silicon inference runtimes and trying to understand the practical synchronization boundary of Apple GPUs. We are not asking about threadgroup barriers (those are documented), but about device-scope synchronization patterns built from atomics. What we've observed: Device-scope atomics are available. It is possible to build global counters and persistent-thread style coordination structures. However, we cannot find any documented guarantee regarding: threadgroup co-residency, global forward progress, occupancy-bounded synchronization safety. In our experiments, synchronization schemes that rely on all threadgroups making progress eventually can become unreliable, while strictly local producer/consumer handoff patterns appear much more robust. Questions: Does Metal provide any documented forward-progress guarantees across threadgroups beyond what is explicitly stated in the Metal specification? Is there any recommended pattern for implementing long-lived producer/consumer GPU pipelines without relying on global synchronization assumptions? For Apple GPUs specifically, should developers assume that occupancy-bounded global synchronization is unsupported unless explicitly provided by the API? We are not looking for undocumented implementation details, only for guidance on what assumptions are safe for production systems. Thanks.
Replies
0
Boosts
0
Views
89
Activity
2w