Post

Replies

Boosts

Views

Activity

Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome
Hi all, I've been working on a pure-Swift port of Google's Gemma 4 text decoder that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback from the MLX team and the community before I propose anything upstream. Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core Why As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box. The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from Gemma 3 in several structural places. The chat-template path through swift-jinja 1.x also silently corrupts the prompt, so the model loads but generates incoherent text. What's in the package A from-scratch Swift implementation of the Gemma 4 decoder (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer) Per-Layer Embedding (PLE) support — the shared embedding table that feeds every decoder layer through a gated MLP as a third residual KV sharing across the back half of the decoder, threaded through the forward pass via a donor table with a single global rope offset A custom Gemma4ProportionalRoPE class for the partial-rotation rope type that initializeRope doesn't currently recognize A chat-template bypass that builds the prompt as a literal string with the correct turn markers and encodes via tokenizer.encode(text:), matching Python mlx-lm's apply_chat_template byte-for-byte Measured on iPhone (A-series, 7.4 GB RAM) Model: mlx-community/gemma-4-e2b-it-4bit Warm load: ~6 s Memory after load: 341–392 MB Time to first token (end-to-end, 333-token system prompt): 2.82 s Generation throughput: 12–14 tok/s What I'd love feedback on Is the sidecar registration pattern the right way to extend mlx-swift-lm with new model families, or is there a more idiomatic path I missed? The chat-template bypass works but feels like a workaround. Is the right long-term fix in swift-jinja, in the tokenizer, or somewhere else entirely? Anyone running into the same PLE / KV-sharing issues on other Gemma-family checkpoints? I'd like to make sure the implementation generalizes beyond E2B before tagging a 0.2.0. Happy to open a PR against mlx-swift-lm if the maintainers think any of this belongs upstream. Thanks for reading.
0
0
22
10h
Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome
Hi all, I've been working on a pure-Swift port of Google's Gemma 4 text decoder that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback from the MLX team and the community before I propose anything upstream. Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core Why As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box. The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from Gemma 3 in several structural places. The chat-template path through swift-jinja 1.x also silently corrupts the prompt, so the model loads but generates incoherent text. What's in the package A from-scratch Swift implementation of the Gemma 4 decoder (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer) Per-Layer Embedding (PLE) support — the shared embedding table that feeds every decoder layer through a gated MLP as a third residual KV sharing across the back half of the decoder, threaded through the forward pass via a donor table with a single global rope offset A custom Gemma4ProportionalRoPE class for the partial-rotation rope type that initializeRope doesn't currently recognize A chat-template bypass that builds the prompt as a literal string with the correct turn markers and encodes via tokenizer.encode(text:), matching Python mlx-lm's apply_chat_template byte-for-byte Measured on iPhone (A-series, 7.4 GB RAM) Model: mlx-community/gemma-4-e2b-it-4bit Warm load: ~6 s Memory after load: 341–392 MB Time to first token (end-to-end, 333-token system prompt): 2.82 s Generation throughput: 12–14 tok/s What I'd love feedback on Is the sidecar registration pattern the right way to extend mlx-swift-lm with new model families, or is there a more idiomatic path I missed? The chat-template bypass works but feels like a workaround. Is the right long-term fix in swift-jinja, in the tokenizer, or somewhere else entirely? Anyone running into the same PLE / KV-sharing issues on other Gemma-family checkpoints? I'd like to make sure the implementation generalizes beyond E2B before tagging a 0.2.0. Happy to open a PR against mlx-swift-lm if the maintainers think any of this belongs upstream. Thanks for reading.
Replies
0
Boosts
0
Views
22
Activity
10h