Hi,
I'm building ANF (Autonomous Native Forge) — a cloud-free, 4-agent
autonomous software production pipeline running on local hardware
with local LLM inference. No middleware, pure Node.js native.
Currently running on NVIDIA Blackwell GB10 with vLLM + DeepSeek-R1-32B.
Now porting to Apple Silicon.
Three technical questions:
How production-ready is mlx-lm's OpenAI-compatible API server
for long context generation (32K tokens)?
What's the recommended approach for KV Cache management
with Unified Memory architecture — any specific flags
or configurations for M4 Ultra?
MLX vs GGUF (llama.cpp) for a multi-agent pipeline
where 4 agents call the inference endpoint concurrently —
which handles parallel requests better on Apple Silicon?
GitHub: github.com/trgysvc/AutonomousNativeForge
Any guidance appreciated.
Selecting any option will automatically load the page