Hi,
I'm building ANF (Autonomous Native Forge) — a cloud-free, 4-agent autonomous software production pipeline running on local hardware with local LLM inference. No middleware, pure Node.js native.
Currently running on NVIDIA Blackwell GB10 with vLLM + DeepSeek-R1-32B. Now porting to Apple Silicon.
Three technical questions:
-
How production-ready is mlx-lm's OpenAI-compatible API server for long context generation (32K tokens)?
-
What's the recommended approach for KV Cache management with Unified Memory architecture — any specific flags or configurations for M4 Ultra?
-
MLX vs GGUF (llama.cpp) for a multi-agent pipeline where 4 agents call the inference endpoint concurrently — which handles parallel requests better on Apple Silicon?
GitHub: github.com/trgysvc/AutonomousNativeForge
Any guidance appreciated.