Local AI & Automation Briefing — May 16th, 2026

Let's start with the biggest speed story this week: Multi-Token Prediction, or MTP, has arrived for local inference and the numbers are wild. Unsloth AI released experimental MTP GGUF quants for Qwen 3.6 models. On a single RTX 5090, the twenty-seven billion parameter model hits one hundred forty tokens per second using MTP. The thirty-five billion parameter MoE variant reaches two hundred twenty tokens per second. That's up to a three-X speedup over standard GGUF inference with zero accuracy loss. On a Strix Halo AMD laptop with Vulkan, code workloads jumped from twelve to thirty-six tokens per second. llama.cpp builds b9161 through b9165 bake this directly in — you just add draft-mtp flags and you're off. This is frontier-model speed on consumer hardware, with no cloud bills.

Alex Cheema from EXO Labs met up with Ahmad Osman earlier this week and posted a detailed thread on why M5 Max MacBook clusters are currently the best price-to-performance for decode-heavy inference. Four MacBooks with one hundred twenty-eight gigs each, connected via Thunderbolt 5 RDMA, achieve near-linear scaling thanks to eight-microsecond latency. He confirmed a government customer is already running production workloads on this setup. EXO also announced they're preparing over ten thousand hardware configurations to publish publicly for free. On X, Ahmad posted a heartfelt defense of the local AI community, pushing back against what he called crypto-style grifters dressing up as AI experts — it got strong engagement from the LocalLLaMA crowd.

On the hardware price front, the RTX 5090 situation remains rough. Street prices sit at three thousand four hundred to four thousand one hundred dollars for AIB models, and NVIDIA has reportedly passed a three hundred dollar increase to board partners due to GDDR7 memory shortages. Cards disappear within minutes when stock appears. For those who need more VRAM, the RTX PRO 6000 Blackwell with ninety-six gigs is settling around eight thousand dollars in the U.S., shipping in Lenovo ThinkStation P4 and ELSA Veluga Pro workstations that support up to three cards.

Budget builders, take note: used RTX 3090s with twenty-four gigs are now six hundred fifty to eight hundred dollars — that's the best dollars-per-gigabyte in the market. Used 4090s run about thirteen hundred fifty to seventeen hundred dollars. For the price of one new 5090, you can build a dual 3090 rig with forty-eight gigs total VRAM.

On the Apple Silicon side, the Mac Studio M3 Ultra with five hundred twelve gigs of unified memory runs Qwen 3.6 at seventy-two tokens per second on MLX at just one hundred twenty watts — versus eleven hundred watts for an equivalent NVIDIA rig. LM Studio pushed an updated MLX runtime beta with major caching improvements and vision model batching, available now by switching to the beta channel. Awni Hannun previously benchmarked M5 Max showing eight-X faster prefill versus M1 Max, all running locally in MLX.

In the self-hosted software ecosystem, Open WebUI released version 0.9.5 patching two CVEs — update your instances if you haven't. n8n's self-hosted AI agent stack continues to mature, with builders now using the Observable Agent Stack pattern combining IronClaw guardrails with n8n workflows and curated libraries. People are running autonomous 24/7 lead generation agents on decade-old hardware with zero cloud costs.

A new open-source tool called whichllm dropped today — it auto-detects your hardware and ranks local LLMs based on real benchmarks rather than parameter count. Very useful for finding the optimal model for your specific rig. And vLLM's ROCm support on AMD has matured to first-class status — Docker images, wheels, and AITER kernels now just work for local inference.

On the security note: SGLang has a critical CVSS nine point eight remote code execution vulnerability. If you're running it locally, update immediately.

That's the local AI briefing for May 16th. I'm Bob — back tomorrow.