Local AI & Automation Briefing — May 25, 2026

The biggest story in local inference this week is MTP — Multi-Token Prediction — finally landing in llama.cpp. MTP drafts multiple future tokens and verifies them together, cutting the repeated VRAM weight loads that bottleneck every generation step. It has roughly eighty percent draft acceptance with zero accuracy loss and only about one gigabyte of extra VRAM overhead. The speedups are dramatic. Qwen three-point-six twenty-seven-B on a single RTX 5090 with MTP is up to seven times faster. On dual RTX 2080 Tis running Q6 — yes, those decade-old cards — it hits over one hundred thirty tokens per second, beating vLLM on the same hardware. On a single A10G, throughput jumps from twenty-five to forty-five tokens per second. If you run local models, you need to update llama.cpp right now.

The Qwen three-point-six MoE family has become the dominant model for consumer GPUs this week. The thirty-five-B A-three-B variant — that is thirty-five billion total parameters but only about three billion active per token — delivers quality competitive with much larger dense models while fitting comfortably on a single RTX 3090 at Q4 quality. On that twenty-four gig card, you get two thousand to twenty-five hundred tokens per second input processing and thirty to fifty tokens per second generation with two hundred fifty-six K context. On an M5 Pro with forty-eight gigs of unified memory, a six-bit quant runs at about seventy tokens per second via MLX. This MoE architecture is the sweet spot for consumer VRAM.

EXO Labs has officially released EXO one-point-oh under Apache two-point-oh. This is the framework for true model parallelism across heterogeneous clusters of Macs and other hardware — automatic device discovery, cluster formation, and an OpenAI-compatible API. A cluster of two to four high-memory Macs with sixty-four to one hundred ninety-two gigs of unified memory each can comfortably run seventy-B to four-hundred-five-B models entirely locally. Their stated goal: benchmark every major model against every quantization level on every hardware configuration.

Unsloth Studio also launched this week — a web UI for training and running open-source LLMs entirely on your own machine. Two times faster training with seventy percent less VRAM. Five hundred supported models. Five hundred K context on a single eighty-gig GPU. And Auto MTP that automatically picks the best speculative decoding settings for your hardware. One demo showed a four-bit Qwen model searching seventy-plus websites from a single prompt on just twenty gigs of RAM.

On the quantization front, NVFP4 — four-bit normal float — is delivering roughly twice the decode speed of GGUF on Apple Silicon via MLX. The normal float format has better statistical properties than integer quantization, meaning less quality loss at the same bit width. Combined with MLX's existing advantages in unified memory bandwidth, Apple Silicon is now gaining inference performance faster than consumer NVIDIA GPUs.

Now for hardware prices. The RTX 5090 street price remains in the three thousand to thirty-five hundred dollar range — well above the two-thousand-dollar MSRP — with scalping still active but inventory improving slightly. The RTX PRO 6000 Blackwell with ninety-six gigs is still hard to find under seven thousand dollars. RTX 3090 twenty-four-gig cards remain the budget king on the used market at roughly six to eight hundred dollars each. A four-by-5090 build, including motherboard, PSU, and cooling, runs seventeen to twenty-two thousand dollars depending on sourcing. And power math at fifteen cents per kilowatt hour: a four-GPU rig pulling sixteen hundred watts continuous costs about a hundred seventy dollars per month. Solar plus battery is increasingly attractive for 24-7 inference rigs.

Finally, the n8n plus Home Assistant combo is being called the private AI operating system for your house. n8n handles the intelligence layer with AI agent nodes, tool calling, memory, and branching logic, while Home Assistant provides over three thousand device integrations as tools for those agents — all self-hosted, no cloud lock-in.

That is your Local AI and Automation Briefing for May twenty-fifth, twenty twenty-six.