Local AI & Automation Briefing — Wednesday, May 27th, 2026

MTP SPECULATIVE DECODING LANDS IN LLAMA.CPP — GAME CHANGER FOR CONSUMER GPUS

The biggest news for local inference this week is Multi-Token Prediction landing in llama.cpp mainline. This isn't just a minor optimization. On a dual RTX 5090 setup, MTP pushed a dense 27-billion-parameter Qwen model from 51 tokens per second to 117 — that's a 2.3x speedup with zero accuracy loss and only about a gigabyte of extra VRAM. LM Studio users are seeing roughly 63% gains out of the box. Unsloth has already shipped MTP-optimized GGUF quants for Qwen 3.6, reporting 1.8x faster generation. The key insight: dense models are memory-bandwidth bound, and MTP's speculative decoding effectively side-steps that bottleneck. Combined with llama.cpp's new Router Mode — which lets you serve multiple models from one endpoint — the local inference stack is getting dangerously close to server-grade throughput on consumer hardware.

EXLLAMA V3 FIGHTS BACK, VLLM 0.21 SHIPS

The local inference engine wars are heating up. ExLlamaV3 received major updates this month, with users reporting 20 to 40 percent faster generation on RTX 40-series cards versus llama.cpp. Meanwhile, vLLM 0.21 dropped with KV cache offload plus heterogeneous memory architecture support, speculative decoding with thinking budget, Blackwell-specific optimizations, and DeepSeek V4 pipeline parallelism. The rough performance hierarchy on NVIDIA remains: SGLang at the top, then vLLM, then ExLlamaV3, then llama.cpp — though llama.cpp still leads in quantization flexibility and sheer model compatibility.

MAC STUDIO M3 ULTRA: DISCONTINUED, SCALPED, STILL WINNING

Apple's quiet discontinuation of high-RAM M3 Ultra configs has turned the used market into a feeding frenzy. A 256-gigabyte M3 Ultra that launched at six thousand two hundred dollars is now reselling for eleven thousand — a 77% markup. The 512-gigabyte model has gone from eleven thousand to nearly sixteen thousand. The reason? MLX on M3 Ultra handles frontier MoE models at 30 to 35 tokens per second while sipping just 120 watts. That's 6.7 times more energy-efficient than a quad RTX PRO 6000 workstation for comparable throughput. Alex Cheema from EXO Labs is polling the community on best models for 128-gigabyte local devices, and NVFP4 quantization on the DGX Spark is emerging as the sweet spot.

Speaking of MLX — independent developers have custom-quantized DeepSeek V4 Flash at Q4-K-Experts, Q8-Rest for M3 Ultra 512GB, hitting 29 to 32 tokens per second at 512-token context and maintaining 24 tokens per second even at 128-thousand-token context, all within about 200 gigabytes of RAM. Batch throughput scales nicely to 98 tokens per second at batch size 16.

HARDWARE PRICE CHECK

RTX 5090 street prices are holding steady in the three thousand six hundred to four thousand dollar range. ASUS ROG Astral at 3,900, MSI Suprim Liquid at 4,000, Gigabyte AORUS at 3,600. These cards are now widely available — the scalping era is over, but the premium over the mythical 1,999 MSRP remains. A quad 5090 build still runs about 17 to 20 thousand dollars including motherboard, PSU, and cooling.

The value play remains the used RTX 3090. Working units are appearing on Facebook Marketplace at 750 dollars, with the typical range being five to seven hundred. That's 24 gigabytes of VRAM for under a thousand bucks — still the sweet spot for running 70-billion-parameter models at Q4. Used RTX 4090s are popping up at 800 to 1,000 dollars locally.

RTX PRO 6000 Blackwell with 96 gigabytes remains vaporware — no confirmed pricing or benchmarks yet. Current closest alternative is the RTX 6000 Ada at 6,800 dollars with 48 gigabytes.

AI AUTOMATION TOOLING

The n8n plus Home Assistant self-hosted AI agent stack continues to gain traction. Users are running agentic workflows where n8n handles complex reasoning and planning, calling into Home Assistant's API for sensor data and device control — all local, no cloud. A German maker recently demonstrated an n8n agent that reads existing HA automations, finds errors, and will soon create new automations from voice messages. Unsloth AI also shipped a major update allowing users to run GPT-style and Claude-style models locally with built-in API connections, code execution, web search, and image generation — essentially a fully local alternative to hosted AI platforms.

QUICK HITS

Atomic Bot launched an iOS app for its Hermes Agent, enabling remote control of self-hosted AI agents through Tailscale, Cloudflare Tunnel, or ngrok. Panthalassa demonstrated a wave-powered floating AI datacenter — GPUs in buoys, powered by ocean motion. Ahmad Osman continues building his open-source AI lab after raising over 30 million dollars, focusing on reproducible local infrastructure. And 0xSero made the observation of the week: "Agent code is like a psychedelic trip — it all makes sense until you start zooming in."

That's your Local AI and Automation Briefing for Wednesday, May 27th. I'm Bob.

—

Sources: X/Twitter posts from @TheAhmadOsman, @0xSero, @alexocheema, @exolabs, @UnslothAI, @cocktailpeanut, @awnihannun; vLLM release notes; llama.cpp GitHub; MLX community benchmarks. Hardware prices from RTX50Drops, Facebook Marketplace, and community reports. Automation tooling notes from maker community X posts.