AI Daily Briefing — May 26, 2026
Episode 2: Local AI & Automation

---

This week's biggest local AI story is speed. llama.cpp added Multi-Token Prediction support, and the results are dramatic. Qwen3.6 27B on an A10G GPU jumped from 25 to 48 tokens per second — a 78% boost. On Vulkan hardware, the gains are even bigger: Qwen3.6 35B doubled from 42 to 89 tokens per second. Paired with Unsloth's QAT GGUF quants, models in the 20 to 35 billion parameter range now feel like daily drivers on consumer GPUs. If you're running local inference, updating llama.cpp and grabbing an MTP-enabled GGUF is the single highest-impact move you can make this week.

EXO Labs dropped their biggest release yet. EXO 1.0 is now open-source under Apache 2.0. Alex Cheema's team is mining roughly a million local AI inference samples, building out a performance database that maps common hardware, model and quantization combinations, peak decode speeds, and dollar-per-gigabyte value. They're also seeing strong adoption of Apple Silicon clusters — users are clustering M1 Studio Max machines with Jumbo Frame networking for distributed inference. This is moving from experiment to production.

Xinference 2.9.0 shipped with DeepSeek V4 Flash and Pro support, backed by vLLM 0.21.0. The self-hosted stack is converging: Open WebUI plus Ollama or LocalAI as your frontend, n8n for agent workflows, Pinokio for one-click installs. The community sentiment is clear — self-hosted AI is graduating from hobby to infrastructure.

Now let's talk hardware and money. RTX 5090 street prices are holding at $3,600 to $3,900 for high-end models like the ASUS ROG Astral and MSI Gaming Trio. That's nearly double the $2,000 MSRP. DRAM shortages are keeping supply tight, and AI demand isn't helping. The RTX PRO 6000 Blackwell with 96 gigs of VRAM jumped from €9,000 to €12,000 — roughly $10,000 to $13,000 US — in just two days across vendors. For comparison, a maxed-out Mac Studio with M4 Max and 128 gigs of unified memory runs about $4,300 to $5,000. Different trade-offs: the Mac gives you bigger models at lower power, NVIDIA gives you raw throughput.

On the infrastructure side, Proxmox GPU passthrough has hit a new level of automation. People are now deploying GPU VMs with a single Ansible plus NixOS command, running parallel inference streams inside Docker containers. The winning pattern for homelabs: Proxmox host, GPU passthrough VM, Docker inside the VM — isolation plus convenience.

Cooling remains critical for sustained inference. The community is trending toward undervolting and power-limiting cards as the first optimization, then adding good airflow or AIO liquid cooling. Some extreme builds are experimenting with immersion cooling in dielectric fluid tanks. The simple math: every watt saved is heat you don't have to remove.

On the automation front, n8n is emerging as the go-to for self-hosted AI agents, with LangChain nodes and MCP support enabling complex multi-step workflows. The trend is toward long-running autonomous agents on dedicated local hardware — mini PCs, old laptops, Mac Minis — running 24/7. Cloud fatigue plus energy constraints on centralized data centers are making local-first the default strategy for many serious users.

The bottom line: local AI just got faster, cheaper, and more practical. llama.cpp MTP is a real breakthrough. EXO 1.0 opens distributed inference to everyone. And the hardware market, while still expensive, is stabilizing into clear build paths for every budget.

---

Sources: llama.cpp MTP benchmarks from X user reports (@KayvonJafar, @wataryoichi); EXO Labs from @exolabs and @alexocheema; Xinference from @Xorbitsio; GPU pricing from @RTX50Drops and @0xLcrgs; Proxmox automation from @riywo; cooling discussion from X community; self-hosted ecosystem overview from X threads and web sources, May 25-26, 2026.