AI DAILY BRIEFING — Episode: Local AI & Automation
Date: May 23, 2026
Duration: ~4-5 minutes spoken

---

Welcome to the Local AI and Automation briefing for Saturday, May 23rd. I'm Bob. This is your weekly roundup of what's happening in self-hosted inference, open-source tools, and the hardware that powers it all.

---

Let's start with the big news in local inference speed: llama.cpp has landed Multi-Token Prediction, or MTP, in its mainline build, and the results are dramatic. On a Qwen3.6 27B model at Q8 quality, a single GPU jumped from 22 tokens per second to 42 — that's nearly double, with no hardware upgrade. Users on dual RTX 2080 Ti cards are reporting over 60 tokens per second, now beating vLLM in some configurations. Speculative decoding tuning in recent builds has pushed to up to seven times decode throughput on RTX 5090 setups. If you're running llama.cpp, update now — this is the kind of free speed boost that comes along once a year.

On the Apple Silicon front, Ollama version 0.19 switched its Mac backend to MLX, delivering roughly 57% better prefill and 93% better decode — about a 2x overall improvement. This means Ollama on Mac is now effectively running MLX natively. An M5 Max was benchmarked running Qwen3.6 27B at a blistering 50 tokens per second with MTPLX optimizations, up from 17 without — a 2.24x jump. The Mac Studio M3 Ultra, with its configurable 512 gigs of unified memory, remains the champion for running truly large models — we're talking 600 billion parameter models fully in RAM, and up to 1.6 trillion parameters quantized. No other single consumer device comes close.

EXO Labs teased something interesting this week: a new feature that would let you browse available models and working configurations like a website, seeing how fast and popular each model is on any hardware. Quote: "What if hosting a model was as simple as browsing a website and picking a working configuration?" Coming soon, according to their X post. This could dramatically lower the barrier for newcomers to distributed inference.

On the self-hosted automation side, the n8n plus Home Assistant plus Ollama stack continues to solidify as the standard for privacy-respecting smart homes. Home Assistant deliberately avoids shipping its own AI features because of the real-world risks — turning off your heater by mistake has consequences. The community solution is n8n as the intelligence layer, with local LLMs handling natural language understanding and decision-making. Use cases include voice-first smart home control, predictive energy automation, and multi-step agent workflows with confirmation gates before physical actions. n8n's 2026 builds include native AI Agent nodes that make this much easier to wire up than even six months ago.

Ollama continues its dominance: 52 million monthly downloads in Q1 2026. Open WebUI remains the go-to frontend, stable and polished. The ecosystem around Ollama — LibreChat, AnythingLLM, Dify — is all healthy, and the stack is increasingly the default recommendation for anyone asking "how do I run ChatGPT at home?"

Now for the hardware price check, updated for May 23rd.

The RTX 5090 is settling into a street price range of $2,950 to $4,000. The MSI Gaming Trio OC at Best Buy sits at $3,600. The ASUS ROG Astral in both black and white is pushing $3,900 to $4,000. The Zotac variant on Amazon at $3,700. The Ventus 3X at $2,950 is your entry point if you can catch stock. Bottom line: the 5090 is now widely available, but you're paying $1,000 to $2,000 over the original MSRP.

The RTX PRO 6000 Blackwell with 96 gigs of GDDR7 is running $8,900 to $10,000 street, up from around $8,000 at launch. Micro Center bumped it to $9,999 recently. This is the card you want if you need to run 100 billion-plus parameter models without offloading. Triton kernel benchmarks show near-peak memory bandwidth — almost 1,800 gigabytes per second.

For budget builds, the used RTX 3090 at $500 to $700 remains the undisputed value king. 24 gigs of VRAM, 20 to 40 tokens per second on 70B models. Two of them give you 48 gigs total and handle most quantized frontier models comfortably. The RTX 4090 used market sits around $1,500 to $2,500 — a significant premium for faster inference.

Multi-GPU math: a four-by-RTX-5090 build with motherboard, PSU, and cooling lands between $17,000 and $27,000 total. Power consumption for a rig like that running 24/7 at, say, 15 cents per kilowatt-hour runs about $130 to $260 per month — not trivial, and a reason solar-plus-battery setups are getting attention from serious homelabbers.

Quick shout-out to Pinokio, where cocktailpeanut has been showing off Stable Audio 3 running on CPU — turning Windows startup sounds and Discord notifications into full club anthems with audio inpainting. And he noted that a lot of recent Pinokio launchers are being built by people who have never coded before, just using AI agents. That's the democratization we like to see.

That's your local AI and automation briefing for May 23rd. Update llama.cpp, keep an eye on EXO Labs, and may your GPUs run cool. I'm Bob — back next week.

---
Sources: X posts from @exolabs, @cocktailpeanut, @TheAhmadOsman, @0xSero, @UnslothAI; llama.cpp GitHub PRs #22673, #23198; r/LocalLLaMA; GPU stock trackers @GPUStockAlerts, @RTX50Drops; Ollama release notes; MLX benchmarks via @ExecuteAuto, @SylonZero, @_ARahim_; n8n community.