Local AI & Automation Briefing — May 17th, 2026

This week, the local AI world didn't get new model releases — it got something better: inference software that makes your existing hardware faster, for free.

The big story is llama.cpp shipping Multi-Token Prediction support. The merge landed in builds b9180 through b9186, and the results are immediate. On a humble RTX 4060 with 8GB of VRAM, throughput jumped from forty to over fifty-one tokens per second — that's a twenty-eight percent speedup with no new hardware. Unsloth already has a Qwen 3.6 27B MTP GGUF on Hugging Face purpose-built for this. The llama.cpp repo also went through a massive restructuring — the web UI component renamed to just UI — and build b9186 brings KleidiAI acceleration for Apple Silicon. Based on official GitHub releases and community benchmarks on X.

vLLM dropped version zero-point-twenty-one-point-zero on May fifteenth. Three hundred sixty-seven commits from over two hundred contributors. The headliner is KV offloading with a Hybrid Memory Allocator: run larger models with less VRAM. They also shipped speculative decoding with thinking budgets for reasoning models, a TOKENSPEED MLA backend for Blackwell GPUs targeting DeepSeek R1 and Kimi K2.5 at low latency, and new model support including Cohere Eagle and Moondream 3. The big caution flag: this release requires a C++20 compiler and drops Transformers v4 support — so plan your upgrades carefully. Based on official vLLM GitHub releases.

SGLang shipped version zero-point-five-point-twelve on May sixteenth with thirty-five new contributors. DeepSeek V4 support merged at launch with extensive optimizations: ShadowRadix prefix caching, W4A8 and W4A4 MegaMoE kernels, HiCache with SSD offloading, and Speculative Decoding V2. One user benchmarked Nemotron 3 Super 120B at nearly fifteen tokens per second on a single GB10 with FP4 — that's a 120 billion parameter model on hardware that costs a few thousand dollars. Based on the official SGLang announcement on X.

On the self-hosted apps front, Ollama has been busy but also scary. Version zero-point-twenty-four brought native Codex App integration — launch it with ollama launch codex-app, get a built-in browser for annotating live pages, review mode, and git worktree support. Recommended models include Kimi K2.6, GLM 5.1, and Gemma 4 31B. On the security side, CVE 2026-7482 — nicknamed Bleeding Llama — scored a CVSS nine-point-one. Unauthenticated remote attackers can exfiltrate process memory including API keys and prompts via a crafted GGUF model. If you run Ollama, upgrade to version zero-point-seventeen-point-one or later immediately, and never expose it directly to the internet. Open WebUI also patched two CVEs — update to version zero-point-nine-point-five or newer. Based on official Ollama and Open WebUI releases.

On X, Ahmad Osman posted a thought-provoking observation: continual learning has already been solved, it just requires model weights running locally on your own hardware, so big labs avoid the topic. His point — local open-source AI will win, inevitably — resonated with over two hundred fifty likes and eight thousand views. Cocktailpeanut posted new content likely related to Pinokio and local AI apps, though X's anti-scraping blocked the actual content.

Now, hardware and prices. The RTX 5090 situation is slowly improving but still painful. Founders Edition cards appear in periodic drops at the two thousand dollar MSRP, but AIB partner cards like ASUS ROG Astral run three thousand nine hundred dollars and up. Supply remains tight — Steam survey shows only zero-point-four-one percent adoption. The RTX PRO 6000 Blackwell with ninety-six gigabytes is trending upward — Micro Center recently raised prices thirteen hundred dollars, now sitting at ten thousand dollars. On the Apple side, the Mac Studio M3 Ultra with five hundred twelve gigabytes has been quietly discontinued due to global DRAM shortages — maximum new config is two hundred fifty-six gigabytes. Used five hundred twelve gigabyte units command a forty-three percent premium over original retail, selling around fifteen thousand seven hundred dollars.

For budget builders, the RTX 3090 remains the king at six hundred fifty to eight hundred fifty dollars used. Twenty-four gigabytes of VRAM runs twenty to thirty-four billion parameter models comfortably. A full budget build — GPU, Ryzen 5 CPU, 64GB of DDR5, everything — costs around fifteen hundred to two thousand dollars. For a four-by RTX 5090 rig, the realistic range is eleven and a half to twenty thousand dollars including motherboard, PSU, cooling, and case. At full load, that system draws two and a half to four thousand watts. At the US average of sixteen cents per kilowatt hour, that's five hundred eighteen dollars monthly in electricity.

The power equation is driving interest in solar. A five kilowatt solar array with fifteen kilowatt-hours of battery storage costs twelve to eighteen thousand after incentives, generates enough to offset a multi-GPU rig, and pays back in four to seven years — or faster in high-rate states like California. With solar costing as little as three to seven cents per kilowatt-hour versus sixteen cents from the grid, the math is getting hard to ignore.

That's the local AI briefing for May seventeenth. Update your inference engines, patch Ollama, and keep an eye on those 3090 bargains. I'm Bob — back tomorrow.