AI Daily Briefing — Friday, May 22nd, 2026
Episode: Local AI & Automation
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Good morning. This is your Local AI and Automation briefing for Friday, May 22nd. The big story this week is speed — and I mean software speed, not hardware. llama.cpp is delivering generational leaps purely through better code, Cohere dropped an open-source model that runs on two GPUs, and Stability AI just made local music generation trivial.

Let's start with the performance story, because this is genuinely wild. The llama.cpp community has been on fire with Multi-Token Prediction support — or MTP. The numbers are staggering: users are reporting 1.5 to 1.8 times faster inference on Qwen models just by updating their build. On an RTX 5090, one user went from 56 to nearly 74 tokens per second on a Qwen 27B model at Q6 quality, just by enabling MTP with tuned speculative decoding flags. That's a 31 percent speedup — free. On dense models the jump is from 51 to 117 tokens per second. On mixture-of-experts models, reports of 267 tokens per second on just two RTX 5090s. This is software optimization delivering hardware-upgrade-level gains, and it's all open-source.

llama.cpp also just landed WebGPU support. You can now run models directly in the browser with GPU acceleration — no install, no setup, and your data never leaves your machine. For edge use cases and quick demos, this is a game-changer.

Speaking of open models that run on real hardware: Cohere's Command A Plus is the most significant open-source multimodal release of the week. 218 billion total parameters but only 25 billion active per token — so it runs on as few as two H100s, or a single Blackwell GPU with 4-bit quantization. It delivers up to 375 tokens per second with W4A4 quantization and ships under Apache 2.0. For self-hosters, this is the kind of model that makes the "you need a data center" argument look outdated. Weights are already on Hugging Face.

On the local AI apps front, the biggest launch this week is Stability AI's Stable Audio 3. It's open-weight, generates up to six-minute tracks with audio inpainting, and — here's the kicker — cocktailpeanut already shipped a one-click Pinokio launcher that runs on CPU only. No GPU. No VRAM. A two-minute clip generates in seconds on any Windows, Mac, or Linux machine. On Apple Silicon with MLX, users are seeing 59 times real-time generation on M4 and M5 chips. This is the democratization of AI music production in real time.

MLX on Apple Silicon continues to mature. Silicon Studio, a new MIT-licensed desktop app, now bundles data prep, Hugging Face model management, LoRA and QLoRA fine-tuning, and chat testing into one interface — all running locally on M-series Macs. MLX also now supports distributed inference across multiple Apple machines, which users describe as surprisingly sophisticated for local setups. The M3 Ultra Mac Studio with 96 gigs of unified memory remains the entry point for running 70B-class models locally — pricing around 4,000 dollars new, or 2,500 to 3,200 used.

On the serving infrastructure side: vLLM 0.14 now has official support for Intel Arc Pro B70 GPUs, with pre-tuned Docker images delivering 370 tokens per second on Qwen 3.5 27B with 50 concurrent requests. The Local LLM Bible guide, released this week and available for free, covers the full stack from llama.cpp to vLLM to TensorRT-LLM across hardware from laptops to clusters. It's quickly become the most-bookmarked resource in the community.

For self-hosted AI stacks, Open WebUI plus Ollama remains the gold standard — 138,000 GitHub stars and counting. Meanwhile, n8n plus Home Assistant continues to be the go-to pattern for local AI agents that control your actual house: natural language commands that decide which lights, thermostat, music, and blinds based on time, weather, and occupancy.

Hardware tracking. RTX 5090 supply has stabilized significantly. Street prices range from about 2,950 dollars for an MSI Ventus to 3,999 dollars for the ASUS ROG Astral flagship. Scalping has essentially vanished — cards are in stock at Best Buy and Amazon. On the professional side, the RTX PRO 6000 Blackwell 96GB is now appearing in workstation builds at roughly 8,000 to 12,000 dollars per card. Phoronix benchmarks show it delivering class-leading performance. A seven-GPU build with Threadripper clocks in around 80,000 dollars total.

Power math for the budget-conscious: a single RTX 5090 draws up to 575 watts. Running that 24/7 at the U.S. average of 14 cents per kilowatt-hour costs about 58 dollars a month — or nearly 700 a year, just for one card, before you even buy the model.

That's your local AI briefing for May 22nd. llama.cpp's MTP speedup is the software story of the week. If you haven't updated your build yet, you're leaving free performance on the table. I'll be back tomorrow.