AI Daily Briefing: Local AI & Automation — May 18, 2026
Episode 2: Local AI Deployment & Automation

Welcome to the Local AI and Automation briefing for Monday, May eighteenth. Here's what's happening in the world of self-hosted inference, consumer GPUs, and AI automation.

Let's start with inference engines — because this week, it's all about speculative decoding. The big three engines are converging on Multi-Token Prediction as the primary performance lever. llama.cpp merged MTP support this week alongside Vulkan optimizations for better GPU efficiency across platforms. vLLM zero-point-twenty-one delivers roughly two-point-one times token throughput using MTP — one project hit one hundred thirty-six tokens per second on dual RTX 5090s, up from sixty-two. And SGLang version zero-point-five-point-twelve dropped with thirty-five new contributors, adding a TokenSpeed MLA attention backend specifically for Blackwell GPUs and FP4 quantization support for DeepSeek and GLM-5 models. If you're running local inference, enabling speculative decoding is now the single biggest speedup you can apply.

On the model side, Unsloth AI released Multi-Token Prediction GGUFs for the Qwen three-point-six series — twenty-seven-b and thirty-five-b parameter models. Their two-bit quantized versions can run complex agentic tasks with thirty-plus tool calls on just twelve to thirteen gigabytes of RAM. Meanwhile, the open-weights gpt-oss twenty-b MoE model dropped with TurboQuant three-bit quantization and MLX optimizations — achieving sixty to eighty tokens per second on sixteen-gig M-series Macs with one hundred thirty-one K context windows. That's a fully offline, zero-cost frontier experience on a laptop. Hugging Face now hosts over one hundred seventy-six thousand GGUF models, with nearly ten thousand new uploads every month.

In the world of extreme compression, 0xSero has been pushing the boundaries of MoE pruning. Using REAP — a one-shot technique from Cerebras — they're stripping up to thirty-five percent of expert parameters from DeepSeek-scale models with minimal quality loss. The result: one-hundred-eighty-nine-billion-parameter models fitting in ninety-four gigabytes at four-bit precision, served through their new vLLM Studio which features hot-reloading and agent-friendly feedback loops. Their latest work explores re-pruning based on real usage data — essentially teaching the model which experts it actually needs.

Hardware corner. The RTX 5090 is now widely available but painfully expensive — three thousand five hundred to four thousand one hundred dollars for new units, nearly double the original two-thousand-dollar MSRP. The RTX PRO 6000 Blackwell with ninety-six gigs of GDDR7 just jumped from eighty-seven hundred to nearly ten thousand dollars at retail. For budget builders, used RTX 3090s remain the sweet spot at six hundred to nine hundred dollars — twenty-four gigs of VRAM that can run quantized seventy-b models at usable speeds. Two used 3090s often beat a single 4090 for both cost and total VRAM.

On the Apple Silicon side, the M3 Ultra Mac Studio with five hundred twelve gigabytes of unified memory runs around ten thousand dollars. One detailed comparison showed it delivering thirty-five tokens per second on a large model versus forty-seven from a forty-five-thousand-dollar quad RTX PRO 6000 workstation — but with six-point-seven times better energy efficiency per token and forty thousand dollars lower three-year total cost of ownership. At roughly one hundred twenty watts under inference load versus over one thousand watts for the NVIDIA rig, the power math is becoming hard to ignore.

Speaking of power, the self-hosted community is getting creative. Solar-plus-battery AI rigs are moving from concept to reality — Google's Energy Park demoed direct DC solar feeding liquid-cooled AI racks, saving twelve to fifteen percent on conversion losses. And in the almost-sci-fi category, Nvidia-backed Starcloud already has an H100 GPU in orbit with a second satellite coming later this year — using vacuum radiative cooling and twenty-four-seven solar power.

In automation tools, the n8n plus Home Assistant combination is emerging as the go-to self-hosted AI automation stack. n8n's visual workflow builder now has native AI agent nodes, vector memory, and MCP support — meaning tools like Claude and Cursor can directly control your home automations. Paired with Home Assistant for the physical execution layer, you get a fully local, zero-recurring-cost agent framework. On the security front, Open WebUI published two new CVEs this week — if you're running an instance, update to version zero-point-nine-point-five or later immediately.

And a quick shout-out to Pinokio, which now offers one-click installs for MLX-native video generation via Phosphene two-point-zero and Qwen3 TTS — both running entirely offline on Apple Silicon. Local AI is no longer just about chatbots — it's becoming a full creative suite.

That's the Local AI briefing for May eighteenth. Build locally, think globally. I'm Bob — back tomorrow.