AI DAILY BRIEFING — EPISODE 2: LOCAL AI & AUTOMATION
Date: May 24, 2026

---

Welcome to the Local AI and Automation briefing for Sunday, May 24th. I'm Bob, and this is your weekly update on running AI on your own hardware.

Let's start with the wildest demo of the week. 0xSero posted a video of DeepSeek-v4-flash, a 284-billion-parameter model, autonomously SSHing into homelab servers, rewriting code using the REAP framework, downloading models and datasets, and running tests — all in a single 400,000-token session. He's also announced he'll be speaking at the REAP event about this integration. It's a preview of where agentic local AI is heading: models that don't just answer questions, but actually operate infrastructure.

EXO Labs had a busy week. They announced they're building a platform that lets you browse models, see real-world benchmarks, and pick a working configuration for any hardware setup — as simple as browsing a website. They're benchmarking every model at every quantization level across every price point. Their weekly community Discord education sessions also start today, May 24th, for anyone wanting to learn distributed inference on Apple Silicon clusters.

Meanwhile, Unsloth AI has been shipping at speed. Their latest Qwen3.6 Multi-Token Prediction GGUFs are delivering roughly double the speed of standard GGUF quantizations. The 27-billion-parameter model runs on just 18 gigabytes of RAM and hits 160 tokens per second. The 35B-A3B variant reaches 240 tokens per second. They also officially joined the PyTorch Ecosystem, cementing their position as the go-to quantization toolkit. Another benchmark from the community shows llama.cpp on dual RTX 3090 cards hitting 99.5 tokens per second on a 35B MTP model — nearly twice as fast as LM Studio on the same hardware.

On the Apple Silicon front, MLX continues to shine. The Mac Studio M3 Ultra with 512 gigabytes of unified memory remains the premium local inference machine at around ten to thirteen thousand dollars. It handles 200-billion-parameter models at usable speeds with 819 gigabytes per second of memory bandwidth. For lighter workloads, the 96-gigabyte M3 Ultra at around four thousand dollars runs 70B models comfortably. Awni Hannun from Anthropic praised PyTorch's new ExecuTorch MLX delegate for bringing MLX performance to PyTorch workflows on Apple Silicon — another step toward making Mac a first-class inference platform.

Now let's talk hardware prices — and they're not pretty. The RTX 5090 officially has a two-thousand-dollar MSRP, but street prices tell a different story. We're seeing listings from three thousand seven hundred fifty to four thousand seven hundred fifty dollars at major retailers. The ASUS ROG Astral variant is going for nearly thirty-nine hundred at Best Buy, and even open-box units on Amazon are hitting thirty-nine hundred. Availability is sporadic, with restock alerts firing and inventory vanishing within minutes.

The RTX PRO 6000 Blackwell workstation card with 96 gigabytes of GDDR7 is even more extreme — eight to ten thousand dollars, with reports of a thirty percent price hike in the last two weeks driven by AI demand. That single card can run 230-billion-parameter models unquantized, making it a compelling one-card solution despite the eye-watering price.

On the budget side, the used market is where the value lives. RTX 4090s are going for fifteen hundred to two thousand dollars used, while the venerable RTX 3090 remains the budget champion at five to eight hundred dollars — delivering roughly seventy to eighty percent of a 4090's speed at one-third the price. For a multi-GPU build, two used 3090s give you 48 gigabytes of VRAM for around twelve to sixteen hundred dollars, which is hard to beat for 70B-class models.

For automation, the n8n plus Home Assistant stack is emerging as the leading self-hosted AI agent framework. Users are building agents that read Home Assistant entities, detect automation errors, and even create new automations from voice commands — all running locally with no cloud dependency. It's a practical pattern: n8n handles the agent logic and LLM calls while Home Assistant manages the physical smart home layer.

And in a lighter note, cocktailpeanut has been having way too much fun with Stable Audio 3 on Pinokio. He's been turning Discord notification sounds into club anthems, water drops into Gesaffelstein-style sets, and generating two-minute jazz piano solos in just four seconds — all running locally, even on CPU with no VRAM. It's a reminder that local AI isn't just about serious infrastructure; it's also about creative tools that anyone can run on their own machine.

That's the Local AI and Automation briefing for May 24th. Check the show notes for all sources and links. I'm Bob — see you next time.