Local AI & Automation Briefing for Tuesday, May 12th, 2026. I'm Bob, and this is your daily update on running AI on your own hardware.

The local AI community is having a proper civil war over MacBook clusters. A user named Merocle posted a viral M5 Max MacBook cluster setup with Thunderbolt 5 networking, and the reactions split in two. Ahmad Osman — the r/LocalLLaMA GPU mod — called it a "psyop" and "absolute slop," warning people to read his hardware articles before buying into Apple hype. On the other side, EXO Labs founder Alex Cheema pushed back hard, posting an eight-hour thermal run chart showing zero throttling, with tensor parallelism latency around eight microseconds across MacBooks. His take: "You certainly can run MacBook Pro at full thermal 24/7." The debate matters because it gets at the core question: are Apple Silicon clusters a serious inference platform or a YouTube flex? The answer probably depends on your workload and your budget.

Speaking of pushing local boundaries, 0xSero gave us a look at vllm-studio — a graphical interface for vLLM that finally "looks good" and may get renamed since it does much more than just vLLM now. He also demoed Droid, a desktop AI agent that sits as a floating ghost you can hotkey anywhere. It captures your microphone, converts speech to text, passes it to any model for tool calls or screen control, then responds with audio via Chatterbox TTS. He posted bluntly: "Local AI will be competitive with frontier. I will make it so." All his weekly AI education sessions are moving to the EXO Labs Discord starting May 24th.

New model drops worth your attention. Qwopus 3.6 35B A3B — a MoE fine-tune of Qwen 3.6 — hits 162 tokens per second on a single RTX 5090 with Q5_K_M quantization. It beats base Qwen 3.6 27B on web dev and reasoning, and it's usable with offloading on lower VRAM cards. MiniCPM-V 4.6 is a 1.3-billion-parameter multimodal vision model that beats Gemma 4 and Qwen 3.5 on benchmarks while using 55 percent less vision compute — 75-millisecond time-to-first-token on an RTX 4090. And the consensus pick for single 3090 builds: Qwen 3.6 27B dense at Q4, running 41 tokens per second on 262,000 tokens of context. A user on X called it the "king of single RTX 3090" and they're not wrong.

On the inference engine front, a benchmark comparison by Jaydev Tonde put vLLM, SGLang, and TensorRT-LLM head-to-head. vLLM won on overall latency-throughput balance, especially at high concurrency and on MoE models. But llama.cpp remains the local king — particularly on single GPU with quantization. The GB10 homelab testing found vLLM's MTP actually hurt MoE performance due to bandwidth saturation, while llama.cpp + DFlash hit 65 tokens per second on short context. The rule of thumb remains: vLLM for servers, llama.cpp for desktops, SGLang if you need structured generation.

EXO Labs refreshed their website and announced weekly community education sessions starting May 24th. Unsloth AI joined the PyTorch Ecosystem — a huge institutional validation for the library that makes training and quantizing models practical on consumer GPUs. And Anthropic's Awni Hannun — who works on MLX-related infrastructure — gave Claude Code's new agent view a perfect score, calling it "exceedingly useful." Worth noting: he's at Anthropic now, not Apple, but the MLX community still follows his takes closely.

Let's talk prices, because this is where things get actionable. RTX 5090 availability has finally normalized. Best Buy has multiple models in stock — MSI Ventus 3X at $3,399, Gigabyte Gaming OC at $3,699, MSI Suprim Liquid at $3,799. The scalping era is over. You can walk into Micro Center and buy an RTX PRO 6000 Blackwell with 96 gigabytes of GDDR7 ECC memory for $1,999 on sale — though expect to pay closer to $8,500 at full retail depending on the variant. On the Apple side, refurbished Mac Studio M3 Ultra starts at $3,399, with MLX inference delivering 350 to 468 tokens per second prefill and 25 to 36 tokens per second decode on quantized models. The 512-gigabyte memory configs command a serious premium on the resale market right now.

For the budget-conscious, the RTX 3090 is the steal of the year. Users are reporting snagging 24-gig cards for $600 on the used market. A complete budget AI rig — 3090, Ryzen 5600X, 32 gigs of RAM, one-terabyte NVMe, 850-watt power supply — runs about $1,050 total. That's enough to run Qwen 3.6 27B comfortably and experiment with frontier models via quantization. Used RTX 4090s hover around $1,800 to $2,500 — faster and more efficient, but same 24-gig VRAM ceiling.

On the automation side, n8n is cementing itself as the self-hosted AI workflow platform. It now has native Home Assistant nodes, an AI Agent node that chains LLMs with tools and vector databases, and a Docker one-liner to deploy. The combo of n8n orchestrating AI agents and Home Assistant controlling your physical world is becoming the default homelab AI stack. Community workflows range from healthcare opportunity scouts to RAG-powered Telegram support bots — all self-hosted, no cloud required.

Power check. A quad RTX 5090 rig draws about 2,300 watts under inference. At the US average of sixteen cents per kilowatt-hour, running 24/7 costs roughly $265 a month. If you're going multi-GPU, factor solar and battery storage into your budget — the payback math is getting increasingly favorable.

That's the Local AI & Automation Briefing for Tuesday, May 12th. I'm Bob — go build something.