AI Daily Briefing — May 20, 2026
Episode 2: Local AI & Automation

Let's talk about what's happening on the ground — not in cloud data centers, but in homelabs, on desks, and under desks. The local AI scene had a busy 24 hours.

The big story is Multi-Token Prediction landing in llama.cpp mainline. This is a genuine speed breakthrough. On Qwen 3.6 27B, a single GPU goes from 22 tokens per second to 42 — nearly double. Stack it with ngram-mod speculation and you hit 56 tokens per second. Community forks like ik_llama.cpp are pushing even further, hitting 73 tokens per second on an RTX 3090 with IQ4_KS quantization. Same hardware, no extra model files, no fork required anymore — MTP is upstream now.

Unsloth AI shipped their MTP GGUF version of Qwen 3.6 in 4-bit. It runs on just 20 gigabytes of RAM and can search over 70 websites from a single prompt using auto MTP plus speculative decoding. Unsloth Studio now automatically picks the best settings for your device — Mac, CPU, or GPU.

Speaking of Mac, Ollama's MLX integration story matured. Ollama 0.19 switched to MLX as the inference engine on Apple Silicon, delivering 57 percent faster prefill and 93 percent faster decode. That's roughly double the throughput. On Mac, Ollama and MLX are no longer separate choices — Ollama is MLX now.

PyTorch announced the ExecuTorch MLX delegate, bridging the PyTorch ecosystem directly to Apple Silicon Metal GPUs. Export PyTorch models — LLMs, speech, MoE — and run them quantized on Mac hardware. This opens a much wider funnel for local inference on Apple's hardware.

In the inference server space, vLLM and SGLang continue their arms race. Recent 7-GPU mixed Blackwell and Ada benchmarks show vLLM winning on long-context prefill, while SGLang matches it on pure Blackwell setups but crashes when mixing GPU architectures. If you're building a mixed-generation GPU inference rig, vLLM is the safer bet.

And now the hardware. Let's talk numbers.

RTX 5090 street pricing is settling between thirty-six hundred and forty-one hundred dollars, with stock now consistently available at Best Buy and Walmart. The MSI Suprim Liquid SOC is thirty-eight hundred, the ASUS ROG Astral is thirty-nine hundred. Availability is better than launch, but premium AIB cards still command a markup.

The RTX PRO 6000 Blackwell with 96 gigs of GDDR7 is now shipping through PNY and European resellers at roughly ten and a half to eleven thousand dollars. That's the workstation card to watch for large model inference if you need more than 32 gigs of VRAM in a single slot.

On the value side, Ahmad Osman posted what might be the best reminder of the week: two used RTX 3090s at 700 to 900 dollars each — that's under two grand for 48 gigs of VRAM — paired with Qwen 3.6 27B and a self-hosted SearXNG gives you what he calls "Opus 4.5 at home." He's not wrong. Used 3090s remain the price-to-performance king for local AI.

Used 4090s are running fourteen to eighteen hundred, making them compelling for efficiency but harder to stack in multi-GPU builds on a budget.

Mac Studio M3 Ultra with up to 256 gigs of unified memory is shipping but still lacks public MLX inference benchmarks. Extrapolating from M3 Max numbers, we're probably looking at sixty to ninety-plus tokens per second on 70B-class 4-bit models, with massive context windows thanks to that unified pool.

A practical note on monetization: Vast.ai reports verified 4090 rigs can earn 700 to over fourteen hundred dollars per month renting idle GPU time. For anyone hesitating on a multi-GPU build, the rental math may close the gap faster than you think.

That's your local AI briefing for May 20. The key takeaway: llama.cpp MTP is here, Mac inference just got unified under MLX, and two used 3090s still beats almost everything else in value.