Local AI & Automation Briefing — Thursday, May 28th, 2026

LLAMA.CPP ON RTX 5090: 273 TOKENS PER SECOND AND A ROCM BREAKTHROUGH

A community developer has posted definitive llama.cpp benchmarks on the RTX 5090, hitting 9,300 tokens per second prompt processing and 273 tokens per second text generation — all with a simple six-line CUDA 12.8 header patch. The full pipeline including speed and quality benchmarks is open-sourced with a live dashboard. On the AMD side, a full ROCm llama.cpp setup is now confirmed working on the RX 7800 XT with ROCm 7.2.3 on Ubuntu 24.04, delivering 549 tokens per second prompt processing and 23 tokens per second generation on Qwen3 27B at Q3_K_M. This is the first detailed community report of a fully functional consumer AMD GPU running local LLMs — a big deal for people who don't want to pay the NVIDIA tax.

But if you're on NVIDIA, upgrade immediately. CUDA 13.3 was just released, fixing incorrect outputs that plagued llama.cpp on CUDA 13.2. Don't skip this one.

SERVING ENGINE WARS: VLLM 0.22 AND SGLANG 0.5.12 LAND

The server-grade inference engines also had big cycles. vLLM 0.22 shipped with 367 commits from over 200 contributors — formal deprecation of Transformers v4, C++20 build requirement, KV cache offload with hybrid memory allocation, speculative decoding with thinking budget for reasoning models, and a new TOKENSPEED MLA backend for Blackwell GPUs. Significant performance gains too: FlashInfer top-k sampling is now default, AllPool dot forward is 51 percent faster, and numpy zero-copy embedding serialization is in.

SGLang dropped 0.5.12.post1 with critical DeepSeek V4 stability fixes — garbled text on B200 and B300 single-token decode, GQA compressor accuracy restoration from 82 percent to 96 percent on GSM8K, and 20 to 40 second cold-bucket stalls eliminated via pre-warming. If you're running DeepSeek V4 on SGLang, this patch is probably mandatory.

The rough hierarchy for local NVIDIA serving remains: SGLang at the top for bleeding-edge models, then vLLM for broad ecosystem support, then llama.cpp for quantization flexibility and local-first ease — with ExLlamaV3 notably quiet this cycle.

MLX ECOSYSTEM KEEPS EXPANDING, CONIFER LAUNCHES JUNE 1ST

Apple's MLX framework continues to attract developers. Cider now offers custom W4A8 and W8A8 acceleration kernels for faster prefill. MLX-Swift is native inference on iPhone and iPad, often beating llama.cpp on those platforms. LiteRT-LM is emerging as an on-device runtime with impressive numbers on the iPhone 17 Pro. On the Mac side, M4 Max shows 20 to 50 percent gains over M2 Ultra on 70-billion-parameter quantized models.

A new tool called Conifer is launching June 1st — it abstracts away quantization, memory management, scheduling, and storage for local inference, claims to beat llama.cpp on some benchmarks, and promises to "just work." Worth watching.

HARDWARE PRICE CHECK

RTX 5090 street prices remain in the three thousand six hundred to four thousand dollar range — widely available now, with the scalping era definitively over, but still at a massive premium over the mythical 1,999 dollar MSRP. ASUS ROG Astral at 3,900, MSI Suprim Liquid at 4,000, Gigabyte AORUS at 3,600.

The value play remains the used RTX 3090 at 700 to 750 dollars on Facebook Marketplace — 24 gigabytes of VRAM for under a thousand bucks is still the sweet spot for 70-billion-parameter models at Q4_K_M. Used RTX 4090s are popping up at 800 to 1,000 dollars locally.

Mac Studio M3 Ultra high-RAM configs continue to sell at steep premiums on the used market — 256-gigabyte models going for eleven thousand dollars, a 77 percent markup from the original 6,200. The 512-gigabyte model is climbing toward sixteen thousand.

RTX PRO 6000 Blackwell with 96 gigabytes remains vaporware — no confirmed pricing, no benchmarks. The closest shipping alternative is the RTX 6000 Ada at 6,800 dollars with 48 gigabytes.

SELF-HOSTED STACK AND QUICK HITS

The standard local AI stack has crystallized: Ollama as the foundation engine, Open WebUI as the frontend with RAG and multi-user support, and n8n for visual agent workflows — all in Docker, all free. AnythingLLM is the most discussed document and RAG tool this month, pairing with Ollama for fully local knowledge base applications.

Tencent launched WorkBuddy globally with a skills gallery of over 100 expert agents. Atomic Bot shipped an iOS app for its Hermes Agent, enabling Tailscale-secured remote control of self-hosted AI agents. And community consensus on quantization: Q4_K_M on a 70-billion-parameter model now beats a full-precision 7-billion-parameter model on real tasks while using far less memory — the conversation has shifted from "biggest model" to "right quant level."

That's your Local AI and Automation Briefing for Thursday, May 28th. I'm Bob.

—

Sources: X/Twitter posts from @witcheer, @NeoAIForecast, @venelin_valkov, @RazzReport, @lmsysorg, @PrizmalAi, @locbuilds, @mudler_it, @TheEcomNomad; vLLM release notes; SGLang release notes; llama.cpp GitHub; MLX community on Hugging Face; Facebook Marketplace and RTX50Drops for hardware pricing; maker community X posts.