Welcome to the AI Daily Briefing, Local Edition for Saturday, May 30th, 2026. I'm Bob, and here's what matters for running AI on your own hardware.

The biggest news for local AI this week: llama.cpp now has a real website. Georgi Gerganov launched llama dot app with a single-line cross-platform installer that gives you a unified llama CLI entrypoint. Any GGUF models you've already downloaded through the Hugging Face cache are automatically available. No more hunting for the right binary or compiling from source. The installer sets up model serving, agent integration, and retains full access to all the advanced flags power users rely on. This is the biggest usability leap for llama.cpp since the project started.

MTP speculative decoding is now in mainline llama.cpp, and the speedups are real. Users are reporting going from twenty-two to forty-two tokens per second on Qwen three point six models. Combined with the new website, running powerful local models has never been faster or easier. LM Studio and Unsloth have already integrated MTP support.

vLLM version zero point twenty-two shipped with DeepSeek V4 as a first-class package and async expert-parallel load balancing enabled by default, delivering one point five to three times throughput gains on GB200 hardware. Also new: a Rust-based BPE tokenizer called fastokens for faster long-context tokenization.

NVIDIA's NVFP4 quantization format is reshaping what fits on a single GPU. This four-bit floating point format delivers roughly three times memory reduction with near-zero accuracy loss, optimized exclusively for Blackwell hardware. NVIDIA has been shipping official NVFP4 checkpoints: Qwen three point six MoE at thirty-five billion total with only three billion active fits on a single Blackwell card. Gemma four twenty-six billion active four billion NVFP4 with multimodal vision support. GLM five point one. Some benchmarks show four-bit actually outperforming FP8 on AIME twenty twenty-five. For Blackwell GPU owners, NVFP4 is the new standard.

Liquid AI's LFM two point five is purpose-built for edge devices. Eight billion total parameters, only one billion active, one hundred twenty-eight thousand context, trained on thirty-eight trillion tokens. It runs on an entry-level laptop and has day-one support for llama.cpp, MLX, vLLM, and SGLang. The instruction following jumped from seventy-nine to ninety-two percent on IFEval. On an M five Max it hits two hundred fifty-three tokens per second. If you're building a local agent that needs reliable tool calling on consumer hardware, this model was literally designed for that use case.

Step three point seven Flash from Stepfun is another massive open-weight release this week. One hundred ninety-eight billion parameters with eleven billion active, vision-language capable, four hundred tokens per second, Apache two point zero license. People are running it on Mac Studio M four Max, DGX Spark, and AMD Ryzen AI Max Plus hardware. Community quants include ROCm FP4 for AMD GPUs and APEX GGUF variants.

Stanford's Hazy Research lab released OpenJarvis, a local-first personal AI agent framework. It's fully local by default via Ollama, promising up to eight hundred times lower API costs and four times lower latency compared to cloud agents. It ships with ready-to-use agents for morning briefings, web and document research, and local Python coding. It's open source and the model stays on your machine.

Ollama version zero point thirty RC is in testing with two big changes: a native MLX backend for Apple Silicon delivering roughly two times inference speedups, and direct llama.cpp integration replacing the older GGML architecture. Set OLLAMA underscore USE underscore MLX equals one and use dash MLX tagged models to try it.

Unsloth launched Unsloth Studio, a simplified web interface for dataset creation and fine-tuning. It includes an OpenAI-compatible API endpoint at localhost port eight eight eight eight, so you can fine-tune and then immediately query your model like it's ChatGPT.

The self-hosted tools stack of Ollama, Open WebUI, n8n, and Pinokio continues to dominate in twenty twenty-six for individuals wanting a completely private, zero-cost AI platform. A new educational project called Tiny-vLLM hit Hacker News with one hundred thirty-seven points. It's a from-scratch implementation of an LLM inference engine in C++ and CUDA, complete with FlashAttention, PagedAttention, continuous batching, and KV cache.

Hardware price check. RTX fifty ninety: around two thousand dollars at retail if you can find one, though international prices are spiking — some Korean listings hit over five thousand dollars. RTX PRO six thousand Blackwell: now nine thousand nine hundred ninety-nine dollars, up from eighty-six ninety-nine. Mac Studio M three Ultra: around eight thousand for two hundred fifty-six gigs, thirteen thousand for five hundred twelve gigs, though Apple appears to be limiting high-RAM configs as M five rumors circulate. Used RTX forty ninety: still holding at fifteen hundred to twenty-three hundred dollars. The GPU market remains brutal for anyone building a local rig.

And a community note: at zero x Sero has released a new Qwen three point six twenty-eight B REAP fine-tune on Hugging Face worth checking out. Meanwhile, the r slash LocalLLaMA subreddit has migrated to a new Discord home.

That's the local edition for Saturday, May thirtieth. Build something cool this weekend, and I'll catch you tomorrow.