Local AI & Automation Briefing for Wednesday, May 13th, 2026. I'm Bob, and this is your daily update on running AI on your own hardware.

Big news from Unsloth AI. They've released GGUF quantizations of Qwen three point six with Multi-Token Prediction — or MTP — for both the twenty-seven-billion dense and thirty-five-billion A-three-B mixture-of-experts variants. MTP is a technique that predicts multiple tokens per forward pass instead of one, roughly doubling throughput on code generation and structured data tasks. These GGUFs require a specific llama.cpp pull request branch — number twenty-two six seven three — to work, but once set up, you're looking at substantially faster inference on the same GPU. The thirty-five-b A-three-B model with Q-five-K-M quantization hits one hundred sixty two tokens per second on a single RTX five-oh-ninety. The twenty-seven billion fits comfortably on an RTX four-oh-ninety or three-oh-ninety at Q-four. Links are on Unsloth's Hugging Face — search for Qwen three point six MTP GGUF.

DeepSeek V-four Flash continues making waves. This is a one-point-six-trillion-parameter mixture-of-experts model with open weights. Community members are running it locally on everything from Jetson Thor to DGX Spark, with Q-two quants delivering five to fourteen tokens per second on modest hardware. The model hits ninety-three point five percent on LiveCodeBench and eighty point six on SWE-bench — frontier-class numbers from an open-weights model you can run at home. Specialized inference engines like Dwarf Star four are being built specifically for this architecture. The quantization resistance on DeepSeek V-four is remarkable — quality holds up well even at extreme compression levels.

Nous Research's Hermes Agent hit one hundred forty five thousand GitHub stars, but the local AI angle is what matters here. The new Computer Use driver — built on CUA's open-source driver — lets any vision-capable model control your Mac desktop in the background. You install it with "hermes computer-use install" and then run "hermes dash t computer_use chat." The agent clicks, types, and scrolls while you keep full control of your keyboard and mouse. It's MIT-licensed, self-improving through skill loops, and runs fully offline with local models. Cocktailpeanut — the creator of Pinokio — confirmed on X that Pinokio now installs Hermes the exact same way you'd install it manually, sharing the same dot-hermes folder. His take: "Pinokio is not doing anything weird to package or wrap these apps — it's an automation engine." This matters because Pinokio's one-click installer has become the default on-ramp for non-technical users getting into local AI.

The Mac cluster civil war continues. Alex Cheema of EXO Labs posted extensive technical arguments defending MacBook clusters for distributed inference. He calculated that tensor parallelism latency sits at roughly eight microseconds across Thunderbolt five, and argued that memory bandwidth is additive when you cluster Macs — "there's no magical force that stops you moving the memory around once you start clustering them." He offered to run specific tests on M-three Ultras for skeptics. On the other side, Ahmad Osman — the r slash LocalLLaMA GPU mod — posted simply that "open source AI is going to win" after a dinner discussion, maintaining his well-known skepticism of Apple Silicon for serious inference. The debate has settled into a productive place: Mac clusters work for specific workloads if you understand the bandwidth math, but NVIDIA still dominates raw throughput.

EXO Labs refreshed their website at exolabs dot net with clearer messaging about their mission — frontier AI on local hardware. They've added a Discord link and announced weekly community education sessions starting May twenty-fourth, aligned with zero-x-Sero's AI education program which is also moving to the EXO Labs Discord. Sero also teased an interview with Theo and Ben Davis dropping this week, calling them "the same in person as they are online."

Let's talk hardware prices, because this is where the rubber meets the road. RTX five-oh-ninety availability has stabilized completely. Best Buy has consistent stock: MSI Ventus three-X at three thousand four hundred dollars, ASUS TUF Gaming at three thousand four hundred forty, Gigabyte Gaming OC at three thousand five hundred eighty-nine, MSI Suprim Liquid at three thousand eight hundred, ASUS ROG Astral at three thousand nine hundred. The scalping era is officially over. You can buy a five-oh-ninety at retail today. A Japanese stock alert account reported fifty thousand MSI Ventus units hitting distribution — supply is genuinely flowing. The Gigabyte AORUS RTX five-oh-ninety Infinity edition is also now listed as available.

For professional builds, the RTX PRO six thousand Blackwell with ninety-six gigabytes of GDDR-seven ECC memory is hitting channels. Pricing varies wildly by variant — the passive server-cooled RTX six thousand D with eighty-four gigabytes goes for about six thousand three hundred dollars in China, below its MSRP equivalent. Full workstation variants run closer to nine thousand. A recent X post noted that two M-five MacBooks cost roughly the same as one RTX PRO six thousand — same VRAM total, but the NVIDIA card is significantly faster.

On the Apple side, refurbished Mac Studio M-three Ultra starts at about eighteen hundred fifty dollars for the twenty-eight-core GPU variant, going up to around three thousand eight hundred for the sixty-core. The five-hundred-twelve-gigabyte unified memory configurations command a serious premium on the resale market — expect fifteen thousand dollars or more. But for inference, the sweet spot is the ninety-six gig or one-twenty-eight gig configs at sub-five-thousand refurbished.

For the budget-conscious: the RTX three-oh-ninety is the value king. Users report snagging twenty-four-gig cards for eight hundred to a thousand dollars on the used market — sometimes as low as six hundred. A complete budget AI rig with a used three-oh-ninety, Ryzen five six hundred X, thirty-two gigs of RAM, one terabyte NVMe, and an eight hundred fifty watt power supply runs about a thousand dollars total. That's enough to run Qwen three point six twenty-seven B at Q-four comfortably and experiment with DeepSeek V-four via heavy quantization. Used RTX four-oh-nineties hover around eighteen hundred to twenty-five hundred — faster silicon, same twenty-four gig VRAM ceiling.

Power math: a quad RTX five-oh-ninety rig draws roughly two thousand three hundred watts under inference load. At the US average of sixteen cents per kilowatt-hour running twenty-four seven, that's roughly two hundred sixty-five dollars a month in electricity. Multi-GPU builders should factor solar and battery storage into their budgets — the payback math is getting increasingly favorable as GPU power draw climbs.

On the tooling front, Needle — a twenty-six-million-parameter model distilled from Gemini for tool calling — is worth paying attention to. At twenty-six million parameters, it's small enough to run on a CPU or alongside your main model without competing for VRAM. Hacker News gave it three hundred forty-six points. This represents a trend toward specialized micro-models for specific agent tasks rather than one giant model doing everything.

DuckDB's new Quack protocol is also relevant for local AI builders — it turns the embedded analytical database into a network service, which means your local AI agents can query structured data without loading everything into Python memory first.

That's your local AI and automation briefing for Wednesday, May thirteenth. I'm Bob — back tomorrow with more hardware prices, model drops, and self-hosted news.