Engineering

Local Models on Hardware Built for Agents

June 1, 20266 min read

On May 31, NVIDIA put a desktop on the roadmap that can hold a trillion-parameter model in memory and run it under your own roof. The DGX Station for Windows is not a graphics card with ambitions. It is built for the thing we actually do all day: run agents.

What was announced

The DGX Station for Windows pairs a GB300 Grace Blackwell Ultra superchip with a 72-core Grace CPU and up to 748GB of coherent memory shared across CPU and GPU. NVIDIA rates it at 20 petaflops of FP4 compute and says it can serve models up to a trillion parameters locally. An onboard ConnectX-8 SuperNIC handles networking at up to 800Gb/s. It ships in Q4 2026 from ASUS, Dell, GIGABYTE, HP, MSI, and Supermicro.

The same day, NVIDIA leaned into the smaller end too, with DGX Spark and RTX-class machines aimed at running agents on the desk instead of in someone else's datacenter. A Spark-class desktop pairs about a petaflop of AI compute with 128GB of unified memory. The whole roadmap, top to bottom, is now pointing at the workload we build for: agents that act, not chatbots that answer.

748GB of coherent memory changes the math. A model that needed a rack now fits next to the monitor, and never leaves the building.

Why coherent memory is the headline, not the petaflops

Everyone quotes the FP4 number. The detail that changes what you can deploy is the unified memory pool. Agentic work is memory-hungry in a way that chat is not: long context windows, multiple models resident at once, retrieval indexes, tool state. On a normal workstation you spend your time evicting one thing to load another. When the CPU and GPU address the same 748GB, a large model and its working set just stay put. That is the difference between a demo and a box an agent runs on for eight hours.

The case for keeping the model in the building

We have written before about on-prem inference for law firms, where sending a prompt to a public endpoint can count as third-party disclosure. Hardware like this widens who gets that option. A firm that couldn't justify a server room can put a trillion- parameter-capable machine on a desk and keep every token inside its own walls. No data residency memo. No no-train clause to negotiate. The model is on the machine, and the machine is yours.

That story isn't only for regulated industries. A warehouse running OCR on bills of lading, a manufacturer scoring defects off the line, a marketing shop that doesn't want client briefs training someone else's model. All of them get the same thing: inference that doesn't depend on a vendor's uptime, pricing, or willingness to keep a model version alive.

What it doesn't fix

A faster box does not give you a working agent. The hard parts stay hard. You still need the orchestration that decides which model handles which step, the identity and audit layer so the agent acts with least privilege, the integration work to wire it into the tools people actually use, and the evaluation to know it's right. We've said this about orchestration versus chatbot sprawl: capability without a control plane is just faster sprawl. Local hardware moves the inference; someone still has to build the system around it.

And Q4 2026 is a roadmap date. Coherent-memory specs on a slide become a deployment when the drivers, the model formats, and the tooling all line up — which is the work, not the keynote.

What we're doing about it

Foundation already builds and runs agents for clients today. When these machines ship, the orchestration, governance, and integration work we do ports onto them, with the model moving from a hosted endpoint to a box in the client's office. For anyone who has been told their data rules them out of useful AI, that constraint is starting to expire.

If you're weighing local versus hosted inference for a workload that can't leave the building, let’s talk — we build the system that runs on either.

← All articles Talk to us