> The blog post implies that it currently requires 96GB of VRAM. Has anyone test...

conradkay · 2026-05-15T07:03:28 1778828608

It'd be way slower since you'd be doing that work every token

zozbot234 · 2026-05-15T07:08:21 1778828901

True (with 64GB RAM it'd have to fetch 20% of its active experts from disk already, about 650MB/tok at 2-bit quant - and that percentage rises quickly as you lower RAM further); my question is just a more practical one about whether it runs at all, how bad the slowdown is, and to what extent you might be able to get some of that decode throughput back by running multiple (slower) agent sessions in parallel under a single Dwarf Star 4 server.

computably · 2026-05-15T19:43:01 1778874181

Storage is multiple orders of magnitude slower than RAM. Pretty sure it'd be more like 10s/tok than anything reasonable.

zozbot234 · 2026-05-15T20:01:13 1778875273

Active params for this model is 13B which takes about 6.5GB at full native quantization, or perhaps 3.25GB at the 2bit quant that's being provided here, that should take significantly less than 10s to fetch on Mac storage, especially given that some fraction of the model weights would be cached in RAM. Sounds like something worth testing out if it can be made to work out of the box with DS4.