Isn't the fact the P40 has horrible fp16 performance a deal breaker for local se...

behohippy · on Feb 11, 2025

You probably won't be running fp16 anything locally. We typically run Q5 or Q6 quants to maximize the size of the model and context length we can run with the VRAM we have available. The quality loss is negligable at Q6.

Eisenstein · on Feb 11, 2025

But the inference doesn't necessarily run at the quant precision.

wkat4242 · on Feb 11, 2025

As far as I understand it does if you quantify the K/V store as well (the context). And that's pretty standard now because it can increase maximum context size a lot.

Eisenstein · on Feb 11, 2025

It is available in most inference engines, but I wouldn't call it in standard use, as it can degrade quality tremendously.

wkat4242 · on Feb 12, 2025

Even at q8_0? I thought it wasn't bad just like the models itself. But very interested to hear.

And q8_0 already halves the memory usage compared to fp16.

One of the ollama Devs called the quality impact negligible at q8_0: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

But perhaps quantifying the KV cache does not scale as gracefully as the model itself?

Eisenstein · on Feb 12, 2025

It highly depends on the model and the context use. A model like command-r for instance is practically unaffected by it, but Qwen will go nuts. As well, tasks highly dependent on context like translation or evaluation will be more impacted than say, code generation or creative output.

behohippy · on Feb 12, 2025

Qwen is a little fussy about the sampler settings, but it does run well quantized. If you were getting infinite repetition loops, try dropping the top_p a bit. I think qwen likes lower temps too

Eisenstein · on Feb 12, 2025

We are talking about dynamically quantizing KV cache, not the model weights.

behohippy · on Feb 13, 2025

I run the KV cache at Q8 even on that model. Is it not working well for you?

wkat4242 · on Feb 12, 2025

Interesting. I didn't know that. I thought it was basically 'free' space saving. Would you know how llama3.1 fares by any chance?

numpad0 · on Feb 11, 2025

Is it cheaper in $/GB than used Vega 56(HBM2 8GB) besides? There are mining boards with bunch of x1 slots that probably can run half a dozen of them for same 48GB.

magicalhippo · on Feb 11, 2025

AFAIK this doesn't really work for interactive use, as LLMs process data serially. So your request needs to pass through all of the cards for each token, one at a time. Thus a lot of PCIe traffic and hence latency. Better than nothing, but only really useful if you can batch requests so you can keep each GPU working all the time, rather than just one at a time.

numpad0 · on Feb 12, 2025

Clearly I wasn't aware enough that DNN is by default like a mesh. Makes sense that it's going to be bottlenecked by the tightest link. Thanks...

BoredPositron · on Feb 11, 2025

Would take a bunch of time just to load the model...

Havoc · on Feb 11, 2025

Best as I can tell most of the disadvantages relate to larger batches. And for home use you’re likely running batch of 1 anyway