It highly depends on the model and the context use. A model like command-r for i...

behohippy · on Feb 12, 2025

Qwen is a little fussy about the sampler settings, but it does run well quantized. If you were getting infinite repetition loops, try dropping the top_p a bit. I think qwen likes lower temps too

Eisenstein · on Feb 12, 2025

We are talking about dynamically quantizing KV cache, not the model weights.

behohippy · on Feb 13, 2025

I run the KV cache at Q8 even on that model. Is it not working well for you?

wkat4242 · on Feb 12, 2025

Interesting. I didn't know that. I thought it was basically 'free' space saving. Would you know how llama3.1 fares by any chance?