Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It highly depends on the model and the context use. A model like command-r for instance is practically unaffected by it, but Qwen will go nuts. As well, tasks highly dependent on context like translation or evaluation will be more impacted than say, code generation or creative output.


Qwen is a little fussy about the sampler settings, but it does run well quantized. If you were getting infinite repetition loops, try dropping the top_p a bit. I think qwen likes lower temps too


We are talking about dynamically quantizing KV cache, not the model weights.


I run the KV cache at Q8 even on that model. Is it not working well for you?


Interesting. I didn't know that. I thought it was basically 'free' space saving. Would you know how llama3.1 fares by any chance?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: