A great thread with the type of info your looking for lives here: https://github...

diggan · on Feb 11, 2025

I guess I'm mostly lamenting about how unscientific these discussions are in general, on HN and elsewhere (besides specific GitHub repositories). Every community is filled with just anecdotal stories, or some numbers but missing to specify a bunch of settings + model + runtime details so people could at least compare it to something.

Still, thanks for the links :)

t1amat · on Feb 11, 2025

In fairness it’s become even more difficult now than ever before.

* hardware spec

* inference engine

* specific model - differences to tokenizer will make models faster/slower with equivalent parameter count

* quantization used - and you need to be aware of hardware specific optimizations for particular quants

* kv cache settings

* input context size

* output token count

This is probably not a complete list either.