Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A great thread with the type of info your looking for lives here: https://github.com/ggerganov/whisper.cpp/issues/89

But you can likely find similar threads for the llama.cpp benchmark here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

These are good examples because the llama.cpp and whisper.cpp benchmarks take full advantage of the Apple hardware but also take full advantage of non-Apple hardware with GPU support, AVX support etc.

It’s been true for a while now that the memory bandwidth of modern Apple systems in tandem with the neural cores and gpu has made them very competitive Nvidia for local inference and even basic training.



I guess I'm mostly lamenting about how unscientific these discussions are in general, on HN and elsewhere (besides specific GitHub repositories). Every community is filled with just anecdotal stories, or some numbers but missing to specify a bunch of settings + model + runtime details so people could at least compare it to something.

Still, thanks for the links :)


In fairness it’s become even more difficult now than ever before.

* hardware spec

* inference engine

* specific model - differences to tokenizer will make models faster/slower with equivalent parameter count

* quantization used - and you need to be aware of hardware specific optimizations for particular quants

* kv cache settings

* input context size

* output token count

This is probably not a complete list either.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: