Isn’t 4 tps good enough for local use by a single user, which is the point of a ...

IanCal · on Feb 12, 2025

4 tokens per second is pretty slow. That's 5-10s for a comment the length of yours (and R1 specifically likes to output a lot of tokens). It's 10-20x slower than many top end models, which are available cheaply. Even high cost versions of R1 (at more than twice the price of sonnet) are $7/million tokens. For $2K you get 285 million tokens. You'd have to run the box at full whack for over two years (for 4tps) to hit that spending, and that ignores electricity prices. Sonnet 3.5 is half that price, and other R1 providers you could probably hit about a billion tokens for $2k. Gemini flash 2 is over 100 tokens per second and $2k gets you something like 5+B tokens (more really but I'm taking an easy estimate over the more expensive part).

If there are things you cannot send to a random party, you might want to look at hosted versions with agreements (if it's a code issue, if you're fine with github then azure is probably fine too).

Outside of that, if you really need to then sure, but these are the kinds of things that really benefit from being able to get high usage on GPUs for short periods of time.

JKCalhoun · on Feb 11, 2025

It is for me. I'm happy to switch over to another task (and maybe that task is refilling my coffee) and come back when the answer is fully formed.

ErikBjare · on Feb 11, 2025

I tend to get impatient at less than 10tok/s: If the answer is 600tok (normal for me) that's a minute.