The thing is though.... the locally hosted models in such hardware are cute as toys, and sure do write funny jokes and importantly, perform private tasks that I would never consider passing to non-selfhosted models, but pale in comparison to the models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc).
If I could run deepseek-r1-678b locally, without breaking the bank, I would. But, for now, opex > capex at a consumer level.
I have a similar setup running at about 1.5 tokens/second, and it's perfectly usable for the sorts of difficult tasks one needs a frontier model like this for - give it a prompt and come back an hour or two later. You interact with it like e-mailing a coworker. If I need an answer back in seconds, it's probably not a very complicated question, and a much smaller model will do.
I get where you’re coming from, but the problem with LLMs is that you very regularly need a lot of back-and-forth with them to tease out the information you’re looking for. A more apt analogy might be a coworker that you have to follow up with three or four times, at an hour per. Not so appealing anymore. Doubly so when you have to stand up $2k+ of hardware for the privilege. If I’m paying good money to host something locally, I want decent performance.
> If I’m paying good money to host something locally
The thing is, however, that at 2k one is not paying good money, one is paying near the least amount possible. TFA specifically is about building a machine on a budget, and as such cuts corners to save costs, e.g. by buying older cards.
Just because 2k is not a negligible amount in itself, that doesn't also automatically make it adequate for the purpose. Look for example at the 15k, 25k, and 40k price range tinyboxes:
Agreed. Furthermore, for some tasks like large context code assistant windows i want really fast responses. I've not found a UX i'm happy with yet but for anything i care about i'd want very fast token responses. Small blocks of code which instantly autocomplete, basically.
4 tokens per second is pretty slow. That's 5-10s for a comment the length of yours (and R1 specifically likes to output a lot of tokens). It's 10-20x slower than many top end models, which are available cheaply. Even high cost versions of R1 (at more than twice the price of sonnet) are $7/million tokens. For $2K you get 285 million tokens. You'd have to run the box at full whack for over two years (for 4tps) to hit that spending, and that ignores electricity prices. Sonnet 3.5 is half that price, and other R1 providers you could probably hit about a billion tokens for $2k. Gemini flash 2 is over 100 tokens per second and $2k gets you something like 5+B tokens (more really but I'm taking an easy estimate over the more expensive part).
If there are things you cannot send to a random party, you might want to look at hosted versions with agreements (if it's a code issue, if you're fine with github then azure is probably fine too).
Outside of that, if you really need to then sure, but these are the kinds of things that really benefit from being able to get high usage on GPUs for short periods of time.
I agree with elorant. Indirectly, some youtubers ended up demonstrating that it's difficult to run the best models with less than 7k$, even if NVIDIA hardware is very efficient.
In the future, I expect this to not be the case, because models will be far more efficient. At this pace, maybe even 6 months can make a difference.
Not to mention even simpler things, like wanting to tag all of your local notes based on the content, basically a bash loop you can run indefinitely and speed doesn't matter much, as long as it eventually finishes
Additionally if all you're doing is simple tagging and classification, you can probably get away with a significantly smaller model (sub 14b parameter model) like Mistral 7b or Qwen 14b.
What I'd like to know is how well those dual-Epyc machines run the 1.58 bit dynamic quant model. It really does seem to be almost as good as the full Q8.
This is not because the models are better.
These services have unknown and opaque levels of shadow prompting[1] to tweak the behavior.
The subject article even mentions "tweaking their outputs to the liking of whoever pays the most".
The more I play with LLMs locally,
the more I realize how much prompting going on under the covers is shaping the results from the big tech services.
There seems to be a LOT of work going on to optimize the 1.58-bit option in terms of hardware and add-ons. I get the feeling that someone from Unsloth is going to have a genuine breakthrough shortly, and the rig/compute costs are going to plummet. Hope I'm not being naïve or over-confident.