Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The thing is though.... the locally hosted models in such hardware are cute as toys, and sure do write funny jokes and importantly, perform private tasks that I would never consider passing to non-selfhosted models, but pale in comparison to the models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc). If I could run deepseek-r1-678b locally, without breaking the bank, I would. But, for now, opex > capex at a consumer level.


200+ comments, https://news.ycombinator.com/item?id=42897205

> This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.


Runs is an overstatement though. With 4 tokens/second you can't use it on production.


I have a similar setup running at about 1.5 tokens/second, and it's perfectly usable for the sorts of difficult tasks one needs a frontier model like this for - give it a prompt and come back an hour or two later. You interact with it like e-mailing a coworker. If I need an answer back in seconds, it's probably not a very complicated question, and a much smaller model will do.


I get where you’re coming from, but the problem with LLMs is that you very regularly need a lot of back-and-forth with them to tease out the information you’re looking for. A more apt analogy might be a coworker that you have to follow up with three or four times, at an hour per. Not so appealing anymore. Doubly so when you have to stand up $2k+ of hardware for the privilege. If I’m paying good money to host something locally, I want decent performance.


> If I’m paying good money to host something locally

The thing is, however, that at 2k one is not paying good money, one is paying near the least amount possible. TFA specifically is about building a machine on a budget, and as such cuts corners to save costs, e.g. by buying older cards.

Just because 2k is not a negligible amount in itself, that doesn't also automatically make it adequate for the purpose. Look for example at the 15k, 25k, and 40k price range tinyboxes:

https://tinygrad.org/#tinybox

It's like buying a 2k-worth used car, and expecting it to perform as well as a 40k one.


Agreed. Furthermore, for some tasks like large context code assistant windows i want really fast responses. I've not found a UX i'm happy with yet but for anything i care about i'd want very fast token responses. Small blocks of code which instantly autocomplete, basically.


> give it a prompt and come back an hour or two later.

This is the problem.

If your use case is getting a small handful of non-urgent responses per day then it's not a problem. That's not how most people use LLMs, though.


Isn’t 4 tps good enough for local use by a single user, which is the point of a personal AI computer?


4 tokens per second is pretty slow. That's 5-10s for a comment the length of yours (and R1 specifically likes to output a lot of tokens). It's 10-20x slower than many top end models, which are available cheaply. Even high cost versions of R1 (at more than twice the price of sonnet) are $7/million tokens. For $2K you get 285 million tokens. You'd have to run the box at full whack for over two years (for 4tps) to hit that spending, and that ignores electricity prices. Sonnet 3.5 is half that price, and other R1 providers you could probably hit about a billion tokens for $2k. Gemini flash 2 is over 100 tokens per second and $2k gets you something like 5+B tokens (more really but I'm taking an easy estimate over the more expensive part).

If there are things you cannot send to a random party, you might want to look at hosted versions with agreements (if it's a code issue, if you're fine with github then azure is probably fine too).

Outside of that, if you really need to then sure, but these are the kinds of things that really benefit from being able to get high usage on GPUs for short periods of time.


It is for me. I'm happy to switch over to another task (and maybe that task is refilling my coffee) and come back when the answer is fully formed.


I tend to get impatient at less than 10tok/s: If the answer is 600tok (normal for me) that's a minute.


I agree with elorant. Indirectly, some youtubers ended up demonstrating that it's difficult to run the best models with less than 7k$, even if NVIDIA hardware is very efficient.

In the future, I expect this to not be the case, because models will be far more efficient. At this pace, maybe even 6 months can make a difference.


Some LLM use cases are async, e.g. agents, "deep research" clones.


Not to mention even simpler things, like wanting to tag all of your local notes based on the content, basically a bash loop you can run indefinitely and speed doesn't matter much, as long as it eventually finishes


Additionally if all you're doing is simple tagging and classification, you can probably get away with a significantly smaller model (sub 14b parameter model) like Mistral 7b or Qwen 14b.


What I'd like to know is how well those dual-Epyc machines run the 1.58 bit dynamic quant model. It really does seem to be almost as good as the full Q8.


I tried that that: ~1.5 to 3 tokens/sec.


Ouch, thanks. About what I get now on a single-CPU box with 128 GB+a 4090. Was hoping for a major speedup.


Peak performance is achieved at ~21 cores. Bottleneck - without any special configs - is RAM to CPU bandwidth.

Let me know if you find some config that really leverages more cores!


This is not because the models are better. These services have unknown and opaque levels of shadow prompting[1] to tweak the behavior. The subject article even mentions "tweaking their outputs to the liking of whoever pays the most". The more I play with LLMs locally, the more I realize how much prompting going on under the covers is shaping the results from the big tech services.

1 https://www.techpolicy.press/shining-a-light-on-shadow-promp...


The 1.58-bit DeepSeek R1 dynamic quant model from Unsloth is no joke. It just needs a lot of RAM and some patience.


There seems to be a LOT of work going on to optimize the 1.58-bit option in terms of hardware and add-ons. I get the feeling that someone from Unsloth is going to have a genuine breakthrough shortly, and the rig/compute costs are going to plummet. Hope I'm not being naïve or over-confident.


Huh? Toys? You can run DeepSeek 70b on 36GB ram Macbook pro.. You can run Phi4, Qwen2.5, or llama3.3. They work great for coding tasks


Yeah but as one of the replies points out the resulting tokens/second would be unusable in production environments


What? Literally use it at work to write code.


I think they're talking about using it to power inference for self hosted user facing applications.


ahhhhh yes ok. Totally agree here.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: