The thing is though.... the locally hosted models in such hardware are cute as t...

walterbell · on Feb 11, 2025

200+ comments, https://news.ycombinator.com/item?id=42897205

> This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.

elorant · on Feb 11, 2025

Runs is an overstatement though. With 4 tokens/second you can't use it on production.

mechagodzilla · on Feb 11, 2025

I have a similar setup running at about 1.5 tokens/second, and it's perfectly usable for the sorts of difficult tasks one needs a frontier model like this for - give it a prompt and come back an hour or two later. You interact with it like e-mailing a coworker. If I need an answer back in seconds, it's probably not a very complicated question, and a much smaller model will do.

xienze · on Feb 11, 2025

I get where you’re coming from, but the problem with LLMs is that you very regularly need a lot of back-and-forth with them to tease out the information you’re looking for. A more apt analogy might be a coworker that you have to follow up with three or four times, at an hour per. Not so appealing anymore. Doubly so when you have to stand up $2k+ of hardware for the privilege. If I’m paying good money to host something locally, I want decent performance.

MonkeyClub · on Feb 11, 2025

> If I’m paying good money to host something locally

The thing is, however, that at 2k one is not paying good money, one is paying near the least amount possible. TFA specifically is about building a machine on a budget, and as such cuts corners to save costs, e.g. by buying older cards.

Just because 2k is not a negligible amount in itself, that doesn't also automatically make it adequate for the purpose. Look for example at the 15k, 25k, and 40k price range tinyboxes:

https://tinygrad.org/#tinybox

It's like buying a 2k-worth used car, and expecting it to perform as well as a 40k one.

unshavedyak · on Feb 11, 2025

Agreed. Furthermore, for some tasks like large context code assistant windows i want really fast responses. I've not found a UX i'm happy with yet but for anything i care about i'd want very fast token responses. Small blocks of code which instantly autocomplete, basically.

Aurornis · on Feb 11, 2025

> give it a prompt and come back an hour or two later.

This is the problem.

If your use case is getting a small handful of non-urgent responses per day then it's not a problem. That's not how most people use LLMs, though.

deadbabe · on Feb 11, 2025

Isn’t 4 tps good enough for local use by a single user, which is the point of a personal AI computer?

IanCal · on Feb 12, 2025

4 tokens per second is pretty slow. That's 5-10s for a comment the length of yours (and R1 specifically likes to output a lot of tokens). It's 10-20x slower than many top end models, which are available cheaply. Even high cost versions of R1 (at more than twice the price of sonnet) are $7/million tokens. For $2K you get 285 million tokens. You'd have to run the box at full whack for over two years (for 4tps) to hit that spending, and that ignores electricity prices. Sonnet 3.5 is half that price, and other R1 providers you could probably hit about a billion tokens for $2k. Gemini flash 2 is over 100 tokens per second and $2k gets you something like 5+B tokens (more really but I'm taking an easy estimate over the more expensive part).

If there are things you cannot send to a random party, you might want to look at hosted versions with agreements (if it's a code issue, if you're fine with github then azure is probably fine too).

Outside of that, if you really need to then sure, but these are the kinds of things that really benefit from being able to get high usage on GPUs for short periods of time.

JKCalhoun · on Feb 11, 2025

It is for me. I'm happy to switch over to another task (and maybe that task is refilling my coffee) and come back when the answer is fully formed.

ErikBjare · on Feb 11, 2025

I tend to get impatient at less than 10tok/s: If the answer is 600tok (normal for me) that's a minute.

Cascais · on Feb 11, 2025

I agree with elorant. Indirectly, some youtubers ended up demonstrating that it's difficult to run the best models with less than 7k$, even if NVIDIA hardware is very efficient.

In the future, I expect this to not be the case, because models will be far more efficient. At this pace, maybe even 6 months can make a difference.

walterbell · on Feb 11, 2025

Some LLM use cases are async, e.g. agents, "deep research" clones.

diggan · on Feb 11, 2025

Not to mention even simpler things, like wanting to tag all of your local notes based on the content, basically a bash loop you can run indefinitely and speed doesn't matter much, as long as it eventually finishes

vunderba · on Feb 11, 2025

Additionally if all you're doing is simple tagging and classification, you can probably get away with a significantly smaller model (sub 14b parameter model) like Mistral 7b or Qwen 14b.

CamperBob2 · on Feb 11, 2025

What I'd like to know is how well those dual-Epyc machines run the 1.58 bit dynamic quant model. It really does seem to be almost as good as the full Q8.

DrNosferatu · on Feb 12, 2025

I tried that that: ~1.5 to 3 tokens/sec.

CamperBob2 · on Feb 14, 2025

Ouch, thanks. About what I get now on a single-CPU box with 128 GB+a 4090. Was hoping for a major speedup.

DrNosferatu · on Feb 14, 2025

Peak performance is achieved at ~21 cores. Bottleneck - without any special configs - is RAM to CPU bandwidth.

Let me know if you find some config that really leverages more cores!

cratermoon · on Feb 11, 2025

This is not because the models are better. These services have unknown and opaque levels of shadow prompting[1] to tweak the behavior. The subject article even mentions "tweaking their outputs to the liking of whoever pays the most". The more I play with LLMs locally, the more I realize how much prompting going on under the covers is shaping the results from the big tech services.

1 https://www.techpolicy.press/shining-a-light-on-shadow-promp...

CamperBob2 · on Feb 11, 2025

The 1.58-bit DeepSeek R1 dynamic quant model from Unsloth is no joke. It just needs a lot of RAM and some patience.

jaggs · on Feb 11, 2025

There seems to be a LOT of work going on to optimize the 1.58-bit option in terms of hardware and add-ons. I get the feeling that someone from Unsloth is going to have a genuine breakthrough shortly, and the rig/compute costs are going to plummet. Hope I'm not being naïve or over-confident.

vanillax · on Feb 11, 2025

Huh? Toys? You can run DeepSeek 70b on 36GB ram Macbook pro.. You can run Phi4, Qwen2.5, or llama3.3. They work great for coding tasks

3s · on Feb 11, 2025

Yeah but as one of the replies points out the resulting tokens/second would be unusable in production environments

vanillax · on Feb 12, 2025

What? Literally use it at work to write code.

jtbaker · on Feb 12, 2025

I think they're talking about using it to power inference for self hosted user facing applications.

vanillax · on Feb 12, 2025

ahhhhh yes ok. Totally agree here.