Hacker Newsnew | past | comments | ask | show | jobs | submit | manmal's commentslogin

> Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago.

And it will not yield the same outcome you would have had. Your own taste in clicking links and pre-filtering as you do your research, is no longer being done if you outsource this. I‘m guilty of this myself. But let’s not kid ourselves.

I’ve had GPT Pro think 40 minutes about the ideal reverse osmosis setup for my home. It came up with something that would have been able to support 10 houses and cost 20k. Even though I did tell it all about what my water consumers are and that it should research their peak usage. It just failed to observe that you can buffer water in a tank.

There‘s a reason they let you steer GPT-Pro as it goes, now.


I don't claim using AI is the same as doing it yourself. My point is that AI capabilities are much more extensive than "fancy search". By giving a metric and an example I hoped to make that point without getting into hair-splitting.

I wouldn’t call that hair-splitting. I’m saying, it’s not a real literature review, but even fancier search.

Words hint at concept space, which is messy and interconnected. I think a charitable reading can understand the difference between "powerful search, kind of like Google as of 2020, or Lexus-Nexus" and LLM-AI chatbot interfaces... I would hope. But I've been developing software since the 1980s so I can't speak for the newer generations who might not have a quadruple decade view. I've been in meetups in San Francisco around 2018, where people were excited to find multimodal reasoning in early days proto-language models. There have been qualitatively noticeable historical shifts. We don't have to agree on the exact labels used, but what LLM's enable is different enough from e.g. ElasticSearch of 2020 to call out.

You might be surprised how well 5.3-codex follows your instructions. When it hits a wall with your request, it usually emits the final turn and says it can’t do it.

Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.

The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall


Well you‘ll need the same prompt for input tokens?

Only the first one. Ideally now there is no second prompt.

Are you aware that every tool call produces output which also counts as input to the LLM?

Are you aware that a lot of model tool calls are useless and a smarter model could avoid those?

Are you aware that output tokens are priced 5x higher than input tokens?


> a lot of model tool calls are useless

That’s just wrong. File reads, searches, compiler output, are the top input token consumers in my workflow. None of them can be removed. And they are the majority of my input tokens. That’s also why labs are trying to make 1M input work, and why compaction is so important to get right.

Regarding output - yes, but that wasn’t the topic in this thread. It’s just easier to argue with input tokens that price has gone up. I have a hunch the price for output will go up similarly, but can’t prove it. The jury’s out IMO: https://news.ycombinator.com/item?id=47816960


This has no bearing on my comment. The point is that a better model avoids dozens of prompts and tool calls by making fewer CORRECT tool calls, with the user needing no more prompts.

I’m surprised this is even a question; obviously a better prompter has the same properties and it’s not in dispute?


That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".

Common sense isn’t a language pattern. I doubt this will ever work w/ LLMs.

The models that we are paying to generate tokens are already not really just LLMs, as anyone studying language models ten years ago (or someone who describes them as "next token predictors") would understand them. Doing a bunch of reinforcement learning so that a model performs better at ssh'ing into my server and debugging my app is already realllly stretching the definition of "language pattern".

I think when we do get AI that can perform as well as a human at functionally all tasks, they will be multi-paradigm systems; some components will not resemble anything in any commercial system today, but one component will be recognizably LLM-like, and act as an essential communication layer.


Why don’t you do the planning yourself? It’s very likely to be a better plan.

Why don’t you switch to codex? The grass is greener here. Do use 5.3-codex though, 5.4 is not for coding, despite what many say.

Anthropic in general is miles ahead in “getting work done”, and its not just me on the team. Theres a lot of paper cuts to work through to be truly generic in provider

I did try out codex before claude went to shit and it was good, even uniquely good in some ways, but wasnt good enough to choose it over claude. Absolutely when claude was bad again it would have been better, but thats hindsight that I should have moved over temporarily.


You can also rent a cloud GPU which is relatively affordable.

Or an autoresearch minimizing render times.

Those generated ADRs are pure crap, full of unnecessary hedges and superficial solutions that don’t survive scrutiny longer than 10 seconds. I do generate ADR skeleton drafts because I hate empty pages, but I need to add the substance or they are not helpful at all.

What we are doing is probably not in training data, maybe that’s why.


They should pick a lane because it’s not very believable if you put these things into defense systems and in the next minute claim that humanity is existentially threatened. Either you’re lying, or ruthless, or stupid.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: