> Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago.
And it will not yield the same outcome you would have had. Your own taste in clicking links and pre-filtering as you do your research, is no longer being done if you outsource this. I‘m guilty of this myself. But let’s not kid ourselves.
I’ve had GPT Pro think 40 minutes about the ideal reverse osmosis setup for my home. It came up with something that would have been able to support 10 houses and cost 20k. Even though I did tell it all about what my water consumers are and that it should research their peak usage. It just failed to observe that you can buffer water in a tank.
There‘s a reason they let you steer GPT-Pro as it goes, now.
I don't claim using AI is the same as doing it yourself. My point is that AI capabilities are much more extensive than "fancy search". By giving a metric and an example I hoped to make that point without getting into hair-splitting.
Words hint at concept space, which is messy and interconnected. I think a charitable reading can understand the difference between "powerful search, kind of like Google as of 2020, or Lexus-Nexus" and LLM-AI chatbot interfaces... I would hope. But I've been developing software since the 1980s so I can't speak for the newer generations who might not have a quadruple decade view. I've been in meetups in San Francisco around 2018, where people were excited to find multimodal reasoning in early days proto-language models. There have been qualitatively noticeable
historical shifts. We don't have to agree on the exact labels used, but what LLM's enable is different enough from e.g. ElasticSearch of 2020 to call out.
You might be surprised how well 5.3-codex follows your instructions. When it hits a wall with your request, it usually emits the final turn and says it can’t do it.
That’s just wrong. File reads, searches, compiler output, are the top input token consumers in my workflow. None of them can be removed. And they are the majority of my input tokens. That’s also why labs are trying to make 1M input work, and why compaction is so important to get right.
Regarding output - yes, but that wasn’t the topic in this thread. It’s just easier to argue with input tokens that price has gone up. I have a hunch the price for output will go up similarly, but can’t prove it. The jury’s out IMO: https://news.ycombinator.com/item?id=47816960
This has no bearing on my comment. The point is that a better model avoids dozens of prompts and tool calls by making fewer CORRECT tool calls, with the user needing no more prompts.
I’m surprised this is even a question; obviously a better prompter has the same properties and it’s not in dispute?
The models that we are paying to generate tokens are already not really just LLMs, as anyone studying language models ten years ago (or someone who describes them as "next token predictors") would understand them. Doing a bunch of reinforcement learning so that a model performs better at ssh'ing into my server and debugging my app is already realllly stretching the definition of "language pattern".
I think when we do get AI that can perform as well as a human at functionally all tasks, they will be multi-paradigm systems; some components will not resemble anything in any commercial system today, but one component will be recognizably LLM-like, and act as an essential communication layer.
Anthropic in general is miles ahead in “getting work done”, and its not just me on the team. Theres a lot of paper cuts to work through to be truly generic in provider
I did try out codex before claude went to shit and it was good, even uniquely good in some ways, but wasnt good enough to choose it over claude. Absolutely when claude was bad again it would have been better, but thats hindsight that I should have moved over temporarily.
Those generated ADRs are pure crap, full of unnecessary hedges and superficial solutions that don’t survive scrutiny longer than 10 seconds. I do generate ADR skeleton drafts because I hate empty pages, but I need to add the substance or they are not helpful at all.
What we are doing is probably not in training data, maybe that’s why.
They should pick a lane because it’s not very believable if you put these things into defense systems and in the next minute claim that humanity is existentially threatened. Either you’re lying, or ruthless, or stupid.
And it will not yield the same outcome you would have had. Your own taste in clicking links and pre-filtering as you do your research, is no longer being done if you outsource this. I‘m guilty of this myself. But let’s not kid ourselves.
I’ve had GPT Pro think 40 minutes about the ideal reverse osmosis setup for my home. It came up with something that would have been able to support 10 houses and cost 20k. Even though I did tell it all about what my water consumers are and that it should research their peak usage. It just failed to observe that you can buffer water in a tank.
There‘s a reason they let you steer GPT-Pro as it goes, now.
reply