No, not really, there was a lot of engineering work and bunch of not-so-big ideas (e.g. InstructGPT reinforcement learning after the model's training), but you can go from the transformers paper to current state of art without needing a "big idea".
And I think this is the major "big idea", accepting the bitter lesson (http://incompleteideas.net/IncIdeas/BitterLesson.html) that major user-visible progress and new emerging capabilities doesn't necessarily require any big ideas but simply scaling to more compute.
I disagree. And first of all, there is a reflective meta teaching from the very idea of the "Bitter Lessons":
the past reveals that (in a way) "the application of models has not been a winner" - but we cannot really know that it is not, because we do not have obtained a model out of it, a model that shows why, an explanation - epistemologically, the "discouraging" protocols cannot be made a "law".
Practically, there still is a need to identify the proper architecture(s) to avoid the undesired weaknesses of the attempts in the current stages.
RLHF is arguably a bigger jump than LLMs, at least from my perspective beginning to study NLP in 2015/16.
Well what exactly is RLHF, practically?
The ability to go from 8 google search snippets to correctly rank and rewrite the top one into agreeable, cohesive, grammatical and helpful english is just incredible and allows so much more and the real step change from these models that lead to virality. It also increases consistency, which was always the worry of business use cases.
Why is that more noteworthy than the base GPT-3?
A lot of the LLM scale --> more correct autoregression prediction progress was predictable - RLHF on text was not (the early sparks coming for most of us in the release of T5 with it's multiple tasks-in-text).
What else could be a big idea coming up?
There is a ongoing wave of innovation in embeddings that has largely been missed by the hype curve but increasingly GPT embeddings and useful for compression, much much more accurate KNN search for tasks like matching curriculums to learning content (even multilingually - see the recent Kaggle competition with performance which is outstanding and due to similarity-based embeddings from the last 3 years). This wave may lead to the partial replacement of some anthropomorphic computing concepts like files, as information is much more addressable, combinable and useful as various sized embeddings, to some extent. More vitally, embeddings can be aligned across different models and modalities to get better results (e.g. the Amazon ScienceQA paper showed text questions about physical situations increased in accuracy when images of the situation were used during training - even if held out afterwards). Now this multimodality thing has always been on the AI radar (not necessarily ML), but these embeddings based on similarity, and also GPT embeddings (they behave differently and are sensitive in different ways) are getting us there much quicker than would have been expected.
Ignoring the engineering and techniques improvements (e.g. scaling up data, learning encodings rather than pre-programmed/sinu-positional embeddings), there are lots of things like capsule networks that could be big, like energy-based models (seeking predictable comfortableness rather than maximising gains). However, like you mentioned, a lot of these are years old and regularly come and go. If you want somebody who is pushing for more exploration here and decries GPT a little, checkout Yann Lecun.
it's an interesting example of how much seemingly superficial, non-fundamental things can matter.
A lot of AI experts are asserting (probably correctly) that Open AI really has done nothing new and is just putting a shiny sticker on what was already known and published research.
But human perception being what it is, having ChatGPT produce a beautifully formed, polite and friendly sentence seems massively better to lay people than a response that has a more terse, unpolished output. It won't surprise me if there is already a giant layer of heuristics pasted on the end of the Transformer model for ChatGPT cleaning up all sorts of ugly corner cases which researchers would hold highly impure and completely value-less while it actually is responsible for a large amount of ChatGPT's success.
I think there is a bit of a lesson there in terms of how much academia does undervalue the polishing part of research work, even if fundamentals ultimately drive progress.
Somebody had to invest resources into training those super large models and observe emergent intelligent behavior. It's not like the authors of the original paper knew that transformers would lead to GPT-4 and spark an AGI debate. Nobody expected transformers to get powerful so fast.
The paper on transformers was published 6 years ago.
6 years in ML is an eternity nowadays.