Techbros claim that sacrificing ever more environmental and economic resources into further scaling of their statistical models will eventually yield Artificial General Intelligence. Sam Altman advertised phd-level thinking for GPT 5 (released in late summer)
In the screenshot you can see a modern, reasoning-enhanced model failing to realize that my question just resembles a riddle very superficially - what PhD would make such a fundamental mistake? Seems like "pure scaling" just isn't enough, and instead of keeping on going, they should return to the drawing boards and do more research.
Why do the creators of llms continue to pour resources into a scaling hype that yields ever diminishing returns? Are they stupid? Or just greedy?
Whatever it is, this doubling down on an obviously wrong decision is not only irresponsible, it's irrational - which is deeply ironic, because these creators are expected to create a rational machine.
Edit: content of the image
Question:
A goat, who is dressed up as a farmer, is allergic to cabbage, but is wolfing down other vegetables, before crossing a river. What is the minimum number of trips needed?
Response:
7 trips.
#Puzzle Interpretation
This adapts the classic wolf-goat-cabbage river crossing puzzle, where the goat (farmer costume irrelevant) can't be left alone with cabbage (due to allergy, preventing eating) but can with other vegetables (no issue). The goat crosses while "wolfing down" safe veggies. Boat carries goat plus one item (or alone); unsafe pairs can't be unsupervised
(The question isn't something I came up with myself, I just reproduced the experiments of actual scientists )
Garbage in, garbage out.
If you feed it a shitpost it'll do its best to assume its a real question and you're not trying to trick it and respond accordingly.
Explanation for this specific case: There is no indication from you in this chat or context that you are attempting and adversarial prompt. So it assumes that you aren't doing that and answers naively to respond your question, filling in the blanks as necessary with assumptions that may or not be wrong.
Try the same question, but before you give it to the LLM, add to the context that the question may or may not be nonsense and that the they are allowed to ask clarifying questions and see what happens there.
Edit: I'm glossing over the PhD thing cause that's just BS, or not applicable at all, or just stupid to even compare an LLM with a human brain at this point.
Edit: Theres something interesting that your prompt touches on and exacerbates, and I can talk about it more if you want, but its called semantic drift. Its a common issue with LLMs where the definition of a word slowly changes meaning across internal iterations. (It also happens in real life at a much much larger scale)
I think you make it too complicated.
The question / prompt is very simple. The answer is "one trip". The LLM stumbles because there are trigger words in there that make it seem like the goat cabbage puzzle question. But to a human it clearly is not. An LLM on the other hand cannot tell the difference.
It may be tricking the LLM somewhat advesarially. But it is still a very simple question, that it is not able to answer, because it fundamentally has no understanding of anything at all.
This prompt works great to drive home that simple fact. And shows that all that touting of reasoning skills is just marketing lies.
I was curious about this myself. I've seen these types of posts before, so i decided to try it myself
I then tried again with the "web search" function and got this
Based on this sample size of 2. I can conclude that searching the web is causing the issue.
Which might explain the "Reviewed 20 sources" message in the original image.
Ah thank you, now I see what you mean. And it seems like we're mostly talking about the same thing here 😅
To reiterate: unprecedented amounts of money and resources are being sunk into systems that are fundamentally flawed (among others by semantic drift), because their creators double down on their bad decisions (just scale up more) instead of admitting that LLMs can never achieve what they promise. So when you're saying that LLMs are just fancy autocorrect, there's absolutely no disagreement from me: it's the point of this post.
And yes, for an informed observed of the field, this isn't news - I just shared the result of an experiment because I was surprised how easy it was to replicate.