Indeed - just like word processor vs typewriter, calculator vs slide rule, spreadsheet vs. calculator, etc etc etc.
Execution Outcomes of of LLM Generated Research Ideas
A Stanford paper – main finding: LLM ideas result in worse projects than human ideas.
Calculator vs slide rule
This prompted a good parallel observation to AI.
At school I did exams using a slide rule (yes I am that old). One of things you always did was do a quick ‘rounded up’ mental calculation to get the magnitude of the answer so you could quickly tell if you had made a mistake using the slide rule.
Calculators came a long and people punched the keys and believed the answer without checking it, where a miss press could result in a silly answer.
AI will do summaries (and I do find them useful), but you need to understand the material and do a ‘rounded up’ check that it is not rubbish. The danger is people start treating AI like calculators and believe without question what it churns out.
This danger applies to watching the news, reading social media, YouTube – and even papers from Stanford, etc. – as well.
In my limited experiments, results vary widely, from “grad student with decent study skills” to “what did this thing even read?” If the document in question is something you need to actually understand – part of a literature review, say – using AI may turn out to be a false economy.
I think in the world of AI it may be helpful to re-imagine what a table of contents is.
To me an AI summary which is packed with hyperlinks to a specific page is a modern table of contents - much more useful than a traditional table of contents.
I script my “summaries” to include an executive summary, detailed summary, a chronological table of issues, and a bullet-point list of key issues. Even a list of “possible inconsistencies.” All with hyperlinks to specific pages.
I suggest that is a modern and really useful table of contents.
You are also expending for more time and energy in creating a highly personalized and customized process.
I wonder if that is really the world we are in ?
Pitches to potential LLM investors (vast sums sought, returns obscure, moats elusive) do urgently encourage that impression, but the very architecture of these systems is turning out to constrain them to fragmented Potemkin outputs, with only an illusion of understanding (see the papers above, not least Apple’s).
They can’t even play chess.
For the foreseeable, we appear to be in the world of little more than AL – Artificial Language.
(or, to allow for the legally contested visual outputs, ATS – Artificial Token Streams)
As with any technology there are lots of people trying to use it for suboptimal purposes. Lots of those tend to make it to mainstream media.
There are also some solid-gold winning use cases for AI that are going on right now in all sorts of professions. Many of them do not get as much publicity because if they are a means to a commercial end why spill the beans on the secret sauce?
Conceivably – though I’m not sure how we would know.
I hope, for the investors, that economic value is found, and translated into sustainable
revenue, because the current reality is of rapidly burning cash that was invested before
the structural limitations of the LLM + Reasoning Model
(and the difficulty of constructing any moat) began to fully emerge in experiment
and competition.
Many of these products will fall away in shake-out.
When I am referring to winning use cases, I mean among end-users
Those are not necessarily public companies producing quarterly reports. Lots of winners in AI are much smaller, privately held businesses.
As for which tech startups will survive - who knows. I would not place any bets on a specific AI company being around in X years or turnning profitable. What I would bet on however is that the AI industry overall will continue to thrive.
I think it’s necessary to differentiate between “AI” generally and “LLMs” specifically. Many of the most successful “AI” applications involve tools other than LLMs. And the major LLM players are all burning enormous piles of money with no apparent path to profitability.
I think the general public’s hype and misunderstanding of the LLM may well wind up the same way IBM Watson did 20 years ago…. I think people quickly forget…. We’ve seen this bubble, and we’ve even watched the promises rise and fall…. I think people quickly forget that Watson would do great on customer chats for a day or a week and then would start spitting out confidential stuff and making promises it was not supposed to do…. There’s reason Watson isn’t running every doctor’s office and customer help desk…. It’s the same reason people today are finding out that even very large frontier models cannot even run a vending machine stand for a month….
Meanwhile, as someone who never forgot how much compute it took to transform an expert system or played with WordNet, and thought about what neural nets may do someday, the LLM is grand if you understand what it is supposed to be doing. Hell, I’m even more exciting about the diffusion models, so-called dLLMs, now as they take a different method of transformation and can do fill in the blank exceptionally well.
Anthropic itself admits in their internal use docs that Claude code gets it wrong 67% of the time…. But 33% of the time it does really well. There’s something to learn here about the neural network and this massive network versus whether the winning subnetwork was activated when you sent the chat….
Meanwhile, I am able to run a qwen3-235B-A22 model at home with a giant context window…. Having DEVONthink able to chat right with the LM Studio instance running in the house has been a huge aid to me —. I’ve been able to let it peruse old things and see if there are prescient themes in old works I had with current themes, etc …. Quite an incredible research assistant for me and very helpful.
Does it replace a researcher? No…. But it is incredibly valuable and helpful because I use it for what the tool is meant to do.
Meanwhile, nobody should be surprised that an Atari 2600 set to “beginner” level was able to smack Claude and ChatGPT in chess …. And it wasn’t even fair…. The Atari simply OWNED the models…. Why? Because an LLM cannot reason in the sense that it would take to play chess. The AI hype out there is people being lead to believe that they can…. (Just like they did with IBM Watson)…
So I am glad we had this hype cycle…. Without it, I’d never have so many large and capable LLMs available to run at home for free that can do what LLMs do and help me so much!
I’ve often found that – the attempted shortcut (either in research or in code-drafting) is mysteriously paid for, again again and again, either in wasted time, or in what turn out later to have been false trails.
Engendering false confidence seems to be the central cost. Lawyers reprimanded (or worse) by judges, for having cited plausible-sounding but non-existant cases, inevitably come to mind.
If you catch a glimpse of the problem, and then circle round (sometimes again and again) in attempts to fix it, the clock ticks well into injury time.
Not the fault of these interesting experimental systems, but closely related to the way they have been over-sold to investors.
Many of us probably remember this early case:
… a document attached to his affidavit indicates he asked the generative AI-powered ChatGPT if one of the six cases the judge has called bogus was real and the chatbot responded that it was.
Additionally, he asked ChatGPT if the other cases provided were fake. The chatbot responded that they were also real and “can be found in reputable legal databases such as LexisNexis and Westlaw.”
and it doesn’t seem to have stopped:
Two More Cases Where Lawyers Face Judicial Wrath for Fake Citations – May 2025
The law has a way of surfacing these things, but the pattern occurs in any domain – the bill for false confidence comes in sooner or later.
“Calculators came a long and people punched the keys and believed the answer without checking it, where a miss press could result in a silly answer.”
Having been at school when calculators came along, I use a calculator exactly as you use a slide rule, they are both convenience tools for calculation and both should be used with an idea of the scale of the expected result.
And there’s a difference between:
- operator error in a determinate system, and
- pretending that a stochastic linguistic trick is performing thought or research.
Bluffing depends on mimicking the patter, but not on knowledge of the world.
It produces no degree of reliability – just an impression of plausibility
That’s not my text. It’s a quote from saltlane to which I replied.
Experimental systems that cost so much to run that the “researchers” are effectively forced to productize them, “reliable” or not.
I think part of the problem with user expectations is that we’re used to computers being determinate systems. Even when they are wrong, they are wrong in predictable ways. User error, or the model was wrong to begin with, or the model isn’t complex enough to deal with real world reality, or something like that. After years of working with computers, we understand those kinds of failures.
LLMs are not determinate systems, and in fact can’t be. They work as well as they do because they incorporate randomness. That’s exactly what makes their (many) flaws difficult or impossible to fix. But meanwhile the humans are conditioned to expect (and therefore assume) accuracy, even though an LLM has no concept of what “accuracy” even is.