The Ai thread

This was claude - paid account with Claude Pro sub.

Everything (and i mean everything) was written by Sonnet 4.5.

Including all the documentation; all i did was tell it what to make, and to write in cross platform C.

i'd never used github before (only local git), claude helped me connect the git repo to itself and publish to it.

the first version of the app was a 1 shot prompt. I added the feature for XML/JSON/CSV export and that was a one shot.

It was on the internet inside of 45 minutes as a cross platform app and that included me walking across teh street to go grab a coffee, sit back down at my desk and finish it.


Last year i did the same thing in powershell (without export formats) using chatGPT. Took a couple of hours of back and forth bugfixing and getting it right.

12 months later, its a 1 shot, tested, cross platform including upload to github with full user documentation.... in under an hour, including me learning to create the github repo and add it to claude...

  • 2 years ago i would be looking for an existing tool, sorting through shareware paid stuff, etc.
  • 1 year ago i got chatgpt to help and it took a couple of hours
  • yesterday ... if i just wanted a tool to run locally and didn't care about publishing it... 15 minutes?

Claude pro costs like $20/month... its constantly helping save me time.

I run a software development group at a large organization, having come from a development and software architect background myself. We’ve been using Claude 4.5, Copilot, and agents heavily. I agree that as of today (Jan 2026), Claude Paid is the most useful for my personal tastes. It requires fewer corrections, though it still makes occasional bad decisions.

I do think that your mileage will definitely vary depending on the underlying tech stack and how well trained your particular model is. We recently shifted our internal development from C#/Node/React to Python/Django/HTMX and Alpine.js.
While Claude and CoPilot in general are handy research tools (similar to a quicker stack exchange google search), we found it struggled with configuration for HTMX-oriented projects initially. However, the community is filling that gap quickly—for example, this config adaptation regarding Django/HTMX skills:
https://forum.djangoproject.com/t/i...-forms-testing-htmx-drf-celery-and-more/43862

My takeaways on AI code generation:

  1. Providing a Sandbox environement VS Production: AI is an excellent "sounding board" for mocking up simple UIs or exploring technical ideas that have been well trodden before. However, it does not replace the need for fundamental knowledge of the underlying platforms. Claude hallucinates. Nothing goes into our production environment without full human oversight and accountability.
  2. The "Drunken Coding Partner" Effect: Coding with AI is like pair-programming with a drunken partner. It’s helpful, but you have to watch it like a hawk because it can get things authentically and confidently wrong.
  3. Enterprise Maturity: Most models that I have used are not yet ready to handle enterprise-grade concerns like security, scalability, performance considerations and complex logging without heavy hand-holding. Anywhere you see hand-holding expect significant time investment.
    Where they do excel is with isolated tasks (like parsing a log file). They tend to struggle with asks that start to straddle into more complex territory like holistic architecture!
  4. The Efficiency Myth: There is a belief among (non-tech-savvy) executives that AI will slash costs by replacing developers. This ignores the reality that software delays (at least in my experience) are rarely caused by typing speed. Rather they are often caused by unclear business requirements or scope creep. AI cannot fix a lack of understanding of the core business problem.
Ultimately, if I could wave a magic wand to make most software development projects go quicker at my own company, I truly believe that empathy will become a superpower that differentiates developers of the future. As I tell my own team, understanding the user and the problem is the one skill that keeps developers relevant, and it’s something AI cannot replicate.

Just my 0.02.

(disclaimer: the above is my own honest opinion and should not be held as being that of my employer).
 
Couple if things:

Yeah, it’s not a replacement for understanding how to code. However it is 100% an efficiency improvement, and as you state above, it makes testing out ideas and writing code for disposable tools much, much cheaper. That tool i made was something i could have written for my own undocumented use in a few days. It would have had no docs, poor version control, not been cross platform, etc. The fact that claude 1-shot version 1 that worked as designed inside of 15 minutes is just wild.

It’s getting cheaper in some circumstances to build your own tools now than find and procure them. Anything with a REST API with public documentation for example is trivial to connect to now.

Today’s paid reasoning LLM based AI isn’t just faster typing speed (code autocomplete was years ago now) - it’s faster reasoning, testing and iteration.

I honestly don’t think AI is anywhere near replacing developers any time soon as you need someone capable of overseeing it, checking the prompts, identifying the actual requirements, etc.

I do however think it is going to make developers and IT professionals much, much more productive. Will this result in less demand for devs? Anthropic don’t seem to think so and neither do I. There’s no shortage of stuff to make, these tools will just enable us to get much more done.

Also what i find great (and not just for code) is project documentation. If the LLM is privy to your code base (or other project), whether it wrote it or you did - it is awesome for keeping the documentation up to date! No one likes writing docs, and inevitably wherever i have worked they constantly lag reality by days, months, weeks…. years.

These LLM models help massively with that.


And yes, don’t trust the output blindly. I actually ran the claude output through GPT 5.2 and it identified some things to fix (edge case DST bugs, filesystem reparse point bugs on Windows); stuff i’d never have found but that it spotted in my test output and the code.

That said, my entire app was written by Sonnet 4.5, so Opus probably would have done better :D
 
Last edited:
For fun I tried this exact same question on both ChatGPT and Grok. Wow.

Did trump win the election in 2020?

ChatGPT:
"No — Donald Trump did not win the 2020 U.S. presidential election.
In the 2020 election held on November 3, 2020, Joe Biden won both the Electoral College and the popular vote and became the 46th President of the United States.

Here are the official results:
  • Joe Biden (Democrat) received 306 electoral votes and about 51.3 % of the popular vote.
  • Donald Trump (Republican) received 232 electoral votes and about 46.8 % of the popular vote.
  • A candidate needs at least 270 electoral votes to win the presidency.
These results were certified by state officials, accepted by Congress, and Biden was inaugurated on January 20, 2021."

Grok:
"No response."
 
Couple if things:

Yeah, it’s not a replacement for understanding how to code. However it is 100% an efficiency improvement, and as you state above, it makes testing out ideas and writing code for disposable tools much, much cheaper. That tool i made was something i could have written for my own undocumented use in a few days. It would have had no docs, poor version control, not been cross platform, etc. The fact that claude 1-shot version 1 that worked as designed inside of 15 minutes is just wild.

It’s getting cheaper in some circumstances to build your own tools now than find and procure them. Anything with a REST API with public documentation for example is trivial to connect to now.

Today’s paid reasoning LLM based AI isn’t just faster typing speed (code autocomplete was years ago now) - it’s faster reasoning, testing and iteration.

I honestly don’t think AI is anywhere near replacing developers any time soon as you need someone capable of overseeing it, checking the prompts, identifying the actual requirements, etc.

I do however think it is going to make developers and IT professionals much, much more productive. Will this result in less demand for devs? Anthropic don’t seem to think so and neither do I. There’s no shortage of stuff to make, these tools will just enable us to get much more done.

Also what i find great (and not just for code) is project documentation. If the LLM is privy to your code base (or other project), whether it wrote it or you did - it is awesome for keeping the documentation up to date! No one likes writing docs, and inevitably wherever i have worked they constantly lag reality by days, months, weeks…. years.

These LLM models help massively with that.


And yes, don’t trust the output blindly. I actually ran the claude output through GPT 5.2 and it identified some things to fix (edge case DST bugs, filesystem reparse point bugs on Windows); stuff i’d never have found but that it spotted in my test output and the code.

That said, my entire app was written by Sonnet 4.5, so Opus probably would have done better :D
Disclaimer: Please read my comments below as somebody who is not 100% anti-AI, nor somebody who is trying to debate you. Really I'm trying to discuss/deliberate with you and rarely are there absolutes to anything - especially in my response, which should also be seen through the lens of 'well it depends'. I value the back and forth and hope that my response lands this way :).

  1. Speed of generation VS Efficiency of value : I'd like to differentiate my point about 'speed of code generation' with 'efficiency of value.'
    Yes, in the example that you kindly provided, Claude 1-shotted a tool in 15 minutes. That feels pretty efficient now, but code that is not well understood is a liability, not an asset. How well do you understand what was generated if a bug is encountered in this same code 6 months down the line? If you are no longer at the company and another developer takes your place, has that code been generated to corporate standards? (see my point below on increased cognitive load).
    The moment that 'disposable' tool becomes critical to a workflow (which they invariably always do), we are left maintaining a codebase that you didn't write and likely don't fully understand. When the REST API changes or an edge case breaks the script in six months, the 'efficiency' gained in the first 15 minutes could very well be wiped out by the hours spent reverse-engineering the AI's logic to fix it.
  2. Documentation: AI is great at describing what code does (syntax), but terrible at explaining why it does it (intent/business logic). Having an LLM generate docs can be useful but often results in a bunch of technically correct but contextually empty boiler plate text!
    This could create a false sense of security where we believe that the project and code base is well-documented, only to find out that the docs don't actually explain the architectural decisions nor the 'why' they were written a specific way when things go horribly wrong.
  3. LLM's don't truly reason: While I can see why you feel that AI represents 'faster reasoning', I'd dispute this somewhat because LLMs like Claude do not reason; they pattern match.
    When we use GPT 5.2/CoPilot/ <insert other black box LLM> to check Claude’s work, we are not auditing the code; we are layering one probability curve over another. This can be very risky. In your example, the fact that the second GPT model found bugs (DST, reparse points) that you openly admitted that you 'never would have found' would be a red flag for me, not necessarily a win.
    I'd be worried that the human in the loop - the one reviewing the LLM - has already lost the ability to effectively audit the output without investing time to ramp-up and close any knowledge gaps --> developer time investment, shifted from code creation to knowledge and review abilities.
    If we have reached a point where we rely on AI to write the code AND to catch the subtle bugs in the generated code (because the human can't spot them), we haven't gained efficiency. Instead we’ve introduced a circular dependency of black boxes (and additional licensing costs - something that your organization may not want to license or even be allowed to license).
  4. Increased cognitive load: In my experience, the bottleneck in software development has never been typing speed; it’s the cognitive load of understanding complex systems.
    Flooding a project with AI-generated code can increase the cognitive load for the reviewer, arguably making the overall process slower if you care about the long-term stability and maintainability in a team environment.
    AI might make the creation phase efficient for getting a starting point in your project. However it also could make the code standardization and governance phases significantly more challenging when those same prompts result in very different code outputs by different team members! Real production level efficiency comes from maintainability, not just how fast an individual can get a script running on a Tuesday afternoon. The code that goes into production may or may not be touched again by the same developer 6 months down the line or even by somebody who has not joined the organization yet.
  5. Cost: The true costs of most paid subscription models are not always well understood upfront. A complex coding task might generate 500 lines of visible code, but it could burn through thousands of internal 'thought tokens' to get there. Because LLMs are stateless, when you need to update documentation for a large project, you often have to feed the entire codebase (or large chunks of it) back into the context window for every request. If you have a 50MB codebase, you are paying to 'upload' that 50MB of text to the model every time you want to fix a few typos or small changes to the README. That is a huge recurring cost that human developers don't incur (we have 'cached' memory).
TLDR; AI in my experience really shifts the time costs from creation to review and dollar costs from upfront to maintenance and review. This is not to imply that we should not use LLM's for code... just go in with your eyes wide open.

This is a really great post from Simon Willison (author of Django, lead on Eventbrite, Lanyard etc...) https://simonwillison.net/2025/Dec/18/code-proven-to-work/

Finally, I'm a huge fan of SLM's and feel that for code generation they may represent a more palatable future for my tastes, providing many solutions to some of the problems mentioned above. We just need hardware costs to come down/hardware capability to go up, so that we are not required to buy 50 developers an M3 Ultra Mac studio each with 256GB ram!!
 
Last edited:
Disclaimer: Please read my comments below as somebody who is not 100% anti-AI, nor somebody who is trying to debate you. Really I'm trying to discuss/deliberate with you and rarely are there absolutes to anything - especially in my response, which should also be seen through the lens of 'well it depends'. I value the back and forth and hope that my response lands this way :).

  1. Speed of generation VS Efficiency of value : I'd like to differentiate my point about 'speed of code generation' with 'efficiency of value.'
    Yes, in the example that you kindly provided, Claude 1-shotted a tool in 15 minutes. That feels pretty efficient now, but code that is not well understood is a liability, not an asset. How well do you understand what was generated if a bug is encountered in this same code 6 months down the line? If you are no longer at the company and another developer takes your place, has that code been generated to corporate standards? (see my point below on increased cognitive load).
    The moment that 'disposable' tool becomes critical to a workflow (which they invariably always do), we are left maintaining a codebase that you didn't write and likely don't fully understand. When the REST API changes or an edge case breaks the script in six months, the 'efficiency' gained in the first 15 minutes could very well be wiped out by the hours spent reverse-engineering the AI's logic to fix it.
  2. Documentation: AI is great at describing what code does (syntax), but terrible at explaining why it does it (intent/business logic). Having an LLM generate docs can be useful but often results in a bunch of technically correct but contextually empty boiler plate text!
    This could create a false sense of security where we believe that the project and code base is well-documented, only to find out that the docs don't actually explain the architectural decisions nor the 'why' they were written a specific way when things go horribly wrong.
  3. LLM's don't truly reason: While I can see why you feel that AI represents 'faster reasoning', I'd dispute this somewhat because LLMs like Claude do not reason; they pattern match.
    When we use GPT 5.2/CoPilot/ <insert other black box LLM> to check Claude’s work, we are not auditing the code; we are layering one probability curve over another. This can be very risky. In your example, the fact that the second GPT model found bugs (DST, reparse points) that you openly admitted that you 'never would have found' would be a red flag for me, not necessarily a win.
    I'd be worried that the human in the loop - the one reviewing the LLM - has already lost the ability to effectively audit the output without investing time to ramp-up and close any knowledge gaps --> developer time investment, shifted from code creation to knowledge and review abilities.
    If we have reached a point where we rely on AI to write the code AND to catch the subtle bugs in the generated code (because the human can't spot them), we haven't gained efficiency. Instead we’ve introduced a circular dependency of black boxes (and additional licensing costs - something that your organization may not want to license or even be allowed to license).
  4. Increased cognitive load: In my experience, the bottleneck in software development has never been typing speed; it’s the cognitive load of understanding complex systems.
    Flooding a project with AI-generated code can increase the cognitive load for the reviewer, arguably making the overall process slower if you care about the long-term stability and maintainability in a team environment.
    AI might make the creation phase efficient for getting a starting point in your project. However it also could make the code standardization and governance phases significantly more challenging when those same prompts result in very different code outputs by different team members! Real production level efficiency comes from maintainability, not just how fast an individual can get a script running on a Tuesday afternoon. The code that goes into production may or may not be touched again by the same developer 6 months down the line or even by somebody who has not joined the organization yet.
  5. Cost: The true costs of most paid subscription models are not always well understood upfront. A complex coding task might generate 500 lines of visible code, but it could burn through thousands of internal 'thought tokens' to get there. Because LLMs are stateless, when you need to update documentation for a large project, you often have to feed the entire codebase (or large chunks of it) back into the context window for every request. If you have a 50MB codebase, you are paying to 'upload' that 50MB of text to the model every time you want to fix a few typos or small changes to the README. That is a huge recurring cost that human developers don't incur (we have 'cached' memory).
TLDR; AI in my experience really shifts the time costs from creation to review and dollar costs from upfront to maintenance and review. This is not to imply that we should not use LLM's for code... just go in with your eyes wide open.

This is a really great post from Simon Willison (author of Django, lead on Eventbrite, Lanyard etc...) https://simonwillison.net/2025/Dec/18/code-proven-to-work/

Finally, I'm a huge fan of SLM's and feel that for code generation they may represent a more palatable future for my tastes, providing many solutions to some of the problems mentioned above. We just need hardware costs to come down/hardware capability to go up, so that we are not required to buy 50 developers an M3 Ultra Mac studio each with 256GB ram!!
I very much agree though I do think there should also be a distinction between code generated for a personal or at least non-professional purpose and the kind of code you’re talking about. To argue against my point somewhat, even there I recognize the slippery slope or at least the difficulty in drawing a strict delineation between such categories, but I do think there's a difference between say Microsoft trying to rewrite the Windows OS from scratch with AI and say @Eric using an LLM to help write code for this board. I don't know exactly where that line is but it's there somewhere.
 
LLM based AI isn't just pattern matching any more.

Reasoning AI is achieving things like solving novel Defcon problems (Defon conference issues complex issues for hackers to solve in a competition).

If you think the current state of the art is still simply basic next token prediction, you're about a year behind.


Also... maintainability: that code was something that works to solve a real world problem before my morning coffee, with 2026 state of the art coding LLM.

If we need to revisit in 6 months or whenever I've left the company, just get the much more advanced LLM in mid 2026 to audit/fix it.


I'm not saying LLMs can code every thing ever, or that they don't need to be supervised. I'm not saying give them free reign to everything outside of a sandbox until the code is battle tested.

What I AM saying is that getting usable tools cranked out in 15 minutes and actually using them to solve real world problems TODAY instead of a week of developer time is a revolution.

The issue(s) discovered in my code were without any basic instructions outside of (this was the entire prompt for v1.0):

We're writing a command line tool called "spacetime" that will generate a histogram of disk space consumption for a folder subtree. this tool will generate bar graphs for disk consumption grouped by various criteria (specified on the command line via switches):

  • modification date
  • creation date
  • last accessed date
The tool shall be written in portable C code. The current host platform is macOS 26, but we should write the code to be portable to Linux, FreeBSD and Windows.

And then it just one shotted it... with a better prompt, to tell it about those additional constraints via a system prompt for code style, other requirements, etc... it would have done better in the first place.

The planning steps, etc. below aren't basic "next token prediction" ... and if you claim they are then I'd argue that its doing a better job than most mid-tier developers do, and that most devs can't even do that.

I've read through the code, it's pretty clean and well structured in my opinion.
 

Attachments

  • Screenshot 2026-01-14 at 2.38.35 pm.png
    Screenshot 2026-01-14 at 2.38.35 pm.png
    211.2 KB · Views: 13
Last edited:
LLM based AI isn't just pattern matching any more.

Reasoning AI is achieving things like solving novel Defcon problems (Defon conference issues complex issues for hackers to solve in a competition).

If you think the current state of the art is still simply basic next token prediction, you're about a year behind.

I agree that 'stochastic parrot' is an outdated beatdown, but let's be precise about what 'Reasoning' means here.

Solving Defcon Capture the Flag challenges is impressive, but it’s a bounded problem. It has a clear rule set, a defined success state, and a logical path to a solution. That is exactly what 'Chain of Thought' (CoT) compute excels at.

However, enterprise software is an Unbounded Problem. It involves ambiguous stakeholder requirements, unwritten business rules, and shifting infrastructure constraints. AI models like Claude 4.5 'reason' by simulating logic paths (probabilistic pathfinding), not by understanding intent. When you say the code is 'clean and well structured,' you are judging the syntax (the how), not the semantic validity (the why).

AI is great at making code look clean (e.g. PEP-8 compliant, commented etc...). But 'clean' code can still contain race conditions, insecure deserialization, or logic bombs that only trigger in edge cases. A visual inspection of 'cleanliness' is not a security audit. To that end, 'Reasoning' in a sandbox is not the same as 'Engineering' in production.

Also... maintainability: that code was something that works to solve a real world problem before my morning coffee, with 2026 state of the art coding LLM.

If we need to revisit in 6 months or whenever I've left the company, just get the much more advanced LLM in mid 2026 to audit/fix it.

Relying on a future, unreleased model to maintain today's code is a gamble, not a strategy. What if the cost of inference for that 'smarter' model is 10x higher? You've just locked yourself into a high-rent maintenance plan.

A strategy of 'just get the more advanced LLM in mid-2026 audit/fix it later' represents huge business risk and is literally the definition of technical debt: borrowing efficiency today against a future resource you hope will exist or will be more capable.

We have entered the era of 'opaqueware' - software that works, but nobody at the company fully quite understands how or why!! I'm not trying to be funny, I'm actually being serious!! If a companies plan for maintenance is 'ask the next version of the AI,' then you aren't building a toolset; you are building a dependency chain and risk removing the institutional knowledge (understanding your own stack) and replacing it with a subscription to a model provider.

If that 'mid-2026 AI' hallucinates a fix on a deprecated library that the 2025 AI chose, and no human has sufficient knowledge to spot it because they've been 'one-shotting' everything for six months, then you don't have an efficiency gain. Instead you have a critical failure waiting to happen.


What I AM saying is that getting usable tools cranked out in 15 minutes and actually using them to solve real world problems TODAY instead of a week of developer time is a revolution.

The issue(s) discovered in my code were without any basic instructions outside of (this was the entire prompt for v1.0):


And then it just one shotted it... with a better prompt, to tell it about those additional constraints via a system prompt for code style, other requirements, etc... it would have done better in the first place.

The planning steps, etc. below aren't basic "next token prediction" ... and if you claim they are then I'd argue that its doing a better job than most mid-tier developers do, and that most devs can't even do that.

I've read through the code, it's pretty clean and well structured in my opinion.

The fact that it 'one-shotted' the code with a basic prompt actually proves that it was a generic problem. The harder the problem, the less likely a 'one-shot' works or the more time involved providing a prompt that suitably describes . Solving common boilerplate problems quickly - while very useful - should not be confused with solving novel engineering challenges.

I have to push back on the timeline comparison too. If a disposable tool takes a human developer a full week to write, but an AI gets the job done in 15 minutes, that says more about the developer's toolkit than the AI's brilliance.

You're right, the efficiency gain can be real, no doubt about it, but let's not exaggerate it. The risk here is the commoditization of quality argument.

Because it only took 15 minutes to generate, we tend to treat the code as disposable. But 'disposable' code has a really nasty habit of quickly becoming 'production' code with a bunch of downstream TODO's and supporting critical business workflows.

When we lower the barrier to entry for creating tools to zero (or next to zero), there can be a tendency to flood our ecosystem with unmanaged scripts. The 'efficiency' of the 15-minute build is quickly eroded by the 'inefficiency' of managing 50 different 'one-shot' shadow IT tools that all break simultaneously when an API updates.

I think we actually agree on the potential when suitably applied as a job aid, just not the risk profile.

I’m not saying don’t use the tools - I use them too. I’m just saying that 'it works' and 'it’s maintainable' are two very different bars to clear. If we stop understanding our own technical stack because it’s cheaper to let an AI 'one-shot' it, we aren’t becoming super-developers, we’re becoming passengers. That worries future me! Let’s enjoy the speed boost, but keep our hands on the wheel. :)

Just my 0.02

Have a great day :)
 
Last edited:
I agree that 'stochastic parrot' is an outdated beatdown, but let's be precise about what 'Reasoning' means here.

Solving Defcon Capture the Flag challenges is impressive, but it’s a bounded problem. It has a clear rule set, a defined success state, and a logical path to a solution. That is exactly what 'Chain of Thought' (CoT) compute excels at.

However, enterprise software is an Unbounded Problem. It involves ambiguous stakeholder requirements, unwritten business rules, and shifting infrastructure constraints. AI models like Claude 4.5 'reason' by simulating logic paths (probabilistic pathfinding), not by understanding intent. When you say the code is 'clean and well structured,' you are judging the syntax (the how), not the semantic validity (the why).

I'd say it's still a somewhat fair beatdown :) - unless I've missed a development, which is entirely possible, the "reasoning" is still purely associative probability. I know there is work being done on integrating full logic engines into neural networks, but I don't remember seeing that reported as having been done for the current production LLM models. My understanding is that Chain of Thought compute which allows LLMs to solve complex problems is still "reasoning" by breaking down the intermediate steps not into "if A then B" symbolic reasoning but rather "A is likely to be associated with B and unlikely to be associated with C, D, or E, so choose B as the next step in the chain with probability x determined by model temperature y" and so on. This may be what you were saying by "simulating logic paths (probabilistic pathfinding), not by understanding intent", but I just wanted to clarify my own understanding.

Depending on your point of view, that ability is one, and IMO only one, rung above being a stochastic parrot or it demonstrates that stochastic parrots can be more powerful than token matching autocomplete, but overall the LLM is still being a form of stochastic parrot. Either way, it is also why an LLM, when it does screw up on these sorts of puzzles, screws up in unexpected ways or sometimes it takes a circuitous path that a human wouldn't do even if the LLM does in the end solve the problem. Again, if I missed an important development, then apologies for being both pedantic and wrong.
 
Last edited:
I'd say it's still a somewhat fair beatdown :) - unless I've missed a development, which is entirely possible, the "reasoning" is still purely associative probability. I know there is work being done on integrating full logic engines into neural networks, but I don't remember seeing that reported as having been done for the current production LLM models. My understanding is that Chain of Thought compute which allows LLMs to solve complex problems is still "reasoning" by breaking down the intermediate steps not into "if A then B" symbolic reasoning but rather "A is likely to be associated with B and unlikely to be associated with C, D, or E, so choose B as the next step in the chain with probability x determined by model temperature y" and so on. This may be what you were saying by "simulating logic paths (probabilistic pathfinding), not by understanding intent", but I just wanted to clarify my own understanding.

My understanding is that CoT works by having one LLM call break the problem down into smaller steps, and feeding that as context and/or prompts back into the next LLM call(s). So each call is responsible for a smaller chunk of "work". So it helps prevent winding up way out in the boonies of the abstract space that the model operates in as it starts to fill out details a user might have implicitly meant, but as you point out, it’s not really reasoning because each chunk is still going through a "traditional" LLM step. It’s also plainly more expensive in compute, so it’s not like we can use it for everything if cost matters between two approaches.

It’s still in the vein of "more monkeys on typewriters" approach of brute forcing the problem.
 
I'd say it's still a somewhat fair beatdown :) - unless I've missed a development, which is entirely possible, the "reasoning" is still purely associative probability. I know there is work being done on integrating full logic engines into neural networks, but I don't remember seeing that reported as having been done for the current production LLM models. My understanding is that Chain of Thought compute which allows LLMs to solve complex problems is still "reasoning" by breaking down the intermediate steps not into "if A then B" symbolic reasoning but rather "A is likely to be associated with B and unlikely to be associated with C, D, or E, so choose B as the next step in the chain with probability x determined by model temperature y" and so on. This may be what you were saying by "simulating logic paths (probabilistic pathfinding), not by understanding intent", but I just wanted to clarify my own understanding.

Depending on your point of view, that ability is one, and IMO only one, rung above being a stochastic parrot or it demonstrates that stochastic parrots can be more powerful than token matching autocomplete, but overall the LLM is still being a form of stochastic parrot. Either way, it is also why an LLM, when it does screw up on these sorts of puzzles, screws up in unexpected ways or sometimes it takes a circuitous path that a human wouldn't do even if the LLM does in the end solve the problem. Again, if I missed an important development, then apologies for being both pedantic and wrong.
That is what I am getting at :). Most prominent LLMs use something very similar to probabilistic pathfinding - or token prediction v2.0 by deconstructing the initial prompt into a series of steps and generating plausible solutions through probabilistic inference and associations over both language and logic. This makes LLM's highly suited for solving graph-based bounded pathfinding problems when prompted appropriately.

When Claude "thinks," it is not entering a separate "logic brain." It is simply generating tokens into a hidden context window before it generates the final answer.
  • Standard Mode: Input -> Predict Answer. (High risk of hallucination because the statistical leap is too big).
  • Thinking Mode: Input -> Predict First Step -> Predict Second Step -> Predict Answer.
Nobel laureate Daniel Kahneman described human thinking in two systems.
A source if you guys are interested : https://www.scientificamerican.com/article/kahneman-excerpt-thinking-fast-and-slow/
  • System 1: Fast, intuitive, pattern-matching (e.g. the example in the paper being 2+2=4).
  • System 2: Slow, deliberate and logical (e.g. 17 X 24).
LLMs are purely System 1 engines - they are instant pattern matchers. "Thinking" mode is a (very sophisticated) hack to simulate System 2. By chaining together thousands of tiny System 1 predictions (step-by-step tokens), they approximate a System 2 process. But if the model creates a "logic error" in step 3, it will often confidently hallucinate the rest of the chain because it is following the pattern of a logical argument, not the strict rules of logic.

By forcing the model to output the "steps," it changes the statistical probability of the final answer. It is conditioning it's own future output as you alluded to based on probabilistic associations between steps. It is not "checking its work" in the way a human does; breaking down in this way is simply laying down a trail of text that makes the correct next token statistically more likely than the incorrect one.

Some more personal thoughts on costs....

  1. Reasoning Tax: Reasoning models don't just 'type faster.' The impact of Test-time compute on the total cost of ownership can be staggering.
    We need to be clear that with models like Sonnet 4.5 or GPT-5.X, you aren't just paying for the text you see. You are also paying for 'hidden' chain of thought that the model generates internally to check its own logic before outputting a single character.
    A complex coding task might generate 500 lines of visible code, but it could burn through thousands of internal 'thought tokens' to get there. You are effectively paying for the model to 'think' in the background regardless of whether or not you got a desirable result.
    If you re-prompt, recheck your work with AI, and say that 15-minute iteration loop that was mentioned earlier in the thread is run ten times, your token usage isn't linear; it’s exponential relative to the visible output.
  2. Double-dipping LLM's: The scenario that throAU referenced earlier - double dipping with a second LLM to verify the output of the first - unfortunately is NOT uncommon in my experience. throAU is quite right - this does happen and you'll notice quite a few youtubers will actually do this in their VLOGS. While the costs might be fine for a 'disposable tool' or 'proof of concept', if you scale this to an enterprise level where hundreds of devs are 'double-dipping' on every commit, the OPEX (Operational Expenditure) becomes eye watering and very sobering when you are trying to explain to your CFO why you went dramatically over your IT allocation budget!
  3. Loss-leader fallacy: Current costs are not indicative of future costs. We are living in a very strange bubble where huge amounts of venture capital is massively subsidizing our token usage in the hopes for a future payoff. The 'few cents' that it costs you today to run a script or call an API end point is likely significantly below the actual compute cost (energy + H100/H200 GPU time). When the race for market share ends, analysts ask for proof of sustainable revenue to justify inflated P/E ratios and these companies need to show actual profit margins, those API costs will rise. Building a workflow that relies on 'cheap intelligence' now is a very risky bet against future price hikes.

We can leave a deliberation on legal, copyright and data security / governance for another day or I'll write a longwinded rambling essay and really nobody wants that! :D
 
Last edited:
from today's Guardian:
"Craig Guildford says he gave incorrect evidence to MPs and mistake arose from ‘use of Microsoft Copilot’
The chief of West Midlands police has apologised to MPs for giving them incorrect evidence about the decision to ban Maccabi Tel Aviv football fans, saying it had been produced by artificial intelligence (AI).
Craig Guildford told the home affairs select committee on Monday that the inclusion of a fictitious match between Maccabi Tel Aviv and West Ham in police intelligence “arose as a result of a use of Microsoft Copilot”.

Arstechnica: https://arstechnica.com/ai/2026/01/...-ai-hallucination-when-banning-football-fans/

We have quite the learning curve ahead of us...
 
Last edited:
ChatGPT still unsafe for psychologically unstable individuals. At this point even ELIZA might be better.


While it's good to hear that they finally want to pay for using Wikipedia to train their LLMs, it still sounds a bit like: "Bankrobbers make deal with bank after having emptied the vault."

 
ChatGPT still unsafe for psychologically unstable individuals. At this point even ELIZA might be better.


While it's good to hear that they finally want to pay for using Wikipedia to train their LLMs, it still sounds a bit like: "Bankrobbers make deal with bank after having emptied the vault."

I'll never understand why how a human could confide and bond with a machine in such a way, it must be a generational thing.
 
I'll never understand why how a human could confide and bond with a machine in such a way, it must be a generational thing.

Supposedly, Joseph Weizenbaum created ELIZA to prove that human communication with a computer is not possible.
Then his secretary (and apparently a lot of other people started having smalltalk with it. That was in the late 60s.


I tried the ELIZA implementation in Emacs (M-x doctor) once. After three sentences or so I ran into a grammar issue and stopped.
 
I'll never understand why how a human could confide and bond with a machine in such a way, it must be a generational thing.
The people experiencing issues have run the full gamut of generations. The primary issue is that a lot of these LLM chat bots provide a purely reinforcing and affirming voice that sounds human enough to be anthropomorphic. For the wrong person that combination can be extremely dangerous regardless of age.
 
Back
Top