NOTES

The Hidden Gap Between AI’s Advice and Its Choices

When AI Knows What “Good” Looks Like but Still Chooses Badly

By Paul DiMaggioMay 5, 20265 min read

AI systems are getting scarily good at talking like experts. They can explain what makes a good security vulnerability, a good stock, or a good medical decision. But there’s a quieter, more important question: can they actually act on that knowledge when the decision is complex and the stakes are high?

In a recent experiment, security researcher Marcus Hutchins built an AI‑assisted system to find serious software flaws. The punchline isn’t that AI can now “press god mode” on hacking. It’s that the model could eloquently describe what a dangerous bug looks like - but struggled to consistently pick the most dangerous one when it had to choose.

That gap - between describing good decisions and reliably making them - is the real limitation we need to talk about.


The security story, without the jargon

Hutchins’ setup was simple to describe:

  • Feed lots of software components into an AI.
  • Ask it to flag which ones look dangerous and explain why.
  • Then ask: “If you were an attacker, which one should you go after first?”

The first steps worked impressively well. The model produced detailed write‑ups, identified real vulnerabilities, and gave reasonable‑looking scores for how exploitable they might be.

The failure showed up at the decision point. When forced to pick a “best” target from a pile of candidates, the AI often chose flashy‑sounding bugs over the one that gave almost total control. It could describe why that particular bug was terrifying, but when it came time to rank everything, it did not behave like someone who truly believes their own explanation.

Hutchins had to step in and effectively teach the system how to think about the choice: which capabilities matter more than others, how different properties combine, and why some boring‑sounding issues are actually the most powerful. Only after he encoded that reasoning into the framework around the model did it start picking the right target.

The lesson: the weakness wasn’t in the model’s vocabulary, it was in its ability to apply that vocabulary consistently in a messy trade‑off.


This isn’t just about hacking

The same pattern appears any time we ask AI to make hard choices in the real world.

Picking the “best” stock

Ask a model what makes a good long‑term investment and it will give you a solid list: valuation, cash flow, risk, diversification, time horizon.

Ask it to write up Company A vs. Company B and you might get something that looks a lot like a research note.

But now say: “Here are 30 plausible stocks. I’m putting most of my savings into one of them. Which should I choose?”

You’ve moved from explanation to decision. The model now has to weigh noisy, conflicting signals, reason about risk, and deal with uncertainty. In practice, LLMs tend to fall back on shallow patterns - well‑known names, exciting narratives, language that correlates with “strong buy” - rather than the disciplined, probabilistic reasoning that good investors use.

Again: no problem describing what a good stock looks like; real trouble reliably acting on that description when the choice is nontrivial.

Complex decisions in healthcare

In medicine, large language models can summarize guidelines, list possible diagnoses, and outline treatment options in clear, professional prose.

However, studies show that when cases are complex - multiple conditions, conflicting risks, unusual presentations - models often miss edge cases, overlook interactions, or sound overconfident where a good clinician would be cautious.

Doctors don’t just know “the right answer” from a textbook. They know when to bend the rules, when to say “this is weird,” and when to get a second opinion. LLMs are much weaker at that kind of flexible, situational judgment. They can perform well on standard cases but falter on exactly the messy scenarios where human expertise matters most.

Strategic and risk decisions in business

Ask an AI “What should we consider before entering a new market?” and you’ll get a respectable checklist: regulation, competition, supply chains, culture, politics.

Ask it to pick between three concrete expansion plans, each with different risks, timing, and dependencies, and you’re back in Hutchins’ world: the model can talk about the right factors but may not reliably weigh them. It doesn’t really know how your company handles risk, how much pain you can tolerate, or which tail risks you absolutely cannot afford.

It can produce convincing narratives; that doesn’t mean it’s making sound bets.


Why this keeps happening

Under the hood, today’s LLMs are pattern machines. They’re extraordinarily good at imitating how experts talk about decisions because they’ve ingested so much expert‑like text.

But robust decisions require more than talk:

  • A stable way to weigh conflicting criteria.
  • The ability to recognize when a case is unusual.
  • A sense of uncertainty and the discipline to say “I don’t know.”

Current models don’t do this reliably. They can simultaneously:

  • Produce an excellent explanation of what “good” looks like in a domain.
  • Fail to behave in line with that explanation when asked to choose between real options.

That’s exactly what Hutchins saw with vulnerabilities, and it is exactly what we should expect in finance, healthcare, and strategic planning if we hand over high‑stakes decisions too quickly.


So how should we actually use AI?

The lesson isn’t “don’t use AI.” It’s “use it in the right layer of the stack.”

The most effective pattern looks like this:

  • Humans define what matters and how trade‑offs should be made.
  • We build frameworks and guardrails that encode that reasoning: scoring rules, constraints, escalation paths.
  • We let the AI do what it’s great at: exploring options, summarizing evidence, spotting patterns, generating scenarios.

In other words, treat AI as a powerful analysis and storytelling engine inside a human‑designed decision system - not as the decision‑maker itself.

Hutchins’ experiment is a concrete reminder: if you give an LLM a pile of complex options and ask it to pick “the best,” you might get something that sounds expert. That doesn’t mean it is.