When Not to Trust AI Advice: 5 Decisions You Should Never Outsource to a Chatbot

Here’s a number that should stop you mid-scroll: one out of every 277 biomedical papers published in early 2026 contains at least one reference that doesn’t exist. Not a misquoted finding. Not an outdated citation. An entirely fabricated source, conjured by AI, now sitting in the peer-reviewed scientific record. That fact alone should make you pause — and it raises an uncomfortable question about when not to trust AI advice.

The research, from Columbia University and published in The Lancet, audited 2.5 million PubMed-indexed papers and found 4,046 fabricated references across 2,810 papers. The fabrication rate went from roughly four fake citations per 10,000 papers in 2023 to 57 per 10,000 by early 2026 — a nearly vertical spike coinciding with mass adoption of AI writing tools. And here’s the quieter, more troubling detail: at the time of the audit, 98.4% of those papers had received no retraction, no correction, no publisher action whatsoever. We’ve explored this gap before, in a story about what happens when AI gives the wrong answer and a real expert is the only fix.

That’s the gap this post is about. Not whether AI is useful — it is. But whether we’ve internalized how often it’s confidently wrong, and what that means for the decisions where being wrong has real consequences.

The raw numbers nobody’s putting on a billboard

Let’s get the data on the table.

In a 2025 peer-reviewed study in Communications Medicine, large language models hallucinated in 64.1% of long-form clinical case summaries when no mitigation prompts were used. Even the best-performing model, GPT-4o, fabricated clinical details 53% of the time without structured prompting. With careful mitigation, researchers brought the overall rate down to 43.1% — which is better, but still means nearly half of all summaries contained invented medical information.

The legal world is running parallel. Stanford’s HAI and RegLab found that purpose-built legal AI tools — not general-purpose chatbots, but professional-grade research systems — hallucinate on 17–34% of challenging queries. Westlaw’s AI-Assisted Research came in at roughly 34%; Lexis+ AI at 17%. For context, the lawyers using these tools aren’t hobbyists. They’re professionals who paid for something better than ChatGPT.

The consequences are arriving in real courtrooms. In just the first quarter of 2026, U.S. courts imposed more than $145,000 in AI-hallucination sanctions. A single case in Oregon’s District Court resulted in a roughly $110,000 penalty after attorneys filed 23 fabricated citations across three federal filings. In Ohio, a federal judge described the violations as the most egregious Rule 11 violations he had ever observed and referred two attorneys for disciplinary action.

Researcher Damien Charlotin, who maintains a global database of AI hallucination court cases, has now catalogued over 1,353 incidents — and recently recorded ten cases from ten different courts on a single day.

The ECRI warning that should change how you search

In January 2026, ECRI — one of the most respected voices in healthcare safety — named misuse of AI chatbots the #1 health technology hazard for the year ahead. Their concern isn’t hypothetical. More than 40 million people use ChatGPT daily for health-related questions, generating medical guidance without any regulatory oversight or verification mechanism.

This isn’t about AI being malicious. It’s about a structural mismatch: the technology is designed to generate plausible-sounding text, not to verify facts. When you ask it a medical, legal, or financial question, it produces an answer that looks right — sometimes remarkably right — but it has no mechanism for knowing whether it is right. We’ve written about this exact architecture problem before, in our piece on why AI masquerading as human is a line we can’t cross.

The fabricated references in biomedical literature tell the same story from a different angle. As Columbia’s research team put it: “A medical professional or clinical guideline developer has no way of knowing that the evidence they are relying on does not exist.” And as they also noted, “The contamination of over 4,000 fabricated references does not go away when the AI gets better.” Those fake citations are now permanently embedded in the scientific record, waiting to be cited by future papers, future guidelines, future clinical decisions.

The 5 decisions you should never outsource to a chatbot

This isn’t a call to abandon AI. It’s a framework for knowing when to pause and find a real human. Here are five categories where the hallucination data says the risk-reward calculation doesn’t add up.

1. Medical decisions — especially anything involving symptoms or treatment

When AI hallucinates on 64% of clinical case summaries without mitigation, and still gets nearly half wrong even with careful prompting, there’s no version of “let me ask ChatGPT about this lump” that qualifies as informed decision-making. The ECRI warning exists precisely because millions of people are doing exactly that. A real doctor or nurse practitioner brings something no current model can: accountability for what they tell you, and a license that depends on getting it right.

2. Legal decisions — filing, contracts, compliance

Stanford’s data shows that even paid legal AI tools are wrong one-sixth to one-third of the time on complex queries. The $145,000 in Q1 2026 sanctions is just what courts are catching. According to Charlotin’s database, 35+ state bar associations have now issued AI verification guidance. If the people who passed the bar exam can’t safely rely on these tools, neither can you.

3. Financial decisions — tax strategy, investment structure, debt planning

Financial advice shares the same structural problem: plausible-sounding answers with no verification mechanism. A confident wrong answer about a retirement withdrawal or business deduction can compound into real financial damage, and unlike a human advisor, the AI carries no liability for the outcome.

4. Immigration and government filing decisions

Immigration law shares the same procedural-accuracy stakes as civil litigation, where the hallucination data is most documented. AI invents citations, misstates procedures, and fabricates precedents. In immigration contexts, a single incorrect filing can mean months of delay, denial, or worse. These systems are complex and unforgiving. An AI that hallucinates 34% of the time on legal queries is not a reliable guide through them.

5. Mental health and significant life decisions

This is where the “hallucination rate” framing breaks down, because the harm isn’t about factual errors. It’s about what’s absent. AI chatbots can reflect back what you tell them, but they cannot perceive tone, hesitation, or what you’re not saying. They cannot exercise clinical judgment or recognize when someone needs crisis intervention. As we’ve written before, AI therapists fall short precisely because they lack the perceptual and relational depth of a real conversation. A real conversation with a trained human operates on an entirely different plane of perception and responsibility.

A framework for when not to trust AI advice

The pattern across all five categories is the same: the moment a decision carries significant, irreversible consequences, the AI output stops being a solution and becomes, at best, a starting point for research you verify yourself.

This doesn’t mean you need to book an expensive consultation for every question. It means recognizing that some domains demand a human in the loop — someone who can be asked follow-up questions, who carries real accountability, and who can say “I don’t know, but here’s how we can find out.”

That principle applies far beyond the high-stakes professional categories above. It’s the same reason a live video call beats a tutorial video when your WiFi setup isn’t matching the instructions on screen, or why a struggling student needs a tutor who can spot the exact moment their understanding wavers. Having someone who can see your screen, hear your question, and take responsibility for the answer isn’t a luxury — it’s the foundation of real help.

For moments like that — whether it’s a tech problem you can’t troubleshoot alone, a tutoring session where a student needs more than an answer key, or a skill you need another human to walk you through — having someone on a live video call changes the equation. You’re not accepting a plausible answer. You’re getting a real one, from someone who can respond to what’s actually happening. Of course you can also be that expert human in the loop and help others make informed decisions utilising the skills you have.

The AI hallucination numbers are going to improve. Structured prompting already cuts the clinical rate from 64% to 43%. Better models will bring the legal hallucination rate down from 34%. But the deeper issue — that these systems are designed for plausibility, not truth — isn’t a bug waiting for a patch. It’s the architecture. And for certain decisions, that architecture will never be enough.