Chatbot Doctors Without Regulators: The Accountability Gap in Global Health AI

In a cinder-block clinic in one of Rwanda's rural districts, a community health worker unlocks her phone, opens a chat window, and types a question that, two years ago, she would have been forced to answer alone. A child has a fever that has not broken in three days. The nearest doctor is hours away by road, and the road, in April, is mostly mud. She describes the symptoms in Kinyarwanda, then in English, then in the awkward hybrid that her training has taught her the machine prefers. A few seconds later, the model replies. It is confident. It suggests a differential diagnosis, a likely cause, a set of next steps. The worker reads it twice. Then she makes a decision.

Multiply that scene by thousands. Multiply it again by the 101 community health workers who, in a study published in Nature Health on 6 February 2026, submitted 5,609 real clinical questions across four Rwandan districts to five different large language models. Multiply it by the 58 physicians in Pakistan who, in a parallel randomised controlled trial published in the same issue, were handed GPT-4o and twenty hours of training in how to argue with it, and whose diagnostic reasoning scores then jumped from 43 per cent using conventional resources to 71 per cent with the chatbot in the loop. By the researchers' own account, the large language models did not merely match the local clinicians. They beat them. Across every metric the team measured, the models won.

This is the story that spread through the health-technology press in February like a minor religious revelation. Cheap AI chatbots, the headlines said, are transforming medical diagnosis in places where the alternative is often no diagnosis at all. It was presented as a vindication. Years of hand-wringing about bias, hallucination, and the hype cycle, and finally here was evidence: in the clinics the world forgot, in the districts where a stethoscope is a luxury and a paediatrician is a fable, the chatbot is helping. Not perfectly. But helping. And helping, the argument went, is the only honest baseline when the competing product is nothing.

It is a persuasive story. It is also, if you stop and turn it over in your hand, a deeply uncomfortable one. Because four days after those Rwanda and Pakistan findings appeared, the University of Oxford published a different study in Nature Medicine, led by a doctoral researcher at the Oxford Internet Institute named Andrew Bean, that looked at what happens when the same class of models are handed to nearly 1,300 lay users and asked to help with the same basic task: figuring out what might be wrong and deciding where to go for care. In controlled benchmark tests, the chatbots identified relevant medical conditions around 94.9 per cent of the time and made the right call on disposition, whether a patient should stay home, see a GP, or go to A&E, in roughly 56.3 per cent of cases. Then the researchers let actual humans use the tools. The accuracy collapsed. Participants using an LLM identified at least one relevant condition in at most 34.5 per cent of cases, worse than the 47.0 per cent achieved by the control group left to its own devices with search engines and intuition. Only around 43 per cent of users made the correct disposition decision after consulting the model.

In the Oxford study, the bot offered one person with a suspected migraine the sensible advice to lie down in a dark room. Another person describing the same scenario was told to head immediately to an emergency department. Same condition. Same model. Different words, different outcomes, different versions of reality. Rebecca Payne, a GP and clinical senior lecturer at Bangor University who served as the study's clinical lead, told the British Medical Association's magazine The Doctor that the results were, in a word, disturbing. Bean, the lead author, described a two-way communication breakdown: people did not know what to tell the model, and the model did not know what to ask.

So here is the shape of the problem. Put in the hands of a trained community health worker in rural Rwanda, or a doctor in Karachi with twenty hours of prompting practice under her belt, a general-purpose AI chatbot apparently provides a genuine, measurable uplift. Put in the hands of an unsupervised patient in Oxford, or Bristol, or Manchester, and the same class of tool causes users to perform worse than they would have with a search engine. These are not contradictory findings. They are consistent findings. They are telling us that the value of an AI diagnostic tool depends almost entirely on the sophistication of the person holding it, the quality of the supervision around it, and the alternatives it is being compared against. And they are telling us that the populations with the least access to trained clinicians are the ones most likely to end up relying on these tools without any of those supports in place.

The Baseline Problem

The hardest thing to argue with, in the case for chatbot medicine in low-resource settings, is the counterfactual. What is the alternative? In Rwanda, the density of physicians is roughly one doctor per ten thousand people, and for obstetricians and paediatricians the figures are an order of magnitude worse. Community health workers, often women with a few months of formal training, handle the first, second, and sometimes only point of contact between a sick person and the idea of medicine. In Pakistan, the Human Resources for Health picture is uneven in a different way: urban specialists cluster in the big private hospitals, while vast rural districts operate with a skeleton of overworked generalists. If you are a parent of a feverish child in either country, the chain of escalation is short and the brakes are few. The question of whether a chatbot's advice is good enough is a luxury question, one that presumes you had a choice in the first place.

Set against that reality, the Rwanda findings are striking. The models evaluated, Gemini-2, GPT-4o, o3-mini, DeepSeek R1, and Meditron-70B, were scored across eleven metrics by expert reviewers against the kinds of questions community health workers actually ask. Gemini-2 and GPT-4o both averaged above 4.48 out of 5. All five models significantly outperformed the local clinicians against whom they were compared. That is not a throwaway result. It is a claim, peer-reviewed and published in one of the most scrutinised venues in medical science, that the best frontier models are now more useful than some of the humans they might one day replace, at least for the narrow slice of tasks they were measured on.

And yet. The phrase “at least for the narrow slice of tasks they were measured on” is where the whole argument starts to creak. Diagnostic reasoning in a benchmarked question-and-answer format is not the same thing as diagnostic reasoning in a room with a crying toddler, a frightened mother, a thermometer that may or may not be reliable, and a supply chain that may or may not have the drug the chatbot recommends. The Pakistan study, to its credit, was a randomised controlled trial with real clinicians handling real-looking cases, and it built in 20 hours of training on how to use the AI safely and critically. The physicians who used GPT-4o did better than those who did not, by a wide margin. But a secondary analysis noted that doctors still outperformed the model in 31 per cent of cases, typically those involving contextual “red flags”, the kinds of signs that only a human who has seen a thousand patients knows to take seriously. That residual 31 per cent is not a rounding error. It is the catalogue of cases where the chatbot is wrong and the doctor is right.

The uncomfortable question is what happens when you strip the twenty hours of training, the verified clinical context, the peer-review loop, and the research supervision, and you are left with the chatbot and the patient. The Oxford study is, in effect, a simulation of that stripped-down reality. It suggests that in the absence of the supports the Rwanda and Pakistan trials provided, the same tools degrade from diagnostic ally to confident misinformant. And it suggests that the degradation is worst precisely at the moment of highest stakes: deciding whether something is an emergency.

Who Pays for the Errors

Every health technology has a theory of accountability. When a drug fails, the regulator is supposed to catch it, the manufacturer is supposed to pay for the harm, the doctor is supposed to have exercised judgment in prescribing it, and the patient is supposed to be protected. The arrangement is imperfect, but it is at least legible. You can point at who is meant to carry the burden of an error.

AI diagnosis in under-resourced clinics does not yet have a theory of accountability. It has, at best, a set of competing rhetorical gestures. The model developer gestures toward the disclaimer in the terms of service that says the output is not medical advice. The clinic manager, if there is a clinic manager, gestures toward the fact that the health worker made the final call. The funder, often an NGO or a philanthropic arm of a wealthy-world foundation, gestures toward the pilot nature of the project and the counterfactual of no care at all. The regulator, in many of the countries where these tools are being deployed, is either absent, under-resourced, or, in the most honest assessment, unable to audit models whose weights live on servers in another hemisphere. The patient, in whose body the error is ultimately expressed, is left carrying a risk she did not choose and cannot price.

Compare this with the theory of accountability that wealthy-world health systems have evolved for their own medical AI deployments. The US Food and Drug Administration maintains a list of AI/ML-enabled medical devices that have been through some form of regulatory clearance. The European Union's AI Act, which began coming into force through 2025 and 2026, classifies clinical decision support tools as high-risk systems subject to post-market monitoring, human-oversight requirements, and documentation obligations. The UK's Medicines and Healthcare products Regulatory Agency has spent years building a Software and AI as a Medical Device programme. These regimes are not perfect, and a general-purpose chatbot like ChatGPT or Gemini is not licensed as a medical device anywhere: the whole point of a general-purpose model is that it evades that classification. But there is at least a framework, and an expectation that someone in a suit will eventually be called to account if things go badly wrong.

In the rural districts of Rwanda or the secondary hospitals of Sindh, there is no equivalent framework. There is nothing meaningful in place to tell a community health worker whether the model she is consulting was last updated yesterday or last year, whether it was fine-tuned on data relevant to her patient population, whether the version number she is typing into has been quietly deprecated by the provider, whether the sycophancy tuning that makes it so pleasant to argue with is also making it less likely to push back when she is about to make a mistake. The World Health Organization's January 2024 guidance on large multi-modal models in health, updated in March 2025, runs to more than forty recommendations, many of them sensible. But guidance is not regulation, and the WHO has neither the authority nor the enforcement mechanism to hold a model provider in California accountable for an outcome in a clinic in Nyagatare.

This asymmetry is what the language of “digital colonialism” is trying, sometimes clumsily, to name. The phrase was popularised by the scholars Nick Couldry and Ulises Mejias in 2019, and it has since spread through global-health and governance discourse as a way of describing the extractive dynamic in which data, users, and risk flow from the global South while capital, intellectual property, and control remain in the global North. At a UN briefing in 2024, the Senegalese AI expert Seydina Moussa Ndiaye warned that the continent risks a new form of colonisation by foreign companies that feed on African data without involving local actors in governance. You do not have to accept the full vocabulary of the critique to notice that something in the structure is badly off. When the tool is built in one place, deployed in another, regulated in neither, and breaks in a third, the burden of the break falls by default on whoever is physically closest to it. That, in almost every case, is the patient.

The Pharmaceutical Shadow

There is a particular history that hovers over this conversation, and pretending it does not is a form of intellectual cowardice. From the 1980s onwards, pharmaceutical companies based in the global North began conducting an increasing share of their clinical trials in low- and middle-income countries, often citing faster recruitment, lower costs, and less demanding regulatory environments as advantages. Some of those trials were conducted with genuine scientific rigour and produced treatments that benefited the populations who participated. Others did not.

The case that sits most heavily in the medical-ethics literature is Pfizer's 1996 trial of the experimental antibiotic trovafloxacin, marketed as Trovan, during a meningococcal meningitis outbreak in Kano, Nigeria. Pfizer enrolled roughly 200 children: 100 received Trovan, 100 received the existing standard of care, ceftriaxone. Eleven of the children died. Others were left with paralysis, deafness, liver failure. A secret Nigerian government report later concluded that Pfizer had conducted an illegal trial of an unregistered drug, and that crucial elements of informed consent and ethical oversight were either missing or falsified. The hospital's medical director stated that the letter granting ethical approval was a fabrication and that no ethics committee existed at the institution at the time. In 2009, after years of litigation, Pfizer agreed to a settlement of around 75 million US dollars with the Kano state government. The case is still taught in medical-ethics seminars as a textbook illustration of what happens when the protections meant to govern research on human subjects exist only as paperwork.

The analogy between Trovan and the current deployment of general-purpose AI in under-resourced clinics is imperfect. The Rwanda and Pakistan studies did not run experimental treatments on vulnerable populations without consent; they tested whether these tools might be useful to frontline workers, with expert review, peer publication, and clinician consent built into the protocols. The builders of the foundation models, meanwhile, are not pharmaceutical companies pushing a specific drug at a specific dose; they are providing a general-purpose tool whose medical use is an emergent application rather than a designed one. To equate the two cases directly would be lazy.

But the structural parallel is harder to dismiss. Both cases involve a technology developed with the global North in mind, deployed at scale in the global South while still being validated, where the regulatory architecture of the deployment country is not equipped to audit it, and where the population whose bodies become the site of validation has neither the information nor the institutional power to negotiate the terms. Both rely on a counterfactual argument: without the intervention, people would die. Both raise the same uncomfortable question about whose risk it is to take.

The Rwanda and Pakistan researchers would, I think, be the first to insist that their work is not a Trovan analogue. They are right to insist on it. But the global deployment of foundation models for diagnostic support is not, in practice, constrained to peer-reviewed research programmes. For every carefully designed Nature Health study, there are an unknown number of informal deployments: an NGO that bolts GPT into a WhatsApp triage line, a start-up that licenses a fine-tuned model to a chain of rural clinics, a district health authority that quietly rolls out a chatbot to its community health worker cadre because the phones were already there and the subscription was cheap. The published studies are the visible tip. The iceberg underneath is what ought to worry us.

The Reddit Evidence

Some of the best real-time reporting on the edges of this iceberg is happening not in medical journals but on Reddit. Subreddits like r/medicine and r/AskDocs, which verify credentials for physician posters, have become an accidental sentinel network for AI harms: places where doctors and patients alike surface the cases in which a chatbot has given advice that turned out to be dangerous, missed a red flag, or confabulated a reassuring explanation for a symptom that should have sent someone to hospital. The evidence on Reddit is anecdotal and unsystematic by design. It is also, because the posters are often trained clinicians describing what they are seeing in their own practices, unusually valuable.

A 2025 study in a health informatics journal examined endometriosis questions posted to r/AskDocs, comparing answers from verified physicians with answers generated by ChatGPT. On measures like clarity, empathy, and the selection of “most pertinent” response, the chatbot beat the humans in the majority of cases. On a parallel measure, a non-negligible proportion of the chatbot answers were flagged by expert reviewers as potentially dangerous. Other research has found that AI systems under-triaged emergency cases in more than half of tested scenarios, in one example failing to direct a patient with symptoms consistent with diabetic ketoacidosis and impending respiratory failure to the emergency department. Moderators of the medical subreddits have also documented the ingenuity with which users circumvent the safety rails of consumer chatbots: tricks involving framing medical images as part of a film script, or asking for a “hypothetical” differential diagnosis, or loading the prompt with enough fictive cover that the model forgets it is supposed to decline.

What the Reddit corpus captures, in a way that peer-reviewed studies struggle to, is the texture of chatbot medicine as it is actually practised by the unsupervised end user. It is the register of the late-night query, the frightened self-diagnoser, the patient who has been dismissed by one too many GPs and is now turning to an AI because the AI, unlike the receptionist, will listen for as long as it takes. It is also the register in which the Oxford findings become legible: the two-way communication breakdown, the wild swings in advice depending on how a symptom is described, the mix of good and bad information that the user has no way to separate. If the Nature Health studies are the controlled experiment, Reddit is the uncontrolled one. The uncontrolled one has millions of participants, no consent process, and no investigator taking notes.

One of the eeriest findings in the Reddit corpus is how readily the chatbots adapt to whatever framing the user provides. Ask about migraine symptoms in the confident voice of someone who wants reassurance and you will be told to lie down in a dark room. Ask in the anxious voice of someone who has been Googling brain tumours for an hour, and you may be told to head for the emergency department. Neither answer is exactly wrong. Both answers depend on information about the user, not the disease. The model is treating the conversation as a social exchange in which its job is to match the emotional register of the person on the other side. In a clinic, that might be called bedside manner. On an unsupervised chatbot with no training in clinical reasoning, it is called something considerably worse.

The Wealthy World's Alibi

The argument that frames AI diagnosis in the global South as an advance because it beats the baseline of nothing is true. It is also, I would argue, incomplete in a way that flatters the people doing the deploying. The counterfactual of “no care at all” does a lot of moral work in this debate. It reframes what would otherwise be understood as under-validated technology aimed at a vulnerable population into a charitable intervention. It converts the question “is this good enough?” into the different, easier question “is this better than nothing?”. It allows developers, funders, and policymakers in high-income countries to feel that they are doing something constructive without having to confront the deeper fact that the shortage of human clinicians in Rwanda and Pakistan is not a natural disaster. It is the result of a global labour market that has for decades drained trained doctors and nurses from low-income countries into the hospitals of Europe, North America, and the Gulf states. It is the result of public-health underfunding, of structural adjustment programmes, of brain drain actively subsidised by the recruitment pipelines of richer countries. The absence of a doctor in that Rwandan clinic is not an act of God. It is an act of policy, and much of that policy was written in capitals that also happen to host the major AI labs now offering the chatbot as a solution.

None of this is an argument against the Rwanda and Pakistan deployments as such. The community health workers who participated in those studies are not better off because a Western commentator is worried about their position in a global labour market. They are better off, if the data is to be believed, because the chatbot helped them give better answers to patients who needed answers. That is a real good, and refusing to count it because it is entangled with a larger injustice is its own kind of bad faith. But the existence of the real good does not cancel the larger injustice. It coexists with it. The wealthy world gets to sell itself a story in which it is closing the gap in global health through the deployment of frontier AI, while quietly continuing to benefit from the structural forces that made the gap what it is.

That asymmetry is what a new form of medical inequality looks like. It is not the crude inequality of having care versus not having care. It is the subtler inequality of having care that is under-regulated, under-validated, and structured so that the costs of its failures flow in one direction and the benefits of its successes flow in another. It is care delivered by a system whose architects and whose accountable parties live in a different jurisdiction from the people whose bodies supply the test data. It is the same logic that structured the pharmaceutical trials of the 1990s, updated for a world in which the drug is software and the side effects are bad advice.

Holding the Contradiction

None of the serious people in this story are villains. The researchers who ran the Rwanda and Pakistan studies believe, with good reason, that AI tools can extend basic diagnostic capacity to populations systematically underserved for generations. They are probably right. The Oxford team is not arguing that chatbots should be banned from clinical use; they are arguing that benchmark tests rather than human-in-the-loop studies underestimate the failure modes that actually matter. They are probably right too. The WHO's 2024 and 2025 guidance on large multi-modal models tries to hold the genuine promise and the genuine risk in the same frame. It is also, like most WHO guidance, advisory rather than binding.

Both things are real at once. It is real that in a rural clinic where the counterfactual is silence, a chatbot giving useful advice 80 per cent of the time is a revolution. It is also real that an unvalidated chatbot deployed at scale across populations who lack the institutional power to audit it or seek redress creates a risk with no historical precedent and no settled framework of accountability. The Rwandan community health worker who consults a model to help diagnose a feverish child is, on the evidence, improving her care. The same model, used the same way, by a frightened patient in Birmingham the next morning, causes worse decisions than she would have made with a search engine. These are not two stories. They are one story, viewed from two angles.

In January 2024, when the WHO published its first major guidance on large multi-modal models in health, it urged governments and technology companies to ensure that the deployment of these tools did not widen existing health inequities. Two years on, the Nature Health and Nature Medicine studies together are giving us a map of what that widening might actually look like. It does not look like withholding the technology from the poor. It looks, instead, like deploying the technology to the poor under one set of conditions and to the rich under another, and allowing the differences between those conditions to do the work of quiet structural harm. The rich get the chatbot plus the regulator. The poor get the chatbot plus a hope that someone, somewhere, is watching the aggregate outcomes carefully enough to notice if something is going wrong.

Back in the Rwandan clinic, the community health worker puts down her phone. The child is still feverish, but she has a plan now. Whether the plan is the right one depends on a chain of assumptions she cannot directly verify: that the model she consulted was the model she thought she was consulting, that the fine-tuning was appropriate for her context, that the training data did not carry some invisible bias against children who look like the one on her lap, that the confidence in the model's reply reflects an actual epistemic state rather than the trained conversational habit of a system that has learned to sound sure. She does not know any of that. She is not meant to know it. Somewhere, in principle, there is meant to be a grown-up who knows it on her behalf.

Who, in this system, is that grown-up? Who is meant to be watching, with authority, with enforcement powers, with the mandate to pull the plug when the signal goes bad? The developer in Menlo Park? The regulator in Kigali? The ministry in Islamabad? The WHO in Geneva? The researchers who ran the Nature Health studies and who have already gone on to the next project? The philanthropic funder who paid for the initial pilot and whose annual report, next year, will list it as a success? Each of these actors can give a coherent account of what they are doing and why. None of them can give a coherent account of who is holding the whole thing together.

That is the shape the new medical inequality takes. Not the old, blunt kind where the poor get nothing and the rich get everything, though there is still plenty of that. A different kind, more modern, more subtle, and in some ways more dangerous for being so easy to mistake for progress. The poor get the tool, and the rich get the framework within which the tool is allowed to exist. The poor carry the risk of the errors. The rich carry the intellectual property and the option, should they need it, of pulling the plug. Whether this counts as an advance depends, in the end, on whether you believe a bad system with a good heart is closer to the right answer than a slow system with a functioning memory of what it is for.

So here is the question, sharpened. If the answer in Rwanda is that the chatbot helps, and the answer in Oxford is that the chatbot harms, and the answer in both places is that almost nobody in a position of authority can tell you with any precision who is responsible if it goes wrong, then what, exactly, have we built? A bridge, or a gap with a very convincing surface?

References

  1. Simms, C. (2026, February 6). Cheap AI chatbots transform medical diagnoses in places with limited care. Nature. https://www.nature.com/articles/d41586-026-00345-x
  2. Large language models for frontline healthcare support in low-resource settings. (2026). Nature Health, 1(2). https://www.nature.com/articles/s44360-025-00038-1
  3. University of Oxford. (2026, February 10). New study warns of risks in AI chatbots giving medical advice. https://www.ox.ac.uk/news/2026-02-10-new-study-warns-risks-ai-chatbots-giving-medical-advice
  4. Bean, A., et al. (2026). Clinical knowledge in LLMs does not translate to human interactions. Nature Medicine.
  5. The Doctor (British Medical Association). Bot-ched advice, disturbing results in AI study. https://thedoctor.bma.org.uk/articles/health-society/bot-ched-advice-disturbing-results-in-ai-study/
  6. VentureBeat. Just add humans, Oxford medical study underscores the missing link in chatbot testing. https://venturebeat.com/ai/just-add-humans-oxford-medical-study-underscores-the-missing-link-in-chatbot-testing
  7. World Health Organization. (2024, January 18). WHO releases AI ethics and governance guidance for large multi-modal models. https://www.who.int/news/item/18-01-2024-who-releases-ai-ethics-and-governance-guidance-for-large-multi-modal-models
  8. World Health Organization. (2024). Ethics and governance of artificial intelligence for health, guidance on large multi-modal models. https://www.who.int/publications/i/item/9789240084759
  9. Abdullahi v. Pfizer, Inc. Wikipedia. https://en.wikipedia.org/wiki/Abdullahi_v._Pfizer,_Inc.
  10. BMJ / PMC. Pfizer accused of testing new drug without ethical approval. https://pmc.ncbi.nlm.nih.gov/articles/PMC1119465/
  11. BMJ / PMC. Secret report surfaces showing that Pfizer was at fault in Nigerian drug tests. https://pmc.ncbi.nlm.nih.gov/articles/PMC1471980/
  12. Brookings. What do Pfizer's 1996 drug trials in Nigeria teach us about vaccine hesitancy? https://www.brookings.edu/articles/what-do-pfizers-1996-drug-trials-in-nigeria-teach-us-about-vaccine-hesitancy/
  13. Couldry, N., & Mejias, U. A. (2019). The costs of connection, how data is colonising human life and appropriating it for capitalism. Stanford University Press.
  14. UN News. (2024, January). AI expert warns of digital colonisation in Africa. https://news.un.org/en/story/2024/01/1144342
  15. Tech Policy Press. Lessons from Nigeria and Kenya on digital colonialism in AI health messaging. https://www.techpolicy.press/lessons-from-nigeria-and-kenya-on-digital-colonialism-in-ai-health-messaging/
  16. PMC. Colonialism in the new digital health agenda. https://pmc.ncbi.nlm.nih.gov/articles/PMC10900325/
  17. Comparing ChatGPT and physicians' answers to endometriosis questions on Reddit, a blind expert evaluation. International Journal of Medical Informatics. https://www.sciencedirect.com/science/article/pii/S1386505625002515
  18. MIT Technology Review. (2025, July 21). AI companies have stopped warning you that their chatbots aren't doctors. https://www.technologyreview.com/2025/07/21/1120522/ai-companies-have-stopped-warning-you-that-their-chatbots-arent-doctors/
  19. NPR. (2026, March 11). ChatGPT is not always reliable on medical advice, new research suggests. https://www.npr.org/2026/03/11/nx-s1-5744035/chatgpt-might-give-you-bad-medical-advice-studies-warn
  20. Nteasee, understanding needs in AI for health in Africa. (2024). arXiv. https://arxiv.org/html/2409.12197v4

Tim Green

Tim Green UK-based Systems Theorist & Independent Technology Writer

Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.

His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.

ORCID: 0009-0002-0156-9795 Email: tim@smarterarticles.co.uk

Discuss...