Brilliant on Paper, Blind in Practice: Why AI Systems Fail Us

The promotional materials are breathtaking. Artificial intelligence systems that can analyse medical scans with superhuman precision, autonomous vehicles that navigate complex urban environments, and vision-language models that understand images with the fluency of a seasoned art critic. The benchmark scores are equally impressive: 94% accuracy here, state-of-the-art performance there, human-level capabilities across dozens of standardised tests.
Then reality intrudes. A robotaxi in San Francisco fails to recognise a pedestrian trapped beneath its chassis and drags her twenty feet before stopping. An image recognition system confidently labels photographs of Black individuals as gorillas. A frontier AI model, asked to count the triangles in a simple geometric image, produces answers that would embarrass a primary school student. These are not edge cases or adversarial attacks designed to break the system. They represent the routine failure modes of technologies marketed as transformative advances in machine intelligence.
The disconnect between marketed performance and actual user experience has become one of the defining tensions of the artificial intelligence era. It raises uncomfortable questions about how we measure machine intelligence, what incentives shape the development and promotion of AI systems, and whether the public has been sold a vision of technological capability that fundamentally misrepresents what these systems can and cannot do. Understanding this gap requires examining the architecture of how AI competence is assessed, the economics that drive development priorities, and the cognitive science of what these systems actually understand about the world they purport to perceive.
The Benchmark Mirage
To understand why AI systems that excel on standardised tests can fail so spectacularly in practice, one must first examine how performance is measured. The Stanford AI Index Report 2025 documented a striking phenomenon: many benchmarks that researchers use to evaluate AI capabilities have become “saturated,” meaning systems score so high that the tests are no longer useful for distinguishing between models. This saturation has occurred across domains including general knowledge, reasoning about images, mathematics, and coding. The Visual Question Answering Challenge, for instance, now sees top-performing models achieving 84.3% accuracy, while the human baseline sits at approximately 80%.
The problem runs deeper than simple test exhaustion. Research conducted by MIT's Computer Science and Artificial Intelligence Laboratory revealed that “traditionally, object recognition datasets have been skewed towards less-complex images, a practice that has led to an inflation in model performance metrics, not truly reflective of a model's robustness or its ability to tackle complex visual tasks.” The researchers developed a new metric called “minimum viewing time” which quantifies the difficulty of recognising an image based on how long a person needs to view it before making a correct identification. When researchers at MIT developed ObjectNet, a dataset comprising images collected from real-life settings rather than curated repositories, they discovered substantial performance gaps between laboratory conditions and authentic deployment scenarios.
This discrepancy reflects a phenomenon that economists have studied for decades: Goodhart's Law, which states that when a measure becomes a target, it ceases to be a good measure. A detailed 68-page analysis from researchers at Cohere, Stanford, MIT, and the Allen Institute for AI documented systematic distortions in how companies approach AI evaluation. The researchers found that major technology firms including Meta, OpenAI, Google, and Amazon were able to “privately pit many model versions in the Arena and then only publish the best results.” This practice creates a misleading picture of consistent high performance rather than the variable and context-dependent capabilities that characterise real AI systems.
The problem of data contamination compounds these issues. When testing GPT-4 on benchmark problems from Codeforces in 2023, researchers found the model could regularly solve problems classified as easy, provided they had been added before September 2021. For problems added later, GPT-4 could not solve a single question correctly. The implication is stark: the model had memorised questions and answers from its training data rather than developing genuine problem-solving capabilities. As one research team observed, the “AI industry has turned benchmarks into targets, and now those benchmarks are failing us.”
The consequence of this gaming dynamic extends beyond misleading metrics. It shapes the entire trajectory of AI development, directing research effort toward whatever narrow capabilities will boost leaderboard positions rather than toward the robust, generalisable intelligence that practical applications require.
Counting Failures and Compositional Collapse
Perhaps nothing illustrates the gap between benchmark performance and real-world competence more clearly than the simple task of counting objects in an image. Research published in late 2024 introduced VLMCountBench, a benchmark testing vision-language models on counting tasks using only basic geometric shapes such as triangles and circles. The findings were revealing: while these sophisticated AI systems could count reliably when only one shape type was present, they exhibited substantial failures when multiple shape types were combined. This phenomenon, termed “compositional counting failure,” suggests that these systems lack the discrete object representations that make counting trivial for humans.
This limitation has significant implications for practical applications. A study using Bongard problems, visual puzzles that test pattern recognition and abstraction, found that humans achieved an 84% success rate on average, while the best-performing vision-language model, GPT-4o, managed only 17%. The researchers noted that “even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges” for these systems. They observed that “most models misinterpreted or failed to count correctly, suggesting challenges in AI's visual counting capabilities.”
Text-to-image generation systems demonstrate similar limitations. Research on the T2ICountBench benchmark revealed that “all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases.” When asked to generate an image of ten oranges, these systems frequently produce either substantially more or fewer items than requested. The failure is not occasional or marginal but systematic and predictable. As one research paper noted, “depicting a specific number of objects in the image with text conditioning often fails to capture the exact quantity of details.”
These counting failures point to a more fundamental issue in how current AI architectures process visual information. Unlike human cognition, which appears to involve discrete object representations and symbolic reasoning about quantities, large vision-language models operate on statistical patterns learned from training data. They can recognise that images containing many objects of a certain type tend to have particular visual characteristics, but they lack what researchers call robust “world models” that would allow them to track individual objects and their properties reliably.
The practical implications extend far beyond academic curiosity. Consider an AI system deployed to monitor inventory in a warehouse, assess damage after a natural disaster, or count cells in a medical sample. Systematic failures in numerical accuracy would render such applications unreliable at best and dangerous at worst.
The Architectural Divide
The question of whether these failures represent fundamental limitations of current AI architectures or merely training deficiencies remains actively debated. Gary Marcus, professor emeritus of psychology and neural science at New York University and author of the 2024 book “Taming Silicon Valley: How We Can Ensure That AI Works for Us,” has argued consistently that neural networks face inherent constraints in tasks requiring abstraction and symbolic reasoning.
Marcus has pointed to a problem he first demonstrated in 1998: neural networks trained on even numbers could generalise to some new even numbers, but when tested on odd numbers, they would systematically fail. He concluded that “these tools are good at interpolating functions, but not very good at extrapolating functions.” This distinction between interpolation within known patterns and extrapolation to genuinely novel situations lies at the heart of the benchmark-reality gap.
Marcus characterises current large language models as systems that “work at the extensional level, but they don't work at the intentional level. They are not getting the abstract meaning of anything.” The chess-playing failures of models like ChatGPT, which Marcus has documented attempting illegal moves such as having a Queen jump over a knight, illustrate how systems can “approximate the game of chess, but can't play it reliably because it never induces a proper world model of the board and the rules.” He has emphasised that these systems “still fail at abstraction, at reasoning, at keeping track of properties of individuals. I first wrote about hallucinations in 2001.”
Research on transformer architectures, the technical foundation underlying most modern AI systems, has identified specific limitations in spatial reasoning. A 2024 paper titled “On Limitations of the Transformer Architecture” identified “fundamental incompatibility with the Transformer architecture for certain problems, suggesting that some issues should not be expected to be solvable in practice indefinitely.” The researchers documented that “when prompts involve spatial information, transformer-based systems appear to have problems with composition.” Simple cases where temporal composition fails cause all state-of-the-art models to return incorrect answers.
The limitations extend to visual processing as well. Research has found that “ViT learns long-range dependencies via self-attention between image patches to understand global context, but the patch-based positional encoding mechanism may miss relevant local spatial information and usually cannot attain the performance of CNNs on small-scale datasets.” This architectural limitation has been highlighted particularly in radiology applications where critical findings are often minute and contained within small spatial locations.
Melanie Mitchell, professor at the Santa Fe Institute whose research focuses on conceptual abstraction and analogy-making in artificial intelligence, has offered a complementary perspective. Her recent work includes a 2025 paper titled “Do AI models perform human-like abstract reasoning across modalities?” which examines whether these systems engage in genuine reasoning or sophisticated pattern matching. Mitchell has argued that “there's a lot of evidence that LLMs aren't reasoning abstractly or robustly, and often over-rely on memorised patterns in their training data, leading to errors on 'out of distribution' problems.”
Mitchell identifies a crucial gap in current AI systems: the absence of “rich internal models of the world.” As she notes, “a tenet of modern cognitive science is that humans are not simply conditioned-reflex machines; instead, we have inside our heads abstracted models of the physical and social worlds that reflect the causes of events rather than merely correlations among them.” Current AI systems, despite their impressive performance on narrow benchmarks, appear to lack this causal understanding.
An alternative view holds that these limitations may be primarily a consequence of training data rather than architectural constraints. Some researchers hypothesise that “the limited spatial reasoning abilities of current VLMs is not due to a fundamental limitation of their architecture, but rather is a limitation in common datasets available at scale on which such models are trained.” This perspective suggests that co-training multimodal models on synthetic spatial data could potentially address current weaknesses. Additionally, researchers note that “VLMs' limited spatial reasoning capability may be due to the lack of 3D spatial knowledge in training data.”
When Failures Cause Harm
The gap between benchmark performance and real-world capability becomes consequential when AI systems are deployed in high-stakes domains. The case of autonomous vehicles provides particularly sobering examples. According to data compiled by researchers at Craft Law Firm, between 2021 and 2024, there were 3,979 incidents involving autonomous vehicles in the United States, resulting in 496 reported injuries and 83 fatalities. The Stanford AI Index Report 2025 noted that the AI Incidents Database recorded 233 incidents in 2024, a 56.4% increase compared to 2023, marking a record high.
In May 2025, Waymo recalled over 1,200 robotaxis following disclosure of a software flaw that made vehicles prone to colliding with certain stationary objects, specifically “thin or suspended barriers like chains, gates, and even utility poles.” These objects, which human drivers would navigate around without difficulty, apparently fell outside the patterns the perception system had learned to recognise. Investigation revealed failures in the system's ability to properly classify and respond to stationary objects under certain lighting and weather conditions. As of April 2024, Tesla's Autopilot system had been involved in at least 13 fatal crashes according to NHTSA data, with Tesla's Full Self-Driving system facing fresh regulatory scrutiny in January 2025.
The 2018 Uber fatal accident in Tempe, Arizona, illustrated similar limitations. The vehicle's sensors detected a pedestrian, but the AI system failed to classify her accurately as a human, leading to a fatal collision. The safety driver was distracted by a mobile device and did not intervene in time. As researchers have noted, “these incidents reveal a fundamental problem with current AI systems: they excel at pattern recognition in controlled environments but struggle with edge cases that human drivers handle instinctively.” The failure to accurately classify the pedestrian as a human being highlighted a critical weakness in object recognition capabilities, particularly in low-light conditions and complex environments.
A particularly disturbing incident involved General Motors' Cruise robotaxi in San Francisco, where the vehicle struck a pedestrian who had been thrown into its path by another vehicle, then dragged her twenty feet before stopping. The car's AI systems failed to recognise that a human being was trapped underneath the vehicle. When the system detected an “obstacle,” it continued to move, causing additional severe injuries.
These cases highlight how AI systems that perform admirably on standardised perception benchmarks can fail catastrophically when encountering situations not well-represented in their training data. The gap between laboratory performance and deployment reality is not merely academic; it translates directly into physical harm.
The Gorilla Problem That Never Went Away
One of the most persistent examples of AI visual recognition failure involves the 2015 incident in which Google Photos labelled photographs of Black individuals as “gorillas.” In that incident, a Black software developer tweeted that Google Photos had labelled images of him with a friend as “gorillas.” The incident exposed how image recognition algorithms trained on biased data can produce racist outputs. Google's response was revealing: rather than solving the underlying technical problem, the company blocked the words “gorilla,” “chimpanzee,” “monkey,” and related terms from the system entirely.
Nearly a decade later, that temporary fix remains in place. By censoring these searches, the service can no longer find primates such as “gorilla,” “chimp,” “chimpanzee,” or “monkey.” Despite enormous advances in AI technology since 2015, Google Photos still refuses to label images of gorillas. This represents a tacit acknowledgement that the fundamental problem has not been solved, only circumvented. The workaround creates a peculiar situation where one of the world's most advanced image recognition systems cannot identify one of the most recognisable animals on Earth. As one analysis noted, “Apple learned from Google's mistake and simply copied their fix.”
The underlying issue extends beyond a single company's product. Research has consistently documented that commercially available facial recognition technologies perform far worse on darker-skinned individuals, particularly women. Three commercially available systems made by Microsoft, IBM, and Megvii misidentified darker female faces nearly 35% of the time while achieving near-perfect accuracy (99%) on white men.
These biases have real consequences. Cases such as Ousmane Bah, a teenager wrongly accused of theft at an Apple Store because of faulty face recognition, and Amara K. Majeed, wrongly accused of participating in the 2019 Sri Lanka bombings after her face was misidentified, demonstrate how AI failures disproportionately harm marginalised communities. The technology industry's approach of deploying these systems despite known limitations and then addressing failures reactively raises serious questions about accountability and the distribution of risk.
The Marketing Reality Gap
The discrepancy between how AI capabilities are marketed and how they perform in practice reflects a broader tension in the technology industry. A global study led by Professor Nicole Gillespie at Melbourne Business School surveying over 48,000 people across 47 countries between November 2024 and January 2025 found that although 66% of respondents already use AI with some regularity, less than half (46%) are willing to trust it. Notably, this represents a decline in trust compared to surveys conducted prior to ChatGPT's release in 2022. People have become less trusting and more worried about AI as adoption has increased.
The study found that consumer distrust is growing significantly: 63% of consumers globally do not trust AI with their data, up from 44% in 2024. In the United Kingdom, the situation is even starker, with 76% of shoppers feeling uneasy about AI handling their information. Research from the Nuremberg Institute for Market Decisions showed that only 21% of respondents trust AI companies and their promises, and only 20% trust AI itself. These findings reveal “a notable gap between general awareness of AI in marketing and a deeper understanding or trust in its application.”
Emily Bender, professor of linguistics at the University of Washington and one of the authors of the influential 2021 “stochastic parrots” paper, has been a prominent voice challenging AI hype. Bender was recognised in TIME Magazine's first 100 Most Influential People in Artificial Intelligence and is the author of the upcoming book “The AI Con: How to Fight Big Tech's Hype and Create the Future We Want.” She has argued that “so much of what we read about language technology and other things that get called AI makes the technology sound magical. It makes it sound like it can do these impossible things, and that makes it that much easier for someone to sell a system that is supposedly objective but really just reproduces systems of oppression.”
The practical implications of this marketing-reality gap are significant. A McKinsey global survey in early 2024 found that 65% of respondents said their organisations use generative AI in some capacity, nearly double the share from ten months prior. However, despite widespread experimentation, “comprehensive integration of generative AI into core business operations remains limited.” A 2024 Deloitte study noted that “organisational change only happens so fast” despite rapid AI advances, meaning many companies are deliberately testing in limited areas before scaling up.
The gap is particularly striking in mental health applications. Despite claims that AI is replacing therapists, only 21% of the 41% of adults who sought mental health support in the past six months turned to AI, representing only 9% of the total population. The disconnect between hype and actual behaviour underscores how marketing narratives can diverge sharply from lived reality.
Hallucinations and Multimodal Failures
The problem of AI systems generating plausible but incorrect outputs, commonly termed “hallucinations,” extends beyond text into visual domains. Research published in 2024 documented that multimodal large language models “often generate outputs that are inconsistent with the visual content, a challenge known as hallucination, which poses substantial obstacles to their practical deployment and raises concerns regarding their reliability in real-world applications.”
Object hallucination represents a particularly problematic failure mode, occurring when models identify objects that do not exist in an image. Researchers have developed increasingly sophisticated benchmarks to evaluate these failures. ChartHal, a benchmark featuring a taxonomy of hallucination scenarios in chart understanding, demonstrated that “state-of-the-art LVLMs suffer from severe hallucinations” when interpreting visual data.
The VHTest benchmark introduced in 2024 comprises 1,200 diverse visual hallucination instances across eight modes. Medical imaging presents particular risks: the MediHall Score benchmark was developed specifically to assess hallucinations in medical contexts through a hierarchical scoring system. When AI systems hallucinate in clinical settings, the consequences can be life-threatening.
Mitigation efforts have shown some promise. One recent framework operating entirely with frozen, pretrained vision-language models and requiring no gradient updates “reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits.” Research by Yu et al. (2023) explored human error detection to mitigate hallucinations, successfully reducing them by 44.6% while maintaining competitive performance.
However, Gary Marcus has argued that there is “no principled solution to hallucinations in systems that traffic only in the statistics of language without explicit representation of facts and explicit tools to reason over those facts.” This perspective suggests that hallucinations are not bugs to be fixed but fundamental characteristics of current architectural approaches. He advocates for neurosymbolic AI, which would combine neural networks with symbolic AI, making an analogy to Daniel Kahneman's System One and System Two thinking.
The ARC Challenge and the Limits of Pattern Matching
Francois Chollet, the creator of Keras, an open-source deep learning library adopted by over 2.5 million developers, introduced the Abstraction and Reasoning Corpus (ARC) in 2019 as a benchmark designed to measure fluid intelligence rather than narrow task performance. ARC consists of 800 puzzle-like tasks designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs, usually around three.
What makes ARC distinctive is its focus on measuring the ability to “generalise from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts.” Unlike benchmarks that can be saturated through extensive training on similar problems, ARC tests precisely the kind of novel reasoning that current AI systems struggle to perform. The benchmark “requires the test taker to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training.”
From its introduction in 2019 until late 2024, ARC remained essentially unsolved by AI systems, maintaining its reputation as one of the toughest benchmarks available for general intelligence. The ARC Prize competition, co-founded by Mike Knoop and Francois Chollet, saw 1,430 teams submit 17,789 entries in 2024. The state-of-the-art score on the ARC private evaluation set increased from 33% to 55.5% during the competition period, propelled by techniques including deep learning-guided program synthesis and test-time training. More than $125,000 in prizes were awarded across top papers and top scores.
While this represents meaningful progress, it remains far below human performance and the 85% threshold set for the $500,000 grand prize. The persistent difficulty of ARC highlights a crucial distinction: current AI systems excel at tasks that can be solved through pattern recognition and interpolation within training distributions but struggle with the kind of abstract reasoning that humans perform effortlessly.
Trust Erosion and the Normalisation of Failure
Research on human-AI interaction has documented asymmetric trust dynamics: building trust in AI takes more time compared to building trust in humans, but when AI encounters problems, trust loss occurs more rapidly. Studies have found that simpler tasks show greater degradation of trust following errors, suggesting that failures on tasks perceived as easy may be particularly damaging to user confidence.
This pattern reflects what researchers term “perfect automation schema,” the tendency for users to expect flawless performance from AI systems and interpret any deviation as evidence of fundamental inadequacy rather than normal performance variation. The marketing of AI as approaching or exceeding human capabilities may inadvertently amplify this effect by setting unrealistic expectations.
Research comparing early and late errors found that initial errors affect trust development more negatively than late ones in some studies, while others found that trust dropped most for late mistakes. The explanation may be that early mistakes allow people to adjust expectations over time, whereas trust damaged at a later stage proves more difficult to repair. Research has found that “explanations that combine causal attribution (explaining why the error occurred) with boundary specification (identifying system limitations) prove most effective for competence-based trust repair.”
The normalisation of AI failures presents a concerning trajectory. If users come to expect that AI systems will periodically produce nonsensical or harmful outputs, they may either develop excessive caution that undermines legitimate use cases or, alternatively, become desensitised to failures in ways that increase risk. Neither outcome serves the goal of beneficial AI deployment.
Measuring Intelligence or Measuring Training
The fundamental question underlying these failures concerns what benchmarks actually measure. The dramatic improvement in AI performance on new benchmarks shortly after their introduction, documented by the Stanford AI Index, suggests that current systems are exceptionally effective at optimising for whatever metrics researchers define. In 2023, AI systems could solve just 4.4% of coding problems on SWE-bench. By 2024, this figure had jumped to 71.7%. Performance on MMMU and GPQA saw gains of 18.8 and 48.9 percentage points respectively.
This pattern of rapid benchmark saturation has led some researchers to question whether improvements reflect genuine capability gains or increasingly sophisticated ways of matching test distributions. The Stanford report noted that despite strong benchmark performance, “AI models excel at tasks like International Mathematical Olympiad problems but still struggle with complex reasoning benchmarks like PlanBench. They often fail to reliably solve logic tasks even when provably correct solutions exist.”
The narrowing performance gaps between models further complicate the picture. According to the AI Index, the Elo score difference between the top and tenth-ranked model on the Chatbot Arena Leaderboard was 11.9% in 2023. By early 2025, this gap had narrowed to just 5.4%. Similarly, the difference between the top two models shrank from 4.9% in 2023 to just 0.7% in 2024.
The implications for AI development are significant. If benchmarks are increasingly unreliable guides to real-world performance, the incentive structure for AI research may be misaligned with the goal of building genuinely capable systems. Companies optimising for benchmark rankings may invest disproportionately in test-taking capabilities at the expense of robustness and reliability in deployment.
Francois Chollet has framed this concern explicitly, arguing that ARC-style tasks test “the ability to generalise from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts” rather than the ability to recognise patterns encountered during training. The distinction matters profoundly for understanding what current AI systems can and cannot do.
Reshaping Expectations and Rebuilding Trust
Addressing the gap between marketed performance and actual capability will require changes at multiple levels. Researchers have begun developing dynamic benchmarks that are regularly updated to prevent data contamination. LiveBench, for example, is updated with new questions monthly, many from recently published sources, ensuring that performance cannot simply reflect memorisation of training data. This approach represents “a close cousin of the private benchmark” that keeps benchmarks fresh without worrying about contamination.
Greater transparency about the conditions under which AI systems perform well or poorly would help users develop appropriate expectations. OpenAI's documentation acknowledges that their models struggle with “tasks requiring precise spatial localisation, such as identifying chess positions” and “may generate incorrect descriptions or captions in certain scenarios.” Such candour, while not universal in the industry, represents a step toward more honest communication about system limitations.
The AI Incidents Database, maintained by the Partnership on AI, and the AIAAIC Repository provide systematic tracking of AI failures. The AIAAIC documented that in 2024, while incidents declined to 187 compared to the previous year, issues surged to 188, the highest number recorded, totalling 375 occurrences, ten times more than in 2016. Accuracy and reliability and safety topped the list of incident categories. OpenAI, Tesla, Google, and Meta account for the highest number of AI-related incidents in the repository.
Academic researchers have proposed that evaluation frameworks should move beyond narrow task performance to assess broader capabilities including robustness to distribution shift, calibration of confidence, and graceful degradation when facing unfamiliar inputs. Melanie Mitchell has argued that “AI systems ace benchmarks yet stumble in the real world, and it's time to rethink how we probe intelligence in machines.”
Mitchell maintains that “just scaling up these same kinds of models will not solve these problems. Some new approach has to be created, as there are basic capabilities that current architectures and training methods aren't going to overcome.” She notes that current models “are not learning from their mistakes in any long-term sense. They can't carry learning from one session to another. They also have no 'episodic memory,' unlike humans who learn from experiences, mistakes, and successes.”
The gap between benchmark performance and real-world capability is not simply a technical problem awaiting a technical solution. It reflects deeper questions about how we define and measure intelligence, what incentives shape technology development, and how honest we are prepared to be about the limitations of systems we deploy in consequential domains. The answers to these questions will shape not only the trajectory of AI development but also the degree to which public trust in these technologies can be maintained or rebuilt.
For now, the most prudent stance may be one of calibrated scepticism: appreciating what AI systems can genuinely accomplish while remaining clear-eyed about what they cannot. The benchmark scores may be impressive, but the measure of a technology's value lies not in how it performs in controlled conditions but in how it serves us in the messy, unpredictable complexity of actual use.
References and Sources
- Stanford Human-Centered AI. (2025). “The 2025 AI Index Report: Technical Performance.” https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance
- Stanford HAI. (2025). “AI Benchmarks Hit Saturation.” https://hai.stanford.edu/news/ai-benchmarks-hit-saturation
- MIT News. (2023). “Image recognition accuracy: An unseen challenge confounding today's AI.” https://news.mit.edu/2023/image-recognition-accuracy-minimum-viewing-time-metric-1215
- Collinear AI. (2024). “Gaming the System: Goodhart's Law Exemplified in AI Leaderboard Controversy.” https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy
- Marcus, G. (2025). “Generative AI's crippling and widespread failure to induce robust models of the world.” https://garymarcus.substack.com/p/generative-ais-crippling-and-widespread
- Marcus, G. (2024). “Taming Silicon Valley: How We Can Ensure That AI Works for Us.” MIT Press.
- Mitchell, M. (2025). “AI's challenge of understanding the world.” Science. https://www.science.org/doi/10.1126/science.adm8175
- Mitchell, M. (2025). “The LLM Reasoning Debate Heats Up.” https://aiguide.substack.com/p/the-llm-reasoning-debate-heats-up
- AIAAIC Repository. (2025). “AI, algorithmic, and automation incidents.” https://www.aiaaic.org/aiaaic-repository
- AI Incident Database. (2025). Partnership on AI. https://incidentdatabase.ai/
- Craft Law Firm. (2024). “Data Analysis: Self-Driving Car Accidents [2019-2024].” https://www.craftlawfirm.com/autonomous-vehicle-accidents-2019-2024-crash-data/
- Responsible AI Labs. (2025). “AI Safety Incidents of 2024: Lessons from Real-World Failures.” https://responsibleailabs.ai/knowledge-hub/articles/ai-safety-incidents-2024
- AlgorithmWatch. (2020). “Google apologizes after its Vision AI produced racist results.” https://algorithmwatch.org/en/google-vision-racism/
- KPMG/University of Melbourne. (2025). “Trust, attitudes and use of artificial intelligence: A global study 2025.” https://kpmg.com/xx/en/our-insights/ai-and-technology/trust-attitudes-and-use-of-ai.html
- Nuremberg Institute for Market Decisions. (2024). “Consumer attitudes toward AI-generated marketing content.” https://www.nim.org/en/publications/detail/transparency-without-trust
- Bender, E. et al. (2021). “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” ACM Conference on Fairness, Accountability, and Transparency.
- Chollet, F. (2019). “On the Measure of Intelligence.” arXiv:1911.01547. https://arxiv.org/abs/1911.01547
- ARC Prize Foundation. (2024). “ARC Prize 2024: Technical Report.” https://arcprize.org/media/arc-prize-2024-technical-report.pdf
- The Debrief. (2024). “AI's Puzzle-Solving Limitations: Vision-Language Models Struggle with Human-Like Pattern Recognition.” https://thedebrief.org/29289-2-vision-language-models/
- arXiv. (2024). “On Limitations of the Transformer Architecture.” https://arxiv.org/html/2402.08164v1
- arXiv. (2024). “Hallucination of Multimodal Large Language Models: A Survey.” https://arxiv.org/abs/2404.18930
- arXiv. (2024). “Your Vision-Language Model Can't Even Count to 20.” https://arxiv.org/abs/2510.04401v1
- Frontiers in Psychology. (2024). “Developing trustworthy artificial intelligence: insights from research on interpersonal, human-automation, and human-AI trust.” https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2024.1382693/full
- OpenAI. (2024). “GPT-4o System Card.” https://cdn.openai.com/gpt-4o-system-card.pdf
- Deloitte. (2024). “Earning trust as gen AI takes hold: 2024 Connected Consumer Survey.” https://www.deloitte.com/us/en/insights/industry/telecommunications/connectivity-mobile-trends-survey/2024.html
- IEEE Spectrum. (2025). “The State of AI 2025: 12 Eye-Opening Graphs.” https://spectrum.ieee.org/ai-index-2025

Tim Green UK-based Systems Theorist & Independent Technology Writer
Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.
His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.
ORCID: 0009-0002-0156-9795 Email: tim@smarterarticles.co.uk