The Transparency Trap: When Explaining AI Moderation Helps Bad Actors

March 28, 2026

Every second, an unfathomable volume of content floods the world's largest social media platforms. TikTok videos, Instagram Reels, YouTube Shorts, Facebook posts, and Threads updates compete for attention in an endless cascade of human expression. Behind the scenes, artificial intelligence systems work tirelessly to sort the acceptable from the harmful, the benign from the dangerous. In the first three months of 2025, TikTok reported that over 99% of content violating its community guidelines was removed before anyone reported it, with more than 90% taken down before gaining any views. The vast majority of these removals (94%) occurred within 24 hours, and automated moderation technologies handled over 87% of all video removals.

These numbers represent a staggering achievement in automated content governance. They also represent a profound challenge: how do you explain billions of algorithmic decisions to regulators, users, and internal governance teams without revealing the very heuristics that bad actors could exploit to evade detection?

This is the glass box problem of modern content moderation. Regulators demand transparency. Users expect fair treatment. Internal governance teams require audit trails. Yet revealing too much about how these systems work creates an instruction manual for those determined to spread harm. As the European Union's Digital Services Act and AI Act reshape the regulatory landscape, platforms find themselves navigating an unprecedented tension between accountability and security.

The stakes could not be higher. Get the balance wrong in favour of opacity, and platforms face regulatory penalties reaching 6% of global revenue, plus the erosion of public trust. Get it wrong in favour of transparency, and every published detection method becomes an evasion playbook. Finding the narrow path between these failure modes has become the defining challenge for platform trust and safety teams worldwide.

When Error Rates Become Headlines

The pressure for explainable AI in content moderation has never been greater. In December 2024, Nick Clegg, Meta's president of global affairs, acknowledged publicly that the company's moderation “error rates are still too high” and pledged to “improve the precision and accuracy with which we act on our rules.” He stated: “We know that when enforcing our policies, our error rates are still too high, which gets in the way of the free expression that we set out to enable. Too often, harmless content gets taken down, or restricted, and too many people get penalized unfairly.”

This admission reflects a broader industry reckoning. Meta's own Oversight Board has warned that moderation errors risk the “excessive removal of political speech.” The company publicly apologised after its systems suppressed photos of then-President-elect Donald Trump surviving an attempted assassination. Of more than 100 decisions reviewed by the Oversight Board, approximately 80% of Meta's original moderation decisions were overturned, suggesting systematic issues with how automated systems make and explain their choices.

The statistics paint a picture of massive scale with meaningful error margins. Reddit reported that of content removed by moderators from January 2024 through June 2024, approximately 72% was removed by automated systems. Meta reported that automated systems removed 90% of violent and graphic content on Instagram in the European Union between April and September 2024. Yet these impressive automation rates come with acknowledged shortcomings in accuracy and explainability.

When billions of decisions occur daily, even a small percentage error rate translates to millions of individual cases where users receive no meaningful explanation for why their content disappeared. This is where the technical challenge of explainability becomes a governance imperative. The global content moderation solutions market, valued at 8.53 billion dollars in 2024, is projected to grow at a compound annual growth rate of 13.10% through 2034, reflecting the immense investment platforms are making in these systems.

Understanding the Toolbox: SHAP, LIME, and Attention Visualisation

At the heart of explainable AI for content classification lie several key technical approaches, each with distinct strengths and limitations for short-form user-generated content. Understanding these tools matters because the choice of explainability method shapes what platforms can tell users, regulators, and their own governance teams about why decisions were made.

SHAP: The Game Theory Approach

SHapley Additive exPlanations, or SHAP, represents one of the most robust approaches to model interpretability. Developed by Scott Lundberg and Su-In Lee in 2017, SHAP builds on Lloyd Shapley's 1953 game theory concept to assign each feature an importance value for a particular prediction. The fundamental insight is elegant: treat model features as “players” in a collaborative game, working together to determine each predicted value.

SHAP offers both global and local explanations, making it particularly valuable for content moderation. A global explanation might reveal that certain visual patterns or text sequences consistently trigger removal decisions across millions of pieces of content. A local explanation can tell a specific user exactly which elements of their post contributed to its removal. Unlike traditional feature importance measures that only indicate which features are generally important, SHAP shows exactly how each feature contributes to every single prediction a model makes.

For tree-based models commonly used in initial content screening, TreeSHAP offers particular advantages. This specialised algorithm computes SHAP values for ensemble models such as random forests and gradient boosted trees in polynomial time, dramatically reducing the computational complexity. Research has demonstrated that Fast TreeSHAP can achieve up to three times faster explanation, while GPU-accelerated implementations (GPUTreeShap) deliver speedups of up to 19 times over standard multi-core CPU implementations.

However, applying SHAP to the transformer-based models that power modern content classification presents greater computational challenges. When processing billions of items daily, generating individual SHAP explanations for deep learning models remains prohibitive at scale, requiring platforms to make strategic choices about which decisions warrant full explainability analysis.

LIME: Local Interpretable Explanations

Local Interpretable Model-agnostic Explanations, or LIME, takes a different approach. Rather than calculating feature importance through game-theoretic principles, LIME creates a local surrogate model, fitting a simpler, interpretable model (typically linear) to explain individual predictions.

The appeal of LIME lies in its model-agnostic nature: it can explain predictions from any machine learning system without requiring access to its internal workings. For platforms running diverse classification systems across text, images, and video, this flexibility proves valuable.

However, LIME carries significant limitations for content moderation. The method is inherently local, unable to provide the global insights that governance teams need to understand systematic patterns in moderation decisions. More critically, if models account for nonlinearity between features and outcomes, this may be missing in LIME's explanation because nonlinearity is lost in the surrogate model. For the nuanced, context-dependent decisions that characterise effective content moderation, this limitation matters.

Attention Visualisation: Looking Inside Transformers

The transformer architecture underlying most modern language and vision models offers another window into decision-making through attention weights. Tools like BertViz, developed for visualising attention in transformer models, can show how these systems allocate focus across input elements. BertViz provides multiple views for analysis: a head view visualising attention for one or more attention heads, a model view offering a bird's-eye perspective across all layers and heads, and a neuron view examining individual components in query and key vectors.

Yet research has increasingly questioned whether attention weights truly explain model behaviour. In their influential 2019 paper “Attention is not Explanation,” Sarthak Jain and Byron Wallace performed extensive experiments across NLP tasks, finding that learned attention weights are frequently uncorrelated with gradient-based measures of feature importance. They demonstrated that very different attention distributions can yield equivalent predictions. Their conclusion was stark: “standard attention modules do not provide meaningful explanations and should not be treated as though they do.”

This presents a fundamental challenge for content moderation transparency. If attention visualisation does not reliably explain why a model made a particular decision, offering it as an explanation may be misleading. The appearance of transparency without substance serves no one's interests.

The Regulatory Landscape: DSA and EU AI Act

Europe has emerged as the global leader in mandating content moderation transparency. The Digital Services Act, fully in force since February 2024, and the AI Act (Regulation EU 2024/1689), which entered into force on 1 August 2024, together create unprecedented requirements for explainability and audit trails. The AI Act represents the first-ever comprehensive legal framework on AI worldwide. These regulations transform theoretical discussions about transparency into concrete compliance obligations with substantial penalties for failure.

Digital Services Act: Statements of Reasons and the Transparency Database

The DSA's centrepiece for content moderation accountability is the “statement of reasons” requirement. Whenever a platform removes or restricts access to content, it must inform users and explain the reasoning behind each decision. Very Large Online Platforms must submit these statements to the DSA Transparency Database, which makes them publicly available in near-real-time.

Starting from 17 February 2024, all providers of intermediary services must publish annual reports on their content moderation practices, including the number of orders received from authorities, measures comprising their content moderation practices, the number of pieces of content taken down, and critically, the accuracy and rate of error of their automated content moderation systems.

However, early analysis reveals significant concerns about data quality. Research examining the database has uncovered issues with incomplete reporting, vague categorisation, and unreliable data. As one study noted: “Transparency mechanisms like the DSA-TDB are only as valuable as the quality of the data they provide. If platforms systematically underuse informative fields, rely on too generic classifications, or submit records that defy plausibility, then the promise of meaningful oversight is undermined.”

EU AI Act: Technical Documentation for High-Risk Systems

The AI Act establishes a risk-based framework classifying AI systems into four categories: unacceptable, high, limited, and minimal risk. While content moderation AI may fall into different categories depending on specific applications, the documentation requirements for high-risk systems set benchmarks that forward-thinking platforms are already adopting.

High-risk AI systems require technical documentation before market release, kept continuously up to date. This documentation must demonstrate compliance with regulatory requirements and provide authorities with clear, comprehensive information for compliance assessment. The required elements include detailed descriptions of system architecture, algorithms used, data sources, data governance practices, and measures for managing risks and ensuring accuracy, robustness, and cybersecurity.

Critically, high-risk AI systems must allow for automatic recording of events (logs) over their lifetime, creating an inherent audit trail. The timeline for compliance creates urgency. Prohibited AI practices and AI literacy obligations entered application from 2 February 2025. Governance rules for general-purpose AI models became applicable on 2 August 2025. Rules for high-risk AI systems embedded in regulated products have an extended transition period until 2 August 2027.

Enforcement with Teeth

The stakes for non-compliance are substantial. Non-compliance with the Digital Services Act can attract penalties of up to 6% of a company's annual turnover in the European Union. In 2024, the Commission launched investigations into TikTok and X for failing to meet transparency and child protection standards. On 24 October 2025, the EU Commission published an assessment finding that Meta and TikTok may have breached transparency rules under the DSA, signalling increased regulatory scrutiny not just for content hosted but for transparency, data accessibility for researchers, and user-friendliness of rights mechanisms.

Building Audit Trails for Governance

Creating effective audit trails for content moderation requires addressing multiple audiences with different needs: internal governance teams seeking to understand systematic patterns, regulators demanding compliance evidence, and users wanting explanations for specific decisions. Each audience requires different information at different levels of detail, making audit trail design a fundamentally architectural challenge.

Internal Governance: Pattern Recognition and Error Analysis

For internal teams, audit trails must enable identification of systematic errors before they become public controversies. This requires logging not just final decisions but the full decision pathway: which models were consulted, what scores they produced, what thresholds were applied, whether human review occurred, and what the final outcome was.

Clegg's December 2024 acknowledgement that Meta “overdid it a bit” during COVID-19 content moderation reflects the kind of retrospective analysis that comprehensive audit trails enable. “We had very stringent rules removing very large volumes of content through the pandemic,” he explained. “No one during the pandemic knew how the pandemic was going to unfold, so this really is wisdom in hindsight.”

The ability to conduct such hindsight analysis depends entirely on having logged sufficient information. Model version tracking becomes essential when identifying whether a specific model update correlated with increased error rates. Threshold tracking reveals whether policy changes translated correctly into technical implementations.

Model Cards and Documentation Standards

The concept of model cards, first proposed in 2019 by data scientists including Margaret Mitchell and Timnit Gebru, provides a framework for documenting AI systems analogous to nutrition labels for food products. Model cards document how a model performs across use cases, data distributions, and social contexts.

For content moderation, model cards should capture intended use cases and out-of-scope applications, expected users and contexts, performance across different demographic groups, training data characteristics, known limitations, and ethical considerations.

NVIDIA has extended this concept with Model Card++, incorporating additional information about bias mitigation, explainability, privacy, safety, and security. The AI Transparency Atlas framework assigns particular weight to safety-critical disclosures: Safety Evaluation (25%), Critical Risk (20%), and Model Data (15%) together account for 60% of the total score. Research evaluating documentation practices found that while leading providers like xAI, Microsoft, and Anthropic achieve approximately 80% compliance, many smaller providers fall below 50%, with categories like Interpretability and Safety Evaluation remaining poorly documented.

Regulatory Compliance: Demonstrating Due Diligence

Meeting regulatory requirements extends beyond simply logging decisions. The DSA requires platforms to demonstrate that their moderation systems are effective and fair. This means being able to show auditors the methodology used to measure accuracy, the error rates for different content categories and user populations, and evidence that human oversight exists for consequential decisions.

The Appeals Centre Europe, certified in October 2024 as the first out-of-court dispute settlement body under the DSA, provides early evidence of how external review will function. Users pay a nominal fee of five euros (refunded if they win) while platforms pay approximately 100 euros per case. In its initial transparency report, of 1,500 disputes ruled upon, over three-quarters of platforms' original decisions were overturned. This reversal rate suggests significant room for improvement in both decision quality and documentation.

The Adversarial Tension: Transparency Versus Security

Here lies the central paradox of explainable content moderation: every detail revealed about how systems detect harmful content becomes a potential roadmap for evading detection. This tension is not theoretical; it represents a daily operational reality for platform trust and safety teams. Balancing these competing imperatives requires understanding both the nature of adversarial threats and the strategies available for managing disclosure.

The Exploitation Problem

Research has documented how bad actors can exploit AI vulnerabilities. Generative Adversarial Networks can manipulate images to appear unchanged to humans while displaying mathematical features that classifiers interpret entirely differently. Researchers have demonstrated effective adversarial techniques even against black-box networks where attackers have no specific knowledge of the model or training data.

Text-based adversarial attacks present particular challenges for short-form content moderation. Researchers have developed attacks at character, word, sentence, and multi-level perturbation units. These attacks exploit the discrete nature of text, where subtle substitutions can evade detection while remaining comprehensible to human readers. The ACM Computing Surveys published a comprehensive survey of adversarial defences and robustness in NLP, cataloguing attack methods ranging from simple character substitutions to sophisticated semantic-preserving perturbations.

Industry professionals have explicitly noted this tension. Describing AI moderation decisions in too much detail could reveal “commercially sensitive” information or provide “a way for bad actors to exploit the service.” YouTube noted that automated enforcement remains necessary due to content volume and speed, adding that it continues improving detection accuracy “especially as generative AI tools contribute to increased volumes of low-quality or misleading content.”

The Arms Race Reality

Content moderation has become an arms race between detection systems and evasion techniques. Malicious actors can intentionally manipulate content to bypass AI filters, “creating content that appears innocuous to humans but is harmful or violates policies.” Adversarial attacks can undermine AI model effectiveness, requiring constant vigilance and adaptation.

This reality shapes how platforms approach explainability. While regulators may demand detailed explanations of decision criteria, providing such explanations publicly would compromise system effectiveness. The result is a careful balancing act: offering enough transparency to satisfy legitimate oversight while maintaining sufficient opacity to preserve security.

Strategies for Managing Disclosure

Several strategies have emerged for managing this tension.

Tiered transparency provides different levels of detail to different audiences. General users might receive categorical explanations (“this content was removed for violating our hate speech policy”) while regulators receive more detailed information under confidentiality agreements. Internal governance teams access full technical details.

Delayed disclosure publishes detailed information about detection methods only after those methods have been superseded. This provides historical transparency while protecting current operations.

Aggregate reporting shares statistics about moderation performance without revealing specific detection criteria. Platforms can demonstrate error rates, appeal success rates, and category distributions without exposing exploitable details.

Adversarial testing proactively challenges moderation systems with known evasion techniques, documenting robustness without revealing techniques systems cannot yet detect.

Microsoft's approach to AI moderation in gaming illustrates principle-based governance: grounding decisions in fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. These principles guide development without specifying technical details that could be exploited.

Platform Practices: Lessons from the Front Lines

The practical implementation of explainability and audit trails varies significantly across major platforms, offering lessons for the broader industry.

TikTok: Automation at Scale

TikTok's transparency reports reveal the most aggressive automation in the industry. In the second half of 2024, the accuracy rate for automated moderation technologies was 99.1%. Over 96% of content removed through automated technology was taken down before receiving any views. Over 80% of violative video removals occurred through automated technology, with over 98% removed within 24 hours.

This automation intensity creates both opportunities and challenges for explainability. High automation enables consistent logging. However, research analysing TikTok's contributions to the DSA Transparency Database discovered a considerable discrepancy: TikTok's transparency report specified that 45% of non-ad content was removed automatically, whereas in the database it was 95%. Such inconsistencies undermine transparency that audit trails are meant to provide.

YouTube: The Human Review Question

YouTube faces persistent questions about human review in its moderation process. The company states that appeals are manually reviewed, yet creators have reported receiving rejection notices within minutes of submitting appeals, contradicting claims of human involvement.

YouTube's Transparency Report tracks whether removals were first flagged by automation or humans, with the majority of takedowns starting with automated flagging. In response to one terminated creator with 650,000 subscribers whose appeal was rejected in five minutes, YouTube maintained it has “not identified any widespread issues” while acknowledging “a handful” of incorrect terminations.

The introduction of a “second chances” pilot programme in October 2025, allowing some terminated creators to request new channels one year after termination, represents an acknowledgement that current appeal systems may be insufficient. This programme excludes creators terminated for copyright infringement and those who violated Creator Responsibility policies.

Meta: The Oversight Board Experiment

Meta's creation of the Oversight Board represents the most ambitious external accountability mechanism in the industry. The Board reviewed 115 cases by April 2024, finding that Meta was “twice as likely to be wrong as right” in its original decisions. The consistently high overturn rate (approximately 80% of decisions) indicates systematic gaps in moderation accuracy that internal processes failed to catch.

In 2024, Meta confirmed another round of funding, with a contribution of 30 million dollars to ensure the Board's operations through 2027. The Board officially began covering cases related to Threads in May 2024, expanding its oversight remit.

The Oversight Board Trust's establishment of Appeals Centre Europe extends this external review model beyond Meta. Now handling disputes from Facebook, TikTok, and YouTube users in the EU, its early results (three-quarters of original decisions overturned) mirror the Oversight Board's experience, suggesting industry-wide challenges with moderation accuracy.

The Human Element: Reviewers and Explanations

Explainability serves not just external stakeholders but also the human reviewers who form the last line of defence in content moderation systems. These workers must understand AI recommendations to make informed decisions, particularly for borderline cases that automated systems flag but cannot confidently resolve. The quality of explanations provided to reviewers directly affects the quality of their decisions.

Cognitive Load and Decision Support

The sheer volume of content requiring review creates cognitive challenges. When AI provides recommendations, the explanation accompanying that recommendation shapes how reviewers engage with it. Overly complex explanations may be ignored; overly simple ones may not provide sufficient context for informed decision-making.

Research on user perception of attention visualisations found that while transformer models could classify documents accurately, attention weights were not perceived as particularly helpful for explaining predictions. Crucially, this perception varied significantly depending on how attention was visualised. The implication for content moderation is clear: the same underlying explanation, presented differently, may have dramatically different effects on reviewer understanding and decision quality.

Large Language Models and Dynamic Explanation Systems

Large language models present both opportunities and challenges for explainable content moderation. Their ability to generate natural language explanations offers a new paradigm for communicating decisions to users, potentially transforming the relationship between platforms and the people whose content they moderate.

As research published in Artificial Intelligence Review has noted, LLMs have the potential to better understand contexts and nuances through pretraining on diverse sources. For content moderation, this could mean explanations that are “dynamic and interactive, including not only the reasons for violating community rules but also recommendations for modification.”

This dialogic approach could transform user experience, moving from punitive removal notices to educational interactions that promote discourse quality. An LLM-based system might not just remove content but explain specifically which phrase or image element violated guidelines and suggest alternative expressions.

However, the same capabilities that enable nuanced explanations also enable sophisticated evasion. If users can query systems about why content was removed and receive detailed responses, they can systematically probe for gaps in detection. The emergence of LLM-based moderation thus intensifies rather than resolves the transparency paradox. Platforms deploying these systems must design interaction patterns that provide genuine value to good-faith users while limiting the information extractable by adversaries.

Operational Principles for Platform Teams

For platform teams navigating the explainability imperative, several principles emerge from current research and regulatory requirements.

Design for multiple audiences. Different stakeholders need different levels of detail. Build systems that can generate tiered explanations, from simple category labels for users to detailed technical documentation for regulators under confidentiality.

Log comprehensively. Audit trails should capture the full decision pathway, not just outcomes. Include model versions, confidence scores, threshold applications, human review involvement, and appeal outcomes.

Test adversarially. Before publishing any explanation methodology, test whether that information could enable evasion. Run adversarial challenges covering known manipulation techniques.

Validate explanations empirically. Ensure that explanations actually reflect decision drivers. If attention weights do not predict behaviour changes, do not offer them as explanations.

Prepare for regulatory evolution. The DSA and AI Act represent the current state of regulation, not the final word. Build flexible systems that can accommodate additional requirements as regulatory frameworks mature.

Invest in human oversight. Automation enables scale but creates accountability gaps. Maintain meaningful human review for consequential decisions and ensure reviewers can understand and act upon AI recommendations.

Navigating the Transparency Paradox

The quest for explainable content moderation at scale represents one of the defining challenges of our digital age. Billions of daily decisions shape what humanity can see, share, and discuss online. The systems making these decisions operate at speeds and scales that preclude traditional human oversight, yet their consequences for free expression, public safety, and democratic discourse demand accountability.

The tools exist: SHAP, LIME, attention visualisation, and emerging LLM-based explanation systems offer genuine capabilities for illuminating algorithmic decision-making. The regulatory frameworks have arrived: the DSA and AI Act establish clear requirements and meaningful penalties. The platforms are adapting: transparency reports, oversight boards, and appeal centres demonstrate genuine investment in accountability.

Yet fundamental tensions remain unresolved. Every explanation risks becoming an evasion guide. Every audit trail creates computational overhead. Every transparency requirement conflicts with operational security. The organisations that navigate these tensions most effectively will shape the future of online discourse.

The glass box problem may never be fully solved. But the ongoing effort to make content moderation more explainable, auditable, and accountable represents an essential commitment to the principle that algorithmic power should be subject to human understanding and democratic oversight. For platforms, regulators, and users alike, the goal is not perfect transparency but rather transparency sufficient to enable meaningful accountability. Finding that balance, and maintaining it as technology and threats evolve, will define the character of our shared digital future.

References and Sources

TikTok Transparency Center. “Community Guidelines Enforcement Report, Q1 2025.” https://www.tiktok.com/transparency/en/community-guidelines-enforcement-2025-1
Meta Transparency Center. “Integrity Reports, Fourth Quarter 2024.” https://transparency.meta.com/integrity-reports-q4-2024
European Commission. “AI Act: Regulatory Framework for AI.” Digital Strategy, 2024. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai
European Commission. “How the Digital Services Act enhances transparency online.” https://digital-strategy.ec.europa.eu/en/policies/dsa-brings-transparency
Lundberg, Scott M. and Su-In Lee. “A Unified Approach to Interpreting Model Predictions.” arXiv:1705.07874, 2017. https://arxiv.org/abs/1705.07874
Salih, A. et al. “A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME.” Advanced Intelligent Systems, 2025. https://advanced.onlinelibrary.wiley.com/doi/10.1002/aisy.202400304
Oversight Board. “2024 Annual Report Highlights Board's Impact in the Year of Elections.” https://www.oversightboard.com/news/2024-annual-report-highlights-boards-impact-in-the-year-of-elections/
Oversight Board. “From Bold Experiment to Essential Institution.” December 2025. https://www.oversightboard.com/news/from-bold-experiment-to-essential-institution/
Chefer, Hila et al. “Transformer Interpretability Beyond Attention Visualization.” CVPR 2021. https://openaccess.thecvf.com/content/CVPR2021/papers/Chefer_Transformer_Interpretability_Beyond_Attention_Visualization_CVPR_2021_paper.pdf
Vig, Jesse. “BertViz: Visualize Attention in NLP Models.” GitHub. https://github.com/jessevig/bertviz
European Commission. “DSA Transparency Database.” https://transparency.dsa.ec.europa.eu/
Holistic AI. “The EU's Digital Services Act: The Need for Independent Third-Party AI Audits.” https://www.holisticai.com/blog/eu-digital-services-act
EU Artificial Intelligence Act. “Article 11: Technical Documentation.” https://artificialintelligenceact.eu/article/11/
EU Artificial Intelligence Act. “Annex IV: Technical Documentation.” https://artificialintelligenceact.eu/annex/4/
Mitchell, Margaret et al. “Model Cards for Model Reporting.” 2019. Referenced in IAPP analysis: https://iapp.org/news/a/5-things-to-know-about-ai-model-cards
NVIDIA Developer Blog. “Enhancing AI Transparency and Ethical Considerations with Model Card++.” https://developer.nvidia.com/blog/enhancing-ai-transparency-and-ethical-considerations-with-model-card/
TechPolicy Press. “Oversight Board Trust Launches EU Out-of-Court Dispute Settlement Service.” October 2024. https://www.techpolicy.press/oversight-board-launches-eu-outofcourt-dispute-settlement-service/
TechPolicy Press. “What We Can Learn from the First Digital Services Act Out-of-Court Dispute Settlements?” https://www.techpolicy.press/what-we-can-learn-from-the-first-digital-services-act-outofcourt-dispute-settlements/
Checkstep. “Emerging Threats in AI Content Moderation: Deep Learning and Contextual Analysis.” https://www.checkstep.com/emerging-threats-in-ai-content-moderation-deep-learning-and-contextual-analysis
Microsoft Developer. “Enhancing Safety Moderation with AI: A Deep Dive.” October 2024. https://developer.microsoft.com/en-us/games/articles/2024/10/enhancing-safety-moderation-with-ai-deep-dive/
Reclaim the Net. “Meta's Nick Clegg Admits Excessive Censorship and High Error Rates in Content Moderation.” December 2024. https://reclaimthenet.org/metas-nick-clegg-admits-high-content-moderation-errors
YouTube Transparency Report. “Community Guidelines Enforcement.” https://transparencyreport.google.com/youtube-policy/appeals
Creator Handbook. “YouTube addresses AI moderation concerns after reporting 12 million channel terminations in 2025.” https://www.creatorhandbook.net/youtube-addresses-ai-moderation-concerns-after-reporting-12-million-channel-terminations-in-2025/
TechCrunch. “EC finds Meta and TikTok breached transparency rules under DSA.” October 2025. https://techcrunch.com/2025/10/24/ec-finds-meta-and-tiktok-breached-transparency-rules-under-dsa/
arXiv. “A Year of the DSA Transparency Database: What it (Does Not) Reveal About Platform Moderation During the 2024 European Parliament Election.” https://arxiv.org/html/2504.06976v1
Springer Link. “Content moderation by LLM: from accuracy to legitimacy.” Artificial Intelligence Review, 2025. https://link.springer.com/article/10.1007/s10462-025-11328-1
ACM Digital Library. “A Survey of Adversarial Defenses and Robustness in NLP.” ACM Computing Surveys, 2023. https://dl.acm.org/doi/10.1145/3593042
Deloitte UK. “EU Digital Services Act: Are you ready for audit?” https://www.deloitte.com/uk/en/services/audit/blogs/eu-digital-services-act-are-you-ready-for-audit.html
Jain, Sarthak and Byron C. Wallace. “Attention is not Explanation.” Proceedings of NAACL-HLT 2019. https://aclanthology.org/N19-1357/
Yang, Jilei. “Fast TreeSHAP: Accelerating SHAP Value Computation for Trees.” arXiv:2109.09847. https://arxiv.org/abs/2109.09847
Mordor Intelligence. “Content Moderation Market Size 2030 & Industry Statistics.” https://www.mordorintelligence.com/industry-reports/content-moderation-market

Tim Green UK-based Systems Theorist & Independent Technology Writer

Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.

His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.

ORCID: 0009-0002-0156-9795 Email: tim@smarterarticles.co.uk

Discuss...