AITraining

When AI Devours the News: Who Pays for Truth

November 29, 2025

The news business has survived many existential threats. Television didn't kill radio. The internet didn't kill newspapers, though it came close. But what happens when artificial intelligence doesn't just compete with journalism but consumes it whole, digests it, and spits out bite-sized summaries without sending a single reader, or penny, back to the source?

This isn't a hypothetical future. It's happening now, and the numbers are brutal.

When Google rolled out AI Overviews to all US users in May 2024, the impact was immediate and devastating. Travel blog The Planet D shut down after its traffic plummeted 90%. Learning platform Chegg reported a 49% decline in non-subscriber traffic between January 2024 and January 2025. The average click-through rate for the number one result on AI Overview keywords dropped from 7.3% in March 2024 to just 2.6% in March 2025. That's not a decline. That's a collapse.

Zero-click searches, where users get their answers without ever leaving Google, increased from 56% to 69% between May 2024 and May 2025, according to Similarweb data. CNN's website traffic dropped approximately 30% from a year earlier. Industry analysts estimate that AI Overviews could cost publishers $2 billion in annual advertising revenue.

But the traffic drain is only half the story. Behind the scenes, AI companies have been systematically scraping, copying, and ingesting journalistic content to train their models, often without permission, payment, or acknowledgement. This creates a perverse feedback loop: AI companies extract the knowledge created by journalists, repackage it through their models, capture the traffic and revenue that would have funded more journalism, and leave news organisations struggling to survive while simultaneously demanding access to more content to improve their systems.

The question isn't whether this is happening. The question is whether we're watching the construction of a new information extraction economy that fundamentally alters who controls, profits from, and ultimately produces the truth.

The Scraping Economy

In November 2023, the News Media Alliance, representing nearly 2,000 outlets in the US, submitted a 77-page white paper to the United States Copyright Office. Their findings were stark: developers of generative artificial intelligence systems, including OpenAI and Google, had copied and used news, magazine, and digital media content to train their bots without authorisation. The outputs of these AI chatbots brought them into direct competition with news outlets through “narrative answers to search queries,” eliminating the need for consumers to visit news sources.

The economics are lopsided to the point of absurdity. Cloudflare found that OpenAI scraped a news site 250 times for every one referral page view it sent that site. For every reader OpenAI sends back to the original source, it has taken 250 pieces of content. It's the digital equivalent of a restaurant critic eating 250 meals and writing one review that mentions where they ate.

Research from 2024 and 2025 shows click-through rate reductions ranging from 34% to 46% when AI summaries appear on search results pages. Some publishers reported click-through rates dropping by as much as 89%. The News Media Alliance put it bluntly: “Without web traffic, news and media organisations lose subscription and advertising revenue, and cannot continue to fund the quality work that both AI companies and consumers rely on.”

This comes at a particularly brutal time for journalism. By the end of 2024, the United States had lost a third of its newspapers and almost two-thirds of its newspaper journalists since 2005. Newspaper advertising revenue collapsed from $48 billion in 2004 to $8 billion in 2020, an 82% decrease. Despite a 43% rise in traffic to the top 46 news sites over the past decade, their revenues declined 56%.

Core copyright industries contribute $2.09 trillion to US GDP, employing 11.6 million workers. The News Media Alliance has called for recognition that unauthorised use of copyrighted content to train AI constitutes infringement.

But here's where it gets complicated. Some publishers are making deals.

The Devil's Bargain

In December 2023, The New York Times sued OpenAI and Microsoft for copyright infringement, accusing them of using millions of articles to train their AI models without consent or compensation. As of early 2025, The Times had spent $10.8 million in its legal battle with OpenAI.

Yet in May 2025, The New York Times agreed to licence its editorial content to Amazon to train the tech giant's AI platforms, marking the first time The Times agreed to a generative AI-focused licensing arrangement. The deal is worth $20 million to $25 million annually. According to a former NYT executive, The Times was signalling to other AI companies: “We're open to being at the table, if you're willing to come to the table.”

The Times isn't alone. Many publishers have signed licensing deals with OpenAI, including Condé Nast, Time magazine, The Atlantic, Axel Springer, The Financial Times, and Vox Media. News Corp signed a licensing deal with OpenAI in May 2024 covering The Wall Street Journal, New York Post, and Barron's.

Perplexity AI, after facing plagiarism accusations from Forbes and Wired in 2024, debuted a revenue-sharing model for publishers. But News Corp still sued Perplexity, accusing the company of infringing on its copyrighted content by copying and summarising large quantities of articles without permission.

These deals create a two-tier system. Major publishers with expensive legal teams can negotiate licensing agreements. Smaller publications, local news outlets, and independent journalists get their content scraped anyway but lack the resources to fight back or demand payment. The infrastructure of truth becomes something only the wealthy can afford to defend.

The Honour System Breaks Down

For decades, the internet operated on an honour system called robots.txt. Publishers could include a simple text file on their websites telling automated crawlers which parts of the site not to scrape. It wasn't enforceable law. It was a gentleman's agreement.

Nearly 80% of top news organisations in the US were blocking OpenAI's web crawlers at the end of 2023, while 36% were blocking Google's artificial intelligence crawler. Publishers attempted to block four times more AI bots between January 2024 and January 2025 using robots.txt.

But the honour system is breaking down.

TollBit's report detected 436 million AI bot scrapes in Q1 2025, up 46% from Q4 2024. The percentage of AI bot scrapes that bypassed robots.txt surged from 3.3% in Q4 2024 to 12.9% by the end of Q1 2025. Recent updates to major AI companies' terms of service state that their AI bots can act on behalf of user requests, effectively meaning they can ignore robots.txt when being used for retrieval-augmented generation.

The Perplexity case illustrates the problem. Wired found evidence of Perplexity plagiarising Wired stories, reporting that an IP address “almost certainly linked to Perplexity” visited its parent company's websites more than 800 times in a three-month span. Ironically, Perplexity plagiarised the very article that called out the startup for scraping its web content.

Cloudflare claimed that Perplexity didn't just violate robots.txt protocols but also broke Web Application Firewall rules which specifically blocked Perplexity's official bots. When websites blocked Perplexity's official crawlers, the company allegedly used a generic browser that impersonated Google Chrome on macOS, and used multiple unofficial IP addresses to bypass robots.txt rules.

Forbes accused Perplexity of plagiarism for republishing its original reporting on former Google CEO Eric Schmidt without citing the story directly, finding a plagiarised version within Perplexity AI's Pages tool with no reference to the media outlet besides a small “F” logo at the bottom of the page.

In response, Cloudflare became the first major internet infrastructure provider to block all AI scrapers accessing content by default, backed by more than a dozen major news and media publishers including the Associated Press, The Atlantic, BuzzFeed, Condé Nast, Dotdash Meredith, Fortune, Gannett, The Independent, and Time.

The technological arms race has begun. Publishers deploy more sophisticated blocking. AI companies find new ways around the blocks. And in the middle, the fundamental question remains: should accessing journalistic content for AI training require explicit consent, or should it be freely available unless someone actively objects and has the technical capacity to enforce that objection?

The Opt-In Opt-Out Debate

The European Union has been grappling with this question directly. The EU AI Act currently operates under an “opt-out” system where rightholders may reserve their rights to prevent text and data mining for commercial purposes. Providers of general-purpose AI models need to obtain authorisation from rightholders if they want to carry out text and data mining when rights have been expressly reserved.

But there's growing momentum toward changing this system. A July 2025 European Parliament study on generative AI and copyright concluded that an opt-in model would more fairly protect authors' rights and rebalance negotiation power, ensuring active consent and potential compensation. The study found that rightholders often lack the technical means or awareness to enforce opt-outs, and the existing system is ill-suited to the realities of AI training.

The United Kingdom has taken a different approach. In December 2024, the UK Government launched a consultation proposing a new exception allowing materials to be used for commercial purposes unless the content creator has “opted-out.” Critics, including the BBC, argue this risks undermining creators' rights and control over their work.

During parliamentary debate, the House of Commons removed provisions on AI transparency which had been added by the Lords. After rewriting, the House of Lords reinstated the amendments, but the Commons again rejected them on 22 May 2025.

The opt-in versus opt-out debate isn't merely technical. It's about where we place the burden of enforcement. An opt-out system assumes AI companies can take content unless told otherwise, placing the burden on publishers to actively protect their rights. An opt-in system assumes publishers have control over their content unless they explicitly grant permission, placing the burden on AI companies to seek consent.

For large publishers with legal and technical resources, the difference may be manageable. For smaller outlets, local news organisations, freelance journalists, and news organisations in the developing world, the opt-out model creates an impossible enforcement burden. They lack the technical infrastructure to monitor scraping, the legal resources to pursue violations, and the market power to negotiate fair terms.

Innovation Versus Preservation

The debate is often framed as “innovation versus preservation.” AI companies argue that restricting access to training data will stifle innovation and harm the public interest. Publishers argue that protecting copyright is necessary to preserve the economic viability of journalism and maintain the quality information ecosystem that democracy requires.

This framing is convenient for AI companies because it makes them the champions of progress and publishers the defenders of an outdated status quo. But it obscures deeper questions about power, infrastructure, and the nature of knowledge creation.

Innovation and preservation aren't opposites. Journalism is itself an innovative enterprise. Investigative reporting that uncovers government corruption is innovation. Data journalism that reveals hidden patterns is innovation. Foreign correspondents risking their lives to document war crimes are engaged in the most vital form of truth-seeking innovation our society produces.

What we're really debating is who gets to profit from that innovation. If AI companies can extract the knowledge produced by journalists, repackage it, and capture the economic value without compensating the original creators, we haven't chosen innovation over preservation. We've chosen extraction over creation.

A 2025 study published in Digital Journalism argued that media organisations' dependence on AI companies poses challenges to media freedom, particularly through loss of control over the values embedded in AI tools they use to inform the public. Reporters Without Borders' World Press Freedom Index found that the global state of press freedom has reached an unprecedented low point. Over 60% of global media outlets expressed concern over AI scraping their content without compensation.

Consider what happens when the infrastructure of information becomes concentrated in a handful of AI companies. These companies don't just distribute news. They determine what constitutes an adequate answer to a question. They decide which sources to cite and which to ignore. They summarise complex reporting into bite-sized paragraphs, stripping away nuance, context, and the very uncertainty that characterises honest journalism.

Google's AI Overviews don't just show you what others have written. They present synthetic answers with an air of authority, as if the question has been definitively answered rather than reported on by journalists with varying levels of access, expertise, and bias. This isn't neutral infrastructure. It's editorial judgement, exercised by algorithms optimised for engagement rather than truth, and controlled by companies accountable primarily to shareholders rather than the public.

Who Owns the Infrastructure of Truth?

This brings us to the deepest question: who owns the infrastructure of truth itself?

For most of modern history, the answer was relatively clear. Journalists and news organisations owned the means of producing truth. They employed reporters, paid for investigations, took legal risks, and published findings. Distribution was controlled by whoever owned the printing presses, broadcast licences, or later, web servers. But production and distribution, while distinct, remained largely aligned.

AI fundamentally separates production from distribution, and arguably introduces a third layer: synthesis. Journalists produce the original reporting. AI companies synthesise that reporting into new forms. And increasingly, AI companies also control distribution through search, chatbots, and AI-powered interfaces.

This isn't just vertical integration. It's a wholesale reorganisation of the information supply chain that places AI companies at the centre, with journalists reduced to raw material suppliers in an extraction economy they neither control nor profit from adequately.

The parallel to natural resource extraction is uncomfortably apt. For centuries, colonial powers extracted raw materials from colonised territories, processed them in industrial centres, and sold finished goods back to those same territories at marked-up prices. The value accrued not to those who produced the raw materials but to those who controlled the processing and distribution infrastructure.

Replace “raw materials” with “original reporting” and “industrial centres” with “AI model training” and the analogy holds. News organisations produce expensive, labour-intensive journalism. AI companies scrape that journalism, process it through their models, and sell access to the synthesised knowledge. The value accrues not to those who produced the original reporting but to those who control the AI infrastructure.

Local news organisations in the US bore the brunt of economic disruption and increasingly tied themselves to platform companies like Facebook and Google. Those very companies are now major players in AI development, exacerbating the challenges and deepening the dependencies. Google's adoption of AI-based summarisation in its search engine results is likely to further upend the economic foundation for journalism.

The collapse of the mainstream news media's financial model may represent a threat to democracy, creating vast news deserts and the opportunity for ill-intentioned players to fill the void with misinformation. One study published by NewsGuard in May 2024 tallied nearly 1,300 AI-generated news sites across 16 languages, many churning out viral misinformation.

What emerges from this landscape is a paradox. At the very moment when AI makes it easier than ever to access and synthesise information, the economic model that produces trustworthy information is collapsing. AI companies need journalism to train their models and provide current information. But their extraction of that journalism undermines the business model that produces it. The snake is eating its own tail.

The Democracy Question

Democracy requires more than free speech. It requires the structural conditions that make truth-seeking possible. You need journalists who can afford to spend months on an investigation. You need news organisations that can fund foreign bureaus, hire fact-checkers, and employ editors with institutional knowledge. You need legal protections for whistleblowers and reporters. You need economic models that reward accuracy over clickbait.

These structural conditions have been eroding for decades. Newspaper revenues declined by nearly 28% between 2002 and 2010, and by another nearly 34% between 2010 and 2020, according to US Census Bureau data. Newspaper publishers collected about $22.1 billion in revenue in 2020, less than half the amount they collected in 2002.

AI doesn't create these problems. But it accelerates them by removing the final economic pillar many publishers were relying on: web traffic. If AI Overviews, chatbots, and synthetic search results can answer users' questions without sending them to the original sources, what incentive remains for anyone to fund expensive original reporting?

Some argue that AI could help journalism by making reporting more efficient and reducing costs. But efficiency gains don't solve the core problem. If all journalism becomes more efficient but generates less revenue, we still end up with less journalism. The question isn't whether AI can help journalists work faster. It's whether the AI economy creates sustainable funding models for the journalism we need.

The European Parliament's study advocating for opt-in consent isn't just about copyright. It's about maintaining the structural conditions necessary for independent journalism to exist. If publishers can't control how their content is used or negotiate fair compensation, the economic foundation for journalism collapses further. And once that foundation is gone, no amount of AI efficiency gains will rebuild it.

This is why framing the debate as innovation versus preservation misses the point. The real choice is between an AI economy that sustains journalism as a vital democratic institution and one that extracts value from journalism while undermining its viability.

The Transparency Illusion

The EU AI Act's requirement that providers publicly disclose detailed summaries of content used for AI model training sounds promising. Transparency is good, right? But disclosure without accountability is just performance.

Knowing that OpenAI trained GPT-4 on millions of news articles doesn't help publishers if they can't refuse consent or demand compensation. Knowing which crawlers visited your website doesn't prevent them from coming back. Transparency creates the illusion of control without providing actual leverage.

What would accountability look like? It would require enforcement mechanisms with real consequences. It would mean AI companies face meaningful penalties for scraping content without permission. It would give publishers legal standing to sue for damages. It would create regulatory frameworks that put the burden of compliance on AI companies rather than on publishers to police thousands of bots.

The UK parliamentary debate over AI transparency provisions illustrates the challenge. The House of Lords added amendments requiring AI companies to disclose their web crawlers and data sources. The House of Commons rejected these amendments twice. Why? Because transparency creates costs and constraints for AI companies that the government was unwilling to impose in the name of fostering innovation.

But transparency without teeth doesn't protect publishers. It just creates a paper trail of their exploitation.

Future Possibilities

We're at a genuine crossroads. The choices made in the next few years will determine whether journalism survives as an independent, adequately funded profession or becomes an unpaid raw material supplier for AI companies.

One possible future: comprehensive licensing frameworks where AI companies pay for the journalism they use, similar to how music streaming services pay royalties. The deals between major publishers and OpenAI, Google, and Amazon could expand to cover the entire industry, with collective licensing organisations negotiating on behalf of smaller publishers.

But this future requires addressing the power imbalance. Small publishers need collective bargaining power. Licensing fees need to be substantial enough to replace lost traffic revenue. And enforcement needs to be strong enough to prevent AI companies from simply scraping content from publishers too small to fight back.

Another possible future: regulatory frameworks that mandate opt-in consent for commercial AI training, as the European Parliament study recommends. AI companies would need explicit permission to use copyrighted content, shifting the burden from publishers protecting their rights to AI companies seeking permission. This creates stronger protections for journalism but could slow AI development and raise costs.

A third possible future: the current extraction economy continues until journalism collapses under the economic pressure. AI companies keep scraping, traffic keeps declining, revenues keep falling, and newsrooms keep shrinking. We're left with a handful of elite publications serving wealthy subscribers, AI-generated content farms producing misinformation, and vast news deserts where local journalism once existed.

The question is which future we choose, and who gets to make that choice. Right now, AI companies are making it by default through their technical and economic power. Regulators are making it through action or inaction. Publishers are making it through licensing deals that may or may not preserve their long-term viability.

What's largely missing is democratic deliberation about what kind of information ecosystem we want and need. Do we want a world where truth-seeking is concentrated in the hands of those who control the algorithms? Do we want journalism to survive as an independent profession, or are we comfortable with it becoming a semi-volunteer activity sustained by wealthy benefactors?

Markets optimise for efficiency and profit, not for the structural conditions democracy requires. If we leave these decisions entirely to AI companies and publishers negotiating bilateral deals, we'll get an outcome that serves their interests, not necessarily the public's.

The Algorithm Age and the Future of Truth

When The New York Times sued OpenAI in December 2023, it wasn't just protecting its copyright. It was asserting that journalism has value beyond its immediate market price. That the work of investigating, verifying, contextualising, and publishing information deserves recognition and compensation. That truth-seeking isn't free.

The outcome of that lawsuit, and the hundreds of similar conflicts playing out globally, will help determine who controls truth in the algorithm age. Will it be the journalists who investigate, the publishers who fund that investigation, or the AI companies who synthesise and redistribute their findings?

Control over truth has always been contested. Governments censor. Corporations spin. Platforms algorithmically promote and demote. What's different now is that AI doesn't just distribute truth or suppress it. It synthesises new forms of information that blend facts from multiple sources, stripped of context, attribution, and sometimes accuracy.

When you ask ChatGPT or Google's AI Overview a question about climate change, foreign policy, or public health, you're not getting journalism. You're getting a statistical model's best guess at what a plausible answer looks like, based on patterns it found in journalistic content. Sometimes that answer is accurate. Sometimes it's subtly wrong. Sometimes it's dangerously misleading. But it's always presented with an air of authority that obscures its synthetic nature.

This matters because trust in information depends partly on understanding its source. When I read a Reuters article, I'm evaluating it based on Reuters' reputation, the reporter's expertise, the sources cited, and the editorial standards I know Reuters maintains. When I get an AI-generated summary, I'm trusting an algorithmic process I don't understand, controlled by a company whose primary obligation is to shareholders, trained on data that may or may not include that Reuters article, and optimised for plausibility rather than truth.

The infrastructure of truth is being rebuilt around us, and most people don't realise it's happening. We've replaced human editorial judgement with algorithmic synthesis. We've traded the messy, imperfect, but ultimately accountable process of journalism for the smooth, confident, but fundamentally opaque process of AI generation.

And we're doing this at precisely the moment when we need trustworthy journalism most. Climate change, pandemic response, democratic backsliding, technological disruption, economic inequality: these challenges require the kind of sustained, expert, well-resourced investigative reporting that's becoming economically unviable.

The cruel irony is that AI companies are undermining the very information ecosystem they depend on. They need high-quality journalism to train their models and keep their outputs accurate and current. But by extracting that journalism without adequately compensating its producers, they're destroying the economic model that creates it.

What replaces professional journalism in this scenario? AI-generated content farms, partisan outlets masquerading as news, press releases repackaged as reporting, and the occasional well-funded investigative outfit serving elite audiences. That's not an information ecosystem that serves democracy. It's an information wasteland punctuated by oases available only to those who can afford them.

What Needs to Happen

The first step is recognising that this isn't inevitable. The current trajectory, where AI companies extract journalistic content without adequate compensation, is the result of choices, not technological necessity. Different choices would produce different outcomes.

Regulatory frameworks matter. The European Union's move toward stronger opt-in requirements represents one path. The UK's consultation on copyright and AI represents another. These aren't just technical policy debates. They're decisions about whether journalism survives as an economically viable profession.

Collective action matters. Individual publishers negotiating with OpenAI or Google have limited leverage. Collective licensing frameworks, where organisations negotiate on behalf of many publishers, could rebalance power. Cloudflare's decision to block AI scrapers by default, backed by major publishers, shows what coordinated action can achieve.

Legal precedent matters. The New York Times lawsuit against OpenAI will help determine whether using copyrighted content to train AI models constitutes fair use or infringement. That decision will ripple through the industry, either empowering publishers to demand licensing fees or giving AI companies legal cover to scrape freely.

Public awareness matters. Most people don't know this battle is happening. They use AI chatbots and search features without realising the economic pressure these tools place on journalism. Democratic deliberation requires an informed public.

What we're fighting over isn't really innovation versus preservation. It's not technology versus tradition. It's a more fundamental question: does knowledge creation deserve to be compensated? If journalists spend months investigating corruption, if news organisations invest in foreign bureaus and fact-checking teams, if local reporters cover city council meetings nobody else attends, should they be paid for that work?

The market, left to itself, seems to be answering no. AI companies can extract that knowledge, repackage it, and capture its economic value without paying the creators. Publishers can't stop them through technical means alone. Legal protections are unclear and under-enforced.

That's why this requires democratic intervention. Not to stop AI development, but to ensure it doesn't cannibalise the information ecosystem democracy requires. To create frameworks where both journalism and AI can thrive, where innovation doesn't come at the cost of truth-seeking, where the infrastructure of knowledge serves the public rather than concentrating power in a few algorithmic platforms.

The algorithm age has arrived. The question is whether it will be an age where truth becomes the property of whoever controls the most sophisticated models, or whether we'll find ways to preserve, fund, and protect the messy, expensive, irreplaceable work of journalism.

We're deciding now. The decisions we make in courtrooms, parliaments, regulatory agencies, and licensing negotiations over the next few years will determine whether our children grow up in a world with independent journalism or one where all information flows through algorithmic intermediaries accountable primarily to their shareholders.

That's not a future that arrives by accident. It's a future we choose, through action or inaction. And the choice, ultimately, is ours.

Sources and References

Similarweb (2024-2025). Data on zero-click searches and Google AI Overviews impact.
TollBit (2025). Q1 2025 Report on AI bot scraping statistics and robots.txt bypass rates.
News Media Alliance (2023). White paper submitted to United States Copyright Office on AI scraping of journalistic content.
Cloudflare (2024-2025). Data on OpenAI scraping ratios and Perplexity AI bypassing allegations.
U.S. Census Bureau (2002-2020). Newspaper publishing revenue data.
Bureau of Labor Statistics (2006-present). Newsroom employment statistics.
GroupM (2024). Projected newspaper advertising revenue analysis.
European Parliament (July 2025). Study on generative AI and copyright: opt-in model recommendations.
UK Government (December 2024). Consultation on copyright and AI opt-out model.
UK Information Commissioner's Office (25 February 2025). Response to UK Government AI and copyright consultation.
Reporters Without Borders (2024). World Press Freedom Index and report on AI scraping concerns.
Forum on Information and Democracy (February 2024). Report on AI regulation and democratic values.
NewsGuard (May 2024). Study on AI-generated news sites across 16 languages.
Digital Journalism (2025). “The AI turn in journalism: Disruption, adaptation, and democratic futures.” Dodds, T., Zamith, R., & Lewis, S.C.
CNN Business (2023). “AI Chatbots are scraping news reporting and copyrighted content, News Media Alliance says.”
NPR (2025). “Online news publishers face 'extinction-level event' from Google's AI-powered search.”
Digiday (2024-2025). Multiple reports on publisher traffic impacts, AI licensing deals, and industry trends.
TechCrunch (2024-2025). Coverage of Perplexity AI plagiarism allegations and publisher licensing deals.
Wired (2024). Investigation of Perplexity AI bypassing robots.txt protocol.
Forbes (2024). Coverage of plagiarism concerns regarding Perplexity AI Pages feature.
The Hollywood Reporter (2025). Report on New York Times legal costs in OpenAI lawsuit.
Press Gazette (2024-2025). Coverage of publisher responses to AI scraping and licensing deals.
Digital Content Next (2025). Survey data on Google AI Overviews impact on publisher traffic.
Nieman Journalism Lab (2024-2025). Coverage of AI's impact on journalism and publisher strategies.

Tim Green UK-based Systems Theorist & Independent Technology Writer

Tim explores the intersections of artificial intelligence, decentralised cognition, and posthuman ethics. His work, published at smarterarticles.co.uk, challenges dominant narratives of technological progress while proposing interdisciplinary frameworks for collective intelligence and digital stewardship.

His writing has been featured on Ground News and shared by independent researchers across both academic and technological communities.

ORCID: 0009-0002-0156-9795 Email: tim@smarterarticles.co.uk

Discuss...

#HumanInTheLoop #MediaEthics #AITraining #TruthProtection

The New Data Barons: How Web Scraping Companies Are Redefining Ethical AI Training

June 22, 2025

The internet's vast expanse of public data has become the new gold rush territory for artificial intelligence. Yet unlike the Wild West prospectors of old, today's data miners face a peculiar challenge: how to extract value whilst maintaining moral authority. As AI systems grow increasingly sophisticated and data-hungry, companies in the web scraping industry are discovering that ethical frameworks aren't just regulatory necessities—they're becoming powerful competitive advantages. Through strategic coalition-building and proactive standard-setting, a new model is emerging that could fundamentally reshape how we think about data ownership, AI training, and digital responsibility.

The Infrastructure Behind Modern Data Collection

The web scraping industry operates at a scale that defies easy comprehension. Modern data collection services maintain vast networks of proxy servers across the globe, creating what amounts to digital nervous systems capable of gathering web data at unprecedented velocity and volume. This infrastructure represents more than mere technical capability—it's the foundation upon which modern AI systems are built.

The industry's approach extends far beyond traditional web scraping. Contemporary data collection services leverage machine learning algorithms to navigate increasingly sophisticated anti-bot defences, whilst simultaneously ensuring compliance with website terms of service and local regulations. This technological sophistication allows them to process millions of requests daily, transforming the chaotic landscape of public web data into structured, usable datasets.

Yet scale alone doesn't guarantee success in today's market. The sheer volume of data that modern collection services can access has created new categories of responsibility. When infrastructure can theoretically scrape entire websites within hours, the question isn't whether companies can—it's whether they should. This realisation has driven the industry to position ethics not as a constraint on operations, but as a core differentiator in an increasingly crowded marketplace.

The technical architecture that enables such massive data collection also creates unique opportunities for implementing ethical safeguards at scale. Leading companies have integrated compliance checks directly into their scraping workflows, automatically flagging potential violations before they occur. This proactive approach represents a significant departure from the reactive compliance models that have traditionally dominated the industry.

The Rise of Industry Self-Regulation

In 2024, the web scraping industry witnessed the formation of the Ethical Web Data Collection Initiative (EWDCI), a move that signals something more ambitious than traditional industry collaboration. Rather than simply responding to existing regulations, the EWDCI represents an attempt to shape the very definition of ethical data collection before governments and courts establish their own frameworks.

The initiative brings together companies across the data ecosystem, from collection specialists to AI developers and academic researchers. This broad coalition suggests a recognition that ethical data practices can't be solved by individual companies operating in isolation. Instead, the industry appears to be moving towards a model of collective self-regulation, where shared standards create both accountability and competitive protection.

The timing of the EWDCI's formation is particularly significant. As artificial intelligence capabilities continue to expand rapidly, the legal and regulatory landscape struggles to keep pace. By establishing industry-led ethical frameworks now, companies are positioning themselves to influence future regulations rather than merely react to them. This proactive stance could prove invaluable as governments worldwide grapple with how to regulate AI development and data usage.

The initiative also serves a crucial public relations function. As concerns about AI bias, privacy violations, and data misuse continue to mount, companies that can demonstrate genuine commitment to ethical practices gain significant advantages in public trust and customer acquisition. The EWDCI provides a platform for members to showcase their ethical credentials whilst working collectively to address industry-wide challenges.

However, the success of such initiatives ultimately depends on their ability to create meaningful change rather than simply providing cover for business as usual. The EWDCI will need to demonstrate concrete impacts on industry practices to maintain credibility with both regulators and the public.

ESG Integration in the Data Economy

The web scraping industry has made a deliberate choice to integrate ethical data practices into broader Environmental, Social, and Governance (ESG) strategies, aligning with Global Reporting Initiative (GRI) standards. This integration represents more than corporate window dressing—it signals a fundamental shift in how data companies view their role in the broader economy.

By framing ethical data collection as an ESG issue, companies connect their practices to the broader movement towards sustainable and responsible business operations. This positioning appeals to investors increasingly focused on ESG criteria, whilst also demonstrating to customers and partners that ethical considerations are embedded in core business strategy rather than treated as an afterthought.

Recent industry impact reports explicitly link data collection practices to broader social responsibility goals. This approach reflects a growing recognition that data companies can't separate their technical capabilities from their social impact. As AI systems trained on web data increasingly influence everything from hiring decisions to criminal justice outcomes, the ethical implications of data collection practices become impossible to ignore.

The ESG framework also provides companies with a structured approach to measuring and reporting on their ethical progress. Rather than making vague commitments to “responsible data use,” they can point to specific metrics and improvements aligned with internationally recognised standards. This measurability makes their ethical claims more credible whilst providing clear benchmarks for continued improvement.

The integration of ethics into ESG reporting also serves a defensive function. As regulatory scrutiny of data practices increases globally, companies that can demonstrate proactive ethical frameworks and measurable progress are likely to face less aggressive regulatory intervention. This positioning could prove particularly valuable as the European Union continues to expand its digital regulations beyond GDPR.

Innovation and Intellectual Property Challenges

The web scraping industry has accumulated substantial intellectual property portfolios related to data collection and processing technologies, creating competitive advantages whilst raising important questions about how intellectual property rights interact with ethical data practices.

Industry patents cover everything from advanced proxy rotation techniques to AI-powered data extraction algorithms. This intellectual property serves multiple functions: protecting competitive advantages, creating potential revenue streams through licensing, and establishing credentials as genuine innovators rather than mere service providers.

Yet patents in the data collection space also create potential ethical dilemmas. When fundamental techniques for accessing public web data are locked behind patent protections, smaller companies and researchers may find themselves unable to compete or conduct important research. This dynamic could potentially concentrate power among a small number of large data companies, undermining the democratic potential of open web data.

The industry appears to be navigating this tension by focusing patent strategies on genuinely innovative techniques rather than attempting to patent basic web scraping concepts. AI-driven scraping assistants, for example, represent novel approaches to automated data collection that arguably deserve patent protection. This selective approach suggests an awareness of the broader implications of intellectual property in the data space.

Innovation focus also extends to developing tools that make ethical data collection more accessible to smaller players. By creating standardised APIs and automated compliance tools, larger companies are potentially democratising access to sophisticated data collection capabilities whilst ensuring those capabilities are used responsibly.

AI as Driver and Tool

The relationship between artificial intelligence and data collection has become increasingly symbiotic. AI systems require vast amounts of training data, driving unprecedented demand for web scraping services. Simultaneously, AI technologies are revolutionising how data collection itself is performed, enabling more sophisticated and efficient extraction techniques.

Leading companies have positioned themselves at the centre of this convergence. AI-driven scraping assistants can adapt to changing website structures in real-time, automatically adjusting extraction parameters to maintain data quality. This adaptive capability is crucial as websites deploy increasingly sophisticated anti-scraping measures, creating an ongoing technological arms race.

The scale of modern AI training requirements has fundamentally changed the data collection landscape. Where traditional web scraping might have focused on specific datasets for particular business purposes, AI training demands comprehensive, diverse data across multiple domains and languages. This shift has driven companies to develop infrastructure capable of collecting data at internet scale.

However, the AI revolution also intensifies ethical concerns about data collection. When scraped data is used to train AI systems that could influence millions of people's lives, the stakes of ethical data collection become dramatically higher. A biased or incomplete dataset doesn't just affect one company's business intelligence—it could perpetuate discrimination or misinformation at societal scale.

This realisation has driven the development of AI-powered tools for identifying and addressing potential bias in collected datasets. By using machine learning to analyse data quality and representativeness, companies are attempting to ensure that their services contribute to more equitable AI development rather than amplifying existing biases.

The Democratisation Paradox

The rise of large-scale data collection services creates a fascinating paradox around AI democratisation. On one hand, these services make sophisticated data collection capabilities available to smaller companies and researchers who couldn't afford to build such infrastructure themselves. This accessibility could potentially level the playing field in AI development.

On the other hand, the concentration of data collection capabilities among a small number of large providers could create new forms of gatekeeping. If access to high-quality training data becomes dependent on relationships with major data brokers, smaller players might find themselves increasingly disadvantaged despite the theoretical availability of these services.

Industry leaders appear aware of this tension and have made efforts to address it through their pricing models and service offerings. By providing scalable solutions that can accommodate everything from academic research projects to enterprise AI training, they're attempting to ensure that access to data doesn't become a barrier to innovation.

Participation in initiatives like the EWDCI also reflects a recognition that industry consolidation must be balanced with continued innovation and competition. By establishing shared ethical standards, major players can compete on quality and service rather than racing to the bottom on ethical considerations.

However, the long-term implications of this market structure remain unclear. As AI systems become more sophisticated and data requirements continue to grow, the barriers to entry in data collection may increase, potentially limiting the diversity of voices and perspectives in AI development.

Global Regulatory Convergence

The regulatory landscape for data collection and AI development is evolving rapidly across multiple jurisdictions. The European Union's GDPR was just the beginning of a broader global movement towards stronger data protection regulations. Countries from California to China are implementing their own frameworks, creating a complex patchwork of requirements that data collection companies must navigate.

This regulatory complexity has made proactive ethical frameworks increasingly valuable as business tools. Rather than attempting to comply with dozens of different regulatory regimes reactively, companies that establish comprehensive ethical standards can often satisfy multiple jurisdictions simultaneously whilst reducing compliance costs.

The approach of embedding ethical considerations into core business processes positions companies well for this regulatory environment. By treating ethics as a design principle rather than a compliance afterthought, they can adapt more quickly to new requirements whilst maintaining operational efficiency.

The global nature of web data collection also creates unique jurisdictional challenges. When data is collected from websites hosted in one country, processed through servers in another, and used by AI systems in a third, determining which regulations apply becomes genuinely complex. This complexity has driven companies towards adopting the highest common denominator approach—implementing privacy and ethical protections that would satisfy the most stringent regulatory requirements globally.

The convergence of regulatory approaches across different jurisdictions also suggests that ethical data practices are becoming a fundamental requirement for international business rather than a competitive advantage. Companies that fail to establish robust ethical frameworks may find themselves excluded from major markets as regulations continue to tighten.

The Economics of Ethical Data

The business case for ethical data collection has evolved significantly as the market has matured. Initially, ethical considerations were often viewed as costly constraints on business operations. However, the industry is demonstrating that ethical practices can actually create economic value through multiple channels.

Premium pricing represents one obvious economic benefit. Customers increasingly value data providers who can guarantee ethical collection methods and compliance with relevant regulations. This willingness to pay for ethical assurance allows companies to command higher prices than competitors who compete purely on cost.

Risk mitigation provides another significant economic benefit. Companies that purchase data from providers with questionable ethical practices face potential legal liability, reputational damage, and regulatory sanctions. By investing in robust ethical frameworks, data providers can offer their customers protection from these risks, creating additional value beyond the data itself.

Market access represents a third economic advantage. As major technology companies implement their own ethical sourcing requirements, data providers who can't demonstrate compliance may find themselves excluded from lucrative contracts. Proactive approaches to ethics position companies to benefit as these requirements become more widespread.

The long-term economics of ethical data collection also benefit from reduced regulatory risk. Companies that establish strong ethical practices early are less likely to face expensive regulatory interventions or forced business model changes as regulations evolve. This predictability allows for more confident long-term planning and investment.

However, the economic benefits of ethical data collection depend on market recognition and reward for these practices. If customers continue to prioritise cost over ethical considerations, companies investing in ethical frameworks may find themselves at a competitive disadvantage. The success of ethical business models ultimately depends on the market's willingness to value ethical practices appropriately.

Technical Implementation of Ethics

Translating ethical principles into technical reality requires sophisticated systems and processes. The industry has developed automated compliance checking systems that can evaluate website terms of service, assess robots.txt files, and identify potential privacy concerns in real-time. This technical infrastructure allows implementation of ethical guidelines at the scale and speed required for modern data collection operations.

AI-driven scraping assistants incorporate ethical considerations directly into their decision-making algorithms. Rather than simply optimising for data extraction efficiency, these systems balance performance against compliance requirements, automatically adjusting their behaviour to respect website policies and user privacy.

Rate limiting and respectful crawling practices are built into technical infrastructure at the protocol level. Systems automatically distribute requests across proxy networks to avoid overwhelming target websites, whilst respecting crawl delays and other technical restrictions. This approach demonstrates how ethical considerations can be embedded in the fundamental architecture of data collection systems.

Data anonymisation and privacy protection techniques are applied automatically during the collection process. Personal identifiers are stripped from collected data streams, and sensitive information is flagged for additional review before being included in customer datasets. This proactive approach to privacy protection reduces the risk of inadvertent violations whilst ensuring data utility is maintained.

The technical implementation of ethical guidelines also includes comprehensive logging and audit capabilities. Every data collection operation is recorded with sufficient detail to demonstrate compliance with relevant regulations and ethical standards. This audit trail provides both legal protection and the foundation for continuous improvement of ethical practices.

Industry Transformation and Future Models

The data collection industry is undergoing fundamental transformation as ethical considerations become central to business strategy rather than peripheral concerns. Traditional models based purely on technical capability and cost competition are giving way to more sophisticated approaches that integrate ethics, compliance, and social responsibility.

The formation of industry coalitions like the EWDCI and the Dataset Providers Alliance represents a recognition that individual companies can't solve ethical challenges in isolation. These collaborative approaches suggest that the industry is moving towards shared standards and mutual accountability mechanisms that could fundamentally change competitive dynamics.

New business models are emerging that explicitly monetise ethical value. Companies are beginning to charge premium prices for “ethically sourced” data, creating market incentives for responsible practices. This trend could drive a race to the top in ethical standards rather than the race to the bottom that has traditionally characterised technology markets.

The integration of ethical considerations into corporate governance and reporting structures suggests that these changes are more than temporary marketing tactics. Companies are making institutional commitments to ethical practices that would be difficult and expensive to reverse, indicating genuine transformation rather than superficial adaptation.

However, the success of these new models depends on continued market demand for ethical practices and regulatory pressure to maintain high standards. If economic pressures intensify or regulatory attention shifts elsewhere, the industry could potentially revert to less ethical practices unless these new approaches prove genuinely superior in business terms.

The Measurement Challenge

One of the most significant challenges facing the ethical data movement is developing reliable methods for measuring and comparing ethical practices across different companies and approaches. Unlike technical performance metrics, ethical considerations often involve subjective judgements and trade-offs that resist simple quantification.

The industry has attempted to address this challenge by aligning ethical reporting with established ESG frameworks and GRI standards. This approach provides external credibility and comparability whilst ensuring that ethical claims can be independently verified. However, the application of general ESG frameworks to the specific challenges of data collection remains an evolving art rather than an exact science.

Industry initiatives are working to develop more specific metrics and benchmarks for ethical data collection practices. These efforts could eventually create standardised reporting requirements that allow customers and regulators to make informed comparisons between different providers. However, the development of such standards requires careful balance between specificity and flexibility to accommodate different business models and use cases.

The measurement challenge is complicated by the global nature of data collection operations. Practices that are considered ethical in one jurisdiction may be problematic in another, making universal standards difficult to establish. Companies operating internationally must navigate these differences whilst maintaining consistent ethical standards across their operations.

External verification and certification programmes are beginning to emerge as potential solutions to the measurement challenge. Third-party auditors could potentially provide independent assessment of companies' ethical practices, similar to existing financial or environmental auditing services. However, the development of expertise and standards for such auditing remains in early stages.

Technological Arms Race and Ethical Implications

The ongoing technological competition between data collectors and website operators creates complex ethical dynamics. As websites deploy increasingly sophisticated anti-scraping measures, data collection companies respond with more advanced circumvention techniques. This arms race raises questions about the boundaries of ethical data collection and the rights of website operators to control access to their content.

Leading companies' approach to this challenge emphasises transparency and communication with website operators. Rather than simply attempting to circumvent all technical restrictions, they advocate for clear policies and dialogue about acceptable data collection practices. This approach recognises that sustainable data collection requires some level of cooperation rather than purely adversarial relationships.

The development of AI-powered scraping tools also raises new ethical questions about the automation of decision-making in data collection. When AI systems make real-time decisions about what data to collect and how to collect it, ensuring ethical compliance becomes more complex. These systems must be trained not just for technical effectiveness but also for ethical behaviour.

The scale and speed of modern data collection create additional ethical challenges. When systems can extract massive amounts of data in very short timeframes, the potential for unintended consequences increases dramatically. The industry has implemented various safeguards to prevent accidental overloading of target websites, but continues to grapple with these challenges.

The global nature of web data collection also complicates the technological arms race. Techniques that are legal and ethical in one jurisdiction may violate laws or norms in others, creating complex compliance challenges for companies operating internationally.

Future Implications and Market Evolution

The industry model of proactive ethical standard-setting and coalition-building could represent the beginning of a broader transformation in how technology companies approach regulation and social responsibility. Rather than waiting for governments to impose restrictions, forward-thinking companies are attempting to shape the regulatory environment through voluntary initiatives and industry self-regulation.

This approach could prove particularly valuable in rapidly evolving technology sectors where traditional regulatory processes struggle to keep pace with innovation. By establishing ethical frameworks ahead of formal regulation, companies can potentially avoid more restrictive government interventions whilst maintaining public trust and social license to operate.

The success of ethical data collection as a business model could also influence other technology sectors facing similar challenges around AI, privacy, and social responsibility. If companies can demonstrate that ethical practices create genuine competitive advantages, other industries may adopt similar approaches to proactive standard-setting and collaborative governance.

However, the long-term viability of industry self-regulation remains uncertain. Without external enforcement mechanisms, voluntary ethical frameworks may prove insufficient to address serious violations or prevent races to the bottom during economic downturns. The ultimate test of initiatives like the EWDCI will be their ability to maintain high standards even when compliance becomes economically challenging.

The global expansion of AI capabilities and applications will likely increase pressure on data collection companies to demonstrate ethical practices. As AI systems become more influential in society, the ethical implications of training data quality and collection methods will face greater scrutiny from both regulators and the public.

The emergence of ethical data collection models represents more than a business strategy—it signals the beginning of a new social contract around data collection and AI development. This contract recognises that the immense power of modern data collection technologies comes with corresponding responsibilities to society, users, and the broader digital ecosystem.

The traditional approach of treating data collection as a purely technical challenge, subject only to legal compliance requirements, is proving inadequate for the AI era. The scale, speed, and societal impact of modern AI systems demand more sophisticated approaches that integrate ethical considerations into the fundamental design of data collection infrastructure.

Industry initiatives like the EWDCI represent experiments in collaborative governance that could reshape how technology sectors address complex social challenges. By bringing together diverse stakeholders to develop shared standards, these initiatives attempt to create accountability mechanisms that go beyond individual corporate policies or government regulations.

The economic viability of ethical data collection will ultimately determine whether these new approaches become standard practice or remain niche strategies. Early indicators suggest that markets are beginning to reward ethical practices, but the long-term sustainability of this trend depends on continued customer demand and regulatory support.

As artificial intelligence continues to reshape society, the companies that control access to training data will wield enormous influence over the direction of technological development. The emerging ethical data collection model suggests one path towards ensuring that this influence is exercised responsibly, but the ultimate success of such approaches will depend on broader social and economic forces that extend far beyond any individual company or industry initiative.

The stakes of this transformation extend beyond business success to fundamental questions about how democratic societies govern emerging technologies. The data collection industry's embrace of proactive ethical frameworks could provide a template for other technology sectors grappling with similar challenges, potentially offering an alternative to the adversarial relationships that often characterise technology regulation.

Whether ethical data collection models prove sustainable and scalable remains to be seen, but their emergence signals a recognition that the future of AI development depends not just on technical capabilities but on the social trust and legitimacy that enable those capabilities to be deployed responsibly. In an era where data truly is the new oil, companies are discovering that ethical extraction practices aren't just morally defensible—they may be economically essential.

References and Further Information

Primary Sources: – Oxylabs 2024 Impact Report: Focus on Ethical Data Collection and ESG Integration – Ethical Web Data Collection Initiative (EWDCI) founding documents and principles – Global Reporting Initiative (GRI) standards for ESG reporting – Dataset Providers Alliance documentation and industry collaboration materials

Industry Analysis: – “Is Open Source the Best Path Towards AI Democratization?” Medium analysis on data licensing challenges – LinkedIn professional discussions on AI ethics and data collection standards – Industry reports on the convergence of ESG investing and technology sector responsibility

Regulatory and Legal Framework: – European Union General Data Protection Regulation (GDPR) and its implications for data collection – California Consumer Privacy Act (CCPA) and state-level data protection trends – International regulatory developments in AI governance and data protection

Technical and Academic Sources: – Research on automated compliance systems for web data collection – Academic studies on bias detection and mitigation in large-scale datasets – Technical documentation on proxy networks and distributed data collection infrastructure

Further Reading: – Analysis of industry self-regulation models in technology sectors – Studies on the economic value of ethical business practices in data-driven industries – Research on the intersection of intellectual property rights and open data initiatives – Examination of collaborative governance models in emerging technology regulation