Total Tests:

The Current State of Data Scraping on the Web:
AI Bots Are Not Welcome in 2025

The Current State of Data Scraping on the Web:<br /> AI Bots Are Not Welcome in 2025



1. Introduction

Back in mid-2022, automated scraping of data from websites was rather a niche problem, well known only in some industries like ecommerce, where competitors’ bots were scraping web data such as prices or discounts to gain an edge by offering more attractive deals to their customers. The situation has radically changed after the launch of ChatGPT by OpenAI in November 2022. Today, the problem of massive proliferation of unauthorized data scraping by AI companies and their suppliers continuously dominates the global media headlines.

As evidenced by a recent hearing “AI Industry's Mass Ingestion of Copyrighted Works” at the US Senate Judiciary Committee, unauthorized exploitation of copyrighted and proprietary creative works – including all kinds of texts, images, compositions and videos – by AI corporations for the training of their Large Language Models (LLMs) is an emerging economic, legal and social problem that may cause a long-lasting and irreparable harm for millions of people.

According to the Database of AI Litigation, maintained by the Ethical Tech Initiative at the George Washington University, as of today, there are over 250 pending lawsuits only in the US against major AI vendors for, among other things, copyright infringement, unwarranted data scraping and even exploitation of pirated content for AI training purposes. Some experts suggest that exploitation of pirated content for AI training is not only illegal but may be criminally punishable under some circumstances.

In July, Cloudflare – a leading cybersecurity company that is estimated to protect over 20% of global websites by its technical solutions including the largest websites in most countries – announced that AI bots would be blocked by default to prevent unwarranted data scraping from websites protected by Cloudflare. Interestingly, in early August, Perplexity, one of the leading AI startups, was accused of scraping data from websites that expressly block AI bots by obscuring the provenance of its bots to circumvent the anti-scraping controls.

Ultimately, today, website owners and authors of creative content have no other viable option to protect the fruits of their intellectual labor, but to deploy a set of security and technical controls to ban automated traffic from bots, eventually changing how the modern Internet works. This research explores how various industries protect their creative content and other intellectual property from unwarranted exploration by AI corporations and their clandestine suppliers.


2. Methodology

In response to the above-mentioned trends and events, ImmuniWeb has recently updated its free online Website Security Test to verify whether website is adequately protected from unauthorized data-scraping bots including bots of AI corporations. We used the Website Security Test to conduct this research, its results may be replicated or expanded by other researchers.

For the purpose of this research, the following lists of leading financial institutions, universities, newspapers and magazines, law firms, academic journals and academic databases were used:

In total, we analyzed 1,807 websites belonging to the above-mentioned entities. During the analysis, we used the following methods to test whether a website blocks AI and other data-scraping bots:

  • Web server’s response to a User Agent of a known AI bot
  • Web server’s response to a User Agent of unknown bots *
  • Web server’s response to automated crawling evidencing a non-human behavior
  • Presence of instructions for AI bots not to crawl website content in “robots.txt” file
  • Presence of meta tags for AI bots instructing not to crawl content of website pages
  • Presence of a WAF or another server-side security mechanism that blocks bots
  • Use of anti-bot protection solutions such as Cloudflare that block AI bots

* excluding the so-called good bots, like Google Bot

The results of the research are briefly summarized below.


3. Results

3.1 Forbes: World's Best Banks 2025

43% of the websites (143 out of 329) from the Forbes’s list of the World's Best Banks 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
Copilot by Microsoft27,3%
Claude by Anthropic13,3%
Apple Intelligence by Apple9,1%
GPTBot by OpenAI7,7%
AmazonBot by Amazon6,3%
Meta AI by Meta6,3%
Perplexity by Perplexity AI4,9%
Gemini by Google1,4%

3.2 Shanghai Ranking: The Academic Ranking of World Universities (ARWU) 2025

36% of the websites (93 out of 255) from the Shanghai Ranking of the World Universities 2025 (ARWU) block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
Claude by Anthropic26,9%
Copilot by Microsoft22,6%
GPTBot by OpenAI19,4%
AmazonBot by Amazon15,1%
Perplexity by Perplexity AI7,5%
Apple Intelligence by Apple6,5%
Meta AI by Meta6,5%
Gemini by Google2,2%

3.3 Encyclopedia Britannica: World Newspapers and Magazines

83% of the websites (81 out of 98) from the Encyclopedia Britannica’s list of the World Newspapers and Magazines block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
GPTBot by OpenAI61,7%
Claude by Anthropic59,3%
Perplexity by Perplexity AI56,8%
Gemini by Google53,1%
AmazonBot by Amazon45,7%
Apple Intelligence by Apple44,4%
Meta AI by Meta43,2%
Copilot by Microsoft21%

3.4 Forbes: World’s Best Management Consulting Firms 2025

52% of the websites (100 out of 191) from the Forbes’s list of the World’s Best Management Consulting Firms 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
Copilot by Microsoft54%
Claude by Anthropic27%
GPTBot by OpenAI26%
AmazonBot by Amazon15%
Perplexity by Perplexity AI15%
Apple Intelligence by Apple12%
Meta AI by Meta11%
Gemini by Google9%

3.5 Legal 500: Top Law Firms in the United States 2025

64% of the websites (119 out of 186) from the Legal 500 list of the Top Law Firms in the United States 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
Copilot by Microsoft59,7%
Claude by Anthropic18,5%
AmazonBot by Amazon15,1%
GPTBot by OpenAI10,9%
Apple Intelligence by Apple7,6%
Meta AI by Meta5,9%
Perplexity by Perplexity AI3,4%
Gemini by Google1,7%

3.6 Legal 500: Top Law firms in France 2025

38% of the websites (75 out of 196) from the Legal 500 list of the Top Law Firms in France 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
Copilot by Microsoft28%
Claude by Anthropic24%
AmazonBot by Amazon21,3%
GPTBot by OpenAI14,7%
Meta AI by Meta12%
Apple Intelligence by Apple9,3%
Gemini by Google1,3%
Perplexity by Perplexity AI1,3%

3.7 Legal 500: Top Law Firms in England 2025

63% of the websites (304 out of 481) from the Legal 500 list of the Top Law Firms in England 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
Copilot by Microsoft34,2%
Claude by Anthropic28,6%
GPTBot by OpenAI16,8%
AmazonBot by Amazon14,5%
Meta AI by Meta9,9%
Perplexity by Perplexity AI7,2%
Apple Intelligence by Apple6,6%
Gemini by Google4,3%

3.8 Top Academic Journals

74% of the websites (25 out of 34) from the list of the top academic journals block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
AmazonBot by Amazon32%
GPTBot by OpenAI32%
Claude by Anthropic28%
Perplexity by Perplexity AI28%
Apple Intelligence by Apple24%
Meta AI by Meta24%
Gemini by Google12%
Copilot by Microsoft8%

3.9 Top Academic Research Databases

73% of the websites (27 out of 37) from the list of the top academic research databases block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot NameExplicitly Banned
GPTBot by OpenAI48,1%
AmazonBot by Amazon37%
Claude by Anthropic37%
Gemini by Google29,6%
Meta AI by Meta25,9%
Copilot by Microsoft25,9%
Apple Intelligence by Apple22,2%
Perplexity by Perplexity AI22,2%

4. Aggregated Results and Infographics

Below is the table that compares and illustrates which industries are the most aggressive and proactive to ban AI bots from accessing creative content on their websites:

List of EntitiesBan AI Bots
Encyclopedia Britannica: World Newspapers and Magazines83%
Top Academic Journals in the World74%
Top Academic Research Databases73%
Legal 500: Top Law Firms in the United States 202564%
Legal 500: Top Law Firms in England 202563%
Forbes: World’s Best Management Consulting Firms 202552%
Forbes: World's Best Banks 202543%
Legal 500: Top Law Firms in France 202538%
Shanghai Ranking: The Academic Ranking of World
Universities 2025 (ARWU)
36%

Diagram 1: Percentage of organizations from the research that block AI bots
Diagram 1: Percentage of organizations from the research that block AI bots

Below is a table that compares which bots are explicitly banned by the entities used in this research, evidencing particular concerns over some AI companies and their data collection or use practices:

List of BotsExplicitly Banned
Copilot by Microsoft34,7%
Claude by Anthropic27,2%
GPTBot by OpenAI20,8%
AmazonBot by Amazon17,7%
Meta AI by Meta12,4%
Apple Intelligence by Apple11,9%
Perplexity by Perplexity AI11,9%
Gemini by Google8,6%

Diagram 2: Percentage of organizations from the research that explicitly ban specific AI bots
Diagram 2: Percentage of organizations from the research that explicitly ban specific AI bots


5. Observations and Conclusion

It is important to consider the following observations and possible future developments:

  • Enacted legislation – including the EU AI Act – provides little to no protection for authors of creative content, eventually creating a huge emerging market for anti-bot solutions and data scraping protection services.
  • Pending copyright infringement lawsuits around the globe will unlikely change the surging trend of technical self-defense from bots regardless of whether courts eventually rule in favor of the plaintiffs or AI corporations.
  • Instead of relying on unclear and uncertain protection under the enacted copyright law, authors now update Terms of Service of their websites to expressly prohibit data scraping and any use of their content for AI training purposes, relying on breach of contract claim under the well-established and time-tested contract law in case of infringement.
  • Many AI corporations – not expressly mentioned in this research including ByteDance and DeepSeek from China – conceal their data collection practices, making it impossible to protect websites from their disguised crawlers by providing instructions in “robots.txt” file, eventually requiring user behavior analytics and other techniques to block their stealth bots.
  • Many AI corporations use clandestine external entities and offshore companies to outsource and eventually obfuscate their massive data scraping programs, flatly denying their involvement in any illicit or unethical data scraping activities.
  • Data suppliers of AI companies may start impersonating data scraping bots of well-known AI corporations, such as Meta or OpenAI, to scrap web data by using User Agents of the latter, ultimately framing the impersonated megacorporations.
  • As per ImmuniWeb’s proprietary honeypots data, since January 2025, there is a spike of automated web traffic from countries like Iran and China, possibly evidencing that data scraping activities take place from these remote jurisdictions to avoid legal actions or prosecution in the US and Europe.
  • A ballooning number of companies use GenAI to create synthetic web content to rank higher in Google, authorizing AI bots to crawl their GenAI-created content without restrictions, however, use of synthetic data for AI training is not just virtually useless but may be harmful for the so-called intelligence of LLM models.
  • There is strong chance that the current business model of many AI corporations – based on the massive misappropriation of proprietary data belonging to third parties without permission and without duly paying for it – will disappear in the next few years and even push some AI vendors out of business.
  • Emerging content licensing agreements between some AI corporations and groups of victimized authors will probably not last long given that authors are frequently and flagrantly underpaid, being roped into such deals by “get the pennies or nothing” offers of aggressive law firms that represent AI corporations.
  • Payment of a fair and reasonable price for third-party creative content on a regular basis – to maintain accuracy of LLM’s intelligence – may be cost prohibitive for many AI corporations because their own prices for chatbots and other AI solutions will then skyrocket, eventually making human labor a more cost-efficient option.

Dr. Ilia Kolochenko, Chief Architect & CEO at ImmuniWeb, concludes: “While largest AI corporations pay law firms many millions to defend or settle the mushrooming copyright infringement lawsuits, AI fatigue and disillusionment are rapidly mounting across almost all industries and sectors of economy. Despite billions invested in AI, we are still very far from creating Artificial General Intelligence (AGI) that was among the key fundraising promises of some AI companies. Furthermore, whoever prevails in the now-pending copyright battles in courts on both sides of the Atlantic, many AI corporations are inevitably poised to face serious challenges in the near future and will likely be compelled to change their current business model. Recent revelations about the massive and deliberate exploitation of pirated content for LLM training by AI companies – are just the tip of the iceberg of unfolding exposure of the systemic misconduct.

Today, both in Europe and the US, copyright owners and authors of creative content are left without a sound protection of their intellectual property under the enacted copyright law, which urgently requires major overhaul to better reflect modern realities. Moreover, even emerging AI legislation, such as the EU AI Act, is simply inefficient and ineffective to protect fruits of intellectual labor from massive misappropriation by AI corporations. Worst, both foreign AI companies and offshore suppliers of Western AI corporations tend to knowingly ignore Western legislation, while their behavior will quite unlikely change in the near future.

Ultimately, authors and copyright owners pragmatically decided to defend their intellectual property themselves by erecting formidable technical fences and security barriers, making unauthorized data scraping prohibitively expensive or technically impossible. Finally, a spike of breach-of-contract lawsuits is likely coming for violation of websites’ novel Terms of Service, but this time, the culprits will probably have to pay (both their human lawyers – that the former so vigorously promised to replace with AI – and copyright owners around the globe).

The Website Security Test used in this research is a part of the ImmuniWeb’s award-winning Community Edition that currently runs over 100,000 daily security scans in over 100 countries. Statistical data from the Community Edition has been utilized in the Verizon Data Breach Investigations Report (DBIR) to which ImmuniWeb is a Contributor, as well as in strategic partnerships that ImmuniWeb has with various NGOs and international organizations including the UN ITU.

To test your website’s protection from AI bots, click on this link.

ImmuniWeb can help prevent data breaches and meet regulatory requirements.

Use and distribution: you are welcome to utilize the above-mentioned content for non-commercial purposes if you make a clear attribution to ImmuniWeb, with a backlink to this page when practical. In case of doubt, please contact us.

What’s next:

Download your free
PDF copy of
the research
Ask a Question