News and Press Releases

Major Update of ImmuniWeb® AI Platform: Q2 2026

July 21, 2026

ImmuniWeb Launches CyberScore, an Online Cybersecurity Rating Tool

July 7, 2026

ImmuniWeb Named a Gold Winner of the Industry Eagles Awards 2026

June 18, 2026

Cybercrime Investigations Weekly

Former Ransomware Negotiator Gets Nearly 6 Years In Prison For Aiding BlackCat/ALPHV Gang

July 16, 2026

15-Year-Old Arrested In Japan For Alleged Cyber-Attack On Bandai Channel

July 9, 2026

Iran-Linked Hacker Wanted By US In $3.4B Case Arrested In Montenegro

July 2, 2026

ImmuniWeb Research

The Current State of Data Scraping on the Web:
AI Bots Are Not Welcome in 2025

10.3k
17
6
17
4
More
10
7
8

The Current State of Data Scraping on the Web:<br /> AI Bots Are Not Welcome in 2025

Introduction
Methodology
Results
Aggregated Results and Infographics
Observations and Conclusion

1. Introduction

Back in mid-2022, automated scraping of data from websites was rather a niche problem, well known only in some industries like ecommerce, where competitors’ bots were scraping web data such as prices or discounts to gain an edge by offering more attractive deals to their customers. The situation has radically changed after the launch of ChatGPT by OpenAI in November 2022. Today, the problem of massive proliferation of unauthorized data scraping by AI companies and their suppliers continuously dominates the global media headlines.

As evidenced by a recent hearing “AI Industry's Mass Ingestion of Copyrighted Works” at the US Senate Judiciary Committee, unauthorized exploitation of copyrighted and proprietary creative works – including all kinds of texts, images, compositions and videos – by AI corporations for the training of their Large Language Models (LLMs) is an emerging economic, legal and social problem that may cause a long-lasting and irreparable harm for millions of people.

According to the Database of AI Litigation, maintained by the Ethical Tech Initiative at the George Washington University, as of today, there are over 250 pending lawsuits only in the US against major AI vendors for, among other things, copyright infringement, unwarranted data scraping and even exploitation of pirated content for AI training purposes. Some experts suggest that exploitation of pirated content for AI training is not only illegal but may be criminally punishable under some circumstances.

In July, Cloudflare – a leading cybersecurity company that is estimated to protect over 20% of global websites by its technical solutions including the largest websites in most countries – announced that AI bots would be blocked by default to prevent unwarranted data scraping from websites protected by Cloudflare. Interestingly, in early August, Perplexity, one of the leading AI startups, was accused of scraping data from websites that expressly block AI bots by obscuring the provenance of its bots to circumvent the anti-scraping controls.

Ultimately, today, website owners and authors of creative content have no other viable option to protect the fruits of their intellectual labor, but to deploy a set of security and technical controls to ban automated traffic from bots, eventually changing how the modern Internet works. This research explores how various industries protect their creative content and other intellectual property from unwarranted exploration by AI corporations and their clandestine suppliers.

2. Methodology

In response to the above-mentioned trends and events, ImmuniWeb has recently updated its free online Website Security Test to verify whether website is adequately protected from unauthorized data-scraping bots including bots of AI corporations. We used the Website Security Test to conduct this research, its results may be replicated or expanded by other researchers.

For the purpose of this research, the following lists of leading financial institutions, universities, newspapers and magazines, law firms, academic journals and academic databases were used:

Forbes: World's Best Banks 2025 (329 entities)
Shanghai Ranking: The Academic Ranking of World Universities 2025 (255 entities)
Encyclopedia Britannica: World Newspapers and Magazines (98 entities)
Forbes: World’s Best Management Consulting Firms 2025 (191 entities)
Legal 500: Top Law Firms in the United States 2025 (186 entities)
Legal 500: Top Law Firms in France 2025 (196 entities)
Legal 500: Top Law Firms in England 2025 (481 entities)
Top Academic Journals in the World (34 entities)
Top Academic Research Databases (37 entities)

In total, we analyzed 1,807 websites belonging to the above-mentioned entities. During the analysis, we used the following methods to test whether a website blocks AI and other data-scraping bots:

Web server’s response to a User Agent of a known AI bot
Web server’s response to a User Agent of unknown bots *
Web server’s response to automated crawling evidencing a non-human behavior
Presence of instructions for AI bots not to crawl website content in “robots.txt” file
Presence of meta tags for AI bots instructing not to crawl content of website pages
Presence of a WAF or another server-side security mechanism that blocks bots
Use of anti-bot protection solutions such as Cloudflare that block AI bots

* excluding the so-called good bots, like Google Bot

The results of the research are briefly summarized below.

3. Results

3.1 Forbes: World's Best Banks 2025

43% of the websites (143 out of 329) from the Forbes’s list of the World's Best Banks 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):

Bot Name	Explicitly Banned
Copilot by Microsoft	27,3%
Claude by Anthropic	13,3%
Apple Intelligence by Apple	9,1%
GPTBot by OpenAI	7,7%
AmazonBot by Amazon	6,3%
Meta AI by Meta	6,3%
Perplexity by Perplexity AI	4,9%
Gemini by Google	1,4%

3.2 Shanghai Ranking: The Academic Ranking of World Universities (ARWU) 2025

36% of the websites (93 out of 255) from the Shanghai Ranking of the World Universities 2025 (ARWU) block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
Claude by Anthropic	26,9%
Copilot by Microsoft	22,6%
GPTBot by OpenAI	19,4%
AmazonBot by Amazon	15,1%
Perplexity by Perplexity AI	7,5%
Apple Intelligence by Apple	6,5%
Meta AI by Meta	6,5%
Gemini by Google	2,2%

3.3 Encyclopedia Britannica: World Newspapers and Magazines

83% of the websites (81 out of 98) from the Encyclopedia Britannica’s list of the World Newspapers and Magazines block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
GPTBot by OpenAI	61,7%
Claude by Anthropic	59,3%
Perplexity by Perplexity AI	56,8%
Gemini by Google	53,1%
AmazonBot by Amazon	45,7%
Apple Intelligence by Apple	44,4%
Meta AI by Meta	43,2%
Copilot by Microsoft	21%

3.4 Forbes: World’s Best Management Consulting Firms 2025

52% of the websites (100 out of 191) from the Forbes’s list of the World’s Best Management Consulting Firms 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
Copilot by Microsoft	54%
Claude by Anthropic	27%
GPTBot by OpenAI	26%
AmazonBot by Amazon	15%
Perplexity by Perplexity AI	15%
Apple Intelligence by Apple	12%
Meta AI by Meta	11%
Gemini by Google	9%

3.5 Legal 500: Top Law Firms in the United States 2025

64% of the websites (119 out of 186) from the Legal 500 list of the Top Law Firms in the United States 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
Copilot by Microsoft	59,7%
Claude by Anthropic	18,5%
AmazonBot by Amazon	15,1%
GPTBot by OpenAI	10,9%
Apple Intelligence by Apple	7,6%
Meta AI by Meta	5,9%
Perplexity by Perplexity AI	3,4%
Gemini by Google	1,7%

3.6 Legal 500: Top Law firms in France 2025

38% of the websites (75 out of 196) from the Legal 500 list of the Top Law Firms in France 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
Copilot by Microsoft	28%
Claude by Anthropic	24%
AmazonBot by Amazon	21,3%
GPTBot by OpenAI	14,7%
Meta AI by Meta	12%
Apple Intelligence by Apple	9,3%
Gemini by Google	1,3%
Perplexity by Perplexity AI	1,3%

3.7 Legal 500: Top Law Firms in England 2025

63% of the websites (304 out of 481) from the Legal 500 list of the Top Law Firms in England 2025 block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
Copilot by Microsoft	34,2%
Claude by Anthropic	28,6%
GPTBot by OpenAI	16,8%
AmazonBot by Amazon	14,5%
Meta AI by Meta	9,9%
Perplexity by Perplexity AI	7,2%
Apple Intelligence by Apple	6,6%
Gemini by Google	4,3%

3.8 Top Academic Journals

74% of the websites (25 out of 34) from the list of the top academic journals block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
AmazonBot by Amazon	32%
GPTBot by OpenAI	32%
Claude by Anthropic	28%
Perplexity by Perplexity AI	28%
Apple Intelligence by Apple	24%
Meta AI by Meta	24%
Gemini by Google	12%
Copilot by Microsoft	8%

3.9 Top Academic Research Databases

73% of the websites (27 out of 37) from the list of the top academic research databases block AI bots and crawlers by server-side security mechanisms or network controls.

Bot Name	Explicitly Banned
GPTBot by OpenAI	48,1%
AmazonBot by Amazon	37%
Claude by Anthropic	37%
Gemini by Google	29,6%
Meta AI by Meta	25,9%
Copilot by Microsoft	25,9%
Apple Intelligence by Apple	22,2%
Perplexity by Perplexity AI	22,2%

4. Aggregated Results and Infographics

Below is the table that compares and illustrates which industries are the most aggressive and proactive to ban AI bots from accessing creative content on their websites:

List of Entities	Ban AI Bots
Encyclopedia Britannica: World Newspapers and Magazines	83%
Top Academic Journals in the World	74%
Top Academic Research Databases	73%
Legal 500: Top Law Firms in the United States 2025	64%
Legal 500: Top Law Firms in England 2025	63%
Forbes: World’s Best Management Consulting Firms 2025	52%
Forbes: World's Best Banks 2025	43%
Legal 500: Top Law Firms in France 2025	38%
Shanghai Ranking: The Academic Ranking of World Universities 2025 (ARWU)	36%

Diagram 1: Percentage of organizations from the research that block AI bots

Below is a table that compares which bots are explicitly banned by the entities used in this research, evidencing particular concerns over some AI companies and their data collection or use practices:

List of Bots	Explicitly Banned
Copilot by Microsoft	34,7%
Claude by Anthropic	27,2%
GPTBot by OpenAI	20,8%
AmazonBot by Amazon	17,7%
Meta AI by Meta	12,4%
Apple Intelligence by Apple	11,9%
Perplexity by Perplexity AI	11,9%
Gemini by Google	8,6%

Diagram 2: Percentage of organizations from the research that explicitly ban specific AI bots

5. Observations and Conclusion

It is important to consider the following observations and possible future developments:

Enacted legislation – including the EU AI Act – provides little to no protection for authors of creative content, eventually creating a huge emerging market for anti-bot solutions and data scraping protection services.
Pending copyright infringement lawsuits around the globe will unlikely change the surging trend of technical self-defense from bots regardless of whether courts eventually rule in favor of the plaintiffs or AI corporations.
Instead of relying on unclear and uncertain protection under the enacted copyright law, authors now update Terms of Service of their websites to expressly prohibit data scraping and any use of their content for AI training purposes, relying on breach of contract claim under the well-established and time-tested contract law in case of infringement.
Many AI corporations – not expressly mentioned in this research including ByteDance and DeepSeek from China – conceal their data collection practices, making it impossible to protect websites from their disguised crawlers by providing instructions in “robots.txt” file, eventually requiring user behavior analytics and other techniques to block their stealth bots.
Many AI corporations use clandestine external entities and offshore companies to outsource and eventually obfuscate their massive data scraping programs, flatly denying their involvement in any illicit or unethical data scraping activities.
Data suppliers of AI companies may start impersonating data scraping bots of well-known AI corporations, such as Meta or OpenAI, to scrap web data by using User Agents of the latter, ultimately framing the impersonated megacorporations.
As per ImmuniWeb’s proprietary honeypots data, since January 2025, there is a spike of automated web traffic from countries like Iran and China, possibly evidencing that data scraping activities take place from these remote jurisdictions to avoid legal actions or prosecution in the US and Europe.
A ballooning number of companies use GenAI to create synthetic web content to rank higher in Google, authorizing AI bots to crawl their GenAI-created content without restrictions, however, use of synthetic data for AI training is not just virtually useless but may be harmful for the so-called intelligence of LLM models.
There is strong chance that the current business model of many AI corporations – based on the massive misappropriation of proprietary data belonging to third parties without permission and without duly paying for it – will disappear in the next few years and even push some AI vendors out of business.
Emerging content licensing agreements between some AI corporations and groups of victimized authors will probably not last long given that authors are frequently and flagrantly underpaid, being roped into such deals by “get the pennies or nothing” offers of aggressive law firms that represent AI corporations.
Payment of a fair and reasonable price for third-party creative content on a regular basis – to maintain accuracy of LLM’s intelligence – may be cost prohibitive for many AI corporations because their own prices for chatbots and other AI solutions will then skyrocket, eventually making human labor a more cost-efficient option.

Dr. Ilia Kolochenko, Chief Architect & CEO at ImmuniWeb, concludes: “While largest AI corporations pay law firms many millions to defend or settle the mushrooming copyright infringement lawsuits, AI fatigue and disillusionment are rapidly mounting across almost all industries and sectors of economy. Despite billions invested in AI, we are still very far from creating Artificial General Intelligence (AGI) that was among the key fundraising promises of some AI companies. Furthermore, whoever prevails in the now-pending copyright battles in courts on both sides of the Atlantic, many AI corporations are inevitably poised to face serious challenges in the near future and will likely be compelled to change their current business model. Recent revelations about the massive and deliberate exploitation of pirated content for LLM training by AI companies – are just the tip of the iceberg of unfolding exposure of the systemic misconduct.

Today, both in Europe and the US, copyright owners and authors of creative content are left without a sound protection of their intellectual property under the enacted copyright law, which urgently requires major overhaul to better reflect modern realities. Moreover, even emerging AI legislation, such as the EU AI Act, is simply inefficient and ineffective to protect fruits of intellectual labor from massive misappropriation by AI corporations. Worst, both foreign AI companies and offshore suppliers of Western AI corporations tend to knowingly ignore Western legislation, while their behavior will quite unlikely change in the near future.

Ultimately, authors and copyright owners pragmatically decided to defend their intellectual property themselves by erecting formidable technical fences and security barriers, making unauthorized data scraping prohibitively expensive or technically impossible. Finally, a spike of breach-of-contract lawsuits is likely coming for violation of websites’ novel Terms of Service, but this time, the culprits will probably have to pay (both their human lawyers – that the former so vigorously promised to replace with AI – and copyright owners around the globe).”

The Website Security Test used in this research is a part of the ImmuniWeb’s award-winning Community Edition that currently runs over 100,000 daily security scans in over 100 countries. Statistical data from the Community Edition has been utilized in the Verizon Data Breach Investigations Report (DBIR) to which ImmuniWeb is a Contributor, as well as in strategic partnerships that ImmuniWeb has with various NGOs and international organizations including the UN ITU.

To test your website’s protection from AI bots, click on this link.

Free Demo

ImmuniWeb can help prevent data breaches and meet regulatory requirements.

Use and distribution: you are welcome to utilize the above-mentioned content for non-commercial purposes if you make a clear attribution to ImmuniWeb, with a backlink to this page when practical. In case of doubt, please contact us.

10.3k
17
6
17
4
More
10
7
8

What’s next:

Read other research by ImmuniWeb
Request a free product demo or pricing
Register for our webinars and product trainings
Read our Cybercrime Investigations weekly blog
Follow us on LinkedIn, X, Telegram and WhatsApp
Subscribe to our Newsletter
Join our Partner Program

Recent Research:

State of Application Security at Top 250 Cryptocurrency Exchanges