The Current State of Data Scraping on the Web:
AI Bots Are Not Welcome in 2025

Table of Contents
- Introduction
- Methodology
- Results
- Forbes: World's Best Banks 2025
- Shanghai Ranking: The Academic Ranking of World Universities (ARWU) 2025
- Encyclopedia Britannica: World Newspapers and Magazines
- Forbes: World’s Best Management Consulting Firms 2025
- Legal 500: Top Law Firms in the United States 2025
- Legal 500: Top Law firms in France 2025
- Legal 500: Top Law Firms in England 2025
- Top Academic Journals
- Top Academic Research Databases
- Aggregated Results and Infographics
- Observations and Conclusion
1. Introduction
Back in mid-2022, automated scraping of data from websites was rather a niche problem, well known only in some industries like ecommerce, where competitors’ bots were scraping web data such as prices or discounts to gain an edge by offering more attractive deals to their customers. The situation has radically changed after the launch of ChatGPT by OpenAI in November 2022. Today, the problem of massive proliferation of unauthorized data scraping by AI companies and their suppliers continuously dominates the global media headlines.
As evidenced by a recent hearing “AI Industry's Mass Ingestion of Copyrighted Works” at the US Senate Judiciary Committee, unauthorized exploitation of copyrighted and proprietary creative works – including all kinds of texts, images, compositions and videos – by AI corporations for the training of their Large Language Models (LLMs) is an emerging economic, legal and social problem that may cause a long-lasting and irreparable harm for millions of people.
According to the Database of AI Litigation, maintained by the Ethical Tech Initiative at the George Washington University, as of today, there are over 250 pending lawsuits only in the US against major AI vendors for, among other things, copyright infringement, unwarranted data scraping and even exploitation of pirated content for AI training purposes. Some experts suggest that exploitation of pirated content for AI training is not only illegal but may be criminally punishable under some circumstances.
In July, Cloudflare – a leading cybersecurity company that is estimated to protect over 20% of global websites by its technical solutions including the largest websites in most countries – announced that AI bots would be blocked by default to prevent unwarranted data scraping from websites protected by Cloudflare. Interestingly, in early August, Perplexity, one of the leading AI startups, was accused of scraping data from websites that expressly block AI bots by obscuring the provenance of its bots to circumvent the anti-scraping controls.
Ultimately, today, website owners and authors of creative content have no other viable option to protect the fruits of their intellectual labor, but to deploy a set of security and technical controls to ban automated traffic from bots, eventually changing how the modern Internet works. This research explores how various industries protect their creative content and other intellectual property from unwarranted exploration by AI corporations and their clandestine suppliers.
2. Methodology
In response to the above-mentioned trends and events, ImmuniWeb has recently updated its free online Website Security Test to verify whether website is adequately protected from unauthorized data-scraping bots including bots of AI corporations. We used the Website Security Test to conduct this research, its results may be replicated or expanded by other researchers.
For the purpose of this research, the following lists of leading financial institutions, universities, newspapers and magazines, law firms, academic journals and academic databases were used:
- Forbes: World's Best Banks 2025 (329 entities)
- Shanghai Ranking: The Academic Ranking of World Universities 2025 (255 entities)
- Encyclopedia Britannica: World Newspapers and Magazines (98 entities)
- Forbes: World’s Best Management Consulting Firms 2025 (191 entities)
- Legal 500: Top Law Firms in the United States 2025 (186 entities)
- Legal 500: Top Law Firms in France 2025 (196 entities)
- Legal 500: Top Law Firms in England 2025 (481 entities)
- Top Academic Journals in the World (34 entities)
- Top Academic Research Databases (37 entities)
In total, we analyzed 1,807 websites belonging to the above-mentioned entities. During the analysis, we used the following methods to test whether a website blocks AI and other data-scraping bots:
- Web server’s response to a User Agent of a known AI bot
- Web server’s response to a User Agent of unknown bots *
- Web server’s response to automated crawling evidencing a non-human behavior
- Presence of instructions for AI bots not to crawl website content in “robots.txt” file
- Presence of meta tags for AI bots instructing not to crawl content of website pages
- Presence of a WAF or another server-side security mechanism that blocks bots
- Use of anti-bot protection solutions such as Cloudflare that block AI bots
* excluding the so-called good bots, like Google Bot
The results of the research are briefly summarized below.
3. Results
3.1 Forbes: World's Best Banks 2025
43% of the websites (143 out of 329) from the Forbes’s list of the World's Best Banks 2025 block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| Copilot by Microsoft | 27,3% |
| Claude by Anthropic | 13,3% |
| Apple Intelligence by Apple | 9,1% |
| GPTBot by OpenAI | 7,7% |
| AmazonBot by Amazon | 6,3% |
| Meta AI by Meta | 6,3% |
| Perplexity by Perplexity AI | 4,9% |
| Gemini by Google | 1,4% |
3.2 Shanghai Ranking: The Academic Ranking of World Universities (ARWU) 2025
36% of the websites (93 out of 255) from the Shanghai Ranking of the World Universities 2025 (ARWU) block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| Claude by Anthropic | 26,9% |
| Copilot by Microsoft | 22,6% |
| GPTBot by OpenAI | 19,4% |
| AmazonBot by Amazon | 15,1% |
| Perplexity by Perplexity AI | 7,5% |
| Apple Intelligence by Apple | 6,5% |
| Meta AI by Meta | 6,5% |
| Gemini by Google | 2,2% |
3.3 Encyclopedia Britannica: World Newspapers and Magazines
83% of the websites (81 out of 98) from the Encyclopedia Britannica’s list of the World Newspapers and Magazines block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| GPTBot by OpenAI | 61,7% |
| Claude by Anthropic | 59,3% |
| Perplexity by Perplexity AI | 56,8% |
| Gemini by Google | 53,1% |
| AmazonBot by Amazon | 45,7% |
| Apple Intelligence by Apple | 44,4% |
| Meta AI by Meta | 43,2% |
| Copilot by Microsoft | 21% |
3.4 Forbes: World’s Best Management Consulting Firms 2025
52% of the websites (100 out of 191) from the Forbes’s list of the World’s Best Management Consulting Firms 2025 block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| Copilot by Microsoft | 54% |
| Claude by Anthropic | 27% |
| GPTBot by OpenAI | 26% |
| AmazonBot by Amazon | 15% |
| Perplexity by Perplexity AI | 15% |
| Apple Intelligence by Apple | 12% |
| Meta AI by Meta | 11% |
| Gemini by Google | 9% |
3.5 Legal 500: Top Law Firms in the United States 2025
64% of the websites (119 out of 186) from the Legal 500 list of the Top Law Firms in the United States 2025 block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| Copilot by Microsoft | 59,7% |
| Claude by Anthropic | 18,5% |
| AmazonBot by Amazon | 15,1% |
| GPTBot by OpenAI | 10,9% |
| Apple Intelligence by Apple | 7,6% |
| Meta AI by Meta | 5,9% |
| Perplexity by Perplexity AI | 3,4% |
| Gemini by Google | 1,7% |
3.6 Legal 500: Top Law firms in France 2025
38% of the websites (75 out of 196) from the Legal 500 list of the Top Law Firms in France 2025 block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| Copilot by Microsoft | 28% |
| Claude by Anthropic | 24% |
| AmazonBot by Amazon | 21,3% |
| GPTBot by OpenAI | 14,7% |
| Meta AI by Meta | 12% |
| Apple Intelligence by Apple | 9,3% |
| Gemini by Google | 1,3% |
| Perplexity by Perplexity AI | 1,3% |
3.7 Legal 500: Top Law Firms in England 2025
63% of the websites (304 out of 481) from the Legal 500 list of the Top Law Firms in England 2025 block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| Copilot by Microsoft | 34,2% |
| Claude by Anthropic | 28,6% |
| GPTBot by OpenAI | 16,8% |
| AmazonBot by Amazon | 14,5% |
| Meta AI by Meta | 9,9% |
| Perplexity by Perplexity AI | 7,2% |
| Apple Intelligence by Apple | 6,6% |
| Gemini by Google | 4,3% |
3.8 Top Academic Journals
74% of the websites (25 out of 34) from the list of the top academic journals block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| AmazonBot by Amazon | 32% |
| GPTBot by OpenAI | 32% |
| Claude by Anthropic | 28% |
| Perplexity by Perplexity AI | 28% |
| Apple Intelligence by Apple | 24% |
| Meta AI by Meta | 24% |
| Gemini by Google | 12% |
| Copilot by Microsoft | 8% |
3.9 Top Academic Research Databases
73% of the websites (27 out of 37) from the list of the top academic research databases block AI bots and crawlers by server-side security mechanisms or network controls.
Among those websites that block AI bots, some also have guidelines in their “robots.txt” files to provide instructions not to crawl their web content for specific AI bots (those say that they follow the guidelines from “robots.txt” file):
| Bot Name | Explicitly Banned |
|---|---|
| GPTBot by OpenAI | 48,1% |
| AmazonBot by Amazon | 37% |
| Claude by Anthropic | 37% |
| Gemini by Google | 29,6% |
| Meta AI by Meta | 25,9% |
| Copilot by Microsoft | 25,9% |
| Apple Intelligence by Apple | 22,2% |
| Perplexity by Perplexity AI | 22,2% |
4. Aggregated Results and Infographics
Below is the table that compares and illustrates which industries are the most aggressive and proactive to ban AI bots from accessing creative content on their websites:
| List of Entities | Ban AI Bots |
|---|---|
| Encyclopedia Britannica: World Newspapers and Magazines | 83% |
| Top Academic Journals in the World | 74% |
| Top Academic Research Databases | 73% |
| Legal 500: Top Law Firms in the United States 2025 | 64% |
| Legal 500: Top Law Firms in England 2025 | 63% |
| Forbes: World’s Best Management Consulting Firms 2025 | 52% |
| Forbes: World's Best Banks 2025 | 43% |
| Legal 500: Top Law Firms in France 2025 | 38% |
| Shanghai Ranking: The Academic Ranking of World Universities 2025 (ARWU) | 36% |

Diagram 1: Percentage of organizations from the research that block AI bots
Below is a table that compares which bots are explicitly banned by the entities used in this research, evidencing particular concerns over some AI companies and their data collection or use practices:
| List of Bots | Explicitly Banned |
|---|---|
| Copilot by Microsoft | 34,7% |
| Claude by Anthropic | 27,2% |
| GPTBot by OpenAI | 20,8% |
| AmazonBot by Amazon | 17,7% |
| Meta AI by Meta | 12,4% |
| Apple Intelligence by Apple | 11,9% |
| Perplexity by Perplexity AI | 11,9% |
| Gemini by Google | 8,6% |

Diagram 2: Percentage of organizations from the research that explicitly ban specific AI bots
5. Observations and Conclusion
It is important to consider the following observations and possible future developments:
- Enacted legislation – including the EU AI Act – provides little to no protection for authors of creative content, eventually creating a huge emerging market for anti-bot solutions and data scraping protection services.
- Pending copyright infringement lawsuits around the globe will unlikely change the surging trend of technical self-defense from bots regardless of whether courts eventually rule in favor of the plaintiffs or AI corporations.
- Instead of relying on unclear and uncertain protection under the enacted copyright law, authors now update Terms of Service of their websites to expressly prohibit data scraping and any use of their content for AI training purposes, relying on breach of contract claim under the well-established and time-tested contract law in case of infringement.
- Many AI corporations – not expressly mentioned in this research including ByteDance and DeepSeek from China – conceal their data collection practices, making it impossible to protect websites from their disguised crawlers by providing instructions in “robots.txt” file, eventually requiring user behavior analytics and other techniques to block their stealth bots.
- Many AI corporations use clandestine external entities and offshore companies to outsource and eventually obfuscate their massive data scraping programs, flatly denying their involvement in any illicit or unethical data scraping activities.
- Data suppliers of AI companies may start impersonating data scraping bots of well-known AI corporations, such as Meta or OpenAI, to scrap web data by using User Agents of the latter, ultimately framing the impersonated megacorporations.
- As per ImmuniWeb’s proprietary honeypots data, since January 2025, there is a spike of automated web traffic from countries like Iran and China, possibly evidencing that data scraping activities take place from these remote jurisdictions to avoid legal actions or prosecution in the US and Europe.
- A ballooning number of companies use GenAI to create synthetic web content to rank higher in Google, authorizing AI bots to crawl their GenAI-created content without restrictions, however, use of synthetic data for AI training is not just virtually useless but may be harmful for the so-called intelligence of LLM models.
- There is strong chance that the current business model of many AI corporations – based on the massive misappropriation of proprietary data belonging to third parties without permission and without duly paying for it – will disappear in the next few years and even push some AI vendors out of business.
- Emerging content licensing agreements between some AI corporations and groups of victimized authors will probably not last long given that authors are frequently and flagrantly underpaid, being roped into such deals by “get the pennies or nothing” offers of aggressive law firms that represent AI corporations.
- Payment of a fair and reasonable price for third-party creative content on a regular basis – to maintain accuracy of LLM’s intelligence – may be cost prohibitive for many AI corporations because their own prices for chatbots and other AI solutions will then skyrocket, eventually making human labor a more cost-efficient option.
Dr. Ilia Kolochenko, Chief Architect & CEO at ImmuniWeb, concludes: “While largest AI corporations pay law firms many millions to defend or settle the mushrooming copyright infringement lawsuits, AI fatigue and disillusionment are rapidly mounting across almost all industries and sectors of economy. Despite billions invested in AI, we are still very far from creating Artificial General Intelligence (AGI) that was among the key fundraising promises of some AI companies. Furthermore, whoever prevails in the now-pending copyright battles in courts on both sides of the Atlantic, many AI corporations are inevitably poised to face serious challenges in the near future and will likely be compelled to change their current business model. Recent revelations about the massive and deliberate exploitation of pirated content for LLM training by AI companies – are just the tip of the iceberg of unfolding exposure of the systemic misconduct.
Today, both in Europe and the US, copyright owners and authors of creative content are left without a sound protection of their intellectual property under the enacted copyright law, which urgently requires major overhaul to better reflect modern realities. Moreover, even emerging AI legislation, such as the EU AI Act, is simply inefficient and ineffective to protect fruits of intellectual labor from massive misappropriation by AI corporations. Worst, both foreign AI companies and offshore suppliers of Western AI corporations tend to knowingly ignore Western legislation, while their behavior will quite unlikely change in the near future.
Ultimately, authors and copyright owners pragmatically decided to defend their intellectual property themselves by erecting formidable technical fences and security barriers, making unauthorized data scraping prohibitively expensive or technically impossible. Finally, a spike of breach-of-contract lawsuits is likely coming for violation of websites’ novel Terms of Service, but this time, the culprits will probably have to pay (both their human lawyers – that the former so vigorously promised to replace with AI – and copyright owners around the globe).”
The Website Security Test used in this research is a part of the ImmuniWeb’s award-winning Community Edition that currently runs over 100,000 daily security scans in over 100 countries. Statistical data from the Community Edition has been utilized in the Verizon Data Breach Investigations Report (DBIR) to which ImmuniWeb is a Contributor, as well as in strategic partnerships that ImmuniWeb has with various NGOs and international organizations including the UN ITU.
To test your website’s protection from AI bots, click on this link.
Use and distribution: you are welcome to utilize the above-mentioned content for non-commercial purposes if you make a clear attribution to ImmuniWeb, with a backlink to this page when practical. In case of doubt, please contact us.
What’s next:
- Read other research by ImmuniWeb
- Request a free product demo or pricing
- Register for our webinars and product trainings
- Read our Cybercrime Investigations weekly blog
- Follow us on LinkedIn, X, Telegram and WhatsApp
- Subscribe to our Newsletter
- Join our Partner Program
State of Application Security at Top 250 Cryptocurrency Exchanges