Blocked Web Browser User Agents
If you were automatically directed to this page when you tried to view another page, you're using a Web browser (or other software) that sends one of the following text strings as the "User-Agent":
- 80legs.com
- AhrefsBot
- Amazonbot
- AspiegelBot
- AwarioBot
- Barkrowler
- BLEXBot
- Bytespider
- BuckyOHare / hypefactors.com
- centuryb.o.t9@gmail.com
- ChatGLM-Spider
- ClaudeBot
- dataforseo.com
- discoveryengine.com
- domaincrawler.com
- domainreanimator.com
- DTS Agent
- Evensi
- GeedoBot / GeedoProductSearch
- GPTBot
- FriendlyCrawler
- gsa-crawler (M2-AMCWPFAKDA6AS, T2-B9E742J9WQSAB, and S5-KRWBRM63Y6JJT)
- heritrix (when followed by an invalid URL like “+http://example@gmail.com”)
- Goodzer
- ImagesiftBot
- LinkFeatureBot
- Mail.RU_Bot
- MauiBot
- MegaIndex.ru
- Missigua Locator
- MJ12bot
- Morfeus F*cking Scanner
- musobot
- OpenLinkProfiler.org
- opensiteexplorer.org
- panscient.com
- RainBot
- riddler.io
- Screaming Frog SEO Spider
- SEMrushBot
- seostar.co
- serpstatbot
- SputnikBot
- terrykyleseoagency.com
- The Knowledge AI
- Timpibot
- Turnitin
- VelenPublicWebCrawler
- WBSearchBot
- WPSpider
- xovibot.net
These connections have been identified as "abusive" by our technical staff.
If you're a legitimate user (that is, if you're a normal human being who has been redirected to this page), please contact us and mention that you're being “blocked based on the HTTP User-Agent when connecting from IP address 3.135.209.107”.
On the rest of this page:
- Can a site owner override this restriction?
- What do you mean by "abusive"?
- Why is Amazonbot included?
- Why is “MJ12bot” included?
- Why are other bots included?
- Is there way I can block just some of these bots?
Can a site owner override this restriction?
If you’re the owner of a site hosted with us, and you want to allow any connection with one of the above “User-Agent” strings to connect to your site anyway, create an empty file named .tigertech-dont-block-user-agents at the top level of your site. Note that the filename begins with a dot, and that dot must be included.
Doing this is not recommended because it may open your site to connections that cause high load, excess CPU usage, or outages on your site.
What do you mean by "abusive"?
We should first emphasize that we don’t go out of our way looking for bots to block. We only do it if we see them actually causing real problems on our servers — in particular, causing resource usage that noticeably slows down a site for human visitors, or that would lead to extra usage fees for our customers.
The main reason a bot is abusive is that it attempts to load all pages on a site without automatically attempting to spread the load over a reasonable time period, and without automatically slowing down when it detects script-based pages that load slowly.
A well-written search engine spider/robot should spread the page requests over an extended period. For example, if you need to load 1000 pages from a site, those could be loaded over 24 hours, not over an hour.
It should also detect how long it takes to load a page, then "sleep" for at least ten times that period before loading a similar URL. This ensures that if a site uses script-based pages that consume large amounts of CPU time, the spider/robot won't increase the total site load by more than 10%.
In addition, a robot should never open more than one simultaneous connection to a particular site.
Finally, we also consider user agents abusive if they repeatedly try to index URLs that return 404 errors, 301 redirects, and so on.
Why is Amazonbot included?
Amazonbot, despite being from a large company, is currently the main cause of high CPU incidents we currently see on sites. It doesn’t respect crawl delay requests, and it often loads the same URL tens or hundreds of times a day. It’s common for it to load thousands of URLs on a site in a short period of time, then repeat the same requests for days on end, causing a site’s CPU usage to increase a hundredfold or more.
Since the only purpose of Amazonbot appears to be “enabling Alexa to answer even more questions for customers”, it’s not reasonable for it to cause sites to slow down and pay extra CPU fees. We’ll continue to monitor it, but for now, the way it works isn’t acceptable.
Why is “MJ12bot” included?
MJ12bot claims to be a project to “spider the Web for the purpose of building a search engine”. The company that makes it asks volunteers to install the indexing software on their own computers, using the volunteers’ own bandwidth and CPU resources instead of the company’s.
The idea of a community-run search engine sounds great — however, the MJ12bot authors have not operated a search engine for many years. Instead, they use the information that people are generating to sell SEO services on a different site.
The MJ12bot software often also requests malformed URLs that generate “404 not found” errors, increasing CPU usage on WordPress sites.
Because of this, and because the MJ12bot software is often one of the primary causes of site slowdowns and CPU overage fees for our customers, we’ve blocked it from sites we host.
If you want to allow MJ12bot to index your site anyway, you can use use the trick described above: create an empty file named .tigertech-dont-block-user-agents at the top level of your site, which will bypass the restriction.
Why are other bots included?
Several other bots listed are also run by companies that sell SEO services or the like, including:
- AhrefsBot
- Barkrowler
- BuckyOHare / hypefactors.com
- dataforseo.com
- domainreanimator.com
- MegaIndex.ru
- OpenLinkProfiler.org
- opensiteexplorer.org
- panscient.com
- Screaming Frog SEO Spider
- SEMrushBot
- seostar.co
- serpstatbot
- terrykyleseoagency.com
- xovibot.net
These services are used by a tiny fraction (if any) of our customers, but the costs and slowdown caused by these bots affects everyone.
“Data mining for profit” bots are fundamentally different than search engine indexers like Googlebot. Search engine bots may send future visitors to a site, which benefits the site owner. It’s reasonable to allow the bots to consume site resources in exchange for that. Search engines also make money off the data, but it’s a symbiotic relationship where both parties get something.
But most data mining bots don’t provide any benefit to the site owner at all. It’s a parasitic relationship, not symbiotic. If anything, the average request from this kind of bot harms the site owner, because the data is used to benefit their competitors.
We don’t go out of our way looking for parasitic bots, but when one of them causes such abnormal resource consumption that it would lead to overage fees for our customers or affect the speed of a site, we block it. It’s not reasonable for our customers to incur expenses for something that won’t benefit them.
If you’re a customer of ours and you want to allow these bots to index your site despite this, you can use use the trick described above: Create an empty file named .tigertech-dont-block-user-agents at the top level of your site, which will bypass the restriction.
Is there way I can block just some of these bots?
In terms of our blocking, there’s no way to allow just one bot from this list while not allowing others. Instead, you can add the .tigertech-dont-block-user-agents then block individual bots you don’t want using .htaccess file entries.
If you use WordPress to manage your site, you can do it this way:
- Add the .tigertech-dont-block-user-agents file to disable the global user-agent blocking (we’ll be glad to do this if you wish);
- Then, if you still want to block some of these, use a WordPress plugin like User Agent Blocker to control which bots you allow and block.
That way you can control it in fine detail from within WordPress.
Copyright © 2000-2024 Tiger Technologies LLC