How can a commercial grade web robot be so badly written?
Alex Schroeder was checking the status of web requests,
and it made me wonder about the stats on my own server.
One quick script later and I had some numbers:
Status of requests for
boston.conman.org so far this month
|Status ||result ||requests ||percent|
|Total ||- ||64542 ||100.01|
|200 || |
|206 || |
|301 || |
|304 || |
|400 || |
|401 || |
|404 || |
|405 || |
|410 || |
|500 || |
I'll have to check the
INTERNAL_ERRORs and into those 12
but the rest seem okay. I was curious to see what I didn't have that was being requested,
when I noticed that the MJ12Bot was producing the majority of
most of the traffic around here is from bots.
Lots and lots of bots.
Top agents requesting pages
|requests ||percentage ||user agent|
|47721 ||74 ||Total (out of 64542)|
|16952 ||26 ||The Knowledge AI|
|9159 ||14 ||Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)|
|5633 ||9 ||Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +https://velen.io)|
|4272 ||7 ||Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)|
|4046 ||6 ||Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)|
|3170 ||5 ||Mozilla/5.0 (compatible; Go-http-client/1.1; +email@example.com)|
|2146 ||3 ||Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)|
|1197 ||2 ||Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, firstname.lastname@example.org)|
|1146 ||2 ||istellabot/t.1.13|
But it's been that way for years now.
C'est la vie.
So I started looking closer at MJ12Bot and the requests it was generating, and & they were odd:
And so on.
As they describe it:
Why do you keep crawling 404 or 301 pages?
We have a long memory and want to ensure that temporary errors, website down
pages or other temporary changes to sites do not cause irreparable changes
to your site profile when they shouldn't. Also if there are still links to
these pages they will continue to be found and followed. Google have
published a statement since they are also asked this question, their reason
is of course the same as ours and their answer can be found here:
Google 404 policy.
But those requests?
They have a real issue with their bot.
Looking over the requests,
I see that they're pages I've linked to,
but for whatever reason,
their bot is making requests for remote pages on my server.
%22 partsthat's an encoded double quote.
It's as if their bot saw
<A HREF="http://www.thomasedison.com"> and treated it as not only a link on my server,
but escaped the quotes when making the request!
Pssst! MJ12Bot! Quotes are optional! Both
<A HREF="http://www.thomasedison.com"> and
<A HREF=http://www.thomasedison.com> are equivalent!
I sent them the following email:
I've read your page on the mj12 bot, and I don't necessarily mind the 404s
your bot generates, but I think there's a problem with your bot making
totally bogus requests, such as:
I'm not a proxy server, so requesting a URL will not work, and even if I
was a proxy server, the request itself is malformed so badly that I have to
conclude your programmers are incompetent and don't care.
Could you at the very least fix your robot so it makes proper requests?
I then received a canned reply saying that they have,
received my email and are looking into it.
But I did a bit more investigation, and the results aren't pretty:
Requests and results for MJ12Bot
|Status ||result ||number ||percentage|
|Total ||- ||2164 ||100.00|
|200 || |
|301 || |
|404 || |
So not only are they responsible for 83% of the bad requests I've seen,
but nearly 77% of the requests they make are bad!
Just amazing programmers they have!