The Boston Diaries' Journal
[Most Recent Entries] [Calendar View] [Friends]

Below are the 6 most recent journal entries recorded in The Boston Diaries' InsaneJournal:

    Thursday, July 11th, 2019
    1:24 am
    Some more observations about the MJ12Bot

    I received another reply from MJ12Bot about their badly written bot and it just said the person responsible for handling enquiries was out of the office for the day and I should expect a reponse tomorrow. We shall see. In the mean time, I decided to check some of the other bots hitting my site and see how well they fare, request wise. And I'm using the logs from last month for this, so these results are for 30 days of traffic.

    Top 10 bots hitting The Boston Diaries
    requests percentage user agent
    167235 70 Total (out of 239641)
    46334 19 The Knowledge AI
    38097 16 Mozilla/5.0 (compatible; SemrushBot/3~bl; +
    17130 7 Mozilla/5.0 (compatible; BLEXBot/1.0; +
    15928 7 Mozilla/5.0 (compatible; AhrefsBot/6.1; +
    12358 5 Mozilla/5.0 (compatible; bingbot/2.0; +
    8929 4 Mozilla/5.0 (compatible;; +
    8908 4 Gigabot
    7872 3 Mozilla/5.0 (compatible; MJ12bot/v1.4.8;
    6942 3 Barkrowler/0.9 (+
    4737 2 istellabot/t.1.13

    So let's see some results:

    Results of bot queries
    Bot 200 % 301 % 304 % 400 % 403 % 404 % 410 % 500 % Total %
    The Knowledge AI 42676 92.1 3352 7.2 0 0.0 127 0.3 4 0.0 170 0.4 5 0.0 0 0.0 46334 100.0
    SemrushBot/3~bl 36088 94.7 1873 4.9 0 0.0 110 0.3 0 0.0 21 0.1 5 0.0 0 0.0 38097 100.0
    BLEXBot/1.0 16633 97.1 208 1.2 124 0.7 114 0.7 0 0.0 46 0.3 5 0.0 0 0.0 17130 100.0
    AhrefsBot/6.1 15840 99.4 78 0.5 0 0.0 4 0.0 0 0.0 5 0.0 0 0.0 1 0.0 15928 99.9
    bingbot/2.0 12304 99.6 35 0.3 0 0.0 6 0.0 0 0.0 3 0.0 5 0.0 0 0.0 12353 99.9 8412 94.2 456 5.1 0 0.0 24 0.3 0 0.0 36 0.4 1 0.0 0 0.0 8929 100.0
    Gigabot 8428 94.6 448 5.0 0 0.0 23 0.3 0 0.0 7 0.1 2 0.0 0 0.0 8908 100.0
    MJ12bot/v1.4.8 2015 25.6 175 2.2 0 0.0 2 0.0 0 0.0 5680 72.2 0 0.0 0 0.0 7872 100.0
    Barkrowler/0.9 6604 95.1 300 4.3 0 0.0 10 0.1 0 0.0 28 0.4 0 0.0 0 0.0 6942 99.9
    istellabot/t.1.13 4705 99.3 28 0.6 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 4 0.1 4737 100.0

    Percentage wise of the top 10 bots hitting my blog (and in fact, these are the 10 ten clients hitting my blog) MJ12Bot is just bad at 72% bad requests. It's hard to say what the second worst one is, but I'll have to give it to The Knowledge AI bot (and my search-foo is failing me in finding anything about this one). Percentage wise, it's about on-par with the others, but some of its requests are also rather odd:

    • /%22
    • /%22https:/
    • /%22http:/
    • /%22https:/
    • /%22https:/
    • /%22https:/

    It appears to be a similar problem as MJ12Bot, but one that doesn't happen nearly as often.

    Now, this isn't to say I don't have some legitimate not found (404) results. I did come across some actual valid 404 results on my own blog:

    • /2004/08/18/
    • /2012/08/10/HREF
    • /2013/01/02/menamena
    • /2013/02/01/HREF
    • /2014/05/04/HREF
    • /2015/02/10/B000FBJCJE
    • /2015/07/10/

    Some are typos, some are placeholders for links I forgot to add. And those I can fix. I just wish someone would fix MJ12Bot. Not because it's bogging down my site with unwanted traffic, but because it's just bad at what it does.

    Tuesday, July 9th, 2019
    11:08 pm
    How can a commercial grade web robot be so badly written?

    Alex Schroeder was checking the status of web requests, and it made me wonder about the stats on my own server. One quick script later and I had some numbers:

    Status of requests for so far this month
    Status result requests percent
    Total - 64542 100.01
    200 OKAY 53457 82.83
    206 PARTIAL_CONTENT 12 0.02
    301 MOVE_PERM 2421 3.75
    304 NOT_MODIFIED 6185 9.58
    400 BAD_REQUEST 101 0.16
    401 UNAUTHORIZED 147 0.23
    404 NOT_FOUND 2000 3.10
    405 METHOD_NOT_ALLOWED 41 0.06
    410 GONE 5 0.01
    500 INTERNAL_ERROR 173 0.27

    I'll have to check the INTERNAL_ERRORs and into those 12 PARTIAL_CONTENT responses, but the rest seem okay. I was curious to see what I didn't have that was being requested, when I noticed that the MJ12Bot was producing the majority of NOT_FOUND responses.

    Yes, sadly, most of the traffic around here is from bots. Lots and lots of bots.

    Top agents requesting pages
    requests percentage user agent
    47721 74 Total (out of 64542)
    16952 26 The Knowledge AI
    9159 14 Mozilla/5.0 (compatible; SemrushBot/3~bl; +
    5633 9 Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +
    4272 7 Mozilla/5.0 (compatible; AhrefsBot/6.1; +
    4046 6 Mozilla/5.0 (compatible; bingbot/2.0; +
    3170 5 Mozilla/5.0 (compatible; Go-http-client/1.1;
    2146 3 Mozilla/5.0 (compatible; MJ12bot/v1.4.8;
    1197 2 Mozilla/5.0 (compatible; DotBot/1.1;,
    1146 2 istellabot/t.1.13

    But it's been that way for years now. C'est la vie.

    So I started looking closer at MJ12Bot and the requests it was generating, and & they were odd:

    • //%22
    • //%22
    • //%22/2018/08/24.1/%22
    • //%22

    And so on. As they describe it:

    Why do you keep crawling 404 or 301 pages?

    We have a long memory and want to ensure that temporary errors, website down pages or other temporary changes to sites do not cause irreparable changes to your site profile when they shouldn't. Also if there are still links to these pages they will continue to be found and followed. Google have published a statement since they are also asked this question, their reason is of course the same as ours and their answer can be found here: Google 404 policy.

    But those requests? They have a real issue with their bot. Looking over the requests, I see that they're pages I've linked to, but for whatever reason, their bot is making requests for remote pages on my server. Worse yet, they're quoted! The %22 partsthat's an encoded double quote. It's as if their bot saw <A HREF=""> and treated it as not only a link on my server, but escaped the quotes when making the request!

    Pssst! MJ12Bot! Quotes are optional! Both <A HREF=""> and <A HREF=> are equivalent!


    Annoyed, I sent them the following email:

    Sean Conner <>
    Your robot is making bogus requests to my webserver
    Tue, 9 Jul 2019 17:49:02 -0400

    I've read your page on the mj12 bot, and I don't necessarily mind the 404s your bot generates, but I think there's a problem with your bot making totally bogus requests, such as:


    I'm not a proxy server, so requesting a URL will not work, and even if I was a proxy server, the request itself is malformed so badly that I have to conclude your programmers are incompetent and don't care.

    Could you at the very least fix your robot so it makes proper requests?

    I then received a canned reply saying that they have, in fact, received my email and are looking into it.


    But I did a bit more investigation, and the results aren't pretty:

    Requests and results for MJ12Bot
    Status result number percentage
    Total - 2164 100.00
    200 OKAY 505 23.34
    301 MOVE_PERM 4 0.18
    404 NOT_FOUND 1655 76.48

    So not only are they responsible for 83% of the bad requests I've seen, but nearly 77% of the requests they make are bad!

    Just amazing programmers they have!

    Thursday, July 4th, 2019
    11:47 pm
    T'was the night after fireworks, and all through the land, I can only hope, that no one lost a hand

    It's that time of the year again when people spend vast amounts of time and money shooting off fireworks. As of now, it no longer sounds like a war zone and the smell of black powder has drifted onward. So I hope everyone had a safe Fourth of July and that this:

    [Fireworks that exploded at ground level, with a capture that says, Amateurs: There's a Reason Professionals Exist. I still think this is one of my best photographs and I'm amazed that not only did I survive, but it came out as well as it did.]

    never happened to you.

    11:47 pm
    Will someone please rescue me from this Chinese fortune cookie factory?

    Tonight's fortune cookie is amusing in the way that only fortune cookies can be.

    [A fortune cookie says Actions speak louder than talks And Ahhhhhnold will kill you with quips.]

    The other fortune cookie fortune was not nearly as interesting.

    11:47 pm
    Those deployment blues

    My department at The Corporation had a deployment this morning (2:00 am). These deployments don't happen that often (the last one happened in January of this year; last year we had a total of four deployments) but usually there are no problems afterwards.

    This time we weren't so lucky.

    It wasn't a problem with our code, but with a vendor our customer, The Monopolistic Phone Company, uses. The vendor in question wasn't sending some critical information we were sending back to The Monopolistic Phone Company. We didn't notice this initially since our testing just happened to use the other vendor The Monopolistic Phone Comapny uses. So while it technically wasn't our problem, getting that particular vendor to even look at a problem, much less solve it, is a multi-month and multi-money problem, practically it is our problem.

    The base problem is that one vendor who shall rename nameless is supposed to forward all SIP headers that start with a common prefix, but they have a limit to the number of non-standard SIP headers they'll forward and we've exceeded said limit. Apparently, a new feature we added, plus moving some existing data to its own header, bumped the number of headers past this limit. The fix was easy (just put the existing data we moved back in the old header while keeping it in the new header) but there was a bit of concern about installing it into production.

    You see, because our customer is The Monopolistic Phone Company, and they have regulartory issues with respect to reliability to contend with, there's a whole process involved with deployment. Just for starters, we have to give them a 10-business day notice of any changes, which they can veto &

    Oh, and have I mentioned the very scary SLAs we have with them? Where vast amounts of money start flowing to The Monopolistic Phone Company for violations of said SLAs? So you can see why it takes a significant amount of time to get deployed, and why we have so few.

    Fortunately, we're given a number of emergency deployments we can use and thus, we used one of them today.

    All told, from initial bug fix to re-deployment took a total of three hours. That is the fastest deployment I've seen of our department's code.

    11:46 pm
    Notes about a broken menu system

    I am partaking a local quick, consumable, gustatory establishment whereupon I spied a problematic carte du jour:

    [Of the three screens, one is a noticable blue: kOne of these things is not like the others / One of these things just doesn't belong / Can you tell which thing is not like the others / By the time I finish my song?k]

    Methinks the local proprietor requires consultation with the originating equipment manufacturer to resolve the current conundrum.

The Boston Diaries   About InsaneJournal