[]RSS

[ Here: About | Archives+Tags | Artwork | Resumé | Contact ] [ Elsewhere: Comic | Projects | Philosophy | Work ]

Microsoft Sucks Bandwidth

August 26th, 2005 in Rants. Weblog

I’ve seen a trend in my server logs this month (day, week, month averages):

1199    10199   32944   msnbot/1.0 (+http://search.msn.com/msnbot.htm)

Microsoft’s MSN search engine crawling is making requests for pages and resources 1.2k times a day on my site, every day. That’s about four times more requests than the total number of resources on my site, or 1/4 of my current request volume (32k/120k raw hits a month). Compared to (and all other bots combined), MSNBot sucks ten times more resources five times more frequently.

What the hell are they doing? And why do their search results still suck?

If my results are at all consistent with other sites, it would appear as if the MSNBot sucks at least 15 times more bandwidth than all other webcrawling combined. I guess it’s no surprise that Microsoft sucks more than most. I only wonder if there is a point to crawling sites so frequently.

7 Responses to “Microsoft Sucks Bandwidth”

  1. Steven Fisher says:
    August 26th, 2005 at 7:58 am

    Is that based strictly on the user agent? If so, it is probably spammers crawling with the same user agent string.

  2. mx says:
    August 26th, 2005 at 8:46 am

    From what I can see the hits are all from a single group of IPs that reverse-map to Microsoft (sampling a few daily logs). I’ve read elsewhere that the MSNBot is aggressive, but only this month has it eclipsed GoogleBot on my site. It may be possible that the MSNBot is buggy, or that someone has hijacked some machines in their IP range (though that seems unlikely).

    I would exclude it from crawling my site, except that the exposure is probably not a bad thing. It just seems that they’re beeing a bit greedy (to the point of insanity) with their crawling.

  3. mx says:
    August 26th, 2005 at 8:47 am

    I can find quite a few complaints from Google too: .

  4. Steven Fisher says:
    August 26th, 2005 at 9:00 am

    Well, you know me well enough to know I’m not a Microsoft apologist. Still, I’d like to believe that they could get something like a spider correct.

    The evidence seems to be against it, though. Thanks for the link. Just to be safe, I’ve added a line to robots.txt excluding msnbot… hopefully their spider works at least that well by now.

  5. mx says:
    August 26th, 2005 at 9:25 am

    It also looks like MSNBot isn’t caching things very well. If it’s re-grabbing all the pages on a daily basis, then they’re not looking at the last-modified header. It also looks like the bot is re-reading resources (images) that are on each page. Looks like a very beta-version, or maybe they just don’t care.

  6. ChipCuccio.US says:
    August 31st, 2005 at 3:27 pm

    No more MSNbot

    Today I added the following rule/exclusion from my robots.txt;
         User-agent: msnbot
     Disallow: /
    Why? Because MSN’s spiders request too many resources, too frequently. MSN spiders don’t play nice, and I’ve had my eye on them...
    
  7. Matt Cutts: Gadgets, Google, and SEO » Crawl caching proxy says:
    April 23rd, 2006 at 1:07 pm

    As part of the Bigdaddy infrastructure switchover, Google has been working on frameworks for smarter crawling, improved canonicalization, and better indexing. On the smarter crawling front, one of the things we’ve been working on is bandwidth reduction. For example, the pre-Bigdaddy webcrawl Googlebot with user-agent “Googlebot/2.1 (+http://www.google.com/bot.html)” would sometimes allow gzipped encoding. The newer Bigdaddy Googlebots with user-agent “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” are much more likely to support gzip encoding. That reduces Googlebot’s bandwidth usage for site owners and webmasters. From my conversations with the crawl/index team, it sounds like there’s a lot of head-room for webmasters to reduce their bandwith by turning on gzip encoding.

 

Leave a Reply

Subscribe to comments