HTTP Compression and the Google AdSense Crawler Bot

Posted: February 22nd, 2008
Filed under: Web Speed and Performance
Tags:


FACT: HTTP Compression really improves Web serving.

FACT: Big sites like Google and Yahoo! use compression.

UNFORTUNATE FACT: Some services are not aware enough of compression and may break… unless you have a smart compression engine!

This underutilized technology transparently reduces the size of all text-based content served from a Web site or Web service, speeding up transmission across the Web, reducing bandwidth expenses, and freeing up Web server availability to handle more requests. Compression deployments are accelerating among business sites, and Google.com has been compressing responses for a long time (see this real-time report: http://www.port80software.com/tools/compresscheck?url=www.google.com).

Google’s Googlebot, their Web crawler that indexes sites to form the basis of search results, also likes to see compressed content. At a search engine conference a few years back, search guru Danny Sullivan spent some time focusing on this: Google only indexes so much of a page, so if you send the Googlebot compressed content (which it asks for by the presence of the “accept-encoding: gzip, deflate“ header in a request), you can theoretically get more content indexed and save bandwidth on that request from Googlebot and all other requests to IE, FireFox and other browsers and search bots with HTTP compression. Very cool.

It is ironic then, given Google’s knowledge and use of HTTP compression, that Google’s AdSense program, which sells contextual advertising on third party sites, use technology that is not compatible with HTTP compression. One of Port80 Software’s httpZip compression clients received this email recently from Google’s AdSense team in response to why the Port80 client’s contextual ad site was not getting index by the AdSense crawler bot program (which goes by the user-agent name starting with “mediapartners-google”; a user agent is the Web client’s name, usually a browser or bot)… this is part of the email from a Google AdSense rep to our client:

“I’ve reviewed your site and have determined that our crawler is having difficulty accessing your URL. Specifically, your webserver is sending our crawler HTML in a compressed format, which our crawler is unable to process.

We recommend that you speak with your web administrator to ensure your system does not send our crawler compressed data. You can determine our crawler by looking for user agents starting with ‘Mediapartners-Google’.

Additionally, please be aware that after you have turned off the encoding, it may be 1 or 2 weeks before the changes are reflected in our index. Until then, we may display less relevant or non-paying public service ads. You should expect your ad relevance to increase over time.”

So, the AdSense crawler bot does not like HTTP Compression. But the real question is — why are they asking for it? In the request to get compression from any Web server, a user agent must first have that “accept-encoding: gzip, deflate” header in the original request… if the AdSense bot cannot deal with compression, it should not be requested by the bot itself. That makes sense, right?

It looks like Google AdSense is asking clients to not compress responses to their bot to fix this issue, rather than fixing the decompression bug (an educated guess) in their bot code. So, the fix for now if you have a Web server, are in the AdSense program from the serving side (you host Google AdSense ads on your own site), and still want to use compression for all other Web visitors, an exception must be made for any request with a user-agent starting in “mediapartners-google”.

Unfortunately, you cannot do this on Microsoft IIS 4 or 5 servers (NT or 2000) without a third party compression tool like httpZip from Port80 Software that can add a compression exclusion for a user agent. On IIS 6 (Windows 2003), you can use httpZip or ZipEnable to add such an exception or exclusion. We will be adding the default exception for this browser to a minor version upgrade of both products soon, but here is how to add an exception for this AdSense bot with httpZip and ZipEnable.

Excluding Google’s AdSense Bot IIS Compression with httpZip:

– Install the free httpZip trial from www.httpzip.com/try.

– Once installed, confirm compression is working fine (http://www.port80software.com/products/httpzip/evaluation).

– Open the httpZip Settings Manager.

– On the compression tab, to add a new Browser Exception for a MIME type, select “New” and, in the Add Browser Exception dialog, enter a Browser Name (like “AdSense Bot”) for the browser in the text box labeled “Browser Name.” Next, enter the search string text used to identify the browser (use “mediapartners-google” to get all versions of the bot, this short version will wildcard for specific software versions of the bot) in the text box labeled “Search String”, then click OK. Please note: you will have to add this for the MIME types being requested by the bot, which should include “text/html”, “text/css”, “text/javascript”, and “application/x-javascript” MIMEs, and probably a few more, based on what you are serving and want to get indexed.

Picking a MIME (text/html) to Exclude the AdSense Bot from compression

Setting up the AdSense Bot Exception for text/html MIME

– Apply your settings in the httpZip Settings Manager. Repete proces for other MIMEs that you want to get indexed (FYI, text/html should take care of most dynamic content output from ASP, ASP.NET, CFM, PHP, JSP, etc. files).

– You can use Wfetch, a free tool in the IIS 6 Resource Kit, to test that no responses will compress when requested by the AdSense bot (http://support.microsoft.com/kb/840671). Just add these headers to a request in Wfetch (“accept-encoding: gzip, deflate”), and the response from the server with the new httpZip exclusion will not be compressed (it should have no headers like “content-encoding: gzip” or “content-encoding: deflate” in the response from the Web server and is therefore not compressed).

– All your other requests from good browsers and bots will now be compressed while you can feel safe that you are not messed up with the Google AdSense bot. Remember, it may take a few weeks for the AdSense bot to reindex your site correctly.

You can add an exclusion to compression requests from the AdSense bot on IIS 6 with ZipEnable by following the instructions above and adding an exclusion directly in ZipEnable — here is the documentation for that process in ZipEnable (http://www.port80software.com/products/zipenable/docs#adv_set_browser). You will also want to use something like Wefetch that will allow you to alter your request headers so you can trick out the user-agent and make sure you are getting no compression when the user agent includes “mediapartners-google*” (make sure the search string is a wildcard implictly in ZipEnable , a bit different than in httpZip: “mediapartners-google*” ).

We hope this helps clear up any confusion on Google AdSense and HTTP compression – please contact us for help here and for other tips on IIS performance boosts!

Best regards,
Port80 Software

1 Comment »

One Comment on “HTTP Compression and the Google AdSense Crawler Bot”

  • I haven’t try this yet, but I will as soon as I have time, but I have also an important site under apache, I wonder how to manage it.

    Thanks for the explanation, useful for those like me without an unmetered host plan 😛

    Posted by: Oil Paintings at 8:59 am on July 18th, 2008