An Example of Why You SHOULD NOT Let Google Crawl Your JavaScript Code

by SEO Mofo on Mar 22nd, 2011

in Advanced SEO, Google Sucks

User-agent: googlebot Disallow: JavaScriptIn my article, titled Advanced SEO for Affiliate Marketing Links, and the follow-up article, titled Hey, Matt Cutts, I’m using JavaScript to hide links from Google, cool?, I discuss an SEO strategy that some sheep people consider to be gray hat or even black hat SEO. The basic concept involves using JavaScript functions to create links, placing the JavaScript code in an external file, and then blocking googlebot from accessing it, using robots.txt. The end result: only your users can see your JavaScript links; Google sees plain text.

In those articles, I discussed how to use this technique in a way that improves the user experience and prevents the passing of PageRank through paid links (as is required by the Google Webmaster Guidelines). One of the things I heard from the ignorant, self-righteous fucktards naysayers of this technique was that I shouldn’t block googlebot from viewing my JavaScript code, because Google is smart enough to “figure things out” for itself.

Also, in the past I’ve asked Matt Cutts if there’s any reason why I shouldn’t Disallow googlebot from crawling external JavaScript files. In his response, he advises people NOT to block Google and says the cost in bandwidth required to serve JavaScript files to Google is insignificant.

The following example shows that both arguments (Google understands JavaScript and it doesn’t cost you anything) are flawed and confirms my recommendation to Disallow googlebot from reading your JavaScript code (regardless of what the code actually does).

The code example below is from Google Instant Previews Experiment #01 – When is the Screen Captured? It is not from an external .js file–it is defined in the page’s <head> section. In other words, this is one of the few times I let Google see some JavaScript code…and you can see for yourself just how well Google has figured it out.

	function showImage(int) {
		int = ((int < 10) ? "0" + int : int);
		var parentID = "update" + int;
		var updatePs = document.getElementById("updates").getElementsByTagName("p");
		var image = document.createElement('img');
		var imgID = "image" + int;
		var imgURI = "/img/google-instant-preview-" + int + ".png";
		var imgALT = "Google Instant Preview #" + int;

		for (var i = 0; i < updatePs.length; i++) {
			var imgObj = updatePs[i].getElementsByTagName("img");
			if (imgObj[0]!=null){
				imgObj[0].parentNode.removeChild(imgObj[0]);
			}
		}

		image.setAttribute("id", imgID);
		image.setAttribute("class", "preview-image");
		image.setAttribute("src", imgURI);
		image.setAttribute("width", "302");
		image.setAttribute("height", "585");
		document.getElementById(parentID).appendChild(image);
	}

What the script actually does is it allows me to easily update that post by adding images to the rollover, using simple CSS classes/ids. But that doesn’t matter; what matters is that Google has pulled an arbitrary string from the code and is treating it like a link URL.

In other words, Google isn’t curiously testing the string to see if it’s a URL–no, Google is boldly declaring: This is definitely a URL, and I’m definitely counting it as a link, and therefore you definitely have a broken link on this page.

Google Webmaster Tools showing broken JavaScript link

Bottom line: Google sucks at understanding JavaScript, and there’s a real possibility that its reckless misinterpretation of your script will end up causing damage to your website’s rankings, its crawl rate, and/or its depth of indexation.

{ comment Leave a comment }

iDCx March 22, 2011 at 6:17 am

Great read this one! Thanks for that – indeed a great way to get around the dynamic urls created by affiliates, often i find dropping the session id and then redirecting the juice to the homepage of the site.

Either way – thanks for this – sent it round the office and received high regard with some of the dev guys who fully understand it!

Cheers again

iDCx

André Scholten March 22, 2011 at 8:42 am

Right, do you really think Google would be so stupid that these kind of 404 links could hurt rankings? I think it’s just an experiment. You can ignore the messages in the GWT.

SEO mofo March 22, 2011 at 10:25 pm

Right, do you really think Google would be so stupid that these kind of 404 links could hurt rankings?

Um…yeah…that’s pretty much the whole point of this article: to show how stupid Google is.

If *you* really think Google cares about false positives or collateral damage, then you haven’t been doing SEO long enough. And that’s assuming anyone over there even notices the bugs in their algorithms or acknowledges the edge cases that would get screwed by them–which isn’t always true. You’re right about one thing though: Google is just one big sloppy, never-ending experiment.

Adam March 22, 2011 at 6:41 pm

@Andre: Yes I do think that. Nice tip Darren

Andy March 23, 2011 at 3:33 am

Great read, I’ve been following your discussion on this one for a while… but I have one question:

Why is the “Google Sucks” link styled differently to all the others on your page? :P

SEO mofo March 23, 2011 at 7:36 am
Andy March 24, 2011 at 3:28 am

Can’t seem to find how to reply to your post… but thanks for the link… I love it! Hahaha

André Scholten March 24, 2011 at 5:33 am

[q]If *you* really think Google cares about false positives or collateral damage, then you haven’t been doing SEO long enough.[/q]
I’ve done SEO long enough to know what are algorithmic bugs and what not.
The way I see it: there’s a programmer on Google that decided to have a try at crawling javascript. He created a function that tries to read URL’s from the script to see if they can reach pages that aren’t linked from normal html links. The programmer knows that not all “/bla” strings in javascript are URL’s, therefore he must have build something in like: “try to crawl this, but as we don’t know it’s a real URL, don’t hurt rankings”.

I can be wrong, so are you ;)

Norfi March 24, 2011 at 6:34 am

Amigo, siento una gran admiración por usted, he leído todos sus blogs, no se si es el mejor, pero creo que es usted un gran SEO, he aprendido mucho con sus artículos y me han ayudado mucho, es usted una especie rara en este mundo actual donde no es muy común que las personas compartan sus conocimientos de forma totalmente abierta y desinteresada.

Norfi March 24, 2011 at 6:44 am

Sorry eliminates the previous comment I sent untranslated
My friend, I have great admiration for you, I read all your blogs, if not the best but I think you’re a great SEO, I learned a lot from your articles and have helped me a lot, you are a rare species in this today’s world where it is very common for people to share their knowledge in a completely open and disinterested.

Henry - Laboratorio March 30, 2011 at 3:05 pm

Regards,

I have a google found as error 404 and it’s strange because I have that page on my server, it will be for Javascript problems?

Pardon my ignorance but I appeared a few days and do not know if that would affect me in my position …

Thanks …

pepe April 26, 2011 at 3:46 pm

So, could we endorse the same for ‘User-agent: *’, or just googlebot?

Pavlicko May 31, 2011 at 10:43 am

Darren,

You make some excellent points, just keep in mind that using robots.txt to block the javascript file from Google won’t stop them from indexing crap if you’ve already linked to the js file elsewhere within the code. Better to use the x-robots-tag instead and leave the js out of the robots file, otherwise it won’t follow the HTTP directives from the x-robots tag anyway.

Here’s the link but I’m sure you already know about it anyway.
http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html

SEO mofo May 31, 2011 at 10:56 pm

Pavlicko,

I’m going to be honest…nothing you just said makes any sense to me. The point I was trying to make in this post is:

Google sucks at understanding JavaScript, so don’t let Google read your JavaScript code. You can prevent Google from reading it by putting it in an external .js file and disallowing googlebot from that file via robots.txt.

Regardless of where the .js file is linked from, and regardless of how it is linked, googlebot must respect the robots exclusion protocol and thus never request the .js file from my server.

Googlebot should never see an x-robots HTTP header in my .js file’s server response, because that would mean Google has already requested and received the .js file.

Are we saying the same thing, just 2 different ways?

SEO Expert June 6, 2011 at 1:53 pm

Oh I never knew that google crawlers dont index the javascript pages, it a good new thing i learned today, thanks for sharing the info..

isma reformas sevilla June 27, 2011 at 3:02 pm

Yo comparto la opinión de Seo expert, yo tampoco sabía que el JavaScript afectara a los motores de busqueda de Google. Gracias y saludos.

phrench August 15, 2011 at 8:54 pm

Although I have disallowed /redirect/ in my robots.txt, the site performance tool in GWT shows me (very slow) loading times for /redirect/afflink

/redirect/afflink is actually /redirect.php?x=afflink (mod rewrite) and redirects to some external affiliate link.

Now the question is why does the site performance tool check the loading time of /redirect/afflink although I forbid Gbot to visit that directory and all links are nofollowed?

SEO mofo August 15, 2011 at 9:21 pm

It’s because Google doesn’t use googlebot to measure page speed; it uses information from the web browsers of actual visitors to your site–primarily from users who have the Google Toolbar installed.

Benjo October 29, 2011 at 3:45 pm

I have an ecommerce site (modernrugs.co.uk) which Google has recently taken a dislike to and I think Javascript might be to blame. Google has indexed and cached my website in the serps. But a few months ago the rankings dropped and I noticed Google would not display the text content from my website homepage in the serps. Basically if you copy any sentence of text from my homepage and paste it into google with inverted commas around it, my site doesn’t appear in the serps as it should do. Yahoo doesn’t have any problem with it, just Google (I wish it was the other way around :P).

I thought it might be a duplicate content issue as a competitor had taken large chunks of text from my homepage and put it on their site, so I re-wrote all the homepage content but the problem remains. So now I’ve got a suspicion that the problem may be because of javascript. Google has no problems with any of the other pages on my site, it is only the homepage (which does contain quite a bit of javascript, more than my other pages). Would you recommend putting ALL the homepage javascript it in an external .js file and disallow googlebot with a robots.txt ?

SEO mofo October 29, 2011 at 11:14 pm

Hi Benjo,

I ran a few simple tests on your site, and something’s definitely amiss. The most notable anomaly is the fact that your home page isn’t returned atop the results of a site: search.

Personally, I wouldn’t suspect JavaScript as the main cause of the problem. The reason being: Google engineers know that their crawler/indexer sucks at reading JavaScript, so they probably wouldn’t base any major algorithmic decisions (e.g., penalties) on it. In other words, it looks like Google has made some pretty extreme assumptions about your home page–the kind that generally require very strong signals (which JavaScript is NOT).

If I were you, the first thing I would address is this site that has scraped all your home page content. See if the site provides info to opt out, or email the site owner directly and ask them to remove your site from their database. If all else fails, submit a DMCA take-down request. That site is hosted in the U.S., so if they don’t comply with your DMCA notice, you can go to their hosting provider and have the entire server shut down.

Once that site is out of the picture, see if your home page recovers. If it doesn’t, try externalizing/blocking your JavaScript. If it does, send me a rug. ;)

Ben October 30, 2011 at 5:28 am

Thanks SEO Mofo, I had noticed that site before but as it was linking back to my site i didn’t think it was an issue. I’ll try and get in touch with them asap! If it works I’ll be back for your address to send you your free rug ;)

SEO mofo October 31, 2011 at 3:49 am

Since you mentioned that the site is linking to yours, I realized that it links to modern-rugs.co.uk instead of modernrugs.co.uk. So I did a few more tests and found some additional “suspects” you might want to investigate.

First I ran a site: search on modern-rugs.co.uk and saw that Google has 20+ URLs indexed, despite the fact that they redirect to modernrugs.co.uk.

It’s a bit unusual for Google to keep stale URLs indexed for this long, so I checked the server response for one of the indexed URLs, to see what kind of redirects are being used.

The results indicate that there’s a problem with your URL rewrites/redirects. Here is the chain of redirects that occurred:

blog.modern-rugs.co.uk/pages/rug_trends_26225.cfm

blog.modernrugs.co.uk//pages/rug_trends_26225.cfm

www.modernrugs.co.uk/blog

www.modernrugs.co.uk/blog/

It probably doesn’t help that your site currently links to the hyphenated version, so you should probably update it to point to the non-hyphenated version. The link is in your right sidebar, in the “Latest Blog” box, at the bottom (it says “view all blogs”).

These problems aren’t really a big deal by themselves, but in aggregate they suggest the possibility that you might have changed domains/subdomains in the past and lost your Google stats (e.g., PageRank, trust, etc.) in the process, due to failed redirects. It’s also possible that Google is reading these problems as some kind of spam signal or something…who knows.

And one last thing I’ll mention is…I noticed your server is setting an unusually-high number of cookies–some of which don’t expire for like 30 years. I don’t have any reason to believe that this is hurting your site in Google, but just from a web developer’s perspective…it’s something I would investigate further if it was my site–especially if the other ideas fail to get results.

ben dale October 31, 2011 at 4:32 pm

thanks again, really appreciate your advice. I will get those links changed as you suggest and will let you now how it goes :) This has been so frustrating!

Yasir Yar Khan January 11, 2012 at 6:09 am

Thanks for sharing :)

I need to check this Google misinterpretation for my websites too.

Isma-Chistes February 17, 2012 at 9:00 am

I do not understand, such as Google rank two pages completely equal in structure of an unevenly

Leave a Comment