Google Turned My Plain Text URL Into a Link

by SEO Mofo on Dec 8th, 2009

in Experiments

Google eating text and crapping links

If you want to stay current with all the incredible information this site has to offer, you can subscribe to my feed1, or set your web browser’s home page to http://www.seomofo.com2.

1 – Example of a link.
2 – Example of a plain text URL.

One of the SEO theories I’ve heard is that if Google finds a plain text URL in your content, it might interpret that URL as some kind of signal. Does it count as a link to the URL? Does it contribute to the URL’s authority? Does it count towards anything at all? Probably not. Matt Cutts addressed this theory in one of his Webmaster videos and essentially killed it:

It sounds to me like plain text URLs aren’t counted for anything…but you can never be too confident about these sorts of things. Sometimes these theories end up being partially true, simply because Google engineers have overlooked rare cases that indirectly validate the theory. I came across a particular case today that shows at least one instance of Google treating a text URL like a link. Mind you, what I found is more like a bug in Google’s caching logic, and not an intentional use of text URLs as a ranking signal. In other words, the information in this post may be interesting, but it’s virtually useless. (Unlike most SEO bloggers, I hate publishing useless shit, but I’ll make an exception for myself this time.)

Earlier today, I viewed the text-only cached copy of my amazing article about monitoring Google Sidewiki comments through RSS. My goal was to see how Google was caching my social media widgets that I insert at the bottom of every blog post. If you don’t know what I’m talking about, scroll to the end of this article and take a look at my sweet widget collection. And while you’re down there, you should probably try out each of the widgets and make sure they’re functioning properly. Also, according to some recent studies I just made up, promoting my site has been shown to improve your site’s rankings and increase traffic!

Anyway…back to my story. So I was viewing Google’s cached text version of my page, because I wanted to see if Google was executing the widget scripts that are remotely hosted on their respective social network domains. Why would I care about that? Well…I’ve run into problems before where Google has executed locally-hosted scripts, which led to a whole bunch of JavaScript-driven links being counted as regular HTML links. (FYI: you CAN lose PageRank through JavaScript links.)

So I viewed the source code of Google’s cached page and compared it to the original source code. I was glad to see Google was NOT caching any content from the JavaScript-driven widgets. Instead, it just stripped out the <script> elements. But then I noticed something very odd: Google had also stripped out the <textarea> tags that I use for my link widget. Take a look…you’ll notice that lines 2 and 4 are gone:

Original Code

<div class="post_widget_mofo" id="copy_link_mofo">
	<textarea onclick="this.focus();this.select();" rows="5" cols="24">
		<a href="http://www.seomofo.com/orm/monitor-sidewiki-comments.html">How to Monitor Google Sidewiki Comments</a>
	</textarea>
	<p style="color:#999; font-size:10px; line-height:1.2em; text-align:center; margin:0;">Link</p>
</div>

Google’s Cached Code

<div class="post_widget_mofo" id="copy_link_mofo">
	<a href="http://www.seomofo.com/orm/monitor-sidewiki-comments.html">How to Monitor Google Sidewiki Comments</a>
	<p style="color:#999; font-size:10px; line-height:1.2em; text-align:center; margin:0;">Link</p>
</div>

When text and HTML code are enclosed in <textarea> tags, web browsers will not render the HTML; they will treat all of it like plain text. This is why my link widget contains un-encoded HTML but is displayed in the browser as plain text:

Link


…but obviously if Google strips the <textarea> tags, then all that’s left is the HTML code, and now Google shows a link for what was supposed to be plain text:

Original Widgets

social media widgets

Cached Widgets

To be honest, I didn’t even know <textarea> elements would show un-encoded HTML as text. I’m usually very conscious about encoding HTML brackets (using &lt; instead of < and &gt; instead of >), but I didn’t catch this one because it was rendering the way I wanted it to.

I have updated my WordPress function to use encoded HTML brackets in my widget, but I still want to play around with this bug, so I’m going to set up several <textarea> elements with various code snippets in them…and wait for Google to cache them. I’ll update this post in a few days with the results…but it will probably still be virtually worthless information.


Exhibit A

Exhibit B

Exhibit C

Exhibit D



Exhibit A: this is basically a repeat of what I’ve already seen. I’m including this to verify that Google is still behaving the way I observed already.

Exhibit B: the onclick attribute is removed from the <textarea> element. Is this attribute what caused Google to strip out these tags?

Exhibit C: the link HTML has been replaced with an image tag. Can I smuggle an image into Google’s text-only cache by wrapping it in <textarea> tags?

Exhibit D: the link HTML has been replaced with a simple JavaScript function. Can I execute JavaScript from within Google’s text-only cache by wrapping it in <textarea> tags?


The Results Are In!

Wow, that was fast. According to Google’s cache date, this post was crawled at 19:57:08 GMT. I posted it at 9:11 am, Pacific time…so that would be 17:11 GMT…which means it took Google about 2 hours and 46 minutes to find my page and crawl it. I don’t know when it appeared cached, but it couldn’t have been more than 4 hours from the crawl time (I noticed it was cached when I checked at about 23:45 GMT).

So here are the results! First off, here is the cached code (you don’t have to analyze it too much…I’ll interpret it for you below):

<div id="exhibit_a" style="width:25%; float:left;">
	<p class="mofo_label">Exhibit A</p>
	<a href="http://www.seomofo.com/">World&#8217;s Greatest SEO</a>
</div>
<div id="exhibit_b" style="width:25%; float:left;">
	<p class="mofo_label">Exhibit B</p>
	<a href="http://www.seomofo.com/">World&#8217;s Greatest SEO</a>
</div>
<div id="exhibit_c" style="width:25%; float:left;">
	<p class="mofo_label">Exhibit C</p>
	SEO Mofo
</div>
<div id="exhibit_d" style="width:25%; float:left;">
	<p class="mofo_label">Exhibit D</p>
	
</div>

Here’s what it looks like rendered:

textarea code samples rendered in Googles cached text

And finally…here’s what it all means:

Exhibit A: as you can see, Google turned our HTML text into a link by removing the <textarea> tags. This is just more of what we already saw.

Exhibit B: this has also been stripped of its <textarea> tags, which suggests that the onclick attribute in the <textarea> tag was NOT the reason why Google removes them.

Exhibit C: this has also been stripped, but so has my <img> tag, leaving nothing but the alt text in its place. This suggests that Google parses the content nested in the <textarea> element independently. My attempt to smuggle in an image failed.

Exhibit D: same result as Exhibit C, but this was a <script> element, so it didn’t have any alt text to leave behind–just a hole where the script used to be. If the <img> trick failed, then it makes sense that this one would too.

Conclusion

The conclusion here is that Google’s engineers have overlooked a minor detail. When Google processes a page and prepares the text-only content, it fails to recognize that the content of a <textarea> element should be treated like text and not like HTML code. If Google is going to strip out the <textarea> tags, it should also make sure the newly-exposed content is encoded properly so it doesn’t render as HTML.

† The W3C specification says: “The TEXTAREA element creates a multi-line text input control. User agents should use the contents of this element as the initial value of the control and should render this text initially.”

Any Takeaways?

Not much to take away here. The only things I can think of are:

  1. If you’re trying to comment-spam a site that doesn’t allow links, maybe it allows links wrapped in <textarea> tags? Ha. Somehow I doubt any CMS programmers would make a special exception like this…but hey, it’s possible.
  2. Every once in awhile, view a couple of your pages’ cached text versions. A lot of times it will reveal content that Google is caching that you don’t want getting cached, for example: “This site requires Flash…blah blah blah…please click here to go to the Flash download page. If you get lost, just follow my PageRank.”

Sweet Widget Collection

Go ahead…you can touch them.