Don’t trust Google to tell you the source
There’s a certain irony in the whole trustrank and authority shaboogle.Googles algo it seems, puts so much trust in certain domains that it actually hoodwinks its surfers into believing that other domains are more relevant than the original source documents themselves.
Michael Gray at Graywolf draws attention to a factor that anyone who works in SEO these days is either painfully aware of or ecstatically pleased about. If you have ‘authority and trust’ in sufficient numbers then given a little effort, you can rank for just about anything.
Michael’s piece is particularly interesting as it illustrates clearly that Google has difficulty, (despite one article clearly containing many clues as to its origin) in differentiating the origins of a particular piece. For Google I guess on that occasion it was encouraging for them that they managed to get the Forbes result to appear before the MSNBC result, yet I suspect that this wasn’t universal and that it would very much depend upon who else was referencing any article on any given day.
Whilst Michael’s piece focuses on the duplicate content issue, it got me thinking and kinda pushed me off on a tangent about the whole attribution thing. I was intrigued by this, so decided to dig a little deeper and see if the same rules applied, I suspected they didn’t but wanted to see in any case.
Google and document source attribution
Google tells us that one of the ways we can tell them that we are the originating source is by embedding a link back to our pieces within our articles.
Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to block the version on their sites with robots.txt.
This supposedly helps them to determine document source, allowing them to give a more precise SERP attribution. Besides the fact that this clearly isn’t working I had to laugh at the line You can also ask those who use your syndicated material to block the version on their sites with robots.txt. as come on guys, seriously who has the time to go around to multiples of sources and ask them to do that! Most re-distributors distribute this stuff for the very inadequacies that this article discusses! Why would they do that? Anyhow, I digress, getting back on track.
I wanted to look at other examples from Forbes to see how generally Google handled other long tail SERPs. I dug out the first post title I came across from a recent archive at Forbes and entered it in the Google search box – How to travel well on a weak dollar which showed the Forbes originator at position #5 and the MSNBC version for the query at position #1.
What is particularly interesting about this SERP is that despite the MSNBC version being littered with references back to Forbes and despite the timestamp for the originating source clearly showing that Forbes published the article 1st, Google still decide that MSNBC should be the place that users visit as the 1st choice reference for the particular piece. If this wasn’t telling enough, it also showed 3 other separate domains, all carrying the same syndicated content and decided that they too were more relevant than the source.
How can this be so? How can 4 other domains be more relevant for an article than the original source? As Michael pointed out, it’s a clear case of authority and trust pipping everything else at the post, especially when the syndication points have very well referenced link or user profiles.
Some people are lazy
My view is that in this case, what may be happening is that the domains which are referencing the article at Forbes, are themselves being re-syndicated and picked up by other bloggers and networks who in turn are looking for related stuff to write about and reference. The factors that influence their own domain trust and authority dictate that they then appear higher than the source for the given piece. Lots of bloggers are lazy when looking for a topic to write about and will often link back to the 1st point of call, rather than the originating piece. If they are using the likes of Google to research their writing, then by pushing down the original, Google are actually contributing to this phenomenon and are exacerbating the issue.
Publishers being screwed by domain authority
This clearly has massive ramifications for publishers the world over, especially those of a niche variety with lim ited comparative authority. If you happen to write a good piece and are fortunate enough for it to be picked up on and referenced, perhaps via use of your RSS feed, then the bottom line would appear to be that even if you take every step to give out that signal that you and you alone are the source article, that the fact is that given enough repetition by domains with a bigger trust score than your own, your piece will pushed further on down the SERPs. In other words, from a search perspective at least, Google will tell the world that other domains that carry your stuff are more relevant than your very own.
Of course, it isn’t easy
To be fair to Google, I guess that it is difficult to determine precisely the point in time that a document is published. The challenges are legion in that other sources can fake the last-modified header, and dynamically driven content fed via some mod_rewritten script will usually output a different timestamp for each document refresh. A domain that is spidered more frequently that one with lesser authority might show content in SERPs before the originator, again due to its authority and trust score determining that its content has more value, even if it isn’t the content originator! Yet at the same time, it surely can’t be so difficult to say that if document is referenced or points to a version of same content, then document equals the source.
Perhaps the very nature of on page textual analysis is so underdeveloped that the other ‘noise’ on a page makes it difficult to determine exact duplication. By noise I mean surrounding navigation text, imagery, page layout, other content.
A page might carry an article from a 3rd party source and might supplement it with lots of other stuff that isn’t to be found on the originating page. You can imagine how tough it might be to work all this out.
Reducing the impact
If anything, it shows us that if we do decide to distribute our content, and we want to take steps to protect ourselves from scenarios like this then besides say, adding some branding to our post titles or cutting off the article length or adding extra page titles to our feed, that until Google gets its act together there is very little we can do other than hit and hope that we don’t fall too far down the SERP.