David over at science text alerted me to a story that I missed regarding Google and some strange accounts of ‘indexed’ websites that don’t exist installing malaware and viri on Google user machines.
Some searches (very specific phrases, and I won’t list any of them right now – Google knows which they are) return results with a large number of .cn (Chinese) sites.
The .cn sites are often scraped content from legitimate U.S. websites
The legitimate sites are being ranked below the scammed .cn sites for these competitive keywords.
Just another hijack story?
Nothing so new there, we’ve all read accounts of scraper sites outranking ‘legitimate’ sites for their content often by use of a 302 ‘hijack‘ . It’s pretty easy to scrape content and slap a few ads around it here and there, and in fairness to the engines it’s not the easiest thing to eliminate, especially in a world of rss and syndicated content.
I’m going to be a little lazy and summise that those clever so and so’s use a little commonsense and hook up with the various ping services that blogs like wordpress use when publishing new content. This would you’d think give them a good way of being able to establish who published what 1st where and when. Grab the timestamp , put it into a database and bob’s your uncle. This way, any duplicate content that followed wouldn’t be classed as the original source and would be ranked beneath that of the original.
Don’t trust the authorities…
Ok so not every website out there has a ping script installed so perhaps the above scenario is indicative of a problem within the Google ranking machine with its reliance and trust in link data and authority scores. If site A happens to have a higher trust level than site B, and Site A decides to use content from site B, then in a scenario where Site A is indexed more frequently than say site B (because of its higher authority score) then there is a very real chance that Google will decide that the rightful owner of the content is Site A and not the original publisher site B.
Google advise people who syndicate content to embed a link within it so that its googlebot sees a link back to the original source and handles it correctly.
Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to block the version on their sites with robots.txt.
Whilst this may well work fine and dandy for people who are behaving themselves, it’s clearly inadequate for those who are not.
It doesn’t take too much effort to strip an href out of a piece of html. Web scripting languages come complete with all manner of string functions that enable a person to do all manner of imaginative things with some text or HTML. A person looking to rank higher with someone else’s content can rank higher up in a SERP and deprive the rightful owner of both kudos and traffic.
Has this aspect of their systems contributed to this problem? Is it soley attributable to this particular flaw in their algo? I doubt it, but at the end of it all it sure looks like its contributing.
List my non existent domain please
The amazing thing about the story from the site calling itself googlewatchdog is that it appears that someone has managed to fool the googlebot completely, getting it to list domain names that do not even exist.
The .cn sites don’t appear to be hosted ANYWHERE. They are simply redirected domain names. How they got ranked in Google in such a short period of time for fairly competitive keywords is a mystery. Google’s index even shows legitimate content for the .cn sites.
It appears that the faked sites are redirecting the Googlebot to a location where content can be indexed, while at the same time recognizing normal users and redirecting them to a site that includes the malware mentioned earlier. This is an obvious violation of Google’s guidelines, but the spammers have found ways to circumvent the rule and hide it from the Googlebot.
These sites are numbering in the millions for many different keywords and phrases, and appear to be developed on an automated basis. Because of privacy laws, it’s hard to track down who owns the domain names – Google has the power to do so, but there has been about exactly zero information from Google about the problem so far, and even many SEO experts and webmasters are not picking up on it.
I’m sure that this has made quite a few people sit up and think hmmn how mad is that. How did they do that then. People can spoof user agents and redirect people or bots all over the shop. They can cloak content and have in the past confused the Google technology into believing that an indexed page resided at one place, when in fact it resided elsewhere. This commonly became known as the 302 hijack a phenomenon that Google stayed silent on for some considerable time, refusing to concede its existence. There were literally hundreds upon hundreds of posts at places like Webmasterworld and the busier webmaster and SEO forums from people complaining about how their content had been replaced by other domains using it as some kind of bait and switch tool.
Yet this one seems different. Very different indeed in that somehow they’ve managed to get around all the accepted safeguards causing Google to output stuff that was at best inaccurate and at worst decidedly harmful to the recipient computer.
There is of course always the possibility that the people concerned are unaware of an errant piece of scumware that is simply hijacking their browsers and taking over the Google SERP from David’s piece quoting Dr Jenny Oliver
“I can’t remember what I put in to search with,” she told me, “as I was idly surfing last night, my Mac was suddenly very busy for several seconds as if installing a program.” She rebooted very quickly after that, but her net connection seemed to have become ominously slow.
Yet this was after she had clicked and not before. Perhaps she was already infected is a chorus I hear from behind, yet David does go on to say that he too saw it with his own eyes on his own pc, seroundtable also provide a screenshot and a little more background and it seems that the Spam team are aware of the issue too.
If it is true, then its a big step up from the conventional means of manipulating the Google index. To get into the results for such well known keywords is a bit of a blackhat coup de force and of course a huge headache for the Google technology team too.
How long before this is plugged? God knows. It’s fair to draw the conclusion that we are very unlikely to hear Google say “Yeah, our index isn’t impregnable, spammers can get right on in and do what they like” It’ll either be bluntly denied or dismissed as some kind of browser hijack. We will no doubt see… Interesting nonetheless 🙂