Today, I’d like to share an observation I made after analysing new back links acquired from guest blogging on Search Engine Journal and getting promoted to the main blog at SEOmoz. It’s really interesting how the more popular, high authority domains get copied (scraped) so frequently by other sites that have pagerank or are sometimes even functioning companies in their own right.
Blogs get scraped
Ever since the introduction of WordPress plugins such as Wp-O-Matic scrapers have become a fact of life. Blogs get scraped, particularly, the larger, more successful and regularly updated sites. Take this popular post on Search Engine Journal for example – there are 78 instances of the <title> and first line of the opening paragraph according to Google’s index. If you take a look at any blog in the Adage Power 150 or Technorati’s Most Popular you’ll be sure to find their posts duplicated hundreds if not thousands of times elsewhere.
Do links in duplicated content pages still pass value?
In my opinion, yes they do. There doesn’t even seem to be a limit to the number of times you can duplicate a page across unique domains to pass link value. You’d expect (or hope) that pages triggering the duplicate content filter at Google would have the value of their outbound links nullified, but I don’t see this happening in many cases. It’s not up to me to out specific examples of this, we’ve all seen it happening. If you haven’t, I’d suggest finding a high competition market and analyse the backlinks to a few domains. If you start seeing links from sites like articleblast.com, goarticles.com and articlesbase.com just do an exact match query in Google for some of the text you find and you’ll find your duplicate articles and inbound links.
Case study: Scraped post at SEOmoz
I decided to take a look at my post (titled “SEOmoz Tools – Top Pages on Domain Kick Ass”) published on SEOmoz a few weeks back. At the base of the article, there is a link back to my site using the anchor “SEO Consultant in London“. It’s not a particularly competitive phrase (nor is there much traffic) but, nonetheless, it’s a valid term and one for which Builtvisible ranked third for until a week or so ago. The article was scraped by at least 21 other domains, the data on which I gathered by using an “intitle” query on exact match for the post title and a randomly chosen sentence from the content, also on exact match.
How do you find scraped content?
My favourite way is just to use a search engine. In this example, I have used an “intitle” operator and a section of text that could only have appeared in the article in question.
You could use Copyscape to do the same thing, though I have found the results to be less useful and not as fresh as the main search engines. You’ll end up going to Google in the long run. Whether you’re familiar with anti plagiarism tools online or not, it’s worth checking your own site. You might be (unpleasantly) suprised.
To answer my question: “Could scraper sites pass any value?” I needed to collect some data. For each of the scraped articles, I collected the following information:
– URL and Domain Pagerank
– SEOmoz Domain MozRank and Domain MozTrust
– Comments on the article (How the original has been scraped and played back to the user on the new page)
– The search engine used to find the article (Yahoo or Google)
You can download my raw data from this URL. (Office 2007 Excel).
Common forms of scraping
The most typical form of scraping was to directly copy the original post HTML and present the content back to the audience of the scraper site. In many cases, the original links to SEOmoz.org had been removed and replaced with the host domain. One site had taken a copy of the page and nofollowed all of the external article links. Frequently, the scrapers were citing a Google feed proxy URL as the “original” source of the content. The remaining pages were displaying only the first paragraph of the page content and linking back to the original with either a do followed or no followed link.
Though all forms of scraping are quite annoying if you’re a site owner, the worst instances (IMO) are when the original links in the article are replaced with internal links elsewhere on the scraped site. No value whatsoever is passed back to the original author, nor the sources the original author cited as valuable. I did find that specific domains were being removed rather than all external links – i.e “seomoz.org” was replaced where “seogadget.com” was not.
Though none of the urls had yet been awarded pagerank, out of the 21 scraping sites found, 17 of the domains had a Google pagerank between PR6 and PR1:
SEOmoz Domain MozRank and Domain MozTrust
16 of the 21 sites found had MozRank and MozTrust – the most trusted and ranked sites being quite high (6.03 DmR and 6.24 DmT). These values are higher than our site, which has a DmR of 4.39 and a DmT of 5.28. None of the scraped page URLs were in the Linkscape index and didn’t have their own metrics available.
Most of the site domains included in the sample data have Pagerank, MozRank and MozTrust. Some of them are in fact perfectly “authoritative” sites in the eyes of search engines such as Google and backlink value analysers such as Linkscape, which would imply they are capable of passing link value. I’m not saying scraping is good, but I am making a comment on their ability to pass value. There are a number of different methods of scraping and problems can be introduced during the scrape process such as bad HTML parsing, linking to RSS feeds and linking out to 404 error pages. That said, for the most part, links back to sources referenced in the posts tend to be left untouched, which (during this test) included the footer text left in the base of my articles. Authoritative domains pass value as search engines index new pages on those domains. Taking that fact into account, it is fair to assume that the scraped sites identified in this test will pass value via the outbound links in the scraped content. I’m still watching a few pages which have links from recently published, scraped posts to test this conclusion further.
My recommendation to anyone thinking of posting on a 3rd party blog is, given the likelihood of the target site being heavily scraped, think very carefully about your content’s outbound links, especially in the footer of the article. Use a sign off, referencing your site and the most important pages on your own blog. In my case, I use a footer link like this:
Finally, if you’re thinking of targeting a blog with an offer of a guest post, be sure to read Josh Klien’s “How to Guest Post to Promote Your Blog” and Darren Rowse’s advice on “How to be a Good Guest Blogger” to get yourself positioned in the right way when you’re authoring your content.