Identifying dead links for more effective link analysis

Link analysis projects can be difficult and time-consuming, especially for websites with larger link profiles, which is why optimising your workflow and improving the integrity of your data is so important.

In the last couple of years, I’ve been lucky enough to work with websites that have tens of thousands of domains pointing at them, and with time restrictions in place, it was crucial to limit the size of the dataset to ensure that the recommendations being given were as actionable as possible. And one of the easiest ways of doing this, is by removing the links that are no longer live.

The two biggest third party link explorers (Majestic and Ahrefs) have both fresh and historic databases, which can help to reduce the number of URLs processed from the outset. Yet despite their best efforts, a fairly large proportion will still have atrophied.

As Majestic’s FAQ succinctly puts it: “there is no API for the internet”. It’s necessary to crawl the web to discover new pages and this takes time. In fact, the two biggest link databases, Ahrefs and Majestic, have several trillion backlinks – that’s 1,000,000,000,000+ – that need to be crawled on a regular basis to keep their databases fresh and relevant.

In this post, I’ll cover a few different techniques you can use to identify dead links before you begin your analysis. In my experience, this can reduce the size of your data set by 30-60%.

URL Profiler:

One of the easiest and most accessible tools to filter out links is URL Profiler. The cost of a license is relatively low and it is capable of handling large amounts of data.

To identify dead links, load the URL list and add in the domain that you want to analyse.

Check against within the ‘Link Analysis’ section and click “Run Profiler”.

url profiler screengrab

Once the crawl is complete, the results can be viewed in the standard export. To be specific, the Link Status column, highlighted below.

If you encounter large number of server errors, you might want to consider slowing down your crawl or changing the user-agent to GoogleBot.

Kerboo:

Another excellent option, and the tool we typically use internally for link auditing, is Kerboo. Kerboo is a specialist tool for backlink analysis and has real utility whether you are auditing existing links or prospecting for new opportunities.

Kerboo automatically merges link profile data which can be imported via the APIs of Majestic, Ahrefs and Search Console on an ongoing basis, or manually uploaded as a list of URLs. As it is cloud based, Kerboo won’t hog all of your PC’s processing power, which on large profiles can take days to complete.

kerboo url

In the above example, 70% of the complete link profile – gathered from Majestic, Ahrefs, Search Console and MOZ – was no longer live.

One of the advantages of Kerboo over other tools, is that it allows scanning for multiple domains. This can come handy if your website went through a domain migration in the past and you would like to scan for links pointing to the legacy version as well.

Screaming Frog:

Identifying pages with links no longer live is also possible via the Swiss army knife of SEO – Screaming Frog. Using Screaming Frog’s custom extraction function and a relatively simple RegEx rule can establish whether the crawled page contains a link to the site under review.

The RegEx rule below, once inserted into the extraction field within Screaming Frog, will list any page that features a valid link pointing to builtvisible.com. It will filter out links with incorrect HTML syntax and look for links pointing to subdomains as well (“www.”,”testing.”, etc.).

Our example:

(?i)a.{1,}\s*href=\s*("|)\s*(http:\/\/|https:\/\/|)([a-z0-9]{1,}+\.|)builtvisible\.com

screaming frog extract

Regex template:

(?i)a.{1,}\s*href=\s*("|)\s*(http:\/\/|https:\/\/|)([a-z0-9]{1,}+\.|)[Example Domain]\.[Domain TLD]

Please note, in order to maintain valid regex syntax, the “.” characters must be escaped by inserting a “\” sign in front of them when entering the domain’s TLD.

regex screaming frog

To make sure you get the most accurate results possible, it’s also recommended that the crawls run with the following settings:

After Screaming Frog has completed its crawl the results should look like this within the extraction tab:

extract screaming frog

If the “Link Status Check 1” field is populated, it means that a link pointing to the defined domain has been spotted.

It’s important to keep in mind that – unlike the previous tools – Screaming Frog is quite memory intensive using the default storage mode. If you are scanning several thousand links, it’s worth switching the memory mode to internal storage (HDD/SSD) instead. This will help to reduce system stress in exchange for speed. 

Taking things further

Although this post primarily focused on reviewing the status of a link for auditing purposes, there are so many other uses for this approach. As an example, when trying to recover lost link equity, knowing what links are live can drastically reduce your list of recommended redirects.

Regardless of what you use this for, I hope the techniques discussed in this post save you some time – so you can concentrate on analysing the links that matter.

Comments are closed.

  • Take a look on Clusteric.com :) Nice software from Poland, also available in English version. Very useful to prepare Onsite and Offsite analysis.

  • Screaming frog and that custom regex extraction to do this is the business. Thanks for sharing. Another cool way of using SF and this technique is to grab all brand mentions from google news to check for unlinked brand mentions at scale! A little off topic, but popped to mind when reading this.

  • Another pro for Screaming Frog. I just love this software more and more!

  • Very interesting post on Link Analysis.

    Awesome tips and step by step explanation. Really appreciate the way you have written and explained.

    I am really gonna apply this for the future. Worth reading it.

    Thanks for sharing it with us.

    Good work..!!


Join the Inner Circle

Industry leading insights direct to your inbox every month.