Deepcrawl – The Crawler of Choice for LARGE Websites

We were approached by Matt at DeepCrawl.co.uk to review their relatively young, but capable site crawl cloud based platform. When we first received the request, I was a little unsure as to how useful this tool will be when compared to well-known & comprehensive tools like Screaming Frog and IIS SEO Toolkit, both of which I’m a massive fan.

To quickly assess what DeepCrawl was up against, I pulled together some high level pros and cons of SF & IIS to set the scene:

Screaming Frog

Pros:

  • The ‘all in one’ SEO tool for quick and in-depth site crawls starting from a particular page, or a list upload.
  • The SF team continually release new updates, and new feature requests are turned around fast.
  • Low annual license cost.
  • Accessible to both Windows and Mac based users.

Cons:

  • Memory allocation can be a problem for larger sites
  • Limited access to source data without running a new custom filter via a new site crawl

IIS SEO Toolkit

Pros:

  • All source code & header information for URLs crawled is downloaded to your local machine, with an extremely powerful built in query interface that allows you to manipulate this data to identify custom error types. Queries can also be saved and reused for other crawl reports at any time.
  • Completely free to use

Cons:

  • Limited on-going support / development of new features
  • Only accessible to windows based users
  • No crawl from list feature

As much as I love both of these tools they have the same critical drawback, and that is scale. For larger site crawls memory allocation for SF can burn out fast, and for IIS Toolkit the platform becomes unresponsive beyond a certain point. Even if you have the ability to successfully export to .csv, the files are so cumbersome that trying to manipulate the data in any form leads to heartache.

I’m ready for a divorce at this point, so let’s take a closer look at setting up a campaign in deepcrawl.co.uk…

Getting started with DeepCrawl

getting-started-deepcrawl

When setting up a new crawl, if you’ve used something like IIS or SF before you’ll quickly become familiar with the environment, with noticeable similarities between each of the crawlers. All of the typical settings like crawl- depth, max urls, crawl rate etc can be found here, however there are some interesting unique features including:

  • The ability to set user-agent, and IP address without the need for proxies. This includes dynamic & static IPs, location specific IP’s (US, Germany, France), and something called ‘stealth crawl’ that randomises the user-agent, IP address and the delay between requests.
  • Set up a crawl on a test site either via custom DNS entries, or a test domain with authentication.
  • The option to adjust pre-set error fields i.e. max HTML size, max title length, minimum content to HTML ratio amongst others.
  • Crawl scheduling that can run once, hourly, daily, weekly, fortnightly or monthly with a follow up error summary PDF straight to your inbox.

One particular feature that is extremely powerful and can also be found within the crawl settings, is the ability to compare past reports. Imagine crawling a test environment and comparing to the production site following go-live for outstanding/new issues – super useful for site migrations!

Reviewing site errors

Running a crawl for a site with over half a million URLs took ~48 hours to complete, after which we were notified and presented with the following dashboard:

deepcrawl-dashboard

Every issue identified can be investigated at a deeper level within 4 main tabs located at the top of the page:

  1. Indexation – An outline of all of the accessibility errors encountered whilst crawling, with the option to segment and export reports by error type.
  2. Content – This segment analyses on-page content errors such as missing page titles, descriptions, duplicate body content, content size, missing H1 tags etc.
  3. Validation – This section hones in on internal ‘link’ or ‘URL’ activity i.e. links resulting in 4XX, 5XX or re-direction errors, as well as types of re-direct, Meta directives and canonicalization.
  4. Site Explorer – Very similar to Bing’s WMT index explorer, but allows you to break down each directory by architecture, site speed, crawl efficiency and linking to allow for further prioritisation.

Helping you communicate & resolve errors faster…

This is where DeepCrawl really comes into its own.

Once you select an error type from any one of the tabs, at the right hand side of the screen you’ll see an ‘add issue’ tab, that when clicked opens up the following dialogue box:

issue-list

Adding an issue description, priority rating, actions and assigning team members to each task will then appear within an ‘all issues’ overview dashboard, like so:

all-issues-deepcrawl

This is such a useful collaborative way to monitor and prioritise errors. Once marked as ‘fixed’ can be re-crawled and compared to the previous report to ensure the issues have been resolved.

In summary

I’m still very much getting used to some of the functionality within deepcrawl.co.uk, but first impressions are good.

The biggest advantage that DeepCrawl has over similar tools like Screaming Frog & IIS Toolkit is the sheer number of URLs that can be crawled and manipulated within the platform itself. As the tool runs in the cloud, there are no memory or timeout errors, whilst the tool also ensures you only download what you need to evaluate and resolve specific issues encountered at any one time.

The fact that DeepCrawl goes someway in helping you prioritise & communicate these errors to your development team is a valuable asset that the other tools can’t compete with.



Stay Updated: Sign Up for Webinar & New Blog Alerts

3 thoughts on “Deepcrawl – The Crawler of Choice for LARGE Websites

  1. The site migration feature (recrawl from a previous list of crawled URLs) – very useful. I think this is a keeper.

  2. Ragil Pembayun says:

    The memory allocation issue is something which has been lingering our agency for quite a while and to be honest it’s got to a point where we’re ready to look for another alternative. In your word, to get divorced!

    A very useful write-up guys! Thanks for heads up :)

  3. On Dan’s behalf, pleasure! I’m just working through the technical changes to SEOgadget the tool identified. We have a lot of internal redirects (my bad :D)

Comments are closed.