Crawling LARGE Websites?
Use DeepCrawl

by on 3rd December 2013

We were approached by Matt at to review their relatively young, but capable site crawl cloud based platform. When we first received the request, I was a little unsure as to how useful this tool will be when compared to well-known & comprehensive tools like Screaming Frog and IIS SEO Toolkit, both of which I’m a massive fan.

To quickly assess what DeepCrawl was up against, I pulled together some high level pros and cons of SF & IIS to set the scene:

Screaming Frog


  • The ‘all in one’ SEO tool for quick and in-depth site crawls starting from a particular page, or a list upload.
  • The SF team continually release new updates, and new feature requests are turned around fast.
  • Low annual license cost.
  • Accessible to both Windows and Mac based users.


  • Memory allocation can be a problem for larger sites
  • Limited access to source data without running a new custom filter via a new site crawl

IIS SEO Toolkit


  • All source code & header information for URLs crawled is downloaded to your local machine, with an extremely powerful built in query interface that allows you to manipulate this data to identify custom error types. Queries can also be saved and reused for other crawl reports at any time.
  • Completely free to use


  • Limited on-going support / development of new features
  • Only accessible to windows based users
  • No crawl from list feature

As much as I love both of these tools they have the same critical drawback, and that is scale. For larger site crawls memory allocation for SF can burn out fast, and for IIS Toolkit the platform becomes unresponsive beyond a certain point. Even if you have the ability to successfully export to .csv, the files are so cumbersome that trying to manipulate the data in any form leads to heartache.

I’m ready for a divorce at this point, so let’s take a closer look at setting up a campaign in…

Getting started with DeepCrawl


When setting up a new crawl, if you’ve used something like IIS or SF before you’ll quickly become familiar with the environment, with noticeable similarities between each of the crawlers. All of the typical settings like crawl- depth, max urls, crawl rate etc can be found here, however, there are some interesting unique features including:

  • The ability to set user-agent, and IP address without the need for proxies. This includes dynamic & static IPs, location specific IP’s (US, Germany, France), and something called ‘stealth crawl’ that randomises the user-agent, IP address and the delay between requests.
  • Set up a crawl on a test site either via custom DNS entries, or a test domain with authentication.
  • The option to adjust pre-set error fields i.e. max HTML size, max title length, minimum content to HTML ratio amongst others.
  • Crawl scheduling that can run once, hourly, daily, weekly, fortnightly or monthly with a follow up error summary PDF straight to your inbox.

One particular feature that is extremely powerful and can also be found within the crawl settings, is the ability to compare past reports. Imagine crawling a test environment and comparing to the production site following go-live for outstanding/new issues – super useful for site migrations!

Reviewing site errors

Running a crawl for a site with over half a million URLs took ~48 hours to complete, after which we were notified and presented with the following dashboard:


Every issue identified can be investigated at a deeper level within 4 main tabs located at the top of the page:

  1. Indexation – An outline of all of the accessibility errors encountered whilst crawling, with the option to segment and export reports by error type.
  2. Content – This segment analyses on-page content errors such as missing page titles, descriptions, duplicate body content, content size, missing H1 tags etc.
  3. Validation – This section hones in on internal ‘link’ or ‘URL’ activity i.e. links resulting in 4XX, 5XX or re-direction errors, as well as types of re-direct, Meta directives and canonicalization.
  4. Site Explorer – Very similar to Bing’s WMT index explorer, but allows you to break down each directory by architecture, site speed, crawl efficiency and linking to allow for further prioritisation.

Helping you communicate & resolve errors faster…

This is where DeepCrawl really comes into its own.

Once you select an error type from any one of the tabs, at the right-hand side of the screen you’ll see an ‘add issue’ tab, that when clicked opens up the following dialogue box:


Adding an issue description, priority rating, actions and assigning team members to each task will then appear within an ‘all issues’ overview dashboard, like so:


This is such a useful collaborative way to monitor and prioritise errors. Once marked as ‘fixed’ can be re-crawled and compared to the previous report to ensure the issues have been resolved.

In summary

I’m still very much getting used to some of the functionality within, but first impressions are good.

The biggest advantage that DeepCrawl has over similar tools like Screaming Frog & IIS Toolkit is the sheer number of URLs that can be crawled and manipulated within the platform itself. As the tool runs in the cloud, there are no memory or timeout errors, whilst the tool also ensures you only download what you need to evaluate and resolve specific issues encountered at any one time.

The fact that DeepCrawl goes some way in helping you prioritise & communicate these errors to your development team is a valuable asset that the other tools can’t compete with.

We're hiring – check out our careers page Careers

Get insights straight to your inbox

Stay one step ahead of the competition with our monthly Inner Circle email full of resources, industry developments and opinions from around the web.