We love the Screaming Frog spider at Agency51-it makes site audits flexible, powerful and very customisable-which can be very helpful when drilling down into the architecture and setup of a website is necessary, especially for SEO purposes. In this post, we’ll be sharing some of the advanced features of the tool, specifically the Custom Search and Custom Extraction features, which can help with everything from the removal of mentions of old brands to finding poorly conceived product descriptions.
Why customise a crawl?
Sometimes, we may need to extract very specific information from a page or set of pages, which the default spider doesn’t include as standard. The custom feature allows us to do this on a very granular level!
Table of Contents
To access the custom feature, navigate to Configuration>Custom:
We then have the 2 sections, which are mostly self-explanatory-search, and extraction. These can be set up before a crawl is run, but not changed during it.
You are allowed up to 10 custom searches, each of which will search the HTML code of the page for the specified text string-we’ll get to some examples shortly.
Extraction, and the data extraction process
As with the search function, we can extract up to 10 separate types of data from the pages included in our crawl. Unlike with the search, we have to enable each extractor individually, and choose what type of scraping method to use (CSSpath, Xpath, Regex). We usually find that Xpath works best, although it does depend on the page in question and what type of data is needed.
Data Extraction process
Before we get started with extraction, we need to know how to teach the spider how to get the data we want out of the page. In most cases, this will involve navigating to the type of page you wish to extract data from, and using the ‘Inspect’ command in Chrome (select the part you’re interested in, right click and select ‘inspect element’ and then copy the ‘selector’ (xpath in this case). In the below example, we’re going to be extracting the time/datestamp from the BBC website.
Once this is copied, we go over to Screaming Frog and paste in the Xpath and select ‘Extract text’ once the crawler has run, the extracted result can be seen below which corresponds with the date displayed on the page.
Important: If extracting multiple data types on a large number of pages, make sure you reduce the thread count (configuration>speed) otherwise it’s very easy for servers to get overloaded (particularly on smaller websites) which no one wants!
Now that we’ve got the technical tutorial out of the way, here are a number of ways to use this for research, data mining and auditing.
Custom Search Examples
Checking for Schema implementation
Schema is code that is inserted into your website which allows Google to return more information to users during searches. For example, in the search below, the bottom result has included an image and a rating in addition to the standard information.
Checking page types at random using the Structured data testing tool is fine for smaller domains, but for larger sites it may be necessary to check that schema is implemented in bulk. Webmaster tools is usually pretty good for this, but it can often miss pages, necessitating a more detailed examination. Depending on the specific structure and markup of microdata on the site, it may be necessary to alter the below-checking some different page types should help establish how schema data is formatted in the code. For example, to check for Organisation/product/review schema the below may suffice:
However, sometimes schema can be implemented differently e.g.
A little investigation should reveal the right format to use for the search boxes.
Finding pages missing UA codes
It’s not uncommon for pages to be missing UA codes for one reason or another, luckily this is easy to check with Screaming Frog. Simply get hold of your UA code or GTM container ID, and set up a search filter for ‘does not contain’ with the ID to create a list of pages without the tag on.
iframes can be a problem in SEO as Search Engines sometimes have difficulty extracting the content from them, so setting up a search for ‘iframe’ will help to diagnose any pages with this issue. It’s worth noting that Google Tag Manager and other tracking scripts will often use iframes, in which case this won’t be a problem for SEO.
Looking for spam or hacked pages
If you suspect that your site may have been hacked (for example, you’ve received a notification through webmaster tools) adding the spam keywords to the search boxes can be very helpful in uncovering exactly which pages need attention.
Finding page with incorrect brand mentions
Other possibilities exist for this, which can be rather esoteric-one of our clients had quite a specific branding problem, in that they had migrated domains and also changed brand names. This created the problem of a lot of pages having the old brand name on, which had a couple of different permutations. By using a regular expression search with lookahead operators, we were able to find out which pages had the mentions on so the content could be edited.
Custom Extraction Examples
Listing, or counting Heading tags
By default Screaming Frog will list the H1 and H2 tags it finds, there may be instances where H3-6 headings need to be checked as well-for example, to assist with restructuring a website’s information architecture. We can also count the number of headings on a page, with the following 2 searches (we’ll do the same for H4 as well)
Using the BBC as an example again, this is the configuration and the result respectively-we need to use the function value command this time to get the data:
Looking for pages using relative rather than absolute links
Although this seems to be the kind of debate that might start a civil war between web developers and SEOs at some point, in reality Google is usually fine with relative links, as long as they are applied consistently and correctly across the whole site. Relative links on internal web pages (without the full document path) are generally a lot easier to code and arguably might take slightly less time to load when clicked on (fractionally) although from an SEO point of view having full links in code is generally preferred. Either way, sometimes it may be helpful to find instances of relative links on a site in which case a crawl using a regular expression similar to the below may be of use:
Product descriptions and prices
We’ve saved the best for last-being able to scrape product descriptions by page can be very valuable for eCommerce, for example to find:
- Empty descriptions
- Short descriptions
- Overly long descriptions
- Descriptions which need proofing/editing
This can also be helpful for competitor analysis, or to compare competitor prices to your own or your client’s.
For a client site of ours, the Xpath of the code looked like this:
As with our tutorial above, it’s usually just a matter of navigating to the block of code that contains the description or price element, although if it’s split into several parts several extraction points may be needed.
We hope you found this post useful, if you have any questions or are curious about XYZ please get in touch with us!
This blog was written by Ben Henderson, Technical SEO Manager for Agency51.
Opt in to receive future blogs and white papers from Agency51 and receive a free SEO audit of your company website.
If you wish to discuss your digital marketing strategy with us then just call 01904 215151 or email firstname.lastname@example.org