Data scraping is the “art” of retrieving elements from the source code of a page to be used in your analysis. This can be a very powerful strategy in the ongoing work to improve the quality of your website.
Oncrawl lets you automatically scrape data during your crawls. In this article, we will analyse some examples of issues where scraping can support your SEO.
Non-regression monitoring
It is well known that unplanned regressions are part of SEOs’ worst nightmares.
These can stem from a change in wording, in templates or in internal linking. Or from any risk of changes outside of your control which can end up having an impact your SEO.
For example, finding that your traffic has dropped and realizing that this is because you lost a rich snippet (direct impact on the CTR) isn’t much fun for an SEO. Especially when he discovers that it could have been quickly detected with non-regression crawls.
And that’s where some smart scraping rules, along with regular crawls of your website, can be your closest allies when it comes to detecting unexpected changes.
Instead of struggling with unplanned website changes, you can also use scraping rules to identify elements which shouldn’t be present or which have been added to your source code or content. This process will allow you to make sure that your latest modifications have been correctly implemented across your entire site.
Visualize the relevance of your SEO strategy
Oncrawl’s cross-analysis allows you to visualize the potential impact on your SEO of scrapable elements.
For example, you could ensure that your strategy aiming to increase the number of comments on your product pages has had an impact on their traffic and/or their rankings.
In the same vein, by scraping prices on your product pages (along with other elements such as reviews), you can redirect your SEO strategy and give a boost to the products which are both popular with your clients (and have a high review score) and which match your average order value.
You can apply this type of principle to any type of website, and, for example, filter by the publication date of articles, by author, or by any other element.
Applying scraping rules, when cross-analyzed with a segmentation based on data retrieved from scraping, would allow you to monitor the impact of all your projects which have an impact on the content or source code. This is one strategy to determine whether or not you’re on the right track..
Content monitoring
Your website has reached a size where content review becomes complicated: you might have multiple writers regularly publishing articles, or external vendors who are able to add their products to your marketplace… There are numerous scenarios that might lead you to automate content review..
Scraping isn’t only useful to retrieve content. With Oncrawl, you can also verify whether an element exists, count the number of elements on the page, check the length of an element, and more..
Using this method, you can ensure that a product description has been written, that an article meets the minimum required word count, that each blog post contains at least one link to a specific category…
[Case Study] Handling multiple site audits
Web infrastructure monitoring
If your site infrastructure is uses load balancing, it’s not always simple to ensure that every server delivers exactly the same code or even responds the same way.
You should start by adding an element into the source code which will allow you to detect which server has rendered the HTML of a page. Then, you can scrape this element: it will now be much easier to determine if one or more servers aren’t up to date or are defective.
Integrating a meta tag or a simple HTML comment would make it very easy to discover that a given server constantly responds with 404 errors.
Unfortunately, this type of phenomenon isn’t unusual. It can lead to issues where bots encounter false errors on pages, or even a lack of response when the robots.txt file is requested.
This last one can launch a massive crawl of pages which shouldn’t be discovered.
In the same way, you can ensure that your pages are correctly served by your cache by adding a meta or comment to be retrieved during a crawl.
This process can allow you to understand high page load times that were previously impossible to explain.
TL;DR: Why scrape data?
In a nutshell, the ability to scrape content or source code during your crawls is not a strategy to be overlooked. It can provide a lot of additional information about your website, its activity and its status.