This blog post will explain how to optimize automatically created pages for search engines by using landing page hashes. Automated SEO is on the rise. But did you know that Google stops indexing and even crawling your pages if they become too similar to each other? At a recent meetup of Growth Hackers of Vienna (GHV on Facebook), Search Engine expert Franz Enzenhofer explained how landing page hashes can be used as a great tool to ensure the uniqueness of all pages. I have put these insights together for you in the following article.
Why do I need SEO Quality Control?
First things first. Before we look into what landing page hashes are and why and how you should use them, I would like to start by thinking about why quality control is essential. When Google crawls and indexes pages on the web, its highest concern is to deliver good value to its users. As the crawler is “only” a machine, its ability to judge a page’s quality is restricted and relies on certain standardized indicators. One of them is a page’s uniqueness: if a website’s pages and content are repeated over and over, it certainly does not deliver additional value to users. While this is typically not an issue for manually created pages, it definitely can be severe for:
- Platforms of all kinds
- Large e-commerce sites
For example, this issue – also referred to as “self-spamming” – occurs if one shop category coincides largely with another category, and hence their corresponding result pages will overlap too. If Google recognizes this, it first stops indexing the pages in question, or worse, rejects to crawl them at all – which often defeats the very SEO purpose why these pages were created in the first place.
What are Landing Page Hashes?
Now let’s look into landing page hashes. In general, a hash function takes some input to calculate a unique hash value of some fixed size from it. In the case of Landing Page hashes, this input basically is the content of a certain web page, or a unique ID for each item on the page. It does not matter — neither for Google’s crawler, nor for the hash function — how the single items are sorted on each page.
How do you use this knowledge?
If the number of your pages to compare is limited – for example in the case of category results or list pages – you can resort to out-of-the-box tools calculating page similarities. For instance, if you are concerned about two shop categories overlapping, you can use Small SEO Tool’s Page Comparison. Simply copy-and-paste your URLs to compare, and receive a similarity score. For example, if there is a lot of similarity, you can now readjust your categories to reduce the overlap and thus improve SEO.
If the number of your pages is too large for these tools to handle, you (or your developer) will have to set up a hash comparison manually in your website’s backend. There, comparing landing page hashes represents a relatively simple, quick and scalable way to automatically check how similar each of your pages are to all the others – which would otherwise take an eternity for large platforms consisting of hundreds or thousands of pages.
This can be done the following way:
- Assign each page element (for example products in a web shop) a unique ID and save this ID to a variable.
- Use these variables as input and call a locality-sensitive hashing function on them. Only locality sensitive hashing will preserve your pages’ similarity in a meaningful way (other hashes create huge variation even from tiny differences). The individual hashing function will depend on the programming language used in your backend (for example PHP or Node.js).
- Compare the hash of each page with all the other pages’ hashes, for example by using Levenstein distance, which represents a good measure of similarity.
How to leverage these insights:
Equipped with the insights of your pages’ similarity, you can:
- Rework your whole website’s hierarchy and SEO strategy to produce more unique pages
- Adjust your list pages and (shop) categories in order to produce less overlap with others
- Generate an XML sitemap that only lets Google index the more relevant pages
For example, you can adapt the script that generates new XML sitemaps – most probably overnight – to also produce a daily report on identical and nearly identical pages. If several pages are very similar, only communicate the more relevant to the search engines via the sitemap, or mark the less relevant ones “noindex”.
For more details on using landing page hashes for SEO, take a look at Franz Enzenhofer’s article on Medium.com.