After the release of our new website, we looked at the feedback (or errors) we were getting from the search engines' webmaster tools and we began to realize that all was not well in search engine land. The most interesting discovery we made was that we had index entries for pages and resources we did not actually have in our website anymore, and we were getting penalized for it. Some were just old links but some were strangely formatted or 'hacking' URLs.
The Situation
After looking closely at the issues being reported by the web master tool we concluded:
- Search engines remember everything. I mean every page that ever existed past and present on your website.
- They consider URLs that differ only by case (capital letters) different.
- You need to tell the search indexer exactly what is going on with your content because it wont make intelligent conclusions. (For example that you have moved a page).
The result?
The indexed list of pages that the search engines has listed for our website were a rather bizarre mish-mash of our current website, our previous website, and everything that had existed before. Not forgetting that some pages were listed multiple times for having references in our website differing only by case.
The upshot was, we were being 'marked down' for:
- Having indexed content that could now not be found.
- Having links to 'duplicate' pages, that were exactly the same but differed only by case.
- (We suspect) having old indexed pages that, even though missing, were counted as duplicates of pages that had effectively been moved.
Note: The search engines don't like duplicate content, because they think that you have plagiarized it or that you are trying to boost a page. Most bizarre when the site you 'plagiarized' is your own.
So by having a bit of cleanup (which was exactly what we planned) we had killed, or at least badly injured, our SEO.
Managing the Situation
We realized that we needed to manage our responses (for the search engines) to deal with the changes we had made. After a bit of head scratching and plotting, we came up with the following strategy:
- Make all URLs lowercase. That one was easy.
- For every incoming page request, format the request URL into its 'perfect' form and, if the perfect form does not match the request URL at all, add a canonical entry to the page header with the perfect form in it. Obviously, the 'perfect' URL has to work!
- Gather the URLs of all incoming requests in a unique table. This is slightly different from a log, as it gave us one single definitive list of how the outside world saw us.
-
Use the unique incoming URL list to:
- Return a 'Permanently Removed' result to the requester, for resources we had no intention of carrying forward (mostly images).
- Return a 'Permanently Moved' result to the requester, for resources that had been relocated, with the new URL.
- Return a 'Not Found' result to the requester for resources we knew nothing about.
The last two steps were iterative but enabled us to return logical responses to every old (or weird) request we got. It should be noted that some of the old links went back years.
After all that, we could let the search engines know: if something was moved where it now was, if something looked duplicate, where the original was, and if something had been removed, that it was gone.
Conclusion
The SEO work mentioned here was just that related to our recent website migration, and obviously our SEO strategy encompasses more, but that is another article entirely. There are a few points worth making about this work:
- It worked.
- It takes a while for the search engines to catch up. The 'error' list is reducing but not gone...
- We are still doing it ;-).
It's also worth mentioning that the unique list of URLs representing how the world sees our website is actually quite valuable in itself. We have plans to use it to dynamically generate a sitemap and power our IndexNow integration.