XML Sitemap 101 – Starting from the Beginning
What is it? This is a very basic technical document which exists entirely to tell search engines which pages from your site you’d like them to include in their search results. In other words, it is a very long list of every page you want Google to know about. XML is short for Extensible Markup Language and that matters only if you must know what it means – it’s a technical specification for a text document. This document is exceedingly dull, but as valuable as it is boring to getting your site fully indexed by search engines.
The XML sitemap itself includes all your page URLs you decide to include and a few technical details like priority and change frequency. It is never seen by visitors (or most site owners). But because Google asks sites for this dull document, we create them and submit them. There are two ways to submit them to search engines: 1) by listing its URL on the robots.txt file and 2) by submitting it within Google Webmaster Tools (Search Console.)
This is not a document most people will ever edit or update. It is created and handled by software, and dynamically updates when new pages are added to (or removed from) the site. Above is the Yoast SEO plug-in interface available for WordPress which is highly recommended if you run a WordPress site. Many other content management systems (CMSs) create XML sitemaps without any involvement from site managers. There are also third-party tools to help you create an XML Sitemap if your CMS (or plug-ins) don’t do it automatically. Product managers who need to write specifications for XML sitemaps for their software engineers or development teams can simply reference the sitemap protocol at Sitemap.org (supported by search engines) instructions for technical specs are fully detailed there.
What Should Be Included in my XML Sitemap?
Often when I find them missing and recommend to clients that they create new XML sitemaps, the engineers for large sites will assume that every page on the site must be included. This is not so – only pages important to site hierarchy (category, sub-category, article, product detail, services & blog pages) and pages that matter to your search traffic should be listed in the sitemap. Now that you know what should be included in the sitemap, let’s pay some attention to what doesn’t go in that document. The list for what doesn’t belong is much longer than what does. Here’s a sample list of things to KEEP OUT of your XML sitemaps:
- Any page that will be included on robots.txt Disallow or page-level noindex
- Any page that will redirect to a different page
- Any page with appended parameters such as category sort order parameters
- Any empty system generated pages without text content (variable map pages)
- Any page that results in 404 error (can cause Google to reject entire sitemap)
- Any repetitive faceted navigation links that simply change page presentation like colors, sizes, price range, etc.
That last item above about static pages surprises many when I suggest static content isn’t necessary in XML sitemaps. Remember that we’re giving search engines a list of pages we care about on the site. Even site owners don’t much care about their static content and include it only to meet legal or business requirements – not because it is something people will search for. We’re only listing pages that matter to search engines here. Those static pages are usually linked site-wide from the footer or top navigation – so they are ALWAYS indexed by virtue of those footer links – no additional search engine help is needed for your FAQ or Terms of Service pages.
One of the first things I check in doing SEO audits for consulting clients is the XML sitemap, looking at structure, sitemap errors listed in Google Search Console and the element ‘change frequency’, along with latest update to the file. In the worst case scenarios, what I’ve found in those site audits are static sitemaps generated many months or even years ago along with Google Search Console errors which show Google rejecting most or all of the XML Sitemaps. Sometimes they are not even submitted within Google Search Console, nor have they included a pointer to the Sitemap index from the robots.txt file.
Because XML sitemaps are paid very little attention by software engineers, developers or other technical team members and because those sitemaps are almost never seen, or even understood by anyone but SEO’s in many cases – they are often non-existent, broken, outdated or full of errors. So I’ll probably be speaking mostly here to those who have heard about their importance and utility, but who don’t actively monitor their health or check whether they are fully implemented or maintained.
They matter to SEO and must be current, must be submitted to search engines and must be dynamically updated as new content is added to the site. Errors must be fixed immediately when discovered. Some staff member should be assigned the task to regularly review Google Webmaster Tools Search Console to verify the health of your sitemaps. See one view of that Search Console XML Sitemap page below:
There is a special purpose XML Sitemap you should use if you have a video library you’d like Google to index. These differ in several technical details in that they incorporate a fairly long list of attributes for each video, including title, description, ratings, thumbnail image, view count, duration (length), and price if the video is for sale. This detailed video sitemap type is unique and can lead to enhanced search results when used with video schema on your site. Any site with a video library should consider separate video sitemaps for this reason. There are also image sitemaps and news sitemaps which differ from the basic XML sitemap due to unique use cases. Every variant of XML sitemap offer a value add for unique content indexing and some offer enhanced search results.
Create, submit, monitor, maintain and understand XML sitemaps to get all your pages indexed. You should also have an HTML sitemap on your site and it will be handled differently in that bots crawl all the links within that sitemap rather than “retrieving” the entire XML sitemap. This is a reference document for search engine spiders only and should not be treated as a user focused page. See more on HTML sitemaps here.