Sitemap is the most convenient way to tell the search engine how does a website structure look like and what URLs does it contain. The problem is that some pages may be nested to deep or somehow unreachable to the search engine. Building a sitemap is a reliable way to keep the search engine aware of every subpage, even if it is located deep in the site structure or if its URL is available only through Javascript call.

Not so long ago, webmasters were building sitemaps in HTML. A “sitemap” link was usually located at the bottom of the page and it led to a document strewn with links, looking like a big table of contents. It wasn’t usually useful, it was meant to be used by search engines.

Then a new concept came into being: search engines began to accept sitemaps in XML format. It was good idea because of two reasons: submitting a sitemap was a clear signal to the search engine where to look for URLs and users didn’t have to bother with odd-looking sitemaps.

But not everything went smoothly. Each search engine were accepting XML sitemaps in its own format only. It means that webmaster who wanted to submit a sitemap to many search engines had to prepare many separate files. But then the Sitemap protocol was developed and search engines approved it. Sitemap protocol was a successful attempt to uniform sitemap formats and now it is widely used.

Sitemap is an XML document. In the basic form it looks like the following:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://lukaszwrobel.pl/</loc>
  </url>
  <url>
    <loc>http://lukaszwrobel.pl/about-me</loc>
  </url>
  <url>
    <loc>http://lukaszwrobel.pl/get-in-touch</loc>
  </url>
  <url>
    <loc>http://lukaszwrobel.pl/blog/math-parser-part-3-implementation</loc>
  </url>

  ...

</urlset>

Basically, it’s just a list of url tags enclosed in the urlset tag. Sitemap has also some more sophisticated capabilities: it can be used to state last modification date, change frequency and URL priority. More complicated example would look like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://lukaszwrobel.pl/</loc>
    <lastmod>2008-11-29</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>http://lukaszwrobel.pl/about-me</loc>
    <lastmod>2008-11-25</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.6</priority>
  </url>
  <url>
    <loc>http://lukaszwrobel.pl/get-in-touch</loc>
    <lastmod>2008-11-25</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>http://lukaszwrobel.pl/blog/math-parser-part-3-implementation</loc>
    <lastmod>2008-11-08</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.7</priority>
  </url>

  ...

</urlset>

A few additional tags have been used:

<lastmod> tag can be used to indicate last modification time. You can use either W3C datetime format or just print a date in YYYY-MM-DD format.

<changefreq> is used to indicate change frequency. You have to choose from few fixed values: always, hourly, daily, weekly, monthly, yearly, never. Two important remarks need to be given. First, use the always value to describe pages that change every time they are being accessed. Second, use the value never only to mark URLs that contain archived content.

<priority> determines URL relevancy compared to other pages on your site. It varies from 0.0 to 1.0; 0.5 is the default value.

Remember that search engines may use informations provided in the arbitrary way.

You can either create and maintain a sitemap manually or use already created software. Manual editing and updating a sitemap could turn into a pain in the ass, believe me. Using the specialized software would be a much better idea. You can try this generator out, it served me well. You simply put your website URL into it and let it crawl, then you can save the sitemap on the disk and - if it is necessary (usually it is) - give a finishing touch. Generators are usually able to do much more, e.g. they report 404 errors and duplicated content.

After being created, a sitemap can be submitted to the search engine. But it would be a good practice to update it every time the website grows. That’s where dynamic languages come in: a sitemap can update itself automatically. Read how to generate sitemap in Rails.

comments powered by Disqus