A crawler is responsible for “crawling” through your website’s content and storing the content in a search index. In this article, we’ll explain how to set up a crawler to crawl the content you want and to store it in the way that you want.
For details on how the crawling process works, check out “How does your crawler work?”.
To begin setting up a crawler, head to Settings -> Crawlers and either click on an existing crawler (like the “Default” crawler that we automatically created for you) or click “New” in the top right to start a new crawler.
The first section in crawler settings is the crawler “boundaries.” Here you can set the areas of your website(s) that will be crawled and made searchable. Specify included areas, excluded areas, and general crawling behavior. The crawler will stay within the set boundaries.
Included in the crawl
Websites: Add one or more websites to crawl. The crawler will use the URLs you add as starting points for the crawl. During the crawl, any discovered links that are within these websites will be crawled. Any links outside of these websites will be ignored.
Page exceptions (under ‘More options’): Add acceptable page URLs. Even if they are outside of the websites set above, the crawler will index them if they are discovered during the crawl. The crawler will not crawl any links found on these pages. Partial URLs are allowed. For example, adding '/archive/' will ignore all URLs that contain that string i.e. yourwebsite.com/archive/forms.
Sitemaps (under ‘More options’): Add XML sitemaps for the crawler to follow. Note, the crawler will automatically look for a sitemap in the root of included websites (i.e. yourwebsite.com/sitemap.xml).
Excluded from the crawl
Pages: Add page URLs that the crawler should ignore. Partial URLs are allowed. For example, adding '/docs/' will ignore all URLs that contain that string i.e. yourwebsite.com/docs/images. These common undesirable URL strings are added by default when you create a new crawler:
URL regex (under ‘More options’): Add regular expressions to match page URLs that the crawler should ignore.
Crawl files: If enabled, the crawler will index files found in your website(s). The crawler can index these common file types: .txt, .xml, .pdf, .doc, .docx, .xls, .xlsx, .ppt, .pptx, or any files served as an octet-stream (binary data file).
Crawl sitemaps: If enabled, the crawler will process and crawl XML sitemaps. The crawler will automatically check for a sitemap.xml file in the root of your website(s) and will also crawl additional sitemaps provided in the “Included in the crawl” section.
Respect noindex tags and robots.txt rules: If enabled, the crawler will look for a robots.txt file in the root of your website(s) and adhere to any allow/disallow rules in the file. The crawler will also ignore pages with noindex meta tags in their markup. To target the Cludo crawler in robots.txt rules or a noindex tag, you can use the user agent shorthand “cludo”.
Respect canonical tags: If enabled, the crawler will process canonical tags found in your pages and ignore pages where the URL in the canonical tag does not match the URL of the current page. In the case of a mismatch, the crawler will attempt to index the URL found in the canonical tag.
The second section in crawler settings is the crawler “structure.” Here you can specify the structure of your pages so the crawler can effectively store their content. For example, what language should the crawler expect? Should the crawler find page titles automatically or in a specific location? Note, you can generally use default settings for a reliable, simple search experience.
Language: Set the language in which the content of your website(s) is written. The crawler processes the text of your website in a way that is specific to the language, so it’s important to set the language accurately.
Content type: Set one of two options for content type. 'Web pages' type allows for inexact search terms to provide results. For example, a search for 'raining' might provide results that contain 'rainy'. This should be used in most situations. 'People directory' type should be used when search results should strictly match the search term like in a directory of names.
The crawler will organize and store your page content into “page fields.” Once page fields are stored, they can be displayed in search results, used to boost pages, and more.
For detailed information on how page fields work and how to customize them, check out the article: Page and file fields.
Much like page fields, the crawler will also store file content into “file fields.” The rules for file fields are slightly different than those of page fields.
For detailed information on how file fields work and how to customize them, check out the article: Page and file fields.
Once you are satisfied with crawler settings, you can test how the crawler will apply those settings using a test crawl. This is a good way to confirm that the crawler will index your content properly before you commit to saving your changes and starting a crawl.
Enter a URL of a page or file into the test crawl input to test the indexability, field values, and available links for that page.
Indexability means the crawler’s ability to crawl and store a page without any errors or breaking any rules of the crawler settings. If a page is indexable, you’ll see a success message:
If a page is not indexable, you’ll see an error message with a reason for the error:
There are many reasons why a page might not be indexable, but some of the most common reasons are:
- The page was not found by the crawler (404 error).
- The page’s URL is outside of the websites included in the crawl.
- The page has a canonical tag that leads to another page.
The field values section gives a preview of what values will be given to each field for the URL you are testing. This is useful when checking to make sure that the field sources you have configured in the crawler settings are being applied properly.
Available links on page (under ‘More details’)
The test crawler will also return a list of the links that were found on the page and whether those links are “crawlable”. A link is crawlable when it is within the scope of a crawl. Please note, even though a link is crawlable, it may not be indexable. The crawler may go to the link and still not be able to index it for various reasons (i.e. if the page returns a 404).