A crawler, or web spider, is the mechanism for indexing a website. Indexing is a term for capturing and storing specific content. So when Cludo crawls your website(s), what we're doing is going through your website and storing information about it. Then when a search is performed, the search engine will process all of the information the crawler has indexed to provide results.
The process for crawling a website consists of two types of crawls:
- Smart crawl
- Full crawl
Smart crawls are a method of crawling where the crawler looks for a domain's XML sitemap and indexes the content listed. An XML sitemap is a document that helps our crawler(s), as well as other web crawlers like Google, find and index information on your website. XML sitemaps can also include information about when the site was last modified, how frequently a page is updated, and the how important a page is compared to others. A smart crawl will update the index based on any changes found between the time of index and the last version of the sitemap. By default, we will use a smart crawl to index your site and if one is not found, we’ll fall back to a full crawl. To find your XML sitemap, enter WWW.YOURDOMAIN.COM/sitemap.xml. Read more about XML sitemaps.
Full crawls happen when a crawler starts at a base URL and will keep navigating and crawling to all other links on each page. If the links on any given page are within the base domain(s) on the crawler, it will continue crawling and indexing pages as it finds them.
Does the crawler respect canonical, robots.txt, and noindex?
The crawler automatically respects noindex tags, canonical tags and settings within your robots.txt file. The crawler can be set to ignore these - simply submit your request to support.
Can you find orphan pages?
No. Orphan pages are web pages on the web that have no links directed to the page. This makes it impossible to find for crawlers, as they rely on following links to index pages. On the contrary, if you have pages that are not located on the website but you’d like to include within a crawl, you can add it to the domains to be crawled for a specific crawler.
Can I crawl my site on-demand or manually?
Yes, please visit this article for more information: How do I run a crawl on demand?
Can I see all of the pages that have been indexed?
Although this is not a current feature, you can see the total number of pages indexed by typing * into the Test Search box on your MyCludo dashboard. You can also view your crawler log and search a page to see if it has been successfully indexed. Furthermore, Page Inventory can give you a good overview of how your crawls went.