A crawler is a mechanism for content on a website. The crawler will start at the configured domains to be crawled, index the content of the page, and then follow the links and index those. It will continue to crawl and index any given page as long as it’s located within the configured domains. A crawler is how you define the different sources of content to be indexed for your search results in a search engine.
Setting up a Crawler
To set up a new crawler, go to your Settings menu on the left side of the MyCludo dashboard and select Crawlers. Click New to create a new crawler or you can select an existing crawler to edit.
Crawler name & domains to be crawled
First, name the crawler. This can reflect the domains it will be crawling, like Main Site or Blog. If you have multiple languages on the website, you can name the crawler based on the language it’s crawling. You may edit the name as needed.
Set the language of the site that the crawler will be indexing. The language setting allows the crawler to interpret the site and analyze the language for the best search experience. For accurate search results, you may choose to create a different crawler for each language.
For sites with multiple languages, you should create a crawler per language. You can also crawl multiple domains for the same language if they have a similar HTML structure for setting the page title, description, and if set, category and image.
In domains to be crawled, enter the URL(s) that should be indexed. These URLs will serve as the base, or starting place, for the crawler to begin the index. If there are sections of the website you do not want to include in the crawl, enter the string under URLs to be excluded from the crawler. This is helpful if you would like to exclude old archives from appearing in search. To exclude specific pages, you will want to go to Tools > Excluded pages to configure that setup.
Respect noindex meta tag and robots.txt:, the guidelines set within the robots.txt file will be followed by the crawler. For example, if the command disallow: / is set, the crawler will not index the site. If the box is not selected, the crawler will disregard any guidelines set by the robots.txt file. You can typically find a robot.txt file by entering YOURWEBSITE.COM/robots.txt. This setting is enabled by default for all crawlers.
Crawler type: The Crawler type is used to index content on a website.
The Person type of crawler is used when crawling a directory of contacts to find specific names. It's not a recommended solution to change a crawler to "Person" if the content is not only people.
When a URL is whitelisted, the crawler will index a page that contains this URL if it is encountered during the crawl. This is helpful if there is a different domain or page you would like to include in the index but not the whole website. For example, if you are in the legal industry, you may want to include a page to the local bar association website and not necessarily the whole website.
Note: When the crawler is considering if a link should be crawled, it will determine that whitelisted URLs contained in the considered URL should be crawled. For example, if the whitelist contains 'https://www.example.com', and the crawler encounters 'https://www.example.com/subdirectory/index.html', it will make the decision to index the page since the whitelisted URL is contained in the URL it's considering.
Fields to be crawled for content pages
This setup will determine how the title, description, category, and image will be displayed in the search engine’s results page.
Title: You may use the following methods for displaying a page’s title:
- Automatic extraction: Cludo will use its algorithm to intelligently pull content from a page for the title
- First H1: The title will be defined by the content located within the first <h1> tag. For example, <h1>This will be the title </h1>
- Website title: The title will be defined by the content located within the <title> tag. For example, <title>This will be the title </title>
- Open graph: The title will defined by og:title
- XPath: Define the title using an X Path query
Description: Use the following methods to pull the content for the description of the page:
- Automatic extraction: Cludo will use its algorithm to intelligently pull content from the page for the description
- Body HTML: The title will be pulled from the content located within the <body> tag. For example, <body>This will be the description pulled for the search results page. </body>
- X Path: Define the description using an X Path query
Category (Optional): Use the following to determine what categories are available in the search engine’s results page:
- Cludo meta tag: Use a Cludo meta tag to define the category
- Meta keyword tag: Categories will be defined using the <meta> tag. For example, <meta name=”keyword” content=”Products”> would create a Products category.
- URL match: Set the category based on a match in the URL. For example, if there is /products in the URL, set the category for the page to Products. For results that do not fit into a category, you should insert a default value, like Other.
- XPath: Define the category using an X Path query
Image (optional): Use the following to determine the image displayed alongside a search engine’s result:
- Open graph: The image will be defined by og:image. If this is required, the field must be present for the crawler to index the field.
- Cludo meta: The image will be defined by cludo:image. If this is required, the field must be present for the crawler to index the field.
- XPath: Define the image using X Path query. If this is required, the field must be present for the crawler to index the field.
For any additional content that may need to be pulled, such as price, location, department, etc., custom fields can be created using an open graph tag, Cludo meta tag, Meta tag, URL match, or XPath.
Fields to be crawled for files
For Professional, Business, and Enterprise customers, files are automatically crawled and displayed within a Documents category. You can also choose to customize the category based on a URL match, HTTP header, or ignore the files from being categorized.
There’s also the option to not crawl files in this setting.
Test your Crawler:
Before saving your settings, you can review how a specific page will appear in the search engine’s result page by inserting the test URL and clicking Test Crawler. This gives you the ability to tweak the configuration until you have the expected results. The tool will also give you recommendations if the test fails.
Once you’ve completed the setup of the new crawler, click Create. The crawl will be added to a queue immediately. Once the crawler has been created, when the settings are saved, it will also begin a new crawl. You can view an update of the crawl on the Crawlers page. For more information on running a crawl, view this article.