What are crawler fields?
When a crawler stores (or “indexes”) a web page, it doesn’t simply store all the text of the page in one big block. Instead, it will split the text into different compartments or “fields” that might serve different purposes. A very basic crawler with default settings will store two fields:
- Title: the title of the page that will be displayed as such in search results.
- Description: generally, the remaining text on the page that you would like to be searchable.
Beyond these two required fields, you can add additional fields to further compartmentalize your page content.
Why are crawler fields important?
Once fields are stored by the crawler, they can drive many different types of functionality:
- Make content searchable: Your content cannot be searched unless it is stored in a field. Generally, we encourage users to use the “Description” field to capture most or all of a page’s content.
- Display fields in search results: By default, a page’s title field and a snippet of the description field are displayed in search results. Search results can be expanded to also show an image or even a custom field such as a published date (note: adding custom fields to search results requires a custom implementation or use of the API to build your own template).
- Boost pages based on a field: Using the Boostings tool, you can increase or decrease a page’s relevance based a field. For example, if you set up a category field to give each page a category, you could boost all pages in the category “Products” to make them more visible in search results.
- Filter pages from search based on a field: In engine settings, you can set up filters that use a specific field to hide pages from search results.
Field type #1: Page fields
Page fields are the main type of crawler fields. They are applied to all web pages in your website(s). By default, you will find the following page fields in your crawler configuration:
- Title (required and enabled by default)
- Description (required and enabled by default)
- Image (disabled by default)
- Category (disabled by default)
In addition to these fields, you can add your own custom page fields.
Field type #2: File fields
File fields are applied only to files such as plain text, Word, or PDF documents. They function just like page fields, but they have slightly different default fields with slightly different configuration options. We’ll cover those options in more depth in the “Adding/editing fields” section. By default, you will find the following file fields in your crawler configuration:
- Title (required and automatically set by the file’s meta title or filename)
- Description (required and automatically set by the file’s content)
- Category (enabled and set to give all files the category “Documents” by default)
Clicking the edit button of an existing field
or clicking in the bottom left will open the field editor:
Here, you’ll find all the settings that tell the crawler how to apply the field as it crawls.
Give the field a unique name that describes the type of content that will be stored in the field.
Field required toggle
Setting a field as required will set the crawler to ignore any pages that don’t contain the field.
A source is where the crawler should look to find the value for a field. For example, if you have a specific HTML tag in your pages that should be used as the title field, you can set that HTML tag as the field source. You can even add multiple sources for a page field, and the crawler will check them in order until it finds a value for the field. For example, the crawler can check a specific HTML tag for a page’s title and, if it doesn’t exist, fall back to a second source such as the title tag of the page.
Add or edit an existing source to open the source editor
You’ll find options for different source types on the left. On the right, you’ll find an area with explanatory text for the currently selected source type and possibly some additional settings that are required for that type. The available source types include:
- Automatic (page fields only): The crawler will attempt to automatically find the best value for the field. This type is only available for title, description, and image fields.
- Page element (page fields only): Set a specific page element. The crawler will look for this element on each page and, if it is found, the element's value will be used as the field's value. The available elements are:
- First H1 on a page
- Title HTML tag
- Body HTML tag
- Meta tag with corresponding name value
- Open Graph tag with corresponding property value that begins with “og:”
- Cludo meta tag with corresponding property value that begins with “cludo:”
- XPath target (page fields only): Set an XPath pattern. The crawler will use this pattern to target elements on each page. If this pattern successfully targets a text value, it will be used as the field's value. Note, the crawler uses XPath 1.0 syntax.
- URL match: Create pairs of URLs and values. When a page or file's URL partially or fully matches the URL in a pair, the field will be given the corresponding value. This source type is most often used to set a page's category.
- Ignore: The crawler will not look for the field in pages or files.
- Static value (file fields only): Set a value that will be used as the field’s value for all files.
- HTTP header (file fields only): Set the name of an HTTP header. If the crawler receives this header in a response to a request for a file, it will use the header's value as the value of the field.
Default field value
Lastly, you can optionally set a default value that the crawler will use as the field value if it can't find any of the set sources.