If you run into duplicate results, there's probably a logical reason - and a good solution - for that!
Pages are only crawled once.
First: The same URL can never be indexed twice. If two results look alike, you'll see that their URLs are still different.
Duplicate results - why they occur and how to eliminate them
Often, duplicate results occur when certain parameters are added to the URL.
A few examples of URL parameters that would generate duplicate results:
If your website exists in both http:// and https:// or with and without www, it's also important that only one of those domains is crawled by Cludo, since crawling both would double all results.
Avoiding duplicate results
There are several ways to eliminate duplicate results.
Proactively, you can use canonical tags to avoid duplicate search results. Canonicalization would also apply to other search engines (like Google). Using canonical tags is the globally recognized way of avoiding duplicate results in search engines, which will also leave your site more SEO-friendly. On this site, you can learn much more about canonical tags and how to use them.
"A canonical tag (aka "rel canonical") is a way of telling search engines that
a specific URL represents the master copy of a page. Using the canonical tag prevents
problems caused by identical or "duplicate" content appearing on multiple URLs.
Practically speaking, the canonical tag tells search engines which version of a URL
you want to appear in search results."
For example, if you have a blog or a news archive, you'd only want to index page 1 of that overview in addition to the unique articles. You wouldn't want to index the overview of page 2, 3, and so on - your users are either looking for a specific article or just want to land on your overview. Indexing page 1, 2, and 3 of your blog overview would most likely result in three search results with the title "Blog", making those look like duplicate results.
You can prevent pages from appearing in search engines - and your Cludo search - by including a noindex meta tag in the HTML of that page. Google explains that pretty well here.
However, if a page is ignored by the crawler due to this having a noindex tag, the crawler won't look for additional links within that page either - pages that you might want to be indexed. Because of that, we would always recommend using canonicalization over noindex.
Exclude URL parameters from your crawler settings
If you want full control over your Cludo crawler - and maybe have some specific needs that you only want to apply here and not for external search engines - you can exclude URLs directly from your crawler settings in MyCludo.
In this article about setting up your crawler, you can read how URLs are excluded.
Say that you'd want to exclude all URLs containing
&page=, you can simply add this parameter to the URLs to be excluded field, and these URLs will no longer be crawled.
At the end of the day, adding URL exclusions to your crawler will also make your crawler way more effective, as it'll run faster when ignoring certain URL patterns.
By default, Cludo's crawler always ignores the following URL patterns:
If you experience duplicate results on your site and need help approaching those, you're always free to contact our support.