Online Stores and Filters: A Magnet for Web Crawlers

Filters in online shops are extremely useful for users, allowing them to quickly and accurately search for products by colour, size, price, brand and other attributes. In this way, they improve the user experience and increase the chances of successful sales.

Unfortunately, filters can also cause significant problems for the performance of an online shop and search engine optimisation. This is because combinations of filters generate a huge number of URLs that Google and other search engines try to index. This can lead to temporary inactivity of the shop, duplicate content, excessive crawl budget and lower visibility of the webshop in search results.

Sitemap

Using filters: what’s the problem?
Basic measures against over-indexing
Solution with directives in robots.txt
- Example for WooCommerce
- Example for Magento
Solution with directives in .htaccess
- Example for WooCommerce
- Example for Magento
Summary and final recommendations

Use of filters: what is the problem?

Each filter combination in an online shop creates its own unique URL, often with long parameters in the query, e.g.:

https://shop.net/trgovina/?filter_brand=nike&filter_material=cotton&filter_discount=on
https://trgovina.com/izdelki/majice.html?oznake=13&prodajnamesta=24&product_awards=55

When there are many product categories and many filters, the number of generated pages can grow to several thousands. Search engine web crawlers/spiders try to visit and often index these pages, leading to:

Slow performance or even inoperability of the webshop (due to increased consumption of server resources),
resulting in a poor user experience for visitors and loss of sales,
duplication of content, which is detrimental to web optimisation (SEO),
unnecessary indexation of pages without added value,
confusion in understanding the hierarchy of pages by web spiders (e.g. Googlebot),
faster consumption of crawl budget.

The crawl budget represents the number of pages that a search engine (e.g. Google) will visit on a website in a given period of time. If the crawl budget is spent on irrelevant URLs (e.g. filter combinations), there is a risk that important pages such as categories and products will not be visited and indexed in time.

Basic measures against over-indexing

We can first try to alleviate the problem of filter over-indexing with some basic SEO solutions:

Use canonical tags – every URL with filters should point to a matching link without filters.
- Example of a URL with filters: https://trgovina.com/izdelki/majice.html?oznake=13&prodajnamesta=24&product_awards=55
- Use a canonical tag: <link rel="canonical" href="https://trgovina.com/izdelki/majice.html" />
No internal links to filtered pages – this will make it harder for web spiders to find URLs with filters.
Exclude URLs with filters from the sitemap – the sitemap.xml file should only include relevant pages.

Despite the above measures, for larger stores with many filters, spiders may still visit filtered URLs in large numbers and thus have a significant impact on the consumption of server resources. In such cases, an additional technical solution is needed:

less aggressive – with directives in the robots.txt file,
more aggressive – with directives in the .htaccess file.

The solution with directives in robots.txt

A recommended initial measure is to use the rules in robots.txt. With the appropriate directives, you can instruct search engines not to index certain URL parameters, while at the same time not preventing them from accessing them. This way, the bots still have access to the entire structure of the site, but they don’t spend their crawl budget on unnecessary combinations of filters.

Example for WooCommerce

User-agent: *
Disallow: /*?*filter_*&filter_*
Disallow: /*?*filter_*&*shop_view=
Disallow: /*?*filter_*&*per_page=
Disallow: /*?*filter_*&*query_type_*
Disallow: /*?*query_type_*&*filter_*
Disallow: /*?*min_price=
Disallow: /*?*max_price=

Example for Magento

User-agent: *
Disallow: /*?dir=
Disallow: /*?order=
Disallow: /*?mode=
Disallow: /*?price=
Disallow: /*?cat=

The robots.txt file is located in the root directory of the website. It can be uploaded to the server manually or edited directly from the store administration. In the case of a WooCommerce store, an SEO plugin (e.g. Yoast SEO or RankMath) can be used, while editing robots.txt in Magento 2 (Adobe Commerce) is shown in these instructions.

Solution with directives in .htaccess

Alternatively, you can use .htaccess rules to block certain bots (e.g. Googlebot) from accessing URLs with filters. This is a fairly aggressive approach that completely prevents web spiders from visiting the pages specified by the directives.

Instructions for editing the .htaccess file are available here.

Example for WooCommerce

# BEGIN - Block UA & QUERY requests for WooCommerce filters and add-to-cart
<IfModule mod_rewrite.c>
RewriteEngine On
# Specify UA (User Agent) - Googlebot hammer
RewriteCond %{REQUEST_METHOD} ^(GET|POST)$
RewriteCond %{HTTP_USER_AGENT} (Googlebot) [NC]
# Specify patterns (QUERY_STRING) for filters and add-to-cart
RewriteCond %{QUERY_STRING} (add-to-cart|filter_color|filter_size|filter_brand|min_price|max_price) [NC]
# Return 429 and stop
RewriteRule "^.*$" - [R=429,L]
</IfModule>
# END - Block UA & URL QUERY requests for WooCommerce filters and add-to-cart

In the .htaccess rules above , adjust line 8 of the code where the different filters are listed. Keep the add-to-cart notation on this line.

Example for Magento

# BEGIN - Block UA & QUERY requests for Magento filters
<IfModule mod_rewrite.c>
RewriteEngine On
# Specify UA (User Agent) - Googlebot hammer
RewriteCond %{REQUEST_METHOD} ^(GET|POST)$
RewriteCond %{HTTP_USER_AGENT} (Googlebot) [NC]
# Specify patterns (QUERY_STRING) for filters
RewriteCond %{QUERY_STRING} (color|size|brand|price|cat|mode|order) [NC]
# Return 429 and stop
RewriteRule "^.*$" - [R=429,L]
</IfModule>
# END - Block UA & URL QUERY requests for Magento filters

In the .htaccess rules above , adjust line 8 of the code accordingly, where the different filters are listed.

The .htaccess blocking method is quite aggressive, as it prevents the crawler from visiting certain pages. This means that Google cannot access those URLs at all and cannot see links that might be useful. While this solution reduces the load on the server and prevents unwanted pages from being indexed, it can also lead to the loss of important search results.

Summary and final recommendations

The use of filters brings many benefits to visitors to an online shop, but it also requires special attention. Improper filter management leads to over-indexing and increased consumption of server resources, resulting in slower performance or even temporary unavailability of the webshop. Proper configuration of filters is therefore important both for faster store performance and for cleaner indexing and better search engine positions.

If you fix the problem with a .htaccess file, you may get the unwanted URLs to not be indexed, but this is a rather aggressive approach as you are completely blocking web spiders from accessing these pages. A more elegant and better solution in the long run is to use robots.txt, where we use precise rules to direct search engines away from unnecessary URL parameters.

One final tip: we recommend that you add your domain (online shop) to the Google Search Console, where you will be able to quickly see if Google is indexing unwanted pages on your website.

NEOSERV BLOG

Sitemap

Use of filters: what is the problem?

Basic measures against over-indexing

The solution with directives in robots.txt

Example for WooCommerce

Example for Magento

Solution with directives in .htaccess

Example for WooCommerce

Example for Magento

Summary and final recommendations

CATEGORIES

COMMENTS

COMMENT THE POST

Your comment has been successfully submitted

The comment will be visible on the page when our moderators approve it.