13 February 2017 Pro Web Blog

Improve website scanning with Crawl Budget Optimization

Crawl Budget Optimization: how to facilitate scanning and indexing

Today we will deal with one of the most underrated SEO topics. Many know about it and others have only heard of it, but there is probably a good slice of web enthusiasts who do not know what it is. We are talking about Crawl Budget Optimization. Let’s shed some light on the topic by giving a definition of crawl budget: the crawl budget is the amount of resources that Google dedicates to scanning a site. Google argues that this is not a ranking factor; actually, we have empirical data that shows quite the opposite. The crawl budget affects site scanning, site scanning affects indexing, indexing affects ranking. We, therefore, allow ourselves to add a little clarification to Google’s statement and say that the crawl budget is not a direct ranking factor, but may have a heavy impact on a website’s visibility in the SERP.

Crawl Budget increase

You are probably now wondering what the crawl budget of your website is and what influences its value. Let’s start by saying that the crawl budget varies for each site and its allocation responds to criteria that are not democratic. How do you make sure that Googlebot devotes more time and energy to your website? The size of the crawl budget depends essentially on two factors, the Crawl Rate Limit and the Crawl Demand. The Crawl Rate Limit is the maximum number of concurrent connections that Google can make to scan the site. It can be influenced in two ways:

through server settings;
through Google Search Console.

When the crawl limit is reached, Googlebot stops scanning the site. However, if the limit is not reached, Googlebot can continue scanning, provided however that there is sufficient indexing demand.

The Crawl Demand depends on the following factors:

website popularity (inbound links);
website content quality.

In summary, to increase the Crawl Budget, you must set up your server so that it does not limit the scanning of Googlebot; the Search Console function for scanning limits should not be used except in extraordinary cases. For the rest, it is about publishing quality content and increasing the link popularity of the site.

Crawl Budget Optimization

We’ve just talked about how to increase your scanning budget, let’s see now how to allow Google the best use of such budget. Within Crawl Budget Optimization, it is possible to distinguish between two types of interventions:

server-side interventions;
on-site interventions;

Server-side interventions

This will probably sound weird, but a good deal of Crawl Budget Optimization has nothing to do with content or website code. Spider scanning capability depends first and foremost on host and web server performance. Here are the main server related optimization interventions.

Reduce server response time

The time that the server takes to return the required resources is a key factor in crawl budget optimization. If the server answers are slow, the crawler will have more difficulty and will run out of the available budget earlier. From this point of view the range of action is not very wide – if your server is not performing well, you need to opt for a different hosting package. This is what can happen by doing a server change:

HTML, CSS and JS resources compression

Compression with gzip or deflate reduces the number of bytes sent over the network. This type of compression is easily implemented by using the mod_deflate module (for Apache server) or the HttpGzipModule module (for Nginx servers); on the IIS servers, it is necessary to act on the configuration of the http compression.

Correct cache management

Cache settings do not only aim at speeding up user browsing. They are also considered by Googlebot, which follows them just as if they were a browser. With the appropriate http headers (Expires and Cache-Control), you can set caching times for all site resources. Here are the ideal settings:

HTML resources: no cache storage;
Images and videos: cache for at least one week, maximum one month;
CSS resources: cache for one week;
JS resources: cache for one week;

On-site interventions

After optimizing the server, you can move to your on-site optimizations. There are a series of important actions you can take to optimize Googlebot crawl activity. Below are the main ones.

Minimizing HTML, CSS, and JS resources

Minimizing the code means removing all unnecessary bytes, like unnecessary spaces, indents, and blank lines. This is a very simple operation that can almost always be done directly from the CMS interface, perhaps by installing dedicated plugins.

Image optimization
Image optimization saves many bytes by speeding up the scanning process and improving user browsing experience. In addition to basic optimization, which can be made with any image editing software (such as Photoshop or GIMP), it is good to compress JPEG and PNG images through tools that will help maintain quality integrity. Valid tools for JPEG format are Jpegtran or Jpegoptim, while for PNG images you can opt for OptiPNG or PNGOUT.

Correct redirect management

Redirect, besides causing a trust loss, weighs the scanning process down significantly. In case of migration or URLs re-writing, it is important to correct all internal links to the site so that the spider is not forced to make unnecessary requests for pages that have been permanently moved. You must also avoid redirect chains.

Correct internal links management

The way the site pages are linked to each other is crucial to good scanning and proper indexing. You should aid Googlebot to easily reach the most important pages of the site, at the expense of those irrelevant to SEO. Correct menu and paging management, as well as interlinking strategy, are fundamental. The number of links within individual pages is also important: every time an URL is linked, the scanning budget is reduced. It is worth mentioning that Googlebot also follows the URLs found in the <head> section of the page, such as Canonical and Hreflang.

Removing duplicate content and unnecessary pages

Most sites that have problems with the crawl budget are characterized by the presence of duplicate content and junk pages that could be safely ignored by spiders. The most typical case regards e-commerce filters, real crawl budgets eaters that can generate tens of thousands of URLs with no added value. You need to make sure that Googlebot does not scan those URLs or, better yet, that those URLs are not generated altogether.

Let’s summarize the possible steps to optimize and increase the Crawl Budget:

Activity	Goal	Impact (1-5)
correct setting of scan limits	Increase of crawl limit	4
creating quality content	Increase of crawl demand	4
link building	Increase of crawl demand	4
reducing server response times	crawl budget optimization	5
resource compression	crawl budget optimization	3
resource minimization	crawl budget optimization	4
image optimization	crawl budget optimization	3
correct redirect management	crawl budget optimization	2
correct internal links management	crawl budget optimization	4
removing duplicate content	crawl budget optimization	4