16 July 2018 Pro Web Blog

Analysis of web server logs: tips and tricks

Let’s continue talking about scanning and scannability by discussing an equally technical topic: the analysis of webserver logs (topics that we addressed in detail at the latest SeriousMonkey).

Log files are extremely useful – though sadly underused – because they are rich in valuable insights into how a search engine scans a website in a given period of time. These are not speculations, abstractions or personal opinions: logs provide concrete, real data, showing exactly what is going on behind the scenes of our website. That’s why it’s essential to analyse server logs to make the SEO strategy even more effective.

Analysing logs means taking into consideration the access requests sent to the webserver that hosts our website, obtaining many valuable insights.

What is log analysis?

Log analysis consists in analysing the access requests that are sent to the webserver that hosts a website.

analisi webserver

To carry out the analysis, it is therefore necessary to obtain one or more logs containing the list of requests to access a website. The following information is usually provided for each request:

Date and time of the request
Requested URL
User Agent
IP address of the user agent
Status Code
Requested page size
Server response time
Page from which the user agent comes

access log

You should be familiar with the Crawl Budget concept and its defining factors: the Crawl Rate Limit and Crawl Demand.

Preliminary concepts

If you follow us regularly, you might remember an article some time ago dedicated to the Crawl Budget, that is, the set of resources Google dedicates to scanning. Basically, optimising the crawl budget means ensuring that Google uses its resources to scan the most important pages for our business. The size of the Crawl Budget depends essentially on two factors: the Crawl Rate Limit and the Crawl Demand.

The Crawl Rate Limit is the maximum number of simultaneous connections Google can make to scan your website. It can be influenced in two ways: by server performance and settings and by SearchConsole settings with which you can decide to reduce it but not, of course, increase it.

The Crawl Demand is the scanning demand and indicates how much a site is worth scanning. It depends on two main variables, i.e., the popularity of the website (in terms of links received) and the quality of the content.

The log analysis is essentially aimed at optimising the crawl rate limit and the crawl demand and is therefore aimed at crawl budget optimization.

The analysis process: user agent

A User Agent analysis allows us to:

Understand whether a website is scanned by useless or harmful bots;
Understand whether all Google bots (or at least those of interest) are scanning the website.

By preventing access to useless or harmful bots, you can ease the workload of the server which can then accept more requests from Google bots. If, for example, your website does not target China or Russia, it will certainly be useful to block Baiduspider and YandexBot.

URL analysis is critical for detecting Crawl Budget dispersion across useless pages, orphan pages, or infinite generation of URLs.

The analysis process URL

The URL analysis allows us to understand whether:

Googlebot more often requests a website’s most important URLs;
Googlebot is requesting useless URLs;
there are slow pages;
there are pages that are too large;
the website has mechanisms for infinite generation of URLs;
Google is scanning orphan pages;
there are URLs that are not scanned.

It may happen that Googlebot scans very little pages that are most important to our business. Why does this happen? Before taking any further action, it is necessary to check whether there are any technical obstacles that prevent Goooglebots from reaching certain pages easily. If there are no such obstacles, the pages in question probably have an inadequate crawl demand considering their importance and what you need do is try to increase it by acting on the content, on the internal linking and on the popularity link of the pages.

It could then happen that Googlebot scans unnecessary pages. A typical case is that of duplication phenomena managed with noindex or canonical. We know that noindex and canonical allow us to avoid SERP issues due to duplicates but do not allow us to prevent Googlebot from scanning duplicate pages; if there are many of these pages, Googlebot is wasting so much crawl budget and it is good to take remedial action. The fastest and most effective solution is to block access to duplicates by Robots.txt, but it’s not always convenient. Duplicate pages may receive links from outside and having Robots.txt block them would result in losing the benefits of these links. It is therefore necessary to assess case by case, it being understood that the best solution would be to intervene on the development side in order to avoid duplicate pages from being generated.

Googlebots could also get lost within mechanisms of endless URL generations, as often happens when you come across large e-commerce filter pages. An e-commerce filter system can generate millions of URLs, all of which are substantially useless for SEO purposes.

We recommend that you do not index filter pages to prevent bots from being lost and wasting Crawl Budget by scanning thousands – or millions – of worthless URLs.

filtri ecommerce

My personal opinion is that you should never index filter pages; if there is a need to focus on the positioning of particular keywords, you can find alternative solutions, since there are several very valid ones. The only way to prevent Googlebots from being lost among millions of worthless URLs is to block access by Robots.tx. Again, however, the best solution would be another: to develop the filter system so that it does not generate scannable URLs.

That’s what happened to one of our customers, a large e-commerce website:

pagine filtro

A single listing page generated almost two million filter pages, which Googlebot scanned in a single day. So, do you think Google used its crawl budget well?

It may happen that some pages scanned by Google are not detected by our Screaming Frog, that is, there may be unlinked pages in the website. Orphan pages may be normal. For example, an e-commerce website may decide not to link product pages that are currently unavailable and then link them again when the products are back in stock. In cases like this it is quite normal for there to be orphan pages, but it may also happen that the presence of orphan pages has, at least at first sight, no logical explanation. In these cases, it is necessary at least to investigate to understand why these pages were not included in the website’s link structure. On the contrary, there may be pages that are correctly linked within the website but are never scanned by Google. In these cases, crawl demand issues can be ruled out: if Googlebot never scans a URL, there is certainly a technical obstacle.

You can also identify the heaviest URLs and those for which the server takes most time to respond. Speeding up the response time optimises the scan budget that Google has allocated to our website.

The analysis process: directories

A directory analysis allows us to:

Understand whether Googlebot is scanning directories it should not scan;
Check if Googlebot is scanning more those directories that are most important to us;
Evaluate whether Googlebot attaches more importance to a country/language.

The analysis process: status code

The status code analysis allows us to understand a number of things: First of all: Does Googlebot receive the same status codes as users? It is very important that Google be treated exactly like a user; if it is not treated differently, an in-depth analysis aimed at understanding the causes and possible shortcomings of a situation that we can undoubtedly call abnormal is needed.

404 errors

Are 404 errors a problem from an SEO point of view? I hope the answer is there for all to see: absolutely not. The problem may exist, however, if there are so many 404 pages and Googlebot continues to insist on them, to the detriment of more important pages. We assume that these are not broken links within the site (we would have noticed them even before analysing the logs) and that they are URLs no longer linked, perhaps for a long time. If we are sure that these pages will never be put back online, we can replace the status code 404 with status code 410 which, in theory, should ensure a faster deindexing process.

Redirect 301

What happens if logs result in millions of 301 status codes? A hasty conclusion might lead us to say, “they have migrated and managed it well. Well done!” A closer look would instead lead us to say, “here’s yet another superficially managed migration.” Often SEOs consider redirects as a panacea for all evils and forget that every time you set up a redirect you force Googlebot to send two requests to the server, one for the old URL and one for the new one. Imagine having a website with two million pages and having to migrate; the website receives visits from only one million pages while the other million is made up of pages that do not bring visits, are not linked from outside and are not included in the new website. What do you do? If the answer is “I redirect everything to the home,” you’re making a huge mistake.

The analysis process: the IP

By analysing the IP addresses of Google bots, you can check whether:

Google also scans a website with non-U.S. IPs;
it’s really Googlebot or a bot that uses Google’s User Agent.

Useful tools for log file analysis

One of the most used and simplest tools is Screaming Frog Log File Analyser. There are many other similar tools, both server side and client side, more or less performing according to the size of the website and the number of requests it receives. It is generally paid software but beware: none of them is really necessary. If you are patient and willing, you can open the log files with Excel and do your own analysis of the logs “by hand”. For a list of the best paid software, see this page.

Paolo Amorosi SEO Coordinator di Pro Web Consulting