27 June 2019 Pro Web Blog

SEO & Machine Learning: classify contents to optimize editorial sites

Large news portals must deal with impeccable content management, to avoid the risk of traffic loss, loss of quality, and possible penalizations.

The choice is often: to continue publishing similar content, minimizing the previous ones, and losing some of your traffic? Or risking to attract fewer readers by publishing less?

The machine learning can help to address the problem: we talked about it in our latest White Paper, which we created for the 2019 Web Marketing Festival.

It’s not just about process automation: a timely content audit starts from the analysis and extraction of the site’s text. It can bring significant benefits, both in terms of selection and growth of the actual contents.

Better online contents for better editorial sites

Having said that an effective content strategy is the foundation of an editorial site, matching the need for expanding content with the historicity is not easy. Often, the experts find themselves between these two needs, facing these critical issues:

Index bloating: it is the number of indexed pages. The growth is historically constant and, without interventions, is meant to continue. However, quality products could take second place due to the size of the site;
Overlapping and cannibalization of contents: You don’t always track what has already been written in the past, and this leads to a substantial content overlap, with dozens of similar articles;
Crawling budget and content accessibility: If the crawling budget is not optimized, with continuous production of content, there will be blind spots on products not scanned by the engine;
Sublet content: often, the publishing needs to monetize pushes towards quantity over quality. A large number of pages of questionable overlapping quality does nothing but make the balance even more unstable.

What to do? Our content strategy process

Use a process based on analytical data, third-party tools on organic visibility, semantic analysis, and clustering to choose what to remove and what to keep.

We analyze the strategy to be implemented step by step:

Content audit process: from the traditional to an automated and optimized process thanks to machine learning;
Exploratory analysis it concerns the distribution and segmentation of contents, allowing to have a precise picture of the state of health of the site’s texts. It analyzes their distribution by category and length, and they are segmented for efficiency in terms of ranking;
Identification of anomalies based on keyword analysis and clustering: The algorithm allows us to verify which contents can be considered abnormal. The TF-IDF algorithm is a calculation that enables us to weigh the importance of a term compared to a document or a collection of documents. This function increases proportionally to the number of times that a word is contained in the document.
However, it grows in an inversely proportional way to the frequency of the word as a whole. The idea behind the behavior is to give more importance to the words contained in the document, which are generally infrequent.

Conclusions: do you have an editorial site and would like to optimize it?

The content audit allows us to indicate to the customer the state of health of the site’s text, as well as their optimization level compared to one another. The advantage is not only that it is an automated activity, but also a scalable optimization activity on a large amount of textual data, improving the use of available resources.

This allows the customer to work directly on problematic texts, saving time and effort during the activity. The use of Machine Learning in the SEO context cannot be separated from the human factor and must always be subjected to a final revision by “human” specialists.

The results achievable from this content strategy approach are not a point of arrival, but an incentive to improve the publishing model, moving towards quality and the response to the user’s needs.

Andrea D'Agostino Data Analyst di Pro Web Consulting