Thursday 15:35–16:05 in Track 3

Mining articles for practical insights for content creation

Łukasz Dziekan /CTO @ Finai, Michał Stolarczyk

Audience level:


As a support to our marketing team we have created a tool which analyzes article headlines and contents. It gives insights how to create headlines and models potential "virality" of the content piece, This was particularly challenging because of limited support for NLP in polish language. And it is actually used by our marketing team.


Using Facebook API we have collected data from fanpages of Polish portals publishing articles in the internet. Based on number of shares, comments, likes and other reactions we defined the virality coefficient, which allows us to measure how much potential each article has to become viral, and therefore being particularly interesting in terms of marketing potential. Given this dataset, we wanted to classify the most catchy phrases occurring in article titles and to check if the content actually matters. We examined how these best phrases change over time, did clustering based on their meaning. Moreover, we automated the process of distinguishing between phrases being one-time events (27-1) and those occurring regularly. We also consider impact of other features of the headline on the virality of the article. Additionally we examine the formatting features based on article content and formatting. Higher level virality analysis concerns linking articles covering the same topic, which requires inclusion of our dataset HTML code of article and text (body) extraction out of it.

During our speech we will cover the following areas:
Data collection:

Data preprocessing:


Analyses :


Subscribe to Receive PyData Updates


Get Now