PyData Warsaw 2017 - Presentation: Mining articles for practical insights for content creation

Using Facebook API we have collected data from fanpages of Polish portals publishing articles in the internet. Based on number of shares, comments, likes and other reactions we defined the virality coefficient, which allows us to measure how much potential each article has to become viral, and therefore being particularly interesting in terms of marketing potential. Given this dataset, we wanted to classify the most catchy phrases occurring in article titles and to check if the content actually matters. We examined how these best phrases change over time, did clustering based on their meaning. Moreover, we automated the process of distinguishing between phrases being one-time events (27-1) and those occurring regularly. We also consider impact of other features of the headline on the virality of the article. Additionally we examine the formatting features based on article content and formatting. Higher level virality analysis concerns linking articles covering the same topic, which requires inclusion of our dataset HTML code of article and text (body) extraction out of it.

During our speech we will cover the following areas:
Data collection:

facebook API (headline, article link, reactions)
downloading HTML code
article text extraction

Data preprocessing:

stemming
tokenization

Analysis:

token, bigram, trigram, starting and ending phrases frequencies and scores
variance and entropy – automatic detection of one-off, regular and seasonal headlines/topics
x-validation on different time intervals and using different news-sources
virality score vs headline length

Analyses :

all of the above analyses for article text and HTML code
topic analysis (LDA)

Modeling:

ensemble modeling to for regression algorithms/classification algorithms to predict virality

Thursday 15:35–16:05 in Track 3

Mining articles for practical insights for content creation

Łukasz Dziekan /CTO @ Finai, Michał Stolarczyk

Description

Abstract

Subscribe to Receive PyData Updates

Tickets