Using visual features alone is not enough to fully exploit content of social media videos. We propose a whole pipeline for extracting textual data embedded in videos and fusing them with the original visual data to predict drops in viewer’s retention.
With all the hype that deep learning is causing, research in this area is advancing very quickly. Using images or text as input for successful prediction is a day-to-day task for most machine learning researchers. However, we think that architectures that use several modalities are still not that popular. We claim that combining different, often complementary, signals is a step forward in designing accurate and effective algorithms.
In this spirit, during our talk we want to outline how we approach a real-life challenge of predicting retention of social media videos.
At the beginning, we will talk about methods that we used to compare corresponding parts of videos of different lengths to calculate relative retention drops. Then, we will talk about challenging preprocessing step of extracting textual overlays from video. Finally, we will show how we approached the task of fusing visual and textual data. In order to solve our problem we used a range of deep learning techniques: LSTM, CNNs and C-RNN (convolutional recurrent neural networks). Models were developed in Keras and Tensorflow.
Hopefully, at the end of the talk you will be familiar with techniques of processing both videos and text in order to train multimodal models.