Natural Language Processing is evolving quickly and the Deep Learning revolution has allowed us to create systems rivaling humans. Unfortunately, most of the research is focusing on the English language and the optimistic results that are the most discussed are not directly transferable to other languages. This talk presents some of the most common challenges in processing Slavic languages.
During the talk we will discuss the most common challenges you are bound to face while working on NLP-related tasks and non-English languages. As an example, we will discuss the problem of Part-of-Speech (POS) Tagging, which is relatively simple in English, but poses a significant difficulty in Polish. For English the problem is proclaimed as solved, by reaching the tagging error rates similar or lower than the human inter-annotator agreement, which is ca. 97%. In the case of languages with rich morphology, such as Polish, there is however no doubt that the accuracies of around 91% delivered by taggers leave much to be desired.
We will also discuss the Deep Learning approaches as applied to Slavic-languages NLP and the problems faced when using vector-based representations, such as word2vec and fasttext. Typical challenges and comparisons will be made on the example of POS tagging and event detection.