In the cyber security world, analysis of command line logs is important for breach detection, but it is one of the most challenging problems. To simplify this process, we propose a framework called CMDLang – command line language, which has features of natural language. I will present results of successful POS and NER training using popular NLP algorithms and demonstrate a real use case of CMDLang
At F-Secure, in order to protect our customers, we use streams of command line logs coming from their systems to detect breaches and anomalies. Analysis of such data is one of the most challenging problems in the cyber security world. It requires domain knowledge and is hard to encapsulate in sets of rules.
What if we treat command lines logs as semi-structured text data? They follow a set of grammar rules and have semantics. Therefore, we propose the framework of CMDLang – command line language, which has features of natural language. We performed successful trainings of part of speech (POS) tagger and named entities recognition (NER) models. Using CMDLang along with NLP methods enables normalization of logs, parsing and their categorization. With a defined language framework, we are able to analyze huge streams of data faster, which improves our detection capabilities.
During the talk, I will present results of the CMDLang creation using popular, open-source NLP algorithms. I will give a walkthrough of the process and define the main ideas behind this language. At the end I will demonstrate usage of CMDLang in a real use case.
This talk will be interesting for every NLP enthusiast, as well for people working with (semi-)structured text data or log processing. During the presentation, I will explain any cyber security terminology used.