Wikidata is a free and open knowledge base from the Wikimedia Foundation, that anyone can edit. Unfortunately, this sometimes leads to people adding inaccurate, offensive or otherwise damaging content in it. In this talk we will dig into a ML method to predict the likelihood of a revision being vandalic, to either revert them automatically or aid the human moderators job.
Open Knowledge bases in general and Wikidata in particular, have become a important source of structured data for a growing array of information systems, including search engines. Like Wikipedia, Wikidata’s content can be created and edited by anyone; which is the main source of its strength, but also allows for malicious users to vandalize it, risking the spreading of misinformation through all the systems that rely on it as a source of structured facts.
To keep the base clean and useful, Wikidata relies on human aids to detect and revert damaging revisions. However, as it receives over 6000 human made revisions an hour and -lately- only between 0.1% and 0.2% of them are malicious, this task is increasingly becoming unfeasible for humans alone.
In the WSDM Cup 2017 we were challenge to come up with a fast and reliable prediction system to automatically detect vandalism and/or narrow down suspicious edits for human revision. In this talk, we will discuss the winning solution of the cup, how we used Python processed half a terabyte of semi-structured data (text, xml and json) into useful features; train the ML model and then use it in a production ready system, that was able to run much faster than real time (two months worth revision in half a day), on a single core machine with only 4gb of RAM.