Big data is cool. Bot fraud is cool too. Using nginx, fluentd, pandas-td, presto, iPython, NetworkX, Jupiter,and scikit-learn, the intricacies of fraud, yield, and trading algorithms for the online advertizing are presented. It's complex, fast, and remunerative out there. See what the tech side of it looks like from the inside out.
Hundreds of billions of video data impressions are served daily across the internet. These impression is logged via a variety of mechanisms to a plethora of repositories. We use Python tools to study and analyze log events, with high volumes, we we've need to come up with methodologies which work at scale, logistic and probabilistic tools such as Bloom filters, Count-Min Sketch. We work in Python and R, and are exploring using Julia.
Serving impressions is conducted via real-time bidding, with extensive logging of the transactions. There are dozens of players and billions of dollars in play. Several topics are of interest: categorization and detection of 'invalid traffic', lost revenue, yield enhancement, inventory training. Two times spheres are relevant, 'ad line' for which milliseconds are monetizable, and postmortems, when the billing, reconciliations, and negotiations for abatements and adjustments due to invalid traffic.
Invalid traffic, estimated to be 25% (it ranges widely depending on context), consists of bot fraud, 1x1 pixel frames, ad stacking, ad injection, below the fold placement, bogus publisher sites, incomplete ad call completion, ...
Using a Python module pandas-td (pandas_td) we can can access a repository containing billions of event records, typically NGINX or similar. Using NetworkX and matplotlib we visualize data, and using SciKit Learn, Theano, and TILO we probe for anomalies and opportunities.
One of the avenues being explored is dimensionality reduction. There are many factors, to consider, IP addresses of user, advertiser, and publisher, and of intermediate ad exchanges and trading desk along the way.
The talk consists of 4 pieces: