This talk is about two things: natural language processing (NLP) and statistical dependance. We will walk through the ins and outs of sentiment analysis in Python (mostly using NLTK) and a swift introduction to the statistics of dependence. We'll put these techniques to use on a dataset of every reddit public comment, perhaps the best data source for exploring the behavior of shouty Web comments.
This talk is about two things: natural language processing (NLP) and statistical dependence. We will embark on a data science workflow using various python scientific computing tools to better understand the behavior of commenters on Reddit. To do this we'll go through an introduction to sentiment analysis in Python (mostly using NLTK) and a swift explanation of the statistics of variable dependence.
We'll couple these freshly learned methods with an excellent dataset for this domain: every public reddit comment. We'll talk a bit about handling and preprocessing data of this size and character. Then we'll compile scores for both sentiment and spelling/grammar. In the end we may just discover if angry comment are also grammatically poor comments. And the audience will walk away a few more tools in scientific computing toolbelt.