(click on the Title to view presentation details)
Sometimes the greatest challenge in working with data is getting data to work with in the first place. In this talk I'll take the audience through the process of building a toolset that can be used to launch a virtual army of data collectors that can help get large volumes of useful data quickly. (No live coding or slides full of code will be presented, we're going to deal with concepts and I'll direct the audience to a github repository with examples at the end of the talk.)
Since it's the most widely available and a common source of valuable information, we'll focus primarily on gathering data from the web, although the principals could certainly be used to churn through other data sources as well.
We'll start by examining a simple web-scraper and the limitations of a singular, linear process. We'll then progress through the concepts of threading and concurrency, and all the way through multiprocessing. (Again, very little code, mostly graphics to help improve understanding of the concepts.)
Once we reach this point, we can discover together that there are limitations to this approach, even on super fast multi-core machines with tons of RAM. Network bottlenecks, ISP issues, and the possibility of creating an inadvertent Denial of Service Attack, not to mention the fact that you may not be able to use the computer in question while the data harvesting is going on.
From here we can consider the idea of using an inexpensive virtual machine running somewhere else (such as AWS) to do our bidding and harvest data while we wait. I'll show how some very simple tools like Vagrant and Fabric can be combined to make running code on a remote machine simple.
We'll still have some limitations though. Moving everything to a remote machine solves some of our original problems, but in the end it's still one machine and even the most powerful machine is going to have limits.
I'll present ways that we can spawn a network (an Army!) of virtual machines that can all work together to complete the task at hand, and have that power available to run any python code we desire.
Often times there exists a divide between data teams, engineering, and product managers in organizations, but with the dawn of data driven companies/applications, it is more prescient now than ever to be able to automate your analyses to personalize your users experiences. LinkedIn's People you May Know, Netflix and Pandora's recommenders, and Amazon's eerily custom shopping experience have all shown us why it is essential to leverage data if you want to stay relevant as a company.
As data analyses turn into products, it is essential that your tech/data stack be flexible enough to run models in production, integrate with web applications, and provide users with immediate and valuable feedback. I believe Python is becoming the lingua franca of data science due to its flexibility as a general purpose performant programming language, rich scientific ecosystem (numpy, scipy, scikit-learn, pandas, etc.), web frameworks/community, and utilities/libraries for handling data at scale. In this talk I will walk through a fictional company bringing it's first data product to market. Along the way I will cover Python and data science best practices for such a pipeline, cover some of the pitfalls of what happens when you put models into production, and how to make sure your users (and engineers) are as happy as they can be.
PyAlgoViz is an HTML5 browser application that allows Python students and practitioners to prototype an algorithm, visualize it, replay the execution, and share the end-result with others. A great use would be as a tool in the Datastructures and Algorithm track of the Computer Science curriculum.
PyAlgoViz is an HTML5 browser application that allows Python students and practitioners to prototype an algorithm, visualize it, and share it with others. To visualize an algorithm, it is sent to a server that runs the code, records the execution, and sends the recording back to the client. In the browser, the recording is then replayed at the speed the user wants. Graphics primitives to draw rectangles, lines, and text, in addition to generating sounds, allow algorithm visualizations that enhance the understanding of the algorithm.
Intended usage for PyAlgoViz is in the Datastructures and Algorithm track of the Computer Science curriculum or for personal education in the area of program algorithms. Not only will students learn how to implement algorithms in Python, they will also will be able to better understand asymptotic or even buggy algorithms by inducing patterns from observing the visualizations they create themselves.
Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets.
A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams.
In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. This will include a discussion of current open source interoperability options with Python, and how to combine real-time computation with batch logic written for Hadoop. We will also discuss Kafka and Storm's alternatives, current industry usage, and some real-world examples of how these technologies are being used in production by Parse.ly today.