(click on the Title to view presentation details)
Blaze is a NumPy/Pandas interface to big data systems like SQL, HDFS, and Spark. Blaze provides Python developers access to the rich analytic processing available both within the Python ecosystem and beyond.
Internally Blaze is a lightweight data modeling language (expressions with type information) alongside a set of interpreters (Python, SQL, Spark, MongoDB, ...). The modeling language provides an intuitive and familiar user experience. The interpreters connect that experience to a wide variety of data technologies. This combination allows developers to construct connections to novel technologies. These connections enable users to interact with their system of choice, regardless if that system is a single CSV file or a large HDFS cluster running Impala.
This is followed by a second talk using Blaze in the wild.
Ever wonder how Google Chrome detects the language of every webpage you visit?
What is data science and how can you use Python to do it? In this talk, I'll teach you the data science process OSEMN, while creating a language prediction algorithm, utilizing nothing but Python and data from Wikipedia!
As an engineer, analyst, or scientist, sharing your work with someone outside of your immediate team can be a challenge. End-users embody many roles with a wide range of technical skill and often times no familiarity with Python or the command line. Findings, key results, and models are frequently boiled down to static graphs, tables, and figures presented in short reports or slideshow presentations. However, engaging research and data analysis is interactive, anticipating the users’ questions and giving them the tools to answer those questions with a simple and intuitive user interface.
Browser based applications are an ideal vehicle for delivering these types of interactive tools, but building a web app requires setting up backend applications to serve up content and creating a UI with languages like HTML, CSS, and JavaScript. This is a non-trivial task that can be overwhelming for anyone not familiar with the web stack.
Spyre is a web application framework meant to help those python developers that might have little knowledge of how web applications works, much less how to build them. Spyre takes care of setting up both the front and back-end of your web application. It uses CherryPy to handle HTTP request logic and Jinja2 to auto-generate all of the client-side nuts and bolts, allowing developers to quickly move the inputs and outputs of their python modules into a browser based application. Inputs, controls, outputs, and the relationships between all of these components are specified in a python dictionary. The developer need only define this dictionary and override the methods needed to generate content (text, tables, and plots).
While Spyre apps are launched on CherryPy’s production-ready server, Spyre’s primary goal is to provide a development path to simple light-weight apps without the need for a designer or front-end engineer. For example, Spyre can be used for
At Next Big Sound we recently used Spyre to build an app to visualize the effects of sampling parameter values on the volume of tweets collected from one of our data providers (see screenshot below).
Web applications like this can turn a highly technical process into a simple tool that can be used by anyone with any level of technical skill.
After you’ve finished the foundational parts of your project -- the data collection, data cleaning, exploration, modeling, and analysis -- Spyre provides a quick and simple way to package the results into an interactive web application that can be viewed by the rest of the world.
Choosing hardware for big data analysis is difficult because of the many options and variables involved. The problem is more complicated when you need a full cluster for big data analytics.
This session will cover the basic guidelines and architectural choices involved in choosing analytics hardware for Spark and Hadoop. I will cover processor core and memory ratios, disk subsystems, and network architecture. This is a practical advice oriented session, and will focus on performance and cost tradeoffs for many different options.
Using machine learning to beat your friends in a NFL confidence pool.
Betting spreads provide a consistent and robust mechanism for encapsulating the variables and predicting outcomes of NFL games. In a weekly confidence pool, spreads also perform very well as opposed to intuition-based guessing and supposed knowledge from years of being a fan. We present some attempts and analysis to use machine learning in order to make improvements on the spread method of ranking winners on a weekly basis.
If so, this talk is for you!
In this talk, I will go over conventional and unconventional techniques that I've used to reduce the size of my data. I will go over traditional dimensionality reduction techniques such as PCA and NMF. In addition, I will go over more esoteric approaches such as Random Projection.
By the end of this talk, you will be able to understand when and how to appropriately apply these techniques to your own data.
Data Science is a comparatively new field and as such it is constantly changing as new techniques, tools, and problems emerge every day. Traditionally education has taken a top down approach where courses are developed on the scale of years and committees approve curricula based on what might be the most theoretically complete approach. This is at odds however with an evolving industry that needs data scientists faster than they can be (traditionally) trained.
If we are to sustainably push the field of Data Science forward, we must collectively figure out how to best scale this type of education. At Zipfian I have seen (and felt) first hand what works (and what doesn't) when tools and theory are combined in a classroom environment. This talk will be a narrative about the lessons learned trying to integrate high level theory with practical application, how leveraging the Python ecosystem (numpy, scipy, pandas, scikit-learn, etc.) has made this possible, and what happens when you treat curriculum like product (and the classroom like a team).
The talk illustrates how selective-search object recognition and the latest deep-learning object identification algorithm was applied to solving the problem of image cropping.
How can you identify the most important part of the image that must not be cropped out when shown as a thumbnail? This is a problem faced by various media and e-commerce where space for the image can be of various sizes and the best portion of the original image must be preserved to maximize effectiveness.
Selective search is a new method proposed by Uijlings (U. Amsterdam) et al. which significantly improved on the accuracy of object recognition over the previous exhaustive search methods. This allows us to use advanced methods such as Convolutional Neural Network, a.k.a. deep learning object identification, to identify interesting objects contained in an image. With this, interesting parts of the image can be preserved in the process of cropping producing effective thumbnails.
The code for selective search is available for matlab only while the deep learning algorithm, Caffe, is available with a Python wrapper. We will illustrate how Python is best suited to putting together the result of cutting edge research in order to solve complex data processing problems.
Visualizations are windows into datasets: they can help generate hypotheses, aid combinatory play to discover trends, and cement insight by providing structure and context.
This talk will touch on the current state of the Python visualization ecosystem, offer some thoughts on iteratively building visualizations, then launch into a data-driven exploration of visualization techniques grounded in the NYC taxicab dataset.
(or: how to only forget some things about your ML models instead of literally everything)
You're writing a classifier. So you trained 10 decision trees in October, with several sets of training data, different maximum depths, different scalings, and different features. Some of the experiments went better than others! Now it's November, and you want to go back to the project and start using one of these models. But which one?!
At Stripe, we train models to automatically detect and block fraudulent transactions in real time. We build a lot of models, and we need a way to keep track of all kinds of information about them. I'll talk about a simple tool we built to:
This functions as a lightweight lab notebook for ML experiments, and been incredibly useful for us (as mere humans). Having a consistent way to look at the results of our experiments means we can compare models on equal footing. No more notes, no more forgetting, no more hand-crafted artisanal visualizations. [1]
[1] You're still allowed to make hand-crafted artisanal visualizations if you want.
SQL is still the bread-and-butter of the data world, and data analysts/scientists/engineers need to have some familiarity with it as the world runs on relational databases.
When first learning pandas (and coming from a database background), I found myself wanting to be able to compare equivalent pandas and SQL statements side-by-side, knowing that it would allow me to pick up the library quickly, but most importantly, apply it to my workflow.
This tutorial will provide an introduction to both syntaxes, allowing those inexperienced with either SQL or pandas to learn a bit of both, while also bridging the gap between the two, so that practitioners of one can learn the other from their perspective. Additionally, I'll discuss the tradeoffs between each and why one might be better suited for some tasks than the other.
Attendees will learn how to use data to generate opportunities for their organizations, and rank some type of risks in scales similar to the ones used by rating agencies, opening profitable risk transfer opportunities.
This presentation documents a simplified analytical approach developed by the author to illustrate key elements in the design of a parametric catastrophic bond, a type of Insurance Linked Security (ILS). The model was developed in order to 1) help potential clients of an investment bank understand advantages and disadvantages of insurance linked securities vs. traditional insurance, 2) help government decision makers draft policies to accommodate the product in their risk management efforts, 3) expand potential market of buyers of ILS by sharing analytical work to Rating Agencies and CDO managers in a reproducible way, 4) Help win structuring mandates for investment banks.
Although the original model was developed using a combination of C/C++, Visual Basic, ActiveX, and an Excel front end, this presentation will show a modern approach that will use: