(click on the Title to view presentation details)
An increasing amount of information is being conveyed via images, with mobile photographs being a particularly notable case. While image metadata, such as the text linking to an image in Google image search, is a common input to machine learning tasks, the content of the image itself is used far less frequently. Whether picking your best vacation photos, recommending similar-looking images to use in a talk or searching for hidden cats, learning tasks can do better if the actual pixels are used as input.
However, The choice of features determines the quality of result as much as the choice of machine learning algorithm and using the pixels directly often yields the poor results. Higher-level image features, such as face detection, histograms and color statistics like hue binning, provide significantly better performance. While advantageous, these features force the developer to choose from a vast number to accurately capture the details of their problem domain, a challenging task. This talk covers classes of simple image features and how to employ them in machine learning algorithms and focuses on providing basic domain knowledge in imaging/computer vision to developers already familiar with machine learning.
The outline is as follows: We begin with an overview of common image features and discuss potential applications for each. Common features include examples from computer vision such as blob identification, face detection and edge statistics as well as from image statistics such as intensity histograms, Fourier properties and color statistics such as hue binning. Next, we present how to generate the features with python imaging libraries. Finally, we discuss approaches to converting complex image features into a series of scalars for the input vector of an ML algorithm that best represent the problem domain.
Probabilistic Programming allows flexible specification of statistical models to gain insight from data. Estimation of best fitting parameter values, as well as uncertainty in these estimations, can be automated by sampling algorithms like Markov chain Monte Carlo (MCMC). The high interpretability and flexibility of this approach has lead to a huge paradigm shift in scientific fields ranging from Cognitive Science to Data Science and Quantitative Finance.
PyMC3 is a new Python module that features next generation sampling algorithms and an intuitive model specification syntax. The whole code base is written in pure Python and Just-in-time compiled via Theano for speed.
In this talk I will provide an intuitive introduction to Bayesian statistics and how probabilistic models can be specified and estimated using PyMC3.
Erik will be talking about Luigi, a recently open-sourced Python framework that helps you build complex pipelines of batch jobs, handle dependency resolution, and create visualizations to help manage multiple workflows.
Luigi provides an infrastructure that powers several Spotify features including recommendations, top lists, A/B test analysis, external reports, internal dashboards, and many more. It also comes with Hadoop support built in (and that’s where really where its strength becomes clear). Spotify uses it to run 10,000+ of Hadoop jobs every day, but also other actions like training machine learning algorithm, sending out reports, loading data into databases, and much more.
PlayHaven is working to model and predict user behavior in games on mobile devices. We are integrated into 5,000 games, see around 130 million unique users monthly, and record events from around 2.5 billion game sessions. Given the right data science tools, we will be able to intelligently redirect that traffic amongst our client's games to both dramatically improve the game player's experience and, thus, to improve the quality of our client's business.
The project chosen as PlayHaven's first run with predicting user behavior is churn prediction. Churn in our context is defined as a user that was previously active within a game decides to leave that game. The implications of successfully predicting include providing us with the chance to see why and solve a user's new lack of interest and, if that fails, provide them with content more in line with their interests.
The method used is called Reconstructability Analysis (RA) - a graphical model framework with heavy overlap in Loglinear Models and Bayesian Networks. This is a conception-to-deliverable overview. We will start with a top level view of PlayHaven and what that means in terms of the data we collect and what we want from data science. We then discuss what data we pull and the resources needed to pull and pre-process it. With a data set built, we will discuss model construction and predictor performance in terms of predictive accuracy and computational resources needed. We will conclude with a top level discussion of a few special applications of RA.
As the popularity of machine learning techniques spreads to new areas of industry and science, the number of potential machine learning users is growing rapidly. While the fantastic scikit-learn library is widely used in the Python community for tackling such tasks, there are two significant hurdles in place for people working on new machine learning problems:
1. Scikit-learn requires writing a fair amount of boilerplate code to run even simple experiments.
2. Obtaining good performance typically requires tuning various model parameters, which can be particularly challenging for beginners.
SciKit-Learn Laboratory (SKLL) is an open source Python package, originally developed by the NLP & Speech group at the Educational Testing Service (ETS), that addresses these issues by providing the ability to run scikit-learn experiments with tuned models without writing any code beyond what generates the features. This talk will provide an overview of performing common machine learning tasks with SKLL.
This is an introduction to data analysis in the context of chip design.
Chip design complexity has grown exponentially in the last 5 years. The primary driver for this has been a demand for higher performance and lower power . While these requirements are not new, what is relatively new is an increased complexity in design styles and manufacturing processes - multiple process corners, multiple on chip voltages, voltage and frequency scaling, many operating modes etc. This has led to a huge increase in the amount of data generated in the process of designing current generation chips. This has made it harder to analyze using traditional methods used in the chip design space like reviewing waveforms, reports and ad-hoc scripting. All this makes this area attractive for rich data analysis.
This talk will be an introduction to pandas and matplotlib with chip design data as the background.
Enaml is an open source library for building rich user interfaces utilizing a declarative extension to the Python language grammar. Notable features of the framework include: automatic dependency analysis at the bytecode level, a constraints based layout system, support for multiple rendering engines, and the ability to interface with a variety of model notification systems. The Enaml DSL is a strict superset of Python which allows the developer to declaratively define dynamic, extensible, and reactive trees; such trees are particularly well suited for the definition of user interface hierarchies.
Enaml has been used for building production applications at multiple Fortune 500 companies and serves as the foundational UI framework at one of the world's leading investment banks. This talk by the author of Enaml will provide an introduction to the language and demonstrate some of the notable features of the framework.
During the fall of 2012, a heated technical discussion regarding asynchronous programming occurred on python-ideas. One of the outcomes of this discussion was Tulip, an asynchronous programming API for Python 3.3, spearheaded by Guido van Rossum. A lesser known outcome was PyParallel: a set of modifications to the CPython interpreter that allows Python code to execute concurrently across multiple cores.
Twisted, Tulip, Gevent, Stackless/greenlets and even node.js are all variations on the same pattern for achieving "asynchronous I/O": non-blocking I/O performed on a single thread. Each framework provides extensive mechanisms for encapsulating computational work via deferreds, coroutines, generators and yield from clauses that can be executed in the future when a file descriptor is ready for reading or writing.
What I found troubling with all these solutions is that so much effort was being invested to encapsulate future computation (to be executed when a file descriptor is ready for reading or writing), without consideration of the fact that execution is still limited to a single core.
PyParallel approaches the problem in a fundamentally different way. Developers will still write code in such a way that they're encapsulating future computation via the provided APIs, however, thanks to some novel CPython interpreter modifications, such code can be run concurrently across all available cores.
This talk will cover the history behind PyParallel, the numerous novel techniques invented to achieve concurrent interpreter execution, real-life examples of PyParallel in action (multi-threaded HTTP servers, parallel task decomposition). It will detail the approach PyParallel takes towards facilitating asynchronous I/O compared to competing libraries like Twisted and Tulip. It will also provide insight into the direction of PyParallel in the future, including things like LLVM integration via Numba/Blaze to further improve computational performance.
Link to the deck being presented: https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all-cores-1
Datadog processes and reports on tens of billions of events a day and is used to manage infrastructures ranging from a couple of servers to many thousands of instances.
This talk will explain why we chose python as the core ecosystem to build it on, how we scaled the system to date and what we learned in the process.
Expect to hear about numerical processing with numpy or pandas, memory management, concurrency, profiling and performance tuning, as well as examples of optimization using C or Cython.
The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems.
The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. The Greenplum Database is now part of the Pivotal Platform and provides super fast analytics capabilities through a shared-nothing architecture and SQL interface (based on Postgres). In addition to running parallel queries across terabytes of data using pure SQL it can also run procedural languages such as PL/Python. MADlib, Pivotal’s open source library for scalable in-database machine learning, uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package. We will also show how we have used many of the standard tools in the Python data analysis toolkit in this framework, including Pandas, scikit-learn, nltk and of course NumPy and SciPy.
In particular combining these tools has allowed us to perform sentiment analysis on large datasets and we will discuss the strategies and issues we have come across along the way.
SciDB-Py connects two of the most powerful open source tools whose shared mission is to change the way quants, data scientists, scientists and analysts work with big data: SciDB and python.
The need to safely store and quickly manipulate vast datasets has made databases increasingly important to data scientists. Unfortunately, the array-based datasets central to many applications do not fit neatly into classic relational or key-value paradigm of popular database architectures. Enter SciDB, a new database platform which is built around large multi-dimensional arrays. It provides both efficient distributed storage and fast array-oriented computations, from simple slices and joins to more involved parallel linear algebraic operations. This talk will introduce SciDB-py, a Python package which wraps SciDB with a familiar, numpy-like syntax. With SciDB-py, data scientists can effortlessly store and manipulate extremely large array-based data from the comfort of the Python interpreter.
We'll demonstrate working with timeseries data in SciDB and present basic examples that illustrate SciDB's native analytics capabilities including aggregation and data decimation, regression and generalized linear models, covariance matrices, singular value decomposition, and extending SciDB with custom operations. The examples apply to a broad range of applications including quantitative finance, econometrics, and risk and credit analysis.
The examples demonstrate how SciDB-Py’s scale out MPP architecture enables interactive exploratory analytics on large-scale data.
This panel is a unique take on the question of Python Scalability.
The first topic is the challenge of scaling Python across computational clusters and hardware like GPUs. How do we take a language that prides itself on readability and ease-of-use, and apply it to cutting edge challenges like cluster computing, multi-core processors, and GPUs? Are there fundamental limitations - do we have to move to a different language for these things?
The second topic is the challenge of scaling the *usage* of Python across a team or an organization. Oftentimes, Python is chosen by the early prototypers in a team, and as they succeessfuly deploy their code into production, more and more people - who do not necessarily possess the mindset of the early evangelists - start needing to use or adopt the language, without necessarily having "organically" grown an appreciation for its idioms. Sometimes the code they produce more closely resembles Java or C++, but just spelled in Python. How can experienced Python programmers help grow the effective usage of the language across the organization?
These are two very different questions, but they are both rooted in a common theme: Python's ease of use and its power as a scripting language is actually a double-edged sword. Since people are used to things being much easier and more pleasant to code in Python, when they run into traditionally challenging problems like distributed computing or software management, there is a tendency to just assume that Python is "really bad" for that use case. But Python can be successfully scaled across a cluster and across and organization. In this panel discussion, we'll talk about lessons learned in both of these situations.