(click on the Title to view presentation details)
Predictobot is a website that makes predictive modeling possible for people without experience in machine learning or programming. You upload a spreadsheet of data, and specify which column you would like to predict, using the data in the other columns. Predictobot automatically analyzes the data and builds a predictive model. The website allows you to see which columns had the biggest effect on the predictions, and how well your model should perform in the future. You get an Excel spreadsheet with formulas that make predictions on future data.
Predictobot is written entirely in Python, using a number of frameworks. I'll talk a little about the architecture, using Django and PiCloud to create a scalable service.
It uses a new, proprietary machine learning algorithm. The technique gives a sparse set of interpretable rules specifying how attributes or pairs of attributes affect the model. It is doing variable selection while building the model. It combines ideas from Naive Bayes and Boosting, to give something better.
This will be the first public demonstration of Predictobot. Attendees of the conference will have early access to Predictobot during our beta test period.
Computational biology increasingly depends on integrating many kinds of measurments (such as mutation status, gene expression, and phenotype) from ever-growing datasets. Real world analysis is ad-hoc, experimental, iterative, and collaborative, and both data and statistical approaches constantly evolve. Ensuring that computational experiments are reproducible, understandable, and efficient remains a major challenge.
Drawing examples from current research into mechanisms of cancer and neural differentiation, I'll dissect some of these challenges, and describe grizzly, a set of abstractions and tools under development to help computational scientists to design, run, and interactively explore complex analytical and statistical workflows over structured, multidimensional data. Grizzly is implemented in python, and builds on top of pandas, NumPy, statsmodels, and IPython.
People and businesses want to make decisions based on large amounts of quantifiable data. If what you actually have is text in natural language, how can you quantify it and make decisions with it? How do you compare it and put error bars on it?
Fortunately, there are now some freely available data sources and libraries that make these questions easier to answer, including Google Ngrams, WordNet, ConceptNet, and NLTK. I'll present code examples, mostly in Python, that use these resources to create language models from text. These models can be built into useful tools such as search engines, recommender systems, and classifiers.
Current online searches for real estate produce a dump of listings for a city, but the consumer is left with the burden of poring through them to assess for personal fit. To produce a more useful search ranking, spatially aware search engine uses personal local environment preferences to map personal fit over all the listings. By aggregating (using Selenium and BeautifulSoup) and mining (using NumPy with MongoDB) social data and applying spatial statistical models (using NumPy, StatsModels, and Pandas), it allows for more useful and intuitive searches, such as
As people search by describing their desired local environments, the machine learning algorithm builds hedonic maps, spatial models of real estate demand, that enable targeted advertising and lead qualification for real estate professionals. It also compares these demand maps with real estate pricing maps to identify price correction opportunities (using distributed queue processing in Celery).
Building predictive models in Python is fun. When I first started with Scientific Python, I was blown away by the examples in the scikit-learn documentation (http://scikit-learn.org/stable/auto_examples/index.html). Buliding a classifier or regressor is fairly easy with scikit-learn and pandas, but after I finish building a model I often find myself saying "Now what?".
For many people, their "Now what?" moment means deploying their model to a production setting for real-time decision making. We'd like to show how Yhat makes deploying predictive models written in Python (or R) fast and easy. We'd like to show data scientists how they can incorporate Yhat into their workflow just by adding a couple lines of code.
At the end of the presentation, you will know how to:
SciDB-Py connects two of the most powerful open source tools whose shared mission is to change the way scientists, engineers and analysts work with big data: SciDB and python.
SciDB is an innovative, next-gen open source database designed for massively scalable in-database complex analytics. It natively organizes data in n-dimensional arrays which is an optimal representation for many of the new types of data being generated and mashed up today: location data, sensor data, genomics data, population data, telematics data, financial time series data, and image data. SciDB supports both embarrassingly parallel and not-embarrassingly parallel processing, distributed storage, fully ACID transactions, efficient sparse array storage, and native, scalable complex math operations like generalized linear models and principal component analysis.
SciDB-Py lets python developers work in a familiar IDE like Wakari, using Blaze arrays that seamlessly reference large scale arrays managed by SciDB. SciDB brings persistent data storage, MPP parallel processing and scale-out linear algebra to python.
We illustrate SciDB-Py with two examples. The first of these examples performs a truncated singular value decomposition of a very large, sparse matrix. This operation is widely used as the “guts” of recommendation engines used in many large web properties. The goal is to cluster customers with similar behavior, but the technique has very general utility.
The second example is from computational finance. We illustrate SciDB's fast parallel aggregation capability to quickly build a custom national best bid and offer price book from daily NYSE TAQ ARCA tick data.
The examples demonstrate that SciDB-Py’s scale out MPP architecture enables interactive exploratory analytics on large-scale data.
Enaml is an open source library for building rich user interfaces utilizing a declarative extension to the Python language grammar. Notable features of the framework include: automatic dependency analysis at the bytecode level, a constraints based layout system, support for multiple rendering engines, and the ability to interface with a variety of model notification systems. The Enaml DSL is a strict superset of Python which allows the developer to declaratively define dynamic, extensible, and reactive trees; such trees are particularly well suited for the definition of user interface hierarchies.
Enaml has been used for building production applications at multiple Fortune 500 companies and serves as the foundational UI framework at one of the world's leading investment banks. This talk will provide a introduction to the Enaml language and demonstrate some of the notable features of the library.
StarCluster is a cluster computing toolkit for the cloud, developed in Python by The Software Tools for Academics and Researchers (STAR) group at MIT. StarCluster makes it easy to create and manage parallel and distributed computing clusters on Amazon's EC2. Additionally, a command line interface provides utilities for working with clusters, machines, and data volumes. StarCluster also exposes a Python plugin API that allows users to customize their systems beyond the defaults. StarCluster also includes public machine images, equipped with frameworks out-of-the-box including OpenMPI, OpenMP, Hadoop, (Py)CUDA, (Py)OpenCL, and IPython (parallel).
In this talk I will give an overview of StarCluster and how to get started using it with these various parallel frameworks on real clusters in the cloud. I will also cover using StarCluster's Python plugin API to further configure a cluster and automate various workflows.