(click on the Title to view presentation details)
PythonFashionForecaster is an ongoing open source code project that I'd like to present to the PyData Community in order to initiate discussion about applications of Python in a traditionally non data-centric industry. It will hopefully extend the use of Python and open source to the world of fashion. A quick search of python repositories on github show a lack of true fashion apps, those mostly involving weather forecast or shopping tools rather than specifically fashion styles. On the other spectrum of fashion apps, those highly relevant to fashion styles are commercial. PythonFashionForecaster is different in that the objective is to display fashion style trends as an information resource in an automatic and computational manner.
This talk would be of interest to anyone that would like to see a case study on the application of parsing JSON data with Python, a survey of data analysis libraries that can be use to analyze social data, as well as anyone interested in fashion related topics. I believe that indirectly this project will bring exposure to the Python Open Source community in non-traditional domains.
Bitdeli is a platform for creating custom analytics in Python, conveniently in your web browser.
You can use Bitdeli to create real-time dashboards and reports, or as a quick and robust way to experiment with up to terabytes of real-time data. Bitdeli is based on vanilla Python to maximize developer-friendliness. There is no need to learn a new paradigm or stop using existing Python packages.
A typical customer of Bitdeli today is a mobile or web startup that wants to understand and leverage the behavior of their users in ways that are not supported by mainstream analytics services. To further support the long tail of custom analytics, we encourage developers to open-source and share their metrics in GitHub, which is tightly integrated to Bitdeli.
In 1967 sociologist Stanley Milgram began a series of experiments into the "small world problem" that would firmly cement the phrase "six degrees of separation" within the popular culture. Because of these experiments, nearly all of us today have heard that we are simply a few hand shakes away from anyone in the world. Indeed it's a popular past time amongst academics to figure our their Erdos number and, amongst the rest of us, to calculate a favorite actor's Bacon number. Fast forward to today and the world seems even smaller. With the internet connecting all of us to one another at the speed of light, and social networks such as Twitter and Facebook creating communities that quite literally span the globe, this new era in connectedness has given us a wealth of data about how we interact with one another. There's hardly anyone in the tech community today who hasn't heard of social network analysis, but this combination of sociology, computer science, and mathematics has significance beyond just the analysis of social networks.
Between nearly any set of entities a relationship can be found, and thus a network can be made, from which the inner workings of those relationships can be studied. The still nascent field of network science is quickly becoming THE science of the 21st century and this talk will introduce this budding field and demonstrate how tools such as NetworkX and Matplotlib make it possible for Pythonistas to make meaningful contributions or simply just analyze their own popularity on Twitter.
The goal of this talk is to give the attendees a basic understanding of what network science is and what it can be used for, as well as demonstrate its use in a specific scenario. During the course of this talk we'll walk through a proper definition of a network and introduce some of the jargon necessary to converse with others working in the field. We'll also take a look at some of the statistical properties of networks and how to use them to analyze our own networks. Finally, we'll look at a specific example of the application of network science principles on a real life social network. By the end of the talk, an attendee should feel comfortable enough with field of network science to be able to start analyzing their own networks of data.
Some of today’s greatest challenges to the scientific community are “big data”, “reproducibility/transparency” and “code sharing”. The state-of-the-art Ultra-scale Visualization Climate Data Analysis Tools (UV-CDAT) environment addresses the first two issues with new visualizations and techniques to address big data and provenance. This talk addresses code re-sharing and re-distribution by introducing the UV-CDAT Re-sharable Analyses and Diagnoses (U-ReAD). U-ReAD will offer scientists a complete set of tools (framework) based on the Python programming language along with a code repository. U-ReAD’s goal is to use structured documentation to help build the interface between UV-CDAT and a diagnostic, with few or no changes to the original code. This framework will allow scientists to quickly and seamlessly re-implement their diagnostics so that they will fit perfectly into the UV-CDAT environment. As a result U-ReAD-enhanced diagnostics will be automatically provenance-enabled, making it easy to reproduce any set of results exactly and transparently, a crucial functionality considering today’s increased scrutiny toward scientific results.
This talk aims to demonstrate how easy it can be to plug any diagnostic into UV-CDAT using U-ReAD. We will show how few changes are necessary to create these plugins and how “augmented” the diagnostics are in return.
U-ReAD’s developers also hope to create a central repository of U-ReAD-enhanced tools so that scientists can easily share their tools. This talk will show what is in store along these lines. http://u-read.llnl.gov
This talk discusses generators as a mechanism for modelling data-centric problems. The techniques suggested focus on simplifying the semantics of processing code, adding flexibility by inverting control structures, and allowing performance optimisations through caching, laziness, and targeted specialisations.
* This would be a continuation of the material I presented at PyData NYC 2012. I would incorporate feedback from that presentation to cover areas of particular interest. It would also use material developed since then, including some illustrative examples of how generators could be used to model certain problems in finance (the benchmark pricing problem, the refdata problem, &c.)
The goal of Disco has been to be a simple and usable implementation of MapReduce. To keep things simple, this MapReduce aspect has been hard-coded into Disco, both in the Erlang job scheduler, as well as in the Python library. To fix various issues in the implementation, we decided to take a cold hard look at the dataflow in Disco's version of MapReduce. We came up with a generalization that should be more flexible and hence also more useful than plain old MapReduce. We call this the Pipeline model, and we hope to use this in the next major release of Disco. This will implement the old MapReduce model in terms of a more general programmable pipeline, and also expose the pipeline to users wishing to take advantage of the optimization opportunities it offers.
If time permits, we will also discuss other aspects of the Disco roadmap, and the future of the Disco project.
HDF5 is a hierarchical, binary database format that has become a de facto standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and in-core calculations. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py).
This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller. With such powerful features at the developer's disposal, what is not to love?!
This tutorial is targeted at a more advanced audience which has a prior knowledge of Python and NumPy. Knowledge of C or C++ and basic HDF5 is recommended but not required.
This tutorial will require Python 2.7, IPython 0.12+, NumPy 1.5+, and PyTables 2.3+. ViTables and MatPlotLib are also recommended. These may all be found in Linux package managers. They are also available through EPD or easy_install. ViTables may need to be installed independently.
Python has been an important tool for analysis and manipulation of scientific data. This has traditionally taken the form of large datasets on disk or in local databases, which are then processed by sophisticated numerical and scientific libraries (SciPy and friends). Increasingly, science is becoming a collaborative enterprise where "big data" is generated in multiple locations and analyzed by multiple research groups.
In this talk we discuss how Python data analysis can help scientists work more collaboratively by integrating Web APIs to access remote data. We will discuss the details of this approach as applied to the Materials Project (see materialsproject.org), a Department of Energy project that aims to remove the guesswork from materials design using an open database of computed properties for all known materials. Using the Python Materials Genomics (pymatgen) analysis package (see packages.python.org/pymatgen), Materials Project data can be seamlessly analyzed alongside local computed and experimental data. We will describe how we make this data available as a web API (through Django) and how we provide access to both data and analysis under a single library. The talk will go over the technology stack and demonstrate the potential power of these tools within an IPython notebook. We will finish by describing plans to extend this work to address key challenges for distributed scientific data.
Luigi is Spotify's recently open sourced Python framework for batch data processing including dependency resolution and monitoring. We will demonstrate how Luigi can help you get started with data processing in Hadoop MapReduce as well as on your local workstation.
Spotify has terabytes of data being logged by backend services every day for everything from debugging to reporting reasons. The logs are basically huge semi-structured text files that can be parsed using a few lines of Python. From this data aggregated reports need to be created, data needs to be pushed into SQL databases for internal dashboards, related artists need to be calculated using complex algorithms and a lot of other tasks need to be performed, many of which have to be run on an daily or even hourly basis.
A lot of the initial processing steps are very similar for the many data products that are produced, and instead of re-doing a lot of work, intermediate results are stored and form dependencies for later tasks. The dependency graph forms a data pipeline.
Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
I will be discussing the approaches taken by the Editor Engagement Experimentation team at the Wikimedia Foundation to discover the new site features that lead to stronger collaborative contributions from editors and readers. The focus will be on how we define, gather and analyze our metrics [2,3,4] and how these have been exposed via a RESTful API built with Flask.
I'll also discuss the experimental results of new features (article feedback, post-edit feedback) and improved ones (account creation) in the context of the analytics implementation with the "e3_analysis" [3,4] python package. Finally, I will give an overview of the work we are carrying out on ranking the quality of reader feedback comments using the pybrain  and mdp  machine learning and data processing packages.
In this talk we will introduce the typical predictive modeling tasks on "not-so-big-data-but-not- quite-small-either" that benefit from distributed the work on several cores or nodes in a small cluster (e.g. 20 * 8 cores).
We will talk about cross validation, grid search, ensemble learning, model averaging, numpy memory mapping, Hadoop or Disco MapReduce, MPI AllReduce and disk & memory locality.
We will also feature some quick demos using scikit-learn and IPython.parallel from the notebook on an spot-instance EC2 cluster managed by StarCluster.
The Data Science team at Vast builds data products informed by the behavior of consumers making big purchases. Our big data is billions of user interactions with millions of pieces of inventory. Recently we have adopted a data processing, analysis, and visualization environment based on remote access to IPython Notebook hosted by a powerful compute server.
Our Data Science environment is inspired by a Development environment proposed by blogger Mark O'Connor. O'Connor advocates using an iPad as a thin client to connect to a more powerful server in the cloud. The combination of tablet plus server is better than a laptop for several reasons including:
IPython Notebook is the keystone of our environment. It enables us to use the tablet browser as a thin client to work with our favorite Python libraries including matplotlb for visualization, scikit-learn for predictive modeling, and pandas for processing and aggregation.
In this talk, I'll discuss configuring the Notebook server and the tablet client. I'll also show examples and results of actual analyses performed in this environment.
Exploratory analysis and predictive modeling of time series is an enormously important part of practical data analysis. From basic processing and cleaning to statistical modeling and analysis, Python has many powerful and high productivity tools for manipulating and exploring time series data using numpy, pandas, and statsmodels.
We will use practical code examples to illustrate important topics such as:
At wise.io we are building a machine-learning platform that makes efficient and accurate learning algorithms available in an easy-to-use service. In this presentation, I will describe how the platform works and how we're using Python to make it scalable and accessible.
Machine-learning is an active field of data science, where sophisticated models are "trained" on data and used to enable human-like cognition in data analysis pipelines and data-heavy applications. Data scientists need the most efficient and most accurate machine-learning implementations, while developers need on-ramps that make it easy to incorporate machine-learning into their applications.
Highlights of our platform include one-step data ingestion and model building, validation, hosting, integration and sharing. A domain intelligence "marketplace" enables domain-specific knowledge to be incorporated in a model with a click (or a "git push") and is scaled automatically to handle large datasets. We use Python and a range of cloud and data frameworks to make this possible, including Anaconda, PiCloud, Pandas and PyTables.
Simulation has become an indispensable research tool across different scientific disciplines ranging from neuroscience to econometrics and quantitative finance. These computational simulations often involve parameters which have to be optimized on data. This parameter optimization gets increasingly challenging the more complex and longer simulations take to run. Cloud services like Amazon Web Services (AWS) provide a compelling tool in scaling this optimization problem by offering computing resources that allow everyone to spawn their own personal cluster within minutes.
With a focus on algorithmic trading models, in this talk I will show how large-scale simulations can be optimized in parallel in the cloud. Specifically, I will (i) provide a tutorial on how trading strategies of varying sophistication can be developed using Zipline -- our open-source financial backtesting system written in Python; (ii) how StarCluster provides an easy interface to launch an Amazon EC2 cluster; (iii) how IPython Parallel can then be used to test large parameter ranges in parallel; and (iv) a brief demo of how Quantopian.com can greatly simplify parts of this process by offering a completely web-based solution free-of-charge. While a case study in quantitative finance, the general approach has direct application to other research domains.