Blaze is a next-generation NumPy sponsored by Continuum Analytics. It is designed as a foundational set of abstractions on which to build out-of-core and distributed algorithms. Blaze generalizes many of the ideas found in popular PyData projects such as Numpy, Pandas, and Theano into one generalized data-structure. Together with a powerful array-oriented virtual machine and run-time, Blaze will be capable of performing efficient linear algebra and indexing operations on top of a wide variety of data backends.
In this talk I will discuss the foundational ideas of Blaze and motivate with demos for real world use cases.
Davin will talk about the large data problems at Stipple, which attempts to index and automatically tag all of the world's images. He will demonstrate ad-hoc data mining using a combination of Disco, Numpy, OpenCV, and Starcluster, and present a live demonstration of data exploration using IPython. He will also talk about their Erlang-based solution for realtime matching of photos, and time permitting, will talk about using PyOpenCL and OpenCV to power certain kinds of auto-tagging of objects in images.
Within the past decade, the amount of DNA sequencing data generated from next-generation sequencing platforms has exploded. As a result, biology has been propelled as a field in need of better scaling algorithms and data structures to efficiently analyze data. Jason Pell will present features of khmer, a software package developed in the GED Lab at Michigan State University, to efficiently filter and analyze data generated by next-generation sequencing platforms. More specifically, he will present the use of the Bloom filter and Counting Bloom filter data structures for assembly graph traversal and k-mer counting, respectively. The khmer software package is written primarily in C++ and wrapped in Python. It is released under the BSD license and is available at http://www.github.com/ged-lab/khmer.
Python's use in analytical settings is well-established and impressive. Most of the discussion though is confined to a few settings: web; finance; the sciences. In this talk, I'll share some of the things I have learned from bringing Python into traditional business groups, pitfalls to avoid, and how to shine if you are a Pythonista looking for a career in the rapidly growing job role of Data Scientist. Along the way I'll share examples of how large scale statistical analyses are used in retail marketing.
Why use GPUs from Python? This workshop will provide a brief introduction to GPU programming with Python, including run-time code generation and use of high-level tools like PyCUDA and PyOpenCL, and Loo.py.
Introduction to business intelligence, data warehousing and online analytical processing with Cubes. Cubes is a lightweight Python framework and OLAP server that provides business point of view modeling for multidimensional data analysis.
IPython for Teaching and Collaboration: a discussion of the strengths and weaknesses of IPython for teaching statistical machine learning, as a medium for lecture notes and student collaboration. This talk will be based on the speaker's experiences as the instructor for General Assembly's course on data science.
Data focused computing involves many stages: exploration, visualization, production mode computing, collaboration, debugging, development, presentation and publication. The IPython Notebook is a web based interactive computing environment that can carry the data scientist through all of these stages. The Notebook enables users to build documents that combine live, runnable code with text, LaTeX formulas, images and videos. These documents are version controllable/sharable and preserve a full record of a computation, its results and accompanying material. In this talk I will introduce the Notebook, show how to configure and run it, illustrate its main features and discuss its future.
Working with data at large scales requires parallel computing to access large amounts of RAM and CPU cycles. Users need a quick and easy way to leverage these resources without becoming an expert in parallel computing. IPython has parallel computing support that addresses this need by providing a high level parallel API that covers a wide range of usage cases with excellent performance. This API enables Python functions, along with their arguments to be scheduled and called on parallel computing resources using a number of different scheduling algorithms. Programs written using IPython Parallel scale across multicore CPUs, cluster and supercomputers with no modification and can be run, shared and monitored in a web browser using the IPython Notebook. In this talk I will cover the basics of this API and give examples of how it can be used to parallelize your own code.
Disco is a Python-based MapReduce framework that provides a refreshing alternative to the Hadoop hegemony. In this presentation, Chris will introduce Disco and the Disco Distributed File System and demonstrate how do deploy a basic Disco installation on Amazon EC2 using StarCluster. Using examples inspired by real projects, he will show how to use Disco to work with large collections of binary data and also discuss the strengths and weaknesses of using MapReduce for large data problems.
An important part of data-intensive scientific computing is data visualization. Matplotlib offers a full-featured data visualization package within Python, which is built to interface well with numpy, scipy, Ipython, and related tools. In this tutorial we will introduce and explore the basic features of plotting with matplotlib; from simple plots such as line diagrams, scatter-plots, and histograms, to more sophisticated features such as three dimensional plotting and animations.
Have a data science problem in Python? Need to do some ML or NLP, but find the options daunting? In this whirlwind tour, we'll go over some common use-cases, and explain where to start. More importantly, you'll learn what to avoid, and what WON'T be a valuable use of your time.
The Message Passing Interface (MPI) has been called the assembly language of distributed parallel computing. It is the de facto message passing standard for effectively and portably utilizing the world's largest (and smallest) supercomputers. In this workshop, we will discuss how MPI can be utilized via several Python implementations, e..g mpi4py and pupyMPI, as the messaging strategy between your parallel programs.
Are you interested in working with social data to map out communities and connections between friends, fans and followers? In this session I'll show ways in which we use the python networkx library along with the open source gephi visualization tool to make sense of social network data. We'll take a few examples from Twitter, look at how a hashtag spreads through the network, and then analyze the connections between users posting to the hashtag. We'll be constructing graphs, running stats on them and then visualizing the output.
Python's Natural Language Toolkit is one of the most widely used and actively developed natural language processing libraries in the open source community. This workshop will introduce the audience to NLTK -- what problems its aims to solve, how it differs from other natural language libraries in approach, and how it can be used for large-scale text analysis tasks. Concrete examples will be taken from Parse.ly's work on news article analysis, covering areas such as entity extraction, keyword collocations, and corpus-wide analysis.
HDF5 is a standard de-facto binary file type specification. However, what makes HDF5 great is the numerous libraries to interact with files of this type and their extremely rich feature set. HDF5 has many bindings for different languages, like C, C, Fortran, Java, Perl and, of course, Python.
During my tutorial I'm going to explain the basics on using HDF5 through PyTables, one of the Python bindings for Python, and how PyTables leverages (and enhances) HDF5 capabilities so as to cope with extremely large datasets, specially in tabular format.
I'll start describing the basic capabilities that PyTables exposes out of HDF5, like creating and accessing large multidimensional datasets, both homogeneous and heterogeneous, and how they can be annotated with user-defined metadata (attributes).
Then, I'll proceed on specific features of PyTables, like high performance compressors (Blosc), automatic parametrization for optimizing performance, how to do very fast queries (using OSPI, a query engine that allows different size/performance ratios in the indexes), and will finish with a glimpse on how to perform out-of-core (also called out-of-memory) computations on huge datasets in a very efficient, memory conscious, way (via the high performance numexpr library).
At AppNexus, we've experienced explosive growth over the last three years. Our data pipeline, horizontally scaled in Hadoop and Hbase, now processes more than 15 terabytes every day. This has meant the rapid scaling and iteration of our optimization tools used for big data exploration and aggregations. Unlike other more complicated programming languages, Python's versatility allows us to use it both for offline analytical tasks as well as production system development. Doing so allows us to bridge the gap between prototypes and production by relying on the same code libraries and frameworks for both, thereby tightening our innovation loop.
We'd like to share our best practices and lessons learned when iterating and scaling with Python. We'll discuss rapid prototyping and the importance of tightly integrating research with production. We'll explore specific tools including Pandas, numpy, and ipython and how they have enabled us to quickly data-mine across disparate data sources, explore new algorithms, and rapidly bring new processes into production.
Whether it be on the road, in the shops, or at home--every day we see more computerized devices interact with a world they "see" primarily via cameras. Scikit-image is a library that implements many of the fundamental algorithms used in these machines, and aims to support reproducible research, industry application and education. It is available free of charge, released under a liberal open source license, and developed by an active community of volunteers. This talk gives an overview of the package, explores the underlying architecture, and illustrates some of the latest features at hand of real-world data.
Machine Learning is a discipline involving algorithms designed to find patterns in and make predictions about data. It is nearly ubiquitous in our world today, and used in everything from web searches to financial forecasts to studies of the nature of the Universe. This tutorial will offer an introduction to scikit-learn, a python machine learning package, and to the central concepts of Machine Learning. We will introduce the basic categories of learning problems and how to implement them using scikit-learn. From this foundation, we will explore practical examples of machine learning using real-world data, from handwriting analysis to automated classification of astronomical images.
There was a time when the go to machine learning library was Weka, a behemoth of a Java library. Recently, Scikit-Learn has chipped away at the functionality provided by Weka, and given the Python community a comparable machine learning all-in-one library. In this talk Brian will discuss how Scikit-Learn is used to solve organic & inorganic problems at bitly. An organic decode was one in which a user makes an explicit decision to click on a link, and inorganic decode is one in which a link gets triggered without the users explicit knowledge. An example of an inorganic type of link is using a bitly encoded link to wrap a small gif that is embedded in a web page. Often these links get over inflated decode counts, which gives the naive appearance of them being popular. Brian will show how Scikit-Learn is used to decide on discriminative features, build the classifier, and test the classifier.
Python is quickly becoming the glue language which holds together data science and related fields like quantitative finance. Zipline is a new, BSD-licensed quantitative trading system which allows easy backtesting of investment algorithms on historical data. The system is fundamentally event-driven and a close approximation of how live-trading systems operate. Moreover, Zipline comes "batteries included" as many common statistics like moving average and linear regression can be readily accessed from within a user-written algorithm. Input of historical data and output of performance statistics is based on Pandas DataFrames to integrate nicely into the existing Python eco-system. Furthermore, statistic and machine learning libraries like matplotlib, scipy, statsmodels, and sklearn integrate nicely to support development, analysis and visualization of state-of-the-art trading systems.
Zipline is currently used in production as the backtesting engine powering Quantopian.com -- a free, community-centered platform that allows development and real-time backtesting of trading algorithms in the web browser. Zipline will be released in time for PyData NYC'12.
The talk will be a hands-on IPython-notebook-style tutorial ranging from development of simple algorithms and their analysis to more advanced topics like portfolio and parameter optimization. While geared towards quantitative finance, the talk is a case study of how modern, general-purpose pydata tools support application-specific usage scenarios including statistical simulation, data analysis, optimization and visualization. We believe the talk to be of general interest to the diverse pydata community.
Statsmodels is a Python package for conducting data exploration, estimating statistical models, and performing statistical tests. Recent work on statsmodels has been heavily focused on improving the user experience. This work includes tighter integration with pandas, a Python data analysis library, and a new package patsy. Patsy gives the user the ability to describe statistical models in a simple but powerful way through formulas. The formula syntax is very similar to the 'formula' mini-language in R and S and will be somewhat familiar for those users that are familiar with these languages. This talk will introduce users to statsmodels and patsy by demonstrating how to fit a statistical model to an example dataset.
Topics covered include:
- Formulas in Patsy
- Exploratory graphical analysis
- Maximum likelihood estimation of a discrete choice model
- Testing model assumptions
Shapely is a Python library for performing geometric calculations. It is most commonly used to process and analyze geographic data, like geo-tagged media or shapefiles. In this talk, we'll take publicly available geo-tagged data, visualize it, and perform spatial analysis to find trends.
Topics that will be covered:
- Shapely shape objects: Points, LineStrings, LinearRings, Polygons
- Visualizing and graphing shapes
- Shape properties: coords, area, length, bounds, centroids
- Shape manipulation: extending, shrinking, simplifying
- Boolean operations on shapes: intersects, equals, overlaps, touches
- Geometric operations on shapes: distance, difference, intersections
- RTrees: indexes for shapes
- Spatial analysis of data
Since v0.8, the pandas library has greatly expanded its timeseries functionality. This tutorial will give an introduction to working with timeseries data in pandas. We'll cover how to create date ranges, convert between point (Timestamp) and interval (Period) representations, convenient indexing and time shifting, changing frequencies, resampling, filtering, and how to work with timezones. Attendees should be familiar with Python, Numpy, and pandas basics.
Python has long been used as a language for crawling the web -- perhaps the most successful example being the early web crawlers built for the Google search engine. In recent times, open source libraries have improved dramatically for doing large-scale web crawling tasks. Further, the web has also matured in that many HTML pages now offer various metadata that can be extracted by well-equipped spiders, beyond the basics such as the text content or document title. This talk will cover Parse.ly's use of the open source Scrapy project and its own work on standardizing metadata extraction techniques on news stories.
Wikipedia’s corpus makes it ideal for doing some natural language procesing tasks (NLP). This talk will cover how to extract data out of Wikipedia for your own use using Python, MongoDB and Solr; it will also cover how to use this data to do familiar NLP tasks such as named entity recognition and suggesting related articles.