(click on the Title to view presentation details)
As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.
I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?
Derivatives analytics is one of the most compute and data intensive areas in the financial industry. This mainly stems from the fact, that Monte Carlo simulation techniques have to be applied in general to value and risk manage single derivatives trades and whole books of derivatives.
DX Analytics is a derivatives analytics library that is completely build in Python and that has a rather Pythonic API. It allows the modeling and valuation of both single- and multi-risk factor derivatives with European and American exercise. It also allows the consistent valuation of complex portfolios of such derivatives, e.g. incorporating the correlation between single risk factors.
The talk provides some theoretical and technical background information, discusses the basic architectural features of DX Analytics and illustrates its use by a number of simple and more complex examples.
"Much of what we want to do with data involves optimization: whether it's to find a model that best fits the data, or to decide on the optimal action given some information.
We'll explore the embarrassment of riches Python offers to tackle custom optimization problems: the scipy.optimize package, Sympy for calculus and code generation, Cython for speedups and binding to external libraries."
Numpy and pandas are the corner stones of data analysis in python. They allow for efficient data access and manipulation. Yet, they are not always appropriate for more heterogeneous data usage, when access patterns are hard to predict, or when you need to support write parallelism. This is an area where traditional databases systems still shine compared to the traditional data scientist toolset.
The goal of this tutorial is to give you an idea of how databases can help you dealing with data which are not just numerical, with minimal effort or knowledge. We will focus on Postgresql, an open source database that have powerful extensions to deal with heterogeneous (aka 'schemaless') data, while being simple to use from python.
This tutorial is free, but requires separate registration.
Today's financial market environment demands for ever shorter times-to-insight when it comes to financial analytics tasks. For the analysis of financial times series or for typical tasks related to derivatives analytics and trading, Python has developed to the ideal technology platform.
Not only that Python provides powerful and efficient libraries for data analytics, such as NumPy or pandas. With IPython there is a tool and environment available that facilitates interactive, and even real-time, financial analytics tremendously.
The tutorial introduces into IPython and shows, mainly on the basis of practical examples related to the VSTOXX volatility index, how Python and IPython might re-define interactive financial analytics.
Quants, traders, financial engineers, analysts, financial researchers, model validators and the like all benefit from the tutorial and the new technologies provided by the Python ecosystem.
BACKGROUND
For background information see the Python-based "VSTOXX Advanced Services" and the related backtesting applications:
http://www.eurexchange.com/vstoxx/
http://www.eurexchange.com/vstoxx/app1/
http://www.eurexchange.com/vstoxx/app2/
TECHNICAL REQUIREMENTS
To follow the tutorial, you should have installed the Anaconda Python distribution on your notebook. Download and follow the instructions here:
http://continuum.io/downloadsAfter installation, start IPython from the command line interface/shell as follows:
$ ipython notebook --pylab inline
IPython should then start in your default Web browser.
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Will your decisions change if you'll know that the audience of your website isn't 5M users, but rather 5'042'394'953? Unlikely, so why should we always calculate the exact solution at any cost? An approximate solution for this and many similar problems would take only a fraction of memory and runtime in comparison to calculating the exact solution.
This tutorial is a practical survey of useful probabilistic data structures and algorithmic tricks for obtaining approximate solutions. When should we use them, and when we should not trade accuracy for scalability. In particular, we start with hashing and sampling; address the problems of comparing and filtering sets, counting the number of unique values and their occurrences; touch basic hashing tricks used in machine learning algorithms. Finally, we analyse some examples of their usage show the full power: how to organise an online analytics, or how to decode a DNA sequence by squeezing a large graph into a bloom filter.
Diamond Light Source is the UK Synchrotron, a national facility containing over 20 experimental stations or beamlines, many of which are capable of generating Terra-bytes of raw data every day. In this data rich environment, many scientists that come to the facility can be daunted by the sheer quantity and complexity of the data on offer. The scientific software group is charged with assisting with this deluge of data and as a small team it is imperative that provides sustainable and rapid solutions to problems. Python has proved to be well suited to this and is now used heavily at the facility, from cutting edge research projects, through general pipe-lining and data management, to simple data manipulation scripts. And by a range of staff and facility users, from experienced software engineers and scientists, to support staff and PhD students simply wanting something to help make sense of the data or experimental set-up.
This presentation focuses on the current state of the scientific management and data analysis within Diamond, and details the workhorses which are relied on, as well as what the future holds.
The pharmaceutical industry is a £250 billion dollar a year industry and a third of the world’s R&D in pharmaceuticals occurs in the UK. Python is well used in high-throughput screening and target validation with a notable example at AstraZeneca displayed prominently on the python.org website but further along the drug development process Python and it’s scientific stack offers a compelling and comphrensive toolkit for use in preclinical and clinical drug development.
In this talk, a demonstration of how Python/SciPy was used to calculate cardiac liability of a drug was assessed as part of routine preclinical screen, how Python was used to statistically analyse a Phase II clinical dataset and how Python was used organise and structure documentation about a new chemical entity according to regulated standards for submission to the European Medicines Agency. Lastly, the talk will conclude with the current barriers to progress for Python to be used more routinely for pharmaceutical problems and how the community might address Python being used in a heavily regulated environment.
Thanks to Python and R, data scientists and researchers have in hands highly powerful tools to program with data, simulate and publish reproducible computational results. Educators have access to free and open environments to teach efficiently statistics and numerical subjects. Thanks to cloud computing, anyone can work today on advanced high capacity technological infrastructures without having to build them or to comply with rigid and limiting access protocols. By combining the power of Python, R and public clouds such as Amazon EC2, it became possible to build a new generation of collaboration-centric platforms for virtual data science and virtual education of considerable power and flexibility.
This tutorial aims to familiarise the attendees with what public clouds can do for e-Science and e-Learning, to present the challenges and opportunities raised by the use of Python and R on such infrastructures and to introduce Elastic-R, one of the first free Python/R-centric virtual data science platforms (www.elasticr.com).
Python threads cannot utilize the power of multiple CPUs. Other solutions such multiprocessing or MPI wrapper are based on message passing, resulting substantial overhead for certain types of tasks.
While pure Python does not support shared memory calculations, Cython combined with OpenMP can provide full access to this type of parallel data processing.
This talk gives a whirlwind tour of Cython and introduces Cython's OpenMP abilities focusing on parallel loops over NumPy arrays. Source code examples demonstrate how to use OpenMP from Python. Results for parallel algorithms with OpenMP show what speed-ups can be achieved for different data sizes compared to other parallelizing strategies.
Data and algorithms are artistic materials just as much as paint and canvas.
A talk covering my recent work with The Tate's CC dataset, David Cameron's deleted speeches and the role of the artist in the world of Big Data.