PyData London 2015 | Presentation: Python and scikit-learn based open research SDK for collaborative data management and exchange

Saturday 11:40 a.m.–12:20 p.m.

Python and scikit-learn based open research SDK for collaborative data management and exchange

Grigori Fursin, Anton Lokhmotov

Audience level:: Novice

Description

We would like to share our experience with a python-based Collective Knowledge SDK for collaborative and reproducible experimentation. It helps organize and share experimental setups (code, data and meta) as unified and reusable components with JSON API via GITHUB. It also helps unify, automate and crowdsource analysis and exploration of multi-dimensional optimization spaces using scikit-learn.

Abstract

Faster and more power efficient computer systems are vital to continue innovation in science and technology. However, designing such systems has become intolerably complex, ad-hoc, costly and error prone due to an enormous number of available design and optimization choices, and complex interactions between all software and hardware components.

Originally, our automatic and machine-learning based exploration and autotuning techniques showed high potential to address above problems [1]. On the other hand, very quickly we faced many other problems such as dealing with ever changing tools and their interfaces, lack of a common experimental methodology, lack of computational power for machine learning and feature selection, problems with reproducibility of empirical experiments, and lack of unified mechanisms for knowledge management and exchange.

Eventually, to be able to proceed with our research, we did not have a choice but to develop an open-source framework and repository (Collective Knowledge) for collaborative and reproducible experimentation [2]. This python-based framework and repository helped our community start organizing, describing, cross-linking and sharing code, data, experimental setups and meta information as unified and reusable components with JSON API via standard Git services (such as GITHUB or BitBucket) [3]. Such unification, in turn, helped researchers assemble various experimental setups (workflows) from shared python components to quickly prototype ideas while crowdsourcing experiments across spare computer resources such as Android mobile phones and tablets [4].

Furthermore, public and unified repository allowed the community to expose experiments to predictive analytics from scikit-learn (statistical analysis, data mining, machine learning) to automate and speed up exploration of multi-dimensional experimental choices, analysis of results and decision making (i.e. predicting program optimizations or hardware configuration based on program and data set features) [1,3].

During past 5 years, our free and open-source technology has been extensively used and validated by several major industrial partners. It also helped initiate new publication model in computer engineering where experiments and artifacts are validated and improved by the community [5]. We therefore believe that our practical experience with this python-based framework and scikit-learn tools may be useful to researchers from other fields.

In this talk, we would like to present our framework, JSON-based APIs, real usage experience to solve program optimization problems, and future work for python-based data management and exchange. In a longer term, we are interested to collaborate with a python community to speed up python itself and related data analytics modules using our machine-learning based autotuning and run-time adaptation techniques.

More details about our techniques and public initiatives:

[1] https://hal.inria.fr/hal-01054763

[2] http://github.com/ctuning/ck

[3] http://c-mind.org/repo

[4] https://play.google.com/store/apps/details?id=com.collective_mind.node

[5] http://c-mind.org/reproducibility