Python machine learning libraries like scikit-learn are a fantastic resource but not always well suited to large datasets. How can we use Python for machine learning in such cases? This talk will introduce PySpark and MLlib as tools for distributed machine learning. We will discuss what these tools are, how they work, and cover some basic code examples of machine learning on a cluster.
1) Intro a. Why is scikit-learn not enough? b. What is Spark? c. What is MLlib? 2) Spark a. Overview of Spark b. Overview of PySpark c. PySpark code sample 3) MLlib a. Overview of MLlib b. MLlib code samples