Deep learning thrives with always bigger networks and always growing datasets but single machine can only handle so much. When to scale to multiple machines and how do do it efficiently? What pros and cons available options have and what is theory behind their approach to distributed training? In this talk we will answer those questions and show what problems we are trying to solve at Avast.
With growing accuracy of deep learning, the needs of computation grow too. Cutting-edge models, hyperparameter tuning or architecture search require either big amounts of GPU memory or take long time to finish. It is not always feasible to invest in high-end server-grade hardware and the development of Big Data shows that scaling horizontally is viable alternative. Why even bother about distributing training to multiple machines and how it differs from single-machine scenario? What are the options available and which one is best?
This talk aims to introduce audience to few available distributed deep learning systems like Distributed Tensorflow, Tensorflow on Spark or Horovod and compare them including both theory and our benchmarks. We will explain how this need arose in Avast, how we solved it and share experiences from our journey.