Machine learning is among the core techniques for big data analytics. Nowadays, there are diversified parallel programming models and computing platforms such as Hadoop MapReduce, Spark, MPI, CUDA-GPU, etc. In terms of data sizes and specific problems, we can choose one of these to implement our parallel machine learning algorithms for analyzing big data. However, when such a new model and platform comes up, we need to re-write all algorithms we use, which is a lot of duplicate work and burden. On the other hand, for data analysts from different application domains who focus on high-level business modeling and data analytics by using high-level data analytic programming language and tool such as R or Matlab, it is really hard for them to learn and use the relatively low-level parallel programming models and computing platforms.
This talk will present a number of commonly-used parallel machine learning algorithms implemented with Hadoop MapReduce and Spark. Then a unified programming model and platform for machine learning and data analytics across heterogeneous parallel programming models and computing platforms will be presented. The key idea is that we will adopt the large-scale matrix as a unified abstraction for representing different machine learning and data analytic algorithms. Then we adopt R as the high-level programming language. We design and provide a unified framework and platform, Octopus, that take charge of converting R and mapping large-scale matrix computation into underlying parallel programming models and platforms. Based on the unified model and platform, we can achieve the goal of “write once, run anywhere” for big data machine learning and data analytic algorithms.
Dr. Yihua Huang is currently a professor in the Department of Computer Science and Technology at Nanjing University in China, Vice Secretary General of the Big Data Committee of the China Computer Society(CCF), and Chair of the Big Data Committee of the Jiangsu Province Computer Society. He received his bachelor, master and Ph.D. degree in computer science from Nanjing University in 1983, 1986 and 2007 respectively. His main research interests include big data processing, parallel and distributed computing, semantic analytics, web data mining. He is one of the earliest researchers in China on big data processing. Currently he is engaged in a series of researches on big data processing with research grants from the state research programs in China as well as grants from industry partners including Intel, Google, ZTE and Baidu. In 2009, sponsored by Google, he created and offered the course “MapReduce Big Data Parallel Processing” for graduate students in computer science at Nanjing University and received the “2012 Google Faculty Fellowship” from Google in 2012.