Machine Learning in the Big Data Era -Are We There Yet? – Experiences with Data-Parallel Computing Frameworks Dr. Sreenivas Sukumar Prospective Faculty Candidate Oak Ridge National Lab December 8, 2014 3:30 pm Electrical Engineering Bullen Room 226 Recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources has brought statistical machine learning under more scrutiny and evaluation for gleaning insights from the data than ever before. In that context, we pose and debate around the question - Are machine learning algorithms scaling with the ability to store and compute? If yes, how? If not, why not? We share experiences from real-world Big Data knowledge discovery projects across domains of national security and healthcare and identify three grand challenges: (i) the 'data science' challenge - designing scalable and flexible computational architectures for machine learning (beyond just data-retrieval); (ii) the 'science of data' challenge - the ability to understand characteristics of data before applying machine learning algorithms and tools; and (iii) the 'scalable predictive functions' challenge - the ability to construct, learn and infer with increasing sample size, dimensionality, and categories of labels. In the second part of the talk, we will present progress made towards addressing each of these challenges. In particular, we will present (i) Benchmark results from scalable implementation of popular algorithms in the literature on different scalable architectures (shared-memory, shared-storage and shared-nothing) both inhouse (Urika, Titan and CADES) and commercial options such as Amazon Web Services (ii) a semantic scalable-knowledge nurturing representation, and (iii) results of analysis, pattern search and predictive modeling at scale applied to use-cases in healthcare.