Machine Learning in the Big Data Era Experiences with Data-Parallel Computing Frameworks

advertisement
Machine Learning in the Big Data Era
-Are We There Yet? –
Experiences with Data-Parallel Computing
Frameworks
Dr. Sreenivas Sukumar
Prospective Faculty Candidate
Oak Ridge National Lab
December 8, 2014
3:30 pm
Electrical Engineering Bullen Room 226
Recent innovations in being able to collect, access, organize, integrate, and query massive
amounts of data from a wide variety of data sources has brought statistical machine learning
under more scrutiny and evaluation for gleaning insights from the data than ever before. In that
context, we pose and debate around the question - Are machine learning algorithms scaling with
the ability to store and compute? If yes, how? If not, why not?
We share experiences from real-world Big Data knowledge discovery projects across domains of
national security and healthcare and identify three grand challenges: (i) the 'data science'
challenge - designing scalable and flexible computational architectures for machine learning
(beyond just data-retrieval); (ii) the 'science of data' challenge - the ability to understand
characteristics of data before applying machine learning algorithms and tools; and (iii) the
'scalable predictive functions' challenge - the ability to construct, learn and infer with increasing
sample size, dimensionality, and categories of labels. In the second part of the talk, we will
present progress made towards addressing each of these challenges. In particular, we will present
(i) Benchmark results from scalable implementation of popular algorithms in the literature on
different scalable architectures (shared-memory, shared-storage and shared-nothing) both inhouse (Urika, Titan and CADES) and commercial options such as Amazon Web Services (ii) a
semantic scalable-knowledge nurturing representation, and (iii) results of analysis, pattern search
and predictive modeling at scale applied to use-cases in healthcare.
Download