Massive Data Analysis Lab (MassDAL) S. Muthukrishnan CS Dept MassDAL • Agenda: Gather, manage and process massive data logs---Web, IP/wireless traffic data, location trajectories of objects, sensor readings of physical world. • Key Challenges: – Scale: Beyond the traditional “human” scale. Eg., IP data at a single router interface for an hour exceeds total yearly worldwide credit card transactions! – Data Collection: probes/sensors with associated data quality and communication problems. • Need breakthroughs in Mathematics, Algorithms, Systems and Engineering, to meet these challenges. • Potential: Major impact in Homeland Security, Telecom, Transportation and Society-at-large. State of MassDAL • Mathematics and Computer Science. – Algorithmic tools for embedding vectors, strings, trees and other objects for “compact” representation. – Algorithmic tools for analyzing data summaries for heavy hitters, deviants, clustering, decision trees, etc. – Invited talks at ACM, SIAM, European conferences in Algorithms, Databases, Statistics, and Data Mining on novel models and algorithms. – Over dozen research papers in last 2 years on experience with massive data analysis. – Supported by NSF grants. Partner: MIT, DIMACS. State of MassDAL • Science – Developing wearable sensors for tracking location of objects as well as “interactions” between objects. Measuring behavioral data. – Current partner: Telcordia. Their initial investment: $300k/3 months (est). Potential parter in works: Los Alamos National Lab. – Potential: Analysis of social networks for Epidemiology and Homeland Security, and health industry. State of MassDAL • Engineering. – Consulting in analysis of wireless network logs. AT&T Wireless, 3rd largest in US, 20 Million customers. Terabytes/month. Fully operational, telcograde! – Incorporated novel algorithms in operational IP network data analysis tools. Partner: Gigascope. – Developed principled approach to data cleaning and data quality monitoring for operational IP network. Partner: PACMAN. – Developed new burst-detection algorithms for text streams. Partner: DIMACS, Monitoring message streams. Future • See http://cs.rutgers.edu/~muthu/massdal.html Future of MassDAL • Research: Need breakthrough research in mathematics, systems, databases, algorithms, sensor networking. • Expand data domains. – Potential partners: Google, NJ auto insurance fraud data, USPTO patent data, AWS location trajectories, etc. • Build state-of-art facility at Rutgers. – Secure, 24X7, data hosting and analysis infrastructure capable of gathering and processing petabytes of data/month across domains, data sources, etc. Unique in the world! • Potential. – Every wireless, telecom, internet service provider is looking to farm out this crucial piece of their operations. Estimated market for these services: 100’s of millions in US $ per year. Crucial for NJ State. Interest from multiple VCs now.