Embedding Methods for Massive Data Sets and Their Applications

Massive Data Analysis Lab
S. Muthukrishnan
CS Dept
• Agenda: Gather, manage and process massive data logs---Web, IP/wireless traffic data, location trajectories of objects,
sensor readings of physical world.
• Key Challenges:
– Scale: Beyond the traditional “human” scale. Eg., IP data at a
single router interface for an hour exceeds total yearly worldwide
credit card transactions!
– Data Collection: probes/sensors with associated data quality and
communication problems.
• Need breakthroughs in Mathematics, Algorithms, Systems
and Engineering, to meet these challenges.
• Potential: Major impact in Homeland Security, Telecom,
Transportation and Society-at-large.
State of MassDAL
• Mathematics and Computer Science.
– Algorithmic tools for embedding vectors, strings, trees and
other objects for “compact” representation.
– Algorithmic tools for analyzing data summaries for heavy
hitters, deviants, clustering, decision trees, etc.
– Invited talks at ACM, SIAM, European conferences in
Algorithms, Databases, Statistics, and Data Mining on
novel models and algorithms.
– Over dozen research papers in last 2 years on experience
with massive data analysis.
– Supported by NSF grants. Partner: MIT, DIMACS.
State of MassDAL
• Science
– Developing wearable sensors for tracking location
of objects as well as “interactions” between
objects. Measuring behavioral data.
– Current partner: Telcordia. Their initial
investment: $300k/3 months (est). Potential parter
in works: Los Alamos National Lab.
– Potential: Analysis of social networks for
Epidemiology and Homeland Security, and health
State of MassDAL
• Engineering.
– Consulting in analysis of wireless network logs.
AT&T Wireless, 3rd largest in US, 20 Million
customers. Terabytes/month. Fully operational, telcograde!
– Incorporated novel algorithms in operational IP
network data analysis tools. Partner: Gigascope.
– Developed principled approach to data cleaning and
data quality monitoring for operational IP network.
Partner: PACMAN.
– Developed new burst-detection algorithms for text
streams. Partner: DIMACS, Monitoring message
• See
Future of MassDAL
• Research: Need breakthrough research in mathematics,
systems, databases, algorithms, sensor networking.
• Expand data domains.
– Potential partners: Google, NJ auto insurance fraud data,
USPTO patent data, AWS location trajectories, etc.
• Build state-of-art facility at Rutgers.
– Secure, 24X7, data hosting and analysis infrastructure capable
of gathering and processing petabytes of data/month across
domains, data sources, etc. Unique in the world!
• Potential.
– Every wireless, telecom, internet service provider is looking to
farm out this crucial piece of their operations. Estimated
market for these services: 100’s of millions in US $ per year.
Crucial for NJ State. Interest from multiple VCs now.