Plans for New DHS Center based at Rutgers University

advertisement
AT&T Labs – Research
Bell Labs/Lucent Technologies
Princeton University
Rensselaer Polytechnic Institute
Rutgers, the State University of New Jersey
Texas Southern University
Texas State University, San Marcos
Background
• DHS has established an Institute for Discrete
Sciences (IDS).
• Managed Out of Lawrence Livermore Nat. Lab.
• DHS is establishing four “university affiliated
centers” around the country.
• One of these will be a “coordinating” UAC.
• The Rutgers-based team has been designated as a
UAC and was asked to become a coordinating
UAC.
• Other centers: Univ. of Illinois UrbanaChampaign, Univ. Southern Cal., U. Pittsburgh
• In addition to 6 formal partners, we have told DHS
that NJ University Consortium institutions will be
involved.
Slide 2
What is Discrete Science?
• Discrete Science deals with
– Patterns
– Arrangements
– Assignments
– Schedules
• Discrete Science
– Seeks patterns in large amounts of
data
– Analyzes connections between entities
such as people and groups
– Develops efficient ways to quickly
spot changes in standard patterns
Slide 3
Why DyDAn?
• Homeland Security requires inferences from
massive flows of data, arriving continuously.
• Buried in data are: quickly changing patterns.
• DyDAn: will develop novel technologies to find
patterns & relationships in dynamic,
nonstationary, massive datasets.
• DyDAn: will produce pioneering educational
programs to nurture homeland security
workforce of the future
Slide 4
DyDAn Research
• Information Management and Knowledge
Discovery
• Fundamental Topics in Discrete Mathematical
Foundations
• Two research themes:
– Analysis of Large, Dynamic Multigraphs
– Continuous, Distributed Monitoring of
Dynamic, Heterogeneous Data
Slide 5
DyDAn Research I: Analysis of Large,
Dynamic Multigraphs
• Need to understand interactions between entities:
people, objects, groups
• Interactions often modeled as graphs
– Linking nodes (entities) with edges (connections)
• Multiple relationships between entities suggests
multigraphs
• Add new entities, new & changing connections
suggests dynamic multigraphs.
• Develop methods to represent, analyze,
interrogate, & navigate dynamic multigraphs.
Slide 6
DyDAn Research I: Analysis of Large,
Dynamic Multigraphs
Slide 7
DyDAn Research II: Continuous,
Distributed Monitoring of Dynamic,
Heterogeneous Data
• Need to understand massive amounts of
data.
• Data inherently distributed (multiple
sources)
• Data arrives rapidly – “continuously”
• Seek anomalies, patterns, “emerging events”
• Run continuous queries to monitor
incoming data stream.
• Data takes numerous forms; requires data
mining methods that span the modalities.
Slide 8
DyDAn Research Portfolio: Flexibility
• 9 initial projects, 5 in Area I, 4 in Area II
• Not all starting in year 1.
• All leverage off previous work and additional
funding from Rutgers.
• Portfolio reviewed regularly with DHS, national
lab partners, and other DHS centers; can readily
change to newly-identified needs.
Slide 9
DyDAn Research Portfolio: Large
Graphs Projects
• Universal Information Graphs (initial emphasis)
• Adding Semantics to and Interconnecting
Semantic Graphs (initial emphasis)
• Analyzing Large, Dynamic Multigraphs Arising
from Blogs
• Algorithms for Identifying Hidden Social
Structures
• Statistical and Graph-theoretical Approaches to
Time-Varying Multigraphs (Initial emphasis)
Slide 10
DyDAn Research Portfolio:
Dynamic Data Projects
•
•
•
•
Message Filtering and Entity Resolution
Continuous, Distributed Data Stream Modeling
Optimization and Data Analysis (Initial emphasis)
Dynamic Similarity Search in Multi-Modal Data
Slide 11
DyDAn Data
• Emphasis on publicly available data.
• How to acquire, publish, analyze, store data in a
private, secure way.
• Privacy-preserving data analysis.
• How to generate synthetic data sets that have the
characteristics of real data but mask protected
aspects.
• Director of Data Analysis will work on all aspects
of acquiring, sharing, publicizing analyzing data:
privacy, legal, technical, etc.
Slide 12
DyDAn Educational Programs
• Great need to train people to work in homeland
security.
• Key DyDAn performers: record of integrating
research and education from K-12 to
postgraduate.
• Integration of research and education: students in
all research projects.
• Integration of research and education: research
themes into educational programs.
Slide 13
DyDAn Educational Programs
• Workshops, tutorials, shortcourses: most open to all
• New courses, certificate programs, faculty training
– Repository for information about homeland security courses
nationally
– New homeland security certificate programs: RPI, RU, TSU
– Website to disseminate our models nationally
– Program for national college faculty
• Extensive program of “research experiences for
undergraduates.”
– Students from around the US in residence at DyDAn
Slide 14
DyDAn Educational Programs
• Internships/Visits
– by students/faculty to national labs, corporate
partner locations, and DyDAn.
– by national lab, DHS, other UAC scientists to
DyDAn
• K-12 programs:
– To build early awareness of educational and
career opportunities in homeland security
– Annual high school teacher “short course” in
discrete math and homeland security
Slide 15
Leadership as a Coordinating UAC
• Building on extensive experience managing large,
complex scientific & educational enterprises.
– Based at DIMACS (Center for Discrete Mathematics
and Theoretical Computer Science).
– An original NSF “science and technology center”
– 13 partner institutions (5 universities, 8 companies)
– Large portfolio of research & educational programs with
international scope
Slide 16
A Resource for NJ
• Connecting to the NJ Universities Homeland
Security Research Consortium: Seek to involve all
Consortium universities
• Building on Relationships with State and Local
Agencies
• Advisory Committee: State and National
Representatives
• DyDAn Events open to NJ university, industry, and
government participants and designed with their
help.
• Connecting NJ to DHS officials and efforts
nationally.
Slide 17
DyDAn Research
• Information Management and Knowledge
Discovery
• Fundamental Topics in Discrete Mathematical
Foundations
• Two research themes:
– Analysis of Large, Dynamic Multigraphs
– Continuous, Distributed Monitoring of
Dynamic, Heterogeneous Data
Slide 18
Project: Universal Information
Graphs
James Abello & Fred Roberts
(Rutgers Univ.)
Kiran Chilakamarri
(Texas Southern University)
Nate Dean
(Texas State University- San
Marcos)
Slide 19
Overview and Connection to Problems of
Homeland Security
•A variety of different massive data sources are available to
analysts: Web, Internet, Calls, Email, Transportation, …
•Problem: Coordinate information from multiple sources, to
identify “interesting” collaborative information networks.
Attack
Graphs
Air Traffic
Slide 20
Web
Internet
Market
Baskets
Call Detail
Overview and Connection to
Problems of Homeland Security
•Model each data source
as a large multidigraph
•Edges give information
•Too much information
to actually fuse all these
multidigraphs into one.
•Challenge: Fuse
collection of
multidigraphs in useful
ways.
Slide 21
Project: Adding Semantics to and
Interconnecting Semantic Graphs
Alex Borgida (Rutgers
University)
Lila Ghemri (Texas
Southern University)
Peter F. Patel-Schneider
(Bell Labs Research)
Slide 22
Overview and Connection to Problems of
Homeland Security
• Information of interest to DHS is often stored using
“shallow” representations.
– Much of the information is in English tags
– Susceptible to ambiguity, incompleteness, etc.
– These representations are nonetheless very useful
• Alleviate such shallowly represented information by
augmenting with rich ontologies that describe and
prescribe how a domain works
– can discover information inherent in shallow information
– can expose inconsistencies in shallow information
• Problem - reasoning with rich information is
computationally expensive.
Slide 23
Planned Work through DyDAn
• Extend OWL Web Ontology Language, a powerful
ontology language for use with shallow information
• Extend and specialize theory of Distributed
Description Logics (DDLs), designed to limit
interactions to lessen computational load
• Develop and extend a highly-optimized reasoner to
improve its performance with large amounts of
shallow information
• Study how dynamic change interacts with reasoners
Slide 24
Project: Analysis of Large,
Dynamic Multigraphs Arising
from Blogs
James Abello (Rutgers &
Ask.com)
Graham Cormode
(AT&T Labs – Research)
S. Muthukrishnan
(Rutgers Univ.)
Slide 25
Multigraphs in Security Applications
• Intelligence data is well-modeled by large, evolving
multigraphs
– Nodes: entities Edges: connections
– Many links between same pair of entities denote different
interactions at different times
– Relationships change (slowly, rapidly) over time.
• Examples:
– (User IDs, emails/telephone calls),
– (Text reports/blogs/webpages, implicit/explicit links)
• Our research: acquiring and analyzing multigraphs
from different applications.
Slide 26
Overview and Connection to Problems
of Homeland Security
• Blogs are an example of open source data
– Large, highly-interconnected source of timely posts
on observations, experiences, events, politics etc.
from citizen observers (sloggers).
– Chaotic source of information. What
(mis)information is being propagated?
– Challenging: find trustworthy sources.
• Goal: develop techniques for labeling
multigraphs in intelligence applications, apply
them to blogs.
Slide 27
Project: Algorithms for
Identifying Hidden Social
Structures in Virtual Communities
Yuliy Baryshnikov (Bell
Labs)
Mark Goldberg (RPI)
Malik Magdon-Ismail
(RPI)
William (Al) Wallace
(RPI)
Slide 28
Overview and Connection to
Problems of Homeland Security
• Prior to their acting, the perpetrators discuss and plan
using a variety of communication media.
• Challenge: Find hidden groups, coalitions and leaders
by non-semantic analysis of large communication
networks.
• Ideal result: Find a suspicious group based on its preevent communication activity, before they act.
• Useful forensic result: Ex-post discovery of the
relationship between the act and communication burst.
Slide 29
Project: Statistical and GraphTheoretical Approaches to TimeVarying Multigraphs
Colin Goodall, AT&T
Labs – Research
Robert Bell, AT&T Labs
– Research
David Madigan, Rutgers
University
Slide 30
Overview and Connection To
Problems of Homeland Security
• A COI (Community of Interest) is
an effective summary of significant
connections in a graph.
• Use COI for very large scale analysis
of a dynamic graph:
– Stark change in COI indicates an anomaly
– Has an entity changed its id?
– New cliques?
• Goal: To analyze and apply automated anomaly
detection to COI’s of dynamic multigraphs in telecomm,
blogs, and intelligence data.
Slide 31
Project: Message Filtering and
Entity Resolution
Endre Boros (Rutgers)
Lila Ghemri (TSU)
Tin Kam Ho (Bell Labs)
Paul Kantor (Rutgers)
David Madigan (Rutgers)
Richard Mammone (Rutgers)
Debasis Mitra (Bell Labs)
Slide 32
Continuous Message Filtering and Entity
Resolution in the Distributed Environment
• Vast amount of data flow into or through monitoring
points
• Chaff must be discarded, meaningful messages and
patterns of messages must be detected
– in real time
– with limited communication among the processing and
monitoring nodes
– with minimal interruption of normal communication and
privacy
– and maximum effectiveness
Slide 33
Overview of Research Problem
• Work on
automatically
learning to identify
topics, events, and
actors in messages
• Recognizing the
same entity (an
actor, a target, and
organization) under
different aliases.
Slide 34
New Research
Operations
10 million
messages a
day. Billions of
possible
identifications
Multiple
modeling and
learning
technologies
Multiple
optimization
and
combination
technologies
Thousands of potentially
important messages,
identifications, etc.
Connection to Problems of Homeland
Security
• Millions/ tens of millions of messages should be
screened for patterns of interest
• Actors often hide behind false identities
– Reveal themselves by
• Language
• Style
• Connections to other actors
• Goal is to maximize screening effectiveness,
minimize false positives – disruption to individuals
and to commerce
• Early, positive impact on detection of agents and
organizations
Slide 35
Project: Continuous, Distributed
Data Stream Monitoring
Moses Charikar
(Princeton)
Graham Cormode
(AT&T Labs – Research)
S. Muthukrishnan
(Rutgers)
Slide 36
Overview and Connection to
Problems of Homeland Security
• Data is massive, distributed, and evolving
– inconvenient or impossible to collect together in one place
– still need to monitor, identify patterns, correlate
• Example:
– monitoring streams of text: emails, blogs, newsfeeds, field
reports.
– identify patterns — profiles, clusters, outliers — that occur
across multiple sites
• Technical challenge:
– must be accurate, avoid false positives.
– optimize how (much) data is communicated between agents.
Slide 37
Project: Optimization and Data
Analysis
Alexandre d'Aspremont (Princeton)
Yuliy Baryshnikov (Bell Labs)
Savas Dayanik (Princeton)
Paul Kantor (Rutgers)
Kai Li (Princeton)
Warren Powell (Princeton)
Seyed Roosta (TSU)
Slide 38
Overview and Connection to Problems of
Homeland Security
• Dynamic data raises optimization issues:
– Has rate of messages sent by X changed?
– Has flow of cash into/out of organization Y changed?
– Has there been unexpected change in travel plans?
• View as learning problem; use optimal learning
strategies.
• Research challenges:
– Optimal detection of changes in signals
– Optimal recursive estimation
– Rapid classification/pattern identification
Slide 39
Project: Dynamic Similarity
Search in Multi-Modal Data
Moses Charikar, Perry
Cook, Kai Li, and
Olga Troyanskaya
(Princeton)
Ken Clarkson, Tin Kam
Ho, and Haobo Ren
(Bell Labs)
Slide 40
Overview and Connection to
Problems of Homeland Security
• Data arising in homeland security comes from
many modalities
– Often such data are sensor data (audio, images, video,
etc.) which are noisy and require similarity match and
similarity search
– Feature extractions are difficult and such features are
high dimensional
• Multi-modal data of interest are massive
– Current content-based similarity search and
classification are limited to small scale
– “Curse of dimensionality”
Slide 41
Overview and Connection to
Problems of Homeland Security
• How to build similarity search systems for multimodal data is not well understood
– How to manage and search at scale
– How to integrate annotations/attributes based search
with content-based search
Slide 42
We are looking forward to collaborating with
the DHS Institute for Discrete Sciences and to
involving the NJ homeland security community
in the new center
Download