AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas Southern University Texas State University, San Marcos Background • DHS has established an Institute for Discrete Sciences (IDS). • Managed Out of Lawrence Livermore Nat. Lab. • DHS is establishing four “university affiliated centers” around the country. • One of these will be a “coordinating” UAC. • The Rutgers-based team has been designated as a UAC and was asked to become a coordinating UAC. • Other centers: Univ. of Illinois UrbanaChampaign, Univ. Southern Cal., U. Pittsburgh • In addition to 6 formal partners, we have told DHS that NJ University Consortium institutions will be involved. Slide 2 What is Discrete Science? • Discrete Science deals with – Patterns – Arrangements – Assignments – Schedules • Discrete Science – Seeks patterns in large amounts of data – Analyzes connections between entities such as people and groups – Develops efficient ways to quickly spot changes in standard patterns Slide 3 Why DyDAn? • Homeland Security requires inferences from massive flows of data, arriving continuously. • Buried in data are: quickly changing patterns. • DyDAn: will develop novel technologies to find patterns & relationships in dynamic, nonstationary, massive datasets. • DyDAn: will produce pioneering educational programs to nurture homeland security workforce of the future Slide 4 DyDAn Research • Information Management and Knowledge Discovery • Fundamental Topics in Discrete Mathematical Foundations • Two research themes: – Analysis of Large, Dynamic Multigraphs – Continuous, Distributed Monitoring of Dynamic, Heterogeneous Data Slide 5 DyDAn Research I: Analysis of Large, Dynamic Multigraphs • Need to understand interactions between entities: people, objects, groups • Interactions often modeled as graphs – Linking nodes (entities) with edges (connections) • Multiple relationships between entities suggests multigraphs • Add new entities, new & changing connections suggests dynamic multigraphs. • Develop methods to represent, analyze, interrogate, & navigate dynamic multigraphs. Slide 6 DyDAn Research I: Analysis of Large, Dynamic Multigraphs Slide 7 DyDAn Research II: Continuous, Distributed Monitoring of Dynamic, Heterogeneous Data • Need to understand massive amounts of data. • Data inherently distributed (multiple sources) • Data arrives rapidly – “continuously” • Seek anomalies, patterns, “emerging events” • Run continuous queries to monitor incoming data stream. • Data takes numerous forms; requires data mining methods that span the modalities. Slide 8 DyDAn Research Portfolio: Flexibility • 9 initial projects, 5 in Area I, 4 in Area II • Not all starting in year 1. • All leverage off previous work and additional funding from Rutgers. • Portfolio reviewed regularly with DHS, national lab partners, and other DHS centers; can readily change to newly-identified needs. Slide 9 DyDAn Research Portfolio: Large Graphs Projects • Universal Information Graphs (initial emphasis) • Adding Semantics to and Interconnecting Semantic Graphs (initial emphasis) • Analyzing Large, Dynamic Multigraphs Arising from Blogs • Algorithms for Identifying Hidden Social Structures • Statistical and Graph-theoretical Approaches to Time-Varying Multigraphs (Initial emphasis) Slide 10 DyDAn Research Portfolio: Dynamic Data Projects • • • • Message Filtering and Entity Resolution Continuous, Distributed Data Stream Modeling Optimization and Data Analysis (Initial emphasis) Dynamic Similarity Search in Multi-Modal Data Slide 11 DyDAn Data • Emphasis on publicly available data. • How to acquire, publish, analyze, store data in a private, secure way. • Privacy-preserving data analysis. • How to generate synthetic data sets that have the characteristics of real data but mask protected aspects. • Director of Data Analysis will work on all aspects of acquiring, sharing, publicizing analyzing data: privacy, legal, technical, etc. Slide 12 DyDAn Educational Programs • Great need to train people to work in homeland security. • Key DyDAn performers: record of integrating research and education from K-12 to postgraduate. • Integration of research and education: students in all research projects. • Integration of research and education: research themes into educational programs. Slide 13 DyDAn Educational Programs • Workshops, tutorials, shortcourses: most open to all • New courses, certificate programs, faculty training – Repository for information about homeland security courses nationally – New homeland security certificate programs: RPI, RU, TSU – Website to disseminate our models nationally – Program for national college faculty • Extensive program of “research experiences for undergraduates.” – Students from around the US in residence at DyDAn Slide 14 DyDAn Educational Programs • Internships/Visits – by students/faculty to national labs, corporate partner locations, and DyDAn. – by national lab, DHS, other UAC scientists to DyDAn • K-12 programs: – To build early awareness of educational and career opportunities in homeland security – Annual high school teacher “short course” in discrete math and homeland security Slide 15 Leadership as a Coordinating UAC • Building on extensive experience managing large, complex scientific & educational enterprises. – Based at DIMACS (Center for Discrete Mathematics and Theoretical Computer Science). – An original NSF “science and technology center” – 13 partner institutions (5 universities, 8 companies) – Large portfolio of research & educational programs with international scope Slide 16 A Resource for NJ • Connecting to the NJ Universities Homeland Security Research Consortium: Seek to involve all Consortium universities • Building on Relationships with State and Local Agencies • Advisory Committee: State and National Representatives • DyDAn Events open to NJ university, industry, and government participants and designed with their help. • Connecting NJ to DHS officials and efforts nationally. Slide 17 DyDAn Research • Information Management and Knowledge Discovery • Fundamental Topics in Discrete Mathematical Foundations • Two research themes: – Analysis of Large, Dynamic Multigraphs – Continuous, Distributed Monitoring of Dynamic, Heterogeneous Data Slide 18 Project: Universal Information Graphs James Abello & Fred Roberts (Rutgers Univ.) Kiran Chilakamarri (Texas Southern University) Nate Dean (Texas State University- San Marcos) Slide 19 Overview and Connection to Problems of Homeland Security •A variety of different massive data sources are available to analysts: Web, Internet, Calls, Email, Transportation, … •Problem: Coordinate information from multiple sources, to identify “interesting” collaborative information networks. Attack Graphs Air Traffic Slide 20 Web Internet Market Baskets Call Detail Overview and Connection to Problems of Homeland Security •Model each data source as a large multidigraph •Edges give information •Too much information to actually fuse all these multidigraphs into one. •Challenge: Fuse collection of multidigraphs in useful ways. Slide 21 Project: Adding Semantics to and Interconnecting Semantic Graphs Alex Borgida (Rutgers University) Lila Ghemri (Texas Southern University) Peter F. Patel-Schneider (Bell Labs Research) Slide 22 Overview and Connection to Problems of Homeland Security • Information of interest to DHS is often stored using “shallow” representations. – Much of the information is in English tags – Susceptible to ambiguity, incompleteness, etc. – These representations are nonetheless very useful • Alleviate such shallowly represented information by augmenting with rich ontologies that describe and prescribe how a domain works – can discover information inherent in shallow information – can expose inconsistencies in shallow information • Problem - reasoning with rich information is computationally expensive. Slide 23 Planned Work through DyDAn • Extend OWL Web Ontology Language, a powerful ontology language for use with shallow information • Extend and specialize theory of Distributed Description Logics (DDLs), designed to limit interactions to lessen computational load • Develop and extend a highly-optimized reasoner to improve its performance with large amounts of shallow information • Study how dynamic change interacts with reasoners Slide 24 Project: Analysis of Large, Dynamic Multigraphs Arising from Blogs James Abello (Rutgers & Ask.com) Graham Cormode (AT&T Labs – Research) S. Muthukrishnan (Rutgers Univ.) Slide 25 Multigraphs in Security Applications • Intelligence data is well-modeled by large, evolving multigraphs – Nodes: entities Edges: connections – Many links between same pair of entities denote different interactions at different times – Relationships change (slowly, rapidly) over time. • Examples: – (User IDs, emails/telephone calls), – (Text reports/blogs/webpages, implicit/explicit links) • Our research: acquiring and analyzing multigraphs from different applications. Slide 26 Overview and Connection to Problems of Homeland Security • Blogs are an example of open source data – Large, highly-interconnected source of timely posts on observations, experiences, events, politics etc. from citizen observers (sloggers). – Chaotic source of information. What (mis)information is being propagated? – Challenging: find trustworthy sources. • Goal: develop techniques for labeling multigraphs in intelligence applications, apply them to blogs. Slide 27 Project: Algorithms for Identifying Hidden Social Structures in Virtual Communities Yuliy Baryshnikov (Bell Labs) Mark Goldberg (RPI) Malik Magdon-Ismail (RPI) William (Al) Wallace (RPI) Slide 28 Overview and Connection to Problems of Homeland Security • Prior to their acting, the perpetrators discuss and plan using a variety of communication media. • Challenge: Find hidden groups, coalitions and leaders by non-semantic analysis of large communication networks. • Ideal result: Find a suspicious group based on its preevent communication activity, before they act. • Useful forensic result: Ex-post discovery of the relationship between the act and communication burst. Slide 29 Project: Statistical and GraphTheoretical Approaches to TimeVarying Multigraphs Colin Goodall, AT&T Labs – Research Robert Bell, AT&T Labs – Research David Madigan, Rutgers University Slide 30 Overview and Connection To Problems of Homeland Security • A COI (Community of Interest) is an effective summary of significant connections in a graph. • Use COI for very large scale analysis of a dynamic graph: – Stark change in COI indicates an anomaly – Has an entity changed its id? – New cliques? • Goal: To analyze and apply automated anomaly detection to COI’s of dynamic multigraphs in telecomm, blogs, and intelligence data. Slide 31 Project: Message Filtering and Entity Resolution Endre Boros (Rutgers) Lila Ghemri (TSU) Tin Kam Ho (Bell Labs) Paul Kantor (Rutgers) David Madigan (Rutgers) Richard Mammone (Rutgers) Debasis Mitra (Bell Labs) Slide 32 Continuous Message Filtering and Entity Resolution in the Distributed Environment • Vast amount of data flow into or through monitoring points • Chaff must be discarded, meaningful messages and patterns of messages must be detected – in real time – with limited communication among the processing and monitoring nodes – with minimal interruption of normal communication and privacy – and maximum effectiveness Slide 33 Overview of Research Problem • Work on automatically learning to identify topics, events, and actors in messages • Recognizing the same entity (an actor, a target, and organization) under different aliases. Slide 34 New Research Operations 10 million messages a day. Billions of possible identifications Multiple modeling and learning technologies Multiple optimization and combination technologies Thousands of potentially important messages, identifications, etc. Connection to Problems of Homeland Security • Millions/ tens of millions of messages should be screened for patterns of interest • Actors often hide behind false identities – Reveal themselves by • Language • Style • Connections to other actors • Goal is to maximize screening effectiveness, minimize false positives – disruption to individuals and to commerce • Early, positive impact on detection of agents and organizations Slide 35 Project: Continuous, Distributed Data Stream Monitoring Moses Charikar (Princeton) Graham Cormode (AT&T Labs – Research) S. Muthukrishnan (Rutgers) Slide 36 Overview and Connection to Problems of Homeland Security • Data is massive, distributed, and evolving – inconvenient or impossible to collect together in one place – still need to monitor, identify patterns, correlate • Example: – monitoring streams of text: emails, blogs, newsfeeds, field reports. – identify patterns — profiles, clusters, outliers — that occur across multiple sites • Technical challenge: – must be accurate, avoid false positives. – optimize how (much) data is communicated between agents. Slide 37 Project: Optimization and Data Analysis Alexandre d'Aspremont (Princeton) Yuliy Baryshnikov (Bell Labs) Savas Dayanik (Princeton) Paul Kantor (Rutgers) Kai Li (Princeton) Warren Powell (Princeton) Seyed Roosta (TSU) Slide 38 Overview and Connection to Problems of Homeland Security • Dynamic data raises optimization issues: – Has rate of messages sent by X changed? – Has flow of cash into/out of organization Y changed? – Has there been unexpected change in travel plans? • View as learning problem; use optimal learning strategies. • Research challenges: – Optimal detection of changes in signals – Optimal recursive estimation – Rapid classification/pattern identification Slide 39 Project: Dynamic Similarity Search in Multi-Modal Data Moses Charikar, Perry Cook, Kai Li, and Olga Troyanskaya (Princeton) Ken Clarkson, Tin Kam Ho, and Haobo Ren (Bell Labs) Slide 40 Overview and Connection to Problems of Homeland Security • Data arising in homeland security comes from many modalities – Often such data are sensor data (audio, images, video, etc.) which are noisy and require similarity match and similarity search – Feature extractions are difficult and such features are high dimensional • Multi-modal data of interest are massive – Current content-based similarity search and classification are limited to small scale – “Curse of dimensionality” Slide 41 Overview and Connection to Problems of Homeland Security • How to build similarity search systems for multimodal data is not well understood – How to manage and search at scale – How to integrate annotations/attributes based search with content-based search Slide 42 We are looking forward to collaborating with the DHS Institute for Discrete Sciences and to involving the NJ homeland security community in the new center