Center for Science of Information Center for Science of Information Bryn Mawr Center for Science of Information: From Information Reduction to Knowledge Extraction Howard MIT Princeton Purdue Stanford Texas A&M UC Berkeley Wojciech Szpankowski Purdue University National Science Foundation/Science & Technology Centers Program/December 2012 UC San Diego UIUC 1 Science & Technology Centers Program Center for Science of Information Outline 1. Science of Information 2. Center Mission STC Team & Staff Integrated Research 3. Research Results in Knowledge Extraction 4. Center Mission Grand Challenges Education & Diversity Knowledge Transfer Science & Technology Centers Program 2 Center for Science of Information Shannon Legacy The Information Revolution started in 1948, with the publication of: A Mathematical Theory of Communication. The digital age began. Claude Shannon: Shannon information quantifies the extent to which a recipient of data can reduce its statistical uncertainty. “semantic aspects of communication are irrelevant . . .” Fundamental Limits for Storage and Communication (Information Theory Field) Applications Enabler/Driver: CD, iPod, DVD, video games, Internet, Facebook, WiFi, mobile, Google, . . Design Driver: universal data compression, voiceband modems, CDMA, multiantenna, discrete denoising, space-time codes, cryptography, distortion theory approach to Big Data. Science & Technology Centers Program 3 Center for Science of Information Shannon Fundamental Results Science & Technology Centers Program 4 Center for Science of Information Post-Shannon Challenges 1. Back from infinity: Extend Shannon findings to finite size data structures (i.e., sequences, graphs), that is, develop information theory of various data structures beyond first-order asymptotics. 2. Science of Information: Extend Information Theory to meet new challenges in biology, massive data, communication, knowledge extraction, economics,... In order to accomplish it we must understand new aspects of information in: structure, time, space, and semantics, and dynamic information, limited resources, complexity, representationinvariant information, and cooperation & dependency. Science & Technology Centers Program 5 Center for Science of Information Post-Shannon Challenges Structure: Measures are needed for quantifying information embodied in structures (e.g., information in material structures, nanostructures, biomolecules, gene regulatory networks, protein networks, social networks, financial transactions). Szpankowski & Choi : Information contained in unlabeled graphs & universal graphical compression. Grama & Subramaniam : quantifying role of noise and incomplete data, identifying conserved structures, finding orthologies in biological network reconstruction. Neville: Outlining characteristics (e.g., weak dependence) sufficient for network models to be well-defined in the limit. Yu & Qi: Finding distributions of latent structures in social networks. Science & Technology Centers Program Center for Science of Information Post-Shannon Challenges Time: Classical Information Theory is at its weakest in dealing with problems of delay (e.g., information arriving late may be useless of has less value). Verdu , Polyanskiy, Kostina: major breakthrough in extending Shannon capacity theorem to finite blocklength information theory for lossy data compression. Kumar: design reliable scheduling policies with delay constraints for wireless networks; new axiomatic approach to secure wireless networks. Weissman : real time coding system with lookahead to investigate the impact of delay on expected distortion and bit rate. Subramaniam: reconstruct networks from dynamic biological data. Ramkrishna: quantify fluxes in biological networks by developing a causality based dynamic theory celluar metabolism taking into account a survival strategy in controlling enzyme syntheses. Space: Bialek explores transmission of information in making spatial patterns in developing embryos – relation to information capacity of a communication channel. Science & Technology Centers Program Center for Science of Information Post-Shannon Challenges Limited Resources: In many scenarios, information is limited by available computational resources (e.g., cell phone, living cell). Bialek works on structure of molecular networks that optimize information flow, subject to constraints on the total number of molecules being used. Semantics: Is there a way to account for the meaning or semantics of information? Sudan argues that the meaning of information becomes relevant whenever there is diversity across communicating parties and when parties themselves evolve over time. Science & Technology Centers Program 8 Center for Science of Information Post-Shannon Challenges Representation-invariance: How to know whether two representations of the same information are information equivalent? Learnable Information: Courtade, Weissman and Verdu developing information-theoretic approach to recommender systems based on novel use of multiterminal source coding under logarithmic loss. Atallah investigates how to enable the untrusted remote server to store and manipulate the client's confidential data. Grama, Subramaniam, Weissman & Kulkurani work on analysis of extreme-scale networks; such datasets are noisy and their analyses need fast probabilistic algorithms for effective search in compressed representations. Science & Technology Centers Program 9 Center for Science of Information Outline 1. Science of Information 2. Center Mission STC Team & Staff Integrated Research 3. Research Results in Knowledge Extraction 4. Center Mission Science & Technology Centers Program 10 Center for Science of Information Mission and Center Goals Advance science and technology through a new quantitative understanding of the representation, communication and processing of information in biological, physical, social and engineering systems. Some Specific Center’s Goals: R E S E A R C H • • • • define core theoretical principles governing transfer of information, develop metrics and methods for information, apply to problems in physical and social sciences, and engineering, offer a venue for multi-disciplinary long-term collaborations, • • • explore effective ways to educate students, train the next generation of researchers, broaden participation of underrepresented groups, Education & Diversity • transfer advances in research to education and industry Knowledge Transfer Science & Technology Centers Program 11 Center for Science of Information STC Team Bryn Mawr College: D. Kumar Wojciech Szpankowski, Purdue Howard University: C. Liu, L. Burge MIT: P. Shor (co-PI), M. Sudan (Microsoft) Purdue University: (lead): W. Szpankowski (PI) Andrea Goldsmith, Stanford Princeton University: S. Verdu (co-PI) Stanford University: A. Goldsmith (co-PI) Texas A&M: P.R. Kumar Peter Shor, MIT University of California, Berkeley: Bin Yu (co-PI) University of California, San Diego: S. Subramaniam UIUC: O. Milenkovic. R. Aguilar, M. Atallah, C. Clifton, S. Datta, A. Grama, S. Jagannathan, A. Mathur, J. Neville, D. Ramkrishna, J. Rice, Z. Pizlo, L. Si, V. Rego, A. Qi, M. Ward, D. Blank, D. Xu, C. Liu, L. Burge, M. Garuba, S. Aaronson, N. Lynch, R. Rivest, Y. Polyanskiy, W. Bialek, S. Kulkarni, C. Sims, G. Bejerano, T. Cover, T. Weissman, V. Anantharam, J. Gallant, C. Kaufman, D. Tse,T.Coleman. Sergio Verdú, Princeton Bin Yu, U.C. Berkeley Science & Technology Centers Program 12 Center for Science of Information Center Participant Awards • Nobel Prize (Economics): Chris Sims • National Academies (NAS/NAE) – Bialek, Cover, Datta, Lynch, Kumar, Ramkrishna, Rice, Rivest, Shor, Sims, Verdu. • Turing award winner -- Rivest. • Shannon award winners -- Cover and Verdu. • Nevanlinna Prize (outstanding contributions in Mathematical Aspects of Information Sciences) -- Sudan and Shor. • Richard W. Hamming Medal – Cover and Verdu. • Humboldt Research Award -- Szpankowski. Science & Technology Centers Program 13 Center for Science of Information Integrated Research Create a shared intellectual space, integral to the Center’s activities, providing a collaborative research environment that crosses disciplinary and institutional boundaries. S. Subramaniam A. Grama Research Thrusts: 1. Life Sciences David Tse T. Weissman S. Kulkarni M. Atallah 2. Communication 3. Knowledge Extraction (Big/Small Data) Science & Technology Centers Program 14 Center for Science of Information Knowledge Extraction Big Data Characteristics : Large (peta and exa scale) Noisy (high rate of false positives and negatives) Multiscale (interaction at vastly different levels of abstractions) Dynamic (temporal and spatial changes) Heterogeneous (high variability over space and time) Distributed (collected and stored at distributed locations) Elastic (flexibility to data model and clustering capabilities) Small Data: Limited amount of data, often insufficient to extract information (e.g., classification of twitter messages) Science & Technology Centers Program 15 Center for Science of Information Analysis of Big/Small Data Some characteristics of analysis techniques : Results are typically probabilistic (formulations must quantify and optimize statistical significance - deterministic formulations on noisy data are not meaningful). Overfitting to noisy data is a major problem. Distribution agnostic formulations (say, based on simple counts and frequencies are not meaningful). Provide rigorously validated solutions. Dynamic and heterogeneous datasets require significant formal basis: Ad-hoc solutions do not work at scale! We need new foundation and fundamental results Science & Technology Centers Program 16 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Results in Knowledge Extraction and/or Life Sciences 4. Center Mission Science & Technology Centers Program 17 Center for Science of Information Research Results Information Theory of DNA Sequencing Tse, Motahari, Bresler, and Bresler – Berkeley Structural Information/Compression Szpankowski and Choi, – Purdue & Venter Institute Pathways in Biological Regulatory Networks Grama, Mohammadi, and Subramaniam – Purdue & UCSD Deinterleaving Markov Processes: Finding Relevant Information Seroussi, Szpankowski and Weinberger (HP, & Purdue) Information Theoretic Models of Knowledge Extraction Problems Courtade, Weissman, and Verdu – Stanford & Princeton Science & Technology Centers Program 18 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Research Results Information Theory of DNA Sequencing Tse, Motahari, Bresler and Bresler -- Berkeley 4. Center Mission Science & Technology Centers Program 19 Center for Science of Information Information Theory of Shotgun Sequencing Reads are assembled to reconstruct the original genome. State-of-the-art: many sequencing technologies and many assembly algorithms, ad-hoc design tailored to specific technologies. Our goal: a systematic unified design framework. Central question: Given statistics of the genome, for what read length L and # of reads N is reliable reconstruction possible? An optimal algorithm is one that achieves the fundamental limit. Science & Technology Centers Program Center for Science of Information Simple Model: I.I.D. Genome (Motahari, Bresler & Tse 12) # of reads N/Ncov reconstructable by the greedy algorithm many repeats of length L 1 no coverage of genome (2 log G) /Hrenyi read length L Science & Technology Centers Program Center for Science of Information Real Genome Statistics Example: Stapholococcus Aureus (Bresler, Bresler & Tse 12) upper and lower bounds based on repeat Statistics (99% reconstruction reliability) Nmin/Ncov K-mer repeat lower bound data i.i.d. fit coverage lower bound greedy improved K-mer optimized de Brujin length read length L A data-driven approach to finding near-optimal assembly algorithms. Science & Technology Centers Program 22 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Research Results Structural Information/Compression Szpankowski & Choi – Purdue & Venter Institute, IEEE Trans. Information Theory, 58, 2012 4. Center Mission Science & Technology Centers Program 23 Center for Science of Information Graph Structures Information Content of Unlabeled Graphs: A structure model S of a graph G is defined for an unlabeled version. Some labeled graphs have the same structure. Graph Entropy vs Structural Entropy: The probability of a structure S is: P(S) = N(S) · P(G) where N(S) is the number of different labeled graphs having the same structure. Science & Technology Centers Program 24 Center for Science of Information 𝑯𝑮 AND 𝑯𝑺 Science & Technology Centers Program 25 Center for Science of Information Structural Entropy – Main Result Consider Erdös-Rényi graphs G(n,p). Science & Technology Centers Program 26 Center for Science of Information Structural Zip (SZIP) Algorithm Compression Algorithm called Structural zip in short SZIP – Demo Science & Technology Centers Program 27 Center for Science of Information Asymptotic Optimality of SZIP Science & Technology Centers Program 28 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Research Results Pathways in Biological Regulatory Networks Grama, Mohammadi, Subramaniam – Purdue & UCSD 4. Center Mission Science & Technology Centers Program 29 Center for Science of Information Decoding Network Footprints • We have initiated an ambitions computational effort aimed at constructing the network footprint of degenerative diseases (Alzheimers, Parkinsons, Cancers). • Understanding pathways implicated in degenerative diseases holds the potential for novel interventions, drug design/ repurposing, and diagnostic/ prognostic markers. • Using rigorous random-walk techniques, TOR complexes signaling are reconstructed – this identifies temporal and spatial aspects of cell growth! Science & Technology Centers Program 30 Center for Science of Information Network-Guided Characterization of Phenotype Science & Technology Centers Program 31 Center for Science of Information Computational Derived Aging Map This map includes various forms of interactions, including protein interactions, gene regulations, post translational modifications, etc. The underlying structural information extraction technique relies on rigorous quantification of information flow. It allows to understand temporal and spatial Science & Technology Centers Program aspect of cell growth. 32 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Research Results Deinterleaving Markov Process: (Finding a Needle in a Haystack) Seroussi, Szpankowski, Weinberger: Purdue & HP, IEEE Trans. Information Theory, 58, 2012 4. Center Mission Science & Technology Centers Program 33 Center for Science of Information Interleaved Streams User 1 output: aababbaaababaabbbbababbabaaaabbb User 2 output: ABCABCABCABCABCABCABCABCABCAB But only one log file is available, with no correlation between users: aABacbAabbBaCaABababCaABCAabBbc Questions: • Can we detect the existence of two users? • Can we de-interleave the two streams? Science & Technology Centers Program 34 Center for Science of Information Separating Interleaved Streams Science & Technology Centers Program 35 Center for Science of Information Probabilistic Model Science & Technology Centers Program 36 Center for Science of Information Problem Statement Given a sample 𝒛𝑛 of P, infer the partition(s) ( w.h.p. as n ) Is there a unique solution? (Not always!) Any partition ' of A induces a representation of P , but • P 'i may not have finite memory or P 'i ‘s may not be independent • P 'w (switch) may not be memoryless (Markov) For k 1 the problem was studied by Batu et al. (2004) • an approach to identify one valid IMP presentation of P w.h.p. • greedy, very partial use of statistics slow convergence • but simple and fast We propose an MDL approach for unknown k • much better results, finds all partitions compatible with P • Complex if implemented via exhaustive search, but randomized gradient descent approximation achieves virtually the same results Science & Technology Centers Program 37 Center for Science of Information Finding 𝚷 ∗ via Penalized ML Science & Technology Centers Program 38 Center for Science of Information Main Result Summary: 𝐹𝑘 𝒛𝑛 = 𝐻 𝒛𝑛 + 𝛽𝜅 log 𝑛 , 𝐹 𝒛𝑛 = argmin 𝐹𝑘 𝒛𝑛 Theorem: Let P I ( P1 , P2 ,...,Pm ; Pw ) and k max i ord ( Pi ) . For 2 in the penalization we have 𝐹 𝒛𝑛 = 𝐹𝑘 Π (𝑎. 𝑠. ) 𝑛 ∞ that is, the right partition will be found almost surely (with probability 1). Proof Ideas: 1. P is a FSM process. 2. If P’ is the ``wrong’’ process (partition), then either one of the underlying process or switch process will not be of finite order, or some of the independence assumptions will be violated. 3. The penalty for P’ will only increase. 4. Rigorous proofs use large deviations results and Pinsker’s inequality over the underlying state space. Science & Technology Centers Program 39 Center for Science of Information Experimental Results Science & Technology Centers Program 40 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Research Results Information Theoretic Models for Big Data (Courtade, Weissman, Verdu – Stanford & Princeton) 4. Center Mission Science & Technology Centers Program 41 Center for Science of Information Modern Data Processing Data is often processed for purposes other than reproduction of the original data: (new goal: reliably answer queries rather than reproduce data!) • Recommendation systems make suggestions based on prior information: Email on server Processor Distributed Data Prior search history Recommendations are usually lists indexed by likelihood. • Databases may be compressed for the purpose of answering queries of the form: “Is there an entry similar to y in the database?”. Original (large) database is compressed Compressed version can be stored at several locations Queries about original database can be answered reliably from a compressed version of it. Science & Technology Centers Program 42 Center for Science of Information Fundamental Limits • Fundamental Limits of distributed inference (e.g., data mining viewed as inference problem: attempt of inferring desired content given a query and massive distributed dataset) • Fundamental Tradeoff: what is the minimum description (compression) rate required to generate a quantifiably good set of beliefs and/or reliable answers • General Results: queries can be answered reliably if and only if the compression rate exceeds the identification rate! • Practical algorithms to achieve these limits Science & Technology Centers Program 43 Center for Science of Information Information-Theoretic Formulation and Results Distributed (multiterminal) source coding under logarithmic loss: • Logarithmic loss is a natural penalty function when processor output is a list of outcomes indexed by likelihood. • Rate constraints capture complexity constraints (e.g., limited number of queries to database, or compressed/ cached version of larger database). • Multiterminal (distributed) problems can be solved under logarithmic loss! • Such fundamental limits were unknown for the last 40 years(!) except in the case of jointly Gaussian sources under MSE constraints. Science & Technology Centers Program 44 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Research Results in Knowledge Extraction (Big Data) 4. Center Mission Grand Challenges (Big Data and Life Sciences) Education & Diversity Knowledge Transfer Science & Technology Centers Program 45 Center for Science of Information Grand Challenges 1. Data Representation and Analysis (Big Data) • • • • • • Succinct Data Structures (compressed structures that lend themselves to operations efficiently) Metrics: tradeoff storage efficiency and query cost Validations on specific data (sequences, networks, structures, high dim. sets) Streaming Data (correlation and anomaly detection, semantics, association) Generative Models for data (dynamic model generations) Identifying statistically correlated graph modules (e.g., given a graph with edges weighted by node correlations and a generative model for graphs, identify the most statistically significantly correlated sun-networks) • • Information theoretic methods for inference of causality from data and modeling of agents. Complexity of flux models/networks Science & Technology Centers Program 46 Center for Science of Information Grand Challenges 2. Challenges in Life Science • Sequence Analysis – Reconstructing sequences for emerging nano-pore sequencers. – Optimal error correction of NGS reads. • Genome-wide associations – Rigorous approach to associations while accurately quantifying the prior in data • • • • • • Darwin Channel and Repetitive Channel Semantic and syntactic integration of data from different datasets (with same or different abstractions) Construction of flux models guided measures of information theoretic complexity of models Quantification of query response quality in the presence of noise Rethinking interactions and signals from fMRI Develop in-vivo system to monitor 3D of animal brain (will allow to see the interaction between different cells in cortex – flow of information, structural relations) Science & Technology Centers Program 47 Center for Science of Information Outline 1. Science of Information 2. Center Mission 3. Recent Research Results in Knowledge Extraction 4. Center Mission Grand Challenges Education & Diversity Knowledge Transfer Science & Technology Centers Program 48 Center for Science of Information Education and Diversity Integrate cutting-edge, multidisciplinary research and education efforts across the center to advance the training and diversity of the work force D. Kumar 1. 2. 3. 4. 5. 6. Summer School, Purdue, May 2011 Introduction to Science Information (HONR 399) Summer School, Stanford, May 2012 Student Workshop & Student Teams, July 2012 NSF TUES Grant, 2012-2014 Information Theory & CSoI School, Purdue 2013 M. Ward 7. Channels Program • • • summer REU academic year DREU new collaborations (female & black post-docs/faculty) 8. Diversity Database B. Gibson B. Ladd Science & Technology Centers Program 49 Center for Science of Information Knowledge Transfer Develop effective mechanism for interactions between the center and external stakeholder to support the exchange of knowledge, data, and application of new technology. • SoiHub.org refined with wiki to enhance communication • Brown Bag seminars, research seminars, Prestige Lecture Series • Development of open education resources • International collaborations (LINCS, Colombia) • Special Session on SoI – CISS’ 12 Princeton • Collaboration with Industry (Consorsium) Ananth Grama Science & Technology Centers Program Center for Science of Information Prestige Lecture Series Science & Technology Centers Program 51