csoi12 - Purdue University

advertisement
Center for Science of Information
Center for
Science of Information
NSF Center for Science of Information:
Overview & Results
Bryn Mawr
Howard
MIT
Princeton
Purdue
Stanford
Texas A&M
UC Berkeley
UC San Diego
UIUC
National Science Foundation/Science & Technology Centers Program
1
Science & Technology Centers Program
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
Integrated Research
Education and Diversity
Knowledge Transfer
3. Research
Graph Compression
Deinterleaving Markov Processes
Science & Technology Centers Program
2
Center for Science of Information
Shannon Legacy
The Information Revolution started in 1948, with the publication of:
A Mathematical Theory of Communication.
The digital age began.
Claude Shannon:
Shannon information quantifies the extent to which a
recipient of data can reduce its statistical uncertainty.
“semantic aspects of communication are irrelevant . . .”
Fundamental Limits for Storage and Communication (Information Theory Field)
Applications Enabler/Driver:
CD, iPod, DVD, video games, Internet, Facebook, WiFi, mobile, Google, . .
Design Driver:
universal data compression, voiceband modems, CDMA, multiantenna, discrete
denoising, space-time codes, cryptography, . . .
Science & Technology Centers Program
3
Center for Science of Information
Post-Shannon Challenges
1. Back from infinity: Extend Shannon findings to finite size data structures (i.e.,
sequences, graphs), that is, develop information theory of various data structures
beyond first-order asymptotics.
2. Science of Information: Extend Information Theory to meet new challenges in
biology, massive data, communication, knowledge extraction, economics,
…
In order to accomplish it we must understand new aspects of information in:
structure, time, space, and semantics, and
dynamic information, limited resources, complexity, representationinvariant information, and cooperation & dependency.
Science & Technology Centers Program
4
Center for Science of Information
Post-Shannon Challenges
Structure:
Measures are needed for quantifying
information embodied in structures (e.g.,
information in material structures,
nanostructures, biomolecules, gene
regulatory networks, protein networks, social
networks, financial transactions).
Szpankowski & Choi : Information contained
in unlabeled graphs & universal graphical compression..
Grama & Subramaniam : quantifying role of noise and
incomplete data in biological network reconstruction.
Neville: Outlining characteristics (e.g., weak dependence)
sufficient for network models to be well-defined in the limit.
Yu & Qi: Finding distributions of latent structures in social
networks.
Science & Technology Centers Program
Center for Science of Information
Post-Shannon Challenges
Time:
Classical Information Theory is at its weakest in
dealing with problems of delay (e.g., information
Arriving late may be useless of has less value).
Verdu & Polyanskiy: major breakthrough in extending
Shannon capacity theorem to finite blocklength information
theory.
Kumar: design reliable scheduling policies with delay
constraints for wireless networks.
Weissman: real time coding system to investigate the
impact of delay on expected distortion.
Subramaniam: reconstruct networks from dynamic biological
data.
Ramkrishna: quantify fluxes in biological networks by
showing that enzyme metabolism is regulated by a survival
strategy in controlling enzyme syntheses.
Space:
Bialek: explores transmission of information in making
spatial patterns in developing embryos – relation to
information capacity of a communication channel.
Science & Technology Centers Program
Center for Science of Information
Post-Shannon Challenges
Limited Resources:
In many scenarios, information is limited by available
computational resources (e.g., cell phone, living cell).
Bialek works on structure of molecular networks that optimize
information flow, subject to constraints on the total number of
molecules being used.
Verdu, Goldsmith investigates the minimum energy per bit as a
function of data length in Gaussian channels.
Semantics:
Is there a way to account for the meaning or
semantics of information?
Sudan argues that the meaning of information becomes
relevant whenever there is diversity across communicating
parties and when parties themselves evolve over time.
Science & Technology Centers Program
7
Center for Science of Information
Post-Shannon Challenges
Representation-invariance:
How to know whether two representations of the
same information are information equivalent?
Learnable Information (BigData):
Data driven science focuses on extracting information
from data. How much information can actually be
extracted from a given data repository? How much
knowledge is in Google's database?
Atallah investigates how to enable the untrusted
remote server to store and manipulate the client's
confidential data. Clifton investigates differential
privacy methods for graphs to determine conditions
under which privacy is practically.
Grama, Subramaniam and Szpankowski work on
analysis of extreme-scale networks; such datasets are
noisy and their analyses need fast probabilistic
algorithms and compressed representations.
Science & Technology Centers Program
8
Center for Science of Information
Post-Shannon Challenges
Cooperation & Dependency:
How does cooperation impact information (nodes
should cooperate in their own self-interest)?
Cover initiates a theory of cooperation and coordination in
networks, that is, they study the achievable joint distribution among
network nodes, provided that the communication rates are given.
Dependency and rational expectation are critical ingredients in
Sims' work on modern dynamic economic theory.
Coleman is studying statistical causality in neural systems by using
Granger principles.
Quantum Information:
The flow of information in macroscopic systems and
microscopic systems may posses different characteristics.
Aaronson and Shor lead these investigations. Aaronson
develops computational complexity of linear systems.
Science & Technology Centers Program
9
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
1. Integrated Research
2. Education and Diversity
3. Knowledge Transfer
3. Research
Science & Technology Centers Program 10
Center for Science of Information
Mission and Center Goals
Advance science and technology through a new quantitative
understanding of the representation, communication and processing of
information in biological, physical, social and engineering systems.
Some Specific Center’s Goals:
•
•
•
•
•
•
•
•
define core theoretical principles governing transfer of information,
develop metrics and methods for information,
apply to problems in physical and social sciences, and engineering,
offer a venue for multi-disciplinary long-term collaborations,
explore effective ways to educate students,
train the next generation of researchers,
Education
&
broaden participation of underrepresented groups,
transfer advances in research to education and industry Diversity
R
E
S
E
A
R
C
H
Knowledge Transfer
Science & Technology Centers Program 11
Center for Science of Information
STC Team
Bryn Mawr College: D. Kumar
Wojciech Szpankowski,
Purdue
Howard University: C. Liu, L. Burge
MIT: P. Shor (co-PI), M. Sudan (Microsoft)
Purdue University: (lead): W. Szpankowski (PI)
Andrea Goldsmith,
Stanford
Princeton University: S. Verdu (co-PI)
Stanford University: A. Goldsmith (co-PI)
Texas A&M: P.R. Kumar
Peter Shor,
MIT
University of California, Berkeley: Bin Yu (co-PI)
University of California, San Diego: S. Subramaniam
UIUC: O. Milenkovic.
R. Aguilar, M. Atallah, C. Clifton, S. Datta, A. Grama, S.
Jagannathan, A. Mathur, J. Neville, D. Ramkrishna, J. Rice, Z.
Pizlo, L. Si, V. Rego, A. Qi, M. Ward, D. Blank, D. Xu, C. Liu, L.
Burge, M. Garuba, S. Aaronson, N. Lynch, R. Rivest, Y.
Polyanskiy, W. Bialek, S. Kulkarni, C. Sims, G. Bejerano, T.
Cover, T. Weissman, V. Anantharam, J. Gallant, C. Kaufman, D.
Tse,T.Coleman.
Sergio Verdú,
Princeton
Bin Yu,
U.C. Berkeley
Science & Technology Centers Program 12
Center for Science of Information
Center Participant Awards
•
Nobel Prize (Economics): Chris Sims
•
National Academies (NAS/NAE) – Bialek, Cover, Datta, Lynch, Kumar,
Ramkrishna, Rice, Rivest, Shor, Sims, Verdu.
•
Turing award winner -- Rivest.
•
Shannon award winners -- Cover and Verdu.
•
Nevanlinna Prize (outstanding contributions in Mathematical Aspects of
Information Sciences) -- Sudan and Shor.
•
Richard W. Hamming Medal – Cover and Verdu.
•
Humboldt Research Award -- Szpankowski.
Science & Technology Centers Program 13
Center for Science of Information
STC Staff
Director – Wojciech Szpankowski
Managing Director – Bob Brown
Bob Brown
Managing Director
Brent Ladd
Education Director
Education Director – Brent Ladd
Diversity Director – Barbara Gibson
Multimedia Specialist – Mike Atwell
Barbara Gibson,
Diversity Director
Mike Atwell
Multimedia Specialist
Administrative Asst. – Kiya Smith
Kiya Smith
Administrative Asst.
Science & Technology Centers Program 14
Center for Science of Information
Integrated Research
Create a shared intellectual space, integral to the
Center’s activities, providing a collaborative
research environment that crosses disciplinary and
institutional boundaries.
S. Subramaniam
A. Grama
Research Thrusts:
1. Life Sciences
David Tse
T. Weissman
S. Kulkarni
M. Atallah
2. Communication
3. Knowledge Management (BigData)
Science & Technology Centers Program 15
Center for Science of Information
Opportunistic Research Workshops
Kickoff Workshop – Chicago, IL – October 6-7, 2010 (28 STC participants and students)
1. Allerton Workshop, IL – September 28, 2010
2. Stanford Workshop – Palo Alto, CA – January 24, 2011 (Princeton – Stanford –Berkeley)
3. UC San Diego Workshop – February 11, 2011 (Berkeley – UCSD)
4. Princeton Workshop – May 13, 2011 (Princeton – Purdue)
5. Purdue Workshop – September 9, 2011 (Berkeley – UCSD – Purdue)
6. Allerton Workshop – September 27, 2011 (Purdue –UIUC)
7. UCSD Workshop on Neuroscience, February 2012 (UCSD – Berkeley)
8. Princeton Grand & Petit Challenges Workshop, March 2012 (Center-wide)
9. Maui Workshop (ICC), (Industrial Round-Table), May, 2012 (Center-wide)
10. MIT Workshop, July 2012.
Science & Technology Centers Program 16
Center for Science of Information
Grand and Petits Challenges
1. Data Representation and Analysis
•
•
•
•
•
•
Succinct Data Structures (compressed structures that lend themselves to
operations efficiently)
Metrics: tradeoff storage efficiency and query cost
Validations on specific data (sequences, networks, structures, high dim. sets)
Streaming Data (correlation and anomaly detection, semantics, association)
Generative Models for data (dynamic model generations)
Identifying statistically correlated graph modules (e.g., given a graph with
edges weighted by node correlations and a generative model for graphs, identify the
most statistically significantly correlated sun-networks)
•
•
Information theoretic methods for inference of causality from data and
modeling of agents.
Complexity of flux models/networks
Science & Technology Centers Program 17
Center for Science of Information
Grand and Petits Challenges
2. Challenges in Life Science
• Sequence Analysis
– Reconstructing sequences for emerging nano-pore sequencers.
– Optimal error correction of NGS reads.
•
Genome-wide associations
– Rigorous approach to associations while accurately quantifying the prior in data
•
•
•
•
•
•
Darwin Channel and Repetitive Channel
Semantic and syntactic integration of data from different datasets (with
same or different abstractions)
Construction of flux models guided measures of information theoretic
complexity of models
Quantification of query response quality in the presence of noise
Rethinking interactions and signals from fMRI
Develop in-vivo system to monitor 3D of animal brain (will allow to see the
interaction between different cells in cortex – flow of information, structural relations)
Science & Technology Centers Program 18
Center for Science of Information
Grand and Petits Challenges
3. Challenges in Economics
•
To study the impact of information flow on dynamics of economic systems
•
Impact of finite rate feedback and delay on system behavior
•
Impact of agents with widely differing capabilities (information and
computation) on overall system state, and on the state of individual agents.
•
Choosing portfolios in the presence of information constraints.
Science & Technology Centers Program 19
Center for Science of Information
Grand and Petits Challenges
4. Challenges in Communications
• Coordination over networks
– Game theoretic solution concept for coordination over networks
– Applications in control.
• Relearning information theory in non-asymptotic regimes
– Modeling energy, computational power, feedback, sampling,
and delay
• Point to point communication with delay and/or finite rate feedback
• Back from Infinity (analytic and non-asymptotic information theory)
•
Information and computation
a.
b.
c.
Quantifying fundamental limits of in-network computation, and the computing capacity
of networks for different functions
Complexity of distributed computation in wireless and wired networks
Information theoretic study of aggregation for scalable query processing
Science & Technology Centers Program 20
Center for Science of Information
Prestige Lecture Series
Science & Technology Centers Program 21
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
1. Integrated Research
2. Education and Diversity
3. Knowledge Transfer
3. Research
Graph Compression
Deinterleaving Markov Processes
Science & Technology Centers Program 22
Center for Science of Information
Structural Information
Information Theory of Data Structures: Following Ziv (1997) we propose to
explore finite size information theory of data structures (i.e., sequences,
graphs), that is, to develop information theory of various data structures
beyond first-order asymptotics. We focus here on information of graphical
structures (unlabeled graphs).
F. Brooks, jr. (JACM, 50, 2003, “Three Great Challenges for . . .CS”):
We have no theory that gives us a metric for the Information embodied in
structure. This is the most fundamental gap in the theoretical underpinnings
of information science and of computer science.
Networks (Internet, protein-protein interactions, and collaboration network)
and Matter (chemicals and proteins) have structures.
They can be abstracted by (unlabeled) graphs.
(Y. Choi, W.S., IEEE Trans. Info.Th., 58, 2012)
Science & Technology Centers Program 23
Center for Science of Information
Graph and Structural Entropy
Information Content of Unlabeled Graphs:
A structure model S of a graph G is defined for an unlabeled version.
Some labeled graphs have the same structure.
Graph Entropy vs Structural Entropy:
The probability of a structure S is:
P(S) = N(S) · P(G)
where N(S) is the number of different labeled graphs having the same
structure.
Science & Technology Centers Program 24
Center for Science of Information
Relationship between ๐‘ฏ๐‘ฎ and ๐‘ฏ๐‘บ
Science & Technology Centers Program 25
Center for Science of Information
Erdös-Rényi Graph Model
Science & Technology Centers Program 26
Center for Science of Information
Structural Entropy
Science & Technology Centers Program 27
Center for Science of Information
Structural Zip (SZIP) Algorithm
Compression Algorithm called Structural zip in short SZIP – Demo
Science & Technology Centers Program 28
Center for Science of Information
Asymptotic Optimality of SZIP
Science & Technology Centers Program 29
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
1. Integrated Research
2. Education and Diversity
3. Knowledge Transfer
3. Research
Graph Compression
Deinterleaving Markov Processes
Science & Technology Centers Program 30
Center for Science of Information
Interleaved Streams
User 1 output:
aababbaaababaabbbbababbabaaaabbb
User 2 output:
ABCABCABCABCABCABCABCABCABCAB
But only one log file is available, with no correlation between users:
aABacbAabbBaCaABababCaABCAabBbc
Questions:
• Can we detect the existence of two users?
• Can we de-interleave the two streams?
(G. Seroussi, W.S., M. Weinberger, IEEE Trans. Info. Th., 2012.)
Science & Technology Centers Program 31
Center for Science of Information
Probabilistic Model
Science & Technology Centers Program 32
Center for Science of Information
Problem Statement
Given a sample ๐’›๐‘› of P, infer the partition(s) ๏ƒ• ( w.h.p. as n ๏‚ฎ ๏‚ฅ )
Some Observations:
๏ฑ Is there a unique solution? (Not always!)
๏ฑ Any partition ๏ƒ• ' of A induces a representation of P, but
• P 'i may not have finite memory or P 'i ‘s may not be independent
• P 'w (switch) may not be memoryless (Markov)
๏ฑ For k ๏€ฝ 1the problem was studied by Batu et al. (2004)
• an approach to identify one valid IMP presentation of P w.h.p.
• greedy, very partial use of statistics ๏ƒžslow convergence
• but simple and fast
๏ฑ We propose an MDL approach for unknown k
• much better results, finds all partitions compatible with P
• Complex if implemented via exhaustive search, but randomized gradient
descent approximation achieves virtually the same results
Science & Technology Centers Program 33
Center for Science of Information
Uniqueness of Representation
Clearly, if Pi is memoryless and Ai ๏€พ 1then any re-partitioning of Ai leads to another
partition compatible with P
• canonical partition (w.r.t. P ): one in which such Ai ‘s are singletons
• ๏ƒ• * ๏€ฝ canonical version of ๏ƒ• (both compatible with P )
Theorem: If ๏ƒ• compatible with P , then ๏ƒ• ' compatable with P iff
๏ƒ• * ๏€ฝ ๏ƒ• '*๏ƒž IMP presentation unique up to re-partitioning of “memoryless alphabets”
Proof idea (warm-up):
:any string over Ai with Pi ( s) ๏€พ 0
s ' ๏€ฝ s ๏ƒฉ๏ƒซ A ' j ๏ƒน๏ƒป : sub-string “projected” over A ' j
u : a k -tuple over Ai \ A ' j with Pi (u ) ๏€พ 0
(there’s always one)
s
Pi (a | s ) ๏€ฝ P (a | s ) / Pw (i ) ๏€ฝ P (a | s ') / Pw (i )
P(a | s ') ๏€ฝ P(a | s ' u ) ๏€ฝ P(a | u ) ๏ƒž independent of s '
๏ƒž Pi (a | s ) independent of s , ๏€ขa, s
Science & Technology Centers Program 34
Center for Science of Information
Finding ๐šท ∗ via Penalized ML
Science & Technology Centers Program 35
Center for Science of Information
Main Result
๐น๐‘˜ ๐’›๐‘› = ๐ป ๐’›๐‘› + ๐›ฝ๐œ…log(๐‘›)
๐น ๐’›๐‘› = argmin ๐น๐‘˜ ๐’›๐‘›
Theorem: Let P ๏€ฝ I ๏ƒ• ( P1 , P2 ,...,Pm ; Pw ) and k ๏€ฝ max i ๏ปord ( Pi )๏ฝ .
For ๏ข ๏€พ 2 in the penalization we have
๐น ๐’›๐‘› = ๐น๐‘˜ Π
(๐‘Ž. ๐‘ . ) ๐‘›
∞
Proof Ideas:
1. P is a FSM process.
2. If P’ is the ``wrong’’ process (partition), then either one of the underlying process
or switch process will not be of finite order, or some of the independence
assumptions will be violated.
3. The penalty for P’ will only increase.
4. Rigorous proofs use large deviations results and Pinsker’s inequality over the
underlying state space.
Science & Technology Centers Program 36
Center for Science of Information
Experimental Results
Science & Technology Centers Program 37
Center for Science of Information
Generalization and Future Work
๏ฑ A different Markov order can be estimated for each Pi , as well as more
general tree models
๏ฑ Switch with memory
• can be solved with similar tools
๏ฑ Future work: intersecting sub-alphabets
• much more challenging!
Science & Technology Centers Program 38
Download