Emerging Frontiers of Science of Information

advertisement
Center for Science of Information
Center for
Science of Information
Bryn Mawr
Center for Science of Information:
From Information Reduction to
Knowledge Extraction
Howard
MIT
Princeton
Purdue
Stanford
Texas A&M
UC Berkeley
Wojciech Szpankowski
Purdue University
National Science Foundation/Science & Technology Centers Program/December 2012
UC San Diego
UIUC
1
Science & Technology Centers Program
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
STC Team & Staff
Integrated Research
3. Research Results in Knowledge Extraction
4. Center Mission
Grand Challenges
Education & Diversity
Knowledge Transfer
Science & Technology Centers Program
2
Center for Science of Information
Shannon Legacy
The Information Revolution started in 1948, with the publication of:
A Mathematical Theory of Communication.
The digital age began.
Claude Shannon:
Shannon information quantifies the extent to which a
recipient of data can reduce its statistical uncertainty.
“semantic aspects of communication are irrelevant . . .”
Fundamental Limits for Storage and Communication (Information Theory Field)
Applications Enabler/Driver:
CD, iPod, DVD, video games, Internet, Facebook, WiFi, mobile, Google, . .
Design Driver:
universal data compression, voiceband modems, CDMA, multiantenna, discrete
denoising, space-time codes, cryptography, distortion theory approach to Big Data.
Science & Technology Centers Program
3
Center for Science of Information
Shannon Fundamental Results
Science & Technology Centers Program
4
Center for Science of Information
Post-Shannon Challenges
1. Back from infinity: Extend Shannon findings to finite size data
structures (i.e., sequences, graphs), that is, develop information theory of
various data structures beyond first-order asymptotics.
2. Science of Information: Extend Information Theory to meet new
challenges in
biology, massive data, communication, knowledge extraction, economics,...
In order to accomplish it we must understand new aspects of information in:
structure, time, space, and semantics,
and
dynamic information, limited resources, complexity, representationinvariant information, and cooperation & dependency.
Science & Technology Centers Program
5
Center for Science of Information
Post-Shannon Challenges
Structure:
Measures are needed for quantifying
information embodied in structures (e.g.,
information in material structures,
nanostructures, biomolecules, gene regulatory
networks, protein networks, social networks,
financial transactions).
Szpankowski & Choi : Information contained
in unlabeled graphs & universal graphical compression.
Grama & Subramaniam : quantifying role of noise and
incomplete data, identifying conserved structures, finding
orthologies in biological network reconstruction.
Neville: Outlining characteristics (e.g., weak dependence)
sufficient for network models to be well-defined in the limit.
Yu & Qi: Finding distributions of latent structures in social
networks.
Science & Technology Centers Program
Center for Science of Information
Post-Shannon Challenges
Time:
Classical Information Theory is at its weakest in
dealing with problems of delay (e.g., information
arriving late may be useless of has less value).
Verdu , Polyanskiy, Kostina: major breakthrough in extending Shannon
capacity theorem to finite blocklength information theory for lossy data
compression.
Kumar: design reliable scheduling policies with delay
constraints for wireless networks; new axiomatic approach to
secure wireless networks.
Weissman : real time coding system with lookahead to investigate the
impact of delay on expected distortion and bit rate.
Subramaniam: reconstruct networks from dynamic biological data.
Ramkrishna: quantify fluxes in biological networks by developing
a causality based dynamic theory celluar metabolism taking into
account a survival strategy in controlling enzyme syntheses.
Space:
Bialek explores transmission of information in making spatial
patterns in developing embryos – relation to
information capacity of a communication channel.
Science & Technology Centers Program
Center for Science of Information
Post-Shannon Challenges
Limited Resources:
In many scenarios, information is limited by
available computational resources (e.g., cell
phone, living cell).
Bialek works on structure of molecular networks that
optimize information flow, subject to constraints on the total
number of molecules being used.
Semantics:
Is there a way to account for the meaning or
semantics of information?
Sudan argues that the meaning of information becomes
relevant whenever there is diversity across communicating
parties and when parties themselves evolve over time.
Science & Technology Centers Program
8
Center for Science of Information
Post-Shannon Challenges
Representation-invariance:
How to know whether two representations of the
same information are information equivalent?
Learnable Information:
Courtade, Weissman and Verdu developing
information-theoretic approach to recommender
systems based on novel use of multiterminal source
coding under logarithmic loss.
Atallah investigates how to enable the untrusted
remote server to store and manipulate the client's
confidential data.
Grama, Subramaniam, Weissman & Kulkurani work
on analysis of extreme-scale networks; such
datasets are noisy and their analyses need fast
probabilistic algorithms for effective search in
compressed representations.
Science & Technology Centers Program
9
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
STC Team & Staff
Integrated Research
3. Research Results in Knowledge Extraction
4. Center Mission
Science & Technology Centers Program 10
Center for Science of Information
Mission and Center Goals
Advance science and technology through a new quantitative
understanding of the representation, communication and processing of
information in biological, physical, social and engineering systems.
Some Specific Center’s Goals:
R
E
S
E
A
R
C
H
•
•
•
•
define core theoretical principles governing transfer of information,
develop metrics and methods for information,
apply to problems in physical and social sciences, and engineering,
offer a venue for multi-disciplinary long-term collaborations,
•
•
•
explore effective ways to educate students,
train the next generation of researchers,
broaden participation of underrepresented groups,
Education
&
Diversity
•
transfer advances in research to education and industry
Knowledge Transfer
Science & Technology Centers Program 11
Center for Science of Information
STC Team
Bryn Mawr College: D. Kumar
Wojciech Szpankowski,
Purdue
Howard University: C. Liu, L. Burge
MIT: P. Shor (co-PI), M. Sudan (Microsoft)
Purdue University: (lead): W. Szpankowski (PI)
Andrea Goldsmith,
Stanford
Princeton University: S. Verdu (co-PI)
Stanford University: A. Goldsmith (co-PI)
Texas A&M: P.R. Kumar
Peter Shor,
MIT
University of California, Berkeley: Bin Yu (co-PI)
University of California, San Diego: S. Subramaniam
UIUC: O. Milenkovic.
R. Aguilar, M. Atallah, C. Clifton, S. Datta, A. Grama, S.
Jagannathan, A. Mathur, J. Neville, D. Ramkrishna, J. Rice, Z.
Pizlo, L. Si, V. Rego, A. Qi, M. Ward, D. Blank, D. Xu, C. Liu, L.
Burge, M. Garuba, S. Aaronson, N. Lynch, R. Rivest, Y.
Polyanskiy, W. Bialek, S. Kulkarni, C. Sims, G. Bejerano, T.
Cover, T. Weissman, V. Anantharam, J. Gallant, C. Kaufman, D.
Tse,T.Coleman.
Sergio Verdú,
Princeton
Bin Yu,
U.C. Berkeley
Science & Technology Centers Program 12
Center for Science of Information
Center Participant Awards
•
Nobel Prize (Economics): Chris Sims
•
National Academies (NAS/NAE) – Bialek, Cover, Datta, Lynch, Kumar,
Ramkrishna, Rice, Rivest, Shor, Sims, Verdu.
•
Turing award winner -- Rivest.
•
Shannon award winners -- Cover and Verdu.
•
Nevanlinna Prize (outstanding contributions in Mathematical Aspects of
Information Sciences) -- Sudan and Shor.
•
Richard W. Hamming Medal – Cover and Verdu.
•
Humboldt Research Award -- Szpankowski.
Science & Technology Centers Program 13
Center for Science of Information
Integrated Research
Create a shared intellectual space, integral to the
Center’s activities, providing a collaborative
research environment that crosses disciplinary and
institutional boundaries.
S. Subramaniam
A. Grama
Research Thrusts:
1. Life Sciences
David Tse
T. Weissman
S. Kulkarni
M. Atallah
2. Communication
3. Knowledge Extraction (Big/Small Data)
Science & Technology Centers Program 14
Center for Science of Information
Knowledge Extraction
Big Data Characteristics :







Large (peta and exa scale)
Noisy (high rate of false positives and negatives)
Multiscale (interaction at vastly different levels of abstractions)
Dynamic (temporal and spatial changes)
Heterogeneous (high variability over space and time)
Distributed (collected and stored at distributed locations)
Elastic (flexibility to data model and clustering capabilities)
Small Data:
Limited amount of data, often insufficient to extract information
(e.g., classification of twitter messages)
Science & Technology Centers Program 15
Center for Science of Information
Analysis of Big/Small Data
Some characteristics of analysis techniques :
 Results are typically probabilistic (formulations must quantify and
optimize statistical significance - deterministic formulations on noisy
data are not meaningful).
 Overfitting to noisy data is a major problem.
 Distribution agnostic formulations (say, based on simple counts and
frequencies are not meaningful).
 Provide rigorously validated solutions.
 Dynamic and heterogeneous datasets require significant formal basis:
Ad-hoc solutions do not work at scale!
We need new foundation and fundamental results
Science & Technology Centers Program 16
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3. Results in Knowledge Extraction and/or Life Sciences
4. Center Mission
Science & Technology Centers Program 17
Center for Science of Information
Research Results
 Information Theory of DNA Sequencing
Tse, Motahari, Bresler, and Bresler – Berkeley

Structural Information/Compression
Szpankowski and Choi, – Purdue & Venter Institute

Pathways in Biological Regulatory Networks
Grama, Mohammadi, and Subramaniam – Purdue & UCSD

Deinterleaving Markov Processes: Finding Relevant Information
Seroussi, Szpankowski and Weinberger (HP, & Purdue)

Information Theoretic Models of Knowledge Extraction Problems
Courtade, Weissman, and Verdu – Stanford & Princeton
Science & Technology Centers Program 18
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3. Research Results
Information Theory of DNA Sequencing
Tse, Motahari, Bresler and Bresler -- Berkeley
4. Center Mission
Science & Technology Centers Program 19
Center for Science of Information
Information Theory of Shotgun Sequencing
Reads are assembled to reconstruct the original genome.
State-of-the-art: many sequencing technologies and many assembly algorithms,
ad-hoc design tailored to specific technologies.
Our goal: a systematic unified design framework.
Central question: Given statistics of the genome, for what read length L and
# of reads N is reliable reconstruction possible?
An optimal algorithm is one that achieves the fundamental limit.
Science & Technology Centers Program
Center for Science of Information
Simple Model: I.I.D. Genome
(Motahari, Bresler & Tse 12)
# of reads
N/Ncov
reconstructable
by the greedy algorithm
many repeats
of length L
1
no coverage of genome
(2 log G) /Hrenyi
read length L
Science & Technology Centers Program
Center for Science of Information
Real Genome Statistics
Example: Stapholococcus Aureus
(Bresler, Bresler & Tse 12)
upper and lower bounds based on repeat
Statistics (99% reconstruction reliability)
Nmin/Ncov
K-mer
repeat
lower
bound
data
i.i.d. fit
coverage
lower bound
greedy
improved
K-mer
optimized
de Brujin
length
read length L
A data-driven approach to finding near-optimal assembly algorithms.
Science & Technology Centers Program 22
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3. Research Results
Structural Information/Compression
Szpankowski & Choi – Purdue & Venter Institute,
IEEE Trans. Information Theory, 58, 2012
4. Center Mission
Science & Technology Centers Program 23
Center for Science of Information
Graph Structures
Information Content of Unlabeled Graphs:
A structure model S of a graph G is defined for an unlabeled version.
Some labeled graphs have the same structure.
Graph Entropy vs Structural Entropy:
The probability of a structure S is:
P(S) = N(S) · P(G)
where N(S) is the number of different labeled graphs having the same
structure.
Science & Technology Centers Program 24
Center for Science of Information
𝑯𝑮 AND 𝑯𝑺
Science & Technology Centers Program 25
Center for Science of Information
Structural Entropy – Main Result
Consider Erdös-Rényi graphs G(n,p).
Science & Technology Centers Program 26
Center for Science of Information
Structural Zip (SZIP) Algorithm
Compression Algorithm called Structural zip in short SZIP – Demo
Science & Technology Centers Program 27
Center for Science of Information
Asymptotic Optimality of SZIP
Science & Technology Centers Program 28
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3. Research Results
Pathways in Biological Regulatory Networks
Grama, Mohammadi, Subramaniam – Purdue & UCSD
4. Center Mission
Science & Technology Centers Program 29
Center for Science of Information
Decoding Network Footprints
• We have initiated an ambitions computational effort
aimed at constructing the network footprint of
degenerative diseases (Alzheimers, Parkinsons, Cancers).
• Understanding pathways implicated in degenerative
diseases holds the potential for novel interventions, drug
design/ repurposing, and diagnostic/ prognostic markers.
• Using rigorous random-walk techniques, TOR
complexes signaling are reconstructed – this identifies
temporal and spatial aspects of cell growth!
Science & Technology Centers Program 30
Center for Science of Information
Network-Guided Characterization of Phenotype
Science & Technology Centers Program 31
Center for Science of Information
Computational Derived Aging Map
This map includes various forms of interactions, including protein interactions, gene regulations,
post translational modifications, etc. The underlying structural information extraction technique
relies on rigorous quantification of information flow. It allows to understand temporal and spatial
Science & Technology Centers Program
aspect of cell growth.
32
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3. Research Results
Deinterleaving Markov Process:
(Finding a Needle in a Haystack)
Seroussi, Szpankowski, Weinberger: Purdue & HP,
IEEE Trans. Information Theory, 58, 2012
4. Center Mission
Science & Technology Centers Program 33
Center for Science of Information
Interleaved Streams
User 1 output:
aababbaaababaabbbbababbabaaaabbb
User 2 output:
ABCABCABCABCABCABCABCABCABCAB
But only one log file is available, with no correlation between users:
aABacbAabbBaCaABababCaABCAabBbc
Questions:
• Can we detect the existence of two users?
• Can we de-interleave the two streams?
Science & Technology Centers Program 34
Center for Science of Information
Separating Interleaved Streams
Science & Technology Centers Program 35
Center for Science of Information
Probabilistic Model
Science & Technology Centers Program 36
Center for Science of Information
Problem Statement
Given a sample 𝒛𝑛 of P, infer the partition(s)  ( w.h.p. as n   )
 Is there a unique solution? (Not always!)
 Any partition  ' of A induces a representation of P , but
• P 'i may not have finite memory or P 'i ‘s may not be independent
• P 'w (switch) may not be memoryless (Markov)
 For k  1 the problem was studied by Batu et al. (2004)
• an approach to identify one valid IMP presentation of P w.h.p.
• greedy, very partial use of statistics  slow convergence
• but simple and fast
 We propose an MDL approach for unknown k
• much better results, finds all partitions compatible with P
• Complex if implemented via exhaustive search, but randomized
gradient descent approximation achieves virtually the same results
Science & Technology Centers Program 37
Center for Science of Information
Finding 𝚷 ∗ via Penalized ML
Science & Technology Centers Program 38
Center for Science of Information
Main Result
Summary: 𝐹𝑘 𝒛𝑛 = 𝐻 𝒛𝑛 + 𝛽𝜅 log 𝑛 ,
𝐹 𝒛𝑛 = argmin 𝐹𝑘 𝒛𝑛
Theorem: Let P  I  ( P1 , P2 ,...,Pm ; Pw ) and k  max i ord ( Pi ) .
For   2 in the penalization we have
𝐹 𝒛𝑛 = 𝐹𝑘 Π
(𝑎. 𝑠. ) 𝑛
∞
that is, the right partition will be found almost surely (with probability 1).
Proof Ideas:
1. P is a FSM process.
2. If P’ is the ``wrong’’ process (partition), then either one of the underlying process
or switch process will not be of finite order, or some of the independence
assumptions will be violated.
3. The penalty for P’ will only increase.
4. Rigorous proofs use large deviations results and Pinsker’s inequality over the
underlying state space.
Science & Technology Centers Program 39
Center for Science of Information
Experimental Results
Science & Technology Centers Program 40
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3. Research Results
Information Theoretic Models for Big Data
(Courtade, Weissman, Verdu – Stanford & Princeton)
4. Center Mission
Science & Technology Centers Program 41
Center for Science of Information
Modern Data Processing
Data is often processed for purposes other than reproduction of the original data:
(new goal: reliably answer queries rather than reproduce data!)
•
Recommendation systems make suggestions based on prior information:
Email on server
Processor
Distributed
Data
Prior search
history
Recommendations are usually
lists indexed by likelihood.
•
Databases may be compressed for the purpose of answering queries of the form:
“Is there an entry similar to y in the database?”.
Original (large)
database is
compressed
Compressed version can
be stored at several
locations
Queries about
original database can
be answered reliably from
a compressed version of it.
Science & Technology Centers Program 42
Center for Science of Information
Fundamental Limits
• Fundamental Limits of distributed inference
(e.g., data mining viewed as inference problem: attempt of inferring
desired content given a query and massive distributed dataset)
• Fundamental Tradeoff: what is the minimum
description (compression) rate required to generate a
quantifiably good set of beliefs and/or reliable answers
• General Results: queries can be answered reliably if and only if
the compression rate exceeds the identification rate!
• Practical algorithms to achieve these limits
Science & Technology Centers Program 43
Center for Science of Information
Information-Theoretic Formulation and Results
Distributed (multiterminal) source coding under logarithmic loss:
• Logarithmic loss is a natural penalty function when processor output is
a list of outcomes indexed by likelihood.
• Rate constraints capture complexity constraints (e.g., limited number of
queries to database, or compressed/ cached version of larger
database).
• Multiterminal (distributed) problems can be solved under logarithmic
loss!
• Such fundamental limits were unknown for the last 40 years(!) except
in the case of jointly Gaussian sources under MSE constraints.
Science & Technology Centers Program 44
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3. Research Results in Knowledge Extraction (Big Data)
4. Center Mission
Grand Challenges (Big Data and Life Sciences)
Education & Diversity
Knowledge Transfer
Science & Technology Centers Program 45
Center for Science of Information
Grand Challenges
1. Data Representation and Analysis (Big Data)
•
•
•
•
•
•
Succinct Data Structures (compressed structures that lend themselves to
operations efficiently)
Metrics: tradeoff storage efficiency and query cost
Validations on specific data (sequences, networks, structures, high dim. sets)
Streaming Data (correlation and anomaly detection, semantics, association)
Generative Models for data (dynamic model generations)
Identifying statistically correlated graph modules (e.g., given a graph with
edges weighted by node correlations and a generative model for graphs, identify the
most statistically significantly correlated sun-networks)
•
•
Information theoretic methods for inference of causality from data and
modeling of agents.
Complexity of flux models/networks
Science & Technology Centers Program 46
Center for Science of Information
Grand Challenges
2. Challenges in Life Science
• Sequence Analysis
– Reconstructing sequences for emerging nano-pore sequencers.
– Optimal error correction of NGS reads.
•
Genome-wide associations
– Rigorous approach to associations while accurately quantifying the prior in data
•
•
•
•
•
•
Darwin Channel and Repetitive Channel
Semantic and syntactic integration of data from different datasets (with
same or different abstractions)
Construction of flux models guided measures of information theoretic
complexity of models
Quantification of query response quality in the presence of noise
Rethinking interactions and signals from fMRI
Develop in-vivo system to monitor 3D of animal brain (will allow to see the
interaction between different cells in cortex – flow of information, structural relations)
Science & Technology Centers Program 47
Center for Science of Information
Outline
1. Science of Information
2. Center Mission
3.
Recent Research Results in Knowledge Extraction
4.
Center Mission
Grand Challenges
Education & Diversity
Knowledge Transfer
Science & Technology Centers Program 48
Center for Science of Information
Education and Diversity
Integrate cutting-edge, multidisciplinary research
and education efforts across the center to advance
the training and diversity of the work force
D. Kumar
1.
2.
3.
4.
5.
6.
Summer School, Purdue, May 2011
Introduction to Science Information (HONR 399)
Summer School, Stanford, May 2012
Student Workshop & Student Teams, July 2012
NSF TUES Grant, 2012-2014
Information Theory & CSoI School, Purdue 2013
M. Ward
7. Channels Program
•
•
•
summer REU
academic year DREU
new collaborations (female & black post-docs/faculty)
8. Diversity Database
B. Gibson
B. Ladd
Science & Technology Centers Program 49
Center for Science of Information
Knowledge Transfer
Develop effective mechanism for interactions between the center and
external stakeholder to support the exchange of knowledge, data, and
application of new technology.
• SoiHub.org refined with wiki to enhance communication
• Brown Bag seminars, research seminars, Prestige
Lecture Series
• Development of open education resources
• International collaborations (LINCS, Colombia)
• Special Session on SoI – CISS’ 12 Princeton
• Collaboration with Industry (Consorsium)
Ananth Grama
Science & Technology Centers Program
Center for Science of Information
Prestige Lecture Series
Science & Technology Centers Program 51
Download