M - Indico

advertisement
The Digital Universe
Scientific Data– Science of Data
(Algorithmic Information Theoretical Analyses)
András Benczúr
ELTE Faculty of Informatics
Supported by the following project:
„Independent steps in scienece”
ELTE TÁMOP-4.2.2/B-10/1-2010-0030
1
Latest Press Releases
• CERN awards major contract for computer
infrastructure hosting to Wigner Research Centre
for Physics in Hungary 08.05.2012
• CERN today signed a contract with the Wigner Research
Centre for Physics in Budapest for an extension to the
CERN data centre. Under the new agreement, the Wigner
Centre will host CERN equipment that will substantially
extend the capabilities of the LHC Computing Grid Tier0 activities and provide the opportunity for business
continuity solutions to be implemented. This contract is
initially until 31 December 2015, with the possibility of
up to four, one year, extensions thereafter.
2
Recent News
Wigner-DataCenter at Wigner Research Institute
Tier-0 center for LHC Computing – 150M EUR investment
. Rolf-Dieter Heuer : 20 years participation of Hungarian
physicists in CERN. New high-tech data connection
between Budapest and CERN, new challenging project that
will change the way of computing support for research in
Europe.
Some history: Gy. Vesztergombi – DATA Grid initiative,
1999.
Hungarian projects: Demo-Grid , EGEE-I:,II.,III
Hungarian Grid Competence Center, Hungrid, Cluster
3
Grid, Desktop-Grid.
Recent News
• Big data has the power to change scientific
research from a hypothesis-driven field to one
that’s data-driven, Farnam Jahanian, chief of the National
Science Foundation’s Computer and Information Science and
Engineering Directorate, said Wednesday. (Two weewks ago)
• The term big data refers generally to the mass of new
information created by the Internet and by scientific tools
such as the Hubble Telescope and the Large Hadron
Collider. The emerging field of big data analysis is aimed
at sorting through the massive volume of that data -whether it’s social media posts, video clips, satellite feeds
or the reaction of accelerated particles -- to gather
intelligence and spot new patterns.
4
Recent News
• Federal officials announced in March that the government will
invest $200 million in research grants and infrastructure
building for big data.
• The investment was spawned by a June 2011 report from the
President's Council of Advisors on Science and Technology,
which found a gap in the private sector's investment in basic
research and development for big data.
5
Digital Universe and Semantic Gap
Mankind gave born to a new universe, the Digital Universe.
Majority of our data and information is inside it somewhere
and in digital form of some kind. Even new observations –
from LHC, digital sensors, cameras etc. – go first in digital
form into it.
The conjecture on the growing semantic gap between
human beings and computers:
With the growing of the size of databases the length of
queries grows at least logarithmically, and may grow
linearly. According to the estimation from IDC in [4] the
size of the Digital Universe will grow in the next five year
by a factor 9. It doubles every one and a half year.
6
Digital Universe and Semantic Gap
The Digital Universe contains only the substitutions, or
encodings of information, independently whatever
information means. Inside the Digital Universe the physical
processes are either transformations of signals from one
form to other one or they are materialized computations.
7
Digital Universe and Semantic Gap
Paradoxically, inside the Digital Universe, the basic
components, the physically existing – even temporarily digits as bits and bytes have no semantic meaning but
operational, computational or transformational. The
observer’s meanings at the very end of the interaction with
the real world are in the mappings of the real world stuff to
a formal computable model. This mapping is the kernel of
filling the gap between human beings and computers.
8
Digital Universe and Semantic Gap
H. Mason: data scientist need tree skills:
•mathematically modeling of data, build the model
•engineering in implementing data processing
•find inside and tell stories on the data, asking the right
questions – the hardest task
We need them to fill the SEMANTIC GAP
P. Gelsinger: „Thirty years ago we didn’t have CS
departments, now every quality school on the planet has
one. Now, nobody has a data-science department. In
thirty years every school on the planet will have one.”
In: „Big Data’s Big Problem: Little Talent (The Wall Street Journal, 04/29/2012
9
Motivation
1967. Debrecen, Colloquium on Information Theory
„Where does information come from?” (from
past)
S. Watanabe, abstract
The question was raised for inductive inference and for
deductive inference.
„Human mind, being an information transducer, it can
lose but not gain information.”
So, Digital Universe, being an information transducer, it
can lose but not gain information.
10
Motivation
Today: Where is information? In the Digital Universe.
Digital Universe: can lose but not gain information.
Information is collected in it.
In 2011: 1.8 Zettabyte of data will be created.
Is information there?
There are signals only. How can we gain information
from it? By computation. Computation: signal
transformation.
How Much Information?
What is information?
11
Motivation
Data volume on the NET:
Estimation: the data on the Web doubles in
11-18 months
Exabyte: the size of new data in year 1998
IDC research: the size of new data in 2011
will exceed 1.8 Zettabyte (1,8*1021 Byte)
Upper estimate: 108 programmers, 8 ours
daily, one keystroke (one byte) per second:
new programs in one year: 1015 byte
12
Motivation
Next generation science , data intensive science
(Jim Grey, Alex Szalay et al. 2005).
„Scientists generate new data much faster as they
can analyze them. All looks like optical illusion.”
(Hugh Kieffert)
Big Data
Scientific Data
13
The Data-Scope Project - 6PB storage,
500GBytes/sec sequential IO, 20M IOPS, 130TFlops
• Thursday, February 2, 2012 at 9:10AM
• “Data is everywhere, never be at a single
location. Not scalable, not maintainable.” –
Alex Szalay
• interview by Nicole Hemsoth with Dr.
Alexander Szalay, Data-Scope team lead, is
available at The New Era of Computing: An
Interview with "Dr. Data".
14
Semantic Gap
The semantic gap between two persons.
The semantic gap between a person and a
computer.
The effect of growing data volume on the
semantic gap:
the law of algorithmic information theory.
15
Mathematics: Information Theory
Mathematical theories of information deal with
quantitative properties. They mainly deal with the
objective parts of information (representation and the
mapping to their referents). The subjective aspect, the
semantics of the referents is the problem of the observer.
In [1]: P.J. Denning summarizes the discussion on the
definition of information in the following: “The formal
definitions of data (objective symbols) and information
(subjective meaning) do not help me to design
computers and algorithms. … Still, what information is
remains an open question. “
16T
Mathematics: Information Theory
If we want to get closer to the notions of information from
the point of view of the mathematical models we have to
investigate carefully what is measured by the entropy
functions. We can measure the quantity of information in
three ways, according to Kolmogorov [2]. All the three
measures are related to the length of description and not to
the meaning of information. They are connected to the length
of optimal digital code.
17T
Maesures of Information quantity
Kolmogorov: three approaches
1. Probabilistic: Shannon-entropy
n
H ( p1 , p2 , pn )   pi log 2 pi
i 1
2. Algorithmic: Kolmogorov-entropy
Cx   C U ( x )  min lp  | Up   x, and , if no such p.
In the definition U is the fixed reference function,
tipically the universal Turing - machine .
3. Combinatorial: uniform code length for all elements of
the set
18
Mathematics: Information Theory
In the Shannon-model, the expected value of the code
length is minimized, whilst Kolmogorov-entropy
measures the minimal length of codes used by the
Universal Reference machine. In both models we don’t
know what information is, we only know that there is a
way to construct/reconstruct it from a signal of given
length. We don’t know what information is, we only
know how much it is.
Processing information you have to understand
meaning. Meaning should be in the eye of the beholder.
19T
Basics of Algorithmic
Information Theory
The two basic principles of algorithmic
information theory:
Different things need different encodings.
Decoding needs computable functions.
20T
Basic techniques:
1) counting the number of code words of given
lengths,
2) using a reference machine that enumerates
a set of decoding functions. Invariance
theorem.
The algorithmic information quantity:
the length of the shortest codeword used by
the Universal Turing-Machine as reference
machine.
l(p): the length of code p.
21
Conditional Kolmogorov entropy
Definition:
Cx | y   C U ( x | y)  min lp  | Up, y   x,
and , when no such p exists.
Prefix entropy:
choose the prefix Universal Turing-Machine
U(p,y) as reference machine
22
Conditional Kolmogorov entropy
The measure of the algorithmic information quantity, the
Kolomogorov entropy is not good for direct investigation
of the Digital Universe. Only the construction of the
Universal Reference Machine is important as
measurement tool in finding approximation of
quantitative analyses of the behavior of the Digital
Universe.
23
Querying a computer
- a modell
Participants:
the computer Watson ,
and person Holmes.
Watson:
Content of data system: M, contains codes of
programs: Prog
Answers a query (request) Q if there exists P in
Prog, such that P computes some answer A from Q
and M. The reference to P must be given in Q.
24
Querying a computer
- a model
The person Holmes:
Conscious content of the brain: knowledge K,
contains a part on „Thinking”, the ability to
Articulate and Codify Knowledge, Cognitive
Processes, Mental Mechanisms
Holmes should articulate and codify a formal query
Q for retrieving data A from Watson. This process
is called filling the semantic gap between Holmes
and Watson.
25
In our simple model Holmes submit the query Q and
Watson answers A. Q contains some reference to a
program P in M used to compute answer A=P(Q,M) .
Now the conditional Kolmogorov-entropy
CA | M   lQ   c p
(The Law of information no growth.)
Meaning: the length of the shortest query used by U,
Practical limitation: strong only for large A and Q.
New reference machine: M with
Prog inside
The reference machine used in the definition
of Kolmogorov-entropy utilizes the possibility
of enumerate every computable functions,
and it is a bit far from practical applications.
Following the basic idea in the construction
of the reference machine, we can consider M
with Prog inside as reference machine. (The
anytime best approximation of the Universal
Reference Machine is in the Digital
Universe.)
27
The conditional algorithmic entropy of A given M is
the length of the shortest query for which Watson
gives the answer A:
In notation:
CWATSON  A | M   min l q  | p and p  M and pq, M   A
note: q contains a reference to p
An important difference from the universal Turing machine is that
Watson contains a collection of facts in M. (Finite Oracle)
We can measure the querying efficiency of
Holmes in getting answer A from Watson as
l Q   CW atson A | M 
28
Quantitative modelling the
human computer interaction
Supose, today Holmes solves a problem D
after entering query Q and retrieving some
information A from Watson.
This means, using a human reasoning
“program” R, Holmes obtains solution S from
D, K and A:
R(D,K,Q,A)=S
Note: the semantics of A is relative to Q.
29
Douglas Adams:
The Hitchhiker’s Guide to
the Galaxy
“Tell us!”
All right said Deep Thought. “The Answer to the
Great Question…”
“Yes…!”
“Of Life, the Universe and Everything…” said Deep
Thought
“Is Forty-two.”
“You have never actually known what the question
is.”
“So once you do know what the question actually
is, you know what the answer means.”
Individual information measure
Similarly to Watson we can introduce
information measures for Holmes.
The need of querying Watson means that he
can’t give a solution S, even if the problem is
formulated in the form of D, so
CHolmes ( S | K )   and also CHolmes ( S | K , D)  
Explanation: K is closed
31
Model fitting
Model fitting between the problem domain of D and a
pre-coded model in M is necessary for codifying query
Q. During this process the knowledge on M contained
in K plays an important role in formulating an efficient
query. Also, M may contain some information on K,
this is the possibility of personalization. All this
influences the semantic gap in formulating query Q.
Explanation – the role of stochastic modelling
Problem of (scientific) databases: mapping the
semantics of measurement information to
computational data model
32
Information no growth law revisited
Formulating query Q he uses K and the
problem description D. Added Q to M he
receives back some information that has been
added to M by someone else. If the answer A
is sufficient to solution S, then there is no
semantic gap. Otherwise, in order to obtain
the solution S from K,D,Q and A he uses some
process R not codified for Watson. Another
semantic gap arises: codifying R into a code
QR, so that Watson gives answer SR for QR.
33
How can we use the model?
Estimate the cardinality of the sets of
possible answers, questions, problems, and
then estimate the average length of queries
and answers.
Let us fix the present situation as above.
With growing M, the code length of new
query and answer of the same semantics as
the former A had are growing.
34
The effect of growing M
Conditional entropy of answer A according
to the reference machine Watson, or the
Digital Universe or the Universal Turing
machine uses the condition that M is
given. How will the conditional entropy
vary when we add some new data (digital
signals) to M? Denoting the new content
by M’ we can ask what the new
conditional entropy of the same answer A
is.
35
The effect of growing M
The number of possible answers grows
exponentially.
So the number of queries also grows
exponentially.
Typical Query lengths grows linearly.
36
Example: subset query
M encodes n elements of a set.
A query retrieves a subset.
Number of queries and answers: 2n.
Average length of queries and answers: c*n.
Adding m new element to the set:
Number of queries and answers: 2n+m.
Average length of queries and answers: c*(n+m).
The average length is independent of the reference
machine.
37
The threat of growing semantic
gap
The size of queries and answers exceeds the
processing capacities of a human beings.
The difference between information quantity
of K (human knowledge) and M (World’s
data) is growing exponentially.
The same will be true for the common
knowledge of a group of people, and finally
for the mankind.
38
World’s Data
Conducted by Revolution Analytics at the Joint
Statistical Meeting held in Miami from July 30 through
Aug. 4, the survey shows that 97% of data scientists
believe "big data" analytics technology currently is
falling short of enterprise needs.
• Specifically, the 200 or so scientists surveyed
highlighted three obstacles to running analytics on big
data:
• * the inherent complexities of big data software
• * problems applying valid statistical models to the
data
• * a general lack of insight into what the data means
39
Evolution of info communication
technologies will help us
Search engines – concentration
(Google, Yahoo, Ms Explorer, Mozilla, …)
Distributed and parallel technologies: HPC,
Clusters, Grid, Cloud, …
Social Networking: Twitter, Blogging, Youtube,
Facebook, …
Semantic technologies (Semantic Web, RDF,
OWL,…)
Data Mining, Data Warehousing, OLAP, Big Data
No-SQL
40
World’s Data
Unstructured data, files, email, video will account
for 90% of all data created over the next decade.
Number of servers managing the world’s data
stores will grow by ten times.
The bad news: the number of IT professionals
available to manage all that data will grow only by
1.5 times today’s levels. They simple won’t keeping
pace with demand. (Threat of growing Semantic
Gap.)
New data sources: embedded systems, sensors in
clothing, medical devices, buildings, …)
Data intensive science
Next generation science by
Jim Grey, Alex Szalay et al. 2005.
„Scientists generate new data much faster as they
can analyze them. All looks like optical illusion.”
(Hugh Kieffert)
42
Jim Gray’s Law of Data
Engineering
1. Scientific cumputing is revolvong around data.
2. Need scale-out solutions for analyses
43
Jim Gray: The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist with others?
•
•
•
Data Query and Visualization tools
Support/training
Performance
– Execute queries in a minute
– Batch (big) query scheduling
The Big Picture - extended
Digital
Universese
Experiments &
Instruments
Other Archives
Literature
facts
facts
•
•
•
•
•
•
Questions
Answers
Prog
Simulations
Documents
M
She is XY
Programs
The Big Problems
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist with others?
•
•
•
Data Query and Visualization tools
Support/training
Performance
– Execute queries in a minute
– Batch (big) query scheduling
Computational Statistics
Unstructured data, like recording facts on
stochastic and random phenomena in M needs
queries formulated in terms of computational
statistics.
from MIT Technology Review Jan/Feb 2010:
Mike Lynch (cofounder of Autonomy) pp.24:
Why can’t Google’s algorithms search unstructured
information? Processing unstructured information
you have to understand meaning.
Meaning should be in the eye of the beholder.
46
Theory of Algorithmic Statistics.
Two parts code: description of a set, conditional
encoding of the elements
Kolmogorov’s structure function:
h x  min
xS ,C  S 
log 2 S
The description of the set S is the structural part; it gives
the regular or statistical properties of x, and usually has
some natural meaning. The second part, the long code,
is the random component.
Now, probably, the random part of the Digital Universe
is much larger than the discovered structure.
47
The three Universe
The Universe
The Univerese in a human brain
The Digital Universe
Three different past to be observed
„Where does information come from?” (from past)
Research: force and provoke the Nature (an
Universe) to produce and show a past such that
we have not observed yet.
48
DEMON OF THE SECOND KIND
"WE WANT THE DEMON, YOU SEE, TO EXTRACT FROM
THE DANCE OF ATOMS ONLY INFORMATION THAT IS
GENUINE, LIKE MATHEMATICAL THEOREMS, FASHION
MAGAZINES, BLUEPRINTS, HISTORICAL CHRONICLES,
OR A RECIPE FOR ION CRUMPETS, OR HOW TO CLEAN
AND IRON A SUIT OF ASBESTOS, AND POETRY TOO, AND
SCIENTIFIC ADVICE, AND ALMANACS, AND
CALENDARS, AND SECRET DOCUMENTS, AND
EVERYTHING THAT EVER APPEARED IN ANY
NEWSPAPER IN THE UNIVERSE, AND TELEPHONE
BOOKS OF THE FUTURE…" (STANISLAW LEM, THE
CYBERIAD)
DEMON OF THE SECOND KIND
A Demon of the Second Kind is a fictional machine that writes
factual statements, but only all too well. It appears in the short
story "The Sixth Sally," which is part of the novel The Cyberiad
by Stanislaw Lem.
In the story, two clever, space-traveling robots (Trurl and
Klapaucius) fall into the clutches of an evil robot, the giant pirate
Pugg. This pirate does not want to rob them of gold or silver;
instead, he wants information. Specifically, Pugg tells his two
captives that he will forcibly hold them until they tell him
everything they know.
Faced with the possibility of spending eons reciting all their
knowledge, Trurl and Klapaucius offer the pirate a bargain. If he
promises to let them go afterwards, the pair will build him a
Demon of the Second Kind, a special machine that can print out
an infinite amount of information.
DEMON OF THE SECOND KIND
The process is straightforward. In any gas, molecules are bumping
into each other with trillions of collisions per second. Sometimes,
they happen to arrange themselves in the shape of a letter. More
rarely, they arrange themselves in the shape of a word. Rarer still,
they arrange themselves to read out a statement. Some of these
statements are true; some aren't. The specialty of a Demon of the
Second Kind is that it can separate the false statements from the
true, and given a roll of paper, it will write out the truth and forget
the falsehood.
The Demon can separate fact from fiction, but it cannot separate
the useful from the useless, and almost every fact it prints is good
for absolutely nothing.
An overabundance of useless information is a curse.
DEMON OF THE SECOND KIND
Demon of the Second Kind gathering
intelligence
Data Mining:
Potentials and Challenges
Rakesh Agrawal & Jeff Ullman
Summary

Data mining has shown promise but needs
much more further research
We stand on the brink of great new answers, but even
more, of great new questions -- Matt Ridley
Thank you for the attention
58
Computers and Information
technology
source
?
(conscious
ness)
encoder
channel
Receiver ?
Sender
Computer
Computer
message
destination
decoder
signal
signal
message
THE NET
? : Tools of
interaction
(consciousness)
Common knowledge in electronic
databases = Digital Universe
Computers and Information
technology
source
?
Artifact/
Nature
encoder
channel
Receiver ?
Sender
Computer
Computer
message
destination
decoder
signal
signal
message
THE NET
? : Tools of
interaction
Artifact/
Nature
Common knowledge in electronic
databases = Digital Universe
Download