The Digital Universe Scientific Data– Science of Data (Algorithmic Information Theoretical Analyses) András Benczúr ELTE Faculty of Informatics Supported by the following project: „Independent steps in scienece” ELTE TÁMOP-4.2.2/B-10/1-2010-0030 1 Latest Press Releases • CERN awards major contract for computer infrastructure hosting to Wigner Research Centre for Physics in Hungary 08.05.2012 • CERN today signed a contract with the Wigner Research Centre for Physics in Budapest for an extension to the CERN data centre. Under the new agreement, the Wigner Centre will host CERN equipment that will substantially extend the capabilities of the LHC Computing Grid Tier0 activities and provide the opportunity for business continuity solutions to be implemented. This contract is initially until 31 December 2015, with the possibility of up to four, one year, extensions thereafter. 2 Recent News Wigner-DataCenter at Wigner Research Institute Tier-0 center for LHC Computing – 150M EUR investment . Rolf-Dieter Heuer : 20 years participation of Hungarian physicists in CERN. New high-tech data connection between Budapest and CERN, new challenging project that will change the way of computing support for research in Europe. Some history: Gy. Vesztergombi – DATA Grid initiative, 1999. Hungarian projects: Demo-Grid , EGEE-I:,II.,III Hungarian Grid Competence Center, Hungrid, Cluster 3 Grid, Desktop-Grid. Recent News • Big data has the power to change scientific research from a hypothesis-driven field to one that’s data-driven, Farnam Jahanian, chief of the National Science Foundation’s Computer and Information Science and Engineering Directorate, said Wednesday. (Two weewks ago) • The term big data refers generally to the mass of new information created by the Internet and by scientific tools such as the Hubble Telescope and the Large Hadron Collider. The emerging field of big data analysis is aimed at sorting through the massive volume of that data -whether it’s social media posts, video clips, satellite feeds or the reaction of accelerated particles -- to gather intelligence and spot new patterns. 4 Recent News • Federal officials announced in March that the government will invest $200 million in research grants and infrastructure building for big data. • The investment was spawned by a June 2011 report from the President's Council of Advisors on Science and Technology, which found a gap in the private sector's investment in basic research and development for big data. 5 Digital Universe and Semantic Gap Mankind gave born to a new universe, the Digital Universe. Majority of our data and information is inside it somewhere and in digital form of some kind. Even new observations – from LHC, digital sensors, cameras etc. – go first in digital form into it. The conjecture on the growing semantic gap between human beings and computers: With the growing of the size of databases the length of queries grows at least logarithmically, and may grow linearly. According to the estimation from IDC in [4] the size of the Digital Universe will grow in the next five year by a factor 9. It doubles every one and a half year. 6 Digital Universe and Semantic Gap The Digital Universe contains only the substitutions, or encodings of information, independently whatever information means. Inside the Digital Universe the physical processes are either transformations of signals from one form to other one or they are materialized computations. 7 Digital Universe and Semantic Gap Paradoxically, inside the Digital Universe, the basic components, the physically existing – even temporarily digits as bits and bytes have no semantic meaning but operational, computational or transformational. The observer’s meanings at the very end of the interaction with the real world are in the mappings of the real world stuff to a formal computable model. This mapping is the kernel of filling the gap between human beings and computers. 8 Digital Universe and Semantic Gap H. Mason: data scientist need tree skills: •mathematically modeling of data, build the model •engineering in implementing data processing •find inside and tell stories on the data, asking the right questions – the hardest task We need them to fill the SEMANTIC GAP P. Gelsinger: „Thirty years ago we didn’t have CS departments, now every quality school on the planet has one. Now, nobody has a data-science department. In thirty years every school on the planet will have one.” In: „Big Data’s Big Problem: Little Talent (The Wall Street Journal, 04/29/2012 9 Motivation 1967. Debrecen, Colloquium on Information Theory „Where does information come from?” (from past) S. Watanabe, abstract The question was raised for inductive inference and for deductive inference. „Human mind, being an information transducer, it can lose but not gain information.” So, Digital Universe, being an information transducer, it can lose but not gain information. 10 Motivation Today: Where is information? In the Digital Universe. Digital Universe: can lose but not gain information. Information is collected in it. In 2011: 1.8 Zettabyte of data will be created. Is information there? There are signals only. How can we gain information from it? By computation. Computation: signal transformation. How Much Information? What is information? 11 Motivation Data volume on the NET: Estimation: the data on the Web doubles in 11-18 months Exabyte: the size of new data in year 1998 IDC research: the size of new data in 2011 will exceed 1.8 Zettabyte (1,8*1021 Byte) Upper estimate: 108 programmers, 8 ours daily, one keystroke (one byte) per second: new programs in one year: 1015 byte 12 Motivation Next generation science , data intensive science (Jim Grey, Alex Szalay et al. 2005). „Scientists generate new data much faster as they can analyze them. All looks like optical illusion.” (Hugh Kieffert) Big Data Scientific Data 13 The Data-Scope Project - 6PB storage, 500GBytes/sec sequential IO, 20M IOPS, 130TFlops • Thursday, February 2, 2012 at 9:10AM • “Data is everywhere, never be at a single location. Not scalable, not maintainable.” – Alex Szalay • interview by Nicole Hemsoth with Dr. Alexander Szalay, Data-Scope team lead, is available at The New Era of Computing: An Interview with "Dr. Data". 14 Semantic Gap The semantic gap between two persons. The semantic gap between a person and a computer. The effect of growing data volume on the semantic gap: the law of algorithmic information theory. 15 Mathematics: Information Theory Mathematical theories of information deal with quantitative properties. They mainly deal with the objective parts of information (representation and the mapping to their referents). The subjective aspect, the semantics of the referents is the problem of the observer. In [1]: P.J. Denning summarizes the discussion on the definition of information in the following: “The formal definitions of data (objective symbols) and information (subjective meaning) do not help me to design computers and algorithms. … Still, what information is remains an open question. “ 16T Mathematics: Information Theory If we want to get closer to the notions of information from the point of view of the mathematical models we have to investigate carefully what is measured by the entropy functions. We can measure the quantity of information in three ways, according to Kolmogorov [2]. All the three measures are related to the length of description and not to the meaning of information. They are connected to the length of optimal digital code. 17T Maesures of Information quantity Kolmogorov: three approaches 1. Probabilistic: Shannon-entropy n H ( p1 , p2 , pn ) pi log 2 pi i 1 2. Algorithmic: Kolmogorov-entropy Cx C U ( x ) min lp | Up x, and , if no such p. In the definition U is the fixed reference function, tipically the universal Turing - machine . 3. Combinatorial: uniform code length for all elements of the set 18 Mathematics: Information Theory In the Shannon-model, the expected value of the code length is minimized, whilst Kolmogorov-entropy measures the minimal length of codes used by the Universal Reference machine. In both models we don’t know what information is, we only know that there is a way to construct/reconstruct it from a signal of given length. We don’t know what information is, we only know how much it is. Processing information you have to understand meaning. Meaning should be in the eye of the beholder. 19T Basics of Algorithmic Information Theory The two basic principles of algorithmic information theory: Different things need different encodings. Decoding needs computable functions. 20T Basic techniques: 1) counting the number of code words of given lengths, 2) using a reference machine that enumerates a set of decoding functions. Invariance theorem. The algorithmic information quantity: the length of the shortest codeword used by the Universal Turing-Machine as reference machine. l(p): the length of code p. 21 Conditional Kolmogorov entropy Definition: Cx | y C U ( x | y) min lp | Up, y x, and , when no such p exists. Prefix entropy: choose the prefix Universal Turing-Machine U(p,y) as reference machine 22 Conditional Kolmogorov entropy The measure of the algorithmic information quantity, the Kolomogorov entropy is not good for direct investigation of the Digital Universe. Only the construction of the Universal Reference Machine is important as measurement tool in finding approximation of quantitative analyses of the behavior of the Digital Universe. 23 Querying a computer - a modell Participants: the computer Watson , and person Holmes. Watson: Content of data system: M, contains codes of programs: Prog Answers a query (request) Q if there exists P in Prog, such that P computes some answer A from Q and M. The reference to P must be given in Q. 24 Querying a computer - a model The person Holmes: Conscious content of the brain: knowledge K, contains a part on „Thinking”, the ability to Articulate and Codify Knowledge, Cognitive Processes, Mental Mechanisms Holmes should articulate and codify a formal query Q for retrieving data A from Watson. This process is called filling the semantic gap between Holmes and Watson. 25 In our simple model Holmes submit the query Q and Watson answers A. Q contains some reference to a program P in M used to compute answer A=P(Q,M) . Now the conditional Kolmogorov-entropy CA | M lQ c p (The Law of information no growth.) Meaning: the length of the shortest query used by U, Practical limitation: strong only for large A and Q. New reference machine: M with Prog inside The reference machine used in the definition of Kolmogorov-entropy utilizes the possibility of enumerate every computable functions, and it is a bit far from practical applications. Following the basic idea in the construction of the reference machine, we can consider M with Prog inside as reference machine. (The anytime best approximation of the Universal Reference Machine is in the Digital Universe.) 27 The conditional algorithmic entropy of A given M is the length of the shortest query for which Watson gives the answer A: In notation: CWATSON A | M min l q | p and p M and pq, M A note: q contains a reference to p An important difference from the universal Turing machine is that Watson contains a collection of facts in M. (Finite Oracle) We can measure the querying efficiency of Holmes in getting answer A from Watson as l Q CW atson A | M 28 Quantitative modelling the human computer interaction Supose, today Holmes solves a problem D after entering query Q and retrieving some information A from Watson. This means, using a human reasoning “program” R, Holmes obtains solution S from D, K and A: R(D,K,Q,A)=S Note: the semantics of A is relative to Q. 29 Douglas Adams: The Hitchhiker’s Guide to the Galaxy “Tell us!” All right said Deep Thought. “The Answer to the Great Question…” “Yes…!” “Of Life, the Universe and Everything…” said Deep Thought “Is Forty-two.” “You have never actually known what the question is.” “So once you do know what the question actually is, you know what the answer means.” Individual information measure Similarly to Watson we can introduce information measures for Holmes. The need of querying Watson means that he can’t give a solution S, even if the problem is formulated in the form of D, so CHolmes ( S | K ) and also CHolmes ( S | K , D) Explanation: K is closed 31 Model fitting Model fitting between the problem domain of D and a pre-coded model in M is necessary for codifying query Q. During this process the knowledge on M contained in K plays an important role in formulating an efficient query. Also, M may contain some information on K, this is the possibility of personalization. All this influences the semantic gap in formulating query Q. Explanation – the role of stochastic modelling Problem of (scientific) databases: mapping the semantics of measurement information to computational data model 32 Information no growth law revisited Formulating query Q he uses K and the problem description D. Added Q to M he receives back some information that has been added to M by someone else. If the answer A is sufficient to solution S, then there is no semantic gap. Otherwise, in order to obtain the solution S from K,D,Q and A he uses some process R not codified for Watson. Another semantic gap arises: codifying R into a code QR, so that Watson gives answer SR for QR. 33 How can we use the model? Estimate the cardinality of the sets of possible answers, questions, problems, and then estimate the average length of queries and answers. Let us fix the present situation as above. With growing M, the code length of new query and answer of the same semantics as the former A had are growing. 34 The effect of growing M Conditional entropy of answer A according to the reference machine Watson, or the Digital Universe or the Universal Turing machine uses the condition that M is given. How will the conditional entropy vary when we add some new data (digital signals) to M? Denoting the new content by M’ we can ask what the new conditional entropy of the same answer A is. 35 The effect of growing M The number of possible answers grows exponentially. So the number of queries also grows exponentially. Typical Query lengths grows linearly. 36 Example: subset query M encodes n elements of a set. A query retrieves a subset. Number of queries and answers: 2n. Average length of queries and answers: c*n. Adding m new element to the set: Number of queries and answers: 2n+m. Average length of queries and answers: c*(n+m). The average length is independent of the reference machine. 37 The threat of growing semantic gap The size of queries and answers exceeds the processing capacities of a human beings. The difference between information quantity of K (human knowledge) and M (World’s data) is growing exponentially. The same will be true for the common knowledge of a group of people, and finally for the mankind. 38 World’s Data Conducted by Revolution Analytics at the Joint Statistical Meeting held in Miami from July 30 through Aug. 4, the survey shows that 97% of data scientists believe "big data" analytics technology currently is falling short of enterprise needs. • Specifically, the 200 or so scientists surveyed highlighted three obstacles to running analytics on big data: • * the inherent complexities of big data software • * problems applying valid statistical models to the data • * a general lack of insight into what the data means 39 Evolution of info communication technologies will help us Search engines – concentration (Google, Yahoo, Ms Explorer, Mozilla, …) Distributed and parallel technologies: HPC, Clusters, Grid, Cloud, … Social Networking: Twitter, Blogging, Youtube, Facebook, … Semantic technologies (Semantic Web, RDF, OWL,…) Data Mining, Data Warehousing, OLAP, Big Data No-SQL 40 World’s Data Unstructured data, files, email, video will account for 90% of all data created over the next decade. Number of servers managing the world’s data stores will grow by ten times. The bad news: the number of IT professionals available to manage all that data will grow only by 1.5 times today’s levels. They simple won’t keeping pace with demand. (Threat of growing Semantic Gap.) New data sources: embedded systems, sensors in clothing, medical devices, buildings, …) Data intensive science Next generation science by Jim Grey, Alex Szalay et al. 2005. „Scientists generate new data much faster as they can analyze them. All looks like optical illusion.” (Hugh Kieffert) 42 Jim Gray’s Law of Data Engineering 1. Scientific cumputing is revolvong around data. 2. Need scale-out solutions for analyses 43 Jim Gray: The Big Picture Experiments & Instruments Other Archives Literature questions facts facts ? answers Simulations The Big Problems • • • • • • Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it? How to coexist with others? • • • Data Query and Visualization tools Support/training Performance – Execute queries in a minute – Batch (big) query scheduling The Big Picture - extended Digital Universese Experiments & Instruments Other Archives Literature facts facts • • • • • • Questions Answers Prog Simulations Documents M She is XY Programs The Big Problems Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it? How to coexist with others? • • • Data Query and Visualization tools Support/training Performance – Execute queries in a minute – Batch (big) query scheduling Computational Statistics Unstructured data, like recording facts on stochastic and random phenomena in M needs queries formulated in terms of computational statistics. from MIT Technology Review Jan/Feb 2010: Mike Lynch (cofounder of Autonomy) pp.24: Why can’t Google’s algorithms search unstructured information? Processing unstructured information you have to understand meaning. Meaning should be in the eye of the beholder. 46 Theory of Algorithmic Statistics. Two parts code: description of a set, conditional encoding of the elements Kolmogorov’s structure function: h x min xS ,C S log 2 S The description of the set S is the structural part; it gives the regular or statistical properties of x, and usually has some natural meaning. The second part, the long code, is the random component. Now, probably, the random part of the Digital Universe is much larger than the discovered structure. 47 The three Universe The Universe The Univerese in a human brain The Digital Universe Three different past to be observed „Where does information come from?” (from past) Research: force and provoke the Nature (an Universe) to produce and show a past such that we have not observed yet. 48 DEMON OF THE SECOND KIND "WE WANT THE DEMON, YOU SEE, TO EXTRACT FROM THE DANCE OF ATOMS ONLY INFORMATION THAT IS GENUINE, LIKE MATHEMATICAL THEOREMS, FASHION MAGAZINES, BLUEPRINTS, HISTORICAL CHRONICLES, OR A RECIPE FOR ION CRUMPETS, OR HOW TO CLEAN AND IRON A SUIT OF ASBESTOS, AND POETRY TOO, AND SCIENTIFIC ADVICE, AND ALMANACS, AND CALENDARS, AND SECRET DOCUMENTS, AND EVERYTHING THAT EVER APPEARED IN ANY NEWSPAPER IN THE UNIVERSE, AND TELEPHONE BOOKS OF THE FUTURE…" (STANISLAW LEM, THE CYBERIAD) DEMON OF THE SECOND KIND A Demon of the Second Kind is a fictional machine that writes factual statements, but only all too well. It appears in the short story "The Sixth Sally," which is part of the novel The Cyberiad by Stanislaw Lem. In the story, two clever, space-traveling robots (Trurl and Klapaucius) fall into the clutches of an evil robot, the giant pirate Pugg. This pirate does not want to rob them of gold or silver; instead, he wants information. Specifically, Pugg tells his two captives that he will forcibly hold them until they tell him everything they know. Faced with the possibility of spending eons reciting all their knowledge, Trurl and Klapaucius offer the pirate a bargain. If he promises to let them go afterwards, the pair will build him a Demon of the Second Kind, a special machine that can print out an infinite amount of information. DEMON OF THE SECOND KIND The process is straightforward. In any gas, molecules are bumping into each other with trillions of collisions per second. Sometimes, they happen to arrange themselves in the shape of a letter. More rarely, they arrange themselves in the shape of a word. Rarer still, they arrange themselves to read out a statement. Some of these statements are true; some aren't. The specialty of a Demon of the Second Kind is that it can separate the false statements from the true, and given a roll of paper, it will write out the truth and forget the falsehood. The Demon can separate fact from fiction, but it cannot separate the useful from the useless, and almost every fact it prints is good for absolutely nothing. An overabundance of useless information is a curse. DEMON OF THE SECOND KIND Demon of the Second Kind gathering intelligence Data Mining: Potentials and Challenges Rakesh Agrawal & Jeff Ullman Summary Data mining has shown promise but needs much more further research We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley Thank you for the attention 58 Computers and Information technology source ? (conscious ness) encoder channel Receiver ? Sender Computer Computer message destination decoder signal signal message THE NET ? : Tools of interaction (consciousness) Common knowledge in electronic databases = Digital Universe Computers and Information technology source ? Artifact/ Nature encoder channel Receiver ? Sender Computer Computer message destination decoder signal signal message THE NET ? : Tools of interaction Artifact/ Nature Common knowledge in electronic databases = Digital Universe