1 Research Problems in Digital Libraries: Data Mining and Text Mining Jaime Carbonell and Raj Reddy Carnegie Mellon University April 21, 2006 Talk presented at CS50 symposium at CMU Keepers of the Faith 2 Digital Libraries and Universal Access to Information Create a Universal Digital Library containing all the books ever published Unfortunately many of the books are in English Not readable by over 80% of the population 3 4 Information Overload If we read a book every day we can only read, at most, 40,000 books in a life time Having millions of books online and accessible creates an information overload “we have a wealth of information and scarcity of (human) attention!”, Herbert Simon Multilingual search technology can help to reduce the overload permits users to search very large data bases quickly and reliably independent of language and location Understanding Language Books in non-native languages remain incomprehensible to most people Translation and Summarization essential for world wide use Current translation systems are not yet perfect Significant improvements in language understanding systems in the past few decades Systems based on statistical and linguistic techniques have shown significant performance improvements improve performance using machine learning Digitization projects will act as test bed for validating Language Understanding Systems Research e.g. The Million Book Digital Library Project 5 The Million Book Digital Library Collaborative venture among many countries including USA, China and India So far 400,000 books have been scanned in China and 200,000 in India Content is made freely available around the globe Those wishing to see the Video in the next slide should download from http://www.rr.cs.cmu.edu/MSRI.zip 6 Million Book Project: Status 21 Centers in India 17 centers in China 1 Center in Egypt Planned : Australia and Europe About 600,000 books scanned About 120,000+ accessible on the web from India http://dli.iiit.ac.in/ Uses 8TB of storage 10 TB server at CMU Library planned for July 2005 1,000,000 books by the end of 2007 Capacity to scan a million pages a day expected to be operational by the end of 2006 Title Author Language Subject Publisher Year Abstract Rig Veda Pandit Sriram Sharma Acharya Sanskrit Philosophy Sanskriti Sansthan Bareli 9 Rig Veda is the oldest of the Vedas. The Rig Veda is the oldest book in Sanskrit or any Indo-European language. Many great Yogis and scholars who have understood the astronomical references in the hymns, date the Rig Veda as before 4000 B.C., perhaps as early as 12,000. Modern western scholars date it around 1500 B.C., though recent archaeological finds in India (like Dwaraka) now appear to require a much earlier date Title Author Language Subject Publisher Year Abstract 10 Elementary Treatise on the Wave-Theory of Light Humphery Lloyd, D.D, D.C.L English Physics Longmans, Green & Co 1873 This book deals with the various aspects of the wave theory of light. It is a critical work which contains an analytical discussion of the most recent researches in Optics. It presents a clear and connected view of the subject. 11 Title Author Language Subject Publisher Year Abstract Beauties from Kalidas Keshav Appa Padhye Sanskrit Poetry 1927 A collection of some of the Best works of Kalidas, Ancient India’s Most Famous Sanskrit Poet. Abhignyana Sakuntalam, Kumara Sambhavam, Ritu Samhara are some of the renowned works of Kalidas. Title Author Language Subject Publisher Year Abstract Gems, Jewels, Coins and Medals Ancient & Modern Archibald Billing English Fine Arts Daldy, Isbister & Co 1875 12 This volume deals with the detailed description of the varied types of fine arts dealing with precious stones, Jewelry and sculpture. Title Author Language Subject Publisher Year Abstract 13 Mudalayiram Mulamum Periya Jeeyar Tamil Religion Sri Vaishnava Sampirathaya Sanjeevikiri Sabayai 1909 This volume is written in Tamil. It provides a detailed account of the origin of Vaishnava and is written by Periya Jeeyar. . 14 Title Author Language Subject Publisher Year Abstract Gulzar-A-Badesha Khader Badesha Urdu Literature Namipress, Chennai 1919 Literature 15 Title Author Language Subject Publisher Year Abstract Jawahar Ali Joyviyah Dr.Ilyas lomas Arabic Metrology Bakri and Issa 1876 It is a book on Metrology, a study of measurements 16 Title Author Language Subject Publisher Year Abstract Panchatantramu Narayana Kavi Telugu Moral Stories Vavilla Ramaswamy and Sons 1912 It is a compilation of stories told by a guru to his royal students, each story teaching a moral. Most of the characters in the stories are animals. The book served as an excellent guide to prospective kings in their everyday life, including their behaviour and their choice of friends. It also is a great asset to parents to teach ethics to their children. 17 Title Author Language Subject Publisher Year Abstract Bharateeya Smritigalu Vidwan Ragu Sutta Kannada Biographical Notes Hemantha Sahitya Compilation of Ancient Memories Title Author Language Subject Publisher Year Abstract 18 The Fauna of British India including Ceylon and Burma Lt. Conl. J. Stephenson English Biology Taylor and Francis 1929 Biological notes on fauna and insects compiled during British India Title Author Language Subject Publisher Year Abstract 19 Harijan: A Journal of Applied Gandhism, 1933-1955 Joan Bondurant (introduction) English Philosophy Garland Publishing Inc. 1973 A journal on Practical implementation of Gandhiism in Every Day Life Title Author Language Subject Publisher Year Abstract 20 Structure Des Molecules Victor Henri French Chemistry Taylor and Francis 1925 This is a unique book that explicates, in detail, the structure of molecules and touches upon certain specific characteristics of molecules with particular reference to Benzene Million Book Project: Research Challenges Providing Access to Billions everyday Distributed Cached Servers in every country and region Self-Healing Data Bases Easy to use interfaces for Billions Text Mining Challenges Multilingual Information Retrieval Summarization Text Categorization Named-Entity identification Novelty Detection Translation Information Bill of Rights Get the right To the right At the right On the right In the right With the right information people time medium language level of detail 22 Relevant Text Mining Technologies 23 “…right “…right “…right “…right “…right “…right information” people” time” medium” language” level of detail” IR (search engines) Classification, routing Anticipatory analysis Info extraction, speech Machine translation Summarization … The Right Information: Next Generation Search Engines Search Criteria Beyond Query-Relevance Google: Popularity (link density, click freq, …) Vivisimo: Panoramic view (clustering + labeling) Information novelty (content differential, recency) Trustworthiness of source Appropriateness to user (difficulty level, …) Hidden web: 10X visible web (Federated search) “Find What I Mean” Principle Search on semantically related terms Induce user profile from past history, etc. Disambiguate terms (e.g. “Jordan”) 24 Clustering (Vivisimo-style) Search vs Standard IR documents query IR Cluster summaries 25 MMR Ranking vs Standard IR documents MMR query IR λ controls spiral curl 26 … In The Right Level of Detail Synthetic Document = Summary++ Audio transcripts • Extractive combo (tracking, MMR, …) Entities ……… • Centrality of info Relations ……. • KIT model relevant • Novelty (vs last time) Textual summary • Entities, relations, dates, … + raw text Texts (Eng, • Later: contradiction & attitude detection Analyst zoom-in Arabic, Chinese …) Novel Attitude mixed • Combine: CMU, IBM (NE + rel extraction), UMD (user model, summ), Stanford (contradiction detection) Sources 27 … In the Right Language (MT) Interlingua Semantic Analysis Syntactic Parsing Source (Arabic) Sentence Planning Transfer Rules Direct: EBMT, SMT Text Generation Target (English) 28 EBMT example 29 English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man Mapudungun: Chi doy fütra chi wentru is my father. fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu. Illustration of Multi-Engine MT El punto de descarge se cumplirá en el puente Agua Fria The drop-off point will comply with The cold Bridgewater El punto de descarge se cumplirá en el puente Agua Fria The discharge point will self comply in the “Agua Fria” bridge El punto de descarge se cumplirá en el puente Agua Fria Unload of the point will take place at the cold water of bridge 30 Interlingua Spoken Language Multi Engine Example Based Statistical Low Resource Automatic MT Evaluation Portable Letras Avenue MEMT Diplomat Tongues METEOR GEBMT KANT KBMT-89 JANUS C-STAR I MT Lab Pangloss RADD MT/TIDES GALE Enthusiast TransTac C-STAR II Nespole Lingwear ThaiLator Speechalator Semantic Annotation Q&A 1986 1991 1993 1996 Extraction 2000 CALL “Language of Life”: vocabulary chemical groups, properties of AA 32 Evolutionary Methods for Discovering Sequence Structure Mapping Distribution of amino acids A Multiple Sequence Alignment Human Monkey Mouse Rat Cow Dog Fly Worm Yeast Conserved Properties across Rhodopsin 33 Results: -Helical Rung Prediction 1DBG: correctly identify 10 out of 11 rungs 34 Concluding Observations … and Exaggerations Everything can be reduced to Information Information is the key everything All “natural” information has an underlying language (genomics, linguistics, …) Information is all levels of graunularity Subatomic DNA/proteins society … Information + language + computation = lifetime employment 35