Research Problems in Digital Libraries: Data Mining and Text Mining

advertisement
1
Research Problems in Digital Libraries:
Data Mining and Text Mining
Jaime Carbonell and Raj Reddy
Carnegie Mellon University
April 21, 2006
Talk presented at CS50 symposium at CMU
Keepers of the Faith
2
Digital Libraries and
Universal Access to Information
Create a Universal Digital Library containing
all the books ever published
 Unfortunately many of the books are in
English

 Not
readable by over 80% of the population
3
4
Information Overload

If we read a book every day
 we
can only read, at most, 40,000 books in a life
time

Having millions of books online and accessible
creates an information overload
 “we
have a wealth of information and scarcity of
(human) attention!”, Herbert Simon

Multilingual search technology can help to
reduce the overload
 permits
users to search very large data bases
quickly and reliably
 independent of language and location
Understanding Language

Books in non-native languages remain
incomprehensible to most people
Translation and Summarization essential for world
wide use
 Current translation systems are not yet perfect
 Significant improvements in language understanding
systems in the past few decades


Systems based on statistical and linguistic
techniques have shown significant performance
improvements


improve performance using machine learning
Digitization projects will act as test bed

for validating Language Understanding Systems
Research

e.g. The Million Book Digital Library Project
5
The Million Book Digital Library
Collaborative venture among many
countries including USA, China and India
 So far 400,000 books have been scanned
in China and 200,000 in India
 Content is made freely available around
the globe


Those wishing to see the Video in the next
slide should download from
http://www.rr.cs.cmu.edu/MSRI.zip
6
Million Book Project: Status





21 Centers in India
17 centers in China
1 Center in Egypt
Planned : Australia and Europe
About 600,000 books scanned

About 120,000+ accessible on the web from India





http://dli.iiit.ac.in/
Uses 8TB of storage
10 TB server at CMU Library planned for July 2005
1,000,000 books by the end of 2007
Capacity to scan a million pages a day expected to
be operational by the end of 2006
Title
Author
Language
Subject
Publisher
Year
Abstract
Rig Veda
Pandit Sriram Sharma Acharya
Sanskrit
Philosophy
Sanskriti Sansthan Bareli
9
Rig Veda is the oldest of the
Vedas. The Rig Veda is the
oldest book in Sanskrit or any
Indo-European language. Many
great Yogis and scholars who
have understood the
astronomical references in the
hymns, date the Rig Veda as
before 4000 B.C., perhaps as
early as 12,000. Modern
western scholars date it around
1500 B.C., though recent
archaeological finds in India
(like Dwaraka) now appear to
require a much earlier date
Title
Author
Language
Subject
Publisher
Year
Abstract
10
Elementary Treatise on the
Wave-Theory of Light
Humphery Lloyd, D.D, D.C.L
English
Physics
Longmans, Green & Co
1873
This book deals with the
various aspects of the wave
theory of light. It is a critical
work which contains an
analytical discussion of the
most recent researches in
Optics. It presents a clear and
connected view of the
subject.
11
Title
Author
Language
Subject
Publisher
Year
Abstract
Beauties from Kalidas
Keshav Appa Padhye
Sanskrit
Poetry
1927
A collection of some of the
Best works of Kalidas, Ancient
India’s Most Famous Sanskrit
Poet. Abhignyana
Sakuntalam, Kumara
Sambhavam, Ritu Samhara
are some of the renowned
works of Kalidas.
Title
Author
Language
Subject
Publisher
Year
Abstract
Gems, Jewels, Coins and
Medals Ancient & Modern
Archibald Billing
English
Fine Arts
Daldy, Isbister & Co
1875
12
This volume deals with the
detailed description of the
varied types of fine arts
dealing with precious stones,
Jewelry and sculpture.
Title
Author
Language
Subject
Publisher
Year
Abstract
13
Mudalayiram Mulamum
Periya Jeeyar
Tamil
Religion
Sri Vaishnava Sampirathaya
Sanjeevikiri Sabayai
1909
This volume is written in Tamil.
It provides a detailed account
of the origin of Vaishnava and
is written by Periya Jeeyar. .
14
Title
Author
Language
Subject
Publisher
Year
Abstract
Gulzar-A-Badesha
Khader Badesha
Urdu
Literature
Namipress, Chennai
1919
Literature
15
Title
Author
Language
Subject
Publisher
Year
Abstract
Jawahar Ali Joyviyah
Dr.Ilyas lomas
Arabic
Metrology
Bakri and Issa
1876
It is a book on Metrology, a
study of measurements
16
Title
Author
Language
Subject
Publisher
Year
Abstract
Panchatantramu
Narayana Kavi
Telugu
Moral Stories
Vavilla Ramaswamy and Sons
1912
It is a compilation of stories
told by a guru to his royal
students, each story teaching
a moral. Most of the characters
in the stories are animals. The
book served as an excellent
guide to prospective kings in
their everyday life, including
their behaviour and their
choice of friends. It also is a
great asset to parents to teach
ethics to their children.
17
Title
Author
Language
Subject
Publisher
Year
Abstract
Bharateeya Smritigalu
Vidwan Ragu Sutta
Kannada
Biographical Notes
Hemantha Sahitya
Compilation of Ancient
Memories
Title
Author
Language
Subject
Publisher
Year
Abstract
18
The Fauna of British India
including Ceylon and Burma
Lt. Conl. J. Stephenson
English
Biology
Taylor and Francis
1929
Biological notes on fauna and
insects compiled during
British India
Title
Author
Language
Subject
Publisher
Year
Abstract
19
Harijan: A Journal of Applied
Gandhism, 1933-1955
Joan Bondurant (introduction)
English
Philosophy
Garland Publishing Inc.
1973
A journal on Practical
implementation of Gandhiism
in Every Day Life
Title
Author
Language
Subject
Publisher
Year
Abstract
20
Structure Des Molecules
Victor Henri
French
Chemistry
Taylor and Francis
1925
This is a unique book that
explicates, in detail, the
structure of molecules and
touches upon certain specific
characteristics of molecules
with particular reference to
Benzene
Million Book Project: Research
Challenges

Providing Access to Billions everyday


Distributed Cached Servers in every country and
region
Self-Healing Data Bases
Easy to use interfaces for Billions
 Text Mining Challenges

Multilingual Information Retrieval
 Summarization
 Text Categorization
 Named-Entity identification
 Novelty Detection
 Translation

Information Bill of Rights
Get the right
 To the right
 At the
right
 On the right
 In the
right
 With the right

information
people
time
medium
language
level of detail
22
Relevant Text Mining Technologies
23






“…right
“…right
“…right
“…right
“…right
“…right
information”
people”
time”
medium”
language”
level of detail”






IR (search engines)
Classification, routing
Anticipatory analysis
Info extraction, speech
Machine translation
Summarization
… The Right Information:
Next Generation Search Engines

Search Criteria Beyond Query-Relevance







Google: Popularity (link density, click freq, …)
Vivisimo: Panoramic view (clustering + labeling)
Information novelty (content differential, recency)
Trustworthiness of source
Appropriateness to user (difficulty level, …)
Hidden web: 10X visible web (Federated search)
“Find What I Mean” Principle



Search on semantically related terms
Induce user profile from past history, etc.
Disambiguate terms (e.g. “Jordan”)
24
Clustering (Vivisimo-style)
Search vs Standard IR
documents
query
IR
Cluster
summaries
25
MMR Ranking vs Standard IR
documents
MMR
query
IR
λ controls spiral curl
26
… In The Right Level of Detail
Synthetic Document = Summary++
Audio
transcripts
• Extractive combo
(tracking, MMR, …)
Entities ………
• Centrality of info
Relations …….
• KIT model relevant
• Novelty (vs last time)
Textual
summary
• Entities, relations,
dates, … + raw text
Texts
(Eng,
• Later: contradiction
& attitude detection
Analyst
zoom-in
Arabic,
Chinese
…)
Novel
Attitude
mixed
• Combine: CMU,
IBM (NE + rel extraction),
UMD (user model, summ),
Stanford (contradiction
detection)
Sources
27
… In the Right Language (MT)
Interlingua
Semantic
Analysis
Syntactic
Parsing
Source
(Arabic)
Sentence
Planning
Transfer Rules
Direct: EBMT, SMT
Text
Generation
Target
(English)
28
EBMT example
29
English:
I would like to meet her.
Mapudungun: Ayükefun trawüael
fey engu.
English:
The tallest man
Mapudungun: Chi doy fütra chi wentru
is my father.
fey ta inche ñi chaw.
English:
I would like to meet the tallest man
Mapudungun (new):
Ayükefun trawüael
Chi doy fütra chi wentru
Mapudungun (correct): Ayüken ñi trawüael
chi doy fütra wentruengu.
Illustration of Multi-Engine MT
El punto de descarge
se cumplirá en
el puente Agua Fria
The drop-off point
will comply with
The cold Bridgewater
El punto de descarge
se cumplirá en
el puente Agua Fria
The discharge point
will self comply in
the “Agua Fria” bridge
El punto de descarge
se cumplirá en
el puente Agua Fria
Unload of the point
will take place at
the cold water of
bridge
30
Interlingua
Spoken Language
Multi Engine
Example Based
Statistical
Low Resource
Automatic MT Evaluation
Portable
Letras
Avenue
MEMT
Diplomat
Tongues
METEOR
GEBMT
KANT
KBMT-89
JANUS
C-STAR I
MT Lab
Pangloss
RADD MT/TIDES
GALE
Enthusiast
TransTac
C-STAR II
Nespole
Lingwear
ThaiLator
Speechalator
Semantic
Annotation
Q&A
1986
1991
1993
1996
Extraction
2000
CALL
“Language of Life”:
vocabulary
chemical groups, properties of AA
32
Evolutionary Methods for
Discovering Sequence 
Structure Mapping
Distribution of amino acids
A Multiple Sequence Alignment
Human
Monkey
Mouse
Rat
Cow
Dog
Fly
Worm
Yeast
Conserved Properties across Rhodopsin
33
Results: -Helical Rung Prediction

1DBG: correctly identify 10 out of 11 rungs
34
Concluding Observations
… and Exaggerations
Everything can be reduced to Information
 Information is the key everything
 All “natural” information has an underlying
language (genomics, linguistics, …)
 Information is all levels of graunularity

 Subatomic

 DNA/proteins  society  …
Information + language + computation
= lifetime employment
35
Download