Textové Databázy

advertisement
Textové Databázy
Ján GENČI
PDT
Obsah
•
•
•
•
•
•
Literatúra
Terminológia
Vymedzenie pojmu textové databázy
Typy dotazov
Fulltextové vyhľadávanie
Lingvistické korpusy
2
Literatúra
• Pokorný J.: Databázové systémy 2, Nakladatelství
ČVUT, 2007
• Pokorný J., Snášel V., Kopecký M.: Dokumentografické
informačné systémy, Nakladatelství Karolinum, 2005.
• Laura C. Rivero, Jorge H. Doorn, Viviana E. Ferraggine:
Encyclopedia Of Database Technologies And
Applications. Idea Group Publishing, 2005 (heslo Text
Databases, p. 688)
• Erickson J.: Database Technologies: Concepts,
Methodologies, Tools, and Applications. IGI Global,
2009. ISBN 978-1-60566-058-5 (pp. 931-939)
3
Literatúra (cont.-1)
4
Literatúra (cont.-2)
• Oracle Text.
http://www.oracle.com/technology/product
s/text/index.html
• Oracle Text. An Oracle Technical White
Paper. June, 2007 (prečítať)
http://www.oracle.com/technology/product
s/text/pdf/11goracletexttwp.pdf
5
TXT DB - Terminológia
• Textové databázy (informačné systémy)
• Dokumentové databázy (Document
databases)
• Dokumentografické informačné systémy
6
Text Databases – definition
• A text is any sequence of symbols (or characters) drawn from an
alphabet.
• A large portion of the information available worldwide in electronic
form is actually in text form (other popular forms are structured and
multimedia information):
– natural language text (e.g., books, journals, newspapers, jurisprudence
databases, corporate information, the Web),
– biological sequences (e.g., ADN and protein sequences),
– continuous signals (e.g., audio and video sequence descriptions, time
functions),
– and so on.
• A text database is a system that maintains a (usually large) text
collection and provides fast and accurate access to it. These two
goals are relatively orthogonal, and both are critical if one is to profit
from the text collection.
7
TXT DB - Type of queries
– Syntactic search (expressed in the sequence of characters
preseted in the text):
• String matching (the simplest query, cely rad algoritmov – KnutMorris-Pratt first O(n))
• Regular expression
• Approximate searching (to recover from different kinds of errors that
the text collection (or the user query) may contain - simple error
model is edit distance)
– Semantic search (great value) - user expresses an information
need and the system retrieves portions of the text collection (i.e.,
documents) that are relevant to that need, even if the query
words do not directly appear in the answer. System ranks the
documents and offers the highest ranked documents to the user.
There are no right or wrong answers, just better and worse ones.
8
Fulltext search
• In the traditional database management systems
(DBMS), text manipulation is restricted to the
usual string manipulation facilities (the exact
matching of substrings)
• The traditional string-level operations are very
costly for large documents - traditional DBMS
engine is inefficient for these operations, they
are usually extended with a special full-text
search (FTS) engine module.
9
Fulltext search (cont.)
• There is a significant demand on the
market on the usage of free text and text
mining operations, since information is
often stored as free text (text analysis in
medical systems, analysis of customer
feedbacks, and bibliographic databases)
• Simple character-level string matching
would retrieve only a fraction of related
documents
10
Alternatives for implementing
an FTS engine
• Built-in FTS engine module (Oracle,
Microsoft SQLServer, Postgres, and
mySQL; Informix Text Datablade; )
• DBMS-independent FTS engine (SPSS
LexiQuest, SAS Text Miner, dtSearch, and
Statistica Text Miner)
11
Ways of processing
• Text mining
• Full text search
12
Text mining
• Subfield of document management that aims at
processing, searching, and analyzing text
documents
• The goal – to discover the non-trivial or hidden
characteristics of individual documents or
document collections
• Interdisciplinary field of machine learning which
exploits tools and resources from computational
linguistics, natural language processing,
information retrieval, and data mining
13
General application schema of text
mining
14
Information Extraction
• Includes i.e. following subtasks:
– named entity recognition – recognition of
specified types of entities in free text,
– co-reference resolution – identification of text
fragments referring to the same entity,
– identification of roles and their relations –
determination of roles defined in event
templates
15
Text Categorization
• Aim - sorting documents into a given
category system; e.g..:
– document filtering –spam filtering, or
newsfeed;
– patent document routing – determination of
experts in the given fields;
– assisted categorization – helping domain
experts in manual categorization with valuable
suggestion;
– automatic metadata generation.
16
Document Clustering
• Groupping elements of a document collection
based on their similarity.
• Documents are usually clustered based on their
content.
• Document Clustering is applied for e.g.:
– clustering the results of (internet) search for helping
users in locating information,
– improving the speed of vector space based
information retrieval,
– providing a navigation tool when browsing a
document collection.
17
Summarization
• Automatic generation of short and
comprehensible summaries of documents
18
FULL-TEXT SEARCH (FTS)
ENGINES
19
Fulltext indices
• A crucial sub-problem in the information retrieval
area is the design and implementation of
efficient data structures and algorithms for
indexing and searching information objects that
are vaguely described.
• The most commonly used indexing structures
are:
– inverted files,
– signature files,
– bitmaps.
20
Informix
• Excalibur Text DataBlade Module provides
text search capabilities that include:
– phrase matching,
– exact and fuzzy searches,
– compensation for misspelling,
– synonym matching.
21
Lingvistické korpusy
• Kolekcie textov v konkrétnom jazyku
určené primárne pre lingvistický výskum
• Značkované texty
• Príklady:
– British National Corpus (100 mil. slov)
– Slovenský národný korpus (530 mil. tokenov)
– Český národný korpus (300 mil. slov)
• Paralelné korpusy
22
Download