mg4j-exercise

advertisement
MG4J: Managing Gigabytes for Java
Exercise
Ida Mele
Document
• Indexing in MG4J is centered around documents
• Package: it.unimi.di.big.mg4j.document
• The object document, which is the instance of the class
Document, represents a single document that can be
indexed
• Different documents have different number and type of
fields.
• For example,
• E-mail: from, to, date, subject, body
• HTML page: title, url, body
Ida Mele
MG4J - exercise
1
Document
• Summary of methods:
Ida Mele
MG4J - exercise
2
DocumentCollection
• Package: it.unimi.di.big.mg4j.document
• DocumentCollection is a randomly addressable lists
of documents
Ida Mele
MG4J - exercise
3
FileSetDocumentCollection
• Package: it.unimi.di.big.mg4j.document
• The main method of FileSetDocumentCollection
allows to build and serialize a set of documents
specified by their filenames
Ida Mele
MG4J - exercise
4
Document Factory
• Package: it.unimi.di.big.mg4j.document
• The factory turns a pure stream of bytes (file) into a
document made by several fields (title and text)
Ida Mele
MG4J - exercise
5
Standard MG4J Document Factories
•
•
•
•
•
•
•
•
•
CompositeDocumentFactory
HtmlDocumentFactory
IdentityDocumentFactory
MailDocumentFactory
PdfDocumentFactory
ReplicatedDocumentFactory
PropertyBasedDocumentFactory
TRECHeaderDocumentFactory
ZipDocumentCollection.ZipFactory
Ida Mele
MG4J - exercise
6
Query
• Package: it.unimi.di.big.mg4j.query
• To query the index we can use the main method of
the class Query
• We can submit queries by using:
• command line
• web browser
• QueryEngine: The query engine receives the query
and returns the ranked list of results
• HttpQueryServer: A simple web server for query
processing
Ida Mele
MG4J - exercise
7
Indexing and querying: exercise
• TECHNICAL REQUIREMENTS:
• UNIX Operating System
• Java (>=6)
• Document collection and the libraries are available at:
http://www.dis.uniroma1.it/~mele/WebIR.html
Ida Mele
MG4J - exercise
8
Set the classpath
• Download and extract htmlDIS.tar.gz
• Download and extract lib.zip
• Download the file set-classpath.sh
• Edit the first line of the file set-classpath.sh: replace
your_directory with the path of the folder containing all the
.jar files (lib folder)
• Set the CLASSPATH: source set-classpath.sh
Ida Mele
MG4J - exercise
9
Building the collection of documents (1)
• Help:
java it.unimi.di.big.mg4j.document.FileSetDocumentCollection -help
• Create the collection:
find htmlDIS -iname \*.html | java
it.unimi.di.big.mg4j.document.FileSetDocumentCollection -f
HtmlDocumentFactory -p encoding=UTF-8 dis.collection
find returns the list of files, one per line. This list is provided
as input to the main method of the
FileSetDocumentCollection
Ida Mele
MG4J - exercise
10
Building the collection of documents (2)
• We need also to specify a factory (the -f option) and the
encoding as a property
• The name of the collection is dis.collection
• The collection does not contain the files, but only their
names
• Deleting or modifying files of htmlDIS directory may
cause inconsistence in the collection
Ida Mele
MG4J - exercise
11
Building the index
• Help: java it.unimi.di.big.mg4j.tool.IndexBuilder --help
• Create the index:
java it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S
dis.collection dis
•
•
•
--downcase: this option forces all the terms to be downcased
-S : specifies that we are producing an index for the specified
collection. If the option is omitted, Index expects to index a
document sequence read from standard input
• dis: basename of the index
If you have memory problem, you can use -Xmx for allocating more
memory to Java:
java -Xmx512M it.unimi.di.big.mg4j.tool.IndexBuilder --downcase -S
dis.collection dis
Ida Mele
MG4J - exercise
12
Index files (1)
• dis-{text,title}.terms: contain the terms of the dictionary.
One term per line
more dis-text.terms
• dis-{text,title}.stats: contain statistics
more dis-text.stats
• dis-{text,title}.properties: contain global information
more dis-text.properties
Ida Mele
MG4J - exercise
13
Index files (2)
• dis-{text,title}.frequencies: for each term, there is the
number of documents with the term (-code)
• dis-{text,title}.globcounts: for each term, there is the
number of occurrence of the term (-code)
• dis-{text,title}.offset: for each term, there is the offset (code)
Ida Mele
MG4J - exercise
14
Index files (3)
• dis-{title,text}.sizes: contain the list of the document
sizes. The document size is the number of words
contained in each document (- code)
• dis-{text,title}.batch<i>: temporary files with sub-indices
(-code). Use the option --keep-batches to not delete
temporary files
• dis-{text,title}.index: contain the index (-code)
Ida Mele
MG4J - exercise
15
Web server
• Help:
java it.unimi.di.big.mg4j.query.Query --help
• Querying the index:
java it.unimi.di.big.mg4j.query.Query -h -i
FileSystemItem -c dis.collection dis-text dis-title
• Command line: {text, title} > computer
• Web browser: http://localhost:4242/Query
Ida Mele
MG4J - exercise
16
Query (1)
• Search one word: The result is the set of documents
that contain the specified word
• Example: computer
• AND: more than one term separated by whitespace or
by AND or &. The result is the set of documents that
contain all the specified words
• Example: computer science
• Example: computer AND science
• Example: computer & science
Ida Mele
MG4J - exercise
17
Query (2)
• OR: more than one term separated by OR or |. The
result is the set of documents that contain any of the
given words
• Example: conference | workshop
• NOT: the operator NOT or ! is used for negation
• Example: conference & ! workshop
• Parentheses: the parentheses are used to enforce
priority in complex queries
• Example: university & (rome | california)
Ida Mele
MG4J - exercise
18
Query (3)
• Proximity restriction: the words must appear within a
limited portion of the document
• Example: (university rome)~6
• Phrase: using “ ” we can look for documents that
contain the exact phrase
• Example: “university of rome la sapienza”
• Ordered AND: more than one term separated by <
• Example: computer < science < department
Ida Mele
MG4J - exercise
19
Query (4)
• Wildcard (*): wildcard queries can be submitted
appending * at the end of a term
• Example: infor*
• Index specifiers: prefixing a query with the name of an
index followed by : you can restrict the search to that
index
• Example: title:computer
• Example: text:computer science AND
title:FOCS
Ida Mele
MG4J - exercise
20
Sophisticated queries (1)
• MG4J provides sophisticated query tuning
• To use this features, we must use the command line
interface
• $ --- to get some help on the available options
• Some examples:
• $mode --- to choose the kind of results
Example: > $mode short
• $selector --- to choose the way the snippet or intervals
are shown
Example: > $selector 3 40
Ida Mele
MG4J - exercise
21
Sophisticated queries (2)
• Other examples:
• $mplex --- when multiplexing is on, each query is
multiplexed to all indices. When a scorer is used, it is
a good idea to use multiplexing
Example: > $mplex on
• $score --- to choose the scorer
Example: > $score VignaScorer
• $weight --- to change the weight of the indices. This
is useful when multiplexing is on
Example: >$weight text:1 title:3
Ida Mele
MG4J - exercise
22
Scorer (1)
• Scorer are important for ranking the documents result of a query.
Default: BM25Scorer and VignaScorer
• ConstantScorer. Each document has a constant score (default is
0)
>$score ConstantScorer
• CountScorer. It is the product between the number of
occurrences of the term in the document and the weight
assigned to the index
>$score CountScorer
Ida Mele
MG4J - exercise
23
Scorer (2)
• TfIdfScorer. It implements TF/IDF
TF is the term frequency of the term t for the document d: c/l;
where c is the number of occurrences of t in d and l is the
length of d
IDF is the inverse document frequency of the term t in the
collection: log(N/f); where N is the number of documents in
the collection and f is the number of documents where t
appears
>$score TfIdfScorer
Ida Mele
MG4J - exercise
24
Scorer (3)
• DocumentRankScorer. The scores of documents are stored
in a text file
>$score DocumentRankScorer nameFile
Ida Mele
MG4J - exercise
25
Virtual fields (1)
• A virtual field produces pieces of text that refer to other documents
(possibly belonging to the collection)
• Referrer: the document that is referring to another document
• Referee: the document to which a piece of text of the Referrer is
referring to
• Intuitively, the Referrer gives us information about the Referee
• The Referrer produces in a virtual field a number of fragments of
text, each referring to a Referee
• The content of a virtual field is a list of pairs made by the piece of
text (called virtual fragment) and by some string that is aimed at
representing the Referee (called the document spec)
Ida Mele
MG4J - exercise
26
Virtual fields (2)
• In the case of the HTML document:
• the document spec is a URL (as specified in the href
attribute)
• the virtual fragment is the content of the anchor element
and some surrounding text (anchor context)
• The HTMLDocumentFactory produces the pairs (document
spec, virtual fragment)
Ida Mele
MG4J - exercise
27
Virtual fields (3)
• Create the list of URL of the documents in the collection:
java it.unimi.di.big.mg4j.tool.ScanMetadata -S dis.collection -u
dis.urls
• Create the document resolver. It is able to map the document
spec produced by some document factory into actual references to
documents in the collection
• Given a document spec, the resolver will decide whether the spec
really refers to a document in the collection or not, and in the first
case it will find out to which document the spec refers to:
java
it.unimi.di.big.mg4j.tool.URLMPHVirtualDocumentResolver -o
dis.urls dis-anchor.resolver
Ida Mele
MG4J - exercise
28
Virtual fields (4)
• Building the index:
java it.unimi.di.big.mg4j.tool.IndexBuilder -a -v anchor:disanchor.resolver --downcase -S dis.collection dis
• Querying the index:
java it.unimi.di.big.mg4j.query.Query -h -i FileSystemItem -c
dis.collection dis-text dis-title dis-anchor
{text, title, anchor} > anchor:conference
{text, title, anchor} > title:combinatorial algorithms AND
anchor:conference
{text, title, anchor} > text:RoboCup AND anchor:info
Ida Mele
MG4J - exercise
29
Virtual gap (1)
• All the virtual fragments that refer to a given document of
the collection are like a single text, called virtual text
• Virtual fragments coming from different anchors are
concatenated, and they are in a text file
• This may produce false positive results
• For example, the query
anchor:(computer AND science)
produces as result a list of documents that contain both
the words in some of their anchors, but not necessarily in
the same anchor
Ida Mele
MG4J - exercise
30
Virtual gap (2)
• To avoid such kinds of false positives, we can use virtual
gaps
• The virtual gap is a positive integer, representing the virtual
space left between different virtual fragments
• For example, if the virtual gap is 64 (the default), anchors
are concatenated by leaving 64 “empty words” between
subsequent fragments
• We can submit the query:
>anchor:(computer AND science)~64
and we will be sure that only documents containing both the
term in the same anchor are retrieved
Ida Mele
MG4J - exercise
31
Virtual gap (3)
• If the anchor is longer than 64 characters, we can still
have false positives
• In the indexing phase, it is possible to specify a different
virtual gap
• For example, we can use:
java it.unimi.di.big.mg4j.tool.IndexBuilder -a -g
anchor:100 -v anchor:dis-anchor.resolver --downcase
-S dis.collection dis
It uses 100 characters for the virtual gap
Ida Mele
MG4J - exercise
32
Term map (1)
• A simple representation of a dictionary is the term list (the file
.terms): a text file containing the whole dictionary, one term
per line, in index order (the first line contains the term with
index 0, the second line the term with index 1, etc.)
• A more efficient representation is based on a monotone
minimal perfect hash function: it is a very compact data
structure that is able to answer to the question "What is the
index of the term XXX?”
• You can build such a function from a sorted term list using:
java it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction
titles.mph dis-title.terms
Ida Mele
MG4J - exercise
33
Term map (2)
• Monotone minimal perfect functions have a serious limit:
they can answer correctly to the question "What is the
index of the term XXX?” but only if the term appears in the
dictionary
• To solve this problem, we can use a signed function
• For terms not in the dictionary, the function will answer
with a special value (-1) that means "the word is not in the
dictionary”
java it.unimi.dsi.util.ShiftAddXorSignedStringMap
titles.mph titles.map mycollection-title.terms
Ida Mele
MG4J - exercise
34
Term map (3)
• Wildcard searches require the use of a prefix map
• A prefix map is able to answer correctly to the question
"What are the indices of terms starting with the characters
YYY?”
• If terms are lexicographically sorted, the answer is a pair
of integers, representing the first and the last index of
terms satisfying the property
• We can build a prefix map by using:
java it.unimi.dsi.util.ImmutableExternalPrefixMap -b4Ki -o
dis-title.terms dis-title.dict
Ida Mele
MG4J - exercise
35
Homework
1. Read the MG4J (big) manual:
http://www.dis.uniroma1.it/~mele/teaching/WebIR/manual
-mg4j.pdf
2. Repeat the exercise
3. Create your own document collection, build the inverted
index (with or without virtual fields), then submit some
queries and try the different scorers
Ida Mele
MG4J - exercise
36
Download