preprocessing of records and queries

advertisement
LIS618 lecture 2
preparing and preprocessing
Thomas Krichel
2011-04-21
today
• text files and non-text files
• characters
• lexical analysis
preliminaries
• Here we discuss some preparatory stages that
has the way the IR system is built.
• This may sound unduly technical.
• But the reality is that often, when querying IR
systems
– we don’t know how they have been built
– we can’t find something that is not there
– we have to make guesses and try several things
key aid: index
• An index is a list of features, with a list of
locations where the feature is to be found.
– The feature may be a word
– The locations depend on where we need to find
the feature at.
example: book
• In a book the index links terms and locations
are page numbers.
• Index terms are selected manually by the
author or a specially trained indexer.
• Once indexing terms are chosen the index can
be built by computer.
• Note that any index terms may be on several
page.
example: computer files
• Here the feature can be a word.
• An a location can be the location of the file
with the location of the start term within the
file.
• The location of the term can be found by
looking at the start and add the length of the
term.
working with files
• Usually, when indexing takes place, we study
the contents of documents that are stored in
files.
• Usually you know something about the files
– by the name of the directory
– by commonly used file name part, e.g. extension
• JHOVE is a tool that tries to guess file types
when you don’t know much about them.
character extraction
• The first thing to do is to extract the
characters out of these documents.
• That means translating the file content, a
series of 0s and 1s into a sequence of
characters.
• For some types the textual information is
limited.
image file, music files
• Even if you have image files and music files,
there usually is some textual data that is store.
• jpeg files have exif metadata.
• mp3 file have id3 metadata.
• These data has simple attribute: value data.
example for id3 version, start
•
•
•
•
•
•
•
header: "TAG"
title: 30 bytes of the title
artist: 30 bytes of the artist name
album: 30 bytes of the album name
year: 4 A four-digit year
comment: 30 bytes of comment
… more fields
mixed contents files
• If you look at MS word, spreadsheet or PDF
file, they contain text surrounded by not
textual code that the software reading it can
understand.
• Such code contains descriptions of document
structure, as well as formatting instructions.
markup?
• Markup is the non-textual contents of a text
documents.
• Markup is used to provide structure to a
document and to prepare for its formatting.
• An easy example for markup is any page
written in HTML
example from a very fine page
<li><a tabindex="400" class="int"
href="travel_schedule.html" accesskey="t">travel
schedule</a></li>
<li><a tabindex="500" class="int" accesskey="l"
href="links.html">favorite links</a>.</li>
</ul>
<div class="hidden">
This text is hidden.
</div>
text extracted
• The extracted text is “travel schedule favorite
links. This text is hidden.”
• The phrase “This text is hidden” is not
normally shown on the page but it is still
extracted and indexed by Google.
• Well, I think it is.
tokenization
• Tokenization refers to the splitting of the
source to be indexed into tokens that can be
used in a query language.
• In the most common case, we think about
splitting the text into words.
• Hey that is easy: “travel” “schedule” “favorite”
“links” “this” “text” “is” “hidden”
• Spot a problem?
text files
• Some files can either contain text, meant for
the end user.
• Some others contain text, plus some textual
instruction that are means for processing
software. Such instructions are called markup.
• Typical examples for markup are web pages.
key aid: index terms
• The index term is a part of the document that
has a meaning on its own.
• It is usually a word. Sometimes it is a phrase.
• Let us stick with the idea of the indexing term
being a word.
the representation of text
• In principle, every file is just a sequence of
on/off electrical signals.
• How can these signals translated into text?
• Well text is a sequence of characters and
every character needs to be recognized.
• Character recognition leads to text
recognition.
bits and bytes
• A bit is a unit of digital information. It can take
the values 0 and 1.
• A byte (respelling of “bite” to avoid confusion)
is a unit of storage. It should be hardware
dependent but it is de facto 8 bits.
characters and computer
• Computers can not deal with characters
directly. They can only deal with numbers.
• There we need to associate a number with
every character that we want to use in an
information encoding system.
• A character set combines characters with
number.
characters
• When we are working with characters, two
concepts appear.
– The character set.
– The character encoding.
• To properly recognize digital data, we need to
know both.
The character set
• The character set is an association of
characters with numbers. Example
– ‘ ‘  32
– bell  7
– ‘’  8594
• Without such a concordance the processing of
text by computer would be impossible.
the character encoding
• This regards the way the character number is
written out as a sequence of bytes.
• There are several ways to do this. Going into
details here would be a bit too technical.
• Let us look at some examples.
ascii
• The most famous, but old character set is the
American Standard for Information
Interchange ASCII.
• It contains 128 characters.
• Its representation in bytes has every byte start
with a 0 and then the seven bits that the
character number makes up.
iso-8859-1 (iso-latin-1)
• This is an extension of ASCII that has
additional characters that are suitable for the
Western European languages.
• There are 256 characters.
• Every character occupies a byte.
Windows 1252
• This is an extension of iso-latin-1.
• It is a legacy character set used on the
Microsoft Windows operating system.
• Some code positions of iso-latin-1 have been
filled with characters that Microsoft liked.
UCS / Unicode
• UCS is a universal character set.
• It is maintained by the International Standards
Organization.
• Unicode is an industry standard for characters.
It is better documented than UCS.
• For what we discuss here, UCS and Unicode
are the same.
Basic multilingual plane
• This is a name for the first 65536 characters in
Unicode.
• Each of these characters fits into two bytes
and is conveniently represented by four hex
numbers.
• Even for these characters, there are numerous
complications associated with them.
ascii and unicode
• The first 128 characters of UCS/Unicode are
the same as the ones used by ASCII.
• So you can think of UCS/Unicode as an
extension of ASCII.
in foreign languages
•
•
Everything becomes difficult.
As an example consider the characters
–
–
–
•
o
ő
ö
The latter two can be considered o with
diarcitics or as separate characters.
most problematic: encoding
•
•
•
•
One issue is how to map characters to
numbers.
This is complicated for languages other than
English.
But assume UCS/Unicode has solved this.
But this is not the main problem that we
have when working with non-ASCII
character data.
encoding
•
•
•
The encoding determines how the numbers
of each character should be put into bytes.
If you have a character set that is has one
byte for each character, you have no
encoding issue.
But then you are limited to 256 characters
in your character set.
fixed-length encoding
•
•
•
If you have a fixed length encoding, all
characters take the same number of bytes.
Say for the basic-multilingual plane of
unicode, you need two bytes for each
character, and then you are limited to that.
If you are writing only ASCII, it appears a
waste.
variable length encoding
•
•
The most widely used scheme to encode
Unicode is a variable length scheme, called
UTF-8.
It is important to understand that the
encoding needs to known and correct.
bascis of UTF-8
• Every ASCII character, represented as a byte,
starts with a zero.
• Characters that are beyond ASCII require two
or three bytes to be complete.
• The first byte will tell you how many bytes are
coming to make the character complete.
byte shapes in UTF-8 encoding
•
•
•
•
•
•
0??????
ASCII
110???? first octet of two-byte character
1110???? first byte of three-byte character
11110??? first octet of four-byte character
10??????? byte that is not the first byte
as you can see, there are sequences of bytes
that are not valid
hex range to UTF-8
• 0000 to 007F
• 0080 to 07FF
• 0800 to FFFF
0???????
110????? 10??????
1110???? 10?????? 10??????
in the simplest case
• In the simplest case, let us assume that we
index a bunch of text files as full text.
• These documents contain text written in some
sort of a human language.
• They may also contain some markup, but we
assume we can strip the markup.
lexical analysis
• Lexical analysis is the process of preparing a
sequence or characters into a sequence of
words.
• These candidate words can then be used as
index term (or what I have called features).
text direction
• Text is not necessarily linear.
• When right to left writing is used, you
sometimes have to jump in the opposite
direction when a left to right expression
appears. Example:
• ‫ سنة من‬132 ‫ بعد‬1962 ‫حققت الجزائر استقاللها عام‬
.‫االحتالل الفرنسي‬
spaces
• One important delimiter between words are
spaces.
• The blank, the newline, the carriage return
and the tab characters are collectively known
as whitespace
• We can collapse whitespace and get the
words.
• But there are special cases.
text without spaces: Japanese
• Japanese uses 3 scripts, kanji, hiragana and
katakana.
• All foreign words are written with katakana.
Any conitunous katakana sequence is a word.
• A Japanese word has at most 4 kanji/hiragana.
A dictionary has to find backward from 4 to 1
if a word exists.
punctuation
• Normally lexical analysis discards punctuation.
But some punctuation can be part of a word,
say “333B.C.”
• What we have to understand here is that
probably at query time, the same removal of
punctuation occurs.
• The user searches for “333B.C”, the query is
translated to 333BC and the occurrence is
found.
hyphen
• It’s a particularly problematic punctuation
mark.
• Breaking them, like, for example, in “state-ofthe-art” is good.
• End of line hyphenation breaks at a line end
may not be ends of words.
• Some hyphenation is part of a word, e.g. UTF8. I should use the non-breaking hyphen.
apostrophe
• Example:
– Mr. O’Neill thinks that the boys’ stories about
Chile’s capital aren’t amusing.
• O Neil … aren t
• The first seem ok, but second looks silly.
stop words
• Some words are considered to carry no
meaning because they are too frequent.
• Sometimes, they are completely removed, and
not removed in the index.
• Then you can’t search for the phrase “to be or
not to be”.
• Stop word are very language dependent.
stemming
• Frequently, text contains grammatical forms
that a user may not look at. In English, e.g.
– plurals
– gerund form
– past tense suffixes
• The most famous algorithm for English
stemming is Martin Porter’s stemming
algorithm.
• Evidence on usefulness is mixed.
grammatical variants
• These are much more important in other
languages. E.g., in Russian
– “My name is Irina.”  “Меня зовут Ирина.”
– “I love Irina.”  “Я люблю Ирину.”
• Text in a query in Russian must be matched
against a variety of grammatical forms, so of
which are very tricky to derive.
casing
• Normalization usually means that we treat
uppercase and lowercase the same, meaning
that the uppercase and lowercase variants of a
token become the same term.
• However in English
– “US” vs “us”
– “IT” vs “it”…
treatment of diacritics
• There are rules in Unicode how diacritics can
be stripped.
• These involve decomposing diacritic
• This is helpful for people who find the input of
diacritics tough.
• But there are lots of other tough choices.
historic equivalents
• Many languages that use diacritics or ligatures
have developed spelling variants without
them.
– Jaspersstraße  Jasperstrasse
– Völklingen  Voelklingen
• A mechanical de-diacritic operation would
yield “Volklingen” which few people would
chose.
• Such operations are very language specific.
http://openlib.org/home/krichel
Please shutdown the computers when
you are done.
Thank you for your attention!
Download