LIS618 lecture 2 preparing and preprocessing Thomas Krichel 2011-04-21 today • text files and non-text files • characters • lexical analysis preliminaries • Here we discuss some preparatory stages that has the way the IR system is built. • This may sound unduly technical. • But the reality is that often, when querying IR systems – we don’t know how they have been built – we can’t find something that is not there – we have to make guesses and try several things key aid: index • An index is a list of features, with a list of locations where the feature is to be found. – The feature may be a word – The locations depend on where we need to find the feature at. example: book • In a book the index links terms and locations are page numbers. • Index terms are selected manually by the author or a specially trained indexer. • Once indexing terms are chosen the index can be built by computer. • Note that any index terms may be on several page. example: computer files • Here the feature can be a word. • An a location can be the location of the file with the location of the start term within the file. • The location of the term can be found by looking at the start and add the length of the term. working with files • Usually, when indexing takes place, we study the contents of documents that are stored in files. • Usually you know something about the files – by the name of the directory – by commonly used file name part, e.g. extension • JHOVE is a tool that tries to guess file types when you don’t know much about them. character extraction • The first thing to do is to extract the characters out of these documents. • That means translating the file content, a series of 0s and 1s into a sequence of characters. • For some types the textual information is limited. image file, music files • Even if you have image files and music files, there usually is some textual data that is store. • jpeg files have exif metadata. • mp3 file have id3 metadata. • These data has simple attribute: value data. example for id3 version, start • • • • • • • header: "TAG" title: 30 bytes of the title artist: 30 bytes of the artist name album: 30 bytes of the album name year: 4 A four-digit year comment: 30 bytes of comment … more fields mixed contents files • If you look at MS word, spreadsheet or PDF file, they contain text surrounded by not textual code that the software reading it can understand. • Such code contains descriptions of document structure, as well as formatting instructions. markup? • Markup is the non-textual contents of a text documents. • Markup is used to provide structure to a document and to prepare for its formatting. • An easy example for markup is any page written in HTML example from a very fine page <li><a tabindex="400" class="int" href="travel_schedule.html" accesskey="t">travel schedule</a></li> <li><a tabindex="500" class="int" accesskey="l" href="links.html">favorite links</a>.</li> </ul> <div class="hidden"> This text is hidden. </div> text extracted • The extracted text is “travel schedule favorite links. This text is hidden.” • The phrase “This text is hidden” is not normally shown on the page but it is still extracted and indexed by Google. • Well, I think it is. tokenization • Tokenization refers to the splitting of the source to be indexed into tokens that can be used in a query language. • In the most common case, we think about splitting the text into words. • Hey that is easy: “travel” “schedule” “favorite” “links” “this” “text” “is” “hidden” • Spot a problem? text files • Some files can either contain text, meant for the end user. • Some others contain text, plus some textual instruction that are means for processing software. Such instructions are called markup. • Typical examples for markup are web pages. key aid: index terms • The index term is a part of the document that has a meaning on its own. • It is usually a word. Sometimes it is a phrase. • Let us stick with the idea of the indexing term being a word. the representation of text • In principle, every file is just a sequence of on/off electrical signals. • How can these signals translated into text? • Well text is a sequence of characters and every character needs to be recognized. • Character recognition leads to text recognition. bits and bytes • A bit is a unit of digital information. It can take the values 0 and 1. • A byte (respelling of “bite” to avoid confusion) is a unit of storage. It should be hardware dependent but it is de facto 8 bits. characters and computer • Computers can not deal with characters directly. They can only deal with numbers. • There we need to associate a number with every character that we want to use in an information encoding system. • A character set combines characters with number. characters • When we are working with characters, two concepts appear. – The character set. – The character encoding. • To properly recognize digital data, we need to know both. The character set • The character set is an association of characters with numbers. Example – ‘ ‘ 32 – bell 7 – ‘’ 8594 • Without such a concordance the processing of text by computer would be impossible. the character encoding • This regards the way the character number is written out as a sequence of bytes. • There are several ways to do this. Going into details here would be a bit too technical. • Let us look at some examples. ascii • The most famous, but old character set is the American Standard for Information Interchange ASCII. • It contains 128 characters. • Its representation in bytes has every byte start with a 0 and then the seven bits that the character number makes up. iso-8859-1 (iso-latin-1) • This is an extension of ASCII that has additional characters that are suitable for the Western European languages. • There are 256 characters. • Every character occupies a byte. Windows 1252 • This is an extension of iso-latin-1. • It is a legacy character set used on the Microsoft Windows operating system. • Some code positions of iso-latin-1 have been filled with characters that Microsoft liked. UCS / Unicode • UCS is a universal character set. • It is maintained by the International Standards Organization. • Unicode is an industry standard for characters. It is better documented than UCS. • For what we discuss here, UCS and Unicode are the same. Basic multilingual plane • This is a name for the first 65536 characters in Unicode. • Each of these characters fits into two bytes and is conveniently represented by four hex numbers. • Even for these characters, there are numerous complications associated with them. ascii and unicode • The first 128 characters of UCS/Unicode are the same as the ones used by ASCII. • So you can think of UCS/Unicode as an extension of ASCII. in foreign languages • • Everything becomes difficult. As an example consider the characters – – – • o ő ö The latter two can be considered o with diarcitics or as separate characters. most problematic: encoding • • • • One issue is how to map characters to numbers. This is complicated for languages other than English. But assume UCS/Unicode has solved this. But this is not the main problem that we have when working with non-ASCII character data. encoding • • • The encoding determines how the numbers of each character should be put into bytes. If you have a character set that is has one byte for each character, you have no encoding issue. But then you are limited to 256 characters in your character set. fixed-length encoding • • • If you have a fixed length encoding, all characters take the same number of bytes. Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that. If you are writing only ASCII, it appears a waste. variable length encoding • • The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8. It is important to understand that the encoding needs to known and correct. bascis of UTF-8 • Every ASCII character, represented as a byte, starts with a zero. • Characters that are beyond ASCII require two or three bytes to be complete. • The first byte will tell you how many bytes are coming to make the character complete. byte shapes in UTF-8 encoding • • • • • • 0?????? ASCII 110???? first octet of two-byte character 1110???? first byte of three-byte character 11110??? first octet of four-byte character 10??????? byte that is not the first byte as you can see, there are sequences of bytes that are not valid hex range to UTF-8 • 0000 to 007F • 0080 to 07FF • 0800 to FFFF 0??????? 110????? 10?????? 1110???? 10?????? 10?????? in the simplest case • In the simplest case, let us assume that we index a bunch of text files as full text. • These documents contain text written in some sort of a human language. • They may also contain some markup, but we assume we can strip the markup. lexical analysis • Lexical analysis is the process of preparing a sequence or characters into a sequence of words. • These candidate words can then be used as index term (or what I have called features). text direction • Text is not necessarily linear. • When right to left writing is used, you sometimes have to jump in the opposite direction when a left to right expression appears. Example: • سنة من132 بعد1962 حققت الجزائر استقاللها عام .االحتالل الفرنسي spaces • One important delimiter between words are spaces. • The blank, the newline, the carriage return and the tab characters are collectively known as whitespace • We can collapse whitespace and get the words. • But there are special cases. text without spaces: Japanese • Japanese uses 3 scripts, kanji, hiragana and katakana. • All foreign words are written with katakana. Any conitunous katakana sequence is a word. • A Japanese word has at most 4 kanji/hiragana. A dictionary has to find backward from 4 to 1 if a word exists. punctuation • Normally lexical analysis discards punctuation. But some punctuation can be part of a word, say “333B.C.” • What we have to understand here is that probably at query time, the same removal of punctuation occurs. • The user searches for “333B.C”, the query is translated to 333BC and the occurrence is found. hyphen • It’s a particularly problematic punctuation mark. • Breaking them, like, for example, in “state-ofthe-art” is good. • End of line hyphenation breaks at a line end may not be ends of words. • Some hyphenation is part of a word, e.g. UTF8. I should use the non-breaking hyphen. apostrophe • Example: – Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. • O Neil … aren t • The first seem ok, but second looks silly. stop words • Some words are considered to carry no meaning because they are too frequent. • Sometimes, they are completely removed, and not removed in the index. • Then you can’t search for the phrase “to be or not to be”. • Stop word are very language dependent. stemming • Frequently, text contains grammatical forms that a user may not look at. In English, e.g. – plurals – gerund form – past tense suffixes • The most famous algorithm for English stemming is Martin Porter’s stemming algorithm. • Evidence on usefulness is mixed. grammatical variants • These are much more important in other languages. E.g., in Russian – “My name is Irina.” “Меня зовут Ирина.” – “I love Irina.” “Я люблю Ирину.” • Text in a query in Russian must be matched against a variety of grammatical forms, so of which are very tricky to derive. casing • Normalization usually means that we treat uppercase and lowercase the same, meaning that the uppercase and lowercase variants of a token become the same term. • However in English – “US” vs “us” – “IT” vs “it”… treatment of diacritics • There are rules in Unicode how diacritics can be stripped. • These involve decomposing diacritic • This is helpful for people who find the input of diacritics tough. • But there are lots of other tough choices. historic equivalents • Many languages that use diacritics or ligatures have developed spelling variants without them. – Jaspersstraße Jasperstrasse – Völklingen Voelklingen • A mechanical de-diacritic operation would yield “Volklingen” which few people would chose. • Such operations are very language specific. http://openlib.org/home/krichel Please shutdown the computers when you are done. Thank you for your attention!