Evidence from Content INST 734 Module 2 Doug Oard Agenda Character sets • Terms as units of meaning • Boolean retrieval • Building an index Where Representation Fits Query Documents Representation Function Representation Function Query Representation Document Representation Comparison Function Index Hits The character ‘A’ • ASCII encoding: 7 bits used per character 01000001 0100 0001 01 000 001 = 65 (decimal) = 41 (hexadecimal) = 101 (octal) • Number of representable character codes: 27 = 128 • Some codes are used as “control characters” e.g. 7 (decimal) rings a “bell” (these days, a beep) (“^G”) ASCII • Widely used for English – American Standard Code for Information Interchange – ANSI X3.4-1968 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 64 SPACE ! " # $ % & ' ( ) * + , . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ DEL | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Latin-1 Character Set • ISO 8859-1 8-bit characters for Western Europe – French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1 Other ISO-8859 Character Sets -2 -6 -3 -7 -4 -8 -5 -9 East Asian Character Sets • More than 256 characters are needed – Two-byte encoding schemes (e.g., EUC) are used • Several countries have unique character sets – GB: China, BIG5: Taiwan, JIS: Japan, KS: Korea, TCVN: Vietnam • Many characters appear in several languages – Research Libraries Group developed EACC as a unified “CJK” character set for USMARC records Unicode • Single code for all the world’s characters – ISO Standard 10646 • Separates “code space” from “encoding” – Code space extends Latin-1 • The first 256 positions are identical – UTF-7 encoding will pass through email • Uses only the 64 printable ASCII characters – UTF-8 encoding is designed for disk file systems Limitations of Unicode • Produces larger files than Latin-1 • Fonts may be hard to obtain for some characters • Some characters have multiple representations – e.g., accents can be part of a character or separate • Some characters look identical when printed – But they come from unrelated languages • Encoding does not define the “sort order” Agenda • Character sets Terms as units of meaning • Boolean retrieval • Building an index