PowerPoint slides

advertisement
Evidence from Content
INST 734
Module 2
Doug Oard
Agenda
Character sets
• Terms as units of meaning
• Boolean retrieval
• Building an index
Where Representation Fits
Query
Documents
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Hits
The character ‘A’
• ASCII encoding: 7 bits used per character
01000001
0100 0001
01 000 001
= 65 (decimal)
= 41 (hexadecimal)
= 101 (octal)
• Number of representable character codes:
27 = 128
• Some codes are used as “control characters”
e.g. 7 (decimal) rings a “bell” (these days, a beep) (“^G”)
ASCII
• Widely used for English
– American Standard
Code for Information
Interchange
– ANSI X3.4-1968
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
NUL
SOH
STX
ETX
EOT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
64
SPACE
!
"
#
$
%
&
'
(
)
*
+
,
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Latin-1 Character Set
• ISO 8859-1 8-bit characters for Western Europe
– French, Spanish, Catalan, Galician, Basque,
Portuguese, Italian, Albanian, Afrikaans, Dutch,
German, Danish, Swedish, Norwegian, Finnish,
Faroese, Icelandic, Irish, Scottish, and English
Printable Characters, 7-bit ASCII
Additional Defined Characters, ISO 8859-1
Other ISO-8859 Character Sets
-2
-6
-3
-7
-4
-8
-5
-9
East Asian Character Sets
• More than 256 characters are needed
– Two-byte encoding schemes (e.g., EUC) are used
• Several countries have unique character sets
– GB: China, BIG5: Taiwan, JIS: Japan, KS: Korea,
TCVN: Vietnam
• Many characters appear in several languages
– Research Libraries Group developed EACC as a
unified “CJK” character set for USMARC records
Unicode
• Single code for all the world’s characters
– ISO Standard 10646
• Separates “code space” from “encoding”
– Code space extends Latin-1
• The first 256 positions are identical
– UTF-7 encoding will pass through email
• Uses only the 64 printable ASCII characters
– UTF-8 encoding is designed for disk file systems
Limitations of Unicode
• Produces larger files than Latin-1
• Fonts may be hard to obtain for some characters
• Some characters have multiple representations
– e.g., accents can be part of a character or separate
• Some characters look identical when printed
– But they come from unrelated languages
• Encoding does not define the “sort order”
Agenda
• Character sets
Terms as units of meaning
• Boolean retrieval
• Building an index
Download