bits bytes and characters

advertisement
lis508 lecture 1:
bits, bytes and characters
Thomas Krichel
2003-09-30
Structure
• Numbers
– Bits
– Bytes
• Character sets
– Coded character set
– Character endcoding
Literature, no need to read…
• Norton “new inside the PC” chapter 4
• http://www.danbbs.dk/~erikoest/bb_terms.
htm
• http://wwwinfo.cern.ch/asdoc/WWW/public
ations/ictp99/ictp99N2705.html
• http://www.cl.cam.ac.uk/~mgk25/unicode.h
tml
Information
• Information is best understood as “what it
takes to answer a question”.
• The simplest question has a “yes” or “no”
answer. Therefore a bit is the natural
measure of information.
• Term first used by John Turkey in 1946.
• Concatenation of “binary digit”.
Usage of bits
• Computers are sometimes classified by
the number of bits they can process at one
time. "32 bit processor"
• Graphics are also often described by the
number of bits used to represent each dot.
bits and bytes
• a bit can take the values 0 or 1, thus it can
describe 2 possibilities
• two bits can take the value 00, 01, 10, 11, thus it
can describe four 2×2 possibilities
• n bits can encode 2 power n possibilities.
• The first chips used to process 8 bits at a time. It
become customary to refer to them as a byte. It
can encode 2 power 8 possibilities.
• We can use binary numbers just as decimal
numbers.
application of bytes
• IP (Internet Protocol) numbers are used as
the addresses of computers on the
Internet.
• In IP version 4 (the one that is most
commonly used), each IP number has 4
bytes.
• It is represented as x.x.x.x where x is a
number between 0 and 255 (why?)
• how many computers can there be on the
Internet at any one time?
decimal/binary numbers
•
•
•
•
•
•
•
•
0
1
2
3
4
5
6
7
0
1
10
11
100
101
110
111
•
•
•
•
•
•
•
•
8
9
10
11
12
13
14
15
1000
1001
1010
1011
1100
1101
1110
1111
Many bytes
• Larger units are
– Kilo byte is 2 power 10 bytes (=1024 bytes)
– Mega bytes is 2 power 20 bytes
– Giga bytes is 2 power 30 bytes
– Tera byte is 2 power 40 bytes
• From ancient Greek words for "thousand",
"large", "giant", and "monster",
respectively. Terms date back to the
French revolution.
Hex numbers
• A byte is often represented by two hex
numbers.
• Each hex number can encode 16 values
• Written 0 to 9, then A B C D E F. F is 15.
• Conventionally prefixed with 0x
• Use Microsoft calculator with scientific
notation to convert.
application of hex numbers
• Media Access Control (mac) addresses of
hardware that allows access to computer
networks. They are 6-byte numbers, each
byte written as 2 hex numbers, e.g.
00:60:08:F5:20:A9
• character numbers that you see when you
are inserting a special symbol in Microsoft
software, e.g. powerpoint.
Characters
• Much of the information processed by
computers is in the form of characters.
• A character only makes sense for a human
user of a minimum cultural level.
• A character is not a glyph.
– ligatures
Information in a computer file
• A file is a piece of data on a stored on a
computer.
• Any file contains a sequence of 0s and 1s,
like 1010100101010011110101010101…
• For a computer to make sense of a file, it
has to know what type of file it is.
executable files
• Files that are executable are files that
make the computer do something. For
example the file starts a program, say
powerpoint. An executable on one
computer may not run on another
• Non-executable files hold data that is used
by an executable file. We will call them
data files. Example: powerpoint slides file.
text files
• Many data files contain textual data.
• Textual data is a sequence of characters.
• A character is an elementary symbol that
has some meaning
– alphabet letter
– hieroglyph
• Example: email file
• Text files can be read by many computer
programs.
non-text files
• Examples for non-text files are
– graphics files
– movie files
– sound files
• non-text files are not very important in
library settings
– there is not way to organize information
retrieval for non-text files. They have to be
retrieved using a textual surrogate.
– traditional library material are textual
• will talk about this later.
Representing characters
• Computers don't understand text, they only
understand numbers. For computers to be able
to treat text, there must be a correspondence
between numbers and text characters. Such a
correspondence is called a character set.
• Examples for characters are
–
–
–
–
a
c
ë
€
Legacy character sets
• In early days, computers were a lot less
powerful than they are today.
• Could only deal with the characters that
are most commonly used.
• Such sets are
– ascii
– ISO-8859-1
– cp1252
ASCII
• American Standard Code for Information
Interchange
• 7-bit character set. There is no such thing
as 8-bit ASCII
• 95 printable symbols
• 33 control characters (0-31, 127)
• http://www.ccmr.cornell.edu/helpful_data/a
scii2.html has a list up to 127
some ASCII control characters
•
•
•
•
•
•
CR (13, ^M) is the carriage return
LF (10, ^J) is the linefeed
FF (12, ^L) is the form feed (new page)
BS (8, ^H) is the backspace
DEL (127, ALT-127) is delete
ESC (27, ^[) escape
ISO-8859-1
• ISO-8859-1, aka ISO-latin-1 extends
ASCII with characters that are commonly
used by the western European languages.
• It is the default character set of html.
• Positions 128 to 159 are not used.
• Cp1252 fills these with graphic chars. It is
as Microsoft character set.
This is not enough
• There are around 6800 different languages
around.
• Some of these languages use characters
sets that are not finite, i.e. folks can make
up now characters out of existing ones!
• Setting up a character set for all
languages is almost impossible.
ISO 10646-1
• Defines the Universal Character Set
(UCS).
• UCS contains the characters required to
represent characters used by many known
languages, even the likes of Oriya, Telugu,
Bopomofo, Runic.
• ISO 10646 defines formally a 31-bit
character set. They are represented as 32
bits, i.e. 4 bytes, or 8 hex chars.
• Not finished.
Unicode
• ISO is a inter-government agency. Slow
and bureaucratic.
• Industry has come together to work on
Unicode, a 2-byte character set.
• With some minor exceptions, the Unicode
characters are the some as the first 65536
characters in UCS.
• Much better documented standard.
Unicode and legacy sets
• The first 128 characters are identical to
those in ASCII
• The next 128 characters are identical to
ISO 8859-1 (Latin-1).
• Unicode is well documented and the
Unicode book can be downloaded from
the Internet. A must-have for the serious
digital librarian.
Politics…
• Does it make sense to use Unicode rather
than, say, ISO-latin-1?
• Many commercial pieces of software have
data files that contain character data
interspersed with non-character data. Is
that good?
http://openlib.org/home/krichel
Thank you for your attention!
Download