lis508 lecture 1: bits, bytes and characters Thomas Krichel 2003-09-30 Structure • Numbers – Bits – Bytes • Character sets – Coded character set – Character endcoding Literature, no need to read… • Norton “new inside the PC” chapter 4 • http://www.danbbs.dk/~erikoest/bb_terms. htm • http://wwwinfo.cern.ch/asdoc/WWW/public ations/ictp99/ictp99N2705.html • http://www.cl.cam.ac.uk/~mgk25/unicode.h tml Information • Information is best understood as “what it takes to answer a question”. • The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information. • Term first used by John Turkey in 1946. • Concatenation of “binary digit”. Usage of bits • Computers are sometimes classified by the number of bits they can process at one time. "32 bit processor" • Graphics are also often described by the number of bits used to represent each dot. bits and bytes • a bit can take the values 0 or 1, thus it can describe 2 possibilities • two bits can take the value 00, 01, 10, 11, thus it can describe four 2×2 possibilities • n bits can encode 2 power n possibilities. • The first chips used to process 8 bits at a time. It become customary to refer to them as a byte. It can encode 2 power 8 possibilities. • We can use binary numbers just as decimal numbers. application of bytes • IP (Internet Protocol) numbers are used as the addresses of computers on the Internet. • In IP version 4 (the one that is most commonly used), each IP number has 4 bytes. • It is represented as x.x.x.x where x is a number between 0 and 255 (why?) • how many computers can there be on the Internet at any one time? decimal/binary numbers • • • • • • • • 0 1 2 3 4 5 6 7 0 1 10 11 100 101 110 111 • • • • • • • • 8 9 10 11 12 13 14 15 1000 1001 1010 1011 1100 1101 1110 1111 Many bytes • Larger units are – Kilo byte is 2 power 10 bytes (=1024 bytes) – Mega bytes is 2 power 20 bytes – Giga bytes is 2 power 30 bytes – Tera byte is 2 power 40 bytes • From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution. Hex numbers • A byte is often represented by two hex numbers. • Each hex number can encode 16 values • Written 0 to 9, then A B C D E F. F is 15. • Conventionally prefixed with 0x • Use Microsoft calculator with scientific notation to convert. application of hex numbers • Media Access Control (mac) addresses of hardware that allows access to computer networks. They are 6-byte numbers, each byte written as 2 hex numbers, e.g. 00:60:08:F5:20:A9 • character numbers that you see when you are inserting a special symbol in Microsoft software, e.g. powerpoint. Characters • Much of the information processed by computers is in the form of characters. • A character only makes sense for a human user of a minimum cultural level. • A character is not a glyph. – ligatures Information in a computer file • A file is a piece of data on a stored on a computer. • Any file contains a sequence of 0s and 1s, like 1010100101010011110101010101… • For a computer to make sense of a file, it has to know what type of file it is. executable files • Files that are executable are files that make the computer do something. For example the file starts a program, say powerpoint. An executable on one computer may not run on another • Non-executable files hold data that is used by an executable file. We will call them data files. Example: powerpoint slides file. text files • Many data files contain textual data. • Textual data is a sequence of characters. • A character is an elementary symbol that has some meaning – alphabet letter – hieroglyph • Example: email file • Text files can be read by many computer programs. non-text files • Examples for non-text files are – graphics files – movie files – sound files • non-text files are not very important in library settings – there is not way to organize information retrieval for non-text files. They have to be retrieved using a textual surrogate. – traditional library material are textual • will talk about this later. Representing characters • Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a character set. • Examples for characters are – – – – a c ë € Legacy character sets • In early days, computers were a lot less powerful than they are today. • Could only deal with the characters that are most commonly used. • Such sets are – ascii – ISO-8859-1 – cp1252 ASCII • American Standard Code for Information Interchange • 7-bit character set. There is no such thing as 8-bit ASCII • 95 printable symbols • 33 control characters (0-31, 127) • http://www.ccmr.cornell.edu/helpful_data/a scii2.html has a list up to 127 some ASCII control characters • • • • • • CR (13, ^M) is the carriage return LF (10, ^J) is the linefeed FF (12, ^L) is the form feed (new page) BS (8, ^H) is the backspace DEL (127, ALT-127) is delete ESC (27, ^[) escape ISO-8859-1 • ISO-8859-1, aka ISO-latin-1 extends ASCII with characters that are commonly used by the western European languages. • It is the default character set of html. • Positions 128 to 159 are not used. • Cp1252 fills these with graphic chars. It is as Microsoft character set. This is not enough • There are around 6800 different languages around. • Some of these languages use characters sets that are not finite, i.e. folks can make up now characters out of existing ones! • Setting up a character set for all languages is almost impossible. ISO 10646-1 • Defines the Universal Character Set (UCS). • UCS contains the characters required to represent characters used by many known languages, even the likes of Oriya, Telugu, Bopomofo, Runic. • ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars. • Not finished. Unicode • ISO is a inter-government agency. Slow and bureaucratic. • Industry has come together to work on Unicode, a 2-byte character set. • With some minor exceptions, the Unicode characters are the some as the first 65536 characters in UCS. • Much better documented standard. Unicode and legacy sets • The first 128 characters are identical to those in ASCII • The next 128 characters are identical to ISO 8859-1 (Latin-1). • Unicode is well documented and the Unicode book can be downloaded from the Internet. A must-have for the serious digital librarian. Politics… • Does it make sense to use Unicode rather than, say, ISO-latin-1? • Many commercial pieces of software have data files that contain character data interspersed with non-character data. Is that good? http://openlib.org/home/krichel Thank you for your attention!