Characters and Images - Department of Computer and Information

advertisement
Department of Computer and Information Science,
School of Science, IUPUI
CSCI N305
Information Representation:
Characters and Images
Dale Roberts
Information Representation Review
All information must be rendered into binary in order to
be stored on a computer.
Prior example of binary information representations
include positive integers, negative integers, and floating
point.
Besides numbers, almost all applications must store
characters and string information.
Images are pervasive in today’s internet world and must
be rendered in binary to be handled by internet
browsers.
Crucial to make general purpose computers, computers
that can easily perform many different tasks, is the idea
that the program is just data. Like any other
information, programs must be rendered into binary in
order to be stored within a computer.
Dale Roberts
Character Representations
EBCDIC – IBM Mainframes
ASCII – PC workstations
Unicode – International Character sets
Dale Roberts
EBCDIC
EBCDIC
Expanded name
Extended Binary Coded Decimal Interchange Code
Area covered
8-bit coded character set for information interchange between IBM computers
Sponsoring body
Proprietary specification developed by IBM
Characteristics/description
A set of national character sets for interchange of documents between IBM
mainframes. Most EBCDIC character sets do not contain all of the characters defined
in the ASCII code set but there is a special International Reference Version (IRV)
code set that contains all of the characters in ISO/IEC 646 (and, therefore, ASCII).
Several national versions have been updated to support the encoding of the euro
sign (in lieu of the currency sign).
Usage
Not much used outside of IBM and similar mainframe environments. When
transmitting EBCDIC files between systems care needs to be taken to ensure that the
systems are set up for the relevant national code set.
Further details available from
Your local IBM office.
Other references
Details of the most commonly used sets of EBCDIC codes can be obtained from
http://www.dkuug.dk/i18n/charmaps which, however, has not necessarily been
updated to cover the new code pages that also support the euro sign..
Dale Roberts
EBCDIC Code Table
Dale Roberts
ASCII
ASCII
Expanded name
American Standard Code for Information Interchange
Area covered
7-bit coded character set for information interchange
Sponsoring body
American National Standards Institute (ANSI)
Source documents
Information Systems – Coded Character Sets – 7-Bit American National Standard
Code for Information Interchange (7-Bit ASCII)
Characteristics/description
Specifies coding of space and a set of 94 characters (letters, digits and punctuation
or mathematical symbols) suitable for the interchange of basic English language
documents. Forms the basis for most computer code sets and is the American
National Version of ISO/IEC 646.
Usage
Used as the basic US code set for personal and workstation computers.
Further details available from
ANSI, 25 West 43rd Street, New York, NY 10036, USA
Other references
A list of ASCII codes can be obtained from
http://www.dkuug.dk/i18n/charmaps/ANSI_X3.4-1968.
Dale Roberts
ASCII Code Set
Dale Roberts
Unicode
From MSDN: Unicode can represent all of the world's
characters in modern computer use, including technical
symbols and special characters used in publishing.
Because each Unicode code value is 16 bits wide, it is
possible to have separate values for up to 65,536
characters. Unicode-enabled functions are often referred
to as "wide-character" functions. Note that the
implementation of Unicode in 16-bit values is referred to
as UTF-16. For compatibility with 8- and 7-bit
environments, UTF-8 and UTF-7 are two transformations
of 16-bit Unicode values. For more information, see The
Unicode Standard, Version 2.0.
Dale Roberts
Universal Character Set (Unicode)
ISO/IEC 10646
Expanded name
ISO/IEC 10646: Universal Multiple-Octet Coded Character Set (UCS)
Area covered
Multilingual, multi-octet character set covering all major trading languages. The intent is to provide coding for all the characters of all the
scripts of the world.
Sponsoring body
ISO/IEC JTC1/SC2 and ISO/IEC JTC1/SC22 WG20
Source documents
ISO/IEC 10646-1 Information technology -- Universal Multiple-Octet Coded Character Set (UCS)
Part 1: Architecture and Basic Multilingual Plane
Part 2: Supplementary Planes
ISO/IEC DIS 14651 International string ordering and comparison -- Method for comparing character strings and description of the
common template tailorable ordering
ISO/IEC PRF TR 14652 Information technology -- Specification method for cultural conventions
ISO/IEC 14755:1997 Information technology -- Input methods to enter characters from the repertoire of ISO/IEC 10646 with a keyboard or
other input devices
Unicode 3.2
RFC 2279 UTF-8, a transformation format of ISO 10646
Characteristics/description
Integrates previous internationally/nationally agreed character sets into a single code set together with additional characters to
previously encoded scripts and new, both current and ancient scripts. ISO/IEC 10646 is based on 4 octet (32-bit) coding scheme known
as the "canonical form" (UCS-4), but a 2-octet (16-bit) form (UCS-2) is used for the Basic Multilingual Plane (BMP), where the missing
two high order octets are assumed to be 00 00. The code set is split into 128 "groups" of 256 "planes", each containing 256 "rows" with
256 "cells" for characters. Each character is given a code position using multiple octets, the third (first) of which identifies the row
containing the character and the fourth (second) its cell number.
Usage
This standard has become the basic coding form for all 16 and 32-bit computer systems. Users of Internet Explorer 5, and XLink-aware
XML browsers, can obtain more details about applications of ISO 10646 from our Diffuse Topic Map service.
Further details available from
ISO and national standards bodies.
Other references
Details of the Unicode standard, the repertoire and coding of which are identical to those of the ISO/IEC 10646 code set can be obtained
from http://www.unicode.org.
Dale Roberts
Unicode Latin Set
Dale Roberts
Additional Unicode Pages
Code Range
(hexadecimal)
Character Set
000-FF
Latin -I (ASCII)
0000-2000
General character alphabets: Latin, Cyrillic, Greek,
Hebrew, Arabic, Thai, etc
2000-3000
Symbols and dingbats: punctuation, maths,
technical, geometric shapes, etc
3000-33FF
Miscellaneous punctuations, symbols, and
phonetics for Chinese, Japanese, Korean
4E00-9FFF
Chinese, Japanese, Korean ideographs
AC00-D7AF
Korean Hangui syllables
E000-F000
Spaces for surrogates
E000-F8FF
Private use
Dale Roberts
Comparing Characters: Collating Sequence
If you look at the ASCII Character Code Table
the ASCII binary number for “A” is 1000001,
which is 65 decimal. The ASCII binary number
for “a” is 1100001, which is 97 decimal.
Therefore, “A” is less than “a”. A blank is
stored as 0100000, or 32 decimal. The blank has
the smallest value of the digits or characters.
Rules:
Upper case < lower case
Space < any other character
Dale Roberts
Comparing Strings
A useful operation is the comparison of two strings. Two strings are related in the
same three basic ways as number values. One string is either less than, equal to, or
greater than the other. String comparison is usually based on the positions of the
characters in the character set.
Scanning along both strings and comparing corresponding characters establish the
relationship between two strings. The strings are equal as long as corresponding
characters are equal. If two characters are different, the comparisons are based on
their relative order in the character set. The character whose code is less belongs to
the lesser string.
Ex.
“abcd” < “abcz”
If the two strings are of different length, but identical up to the end of the shorter
one, then the shorter string is the lesser of the two:
Ex.
“abc” < “abcd”
If the two strings are of different length and consist of Upper and lowercase letters,
Upper case letters come before lower case letter and a blank has a lower value than
all other letters.
Ex.
“AZZZ” < “Aaaah”
Below is an example of a comparison of strings that contain blanks. Scanning along
both strings and comparing corresponding characters, you see the strings are equal
for the first two characters. You then compare the blank and the t; you then reach
the conclusion below.
Ex.
“hi there” < “hit a ball”
Dale Roberts
Image Data
Image Data
Because of the number of different shapes, colors, textures, sizes and shadings of images,
there is no standard representational format and there is with alphanumeric codes.
There are 2 ways of representing images:
1 Bit map or raster images
2 Object or vector images are made up of simple geometrical elements. Each element is
specified by its geometric parameters, its location in the picture and other details.
Common Graphics Formats
pixel-based
math-based
bitmap
outline, object-oriented
paint
draw
raster
vector
GIF
PICT, wmf
TIFF
Postscript
Dale Roberts
Rastor Images
Bit map or raster images consist of an array of
pixel values (pixel stands for 'picture element').
Each pixel represents the sampling of a small
area of the picture.
In its simplest form an image is represented as
a long string of bits representing the rows of
pixels in the image, where each bit is either 1
or 0 depending on whether the corresponding
pixel is black or white.
Color images are only slightly more
complicated, since each pixel can be
represented by a combination of bits indicating
the color of that pixel. It is common to record
the color of each pixel as three components:
red
green
blue
One byte is typically used to represent the
intensity of each color component
Dale Roberts
Acknowledgements
A list of character standard was obtained from www.diffuse.org.
A portion of the discussion regarding character and string comparisons was
obtained from Emad Hayajneh.
A portion of the discussion regarding images was obtained from Dr. Robert
Stephens.
Dale Roberts
Download