2. Data formats - Department of Electrical Engineering & Computer

advertisement
2. Data Formats
Chapt. 3
ITEC 1011
Introduction to Information Technologies
Introduction
• Examples
Computer
Real World
Data
Data
Input device
Dear Mom:
Keyboard
10110010…
Digital
camera
10110010…
pp. 59.-61
ITEC 1011
Introduction to Information Technologies
Format must be appropriate
• The internal representation must be
appropriate for the type of processing to
take place (e.g., text, images, sound)
ITEC 1011
Introduction to Information Technologies
Rules/Conventions
• Proprietary formats
– Unique to a product or company
– E.g., Microsoft Word, Corel Word Perfect, IBM Lotus
Notes
• Standards
– Evolve two ways:
• Proprietary formats become de facto standards (e.g., Adobe
PostScript, Apple Quick Time)
• Committee is struck to solve a problem (Motion Pictures
Experts Group, MPEG)
pp. 61-62
ITEC 1011
Introduction to Information Technologies
Standards Organizations
• ISO – International Standards Organization
• CSA – Canadian Standards Association
• ANSI – American National Standards
Institute
• IEEE – Institute for Electrical and
Electronics Engineers
• Etc.
ITEC 1011
Introduction to Information Technologies
Examples of Standards
Type of Data
Alphanumeric
Standards
ASCII, EBCDIC, Unicode
Image
JPEG, GIF, PCX, TIFF
Motion picture
MPEG-2, Quick Time
Sound
Sound Blaster, WAV, AU
Outline graphics/fonts
PostScript, TrueType, PDF
ITEC 1011
Introduction to Information Technologies
Why Standards?
• Standard are “arbitrary”
• They exist because they are
–
–
–
–
–
ITEC 1011
Convenient
Efficient
Flexible
Appropriate
Etc.
Introduction to Information Technologies
Alphanumeric Data
• Problem: Distinguishing between the number 123
(one hundred and twenty-three) and the characters
“123” (one, two, three)
• Four standards for representing letters (alpha) and
numbers
– BCD – Binary-coded decimal
– ASCII – American standard code for information
interchange
– EBCDIC – Extended binary-coded decimal interchange
code
– Unicode
pp. 63-69
ITEC 1011
Introduction to Information Technologies
Standard Alphanumeric Formats
•
•
•
•
BCD
ASCII
EBCDIC
Unicode
ITEC 1011
Introduction to Information Technologies
Next 2 slides
Binary-Coded Decimal (BCD)
• Four bits per digit
Note: the following
bit patterns are not
used:
1010
1011
1100
1101
1110
1111
ITEC 1011
Digit
Bit pattern
0
0000
1
0001
2
0010
3
0011
4
0100
5
0101
6
0110
7
0111
8
1000
9
1001
Introduction to Information Technologies
Example
• 709310 = ? (in BCD)
7
0111
ITEC 1011
0
0000
9
1001
3
0011
Introduction to Information Technologies
Standard Alphanumeric Formats
•
•
•
•
BCD
ASCII
EBCDIC
Unicode
ITEC 1011
Next 22 slides
Introduction to Information Technologies
The Problem
• Representing text strings, such as
“Hello, world”, in a computer
ITEC 1011
Introduction to Information Technologies
Codes and Characters
• Each character is coded as a byte
• Most common coding system is ASCII
(Pronounced ass-key)
• ASCII = American National Standard Code
for Information Interchange
• Defined in ANSI document X3.4-1977
ITEC 1011
Introduction to Information Technologies
ASCII Features
•
•
•
•
7-bit code
8th bit is unused (or used for a parity bit)
27 = 128 codes
Two general types of codes:
– 95 are “Graphic” codes (displayable on a
console)
– 33 are “Control” codes (control features of the
console or communications channel)
ITEC 1011
Introduction to Information Technologies
ASCII Chart
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
Introduction to Information Technologies
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
Least
1100
1101
1110
1111
ITEC 1011
000
001
010
011
NULL
DLE
0
SOH
DC1
!
1
STX
DC2
"
2
ETX
DC3
#
3
EDT
DC4 Most$ significant
4
ENQ
NAK
%
5
ACK
SYN
&
6
BEL
ETB
'
7
BS
CAN
(
8
HT
EM
)
9
LF
SUB
*
:
VT
ESC
+
;
significant
bit
FF
FS
,
<
CR
GS
=
SO
RS
.
>
SI
US
/
?
100
@
A
B
C
bit D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
e.g., ‘a’ = 1100001
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
95 Graphic codes
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
33 Control codes
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
Alphabetic codes
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
Numeric codes
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
Punctuation, etc.
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
“Hello, world” Example
H
e
l
l
o
,
w
o
r
l
d
ITEC 1011
=
=
=
=
=
=
=
=
=
=
=
=
Binary
01001000
01100101
01101100
01101100
01101111
00101100
00100000
01110111
01100111
01110010
01101100
01100100
=
=
=
=
=
=
=
=
=
=
=
=
Hexadecimal
48
65
6C
6C
6F
2C
20
77
67
72
6C
64
=
=
=
=
=
=
=
=
=
=
=
=
Decimal
72
101
108
108
111
44
32
119
103
114
108
100
Introduction to Information Technologies
Common Control Codes
•
•
•
•
•
CR
LF
HT
DEL
NULL
0D
0A
09
7F
00
carriage return
line feed
horizontal tab
delete
null
Hexadecimal code
ITEC 1011
Introduction to Information Technologies
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
Terminology
• Learn the names of the special symbols
–
–
–
–
–
–
ITEC 1011
[]
{}
()
@
&
~
brackets
braces
parentheses
commercial ‘at’ sign
ampersand
tilde
Introduction to Information Technologies
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
ITEC 1011
000
NULL
SOH
STX
ETX
EDT
ENQ
ACK
BEL
BS
HT
LF
VT
FF
CR
SO
SI
001
DLE
DC1
DC2
DC3
DC4
NAK
SYN
ETB
CAN
EM
SUB
ESC
FS
GS
RS
US
010
!
"
#
$
%
&
'
(
)
*
+
,
.
/
011
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
100
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
101
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
Introduction to Information Technologies
110
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
111
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
DEL
Escape Sequences
• Extend the capability of the ASCII code set
• For controlling terminals and formatting output
• Defined by ANSI in documents X3.41-1974 and
X3.64-1977
• The escape code is ESC = 1B16
• An escape sequence begins with two codes:
ESC
[
1B16
ITEC 1011
5B16
Introduction to Information Technologies
Examples
• Erase display:
• Erase line:
ITEC 1011
ESC [ 2 J
ESC [ K
Introduction to Information Technologies
Standard Alphanumeric Formats
•
•
•
•
BCD
ASCII
EBCDIC
Unicode
ITEC 1011
Introduction to Information Technologies
Next 1 slides
EBCDIC
• Extended BCD Interchange Code
(pronounced ebb’-se-dick)
• 8-bit code
• Developed by IBM
• Rarely used today
• IBM mainframes only
ITEC 1011
Introduction to Information Technologies
Standard Alphanumeric Formats
•
•
•
•
BCD
ASCII
EBCDIC
Unicode
ITEC 1011
Introduction to Information Technologies
Next 2 slides
Unicode
• 16-bit standard
• Developed by a consortia
• Intended to supercede older 7- and 8-bit
codes
ITEC 1011
Introduction to Information Technologies
Unicode Version 2.1
•
•
•
•
1998
Improves on version 2.0
Includes the Euro sign (20AC16 =
From the standard:
)
…contains 38,887 distinct coded characters derived
from the supported scripts. These characters cover the
principal written languages of the Americas, Europe,
the Middle East, Africa, India, Asia, and Pacifica.
http://www.unicode.org
ITEC 1011
Introduction to Information Technologies
Keyboard Input
•
•
•
•
•
•
Key (“scan”) codes are converted to ASCII
ASCII code sent to host computer
Received by the host as a “stream” of data
Stored in buffer
Processed
Etc.
pp. 69
ITEC 1011
Introduction to Information Technologies
Shift Key
• inhibits bit 5 in the ASCII code
ASCII code
6 5 4 3 2 1 0 Character
Key(s)
Shift
ITEC 1011
a
1 1 0 0 0 0 1
a
a
1 0 0 0 0 0 1
A
Introduction to Information Technologies
Control Key
• inhibits bits 5 & 6 in the ASCII code
ASCII code
6 5 4 3 2 1 0 Character
Key(s)
Ctrl
c
1 1 0 0 0 1 1
c
c
0 0 0 0 0 1 1
ETX
Control
code
ITEC 1011
Introduction to Information Technologies
Other Input
•
•
•
•
•
•
OCR – optical character recognition
Bar code readers
Voice/audio input
Punched cards
Images / objects
Pointing devices
pp. 69-86
ITEC 1011
Introduction to Information Technologies
OCR
Hello, world
Optical scan
Page of text
ITEC 1011
Introduction to Information Technologies
10110110…
Computer file
Other Input
•
•
•
•
•
•
OCR – optical character recognition
Bar code readers
Voice/audio input
Punched cards
Images / objects
Pointing devices
pp. 69-86
ITEC 1011
Introduction to Information Technologies
Bar Codes
• An automatic identification (Auto ID)
technology that streamlines identification
and data collection
• See
http://www.digital.net/barcoder/barcode.html
ITEC 1011
Introduction to Information Technologies
Other Input
•
•
•
•
•
•
OCR – optical character recognition
Bar code readers
Voice/audio input
Punched cards
Images / objects
Pointing devices
pp. 69-86
ITEC 1011
Introduction to Information Technologies
Voice/audio Input
• Input device: microphone
• Audio input is “digitized” and stored
• Processed in two ways
– As is (no recognition)
– Recognized and converted to alphanumeric data
(ASCII)
Digitize
ITEC 1011
10110010…
Introduction to Information Technologies
Other Input
•
•
•
•
•
•
OCR – optical character recognition
Bar code readers
Voice/audio input
Punched cards
Images / objects
Pointing devices
pp. 69-86
ITEC 1011
Introduction to Information Technologies
Punched Cards
• Invented by Herman Hollerith (founder of
IBM)
• Each card holds 80 characters
ITEC 1011
Introduction to Information Technologies
Other Input
•
•
•
•
•
•
OCR – optical character recognition
Bar code readers
Voice/audio input
Punched cards
Images / objects
Pointing devices
pp. 69-86
ITEC 1011
Introduction to Information Technologies
Images
• Typically images are pictures that are
optically scanned and saved as a “bit map”
or in some other format
• Many formats
– gif, jpeg, …
ITEC 1011
Introduction to Information Technologies
Typical “Save As” Dialog
ITEC 1011
Introduction to Information Technologies
Objects
• Images made of geometrically definable
shapes
• Offer efficiency, flexibility, small size, etc.
ITEC 1011
Introduction to Information Technologies
Other Input
•
•
•
•
•
•
OCR – optical character recognition
Bar code readers
Voice/audio input
Punched cards
Images / objects
Pointing devices
pp. 69-86
ITEC 1011
Introduction to Information Technologies
Pointing Devices
• Originally used for specifying coordinates
(x, y) for graphical input
• Today used as general purpose device for
“graphical user interfaces” (GUIs)
ITEC 1011
Introduction to Information Technologies
Thank you
ITEC 1011
Introduction to Information Technologies
Download