KTLec41Informatics2012

advertisement
MATH 1020: Mathematics For Non-science
Chapter 4.1: Informatics
Instructor: Prof. Ken Tsang
Room E409-R9
Email: kentsang@uic.edu.hk
1
Informatics-the science of information
– What’s information
– Correcting errors in transmitted
messages
– Genetic code and information
– Data compression
– Cryptography
2
Data, messages and information
“Message” is a generic term, something to
convey the information.
 Data are records of messages.
 Data/messages may or may not be
comprehensible to us.
 Information is the knowledge helpful to us to
understand the world around us.
 Information is derived/obtained from
data/message.

3
Messages
• Messages can be a smell
female insects can attract
males with pheromones
4
Messages
• Messages can be a sound
ring, noise, applause
musical, speech
5
Messages
• Messages can be visual
color, shape, design
6
Writings
• More than 80,000
characters are used to
code the Chinese
language
7
Writings
• Ancient Egyptians used
hieroglyphs to code
sounds and words
8
Writings
• Japanese language is
also using the 96
Hiragana character
coding syllables
9
Writings
• Modern numbers are
coded with 10 digits
created by Indians and
transmitted to
Europeans through the
Arabs
10
Writings
• George BOOLE (18151864)
used only two
characters to code
logical values
0 1
11
Writings
• John von NEUMANN
(1903-1957)
developed the concept of
programming using also
binary system to code
all possible information
0 1
12
Binary Codes: the basis of data

CD, MP3, and DVD players, digital TV,
cell phones, the Internet, GPS system, etc.
all represent data as strings of 0’s and 1’s
rather than digits 0-9 and letters A-Z

Whenever information needs to be digitally
transmitted from one location to another, or
stored in electronic device, a binary code is
used
13
Binary Codes

A binary code is a system for encoding data
made up of 0’s and 1’s.
 Examples
– UPC (universal product code, dark = 1, light = 0)
– Morse code (dash = 1, dot = 0)
– Braille (raised bump = 1, flat surface = 0)
– Yi-jing易经 (Yin=0, yang=1)
14
The Challenges in an
information age

Mathematical Challenges in the Digital
Revolution
 How to detect and correct errors in data
transmission
 How to electronically send and store
information economically/efficiently
 How to ensure security of transmitted data
15
Claude Shannon (1916-2001)
“Father of Information Theory”
Claude Shannon’s 1948
paper “A Mathematical
Theory of Communication”
is the paper that made the
digital world we live in
possible.
16
A Brief Introduction: Information Theory

Information theory is a branch of science that deals
with the analysis of a communications system
 We will study digital communications – using a file
as the channel
NOISE
Source of
Message

Encoder
Channel
Decoder
Destination
of Message
Very precise definition of information as a message
made up of symbols from some finite set of
alphabets. It applies to any form of messages.
A human communication system :
ASCII
Dictionary defines code as “a system of symbols for
communication.”
Information is “Communication between an encoder and
decoder using agreed-upon symbols.”
18
Information content

Information content of a message is a
quantifiable amount & always positive.
 The information content of a message is
inversely related to the probability that the
message will be received from the set of all
possible messages.
 The message with the lowest probability of
being received contains the highest
information content.
19
Information content -1
The information content I of a message is defined
as the negative of base-2 logarithm of its
probability p: I=log2(1/p)

If p = 1, then I = 0. A message you are certain to occur
has no information content.
 The logarithmic measure is justified by the desire for
information to be additive. We want the algebra of our
measures to follow the Rules of Probability. When
independent packets of information arrive, we would
like to say that the total information received is the sum
of the individual pieces.
20
1
I ( X j )  log
  log Pj
P( X j )

Suppose we have an event X, where Xi
represents a particular outcome of the event
 Consider flipping a fair coin, there are two
equi-probable outcomes:
– say X0 = heads, P0 = 1/2, X1 = tails, P1 = 1/2

The amount of information for any single result
is 1 bit, unit of information.
 In other words, the number of bits required to
communicate the result of the event is 1 bit

When outcomes are equally likely, there is a
lot of information in the result
 The higher the likelihood of a particular
outcome, the less information that outcome
conveys
 However, if the coin is biased such that it
lands with heads up 99% of the time, there
is not much information conveyed when we
flip the coin and it lands on heads
Information content -2
Suppose we speak a funny language with only 4
alphabets, and we play the game of questions &
answer to determine which alphabet I have
chosen. You can ask me questions and I will only
answer `yes' or `no.'
 What is the minimum number of such questions
that you must ask in order to guarantee finding the
answer?

“Is it A?”, “Is it B?" ...or is there some more intelligent
way to solve this problem?
23
Information content -3The answer to a Yes/No (binary) question having
equal probabilities conveys one bit worth of
information.
In the above example with 4 (equally probable)
alphabets, the minimum number of questions to
discover the answer is equal to 2, even though there
are 4 possibilities.
In English with 26 alphabets, the minimum number
of questions to discover the answer is equal to
I=log2(26)=4.7
24
Information Complexity of
Coded Messages

Let’s think written numbers:
– k digits → 10k possible messages

How about written English?
– k letters → 26k possible messages
– k words → Dk possible messages, where D is
English dictionary size
∴ Length ~ content ~ log(complexity)
25
Information Entropy (熵)

1 
H ( p( X ))   p( x) log 2 p( x)  Ex log 2

p( x) 
x

• The expected content (length in bits) of a
binary message with symbols {x}
– other common descriptions: “code complexity”,
“uncertainty”, “expected surprise”, the minimum
number of bits required to store data, etc.
26

Entropy is the measurement of the average
uncertainty of information
– A highly predictable sequence contains little actual information
Example: 11011011011011011011011011 (what’s next?)
– A completely unpredictable sequence of n bits contains n bits of
information
 Example: 01000001110110011010010000 (what’s next?)


X – random variable with a discrete set of
possible outcomes
– (X0, X1, X2, … Xn-1) where n is the total
number of possibilities
n 1
n 1
1
Entropy   Pj log Pj   Pj log
Pj
j 0
j 0






Entropy is greatest when the probabilities of the
outcomes are equal
Let’s consider our fair coin experiment again
The entropy H = ½ log 2 + ½ log 2 = 1
Since each outcome has information content of 1, the
average of 2 outcomes is (1+1)/2 = 1
Consider a biased coin, P(H) = 0.98, P(T) = 0.02
H = 0.98 * log 1/0.98 + 0.02 * log 1/0.02 =
= 0.98 * 0.029 + 0.02 * 5.643 = 0.0285 + 0.1129 =
0.1414
Entropy of a coin flip
Entropy H(X) of a coin flip,
measured in bits; graphed versus the
fairness of the coin Pr(X=1).
The entropy reaches the maximum
value when the coin has equal
probability ½ of landing with
“heads” or “tails” side up (i.e. the
coin is fair). The case of fair coin is
the most uncertain situation to
predict the outcome of the
next toss. Hence, the result of each
toss of the coin delivers a full 1 bit
of information.
29
SETI
Search for extraterrestrial intelligence
SETI project uses information theory to search for
intelligent life on other planets. Currently, radio
signals from deep space are monitored for signs of
transmissions from civilizations on other worlds. If
those signals are generated by intelligent beings,
they must not be random in nature and contain
discernible pattern in them with significant
information content.
30
SETI
Search for extraterrestrial intelligence
31
DNA: Genetic Code
• Nature uses 4
molecules to code
the genetic heredity
32
Information Theory & genetics
DNA
Message
Information
Source:
Signal
Received
Signal
Message
Receiver
Child
Transmitter
Parents
Noise
Source
Destination:
Mutation,
evolution
A typical communication system
33
Informatics
– What’s information
– Correcting errors in transmitted
messages
– Genetic code and information
– Data compression
– Cryptography
34
Transmission Problems

What are some problems that can occur
when data is transmitted from one place to
another?

The two main problems are
– transmission errors: the message sent is not
the same as the message received
– security: someone other than the intended
recipient receives the message
35
Transmission Error Example





Suppose you were looking at a newspaper ad for a
job, and you see the sentence “must have bive years
experience”
We detect the error since we know that “bive” is not
a word
Can we correct the error?
Why is “five” a more likely correction than “three”?
Why is “five” a more likely correction than “nine”?
36
Another Example

Suppose NASA is directing one of the Mars rovers by
telling it which crater to investigate

There are 16 possible signals that NASA could send, and
each signal represents a different command

NASA uses a 4-digit binary code to represent this
information
0000
0001
0010
0100
0101
0110
1000
1001
1010
1100
1101
1110
0011
0111
1011
1111
37
Lost in Transmission




The problem with this method is that if there is
a single digit error, there is no way that the
rover could detect or correct the error
If the message sent was “0100” but the rover
receives “1100”, the rover will never know a
mistake has occurred
This kind of error – called “noise” – occurs all
the time
Errors are also introduced into data stored over a
long period of time on CD or magnetic tape as the
hardware deteriorates.
38
Coding theory

Coding theory, the study of codes, including
error detecting and error correcting codes,
has been studied extensively for the past forty
years.

Messages, in the form of bit strings, are encoded by
translating them into longer bit strings, called
codewords.
 As long as not too many errors were introduced in
transmission, we can recover the codeword from the
bit string received when this codeword was sent.
39
BASIC IDEA

The details of techniques used to protect information
against noise in practice are sometimes rather complicated,
but basic principles are easily understood.

The key idea is that in order to protect a message against a
noise, we should encode the message by adding some
redundant information to the message.

In such a case, even if the message is corrupted by a noise,
there will be enough redundancy in the encoded message
to recover, or to decode the message completely.
40
Adding Redundancy to our
Messages

To decrease the effects of noise, we add
redundancy to our messages.
 First method: repeat the digits multiple
times.
 Thus, the computer is programmed to take
any five-digit message received and decode
the result by majority rule.
41
Majority Rule

So, if we sent 00000, and the computer receives any of
the following, it will still be decoded as 0.
00000 11000 Notice that for the
10000 10100 computer to decode
01000 10010 incorrectly, at least
00010 10001 three errors must be
00001 etc.
made.
42
Independent Errors

Using the five-time repeats, and assuming
the errors happen independently, it is less
likely that three errors will occur than two
or fewer will occur.
 This is called the maximum likelihood
decoding.
43
More on Redundancy

Another way to try to avoid errors is to send
the same message twice

This would allow the rover to detect the error,
but not correct it (since it has no way of
knowing if the error occurs in the first copy of
the message or the second)
44
Why don’t we use this?

Repetition codes have the advantage of
simplicity, both for encoding and decoding
 But, they are too inefficient!
 In a five-fold repetition code, 80% of all
transmitted information is redundant.
 Can we do better?
 Yes!
45
Parity-Check Sums
 Sums of digits whose parities determine the check digits.
 Even Parity – Even integers are said to have even parity.
 Odd Parity – Odd integers are said to have odd parity.
Decoding
 The process of translating received data into code words.
 Example: Say the parity-check sums detects an error.



The encoded message is compared to each of the possible correct messages.
This process of decoding works by comparing the distance between two
strings of equal length and determining the number of positions in which the
strings differ.
The one that differs in the fewest positions is chosen to replace the message
in error.
In other words, the computer is programmed to automatically correct the
error or choose the “closest” permissible answer.
46
Error Correction

Over the past 40 years, mathematicians and
engineers have developed sophisticated
schemes to build redundancy into binary
strings to correct errors in transmission!
 One example can be illustrated with Venn
diagrams!
47
Computing the Check Digits

The original message is four digits long

We will call these digits I, II, III, and IV

We will add three new digits, V, VI, and VII

Draw three intersecting circles
as shown here

Digits V, VI, and VII should be
chosen so that each circle
contains an even number of
ones
Venn Diagrams
V
I
VI
III IV
II
VII
48
A Hamming (7,4) code

A Hamming code of (n,k) means the message of k
digits long is encoded into the code word of n digits.

The 16 possible messages:
0000
1010
0011
0001
1100
1110
0010
1001
1101
0100
0110
1011
1000
0101
0111
1111
49
Binary Linear Codes



The error correcting scheme we
just saw is a special case of a
Hamming code.
These codes were first proposed
in 1948 by Richard Hamming
(1915-1998), a mathematician
working at Bell Laboratories.
Hamming was frustrated with
losing a week’s worth of work
due to an error that a computer
could detect, but not correct.
50
Appending Digits to the Message

The message we want to send is “0100”

Digit V should be 1 so that the first circle has two
ones

Digit VI should be 0 so that the second circle has
zero ones (zero is even!)

Digit VII should be 1 so that
the last circle has two ones

Our message is now 0100101
1
0
0
0 0
1
1
51
52
53
54
55
Encoding those messages
Message  codeword
0000  0000000
0110  0110010
0001  0001011
0101  0101110
0010  0010111
0011  0011100
0100  0100101
1110  1110100
1000  1000110
1101  1101000
1010  1010001
1011  1011010
1100  1100011
0111  0111001
1001  1001101
1111  1111111
56
57
Detecting and Correcting Errors

Now watch what happens when there is a single digit error

We transmit the message 0100101 and the rover receives
0101101

The rover can tell that the second and third circles have odd
numbers of ones, but the first circle is correct

So the error must be in the digit that is
in the second and third circles, but not
the first: that’s digit IV

Since we know digit IV is wrong, there is
only one way to fix it: change it from 1 to 0
1
0
0
0 1
1
1
58
59
Try It!

Encode the message 1110 using this method

You have received the message 0011101.
Find and correct the error in this message.
60
Extending This Idea

This method only allows us to encode 4 bits
(16 possible) messages, which isn’t even
enough to represent the alphabet!

However, if we use more digits, we won’t be
able to use the circle method to detect and
correct errors

We’ll have to come up with a different method
that allows for more digits
61
Parity Check Sums

The circle method is a specific example of a
“parity check sum”

The “parity” of a number is 1 is the number
is odd and 0 if the number is even

For example, digit V is 0 if I + II + III is
even, and 1 if I + II + III is odd
62
Conventional Notation

Instead of using Roman numerals, we’ll use
a1 to represent the first digit of the message,
a2 to represent the second digit, and so on

We’ll use c1 to represent the first check
digit, c2 to represent the second, etc.
63
Old Rules in the New Notation

Using this notation, our rules for our check
digits become
– c1 = 0 if a1 + a2 + a3 is even
– c1 = 1 if a1 + a2 + a3 is odd
– c2 = 0 if a1 + a3 + a4 is even
– c2 = 1 if a1 + a3 + a4 is odd
c1
a1
– c3 = 0 if a2 + a3 + a4 is even
a3 a
4
a2
– c3 = 1 if a2 + a3 + a4 is odd
c3
c2
64
An Alternative System

If we want to have a system that has enough
code words for the entire alphabet, we need
to have 5 message digits: a1, a2, a3, a4, a5

We will also need more check digits to help
us decode our message: c1, c2, c3, c4
65
Rules for the New System

We can’t use the circles to determine the
check digits for our new system, so we use
the parity notation from before
 c1
 c2
is the parity of a1 + a2 + a3 + a4
is the parity of a2 + a3 + a4 + a5
 c3 is the parity of a1 + a2 + a4 + a5
 c4 is the parity of a1 + a2 + a3 + a5
66
Making the Code

Using 5 digits in our message gives us 32
possible messages, we’ll use the first 26 to
represent letters of the alphabet

On the next slide you’ll see the code itself,
each letter together with the 9 digit code
representing it
67
The Code
Letter
Code
Letter
Code
A
000000000
N
011010101
B
000010111
O
011101100
C
000101110
P
011111011
D
000111001
Q
100001011
E
001001101
R
100011100
F
001011010
S
100100101
G
001100011
T
100110010
H
001110100
U
101000110
I
010001111
V
101010001
J
010011000
W
101101000
K
010100001
X
101111111
L
010110110
Y
110000100
M
011000010
Z
110010011
68
Using the Code

Now that we have our code, using it is simple

When we receive a message, we simply look it up
on the table

But what happens when the message we receive
isn’t on the list?

Then we know an error has occurred, but how do
we fix it? We can’t use the circle method
anymore
69
Beyond Circles

Using this new system, how do we decode
messages?

Simply compare the (incorrect) message with the
list of possible correct messages and pick the
“closest” one

What should “closest” mean?

The distance between the two messages is the
number of digits in which they differ
70
The Distance Between
Messages

What is the distance between 1100101 and
1010101?
– The messages differ in the 2nd and 3rd digits, so
the distance is 2

What is the distance between 1110010 and
0001100?
– The messages differ in all but the 7th digit, so
the distance is 6
71
Hamming Distance

Def: The Hamming distance between two
vectors of a vector space is the number of
components in which they differ, denoted
d(u,v).
72
Hamming Distance

Ex. 1: The Hamming distance between
v=[1011010]
u=[0111100]
d(u, v) = 4

Notice: d(u,v) = d(v,u)
73
74
Hamming weight of a Vector

Def: The Hamming weight of a vector is the
number of nonzero components of the
vector, denoted wt(u).
75
Hamming weight of a code

Def: The Hamming weight of a linear code
is the minimum weight of any nonzero
vector in the code.
76
Hamming Weight

The Hamming weight of
v=[1011010]
u=[0111100]
w=[0100101]
are:
wt(v) = 4
wt(u) = 4
wt(w) = 3
77
Nearest-Neighbor Decoding

The nearest neighbor decoding method
decodes a received message as the code
word that agrees with the message in the
most positions
78
Trying it Out

Suppose that, using our alphabet code, we
receive the message 010100011

We can check and see that this message is
not on our list

How far away is it from the messages on
our list?
79
Distances From 010100011
Code
Distance
Code
Distance
000000000
4
011010101
5
000010111
4
011101100
5
000101110
4
011111011
3
000111001
4
100001011
4
001001101
6
100011100
8
001011010
6
100100101
4
001100011
2
100110010
4
001110100
6
101000110
6
010001111
3
101010001
6
010011000
5
101101000
6
010100001
1
101111111
6
010110110
3
110000100
5
011000010
3
110010011
3
80
Fixing the Error

Since 010100001 was closest to the message that
we received, we know that this is the most likely
actual transmission

We can look this corrected message up in our
table and see that the transmitted message was
(probably) “K”

This might still be incorrect, but other errors can
be corrected using context clues or check digits
81
Distances From 1010 110

The distances between message “1010 110”
and all possible code words:
v
1010 110
1010 110
1010 110
1010 110
1010 110
1010 110
1010 110
1010 110
code word
0000 000
0001 011
0010 111
0100 101
1000 110
1100 011
1010 001
1001 101
distance
4
5
2
5
1
4
3
4
v
1010 110
1010 110
1010 110
1010 110
1010 110
1010 110
1010 110
1010 110
code word
0110 010
0101 110
0011 100
1110 100
1101 000
1011 010
0111 001
1111 111
distance
3
4
3
2
5
2
6
3
82
Informatics
– What’s information
– Correcting errors in transmitted
messages
– Genetic code and information
– Data compression
– Cryptography
83
84
The genetic code

The genome基因組 is the instruction manual
for life, an information system that specifies
the biological body.
 In its simplest form, it consists of a linear
sequence of four extremely small molecules,
called nucleotides.
 These nucleotides make up the “steps” of the
spiral-staircase structure of the DNA and are
the letters of the genetic code.
85
DNA from an Information
Theory Perspective
The “alphabet” for DNA is {A,C,G,T}.
Each DNA strand is a sequence of symbols
from this alphabet.
 These sequences are replicated and
translated in processes reminiscent of
Shannon’s communication model.
 There is redundancy in the genetic code that
enhances its error tolerance.

86
What Information Theory Contributes to
Genetic Biology

A useful model for how genetic information is
stored and transmitted in the cell
 A theoretical justification for the observed
redundancy of the genetic code
 The ideas that Information Theory (Claude
Shannon) introduced have profound and farreaching implications for Origin of Life
Research and Evolutionary Theory.
87
A DNA double helix
The main role of DNA
(Deoxyribonucleic acid 脱
氧核糖核酸) molecules is
the long-term storage of the
genetic instructions used in
the development and
functioning of all known
living organisms.
88
The Genetic Code
The four bases (nucleotides) found
in DNA are adenine (abbreviated
A), cytosine (C), guanine (G) and
thymine (T). These four bases can be
regarded as the symbols used in the
Genetic Code.
DNA code has four alphabets. They
are A - C - G - T. Three alphabets
come together to form a letter
called a “triplet” or “Codon”. The
Codon GGG is an instruction to
make Glycine, an amino acids
commonly found in proteins.
89
Processes of Life in the Cell
Information
stored in the
DNA in the
nucleus is copied
over to RNA
(ribonucleic acid)
strands, which
acts as a
messenger to
govern the
chemical
processes in the
cell.
Duplication and Division
In the course of cell
division, the DNA strands
in the nucleus
(chromosomes) are
duplicated by splitting the
double-helix strand up and
replacing the open bonds
with the corresponding
amino acids
Process must be
sufficiently
accurate, but also
capable of
occasional minor
mistakes to allow
for evolution.
Transmitting
• Prior to the cell fission
the DNA molecule is
unzipped
92
Escherichia coli genome
• >gb|U00096|U00096 Escherichia coli 大腸桿菌
• K-12 MG1655 complete genome基因組
• AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTG
TGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG
AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATT
GACTTAGGTCACTAAATACTTTAACCAATATAGGCAT
AGCGCACAGACAGATAAAAATTACAGAGTACACAAC
ATCCATGAAACGCATTAGCACCACCATTACCACCACC
ATCACCATTACCACAGGTAACGGTGCGGGCTGACGCG
TACAGGAAACACAGAAAAAAGCCCGCACCTGACAGT
GCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAAC
AACCATGCGAGTGTTGAA
93
Hierarchies of symbols
English
letter (26)
word
(1-28 letters)
sentence
book
computer
bit (2)
byte
(8 bits)
line
program
genetics
nucleotide核苷酸(4)
codon
(3 nucleotides)
gene
genome
94
The Central Dogma of Molecular
Biology
Replication
Transcription
Translation
RNA
DNA
Protein
Reverse
Transcription
Ribonucleic acid
核糖核酸
A long string of DNA like ACGGGTCTTTAAGATG——- that
DNA pattern might hold the instructions to build a big mouth or
blue eyes.
95
Information Theory & genetics
DNA
Message
Information
Source:
Signal
Received
Signal
Message
Receiver
Child
Transmitter
Parents
Noise
Source
Destination:
Mutation,
evolution
A typical communication system
96
Redundancy in DNA code
• Two other codons, GGC and GGA (besides GGG),
also code for Glycine. Each amino acid can be
represented by three different codons, not just one.
This is “redundancy”.
• DNA has a clever mechanism for reducing
copying errors, namely that each amino acid
has three Codons that code for it, not just one.
It means that some copying errors (GGC
instead of GGG) will not cause a problem.
• This is why DNA has survived over 3 billion
years intact.
97
The very existence of information changes the
materialistic worldview. Materialistic philosophy
has no explanation for the existence of information.
Norbert Weiner, the MIT Mathematician and father of Cybernetics,
said, “Information is information, neither matter nor energy. No
materialism that denies this can survive the present day.”
Information is an entity carried by, but beyond matter and energy.
The observable universe is consisted of not only matter and energy
but also information.
Is life a vehicle to transport some kind of information through
space and time?
98
Informatics
– What’s information
– Correcting errors in transmitted
messages
– Genetic code and information
– Data compression
– Cryptography
99
Data compression
Purpose: Increase the efficiency in
storing/transmitting data.
 How: achieved by removing data
redundancy while preserving information
content. Data with low entropy permit a larger

compression ratio than data with high entropy.

Types: Lossless compression and lossy
compression. Entropy effectively limits the
strongest lossless compression possible.
100
Data compression



Data compression is important to storage systems
because it allows more bytes to be packed into a
given storage medium than when the data is
uncompressed.
Some storage devices (notably tape) compress
data automatically as it is written, resulting in less
tape consumption and significantly faster backup
operations.
Compression also reduces file transfer time, saving
time and communications bandwidth.
102
Compression

There are two main categories
– Lossless
– Lossy

Compression ratio:
103
Compression factor

A good metric for compression is the compression
factor (or compression ratio) given by:

If we have a 100KB file that we compress to 40KB,
we have a compression factor of:
104
Inefficiency of ASCII

Realization: In many natural (English) files,
we are much more likely to see the letter ‘e’
than the character ‘&’, yet they are both
encoded using 7 bits!

Solution: Use variable length encoding!
The encoding for ‘e’ should be shorter than
the encoding for ‘&’.
105
Relative Frequency of Letters in English Text
106
ASCII (cont.)

Here are the ASCII bit strings for the capital letters in our
alphabet:
Letter
ASCII
Letter
ASCII
A
0100 0001
N
0100 1110
B
0100 0010
O
0100 1111
C
0100 0011
P
0101 0000
D
0100 0100
Q
0101 0001
E
0100 0101
R
0101 0010
F
0100 0110
S
0101 0011
G
0100 0111
T
0101 0100
H
0100 1000
U
0101 0101
I
0100 1001
V
0101 0110
J
0100 1010
W
0101 0111
K
0100 1011
X
0101 1000
L
0100 1100
Y
0101 1001
M
0100 1101
Z
0101 1010
107
Example: Morse code

Morse code is a method of transmitting textual
information as a series of on-off tones, lights, or
clicks that can be directly understood by a skilled listener or
observer without special equipment.

Each character is a sequence of dots and dashes,
with the shorter sequences assigned to the more
frequently used letters in English – the letter 'E'
represented by a single dot, and the letter 'T' by a single dash.

Invented in the early 1840s. it was extensively used in
the 1890s for early radio communication before it was
possible to transmit voice.
108
A U.S. Navy seaman sends
Morse code signals in 2005.
Vibroplex semiautomatic key. The
paddle, when pressed to the right
by the thumb, generates a series of
dits. When pressed to the left by
the knuckle of the index finger, the
paddle generates a dah.
109
International Morse Code
110
Variable Length Coding

Assume we know the distribution of characters
(‘e’ appears 1000 times, ‘&’ appears 1 time)
 Each character will be encoded using a number of
bits that is inversely proportional to its frequency
(made precise later).
 Need a ‘prefix free’ encoding: if ‘e’ = 001
than we cannot assign ‘&’ to be 0011. Since
encoding is variable length, need to know when to
stop.
111
Encoding Trees

Think of encoding as an (unbalanced) tree.
 Data is in leaf nodes only (prefix free).
1
0
e
0 1
a
b
‘e’ = 0, ‘a’ = 10, ‘b’ = 11
 How to decode ‘01110’?

112
Cost of a Tree

For each character ci let fi be its frequency
in the file.
 Given an encoding tree T, let di be the depth
of ci in the tree (number of bits needed to
encode the character).
 The length of the file after encoding it with
the coding scheme defined by T will be
C(T)= Σdi fi
113
Example Huffman encoding


A=0
B = 100
C = 1010
D = 1011
R = 11
ABRACADABRA = 01001101010010110100110

This is eleven letters in 23 bits
 A fixed-width encoding would require 3 bits
for 5 different letters, or 33 bits for 11 letters
 Notice that the encoded bit string can be
decoded!
114
Why it works

In this example, A was the most common
letter
 In ABRACADABRA:
– 5 As
– 2 Rs
– 2 Bs
– 1C
– 1D
code for A is 1 bit long
code for R is 2 bits long
code for B is 3 bits long
code for C is 4 bits long
code for D is 4 bits long
115
Creating a Huffman encoding

For each encoding unit (letter, in this
example), associate a frequency (number of
times it occurs)
– Use a percentage or a probability

Create a binary tree whose children are the
encoding units with the smallest frequencies
– The frequency of the root is the sum of the
frequencies of the leaves

Repeat this procedure until all the encoding
units are in the binary tree
116
Example, step I

Assume that relative frequencies are:
A: 40
B: 20
C: 10
D: 10
R: 20
(I chose simpler numbers than the real frequencies)
Smallest numbers are 10 and 10 (C and D), so connect those
–
–
–
–
–


117
Example, step II
and D have already been used, and the
new node above them (call it C+D) has
value 20
 The smallest values are B, C+D, and R, all of
which have value 20

C
– Connect any two of these; it doesn’t matter
which two
118
Example, step III
The smallest values is R, while A and
B+C+D all have value 40
 Connect R to either of the others

root
leave
119
Example, step IV

Connect the final two nodes
120
Example, step V


Assign 0 to left branches, 1 to right branches
Each encoding is a path from the root

A=0
B = 100
C = 1010
D = 1011
R = 11

Each path
terminates at a
leaf
Do you see
why encoded
strings are
decodable?

121
Unique prefix property
A=0
B = 100
C = 1010
D = 1011
R = 11
 No bit string is a prefix of any other bit string
 For example, if we added E=01, then A (0) would
be a prefix of E
 Similarly, if we added F=10, then it would be a
prefix of three other encodings (B=100, C=1010,
and D=1011)


The unique prefix property holds because, in a binary tree, a
leaf is not on a path to any other node
122
Practical considerations

It is not practical to create a Huffman encoding for
a single short string, such as ABRACADABRA
– To decode it, you would need the code table
– If you include the code table in the entire message, the
whole thing is bigger than just the ASCII message

Huffman encoding is practical if:
– The encoded string is large relative to the code table, OR
– We agree on the code table beforehand

For example, it’s easy to find a table of letter
frequencies for English (or any other alphabet-based
language)
123
Data compression

Huffman encoding is a simple example of data
compression: representing data in fewer bits
than it would otherwise need
 A more sophisticated method is GIF (Graphics
Interchange Format) compression, for .gif files
 Another is JPEG (Joint Photographic Experts
Group), for .jpg files
– Unlike the others, JPEG is lossy—it loses information
– Generally OK for photographs (if you don’t compress
them too much) because decompression adds “fake” data
very similar to the original
124
JPEG Compression

Photographic images incorporate a great deal of
information. However, much of that information can be
lost without objectionable deterioration in image quality.

With this in mind, JPEG allows user-selectable image
quality, but even at the “best” quality levels, JPEG
makes an image file smaller owing to its multiple-step
compression algorithm.

It’s important to remember that JPEG is lossy, even at
the highest quality setting. It should be used only
when the loss can be tolerated.
125
2. Run Length Encoding (RLE)



RLE: When data contain strings of repeated symbols (such as bits or
characters), the strings can be replaced by a special marker, followed
by the repeated symbol, followed by the number of occurrences. In
general, the number of occurrences (length) is shown by a two digit
number.
If the special marker itself occurs in the data, it is duplicated (as in
character stuffing).
RLE can be used in audio (silence is a run of 0s) and video (run of a
picture element having the same brightness and color).
126
An Example of Run-Length Encoding
127
2. Run Length Encoding (RLE)

Example
– # is chosen as the special marker.
– Two-digit number is chosen for the repetition count.
– Consider the following string of decimal digits
15000000000045678111111111111118
Using RLE algorithm, the above digital string would be
encoded as:
15#01045678#1148
– The compression ration would be
(1 – (16/32)) * 100% = 50%
128
129
One of the essential aspects of communication systems is that the
codes, the encoders and decoders have layers.
Information is encoded from the top layer down, and it is decoded
from the bottom layer up.
130
Download