march27

advertisement
CS 4100 Artificial Intelligence
Prof. C. Hafner
Class Notes March 27, 2012
Bayes net another example
• What are the conditional independence assumptions
embodied in this model ?
Ulcer
Infection
Fever
Stomach Ache
Bayes net another example
• What are the conditional independence assumptions
embodied in this model ? (and how is it useful)
– Fever is conditionally independent of ulcer and stomach
ache, given Infection
– Stomach ache is conditionally independent of Infection
and fever, given Ulcer.
Ulcer
Infection
Fever
Stomach Ache
Bayes net another example
• What are the conditional independence assumptions
embodied in this model ?
Ulcer
Infection
Stomach Ache
Fever
P(Ulcer | Fever) = α P(Fever | Ulcer) P(Ulcer)
P(Fever | Ulcer) = α [P(Fever, Inf, SA | Ulc) + P(Fever, ~Inf, SA | Ulc) +
P(Fever, Inf, ~SA | Ulc) + P(Fever, ~Inf, ~SA | Ulc) ] P(Ulc)
Simplifications:
P(Fever, Inf, SA | Ulc) is P(Fever |Inf) P(Inf | Ulc) P(SA | Ulc)
P(Fever, ~Inf, SA | Ulc) is P(Fever |~Inf) P(~Inf | Ulc) P(SA | Ulc)
Test your understanding: design a Bayes net with plausible
numbers
Information Theory
• Information is about categories and classification
• We measure quantity of information by the
resources needed to represent/store/transmit the
information
• Messages are sequences of 0’s and 1’s (dots/dashes)
which we call “bits” (for binary digits)
• You need to send a message containing the identity
of a spy
– It is known to be Mr. Brown or Mr. Smith
• You can send the message with 1 bit, therefore the
event “the spy is Smith” has 1 bit of information
Calculating quantity of information
• Def: A uniform distribution of a set of possible
outcomes (X1 . . . Xn) means the outcomes are
equally probable; that is, they each have probability
1/n.
• Suppose there are 8 people who can be the spy.
Then the message requires 3 bits. If there are 64
possible spies the message requires 6 bits, etc.
(assuming a uniform distribution)
• Def: The information quantity of a message where
the (uniform) probability of each value is p:
I = -log p bits
Intuition and Examples
• Intuitively, the more “surprising” a message is, the
more information it contains. If there are 64 equallyprobable spies we are more surprised by the identity
of the spy than if there are only two equally probable
spies.
• There are 26 letters in the alphabet. Assuming they
are equally probable, how much information is in
each letter:
I = -log (1/26) = log 26 = 4.7 bits
• Assuming the digits from 0 to 9 are equally probable.
Will the information in each digit be more or less
than the information in each letter?
Sequences of messages
• Things get interesting when we looks beyond a single
message to a long sequence of messages.
• Consider a 4-sided die, with symbols A, B, C, D:
– Let 00 = A, 01=B, 10=C, 11=D
– Each message is 2 bits. If you throw the die 800 times,
you get a message 1600 bits long
That’s the best you can do if A,B,C,D equally probable
Non-uniform distributions (cont.)
• Consider a 4-sided die, with symbols A, B, C, D:
– But assume P(A) = 7/8 and P(B)=P(C)=P(D) = 3/24
– We can take advantage of that with a different code:
0 = A, 10= B, 110 = C, 111 = D
– If we throw the die 800 times, what is the expected
length of the message? What is the entropy?
• ENTROPY is the average information (in bits) of
events in a long repeated sequence
Entropy
Formula for entropy with outcomes x1 . . . xn :
- Σ P(xi) * log P(xi) bits
For a uniform distribution this is the same as –log P(x1)
since all the P(xi) are the same.
What does it mean? Consider 6-sided die, outcomes
equally probable:
-log 1/6 = 2.58 tells us a long sequence of die
throws can be transmitted using 2.58 bits per throw
on the average and this is the theoretical best
Review/Explain Entropy
• Let the possible outcomes be x1 . . . . xn
– With probabilities p1 . . . pn that add up to 1
• Ex: an unfair coin where n = 2, x1 = H (3/4), x2 = T
(1/4)
• In a long sequence of events E = e1 . . . ek, we
assume that outcome xi will occur k * pi times, etc.
E = HHTHTHHHTTHHHHHTHHHHTTHTHTHHHH …….
If k = 10000, we can assume H occurs 7500 times, T
2500.
Note: the concept TYPES vs. TOKENS. There are two
types and 10000 tokens in this scenario.
Review/Explain Entropy
The entropy of E H(E) is the average information of the
events in the sequence e1 . . . ek :
k
1/k * Σ I(ej) =
[now switch to summation over outcomes]
j=1
n
n
1/k * Σ I(xi) * (k*pi) = k/k * Σ I(xi) * pi
i=1
n
Σ -log(pi) * pi bits
i=1
i=1
Review/Explain Entropy
Entropy is sometimes called “disorder” – it represents
the lack of predictability as to the outcome for any
element of a sequence (or set)
If a set has just one outcome, entropy = 1 * -log(1) = 0
If there are 2 outcomes, then 50/50 probability gives
the maximum entropy – complete unpredictability.
This generalizes to any uniform distribution for n
outcomes.
- (0.5 * log(.5) + 0.5 * log(.5)) = 1 bit
Note: log(1/2) = -log(2) = -1
Calculating Entropy
• Consider a biased coin: P(heads) = ¾; P(tails) = ¼
• What is the entropy of a coin toss outcome?
• H = ¼ * -log(1/4) + ¾ * -log(3/4) = 0.811 bits
• Using the Information Theory Log Table
• H = 0.25 * 2.0 + 0.75 * 0.415 = 0.5 + 0.311 = .811
• A fair coin toss has more “information”
• The more unbalanced the probabilities, the more
predictable the outcome, the less you learn from
each message.
Maximum disorder
1
H
(entropy in bits)
0
½
1
Probability of x1
Entropy for a set containing 2 possible outcomes (x1, x2)
What if there are 3 possible outcomes?
for equal probability case: H = -log(1/3) = about 1.58
Define classification tree and ID3 algorithm
• Def: Given a table with one result attribute and several
designated predictor attributes, a classification tree for
that table is a tree such that:
– Each leaf node is labeled with a value of the result
attribute
– Each non-leaf node is labeled with the name of a
predictor attribute
– Each link is labeled with one value of the parent’s
predictor
• Def: the ID3 algorithm takes a table as input and
“learns” a classification tree that efficiently maps
predictor value sets into their results from the table.
A trivial example of a classification tree
Record#
Color
Shape
Fruit
1
red
round
apple
2
yellow
round
lemon
3
yellow
oblong
banana
Color
yellow
red
Shape
apple
round
lemon
oblong
banana
The goal is to create an “efficient” classification tree which always gives
the same answer as the table
A well-known “toy” example: sunburn data
Name
Hair
Height
Weight
Lotion
Sunburned
Sarah
Blonde
Average
Light
No
Yes
Dana
Blonde
Tall
Average
Yes
No
AleX
Brown
Short
Average
Yes
No
Annie
Blonde
Short
Average
No
Yes
Emily
Red
Average
Heavy
No
Yes
Pete
Brown
Tall
Heavy
No
No
John
Brown
Average
Heavy
No
No
Katie
Blonde
Short
Light
Yes
No
Predictor attributes: hair, height, weight, lotion
Hair
Blonde
Red
Lotion
Y
Not
Sunburned
Sunburned
N
Sunburned
Brown
Not
Sunburned
Outline of the algorithm
1. Create the root, and make its COLLECTION the entire table
2. Select any non-singular leaf node N to SPLIT
1. Choose the best attribute A for splitting N (use info theory)
2. For each value of A (a1, a2, . .) create a child of N, Nai
3. Label the links from N to its children: “A = ai”
4. SPLIT the collection of N among its children according to
their values of A
3. When no more non-singular leaf nodes exist, the tree is finished
4. Def: a singular node is one whose COLLECTION includes just one
value for the result attribute (therefore its entropy = 0)
Choosing the best attribute to SPLIT: the one
that is MOST INFORMATIVE
that reduces the entropy (DISORDER) the most
Assume there are k attributes we can choose. For
each one, we compute how much less entropy exists
in the resulting children than we had in the parents:
H(N) – weighted sum of H(children of N)
Each child’s entropy is weighted by the “probability” of
that child (estimated by the proportion of the parent’s
collection that would be transferred to the child in the
split)
C(S1) = {S,D,X,A,E,P,J,K}(3,5)/____}
S1: _______
Calculate entropy: - [3/8 log 3/8 + 5/8 log 5/8] =
.53 + .424 = .954
Find information gain (IG) for all 4 predictors: hair, height, weight, lotion
Start with lotion: values (yes, no)
Child 1: (yes) = {D,X,K}(0, 3)/0
Child 2: (no) = {S,A,E,P,J}(3,2)/ -[3/5 log 3/5 + 2/5 log 2/5] = .971
Child set entropy = 3/8 * 0 + 5/8 * .971 = 0.607
IG(Lotion) = .954 - .607 = .347
Then try hair color: values (blond, brown, red)
Child 1(blond) = {S,D,A,K}(2,2)/1
Child 2(brown) = {X,P,J}(0,3)/0
Child 3(red) = {E}(1,0)/0
Child set entropy = 4/8 * 1 + 3/8 * 0 + 1/8 * 0 = 0.5
IG(Hair color) = .954 - 0.5 = .454
Next try Height: values (average, tall short)
Child1(average) = {S,E,J}(2,1)/ -[2/3 log 2/3 + 1/3 log 1/3] = 0.92
Child2(tall) = {D,P}(0,2)/0
Child3(short)={X,A,K}(1,2)/0.92
Child set entropy = 3/8 * 0.92 + 2/8 * 0 + 3/8 * 0.92 = 0.69
IG(Height) = .954 - .69 = 0.26
Next try Weight . . . IG(Weight) 0.954 – 0.94 = 0.014
So Hair color wins: Draw the first split and assign the collections
N1: Hair Color
Blond: C = {S,D,A,K}(2,2)/1
Red
Brown
yes
no
S2:_______
S2:_________
C(S2) = {S,D,A,K}(2,2)/1
Start with lotion: values (yes, no)
Child 1: (yes) = {D, K}(0, 2)/0
Child 2: (no) = {S,A}(2,0)/ 0
Child set entropy = 0
IG(Lotion) = 1 – 0 = 1
No reason to go any farther
S1: Hair Color
Blond: C = {S,D,A,K}(2,2)/1
Red
Brown
yes
no
S2: Lotion
no
yes
yes
no
Discuss assignment 5
Perceptrons and Neural Networks:
Another Supervised Learning Approach
Perceptron Learning (Supervised)
•
•
•
•
Assign random weights (or set all to 0)
Cycle through input data until change < target
Let α be the “learning coefficient”
For each input:
– If perceptron gives correct answer, do nothing
– If perceptron says yes when answer should be no,
decrease the weights on all units that “fired” by α
– If perceptron says no when answer should be yes,
increase the weights on all units that “fired” by α
Download