Digital Representation of Audio Information

advertisement
EE513
Audio Signals and Systems
Statistical Pattern Classification
Kevin D. Donohue
Electrical and Computer Engineering
University of Kentucky
Interpretation of Auditory Scenes
 Human perception and cognition greatly exceeds any
computer-based system for abstracting sounds into objects
and creating meaningful auditory scenes. This perception of
objects (not just detecting acoustic energy) allows for
interpretation of situations leading to an appropriate response
or further analyses.
 Sensory organs (ears) separate acoustic energy into frequency bands
and convert band energy into neural firings
 The auditory cortex receives the neural responses and abstracts an
auditory scene.
Auditory Scene
 Perception derives a useful representation of reality from
sensory input.
 Auditory Stream refers to a perceptual unit associated with
a single happening (A.S. Bregman, 1990) .
Acoustic to
Neural
Conversion
Organize into
Auditory
Streams
Representation
of Reality
Computer Interpretation
 In order for a computer algorithm to interpret a scene
 Acoustic signals must be converted to numbers using meaningful models.
 Sets of numbers (or patterns) are mapped into events (perceptions).
 Events are analyzed with other events in relation to the goal of the
algorithm and mapped into a situation (cognition or deriving meaning).
 Situation is mapped into an action/response.
 Numbers extracted from the acoustic signal for the purpose of
classification (determination of event) are referred to as features.
 Time -based features are extracted from signal transforms such as:
 Envelope
 Correlations
 Frequency-based features are extracted from signal transforms such as:
 Spectrum (Cepstrum)
 Power Spectral Density
Feature Selection Example
 Consider a problem of discriminating between the spoken
words yes and no based on 2 features:
1. The estimate of first formant frequency g1 (resonance of the
spectral envelope)
2. The ratio in dB of the amplitude of the second formant frequency
over the third formant frequency g2.
 A fictitious experiment was performed and these 2 features
were computed for 25 recordings of people saying these
words. The feature were plotted for each class to develop
an algorithm to classify these samples correctly.
Feature Plot
 Define a feature
20
vector.
 Plot G, given a
yes was spoken,
with green o’s,
and given a no
was spoken, be
wiht red x’s.
2
dB of Ratio Formant 3 over 4 ( g )
g 
G   1
g2 
15
10
5
0
-5
-10
440
460
480
500
520
540
560
First Formant Frequency ( g )
1
580
600
Minimum Distance Approach
 Create representative
μ no
1 25

 G (n | no)
25 n 1
 For a new sample with
estimated features, use
decision rule:
G  μ no
no

G  μ yes

yes
 Results in 3 incorrect
decisions.
15
2
dB of Ratio Formant 3 over 4 ( g )
vector for yes and no
features 25
1
μ yes 
 G (n | yes)
25 n 1
20
10
5
0
-5
-10
440
460
480
500
520
540
560
First Formant Frequency ( g )
1
580
600
Normalization With STD
Normalized dB of Ratio Formant 3 over 4 ( g2 )
 The frequency features
had larger values than the
amplitude ratios, and
therefore had more
influence in the decision
process.
 Remove scale differences
by normalizing each
feature by its standard
deviation over all classes.
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
14
15
16
17
18
Normalized First Formant Frequency ( g )
1
25

1  25
2
  gi (n | yes)  μi| yes   gi (n | no)  μi|no 2 
i 
25  n1
n 1


 Now 4 errors result (why
would it change?)

19
Minimum Distance Classifier
 Consider feature vector x with the potential to be classified as
belonging to K exclusive classes.
 Classification decision will be based on the distance of the
feature vector to one of the template vectors representing
each of the K classes.
 The decision rule is for a given observation x and set of
template vectors zk for each class, decide on class k such that:
arg min D
k
k
 ( x  z k )T ( x  z k )

Minimum Distance Classifier
 If some features need to be weighted more than others in the decision
process, as well as exploiting correlation between the features, the
distance for each feature can be weighted to result in the weighted
minimum distance classifier:
arg min D
k
 ( x  z k )T W ( x  z k )

k
where W is a square matrix of weights with dimension equal to length of
x. If W is a diagonal matrix, it simply scales each of the features in the
decision process. Off diagonal terms scale the correlation between
features. If W is the inverse of the covariance matrix of the features in x,
and zk is the mean feature vector for each class, then the above distances
are referred to as the Mahanalobis distance.
z k  Ex k 


1 K

T
W    E x  z k x  z k  k 
 K k 1

1
Correlation Receiver
 It can be shown that selecting the class based on the minimum distance
between the observation vector and the template vector is equivalent to
finding the maximum correlation between the observation vector and the
template:
arg min D
k
or
k
arg min D
k


(x  z ) W(x  z )   arg max C
 (x  z k )T (x  z k )  arg max Ck  xT z k

k

k
T
k
k
k
k
where the template vectors have been normalized such that
z Tk z k  P
(P is a constant) for all k
 xT Wz k

Definitions
 Random variable (RV) is a function that maps events (sets)
into a discrete set of real numbers for a discrete RV, or a
continuous set of real numbers for a continuous RV.
 Random process (RP) is a series of RVs indexed by a
countable set for a discrete RP, or by a non-countable set for
continuous RP.
Definitions: PDF First Order
 The likelihood of RV values is described through
the probability density function (pdf).
Prxb  X  xe  
xe
p
X
( x)dx
xb

p X ( x)  0 x
and
p

X
( x)dx  1
Definitions: Joint PDF
 The probabilities describing more than one RV is
described by a joint pdf.
Prxb  X  xe    yb  Y  ye  
ye xe
p
XY
( x, y )dxdy
yb xb
 
p XY ( x, y )  0 x, y
and
p
  
XY
( x, y )dxdy  1
Definitions: Conditional PDF
 The probabilities describing a RV given that the
another event has already occurred is described by a
conditional pdf.
p XY ( x, y )
p X |Y ( x | y ) 
pY ( y )
 Closely related to this is Bayes’ rule:
p X |Y ( x | y) pY ( y )  p XY ( x, y )  pY | X ( y | x) p X ( x)
pY | X ( y | x) 
p X |Y ( x | y ) pY ( y )
p X ( x)
Examples: Gaussian PDF
 A first order Gaussian RV pdf (scalar x) with mean µ and
standard deviation  is given by:
 ( x   )2 

p X ( x) 
exp  
2
2
2
2


1
 A higher order joint Gaussian pdf (column vector x) with
mean vector m and covariance matrix  is given by:
pX ( x) 
1
2 n / 2  1/ 2
T
x  x1 , x2 ,  xn 
m  Ex

 1

exp   (x  m)T  1 (x  m) 
 2

  E (x  m)( x  m)T

Example Uncorrelated
Prove that for an Nth order sequence of uncorrelated Gaussian zero-mean RVs
the joint PDF can be written as:
N
p X ( x)  
i 1
 ( xi ) 2 

exp  
2 
2
2i
 2 i 
1
Note that for Gaussian RVs uncorrelated implies statistical independence.
Assume variances are equal for all elements. What would the autocorrelation
of this sequence look like?
How would the above analysis change if RVs were not zero mean?
Class PDFs
When features are modeled as RVs, their pdfs can be used to derive
distance measures for the classifier, and an optimal decision rule that
minimizes classification error can be designed.
Consider K classes individually denoted by k. Feature values
associated with each class can be described by:
 a posteriori probability (likelihood the class after observation/data)
pk (k x)
 a priori probability (likelihood the class before observation/data)
pk (k )
 Likelihood function (likelihood observation/data given a class)
px (x k )
Class PDFs
The likelihood function can be estimated through empirical studies.
Consider 3 speakers whose 3rd formant frequency is distributed by:
1
0.8
0.6
Decision Thresholds
1 (-3,.9)
px ( x 3 )
2 (0, 1.2)
3 (2, .5)
0.4
px ( x 2 )
px ( x 1 )
0.2
0
-8
-6
-4
-2
Feature Value
0
2
Classifier probabilities can be obtained from Bayes’ rule
pk (k x) 
p x ( x k ) pk (k )
p x ( x)
4
Maximum a posteriori Decision Rule
For K classes and observed feature vector x, the maximum a
posteriori (MAP) decision rule states:
Decide ωi if pk (i x)  pk ( j x) j  i
or by applying Bayes’ rule:
Decide ωi if px (x i ) 
px (x  j ) pk ( j )
pk (i )
j  i
For the binary case this reduces to the (log) likelihood ratio
j
j
px (x i )  pk ( j )
  pk ( j ) 

 ln  px (x i )   ln px (x  j ) ln 
px (x  j )  pk (i )
  pk (i ) 


i


i
Example
Consider a 2 class problem with Gaussian distributed
feature vectors x  x1 , x2 , xN T
m1  Ex 1 
m 2  Ex 2 


 E(x  m )( x  m )  
1  E (x  m1 )( x  m1 )T 1
2
T
2
2
2
Derive the log likelihood ratio and describe how the classifier
uses distance information to discriminate between the classes.
Homework
Consider a 2 features for use in a binary classification problem. The
features are Gaussian distributed are form feature vector
x = [x1, x2]T. Derive the log likelihood ratio and corresponding
classifier for the 3 different cases listed below:
•1)
2)
p ( )  p ( )  0.5
pk (1 )  pk (2 )  0.5
m1  [1,1]T
3)
k
T
T
0.8 0 
2  

 0 0.2
pk (1 )  pk (2 )  0.5
m1  m 2  0,0
0.5 0 
2  

 0 0.5
2
m 2  1,1
T
 0.5  0.2
1   2  

 0.2 0.5 
4)
pk (1 )  0.2
m1  [1,1]T
T
0.1 0 
1  
,
 0 0.1
k
m1  [1,1]
m 2  1,1
0.6 0 
1  
,
 0 1.2
1
pk (2 )  0.8
m 2  1,1
0.6 0 
1  
,
 0 1.2
T
0.8 0 
2  

 0 0.2
Comment how each classifier computes “distance” and uses it in the
classification process.
Classification Error
0.5
Classification error is the
percentage of decision
statistics that occur on the
wrong side of the threshold,
scaled by the percentage of
times such an event occurs.
0.4
0.3
T1
Decision
Thresholds
T2
p ( 1 )
p ( 3 )
0.2
p ( 2 )
0.1
0
-8
-6
-4
-2
0
2
4
6
decision statistic
T2

 T1

pe  pk (1 )  p ( 1 )d  pk (2 )  p ( 2 )d   p ( 2 )d   pk (3 )  p ( 3 )d
 

T1
T2




Homework
For the previous example, write an expression for
probability of a correct classification by changing the
integrals and limits (i.e. do not simply write pc=1-pe)
Approximating a Bayes Classifier
If density functions are not known:
 Determine template vectors that minimize distances to
feature vectors in each class for training data (vector
quantization).
 Assume form of density function and estimate parameters
(directly or iteratively) from the data (parametric or
expectation maximization).
 Learn posterior probabilities directly from training data
and interpolate on test data (neural networks).
Download