Activity Analysis of Sign Language Video

advertisement
Activity Analysis of Sign
Language Video
Generals exam
Neva Cherniavsky
MobileASL goal:
• ASL communication using video cell phones over
current U.S. cell phone network
Challenges:
• Limited network bandwidth
• Limited processing power on cell phones
• FAQ
Activity Analysis and MobileASL
• Use qualities unique to sign language
– Signing/Not signing/Finger spelling
– Information at beginning and ending of signs
Activity Analysis and MobileASL
• Use qualities unique to sign language
– Signing/Not signing/Finger spelling
– Information at beginning and ending of signs
• Decrease cost of sending video
Activity Analysis and MobileASL
• Use qualities unique to sign language
– Signing/Not signing/Finger spelling
– Information at beginning and ending of signs
• Decrease cost of sending video
– Maximum bandwidth
Activity Analysis and MobileASL
• Use qualities unique to sign language
– Signing/Not signing/Finger spelling
– Information at beginning and ending of signs
• Decrease cost of sending video
– Maximum bandwidth
– Total data sent and received
Activity Analysis and MobileASL
• Use qualities unique to sign language
– Signing/Not signing/Finger spelling
– Information at beginning and ending of signs
• Decrease cost of sending video
– Maximum bandwidth
– Total data sent and received
– Power consumption
Activity Analysis and MobileASL
• Use qualities unique to sign language
– Signing/Not signing/Finger spelling
– Information at beginning and ending of signs
• Decrease cost of sending video
– Maximum bandwidth
– Total data sent and received
– Power consumption
– Processing cost
One Approach:
Variable Frame Rate
Variable Frame Rate
• Decrease frame rate during “listening”
• Goal: reduce cost while maintaining or
increasing intelligibility
NO
– Maximum bandwidth?
YES
– Total data sent and received?
– Power consumption?
YES
– Processing cost?
YES
Demo
The story so far...
• Showed variable frame rate can reduce
cost (25% savings in bit rate)
• Conducted user studies to determine
intelligibility of variable frame rate videos
– Quality of each frame held constant (data
transmitted decreased with decreased frame
rate)
– Lowering frame rate did not affect intelligibility
– Freeze frame thought unnatural
Outline
1. Introduction
2. Completed Activity Analysis Research
a. Feature extraction
b. Classification
3. Proposed Activity Analysis Research
4. Timeline to complete dissertation
Activity Analysis, big picture
Raw Data
Classification
Feature
Extraction
Modification
Classification
Engine
Activity Analysis, thus far
Signing, Listening
Feature
Extraction
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
,
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
,
,
Classification
,
Features
H.264 information:
Type of macroblock
Motion vectors
Features cont.
Features:
(x,y) motion vector face
(x,y) motion vector left
(x,y) motion vector right
# of I blocks
Classification
• Train via labeled examples
• Training can be performed offline, testing
must be real-time
• Support vector machines
• Hidden Markov models
Support vector machines
• More accurately called support vector
classifier
• Separates training data into two classes
so that they are maximally apart
Maximum margin hyperplane
Small Margin
Large Margin
Support vectors
What if it’s non-linear?
Implementation notes
• May not be separable
– Use linear separation, but allow training errors
– Higher cost for errors = more accurate model, may
not generalize
• libsvm, publicly available Matlab library
– Exhaustive search on training data to choose best
parameters
– Radial basis kernel function
• As originally published, no temporal information
– Use “sliding window”, keep track of classification
– Majority vote gives result
Implementation notes
• May not be separable
– Use linear separation, but allow training errors
– Higher cost for errors = more accurate model, may
not generalize
• libsvm, publicly available Matlab library
– Exhaustive search on training data to choose best
parameters
– Radial basis kernel function
• As originally published, no temporal information
– Use “sliding window”, keep track of classification
– Majority vote gives result
Implementation notes
• May not be separable
– Use linear separation, but allow training errors
– Higher cost for errors = more accurate model, may
not generalize
• libsvm, publicly available Matlab library
– Exhaustive search on training data to choose best
parameters
– Radial basis kernel function
• As originally published, no temporal information
– Use “sliding window”, keep track of classification
– Majority vote gives result
SVM Classification Accuracy
Test video SVM
gina1
87.8%
SVM
3 frame
88.8%
SVM
4 frame
87.9%
SVM
5 frame
88.7%
gina2
85.2%
87.4%
90.3%
88.3%
gina3
90.6%
91.3%
91.1%
91.3%
gina4
86.6%
87.1%
87.6%
87.6%
Average
87.6%
88.7%
89.2%
89.0%
Hidden Markov models
• Markov model: finite state model, obeys
Markov property
Pr[Xn = x | Xn-1 = xn-1, Xn-2 = xn-2, … X1 = x1]
= Pr [Xn = x | Xn-1 = xn-1]
• Current state depends only on previous
state
• Hidden Markov model: states are hidden,
infer through observations
0.2
0.5
0.2
0.3
0.4
0.3
0.4
0.4
0.4
0.2
0.1
0.5
0.1
0.1
0.7
0.2
0.6
0.4
Different models
0.3
0.2 0.2
0.5
0.6
0.2
0.2
0.3
0.4
0.3
0.1
0.4
0.2
0.5
0.4
0.4
0.4
0.8
0.2
0.1
0.1
0.1
0.7
0.1
0.1
0.2
0.6
0.8
0.5
0.5
0.1
0.4
0.5
Two ways to solve recognition
1. Given observation sequence O and a
choice of models , maximize Pr(O| )
Speech recognition: which word?
?
?
produced observation?
2. Given observation sequence and model,
find the most likely state sequence.
Has been used for continuous sign
recognition.
Two ways to solve recognition
1. Given observation sequence O and a
choice of models , maximize Pr(O| )
Speech recognition: which word?
?
?
produced observation?
2. Given observation sequence and model,
find the most likely state sequence.
Has been used for continuous sign
recognition.
Two ways to solve recognition
1. Given observation sequence O and
model , what is Pr(O| )?
Speech recognition: which word
produced observation?
2. Given observation sequence and model,
find the most likely state sequence.
Has been used for continuous sign
recognition [Starner95].
Implementation notes
• Use htk, publicly available library written in
C
• Model signing/not signing as “words”
– Other possibility is to trace state sequence
– Each is a 3 state model, no backward
transitions
• Must include some temporal info, else
degenerate (biased coin flip)
• Use 3, 4, and 5 frame window
Implementation notes
• Use htk, publicly available library written in
C
• Model signing/not signing as “words”
– Other possibility is to trace state sequence
– Each is a 3 state model, no backward
transitions
• Must include some temporal info, else
degenerate (biased coin flip)
• Use 3, 4, and 5 frame window
HMM Classification Accuracy
Test video HMM
3 frame
gina1
87.3%
HMM
4 frame
88.4%
HMM
5 frame
88.4%
Best
SVM
gina2
85.4%
86.0%
86.8%
90.3%
gina3
87.3%
88.6%
89.2%
91.3%
gina4
82.6%
82.5%
81.4%
87.6%
Average
85.7%
86.4%
86.5%
89.2%
88.8%
Outline
1. Motivation
2. Completed Activity Analysis Research
3. Proposed Activity Analysis Research
a. Recognize finger spelling
b. Recognize movement epenthesis
4. Timeline to complete dissertation
Activity Analysis, thus far
Signing, Listening
Feature
Extraction
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
,
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
,
,
Classification
,
Activity Analysis, proposed
Signing, Listening,
Finger spelling
Movement
epenthesis
Feature
Extraction
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
,
Qu i ck Ti me ™a nd a
TIF F (LZW)d ec om pres so r
a re ne ed ed to s ee th i s pi c tu re.
,
,
Classification
,
Proposed Research
• Recognize new activity
– Finger spelling
– Movement epenthesis (= sign segmentation)
• Questions
– Why is this valuable?
– Is it feasible?
– How will it be solved?
Why? Finger spelling
Believe that increased frame rate will increase intelligibility
Will confirm optimal frame rate through user studies
Why? Movement epenthesis
• Choose frames so that
low frame rate more
intelligible
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Potentially first step in
continuous sign
language recognition
engine
• Irritation must not
outweigh savings; verify
through user studies
Is it feasible?
• Previous (somewhat successful) work:
– Direct measure device
– Rules-based
• Change in motion trajectory, low motion
[Sagawa00]
• Finger flexion [Liang98]
• Previous very successful work (98.8%)
– Neural Network + direct measure device
– Frame classified as left boundary, right
boundary, or interior [Fang01]
Is it feasible?
• Previous (somewhat successful) work:
– Direct measure device
– Rules-based
• Change in motion trajectory, low motion
[Sagawa00]
• Finger flexion [Liang98]
• Previous very successful work (98.8%)
– Neural Network + direct measure device
– Frame classified as beginning of sign, end of
sign, or interior [Fang01]
How?
• Improved feature extraction
– Use the part of sign to inform extraction
– See what works from the sign recognition
literature
• Improved classification
Parts of sign
• Handshape
– Most work in sign language recognition focused here
– Includes expensive techniques (time, power)
• Movement
– We only use this right now!
– Often implicitly recognized in machine learning
• Location
• Palm orientation
• Nonmanual signals (facial expression)
Parts of sign
• Handshape
– Most work in sign language recognition focused here
– Includes expensive techniques (time, power)
• Movement
– We only use this right now!
– Often implicitly recognized in machine learning
• Location
• Palm orientation
• Nonmanual signals (facial expression)
Parts of sign
• Handshape
– Most work in sign language recognition focused here
– Includes expensive techniques (time, power)
• Movement
– We only use this right now!
– Often implicitly recognized in machine learning
• Location
• Palm orientation
• Nonmanual signals (facial expression)
Add center of gravity to features
QuickTi me™ and a
TIFF ( LZW) decompressor
are needed to see thi s pi ctur e.
Parts of sign recognized by
center of gravity
•
•
•
•
•
Handshape
Movement
Location
Palm orientation
Nonmanual signals (facial expression)
Accurate COG
• Bayesian filters
– Very similar to hidden Markov models
– What state are we in, given the (noisy)
observations?
– Find posterior pdf of state
– Kalman filter, particle filter
• Viola and Jones [01] object detection
Bayesian filters
Predict
Kalman: add in noise, guess
state
Particle: add in noise, guess
particle location
Update
Kalman: assume linear system,
minimize MSE; measure
Particle: sum of weighted samples;
measure, update weights
How?
• Improved feature extraction
• Improved machine learning
– 3 class SVM for finger spelling
– State sequence HMM
– AdaBoost [Freund97]
AdaBoost (adaptive boosting)
AdaBoost Algorithm
• In each round t = 1 to T:
– Train a “weak learner” on weighted data
– ht : features  {signing, listening}, error is
sum of weights of misclassfied examples
– t = 1/2 ln((1 - error)/error)
– Reweight based on error, normalize weights
• Answer is sign(∑t t ht)
Outline
1.
2.
3.
4.
Motivation
Completed Research
Proposed Research
Timeline to complete dissertation
Timeline
•
October 2007 - March 2008: Recognize
signing/listening/finger spelling
• Deadline: Automatic Face and Gesture
Recognition, March 28, 2008
1. Bayesian filters for better features.
2. Viola and Jones’s object detection.
3. Improve hidden Markov model.
4. Evaluate three class support vector machine.
5. Implement AdaBoost, cascade.
6. Experiment with combining these
techniques.
Timeline, cont.
•
•
•
April 2008 - May 2008: Run user study to
evaluate optimal frame rate for finger spelling.
Deadline: ASSETS 2008, May 25, 2008
June 2008 - December 2008: Apply techniques
to the problem of sign segmentation.
1. Evaluate feature set and improve.
2. Conduct a user study to evaluate intelligibility of
dropping frames during movement epenthesis.
3. Improve machine learning techniques; implement
combination via decision trees.
•
Early 2009: Complete dissertation and defend.
Questions?
Download