AN IMPROVED AUDIO Jenn Tam

advertisement
AN IMPROVED
AUDIO
Jenn Tam
jdtam@cs.cmu.edu
Computer Science Dept.
Carnegie Mellon University
SOAPS 2008, Pittsburgh, PA
WHAT ARE CAPTCHAS?
 CAPTCHAs
are tests generated by computers and
generally passable by humans but not current
computer programs.
1
THE PROBLEM WITH CURRENT
AUDIO CAPTCHAS
 In
some cases the human passing rate is only
70%!
 To make the CAPTCHAs secure, noise was
injected into the audio files making it harder for
both computers and humans to pass.
2
ARE CURRENT AUDIO CAPTCHAS
SECURE?
 A CAPTCHA is
considered broken once a
program can pass it 5% of the time.
 Since the current audio CAPTCHAs use a
limited vocabulary, it was possible for us to
collect enough data to train a system that could
pass the current audio CAPTCHAs more than
45% of the time.
3
HOW DID WE TEST THE CURRENT
AUDIO CAPTCHAs?
 Selected
three different types of audio
CAPTCHAs: google, reCAPTCHA, and digg
 Collected 1000 CAPTCHAs per type of audio
CAPTCHA to use for training and testing
 Created an ASR system using machine learning
techniques
4
THE ALGORITHM
 Given
the .wav file of an audio CAPTCHA
 Segmentation - selecting portions of the audio
which most likely are digits/letters
 Recognition
 Extract features from the segment
 Classify segment as digit/letter or noise and
output the label
 Stop once a maximum number of segments are
classified
5
ALGORITHM DETAILS SEGMENTATION
 CAPTCHAs
were manually labeled and segmented.
We created training segments using this information.
 For testing, we chose the highest energy peaks in the
test CAPTCHA and selected fixed size segments
roughly centered at the peaks.
QuickTime™ and a
decompressor
are needed to see this picture.
6
ALGORITHM DETAILS FEATURES
 We
used three popular techniques for extracting
features from speech to derive 5 sets of features
from the audio.
 Mel-frequency cepstral coefficients (MFCC)
 Perceptual linear prediction (PLP)
 Relative spectral transform with PLP (RASTAPLP)
7
ALGORITHM DETAILS - AdaBoost
 Used
decision stumps for weak classifiers
 For each type of audio CAPTCHA we created
enough classifiers to label a segment as a digit,
letter, or noise.
8
ALGORITHM DETAILS - SVM
 Created
a single multiclass classifier using all the
training segments (from 900 CAPTCHAs)
9
ALGORITHM DETAILS - k-NN
 Created
5 classifiers corresponding to each of
the feature sets
 Used Euclidean distance as our distance metric
 Cross-validation gives us k=1
10
THE ALGORITHM
 Input: Audio
CAPTCHA as an audio file
 Segmentation
 Find the highest energy peak, and extract a
fixed size segment centered at that peak
 Recognition
 Extract features from segment
 Give segment to classifier and obtain label
 Stop extracting segments once all segments have
been labeled or a max solution size is reached.
11
ANALYSIS OF CURRENT AUDIO
CAPTCHAs
Exact Match Rate
 Using
three machine
learning techniques to 80
70
perform ASR on the 60
CAPTCHAs
50
% 40
 AdaBoost
30
 Support Vector
20
Machines (SVM) 10
 k-Nearest Neighbor 0
(k-NN)
AdaBoost
SVM
k-NN
GooglereCAPTCHA Digg
12
THE GOAL
 Make
a secure audio CAPTCHA which will be
easier for a human to pass and harder for a
computer to pass.
 Equate solving a CAPTCHA with doing some
useful work.
 In other words, create an audio reCAPTCHA.
13
WHAT IS reCAPTCHA?
 reCAPTCHA helps
digitize text on which OCR
fails by using the text as its CAPTCHA.
 Since millions of people solve CAPTCHAs each
day, millions of words get digitized each day!
14
15
THE AUDIO RECAPTCHA
 Takes
advantage of the human ability to
understand words through context.
 Will help transcribe digital audio on which ASR
systems fail.
 The audio being used was originally recorded
with the intention that it should be easily
understood by humans.
16
HOW WILL IT WORK?
 Start
with a database of phrases with known
transcriptions.
 Give user adjacent phrases to transcribe as the
CAPTCHA .
 Check user solution against the database to
determine the result of the test. Store the rest of
the solution as transcription
17
Segment #1
Segment #2
Segment
#3
That was the shot that killed Harry Lime. He died in a
Harry Lime he died in a sewer beneath Vienna
Harry Lime. He died in a
18
ANALYSIS OF SECURITY
 Speaker
independent recognition is difficult.
 Open vocabularies make it even more difficult
for ASR systems
 AM broadcasts and .mp3 compression cause the
loss of important data needed for automatic
analysis
19
CONCLUSION
 CAPTCHAs
need to be more accessible, yet
remain secure and not too difficult for humans.
 Deploy audio reCAPTCHA through
reCAPTCHA site.
 Help make knowledge captured in audio
available in text form
20
ACKNOWLEDGEMENTS
 Dr.
Luis von Ahn, CMU
 Dr. Manuel Blum, CMU
 Dr. Roni Rosenfeld, CMU
 David Huggins-Daines, CMU
 Jiri Simsa, CMU
 Sean Hyde, CMU
21
Download