Speech Recognition LIACS Media Lab Leiden University

advertisement
Seminar
Speech Recognition Projects
E.M. Bakker
LIACS Media Lab
Leiden University
LIACS Media Lab Leiden University
Speech Recognition
Project Outline
•
Implementation
•
Project Modules
–
–
–
–
–
•
Speech Database
Speech Signal Analysis
Hidden Markov Models + Training
Language Models + Training
Recognition Algorithms
Evaluation
LIACS Media Lab Leiden University
Speech Recognition
Implementation
•
A Safe C++ Programming Style
–
–
–
–
Not to be used in C++
Syntax and Programming Style
Conventions
Basic Design Rules
• Program Services
• Memory Services
• Diagnostics
• Important Topics
– Portability
– Testing
– Reliability
LIACS Media Lab Leiden University
Speech Recognition
Implementation: A Safe C++ Programming Style
•
Features to be avoided, or not to be used in C++
– C inherited features
if(c=0), ?:, , ,goto, break, continue, union, struct, bit-wise,
(&& || !), int, short, double, unsigned, ++, --, explicit
constant numbers, cast, variable argument lists
– Preprocessor features
macros for constants, macros for functions, #pragma,
compiler/platform specific directives
– Object Oriented
global data, global non-member functions, public data,
friend, overloading operators@, ++,...
– Memory and pointer-related
pointers, new, delete, malloc, free(), pointers to functions,
->, ->* .*, const char*, NULL, type &ref - t, type count[],
type *count, type *count[], type (*count)[], type (&count)
printf, scanf, assembly language, object passed by val and
temporary objects
LIACS Media Lab Leiden University
Speech Recognition
Implementation: Syntax and Programming Style
• Programs in plain English
• Meaningful names
• One statement per line
• const: for data and methods whenever possible
• variables: local whenever possible
• private/protected data members whenever possible
• do not use confusing syntax like
•
– if (a)
– for (I=0;I++<4;)
always use default in switch-statement
• use assert in all the critical points
LIACS Media Lab Leiden University
Speech Recognition
Implementation: Conventions
• Functions and methods: My_Example_Function()
• Variables:
my_example_var
• Classes:
MyExampleClass
• Constants:
MY_EXAMPLE_CONSTANT
•
•
In general: meaningful names, except for indices
Comment:
–
–
–
–
–
file-description
version history (bugs new functionality)
user information (user guide)
implementation information (reference guide)
code comment
LIACS Media Lab Leiden University
Speech Recognition
Implementation: Basic Design Rules
– Project modularity achieved through classes.
– Structure the program by Classes only (only methods are
allowed, no separate functions)
– Project is decomposed into modules with as little crossdependence as possible
– One module per class
– Classes should have minimal interfaces
– Modules should have minimal dependencies
– Implementation issues hidden from clients (information
hiding)
– Inheritance should be extensively used
•
Advantages:
– Improved readability
– Reduced maintenance work
– Improved robustness
LIACS Media Lab Leiden University
Speech Recognition
Implementation: Program Services
•
Safe memory management
– memory service
– dynamic memory management: C++ without pointers
•
Diagnostics
– decide which data must be checked when, and define the
actions
• File management, user interfaces
• User program configuration management
• Text data management
• Mathematical data management
LIACS Media Lab Leiden University
Speech Recognition
Implementation: Memory Services & Diagnostics
•
Memory Services
•
Diagnostics
LIACS Media Lab Leiden University
Speech Recognition
Some Important Topics
•
Portability
– portability and defined options in files: compatib.h, defopt.h,
Boolean.h
•
Testing
– test routines and version history
•
Reliability
– readability
– maintainability
LIACS Media Lab Leiden University
Speech Recognition
RES General Specification
• RES (Recognition Experimental System) is an HMM based
experimental tool for continuous multispeaker speech
recognition. The system works on recorded speech files and it
basically includes:
– the batch modules for acoustic model initialization and
training
– grammar models training
– phoneme/word recognition
– performance evaluation.
• RES is state of art in speaker independent phonetic recognition:
– with 69.2% of percent correct using all TIMIT test data using
context independent phonetic models.
– It yields 87.83% of percent correct in speaker independent word
recognition on ATIS using context independent phonetic models
not optimally tuned on this database.
LIACS Media Lab Leiden University
Speech Recognition
RES General Specification
• How to build an ASR system for a different language?
– we need many segmented speech recordings to feed the
training programs and get good HMM models of our voices.
– use a freeware program like Snack 1.4 (search on the
Internet) to prepare the data.
– search a Dutch multispeaker phonetic database.
– Design and feed the right language-model.
• Speech samples to train and test the RES system?
– You can download speech samples from Linguistic Data
Consortium (LDC) after you have obtained a user account.
LIACS Media Lab Leiden University
Speech Recognition
General Specification
LIACS Media Lab Leiden University
Speech Recognition
General Specification
•
Required C++ custom libraries:
– none
• Portability:
– Linux
– Windows 3.x, Windows 95, NT
– DOS with DjGpp
• Compilers:
– Ms Visual C++ >4.0
– DjGpp version 2.8.1 or
– GNU Linux Gpp version 2.8.1 or newer
LIACS Media Lab Leiden University
Speech Recognition
Speech Database
•
•
Speech data retrieval
Speech files:
– NIST1A (ATIS x, TIMIT),
– MS WAV
– custom, adding software drivers
•
Label File:
– ATIS
– IMIT
– various subsets, custom labels alphabets included in a file,
custom label handling supplying a driver.
•
Other options:
– overlap
– window length
– file buffering
LIACS Media Lab Leiden University
Speech Recognition
Speech Signal Analysis
•
•
Feature Extraction
Signal processing:
– Any concatenation of processing blocks is allowed. Each
block performs a class of processing and the actual
processing is specified by the options.
•
Available processing blocks:
–
–
–
–
–
–
Preemphasis_and_Hamming
Mean_Subtraction
FFT
MFCC with Log/non Log Energy
any order differences
Other Blocks can be added supplying proper drivers.
LIACS Media Lab Leiden University
Speech Recognition
Hidden Markov Models
•
•
HMM model Initialization
HMM topology:
– 4 predefined types with configurable number of states.
•
Acoustic Units:
– as allowed by the available database
•
emission densities:
– Untied Gaussian mixtures
– full or diagonal covariance matrix
– number of mixtures configurable for each acoustic unit
•
Initialization method:
– maximum distortion splitting on segmented database
LIACS Media Lab Leiden University
Speech Recognition
Hidden Markov Models Training
•
Training algorithm:
– Single and Simultaneous Model Re-estimation BaumWelch.
– parameter re-estimation: selective by configuration.
LIACS Media Lab Leiden University
Speech Recognition
Language Models
•
Language Model:
– unigram and bigram on words and
phonemes
– Smoothing techniques
– Good-Turing, non-linear and linear
interpolation model
– Word Clustering: minimum mean square
error on transition probability
– Perplexity
– word and phoneme based computation
LIACS Media Lab Leiden University
Speech Recognition
Recognition
•
Recognition
– Recognition Unit:
– acoustic units, words
– Algorithm Type
– Viterbi with Beam search and Window
search pruning strategies
LIACS Media Lab Leiden University
Speech Recognition
Evaluation
•
Evaluation: Wagner-Fisher algorithm
LIACS Media Lab Leiden University
Speech Recognition
Projects
1. Dutch Speech Corpus + Database Interface (2 groups)
– in an early phase some example classes should be available, like
counting, etc.
– maybe use tools like ‘praat’ (for wav labeling with phonetics), etc.
2. Signal Analysis and Feature Extraction (2 groups)
3. HMM Initialization + HMM Dutch Phonetic Training (2
groups)
4. Dutch Language Model + Word Class Training (2 groups)
– in an early phase some examples should be available
5. Recognition (2 groups)
Evaluation (All)
LIACS Media Lab Leiden University
Speech Recognition
Project Designs
The design of the project should contain the following:
• The implementation goals
–
–
–
–
–
The underlying technique and theory
A functional description of the starting-code and tools
The design of new code and functionality
Implementation goals and a time-scheme
NB if it is considered difficult to obtain all the goals within the
current time-frame, team up with the other team
• Interfacing
– Define the module-interfaces
– Define the time-path for the essential module-inputs
– Define a realistic time-path for the (partial-)outputs of the
module.
LIACS Media Lab Leiden University
Speech Recognition
Download