ASRU_2009_Summary

advertisement

Ben Lambert

ASRU 2009 Summary

Merano, Italy

December 14 th to December 17 th , 2009. http://www.asru2009.org/

Invited and Keynote Talks

Generalization problem in ASR acoustic model training and adaptation

Sadaoki Furui

Department of Computer Science, Tokyo Institute of Technology

This keynote talk started off the workshop. (Dr. Furui is very distinguished; many awards, etc.) This talk present a survey of techniques of generalization in acoustic model training and adaptation. That is a sound or acoustic pattern can be very variable, and understood by people under many conditions. The generalization problem is, how to take some standard or prototypical representation of a sounds, and find a generalization of it that is still an accurate representation of the sound. It seems the challenge is how to generalize the representation such that a bigger range of conditions for that sound can be recognized without also recognizing other sounds (i.e. how to increase recall without decreasing precision?)

One bit of evidence he presented is that recognition performance saturates or levels-off after enough data has been seen. This suggests that it’s not a lack of data that’s the problem, but the ability to generalize form the data that’s the problem.

He very briefly presented each of many techniques for generalization, and didn’t go into detail for them. I didn’t fully understand them, but I will list them here. I don’t understand them all, but I think it may be helpful just to list them.

Constraining the degree of freedom by using a priori knowledge

 Vocal tract normalization

 Correlation - (?) “Pair-wise correlation between the mean vectors can be used to enhance estimation of the mean parameters of some speech units even if they are not directly observed in the adaptation data…”

 MAP and Bayesian estimation - MAP for speaker adaptation of HMMs, model parameters are random variables adjusted (?) using adaptation data. (MAP is better than ML, especially when adaptation data is small). Bayesian estimation is closely related…

 Extended-MAP and Quasi-Bayes - ??

 Jacobian approach – An analytic approach to adapting models, under an initial condition to a target condition. Here noise (or whatever) is represented by a matrix, and adaptation performed by matrix arithmetic.

 Eigen-voice – PCA over multiple speakers

 Multiple modeling – separate models for separate conditions (age, gender, rate of speaking), best model is selected at recognition time…

 Cluster adaptive training – A cluster of models are linearly interpolated.

 Bayesian Networks – Huge amount of computation needed for training, but he thinks we might see this working within 10 years.

Constraining the degree of freedom without using a priori knowledge

 structural approach - ?

 MLLR – Max Likelihood Linear Regression

 Signal bias removal

 CMLLR – Constrained MLLR

 Interpolation

 Vector field smoothing (VFS)

 Ensemble methods

Combinations and extensions

 ML-based combinations

 SMAP – Structural MAP

 N-best based methods

 Discriminative approach based combination

 Bayesian discriminative adaptation (Discriminitive MAP estimation)

 I-smoothing

 Large-margin discriminative training.

A few notes I took:

 Discriminative training techniques biased toward training data

 Never succeeded in using Bayes Nets, maybe in 10 years, with more powerful computers.

 Furui’s paper on speech perception: o abrupt transition from understanding 100%-->10% within 10 msec (?)

His view of the future:

 Something is missing (but doesn’t know what)

 How to combine everything? How to aggregate it?

 Humans are very flexible, in many conditions—why can’t ASR?

 Can’t get to human performance by just accumulating everything we’ve got

  need some kind of roadmap to get there

 “not something wrong, something is missing”

 “people are very adaptable and flexible, but how?”

Manipulation of Consonants in Natural Speech

Jont Allen and Feipeng Li

Beckman Institute, University of Illinois at Urbana-Champaign, USA

This was the second talk of the Workshop, and I think one of the most interesting. Jont works on speech perception, not ASR. So, he studies how people perceive speech. (I believe this includes how people with hearing loss perceive speech differently than people with

“normal” hearing).

He made some radical sounding claims, but at least to me, they sounded pretty reasonable. Among the things he said, were that ASR researchers are making the same mistakes over and over because they haven’t gone back to read early work from the 1920’s

(including Fletcher and colleagues at Bell labs). He also said something along the line of,

“HMMs just aren’t the right tool for the job.” In the results he presented he hinted at some alternatives, but he hasn’t tried them (yet)—he wants to understand how people understand speech first. (He said he refuses students who want to work on ASR, because they don’t understand how human speech recognition works yet).

A quotation from another article: “I am researching human-speech recognition, as opposed to machine-speech recognition," said Allen. "We fundamentally don’t understand how the cochlea - the inner ear - and the brain, process speech. We just don’t have the information."

Jont was at Bell Labs until he moved to UIUC around 2004.

His talk was on “manipulation of consonants” in natural speech. What he did was show how to take a consonant sound like /ta/ and modify the acoustic signal to turn it into a /pa/.

I think this is very interesting, because by identifying the portion of the signal that you modify to get from one consonant to the other, he seems to have identified exactly what is characteristic of the /ta/ sound. He does this with numerous consonant pairs. To determine which consonant pairs are most confusable, he presented a confusion matrix of consonant pairs. When there is very low background noise, people (without hearing loss, and under ideal listening condition) can identify consonants correctly with 100% accuracy.

As the background noise increases, some of the consonants are mistaken for one another.

By increasing the noise, confusion sets begin to emerge, for instance /ta/ /ka/ and /pa/.

I believe one of the most interesting things here is how he visualizes or processes the acoustic signal. As far as I can tell, it’s a pretty novel technique. I believe he doesn’t use an

MFCC filter bank(?) at all. He uses something that he calls an AI-gram. It derives from

Harvey Fletcher’s Articulation Index (AI) model (1920’s), which is “an objective appraisal criterion of speech audibility.” (See paper cited below for more about AI). To depict the speech signal, he uses what he calls the “AI-gram” which is “the integration of a simple linear auditory model filter-bank and Fletcher’s AI model.”

Visualizing these AI-grams, it is possible to see the “tiny burst of energy to change from one phone to another. For instance, below are the sounds /ta/, /ka/, and /pa/ depicted.

The AI_gram (I believe) is the lower-left box for each sound. The characteristic burst of energy for each sound is highlighted with a little blue box (hard to see at this size, but to the left of the stripey shape in each case). He was able to determine that this is the characteristic part of the sound, because by moving or masking this part of the sound, he can transform it into another sound. After manipulating a consonant, people don’t hear the new consonant with 100% accuracy, but around 98-99% accuracy.

/ta/ is characterized by a 4kHz burst of energy ~50ms before the vowel; /k/ is characterized by a 1.4-2kHz burst of energy ~50ms before the vowel; and /p/ by a 0.7-1kHz burst of energy leading into the vowel.

/ta/ /ka/ /pa/

In the article cited below, he compared these characteristics of each sound in the AIgrams with the tongue position, but didn’t attempt to explain how the tongue position is related to the frequency observed.

When the background noise is high enough to mask these burst of energy, the sound is not easily recognized (if at all).

He believes automatic speech recognizers would be more likely to work by recognizing these bursts of energy. Other sounds are characterized by other shapes in the AI-gram, such as edges. Thus, he thinks a better ASR engine would consist of edge-detectors, and burst detectors, not HMMs.

A few quotations from his talk:

“no redundancy in speech… but there are conflicting cues”

“not the formant transition”

“tiny bursts of energy to change from one phone to another”

The specific paper he presented is still under review and not yet available, but the following paper has a lot of what he presented. Code, demos, and videos are available from his website.

Allen, Jont and Li, Feipeng (2009). Speech perception and cochlear signal

processing, IEEE Signal Processing Magazine, Invited: Life-sciences, 26(4), pp 73-

77, July. ( pdf , djvu )

His website: http://hear.ai.uiuc.edu/

Later during the workshop I spoke with him about when we spliced “pope” into “the pulp will be used to produce newsprint”. He said that under ideal listening circumstances

(which we did not guarantee with MTurk), people should be able to hear what was actually spoken 100% of the time (in other words, they should hear “pope” if that’s what was spoken).

But as the acoustic condition is degraded, people’s “common sense” (or whatever) will begin to kick in, and people will hear “pulp” even if what is spoken is “pope.” I believe our results are somewhat erratic because they probably mostly depend on the MTurk user’s listening setup.

He referred me to a few papers including:

Bronkhorst, "A Model for context effects in speech recognition" 1993

Bronkhorst 1992

Shannon 1948 (beginning around page 10), and Shannon 1950

His textbook, Allen, “Articulation and Intelligibility”, 2005. (I just ordered this). http://www.amazon.com/Articulation-Intelligibility-Synthesis-Lectures-

Processing/dp/1598290088

He’s a strong advocate of reading original early work, such as Shannon’s papers. I think he thinks it’s a mistake that these are not part of core CS curricula (and I think he may be correct—I want to go back and read these papers).

He also referred me to:

 Lori Holt (Psych at CMU)

 Bob Boston (Pitt)

 John Durant (Pitt)

 And, Rich Stern—Rich’s work is more on binaural hearing, but he’s one of these people who we know the best.

More on the AI-gram:

“THE AI-GRAM

The AI-gram is the integration of a simple linear auditory model filter-bank and the

Fletcher’s AI model [i.e., Fletcher’s SNR model of detection]. Figure 2 depicts the block diagram of AI-gram. Once the speech sound reaches the cochlea, it is decomposed into multiple auditory filter bands, each followed by an “envelope” detector. Fletcher-audibility of the narrow-band speech is predicted by the formula of specific AI (2). A time-frequency pixel of the AI-gram (a two-dimensional image) is denoted AI1t, f 2, where t and f are time and frequency. The implementation used here quantizes time to 2.5 [ms] and uses 200 frequency channels”

Acoustic Modelling for Speech Recognition: Hidden Markov Models and

Beyond?

M.J.F. Gales

Department of Engineering, University of Cambridge, UK

<missed this one due to technical difficulties, dammit>

I got the slides from Mark. I think slides 19-29 may be some of the most interesting. http://mi.eng.cam.ac.uk/~mjfg/ASRU_talk09.pdf

Trends and challenges in language modeling for speech recognition and machine translation

Holger Schwenk

University of Le Mans, France

<missed>

Spoken Dialogue Systems: Challenges, and Opportunities for Research

Jason D. Williams

AT&T Labs – Research

<missed> http://www2.research.att.com/~jdw/papers/williams-2009-asru.pdf

Toward Machine Translation with Statistics and Syntax and Semantics

Dekai Wu

Hong Kong University of Science & Technology

Segment based translation: defer segmentation until last possible moment. This is especially important for languages like Chinese, where if you perform segmentation first, and incorrectly, then the translation is all messed up.

He went into some detail about how to use syntax in MT, in particular, “transduction grammar.” As in formal language theory, there is a complexity hierarchy of these. At the bottom end are finite-state transducers, which are most similar to regular language or finite state machines. At the other end are “syntax-directed transductions” which allow unrestricted syntax in both source and destination languages. But the hypothesis space of grammars is so large that these are unwieldy(?). In between lies what Dekai thinks is where we want to be, which is “inversion transduction grammars” that are most similar to contextfree grammars. (They are called “inverted”, I believe, because the grammar allows inverting the order of words/phrases between the two languages).

He didn’t get into semantics in the talk. There is only a very short section in the paper on using semantic role labels.

So, I looked up his paper on using semantic role labels for SMT:

Dekai Wu, and Pascale Fung. “Can semantic role labeling improve SMT?” EAMT 2009.

In this they find that many translation errors occur because the semantic roles are not preserved. Additionally, in their dataset, they find that 84% semantic role mappings in their sentences pairs, remained consistent. This suggested the attempting to maintain semantic roles would make the translations better.

I believe their experimental setup is somewhat artificial, since it’s a new line of research.

For instance, they only look at sentences where all the verbs are correctly translated. But under these circumstances, enforcing that semantic roles are preserved increases BLEU score and METEOR score performance by 2 points each.

Pascale Fung, Dekai’s wife, and coauthor of the SRL for MT paper, was very interested in my work because she’s worked on both ASR and ontologies. I spoke to Pascale Fung, on 1-

5-10, 10pm.

They have started some work on semantics for ASR. Found some positive perplexity, but no actual ASR results yet. They’re using it in co-mixed ASR (when more than one language in the speech). She has one PhD student looking at using SRL in that.

Rapid Language Adaptation Tools for Multilingual Speech Processing

Tanja Schultz

University of Karlsruhe

Some notes:

There is a lack of language experts. There are also serious issues with segmentation and writing systems that are non-trivial for many languages.

Web-based tools: Give language speakers the tools to let them collect the data (crowd sourcing) cmuspice.org

 GlobalPhone(?) – dataset

 ELRA available from link on Tanja’s website

 Get pronunciations from Wikitionary

 (John Kominek?)

 Textbook: Multilingual Speech Processing

 Olympics: Multilingual data scenario

It’s Not You, It’s Me: Automatically Extracting Social Meaning from Speed Dates

Dan Jurafsky

Stanford University

Notes:

 “flirt detectors”

 Machines are better at detecting flirting than people

 Men are sympathetic when they are flirting

 “Funny detectors”

 Grad students @ Stanford

 People talk exactly the (same?) native as non-native speakers.

 (no video).

 The people who “sit” are more (picky?).

New Perspectives on Spoken Language Understanding: Does Machine Need to

Fully Understand Speech?

Tatsuya Kawahara

Kyoto University, ACCMS

The conventional formulation of spoken language understanding is frame identification and slot-filling. For instance, in ATIS, this is to fill slots like destination, time, etc. This formulation seems to be fine when used with a backend relational database. However, when you move from a back-end database to something more like a search engine, then frame-slot-filling goes out the window.

He cites two surveys of SLU that I need to take a closer look at:

 DeMori, “Spoken Language Understanding”, ASRU 2007

 Wang, Deng, Acero, “An introduction to Statistical Spoken Language Understanding”

2005.

“SLU [for a search engine] is typically based on a vector space model (VSM)” which is something like TFIDF weighted by acoustic confidence. For a very large search space, such as the Web or newspaper articles, there is little room for SLU beyond VSM…. But in a restricted domain there is some room. They consider two restricted domains: software manual for Microsoft products, and tourist information for travel agents.

For the manuals, they do a dependency parse of all the documents before hand, and use the dependency parse to generate questions that can be presented to the user to help refine the query. Thus, asking which version of Excel they are using might provide the most information gain (in terms of narrowing the number of documents).

Also look at:

 Transcription of Japanese congress.

 Transcription(?) of poster session dialogs.

Some notes:

Q: Does understanding help ASR?

Yes, for people.

No, for machines??? (  this is the question)

 MMR

 VSM-Vector space model

 “When asked, stenographers said they “don’t understand”, yet the don’t transcribe disfluencies.”

 Constraint(?)-based to interaction-based.

 Dataset of poster-sessions. o “hotspot” where the audience was impressed.

 They automatically detect: appreciation and laughter.

 [Look at this paper]

 Inter-annotator agreement of cleaning of transcripts… [is low?]

Roger Moore talk

Summary of survey of conference attendee’s prediction about where things are going/when/etc. Ray Kurzweil predictions, etc. Wasn’t that interesting.

Jont’s comment: There is some research on how good ASR has to be for people to use it— but he didn’t say/remember where it was from or by whom.

Audio-Visual Automatic Speech Recognition and Related Bimodal Speech

Technologies: A Review of the State-of-the-Art and Open Problems

Gerasimos Potamianos

National Centre for Scientific Research “Demokritos”

<skipped>

Online Discriminative Learning: Theory and Applications

Nicolò Cesa-Bianchi

DSI, Università degli Studi di Milano via Comelico 39, 20135 Milano, Italy

A few notes; he lost me after 10 minutes:

[Von Neumann game theory from the 1930’s—static ]

Vovte (?) and Warmuth (1989) – prediction with expert advice

Structured perceptron for training CRFs (?)

Margin optimization problem

“Passive-aggressive” algorithm

MIRA

Cost-sensitive margin-based labels.

Multi-view learning

-want the multiple views to (interact?) with eachother.

Panel: Moderated by Alex Acero

Panel members:

 Fred Jelinek (JHU)

 Jerome Bellegarda (Apple)

 Pascale Fung (Hong Kong? UST?)

 ??? Someone from Loquendo

 Herrman Ney (Aachen)

 Al Gorin (NSA)

Some notes

 “Universal phoneme recognizer”

 Gorin: Internal DoD report about ASR: “We thought we were done… we were wrong”

 “DARPA-hard” = very, very hard problems

 “GALE”

 EU: Networks of excellent (like U.S. centers of excellence)

 Gorin: We need a “Moore’s Law”, or 10 questions…

 Alex: First dictation , then cell phones, what’s next? (As in, the next big flop?)

 Loquendo: In-car apps, also accessibility

 Pacale: multi-lingual address book

 Gorin: Agriculture article in most recent New Yorker… “used to be 100,000 languages”

May not be worth it to save/preserve the current 6,000.

 Alan Black’s Q: what’s been the progress of the last 10 years, what for the next 10 years?

 Al, quoting Ed Feigenbaum, “AI is one of those fields where we disown our successes.”

 “Need to pick the right metrics and goals”

Relevant/Interesting Posters in the Poster Sessions

Ontology-based Grounding of Spoken Language Understanding

Silvia Quarteroni, Marco Dinarelli, Giuseppe Riccardi

DISI – University of Trento

This paper is very similar to our work. The goal is to use a measure of ontology-relatedness to prefer recognition hypotheses that are “more related” than those which are not. Unlike our work, which is open-domain, this is for a dialog system. Specifically, the system is part of the “LUNA” project, and in this case is for a (hardware?) help desk. For example, people call to say things like, “my printer doesn’t work any more.” They use a small ontology to determine relatedness in a hypothesized ASR transcription.

The ontology consists of 32 concepts classes (e.g. software operation, network operation, hardware, number, problem, institution). And each may have attributes, such as type (Mac/PC) and brand (DELL).

Rather than use a kind of “distance” metric through the ontology as we did, they use six binary relations to determine relatedness. They are: is-a, superclass, direct relation, relation to super-class, X negates attribute Y, and X and Y are attributes of the same concept class.

Then, for each “concept” in an ASR hypothesis, they compute it’s average relatedness to the concepts within an n concept window. Then the take the average relatedness for all the concepts in an ASR hypothesis (?). The attribute-of relations were the most common.

The interesting positive result that they found was a positive correlation between

“concept” accuracy and the relatedness metric. It’s not entirely clear how they derive these graphs (#4 and #5 in the paper). However, in the end this positive correlation was not enough to yield increased concept accuracy through n-best re-ranking. They did some additional analysis that I don’t fully understand yet.

Like us, they were not able to improve performance by using these scores to do n-best re-ranking. I spoke with first author Quarteroni after and shared our similar experience.

She wasn’t sure why it didn’t work yet.

They use Protégé+Frames to write the ontology.

Active Learning for rule-based and corpus-based Spoken Language Understanding models

Pierre Gotab, Frederic Bechet, Geraldine Damnati

Universite d’Avignon?

<This one seemed interesting, but I need to look back at the paper. It’s not that important though, except that it’s an SLU paper.>

Representing the Reinforcement Learning State in a Negotiation Dialogue

Peter A. Heeman

Center for Spoken Language Understanding, Oregon Health & Science University

In this research, they simulate a negotiation dialog between a user and a dialog system. The system is trying to sell furniture to the user. Both the user and the system have a preference (or policy) that determines the “quality” of a set of furniture that ultimately determined at the end. (E.g. the caller might want all the furniture the same color, and the system might only be able to deliver furniture with net weight under 1000 pounds).

They use reinforcement learning, in which the learner is only given feedback at the very end of the process (i.e. the quality of the solution arrived at), it isn’t given feedback for the decisions it makes at each “turn”. Using a reinforcement learning algorithm they attempt to learn a “Q” function, which specifies the utility of each action available at each state. (For example, what’s the quality of suggesting an ottoman in state 5?)

Russell and Norvig, 2 nd ed. 2003, Ch. 21 is a good review of Reinforcement Learning.

Back-off Action Selection in Summary Space-Based POMDP Dialogue Systems

M. Gaˇsi´c, F. Lef`evre, F. Jurˇc´ıˇcek, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young

Spoken Dialogue Systems Group, Cambridge University Engineering Department

A partially-observable Markov decision process is similar to the scenario for reinforcement learning, except that there is less state information know. (in other words, it may not be possible to determine the exact state?) These appear to be a common or natural way to manage a dialog system, since at each state (in the dialog) there are a number of actions one might take (e.g. ask for clarification, suggest something, etc.). Since the state is not known, there is a probability distribution over all states. But this makes the model intractable for even moderately sized problems.

In this paper, they look at “compressing” the state space into a smaller summary space, where the algorithms are tractable. The problem the arises with this strategy is that the “optimal” action to take in the summary space may not be a valid action in the normal state space. This research deals with how to handle that case. The two main options when this occurs are: 1) look at the valid actions in “nearby” states in the summary space? Or 2) look at suboptimal actions in the current state? (i.e. from an n-best list of actions). They find that look down the n-best list of action is preferable.

Russell and Norvig, 2 nd ed. 2003, Ch. 18 is a good review of POMDPs.

The Exploration/Exploitation Trade-off in Reinforcement Learning for Dialogue

Management

Sebastian Varges, Giuseppe Riccardi, Silvia Quarteroni, Alexei V. Ivanov

I didn’t get a chance to look closely at this one. But I’m very interested to learn that people use reinforcement learning and POMDPs in dialog systems. It doesn’t appear that these techniques are especially new; they’ve been around for at least a few years, if not since ~2000. But it’s neat to see some AI being used. I believe planning systems are also used into dialog managers, but I haven’t yet read much about how, yet.

None of these are probably directly related to my work, but it will be good to learn about, know about, and to keep an eye on.

Prefix to the next few

I am beginning to learn more about conditional random fields (CRFs). I believe they are like maximum entropy models for sequences. And, as such, allow one to throw in lots of redundant features into the model without messing things up. Like an HMM, CRFs can be used to label sequences. I believe their original purpose, when invented by McCallum and

Lafferty (and Pereira?) was to label sections of a document (for instance, an email address might have a header, a salutation, a body, and a signature). And, I think they can be used for any sort of sequence labeling. (MEMMs, maximum entropy Markov models are closely related to CRFs, I think).

I believe that a CRF effectively has a maximum entropy model at each position in the sequence. So, for a sequence of length n, there are n MEMs, but the training and “inference”

(i.e. “decoding”) of them are not independent. Thus, the limitations on the number of features one can use is even more severely limited than a MEM. I think this means they will work OK if there is a focused set of features, but they cannot approximate a language model.

Around 2005/6 Hong-Kwang Jeff Kuo and Yuqing Gao used MEMMs for speech recognition. But since it is not possible to incorporate a language model into the model, they task is phone recognition, not word recognition. (This paper is called: “Maximum

Entropy Direct Models for Speech Recognition”) (There may be ways to interpolate

MEMMs/CRFs with a LM, but I haven’t seen if/how those work yet—Mark Gales talk at

ASRU ’09 may have some more on that).

Since MEMs may be one of the more promising methods for my research, I want to understand what’s happening with CRFs and MEMMs, especially in the speech community.

There’s still more to learn about all this. But there are a number of poster-session papers at

ASRU 09 on using CRFs in an ASR setting. Here are two.

*Hidden Conditional Random Fields for Phone Recognition

Yun-Hsuan Sung and Dan Jurafsky

Stanford University

Sung and Jurafsky use an HCRF for phone recognition. HCRFs are CRFs augmented with a hidden state variable in each of the “state sequence nodes”. These are used to represent

“subphones and mixture components.” I’m not completely clear on how they use the hidden state variables.

They show these outperform both generatively and discriminatively trained HMMs. I believe they said this is the first work to incorporate “bigram” features into the model. In this case, since it’s phone recognition, they are phone bigrams. I asked if they could use trigrams, and they said even phone trigrams would be computationally infeasible.

Thus word bigrams and trigrams would be pretty much out of the question. I wonder if a class-based language model would be more feasible… (?). Perhaps not yet, but it might be worth considering, since we can (potentially) model word classes with a knowledge base.

Dan Jurafsky said he thought MaxEnt/CRF was the right thing to look at for our work.

But that the big limitation to these models is the number of features you can use. Since it seems that our semantic features are more likely to be sentence/phase-level, not wordlevel, that whole-sentence ME model might make more sense than a CRF or MEMM.

(Yun-Hsuan Sung is defending January 2010 and going to Google).

*A Segmental CRF Approach to Large Vocabulary Continuous Speech Recognition

Geoffrey Zweig and Patrick Nguyen

Microsoft Research

In this work, they again use CRFs. The application and data is for Bing Mobile Search, so the speech might be someone speaking, “Jim’s Crab Shack” into their cell phone in order to locate it on a map or get information like the phone number. I thought I remember him saying that they don’t use a LM, but it appears from the paper that they do. However, I don’t think the LM is incorporated directly into the CRF model—I think the CRF model has a single LM feature, which is like a subroutine that calls out the actual language model.

What’s interesting about this work however is that they use a “segmental CRF.” This means that rather than having a state node for each phone, the state node can span a number of phones. This means they can incorporate word-level features into the CRF model. Since the segments are not know a priori, I believe the model attempts all possible segmentations (this is possibly more feasible since the speech is short queries, not long sentences).

Segmental CRFs are also called Semi-Markov Random Fields. Geoff says they have been used for text, but this is the first time they’ve been used for speech.

Interestingly, the segment features are derived from a series of “detectors” (reminiscent of our sense machines, but without any notion of “sense”). I believe the detectors are more like LM features, but I’m not really sure. He says a big piece of this work going forward is to develop more “detectors.”

Papers I need to take a closer look at (some of)

Most important ones marked with a *.

*Scaling Shrinkage-Based Language Models

Stanley F. Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy

IBM T.J. Watson Research Center

 Arthur Toth, Alan Black, and Ruhi tell me I need to look at this. This is next I think.

*Syntactic Features for Arabic Speech Recognition

Hong-Kwang Jeff Kuo, Lidia Mangu, Ahmad Emami, Imed Zitouni, Young-Suk Lee

IBM T.J. Watson Research Center

 Ruhi and Pascale said to definitely look at this one. This won best-paper award.

Another, later version of this will be presented at (HLT or ICASSP? I forget which).

*Towards the Use of Inferred Cognitive States in Language Modeling

Nigel G. Ward, Alejandro Vega

Department of Computer Science, University of Texas at El Paso

Hierarchical Variational Loopy Belief Propagation for Multi-talker Speech

Recognition

Steven J. Rennie, John R. Hershey, Peder A. Olsen

IBM T.J. Watson Research Center

This doesn’t really have anything to do with my work. But it seemed pretty neat. The idea is to do recognition of multiple simultaneous speakers. What I thought was particularly interesting is that they can beat human performance (which may not be that surprising, but is still pretty cool)

Lexicon Adaptation for Subword Speech Recognition

Timo Mertens #1, Daniel Schneider 2, Arild Brandrud Næss #, Torbjørn Svendsen #

Norwegian University of Science and Technology, Norway

Investigations on Features for Log-Linear Acoustic Models in Continuous Speech

Recognition

S. Wiesler, M. Nußbaum-Thom, G. Heigold, R. Schl¨uter, H. Ney

RWTH Aachen University

Log-Linear Framework for Linear Feature Transformations in Speech Recognition

Muhammad Ali Tahir, Georg Heigold, Christian Plahl, Ralf Schl¨uter, Hermann Ney

Aachen University

Dynamic Network Decoding revisited

Hagen Soltau and George Saon

IBM T. J. Watson Research Center

Transition features for CRF-based speech recognition and boundary detection

Spiros Dimopoulos, Eric Fosler-Lussier, Chin-Hui Lee, Alexandros Potamianos

Download