Document 17864855

advertisement
24851
>> Dong Yu: We're pleased to have Eric today. And Eric got his Ph.D. degree
from U.C. Berkeley, and he's associate professor at Ohio State University
right now. He's been working on conditional random fields for many years.
And today he will talk about his recent work in this area.
>> Eric Fosler-Lussier: Thank you. So when I was trying to figure out what
to talk about in this talk I thought, well, my students are doing all this
whacky stuff and how do I sort of summarize all this stuff.
So let me give you a quick overview of the lab. Basically my viewpoint is
that we want to be looking at linguistics to help us do speech recognition
and language processing. We want to be looking at language processing and
speech recognition and help us do linguistics. And I'm going to talk more
about the former point than the latter point today. But that is sort of the
overall thing that we're doing. And traditionally we've been working a lot
in machine learning on the CSC side and a lot of people in linguistics to
sort of get some good insight.
Recently I've also been dabbling in stuff with bioinformatics and a partner
over there, the Wechsler Medical Center I'll talk a little bit about that
today too just to give you a broad overview of I was kind of surprised when I
actually sat down and said how many projects are we actually doing, and
there's quite a few.
Both on the speech side and the language side and we up until recently had
Chris Brew who left for ETS so I inherited a number of language projects.
So a lot of the stuff we've been doing traditionally that people know me for
is working on speech recognition with CRFs and feature-based stuff.
There's been some stuff with Karen Levesque doing discriminative FSTs and
also some of the future-based stuff. We just recently got involved in a
government project on multi-lingual keyword spotting. And you can read all
the rest of the stuff.
We do a little bit of linguistics so computational models of child language
acquisition. That's work with Mary Beckman. And some stuff on the medical
side of looking at timeline extractions from electronic medical records and
virtual patients.
And there's also a little bit of stuff in kids' literature on how to predict
the reading level.
So there's a number of partners in all of these things that I won't have time
to go into detail. So today I said, okay, we'll try and give you a slice
across. So I'm going to try and balance my speech time and my language time,
although I will tend to talk more about the speech, I think, because I go
into that.
So here's the overview. So today I'm going to give you many of the people -there's a lot of old familiar faces in the field in the room. So I'm going
to give you a, for those who don't know what I do, give you sort of a: How
did we get here background on conditional random fields and stuff like that,
then I'm going to talk a little bit about some of the stuff that we've been
doing with segmental conditional random fields, articulatory feature modeling
and I'll talk a little bit about the language stuff we've been doing where
we've been taking CRFs and kind of grew up in the natural language community
and using them as event extractors to do semi supervised learning.
Okay. So background how we got into this mess. Okay. So a lot of the
things that I've been interested in, thinking about how do we break apart one
of these problems into some sort of feature description and use combiners to
do feature-based modeling. And that's sort of the overall theme of the talk.
And the idea is that in general we're going to be looking at ways to extract
features from either speech or language and then combine using some sort of
weighted model, usually a log linear weighted model.
And just to give you a flavor of how I got into this whole mess, a little bit
of the old history and I did my thesis way back when on pronunciation
modeling.
So I stole this slide from Rowit [phonetic] who stole the slide from Karen on
pronunciation variation between switchboard and it comes out of work from
Steve Greenberg where they did transcriptions of, fine transcriptions of how
people pronounce words in conversational speech.
And you'll see that a lot of -- there are a lot of different ways people say
things like, for example, probably. And in fact it's probably very
unlikely -- in fact, it was not observed that you actually see the full
pronunciation of probably. But usually you say probably or probably. So
there's a lot of variation.
And this is a real problem from the viewpoint of trying to build a good model
of this because we don't have complete linguistic evidence in the signal
according to linguists. So these are linguists transcribing these utterances
saying here are the things I've observed.
So I did a thesis which nobody should ever look at on doing pronunciation
variation and people back in the '90s were getting moderate success by saying
I'm going to take my pronunciation dictionary and add these pronunciations.
I don't want to add too many of them because things are confusable and things
like that. But we never got too far with that approach, because the more
pronunciations you add to the system, the more likely it is that you're going
to start confusing things.
And so let's take a little bit of a closer look at some of the data. And the
thing that I'm going to show you is actually inspired by some work that was
done by Marat Sarclar [phonetic] and Sanjav Cupar [phonetic]at Hopkins, which
basically said what are the acoustic, the real acoustics that go along with
variations that we're observing?
So I have this following little example that if we just look at the Buckeye
corpus speech. This is again conversational speech but we get long-term
speakers, switchboard has like five minutes of each conversation side. These
are like hour-long transcriptions. So we get a lot of instances of variation
within the speaker.
And we can take a look at things like let's just look at ah versus eh in the
Buckeye corpus. And I'm going to look at just looking at the format
structure of saying I'm looking at where are the peaks in energy, okay, for
ah, eh and then when the dictionary said there should have been an ah but the
transcriber said that it was an eh.
So that would be like if I had bad and somebody transcribed it as bed.
would be a confusable thing but people do say this.
Which
So here are the data that just correspond to as versus our transcriber says
as and these aress. This is transcriberess. First form of frequency versus
the second form of frequency. If you've done linguistics you'll know this is
one section of the vowel triangle.
So ->>: More in terms of duration than in terms of the former?
>> Eric Fosler-Lussier: Not in this corpus. The duration is actually
roughly equivalent for this speaker. So not for this. But I agree I'm only
looking at one dimension of a very dimensional thing.
So just to give you -- and circumstances really mean nothing, statistically,
but just to give you a sense of like here's a region where we see ahs. And
here's a region we see ehs. So now here's your test. What happens when ahs
is going to be transcribed as ehs. So the observer saw this particular thing
and said, okay, this ah, it was an ah but it was an eh. So the transcriber
decided to change the canonical interpretation to ah, where do you think that
data should be? It should be, you would hope, in that overlapping region,
right? Of course it's not otherwise why would I be showing this slide,
right?
So the actual data, which is marked with the blue stars, is all over the
place. And so some of it is in that region that you would expect. Some of
those are in the ah territory. Some of them are actually higher than eh.
This is more over into eh where the performance is and some sit lower than ah
and there's no phoneme lower than ah, right?
Again, this is a very hand wavy argument but you can -- this kind of data was
born out more statistically with the study that mar rat and Sanjeev did. But
here's the key. Where is the data not? The data is not over here. These
are the back vowels in here. So what is the transcriber really trying to
tell us? It's not that ah really was pronounced as an eh. It's that it was
a front vowel. It should have been an ah, I don't know what it really should
be. I don't know what the back -- I don't know what the vowel height should
be. So I have some uncertainty as to height.
I mean, that's sort of my loose interpretation of this. So we might have
uncertainty in some dimensions in terms of the variation, but certainty in
other dimensions. So if we start thinking about the linguistics in terms of
the multi-dimensional variation of some sort of phonetic space then we can
start modeling some of these phenomena.
So, right. So we can't take those transcriptions at face value. Okay? So
where I've been looking at different kinds of representations is to think
about sub phonetic representation such as phonological features or
articulartry features to represent these kind of transcriptional differences.
The interesting question then becomes how do you involve that into some sort
of statistical model. So backing up a little bit to the statistical model
side, I grew up in at IXI where neural nets were a big thing.
So one of our favorite things to do was to basically take some acoustics
here, put a multi-layer to try to predict which phone class is this given
this little local chunk of speech. I might have a local chunk of speech I'll
get a posterior estimate for every frame effectively out.
And if you are a IXI person you might put this into do a diagonalization and
decorrelation and put it into an HMM. But one of the things that we did is
basically say, okay, we're going to build a log linear model on top of that
which is a conditional random field. And I'll talk a little bit more about
the conditional random fields in a second.
But essentially what you're going to get at is instead of having these frame
level posteriors you end up getting these sequence level posteriors out. Why
do I like this so much? Because we can start playing around with saying,
hey, look, I have little local detectors of: Is this a manner? Is this a
place? Is this a height? I can talk about these as features that I'm
extracting from the data and I can talk about combining features of different
kinds of things.
So this can be used in place of or in parallel to other types of things. So,
for example, Zilla did a lot active work to begin with on trying to extract
things like using the sufficient statistics as the functions, right, and
using a hierarchical or hidden conditional random field to do this kind of
estimation.
>>: [inaudible] also at the same time?
>> Eric Fosler-Lussier: Yes, so we've done it. You have to put X and N
squared. So you get the sufficient statistics. It helps us a little bit.
But it complicates the thing, the -- it just -- it doesn't help enough to add
that in all the time. But it's a good thing.
>>: [inaudible].
>> Eric Fosler-Lussier:
Please do.
>>: So to train this classifier you need a training signal.
showed us that the training signal was unreliable.
And you just
>> Eric Fosler-Lussier: Right. You're asking something that's not on the
slides but it's a very good question. So there's a different -- let me -there's two answers to that question. In fact, they do come into the slides
later on. So you're right you are anticipating.
So at first guess what we're going to do is basically take a phone level
alignment. In fact, usually from an HMM, because we need -- that's one of
the downsides to this is that you actually need to have, if you're not doing
a hidden conditional random field you need to have that label sequence.
And so what we do is we take a phone sequence and we can just back propagate
it to what the appropriate features were trained independent nets and what
we're capturing essentially is in some ways the errors that each one of these
makes in its estimation.
And we might get a little bit of future asynchrony or something like that.
But if you want to go to models more like the kind of articulatory feature
model you're going to need some other mechanism to try and handle that
misalignment. And I'll talk a little bit about Rowit's model which you might
have seen before about trying to do articulatory feature alignment. So hang
on to that thought.
>>: That point.
>> Eric Fosler-Lussier:
Yeah.
>>: My understanding it was CRF, what was hidden in OCF. You don't need to
have precise alignment for the frame level to train the set.
>> Eric Fosler-Lussier: To train a straightup CRF you need to have the
mobile alignment. To train the hidden one you don't.
>>: I see.
>> Eric Fosler-Lussier: Right. Because the alignment is actually hidden, is
actually one of the things that you need to -- but you can do -- you can do
an embedded Verterbe type style training which doesn't -- with one
realignment basically you get back to where you need to be.
So it's not a big deal to ->>: When you put this CRF in the form that's modeled the whole sequence, that
by default you get hidden there. But you don't really expect -- rather
than -- these things here.
>> Eric Fosler-Lussier: So the way we're training it actually you do have -well, we've used both one state and three state type models.
But, yeah, you do actually have an explicit -- the way we train it we have
this explicit labeling, then we go and we vertebe relabel and that gets us to
get away from the mismatch. Okay. That actually gets to an interesting
point about the criteria. So let me hold off on that idea for a minute.
Okay. So one of the very few equations that I have on this, in this talk is
basically what we're interested in is a label posterior given the acoustics,
and what we're going to do is talk about this in terms of a bunch of what
we'll call state functions, which associate with each label essentially some
feature function here, some generic feature function. I'm going to label
those as my state functions, and also the transition functions, which talk
about pairs of states. In this case, because this is what we call linear
change CRF. And it also may have associations with the feature functions.
This actually turns out for us to be an important thing. It's an open
question as to how important is it to have observational dependence on this
transition. We find that having the ability to talk about whether you're
going from one state to another based on the transitions which you can't do
in an HMM straight up, this actually is important for us that we get quite a
bit -- a little bit of gain out of doing that.
Okay. So just to sum up this little sort of like this is like 2008 and
before kind of thing, you know, basically you can play around with different
versions of the function. So, for example, if you're yes or Hiffney
[phonetic] here are my -- I have between 10,000 Gaussians, which are my 20
closest ones and that would be a very sparse feature space.
So there's a nice thing about this is that you can talk about different kinds
of features, plug them in together and just train. Just to summarize the
results that we got on previous studies with phone recognition basically we
found that we were beating the ten HMM systems using the posteriors input
with many fewer parameters.
Even just a monophone based CRF was actually beating our triphone based
station. Admittedly this is not a discriminatively trained HMM. So caveat
there.
And for Tim [inaudible] training discriminantly it's a mess, not enough data
to get it done.
>>: Until IBM ->> Eric Fosler-Lussier:
Until IBM.
Brian's going to show me wrong, exactly.
And the transition features actually help us quite a bit. That was one of
the other things we saw. And we played around with things where we combined
a large number of phonological features and phone posteriors and we found
that that was a good effective combination technique.
So any questions on sort of the background before I sort of segue into like
what's new? All right. So I'm going to talk a little bit about upcoming
interspeech paper we have on boundary factored CRFs.
So I'm glad Jeff's in the audience, because I'm going to sort of sing to his
choir. This whole talk in some ways is like bringing Kohl's Newcastle
because you guys do a lot of this stuff within the speech group.
So I'm -- we're adding little bits of knowledge to the kind of things that
you guys are looking at.
So the frame level approach, while nice and easy to train, has a real
problem. And so that what we're trying to do is maximize the conditional
maximum likelihood of the frame labels as opposed to upstream what we really
want is words in the end, just just phones, we want words. There's a big
criteria mismatch.
And it gets down to exactly about this issue about segmentations and the fact
that if I get -- and the way to think about it is if I had cat and I got a
whole bunch of Ks let's see cat is not good because As should be long.
Let's imagine I recognize cat as, K should be one frame long and there's a
bunch of ahs and a bunch of ts. And if I label that one frame wrong, it's a
big problem in terms of the recognition, right, but the criteria basically
says, well, you just got one wrong out of this long sequence, right? So the
frame level criteria is really kind of a bad mismatch.
As a side note to that, so one of the things that you want is sort of phono
tactic grammars and things like that this sound is likely to correspond, and
on a frame level those probabilities or actually potentials in a CRF get
really spread far apart because you've got all the sequence of frames.
So this is kind of bad from that point of view. And we want to be able to
incorporate long span features like the duration exactly to your point.
Duration is very distinctive especially in a lot of languages that are not
English. Form and trajectory, syllable phoneme counts all these kinds of
things.
So we want something else to work on. So segmental CRFs to the rescue, da da
da da. So the idea is I'm going to change my labels from Q to Y where I'm
talking Y is now a segment as opposed to Q is my frame.
And we're going to talk about labels being on the segment level but every
label is going to correspond to every chunk of frames and probably Jeff has
talked your ear off about it. But the idea is that what you're going to do
is you've got this implicit sort of segmentation that's going on that
basically says, hey, you know, this Y3 actually corresponds to two outputs
and I need, the segmentation is somewhat hidden from me.
But the nice thing is now phono tactic grammars actually fit. So we're now
talking about really this phone is changed from that phone. So when you plug
all this thing in, I'm going to take all my statement function and transition
functions and put them into sort of one form here.
What I'm going to do is look at the segmentations, possible segmentations and
sum out over all the possible segmentations and now I can get a posterior on
the sequence rather than the posterior on just the frames. Okay. So that's
essentially the model that's built into scarf. And one of the questions -one of the mantras in my lab is, you know, what is the stupidest thing you
can think of to do first. And one of the questions we said is what if you
relax that assumption basically say, hey, I'm going to actually just try and
jointly predict the labels and the segmentations at the same time, which
saves me on training time because I don't have to do all of this
hypothesizing and summing over all the possible segmentations.
But on the other hand it is exactly the same type of Verterbe assumption that
I'm making in terms of the segmentation. My segmentation was wrong in the
first place then I'll have to correct it with a retraining.
Okay. So this is the model that we're working with. But I think a lot of
the lessons that we learned are really sort of transferrable between the two.
I don't think there's a big deal about this. This is mostly because it makes
it computational tractable within the university setting I think is for us.
Yes?
>>: When you consider, when you're doing decoding, say,
to consider all the possible segmentations?
don't you still have
>> Eric Fosler-Lussier: You do have to consider all the possible
segmentation during code time, right. This is mostly to save on the train
time. We do put a limit on our segment length. That's another thing that we
have done to sort of optimize the decode thing.
And one of the things we're trying to do and I was very annoyed to see Jeff
coming out with a paper on this right before we did, but is to do one pass
decoding directly from the acoustics.
So that's why we were kind of interested in getting into this model.
Estimation you square of the computation will be square of the rate.
>> Eric Fosler-Lussier:
computational tricks.
No.
No.
So you have to -- so there's some
>>: You have to limit ->> Eric Fosler-Lussier:
So you limit -- I'll show you in a second.
>>: Training doesn't fix -- don't you still have to do the work for Z?
>> Eric Fosler-Lussier: You do have to do the work for Z. But -- oh, right.
So this is the problem with presenting your student's work. Is the reason
you went into this is not the reason that ended up happening.
So you do need to do it for the Z. You're right. And so we did these
computational tricks on the segment length to basically make it more
manageable. So basically we experimented with going off slide a little bit
we experimented with the idea -- actually, let me come back to that. But
because I'll have a slide that I can point to in a second.
So I'll just go up here. So on the efficient segmental CRF slide, the upshot
is that we found that if you -- we played around with actually in a
classification task what would be the optimal size of segments, like if I had
a long segment longer than whatever my optimal, whatever my maximal size was
D. I'm going to call that D, right?
Then I would go and segment a phone into two boundaries. So I might have the
first half of the phone or the second half of the phone or maybe multiple
chunks of D length.
The question we had was how bad does that hurt? And we found out that after
doing some experimentation for the Timit task at least, that having an
optimal size of about 10 was a good trade-off between decoding speed and
accuracy. We didn't really lose so much accuracy by saying my maximum length
of duration was 10 if I had something that was longer than 10 I'm just going
to postulate two segments in the end.
All right. So one of the ways that I've been thinking about this problem is
that what you're really talking about is a different kind of CRF when you're
talking about a segmental CRF in some ways. What you're really talking about
is a model where the time now becomes sort of a first-class variable, right?
So the segmentation variables are also going to be, the segmentation times of
my segments are actually also variables that I want to hypothesize.
So when I have a graph like this, a CRF basically defines its probability
structure over the cleeks in the graph. So I have some natural cleeks here
because I can talk about this phone is between this time point and this time
point. Okay? I can also talk about -- so I can talk about extracting
evidence from that from the acoustics. I can talk about a prior to say this
phone is likely to be this long, right? On the thing. But I also have these
inverted triangles which basically say, hey, here is a phone and its
successor and I believe the boundary is at this point, and I can talk about
acoustic observations that correspond between the transitions between phones.
So there's a lot of -- there were a lot of models that at some points tried
to -- like there was a model called the spam model out of ICSI that tried to
focus on transitional areas in speech rather than segmental areas.
And this is a natural way to think about incorporating that type of evidence
of a different nature. Okay.
>>: Use it to model the situation?
You mentioned earlier about --
>> Eric Fosler-Lussier: The form of transition, that's an interesting
question. So the steady state within -- so the natural -- like let's imagine
if Y were a triphone, then you would expect a formant transition would be
appropriate here. But if you're looking at local area of formant transitions
from ba to eh, for example, this might be an area where you could actually
incorporate features like that. Right?
So it gives you that flexibility to start thinking about the linguistics of
the situation. Right? And plug in I've got this great transition detector
that will do X and I can plug it in as oh this is something that focuses on
boundaries versus steady state.
Now, the problem -- and I said before we were very interested in all these
transitional, like having dependence, acoustic dependence on the transitions
between phones, if you try and do this directly within the segmental CRF, you
incur essentially an N square D times the length. That's the decode
complexity. And the question is, because you need to keep track of
essentially the durations of either one of these and also the pair of these
things, right? So all of the -- you have to look at all possible segments.
So what Ryan discovered is, hey, let's do the following: We're going to
model a boundary itself as an intermediate node. This is deterministic.
This essentially is going to carry something about its segmentation but this
doesn't carry any information about the segmentation.
So you give up the ability to talk about the value, the duration of both
segments in joint. So you do give up something with this model. But if you
don't have features that talk about that, then you can basically say, look,
I'm interested in knowing the time point that this ends, it's transitioning
to this point and I might look over a local window of features.
So we're changing the problem so you're no longer able to look over the
entire span of the segment but you're allowed to look at a local window
around a boundary. That's really what we want to focus on. So Ryan called
this boundary factored segmental condition random field.
>>: Essentially what the portion of values ->> Eric Fosler-Lussier: So this is a deterministic value. This basically
carries like this might be an add that goes for five frames. This one is an
add that goes for three frames. This just says you're going into an add. So
that makes this. It's not deterministic because essentially it carries a
prior on the duration of segments, right. But in the time point it's
actually carried by the previous time, right? So this, one way of thinking
about it is this has, it carries a label and a time point as opposed to a
label and a duration.
So it turns out that with that clever trick you can actually get quite a bit
of speed up in doing this. So we did some ->>: From the second one, why do you see you call segment CRM.
segmental.
It's not
>> Eric Fosler-Lussier: It is segmental because we still have these
segments. The labels are still on the segmental level.
>>: But you said you're not using any of the features about if segment.
don't think ->> Eric Fosler-Lussier: In the transition.
Segments still now respect --
I
In the transition boundary.
>>: Transition feature you're only looking at like the local ->> Eric Fosler-Lussier: That's right. So -- and these are transitions
between segments, right? So this segment actually gets to observe everything
about its segment information.
Sure, this gets to observe all of the data that's within its box, that is a
segmental thing. It's not a single frame.
>>: But they have a little, boundaries certain features from.
>> Eric Fosler-Lussier:
segmentation.
That's right.
You have to hypothesize the
>>: Whatever you set the computation now over here.
>> Eric Fosler-Lussier: Because we don't have to -- we no longer have to
model the joint pair of all of phone one with duration D1 time all of phone 2
with all of phone duration two. We compile it down into a single time point
so that you're only allowed to talk about -- the transition feature, allowed
to only look at local boundaries, but the segmental feature, the state
feature is allowed to -- within its segment.
And I want to point out most of the models out there don't even use
information on the observational side with those segments. They're just
mostly priors on this. So it's particularly this question of how do you get
the observational dependence into the transitions in the right way.
>>: You're talking about the computation encompassing in terms of decoding.
>> Eric Fosler-Lussier:
Yeah.
>>: Because otherwise you don't get a feature -- get feature objection based
on the segment, the segment how ->> Eric Fosler-Lussier: So here's how we -- okay. You're anticipating me.
Good. Good. Let's do a head-to-head comparison between frame level CRF and
segmental CRF. That was the key thing. We realize that nobody's really
done. See what happens. We'll use the same exact feature input. So I'm
going to use my posteriors coming off all my phones or my phones and
phonological features.
And I'll train up my usual Jeremy Morris-style CRF, frame level CRF. Now the
question is: How do we go to extracting segmental level features. Remember,
you hypothesize the segmentation and do feature extraction. Basically what
we're saying stage features we'll sample uniformly from my posterior space.
So I might take five snapshots. If it's a 10 long I'll make them evenly
spaced. If they're three long I might oversample.
But what I end up with is a fixed feature vector that basically describes
different points within that segment.
We played around with different things like what's the maximum of this value?
What's the maximum within the window? All these kinds of things we played
with a number of things.
This seems to work well. But you could imagine plugging in something else.
There's nothing that says that you can't. What you need is an online feature
extractor that basically says I'm hypothesizing this segment at this time
give me the features that correspond to this. So ours basically -- our
version of that basically just says I'm going to uniformly sample within my
posterior space in time, I should say.
We throw in the duration feature as well.
So for the experiments I'm going to show the maximum duration was 10. For
the transition features we played around with two different things. One is
that basically this frame level transition feature basically just says give
me the posteriors that surround the boundary frames within some window. And
I also might incorporate a direct MLP prediction of is this a boundary or
not. So you can train an MLP that does that as well.
Now the segment level basically says I have to look at -- I have to look at
the hypothesized segmentation on either side and now extract the features
corresponding to both sides.
So if I want the segments, I
left and segmentation on the
this is what we're trying to
we lose by actually going to
now have to both hypothesize segmentation on the
right. So this is more expensive. Okay. And
get away from. But we want to know how much do
this local boundary idea.
Okay. So this is core test accuracy. We also tend to port it on enhanced
accuracy, which is full DEV set, tends to be two points higher. If you
wonder why these numbers look a little lower than Jeremy's it's because he
tends to report on the enhanced stuff.
So the core test, basically if you plug the stuff into a 10 MHM you get
roughly the same thing as frame CRF. On the enhanced set it tends to be
bigger and statistically significant.
So when we just use segmental state features with no transitional
information, we get essentially a boost. And so this is really compared to
that. So you get about two points going from frame to segment. When you
incorporate -- now if we do full SCRF training where you have to look at all
pairs of possible things you get training time of about 62 minutes. When you
now collapse it down by having this fan into a single state, you actually
improve training time quite a bit. This is per EPAC of training. So when we
put in the transitional feature we get another point here. That's nice.
It turns out if you instead do some sort of windowed computation you can
actually get a little bit more. And you'll notice that this becomes really
expensive to train in the CRF world, right? It's not so bad. It's about
eight times faster or so, right, to do this sort of just focus on the
boundary rather than looking at all pairs. Okay. So you can do this
trade-off of like is it worth the extra point to do essentially three times
as much training. You can play with this idea. There's some happy medium.
We're just showing you some choice points.
Now, this was using just the phone posteriors.
>>: Are the numbers in the last two rows there, there's two different
training times.
>> Eric Fosler-Lussier:
Yes.
>>: Right.
>> Eric Fosler-Lussier:
That's right.
>>: For the boundary SCRF and regular SCRF.
>> Eric Fosler-Lussier:
Right.
Exactly.
>>: Are the accuracies the same.
>> Eric Fosler-Lussier: The accuracies turn out to be exactly the same
because the models are exactly the same. Because we're not using any
features that talk about ->>: The SCRF it's not using ->> Eric Fosler-Lussier: It's not using -- it doesn't have a feature that
talks about that. So it's just wasted power, exactly right. So they're
equivalent because of the types of features you want to use.
>>: So one of the advantages uses segmental feature is you can, for example,
take average of the [inaudible].
>> Eric Fosler-Lussier:
That's right.
>>: You will miss ->> Eric Fosler-Lussier: No, no, no we can do that. We did that as one of
our experiments. Instead of taking the average MRCCS you take the average of
the posteriors. We have done that. It works about the same as taking the
uniform sampling or all this stuff.
So each one works a little better or a little worse in various situations.
Right.
So that's if you add in the phonological features. It turns out annoyingly
that we actually ended up with the same amount by adding phonological
features so it helps down here. So we're still sort of puzzling over that a
little bit.
I don't know what to say about that yet. Okay. So we're eventually moving
to word recognition on this and the idea is that we can -- one option that
we're interested in doing is basically saying, look, we can use Jeff's scarf
system as a higher level processing system that this becomes a first pass
that the scarf can now incorporate larger segmental features on. So we
basically put out lattices and then have scarf sort of work on that.
So that's one option that we're looking at for doing word level decoding.
The other option that we're looking at is Jeremy Moore did some sort of
hybrid recognition style thing for basically turning this into like a hybrid
neural network except for it works on sequences rather than individual
phones.
So that's all I had to say on this topic. And I see that I'm running really
long. So I guess I'm going to give you guys a choice, because I will have to
sort of shrink.
So I have some -- I know Rowit actually gave a talk here on the articulatory
features stuff last year when he was interning. I could turn to the more
text-based processing stuff or talk a little, dabble in each. So maybe I'll
have a show of hands.
>>: [inaudible].
>> Eric Fosler-Lussier:
>>: 11:40.
Do we have time?
We have Dan showing up.
>>: I see.
>>: 11:30 pick him up. If there's a group of us, though, that's going to
lunch we should probably beat the lunch crowd.
>> Eric Fosler-Lussier:
Yes, absolutely.
>>: Eric, just so you can calibrate, there's probably five or six people in
this room who might have seen Rowit's talk.
>> Eric Fosler-Lussier: Okay. So that's a good calibration. So maybe I
should -- I don't know. So would people prefer to hear more about like -- so
articulatory feature modeling, that answers Lee's question a little bit about
this stuff, or text-based processing? I'll touch on each one so you get a
flavor of what's going on.
>>: [inaudible].
>> Eric Fosler-Lussier: Okay. I'll zip through a
Articulatory feature modeling. So what we've been
basically taking articulatory features and sort of
and then combining them by thinking about a linear
little bit. Okay.
doing up to this point is
using them as estimates
sequence of phones.
And one of the things that we're interested in I've been working with Karen
Levesque and with two of my students Rowit you guys know he interned here
last year and [inaudible] and we've been working on some articulatory feature
model and we've been complexifying the CRFs in a different dimension which is
in terms of factor state spaces rather than factoring time. So just to give
you a view of this.
One way of thinking about the world is, of language, so if you have the
"WordSense", it can often be pronounced as sent, because if you, if the
articulators are out of synchrony, you end up having more of a closure and a
release which ends up sounding like a T. So you get sents instead of sense.
It's hard to do on the fly.
So there are models of this like articulatory phonology. And Karen spent a
lot of time thinking about models and this built on some stuff that Lee had
done as well in thinking about can we build models of how articulators
combine to produce phonological effects.
So the idea here is that if you get an asynchrony in the tongue body, the
tongue tip and the lips moving at a different time than the velum and glottis
does, you end up with -- I should say it's over here, the asynchrony. I
guess they're both asynchronous.
But you end up with a nasalized vowel and you also end up with this extra
phone that didn't appear in the original transcription. And the claim is
that this can account for a lot of that pronunciation variation that we saw
in the beginning.
It can and it can't. It counts for some of the effects. It doesn't count
for all of them. And so Rowit built this model that he talked about last
year where we're trying to do articulatory feature alignment. And this gets
back to the question I was asked about how do I get the targets for all of
the MLP s that I'm training? And the problem is that if you are going from a
phone, if you're going from this, right, if I just project from this phone to
each one of these dimensions. So I've changed my categorization because
we're using articulatory instead of phonological features, but the point is
the same, I'm not going to be able to really match what's really going on
which is this, right?
So a mapping here from N and this nasalized N, right, isn't going to be able
to get me this particular asynchrony very well. Unless I have a really fine
transcription. In fact, so we can use the switchboard transcription project
as our fine transcription and then project backwards. That's the way we're
going to seed our models to answer some questions.
But if I'm just going off a dictionary where I don't have detailed hand
transcriptions I'm not going to be able to get this.
The question is how do I bootstrap from models where I've got this really
fine transcription and be able to project back to this kind of space.
And that's where the alignment comes in. So I basically want to say, hey,
I've got the WordSense coming in and I want to be able to say I expect sort
of some asynchronous thing going on with -- there's some synchrony going on
between the streams but there's some asynchrony going on between the streams.
I need to have a model of that.
So we put up this monstrosity. Oh! Right? But let me make that a little
simpler. Okay. So let's think of each one of these colors as some sort of
asynchronous state where basically you're now thinking about what is the
value of -- so I have essentially a phone that -- so between this word and
the sub word state it will map to a particular articulatory feature which
corresponds to a phone. So there's a model that basically says are these
features allowed to be asynchronous, what is the probability distribution of
them being asynchronous, and I can talk about different streams going on in
parallel, one, two, three, as many as I want. Where they each have these
linear structures going over time but also have synchrony constraints that go
on between the streams.
So this is kind of a really ugly factograph. By the way, the red things are
trainable. The blue things are actually deterministic. So it's not as bad
as having to learn everything about these things.
But we can take advantage of these deterministic constraints because the
number of red constraints is pretty small. And we can boil down this model
into essentially a very simple model that talks about a vector space of sub
word configurations. And we can talk about the fact that so let's imagine I
put a limit -- so if I had cat and I had three streams, I could have one,
one, one, two, two, two, three, three, three, or I could have one, one, one,
and then one, two, one, and two, three, three, three, three, three start
getting things asynchronous out of line.
If I put a limit on that on how far my articulator is allowed to get out of
synchrony, the 1s and 2s and 3 basically talk about the canonical positions
for the ca the ah and the ta or for sense, ah, eh sa he. So I can talk about
this one moved ahead but this one didn't and all that really is a change in
index space.
So if you do that, basically we now have basically a bunch of trainable
parameters and one basic configuration parameter that basically says hey this
is a -- these are the possible transitions between these two states. So we
actually -- do we keep that fixed? Doesn't matter. But the idea is that
what we're doing is eliminating deterministic variables. So originally we
tried to implicit it with a general purpose CRF toolkit and it got hairy.
By doing some neat tricks, which actually are related to the tricks for the
boundary-related factored thing about restricting your state spaces, you can
actually get a computational tractable model. And we found that if you take
the alignment error rate, we actually end up getting articulator alignment
errors reduced by about five to 15 percent relative. So that's sort of where
that's been.
>>: [inaudible].
>> Eric Fosler-Lussier: So you initialize the model but using switchboard
transcription project, so you have hand transcribed things and then we're
going to propagate -- actually in that case we're training and testing on
that set. But the idea is eventually we then took that and aligned all of
switchboard for another experiment. But I'm not going to talk about that.
But, yes, you use that as a bootstrap.
>>: [inaudible].
>> Eric Fosler-Lussier:
model, that's right.
We don't have Tinmit [phonetic] results for this
>>: But you follow the details.
>> Eric Fosler-Lussier: We probably should do a Tinmit model for this result
that's a good point. If you're interested in more details that was our ASRU
2011 paper.
So Rowit has to graduate at some point so he can go off and be a great
researcher somewhere. And he was playing around with trying to get this to
be full recognition. And he said, well, what if I try to do this as keyword
spotting. And we elicited the help of Yosee Tochet [phonetic] who had done
some discriminative keyword spotting, and built a model that basically says,
hey, I'm going to develop this feature-based keyword spotter that basically
says all right I've got two segments of speech and one has the word, one does
not have the word and the objective function that I want to optimize
basically says my score for the thing that has the word had better be better
than the score that doesn't have the word. That's the very simple version of
that. So we want this function.
I'm not going to talk too much about that, but the idea that's behind this is
that this now becomes this we can parameterize with some land awaited sets of
features that we extract from something.
And in this case we're going to use the alignment model to basically say
what's my best alignment of the word and then we're going to extract features
from those segments found in the alignment. And currently he's done this
with phone based alignments but what he's trying to work on now is getting
articular word alignments.
And at the moment, we just finished a paper of rough for MSLP machine
learning and -- machine learning and speech and language processing new thing
the symposium that's coming out.
It turns out that's actually worked pretty well in low resource settings
where you don't have a lot of data which is important to us if we, let's say,
have language from a new data -- data from a new language where we don't have
a lot of data.
So we haven't conducted any experiments on that kind of multi-lingual
setting.
So I realize that was an extremely hand wavy sort of version of this. But I
do want to talk a little bit about the stuff we've been doing with electronic
medical records.
If anybody had any questions about that, but that's just to give you the
flavor of we've got articulators alignments and then we're going to use those
to extract features for doing one of these keyword spotters. That's the real
rub.
>>: You say you have two cuts for UCF?
>> Eric Fosler-Lussier: So this turns out -- the model turns out not to be a
CRF in that case. It's kind of like a Perceptron. USC wouldn't call it
Perceptron, I call it Perceptron. It's actually trained using a max margin
style approach. So you could think of it as an SVM, linear SVM.
Okay. Last bit. I've been working on some electronic medical record stuff,
which has kind of been eye-opening, taking me back to the CRF stuff, where
did CRF stuff come from in terms of language processing.
So Zillo was saying, I've been dragged back into language [laughter] it's
like, whoa. So let me give you a medical record. This is sanitized medical
record out of OSU Medical Center. And you'll notice some interesting things
about this. So you've got some information corresponding to the patient and
the doctor and this is all structured data.
Right? And then we've got a bunch of unstructured data that's in particular
regions. So what's the history, what did we find in terms of tests, what are
we planning to do? And the idea is that we're going to get multiple of these
nodes. And eventually, what we want to be building over the lifetime is a
model where we say here's our first narrative. We have some things where
here's our admission date. There's some things that happen before admission.
There's some things that happen after admission but before discharge and
there's some things that happen after discharge or plan to happen after
discharge.
And here's the second admission which is hopefully after the first discharge
date. Hopefully way after but not often. And some of these things. And so
if there's a readmission, you know, we now know that the medical history is
going to correspond to some events but maybe some other events that weren't
in this original thing. And how do we find all these correspondences. So
this becomes an interesting problem of how do I know when a medical event in
one document or within one document corresponds to another medical event
within that document or across documents. So there's sort of two problems.
Within document and cross-document, co-reference resolution. Medical events.
It's kind of like anti-for resolution in language processing but here we're
talking about events, event resolution.
So the obvious thing to do -- and I just want to point out: Here we've got K
competitive use these are labeled twice in the same data and this is the
same. But you might have chest pain here but this chest pain actually is not
the same as that chest pain.
We'll see an example what that really means. And the thing is that labeling
these things is hard. You can get reasonable entertainment agreement if you
spend a lot of time at it but it takes forever. We've got -- at the point of
this study we had three patients and about 35 clinical notes. It really
takes -- that was with four annotators going through and doing all the stuff.
So it's pretty ugly.
techniques, right?
So we want to use some sort of semi supervised
And the question is can we use unlabeled data to do that, right? So what we
want to do is build a temporal classifier to do this because when people
think about co-references they often think about semantic concepts, chest
pain acute chest pain, although the same thing so semantic overlap, right?
But it turns out there's a lot of queues that correspond to temporal
information according to events that really can tell you is this the same
event or not. I'm going to skip the semantic -- so I'm going to -- we take
it for granted we extract a bunch of semantic features that are medically
relevant.
But I want to talk about the temporal stuff, because that's actually where
our newest work -- people have done that kind of thing for the semantic
stuff.
So the idea is that what we're going to do is try and build coarse time bins
and build a sequence tagger conditional random field to assign medical events
to relative points in time -- blah. Points in time relative to admission.
So we may have things that are way before admission, just before admission
like way before admission like a year. Before admission, after admission or
after discharge. Those are -- or at admission is the other one.
The idea is that here's our medical note. Right? So you've got K
competitive use and hypertension. It's with a history of. So sort of way
before. Right? Chest pain, which started two days ago. It's also before
admission. He does not have chest pain now. That's after admission.
But ever since the episode two days ago before admission. So here we have
chest pain. These two are not co-referent but this episode is co-referent
with the chest pain. Okay. So we're going to need some semantic information
to know that chest pain could be an episode. So that's what the semantic
stuff is doing and temporal stuff is basically giving us a different view.
So we're going to extract state functions and then we're going to have some
transition functions in the usual CRF thing. So what are the kinds of
functions you would actually extract? One is what section are you currently
in? It's not a guaranteed thing. So past medical history doesn't
necessarily always -- it's going to be a bias definitely to things that are
before admission. But things doctors don't always pay attention to what
they're typing where.
So things that can happen after admission still occur in that. So we look at
the section -- but it's a good clue, right? We're also interested in lexical
features. So history of presented with. These are going to be the preceding
bigrams that will tell us something.
We also extract things like not only just the preceding diagrams but also
parts of speech and stuff like that so we know that oh this is probably a
temporal modifier so that's a good thing to know. The other thing is that
there's actually a lot of time relative time references in here. So two days
ago. Two days ago. Well, these two things are going to map to this. So we
look for the closest relevant bits of temporal expressions. So we do a
temporal expression parser, for example. And essentially what we end up
doing is building this model using all of these expressions and we can now
label every event with we know the admission date. Because that's in the
structured data. And now we basically say we can give a relative time
essentially based on where we're assigning things in the temporal bin. Why
is this important to us? Well, essentially we're going to then take our
semantic features as one view of the world. Temporal features are another
view of the world and we're going to do semi-supervised learning in a
multi-view type environment and we play with two different multi-view
strategies. One is to basically do co-training where you basically say I'm
going to train classifier on one type of feature. Predict label instances in
my unlabeled data that I'm pretty sure about and then use that to train the
other and then go back and forth, go back and forth.
>>: This bigram.
>> Eric Fosler-Lussier: It's a huge sparse vector. Right. And the other
option is to basically talk about posterior regularization I'm not an expert
in posterior regularization I'll wave my hands by saying, look, you can
develop a probability structure and then have a prior from some other model.
Well, the other model, the temporal model now serves as a prior for the
semantic model. And semantic model is a prior for the temporal model. Then
you can again do that kind of labeling so that you try and constrain the
amount that you disagree on the outputs of those two things. So you go back
and forth, back and forth.
So just to give you a sense of this, if you do supervised learning on a 60/40
split of the data this is dataset for which we have all the data. You get
about 77 percent on clinical notes with recall of precision of 77 percent of
is this co-referent recall of 90 percent. Co-training basically works a
little worse than posterior regularization for this, but essentially we're
getting pretty close to the supervised learning, which is surprising to us,
and we have not actually taken advantage of all of the vast numbers of
clinical notes that we could do. That's to be the future experiment.
So summing. Sorry to blast through the last bit. But I just wanted to give
you a flavor of the kinds of things we're working on in the speech and
language technology lab. We've got a number of other projects, as I said.
But a lot of the stuff we're doing is basically how do I think about
intelligent feature extraction and then plugging it into relatively standard
segmental tools. So we've been doing this in speech processing and how do
you do intelligent factorization of that space and the text processing we've
been using this as features for something else, right?
So one of the things that we've noticed is that you basically, if you start
using things like people talked about CRF as a replacement for HMMs,
discriminative HMMs, a linear chain CRF is more powerful than HMM when you
start thinking about transition observations.
I think the practical import of it is that the Highgold paper kind of shows
you can factorize the space back into that -- in practical terms this is an
easy way to have this observational dependence, right? And so we've been
using the sequence models also as features for other learners and finding out
that incorporating sequence information at the lower levels can often help
when you're doing the prediction for other tasks as well. So those are the
two main messages. So that's all I have to say. [applause].
>>: [inaudible].
>> Eric Fosler-Lussier:
I'm sorry?
>>: You mentioned [inaudible] off.
incorporate that?
Do you have any thought how to
>> Eric Fosler-Lussier: No, but I would love it if -- I've been wrapping my
brain around that for a while, and I have some ideas on like curve modeling
and stuff like that. But it's not -- I don't have anything solid.
>>: Whose idea how you can incorporate [inaudible] in CRF systems?
>> Eric Fosler-Lussier: I think I'll have Jeff do that. [laughter] I mean,
that's an interesting question, because the state space I think the way that
Jeff does it is not -- is not a bad start because he thinks about -- it's
hard to talk about it with him in the room. [laughter]. But you know
thinking about the language model state space, you know, allows you to think
about any depth of language modeling you want because you're thinking about
the state space, but I think if you -- I think my take on it is you're
probably -- you're not going to be able to get it all in one system. And we
get this with the deep nets and we get all this stuff is that I kind of see
this as the acoustic smoothing over a number of lower level estimates that
can give you some sort of phonological substrate and then you use that to go
upwards from there. So I think the language model stuff, you know, you would
do -- you could do some interesting things either with max center or CRF
style models on top of the lattice. Kind of the way that Jeff thinks about
this model, I think. And so as long as you can -- so that would be my
answer. But I don't have a good answer for you I think is the upshot.
[applause]
Download