24851 >> Dong Yu: We're pleased to have Eric today. And Eric got his Ph.D. degree from U.C. Berkeley, and he's associate professor at Ohio State University right now. He's been working on conditional random fields for many years. And today he will talk about his recent work in this area. >> Eric Fosler-Lussier: Thank you. So when I was trying to figure out what to talk about in this talk I thought, well, my students are doing all this whacky stuff and how do I sort of summarize all this stuff. So let me give you a quick overview of the lab. Basically my viewpoint is that we want to be looking at linguistics to help us do speech recognition and language processing. We want to be looking at language processing and speech recognition and help us do linguistics. And I'm going to talk more about the former point than the latter point today. But that is sort of the overall thing that we're doing. And traditionally we've been working a lot in machine learning on the CSC side and a lot of people in linguistics to sort of get some good insight. Recently I've also been dabbling in stuff with bioinformatics and a partner over there, the Wechsler Medical Center I'll talk a little bit about that today too just to give you a broad overview of I was kind of surprised when I actually sat down and said how many projects are we actually doing, and there's quite a few. Both on the speech side and the language side and we up until recently had Chris Brew who left for ETS so I inherited a number of language projects. So a lot of the stuff we've been doing traditionally that people know me for is working on speech recognition with CRFs and feature-based stuff. There's been some stuff with Karen Levesque doing discriminative FSTs and also some of the future-based stuff. We just recently got involved in a government project on multi-lingual keyword spotting. And you can read all the rest of the stuff. We do a little bit of linguistics so computational models of child language acquisition. That's work with Mary Beckman. And some stuff on the medical side of looking at timeline extractions from electronic medical records and virtual patients. And there's also a little bit of stuff in kids' literature on how to predict the reading level. So there's a number of partners in all of these things that I won't have time to go into detail. So today I said, okay, we'll try and give you a slice across. So I'm going to try and balance my speech time and my language time, although I will tend to talk more about the speech, I think, because I go into that. So here's the overview. So today I'm going to give you many of the people -there's a lot of old familiar faces in the field in the room. So I'm going to give you a, for those who don't know what I do, give you sort of a: How did we get here background on conditional random fields and stuff like that, then I'm going to talk a little bit about some of the stuff that we've been doing with segmental conditional random fields, articulatory feature modeling and I'll talk a little bit about the language stuff we've been doing where we've been taking CRFs and kind of grew up in the natural language community and using them as event extractors to do semi supervised learning. Okay. So background how we got into this mess. Okay. So a lot of the things that I've been interested in, thinking about how do we break apart one of these problems into some sort of feature description and use combiners to do feature-based modeling. And that's sort of the overall theme of the talk. And the idea is that in general we're going to be looking at ways to extract features from either speech or language and then combine using some sort of weighted model, usually a log linear weighted model. And just to give you a flavor of how I got into this whole mess, a little bit of the old history and I did my thesis way back when on pronunciation modeling. So I stole this slide from Rowit [phonetic] who stole the slide from Karen on pronunciation variation between switchboard and it comes out of work from Steve Greenberg where they did transcriptions of, fine transcriptions of how people pronounce words in conversational speech. And you'll see that a lot of -- there are a lot of different ways people say things like, for example, probably. And in fact it's probably very unlikely -- in fact, it was not observed that you actually see the full pronunciation of probably. But usually you say probably or probably. So there's a lot of variation. And this is a real problem from the viewpoint of trying to build a good model of this because we don't have complete linguistic evidence in the signal according to linguists. So these are linguists transcribing these utterances saying here are the things I've observed. So I did a thesis which nobody should ever look at on doing pronunciation variation and people back in the '90s were getting moderate success by saying I'm going to take my pronunciation dictionary and add these pronunciations. I don't want to add too many of them because things are confusable and things like that. But we never got too far with that approach, because the more pronunciations you add to the system, the more likely it is that you're going to start confusing things. And so let's take a little bit of a closer look at some of the data. And the thing that I'm going to show you is actually inspired by some work that was done by Marat Sarclar [phonetic] and Sanjav Cupar [phonetic]at Hopkins, which basically said what are the acoustic, the real acoustics that go along with variations that we're observing? So I have this following little example that if we just look at the Buckeye corpus speech. This is again conversational speech but we get long-term speakers, switchboard has like five minutes of each conversation side. These are like hour-long transcriptions. So we get a lot of instances of variation within the speaker. And we can take a look at things like let's just look at ah versus eh in the Buckeye corpus. And I'm going to look at just looking at the format structure of saying I'm looking at where are the peaks in energy, okay, for ah, eh and then when the dictionary said there should have been an ah but the transcriber said that it was an eh. So that would be like if I had bad and somebody transcribed it as bed. would be a confusable thing but people do say this. Which So here are the data that just correspond to as versus our transcriber says as and these aress. This is transcriberess. First form of frequency versus the second form of frequency. If you've done linguistics you'll know this is one section of the vowel triangle. So ->>: More in terms of duration than in terms of the former? >> Eric Fosler-Lussier: Not in this corpus. The duration is actually roughly equivalent for this speaker. So not for this. But I agree I'm only looking at one dimension of a very dimensional thing. So just to give you -- and circumstances really mean nothing, statistically, but just to give you a sense of like here's a region where we see ahs. And here's a region we see ehs. So now here's your test. What happens when ahs is going to be transcribed as ehs. So the observer saw this particular thing and said, okay, this ah, it was an ah but it was an eh. So the transcriber decided to change the canonical interpretation to ah, where do you think that data should be? It should be, you would hope, in that overlapping region, right? Of course it's not otherwise why would I be showing this slide, right? So the actual data, which is marked with the blue stars, is all over the place. And so some of it is in that region that you would expect. Some of those are in the ah territory. Some of them are actually higher than eh. This is more over into eh where the performance is and some sit lower than ah and there's no phoneme lower than ah, right? Again, this is a very hand wavy argument but you can -- this kind of data was born out more statistically with the study that mar rat and Sanjeev did. But here's the key. Where is the data not? The data is not over here. These are the back vowels in here. So what is the transcriber really trying to tell us? It's not that ah really was pronounced as an eh. It's that it was a front vowel. It should have been an ah, I don't know what it really should be. I don't know what the back -- I don't know what the vowel height should be. So I have some uncertainty as to height. I mean, that's sort of my loose interpretation of this. So we might have uncertainty in some dimensions in terms of the variation, but certainty in other dimensions. So if we start thinking about the linguistics in terms of the multi-dimensional variation of some sort of phonetic space then we can start modeling some of these phenomena. So, right. So we can't take those transcriptions at face value. Okay? So where I've been looking at different kinds of representations is to think about sub phonetic representation such as phonological features or articulartry features to represent these kind of transcriptional differences. The interesting question then becomes how do you involve that into some sort of statistical model. So backing up a little bit to the statistical model side, I grew up in at IXI where neural nets were a big thing. So one of our favorite things to do was to basically take some acoustics here, put a multi-layer to try to predict which phone class is this given this little local chunk of speech. I might have a local chunk of speech I'll get a posterior estimate for every frame effectively out. And if you are a IXI person you might put this into do a diagonalization and decorrelation and put it into an HMM. But one of the things that we did is basically say, okay, we're going to build a log linear model on top of that which is a conditional random field. And I'll talk a little bit more about the conditional random fields in a second. But essentially what you're going to get at is instead of having these frame level posteriors you end up getting these sequence level posteriors out. Why do I like this so much? Because we can start playing around with saying, hey, look, I have little local detectors of: Is this a manner? Is this a place? Is this a height? I can talk about these as features that I'm extracting from the data and I can talk about combining features of different kinds of things. So this can be used in place of or in parallel to other types of things. So, for example, Zilla did a lot active work to begin with on trying to extract things like using the sufficient statistics as the functions, right, and using a hierarchical or hidden conditional random field to do this kind of estimation. >>: [inaudible] also at the same time? >> Eric Fosler-Lussier: Yes, so we've done it. You have to put X and N squared. So you get the sufficient statistics. It helps us a little bit. But it complicates the thing, the -- it just -- it doesn't help enough to add that in all the time. But it's a good thing. >>: [inaudible]. >> Eric Fosler-Lussier: Please do. >>: So to train this classifier you need a training signal. showed us that the training signal was unreliable. And you just >> Eric Fosler-Lussier: Right. You're asking something that's not on the slides but it's a very good question. So there's a different -- let me -there's two answers to that question. In fact, they do come into the slides later on. So you're right you are anticipating. So at first guess what we're going to do is basically take a phone level alignment. In fact, usually from an HMM, because we need -- that's one of the downsides to this is that you actually need to have, if you're not doing a hidden conditional random field you need to have that label sequence. And so what we do is we take a phone sequence and we can just back propagate it to what the appropriate features were trained independent nets and what we're capturing essentially is in some ways the errors that each one of these makes in its estimation. And we might get a little bit of future asynchrony or something like that. But if you want to go to models more like the kind of articulatory feature model you're going to need some other mechanism to try and handle that misalignment. And I'll talk a little bit about Rowit's model which you might have seen before about trying to do articulatory feature alignment. So hang on to that thought. >>: That point. >> Eric Fosler-Lussier: Yeah. >>: My understanding it was CRF, what was hidden in OCF. You don't need to have precise alignment for the frame level to train the set. >> Eric Fosler-Lussier: To train a straightup CRF you need to have the mobile alignment. To train the hidden one you don't. >>: I see. >> Eric Fosler-Lussier: Right. Because the alignment is actually hidden, is actually one of the things that you need to -- but you can do -- you can do an embedded Verterbe type style training which doesn't -- with one realignment basically you get back to where you need to be. So it's not a big deal to ->>: When you put this CRF in the form that's modeled the whole sequence, that by default you get hidden there. But you don't really expect -- rather than -- these things here. >> Eric Fosler-Lussier: So the way we're training it actually you do have -well, we've used both one state and three state type models. But, yeah, you do actually have an explicit -- the way we train it we have this explicit labeling, then we go and we vertebe relabel and that gets us to get away from the mismatch. Okay. That actually gets to an interesting point about the criteria. So let me hold off on that idea for a minute. Okay. So one of the very few equations that I have on this, in this talk is basically what we're interested in is a label posterior given the acoustics, and what we're going to do is talk about this in terms of a bunch of what we'll call state functions, which associate with each label essentially some feature function here, some generic feature function. I'm going to label those as my state functions, and also the transition functions, which talk about pairs of states. In this case, because this is what we call linear change CRF. And it also may have associations with the feature functions. This actually turns out for us to be an important thing. It's an open question as to how important is it to have observational dependence on this transition. We find that having the ability to talk about whether you're going from one state to another based on the transitions which you can't do in an HMM straight up, this actually is important for us that we get quite a bit -- a little bit of gain out of doing that. Okay. So just to sum up this little sort of like this is like 2008 and before kind of thing, you know, basically you can play around with different versions of the function. So, for example, if you're yes or Hiffney [phonetic] here are my -- I have between 10,000 Gaussians, which are my 20 closest ones and that would be a very sparse feature space. So there's a nice thing about this is that you can talk about different kinds of features, plug them in together and just train. Just to summarize the results that we got on previous studies with phone recognition basically we found that we were beating the ten HMM systems using the posteriors input with many fewer parameters. Even just a monophone based CRF was actually beating our triphone based station. Admittedly this is not a discriminatively trained HMM. So caveat there. And for Tim [inaudible] training discriminantly it's a mess, not enough data to get it done. >>: Until IBM ->> Eric Fosler-Lussier: Until IBM. Brian's going to show me wrong, exactly. And the transition features actually help us quite a bit. That was one of the other things we saw. And we played around with things where we combined a large number of phonological features and phone posteriors and we found that that was a good effective combination technique. So any questions on sort of the background before I sort of segue into like what's new? All right. So I'm going to talk a little bit about upcoming interspeech paper we have on boundary factored CRFs. So I'm glad Jeff's in the audience, because I'm going to sort of sing to his choir. This whole talk in some ways is like bringing Kohl's Newcastle because you guys do a lot of this stuff within the speech group. So I'm -- we're adding little bits of knowledge to the kind of things that you guys are looking at. So the frame level approach, while nice and easy to train, has a real problem. And so that what we're trying to do is maximize the conditional maximum likelihood of the frame labels as opposed to upstream what we really want is words in the end, just just phones, we want words. There's a big criteria mismatch. And it gets down to exactly about this issue about segmentations and the fact that if I get -- and the way to think about it is if I had cat and I got a whole bunch of Ks let's see cat is not good because As should be long. Let's imagine I recognize cat as, K should be one frame long and there's a bunch of ahs and a bunch of ts. And if I label that one frame wrong, it's a big problem in terms of the recognition, right, but the criteria basically says, well, you just got one wrong out of this long sequence, right? So the frame level criteria is really kind of a bad mismatch. As a side note to that, so one of the things that you want is sort of phono tactic grammars and things like that this sound is likely to correspond, and on a frame level those probabilities or actually potentials in a CRF get really spread far apart because you've got all the sequence of frames. So this is kind of bad from that point of view. And we want to be able to incorporate long span features like the duration exactly to your point. Duration is very distinctive especially in a lot of languages that are not English. Form and trajectory, syllable phoneme counts all these kinds of things. So we want something else to work on. So segmental CRFs to the rescue, da da da da. So the idea is I'm going to change my labels from Q to Y where I'm talking Y is now a segment as opposed to Q is my frame. And we're going to talk about labels being on the segment level but every label is going to correspond to every chunk of frames and probably Jeff has talked your ear off about it. But the idea is that what you're going to do is you've got this implicit sort of segmentation that's going on that basically says, hey, you know, this Y3 actually corresponds to two outputs and I need, the segmentation is somewhat hidden from me. But the nice thing is now phono tactic grammars actually fit. So we're now talking about really this phone is changed from that phone. So when you plug all this thing in, I'm going to take all my statement function and transition functions and put them into sort of one form here. What I'm going to do is look at the segmentations, possible segmentations and sum out over all the possible segmentations and now I can get a posterior on the sequence rather than the posterior on just the frames. Okay. So that's essentially the model that's built into scarf. And one of the questions -one of the mantras in my lab is, you know, what is the stupidest thing you can think of to do first. And one of the questions we said is what if you relax that assumption basically say, hey, I'm going to actually just try and jointly predict the labels and the segmentations at the same time, which saves me on training time because I don't have to do all of this hypothesizing and summing over all the possible segmentations. But on the other hand it is exactly the same type of Verterbe assumption that I'm making in terms of the segmentation. My segmentation was wrong in the first place then I'll have to correct it with a retraining. Okay. So this is the model that we're working with. But I think a lot of the lessons that we learned are really sort of transferrable between the two. I don't think there's a big deal about this. This is mostly because it makes it computational tractable within the university setting I think is for us. Yes? >>: When you consider, when you're doing decoding, say, to consider all the possible segmentations? don't you still have >> Eric Fosler-Lussier: You do have to consider all the possible segmentation during code time, right. This is mostly to save on the train time. We do put a limit on our segment length. That's another thing that we have done to sort of optimize the decode thing. And one of the things we're trying to do and I was very annoyed to see Jeff coming out with a paper on this right before we did, but is to do one pass decoding directly from the acoustics. So that's why we were kind of interested in getting into this model. Estimation you square of the computation will be square of the rate. >> Eric Fosler-Lussier: computational tricks. No. No. So you have to -- so there's some >>: You have to limit ->> Eric Fosler-Lussier: So you limit -- I'll show you in a second. >>: Training doesn't fix -- don't you still have to do the work for Z? >> Eric Fosler-Lussier: You do have to do the work for Z. But -- oh, right. So this is the problem with presenting your student's work. Is the reason you went into this is not the reason that ended up happening. So you do need to do it for the Z. You're right. And so we did these computational tricks on the segment length to basically make it more manageable. So basically we experimented with going off slide a little bit we experimented with the idea -- actually, let me come back to that. But because I'll have a slide that I can point to in a second. So I'll just go up here. So on the efficient segmental CRF slide, the upshot is that we found that if you -- we played around with actually in a classification task what would be the optimal size of segments, like if I had a long segment longer than whatever my optimal, whatever my maximal size was D. I'm going to call that D, right? Then I would go and segment a phone into two boundaries. So I might have the first half of the phone or the second half of the phone or maybe multiple chunks of D length. The question we had was how bad does that hurt? And we found out that after doing some experimentation for the Timit task at least, that having an optimal size of about 10 was a good trade-off between decoding speed and accuracy. We didn't really lose so much accuracy by saying my maximum length of duration was 10 if I had something that was longer than 10 I'm just going to postulate two segments in the end. All right. So one of the ways that I've been thinking about this problem is that what you're really talking about is a different kind of CRF when you're talking about a segmental CRF in some ways. What you're really talking about is a model where the time now becomes sort of a first-class variable, right? So the segmentation variables are also going to be, the segmentation times of my segments are actually also variables that I want to hypothesize. So when I have a graph like this, a CRF basically defines its probability structure over the cleeks in the graph. So I have some natural cleeks here because I can talk about this phone is between this time point and this time point. Okay? I can also talk about -- so I can talk about extracting evidence from that from the acoustics. I can talk about a prior to say this phone is likely to be this long, right? On the thing. But I also have these inverted triangles which basically say, hey, here is a phone and its successor and I believe the boundary is at this point, and I can talk about acoustic observations that correspond between the transitions between phones. So there's a lot of -- there were a lot of models that at some points tried to -- like there was a model called the spam model out of ICSI that tried to focus on transitional areas in speech rather than segmental areas. And this is a natural way to think about incorporating that type of evidence of a different nature. Okay. >>: Use it to model the situation? You mentioned earlier about -- >> Eric Fosler-Lussier: The form of transition, that's an interesting question. So the steady state within -- so the natural -- like let's imagine if Y were a triphone, then you would expect a formant transition would be appropriate here. But if you're looking at local area of formant transitions from ba to eh, for example, this might be an area where you could actually incorporate features like that. Right? So it gives you that flexibility to start thinking about the linguistics of the situation. Right? And plug in I've got this great transition detector that will do X and I can plug it in as oh this is something that focuses on boundaries versus steady state. Now, the problem -- and I said before we were very interested in all these transitional, like having dependence, acoustic dependence on the transitions between phones, if you try and do this directly within the segmental CRF, you incur essentially an N square D times the length. That's the decode complexity. And the question is, because you need to keep track of essentially the durations of either one of these and also the pair of these things, right? So all of the -- you have to look at all possible segments. So what Ryan discovered is, hey, let's do the following: We're going to model a boundary itself as an intermediate node. This is deterministic. This essentially is going to carry something about its segmentation but this doesn't carry any information about the segmentation. So you give up the ability to talk about the value, the duration of both segments in joint. So you do give up something with this model. But if you don't have features that talk about that, then you can basically say, look, I'm interested in knowing the time point that this ends, it's transitioning to this point and I might look over a local window of features. So we're changing the problem so you're no longer able to look over the entire span of the segment but you're allowed to look at a local window around a boundary. That's really what we want to focus on. So Ryan called this boundary factored segmental condition random field. >>: Essentially what the portion of values ->> Eric Fosler-Lussier: So this is a deterministic value. This basically carries like this might be an add that goes for five frames. This one is an add that goes for three frames. This just says you're going into an add. So that makes this. It's not deterministic because essentially it carries a prior on the duration of segments, right. But in the time point it's actually carried by the previous time, right? So this, one way of thinking about it is this has, it carries a label and a time point as opposed to a label and a duration. So it turns out that with that clever trick you can actually get quite a bit of speed up in doing this. So we did some ->>: From the second one, why do you see you call segment CRM. segmental. It's not >> Eric Fosler-Lussier: It is segmental because we still have these segments. The labels are still on the segmental level. >>: But you said you're not using any of the features about if segment. don't think ->> Eric Fosler-Lussier: In the transition. Segments still now respect -- I In the transition boundary. >>: Transition feature you're only looking at like the local ->> Eric Fosler-Lussier: That's right. So -- and these are transitions between segments, right? So this segment actually gets to observe everything about its segment information. Sure, this gets to observe all of the data that's within its box, that is a segmental thing. It's not a single frame. >>: But they have a little, boundaries certain features from. >> Eric Fosler-Lussier: segmentation. That's right. You have to hypothesize the >>: Whatever you set the computation now over here. >> Eric Fosler-Lussier: Because we don't have to -- we no longer have to model the joint pair of all of phone one with duration D1 time all of phone 2 with all of phone duration two. We compile it down into a single time point so that you're only allowed to talk about -- the transition feature, allowed to only look at local boundaries, but the segmental feature, the state feature is allowed to -- within its segment. And I want to point out most of the models out there don't even use information on the observational side with those segments. They're just mostly priors on this. So it's particularly this question of how do you get the observational dependence into the transitions in the right way. >>: You're talking about the computation encompassing in terms of decoding. >> Eric Fosler-Lussier: Yeah. >>: Because otherwise you don't get a feature -- get feature objection based on the segment, the segment how ->> Eric Fosler-Lussier: So here's how we -- okay. You're anticipating me. Good. Good. Let's do a head-to-head comparison between frame level CRF and segmental CRF. That was the key thing. We realize that nobody's really done. See what happens. We'll use the same exact feature input. So I'm going to use my posteriors coming off all my phones or my phones and phonological features. And I'll train up my usual Jeremy Morris-style CRF, frame level CRF. Now the question is: How do we go to extracting segmental level features. Remember, you hypothesize the segmentation and do feature extraction. Basically what we're saying stage features we'll sample uniformly from my posterior space. So I might take five snapshots. If it's a 10 long I'll make them evenly spaced. If they're three long I might oversample. But what I end up with is a fixed feature vector that basically describes different points within that segment. We played around with different things like what's the maximum of this value? What's the maximum within the window? All these kinds of things we played with a number of things. This seems to work well. But you could imagine plugging in something else. There's nothing that says that you can't. What you need is an online feature extractor that basically says I'm hypothesizing this segment at this time give me the features that correspond to this. So ours basically -- our version of that basically just says I'm going to uniformly sample within my posterior space in time, I should say. We throw in the duration feature as well. So for the experiments I'm going to show the maximum duration was 10. For the transition features we played around with two different things. One is that basically this frame level transition feature basically just says give me the posteriors that surround the boundary frames within some window. And I also might incorporate a direct MLP prediction of is this a boundary or not. So you can train an MLP that does that as well. Now the segment level basically says I have to look at -- I have to look at the hypothesized segmentation on either side and now extract the features corresponding to both sides. So if I want the segments, I left and segmentation on the this is what we're trying to we lose by actually going to now have to both hypothesize segmentation on the right. So this is more expensive. Okay. And get away from. But we want to know how much do this local boundary idea. Okay. So this is core test accuracy. We also tend to port it on enhanced accuracy, which is full DEV set, tends to be two points higher. If you wonder why these numbers look a little lower than Jeremy's it's because he tends to report on the enhanced stuff. So the core test, basically if you plug the stuff into a 10 MHM you get roughly the same thing as frame CRF. On the enhanced set it tends to be bigger and statistically significant. So when we just use segmental state features with no transitional information, we get essentially a boost. And so this is really compared to that. So you get about two points going from frame to segment. When you incorporate -- now if we do full SCRF training where you have to look at all pairs of possible things you get training time of about 62 minutes. When you now collapse it down by having this fan into a single state, you actually improve training time quite a bit. This is per EPAC of training. So when we put in the transitional feature we get another point here. That's nice. It turns out if you instead do some sort of windowed computation you can actually get a little bit more. And you'll notice that this becomes really expensive to train in the CRF world, right? It's not so bad. It's about eight times faster or so, right, to do this sort of just focus on the boundary rather than looking at all pairs. Okay. So you can do this trade-off of like is it worth the extra point to do essentially three times as much training. You can play with this idea. There's some happy medium. We're just showing you some choice points. Now, this was using just the phone posteriors. >>: Are the numbers in the last two rows there, there's two different training times. >> Eric Fosler-Lussier: Yes. >>: Right. >> Eric Fosler-Lussier: That's right. >>: For the boundary SCRF and regular SCRF. >> Eric Fosler-Lussier: Right. Exactly. >>: Are the accuracies the same. >> Eric Fosler-Lussier: The accuracies turn out to be exactly the same because the models are exactly the same. Because we're not using any features that talk about ->>: The SCRF it's not using ->> Eric Fosler-Lussier: It's not using -- it doesn't have a feature that talks about that. So it's just wasted power, exactly right. So they're equivalent because of the types of features you want to use. >>: So one of the advantages uses segmental feature is you can, for example, take average of the [inaudible]. >> Eric Fosler-Lussier: That's right. >>: You will miss ->> Eric Fosler-Lussier: No, no, no we can do that. We did that as one of our experiments. Instead of taking the average MRCCS you take the average of the posteriors. We have done that. It works about the same as taking the uniform sampling or all this stuff. So each one works a little better or a little worse in various situations. Right. So that's if you add in the phonological features. It turns out annoyingly that we actually ended up with the same amount by adding phonological features so it helps down here. So we're still sort of puzzling over that a little bit. I don't know what to say about that yet. Okay. So we're eventually moving to word recognition on this and the idea is that we can -- one option that we're interested in doing is basically saying, look, we can use Jeff's scarf system as a higher level processing system that this becomes a first pass that the scarf can now incorporate larger segmental features on. So we basically put out lattices and then have scarf sort of work on that. So that's one option that we're looking at for doing word level decoding. The other option that we're looking at is Jeremy Moore did some sort of hybrid recognition style thing for basically turning this into like a hybrid neural network except for it works on sequences rather than individual phones. So that's all I had to say on this topic. And I see that I'm running really long. So I guess I'm going to give you guys a choice, because I will have to sort of shrink. So I have some -- I know Rowit actually gave a talk here on the articulatory features stuff last year when he was interning. I could turn to the more text-based processing stuff or talk a little, dabble in each. So maybe I'll have a show of hands. >>: [inaudible]. >> Eric Fosler-Lussier: >>: 11:40. Do we have time? We have Dan showing up. >>: I see. >>: 11:30 pick him up. If there's a group of us, though, that's going to lunch we should probably beat the lunch crowd. >> Eric Fosler-Lussier: Yes, absolutely. >>: Eric, just so you can calibrate, there's probably five or six people in this room who might have seen Rowit's talk. >> Eric Fosler-Lussier: Okay. So that's a good calibration. So maybe I should -- I don't know. So would people prefer to hear more about like -- so articulatory feature modeling, that answers Lee's question a little bit about this stuff, or text-based processing? I'll touch on each one so you get a flavor of what's going on. >>: [inaudible]. >> Eric Fosler-Lussier: Okay. I'll zip through a Articulatory feature modeling. So what we've been basically taking articulatory features and sort of and then combining them by thinking about a linear little bit. Okay. doing up to this point is using them as estimates sequence of phones. And one of the things that we're interested in I've been working with Karen Levesque and with two of my students Rowit you guys know he interned here last year and [inaudible] and we've been working on some articulatory feature model and we've been complexifying the CRFs in a different dimension which is in terms of factor state spaces rather than factoring time. So just to give you a view of this. One way of thinking about the world is, of language, so if you have the "WordSense", it can often be pronounced as sent, because if you, if the articulators are out of synchrony, you end up having more of a closure and a release which ends up sounding like a T. So you get sents instead of sense. It's hard to do on the fly. So there are models of this like articulatory phonology. And Karen spent a lot of time thinking about models and this built on some stuff that Lee had done as well in thinking about can we build models of how articulators combine to produce phonological effects. So the idea here is that if you get an asynchrony in the tongue body, the tongue tip and the lips moving at a different time than the velum and glottis does, you end up with -- I should say it's over here, the asynchrony. I guess they're both asynchronous. But you end up with a nasalized vowel and you also end up with this extra phone that didn't appear in the original transcription. And the claim is that this can account for a lot of that pronunciation variation that we saw in the beginning. It can and it can't. It counts for some of the effects. It doesn't count for all of them. And so Rowit built this model that he talked about last year where we're trying to do articulatory feature alignment. And this gets back to the question I was asked about how do I get the targets for all of the MLP s that I'm training? And the problem is that if you are going from a phone, if you're going from this, right, if I just project from this phone to each one of these dimensions. So I've changed my categorization because we're using articulatory instead of phonological features, but the point is the same, I'm not going to be able to really match what's really going on which is this, right? So a mapping here from N and this nasalized N, right, isn't going to be able to get me this particular asynchrony very well. Unless I have a really fine transcription. In fact, so we can use the switchboard transcription project as our fine transcription and then project backwards. That's the way we're going to seed our models to answer some questions. But if I'm just going off a dictionary where I don't have detailed hand transcriptions I'm not going to be able to get this. The question is how do I bootstrap from models where I've got this really fine transcription and be able to project back to this kind of space. And that's where the alignment comes in. So I basically want to say, hey, I've got the WordSense coming in and I want to be able to say I expect sort of some asynchronous thing going on with -- there's some synchrony going on between the streams but there's some asynchrony going on between the streams. I need to have a model of that. So we put up this monstrosity. Oh! Right? But let me make that a little simpler. Okay. So let's think of each one of these colors as some sort of asynchronous state where basically you're now thinking about what is the value of -- so I have essentially a phone that -- so between this word and the sub word state it will map to a particular articulatory feature which corresponds to a phone. So there's a model that basically says are these features allowed to be asynchronous, what is the probability distribution of them being asynchronous, and I can talk about different streams going on in parallel, one, two, three, as many as I want. Where they each have these linear structures going over time but also have synchrony constraints that go on between the streams. So this is kind of a really ugly factograph. By the way, the red things are trainable. The blue things are actually deterministic. So it's not as bad as having to learn everything about these things. But we can take advantage of these deterministic constraints because the number of red constraints is pretty small. And we can boil down this model into essentially a very simple model that talks about a vector space of sub word configurations. And we can talk about the fact that so let's imagine I put a limit -- so if I had cat and I had three streams, I could have one, one, one, two, two, two, three, three, three, or I could have one, one, one, and then one, two, one, and two, three, three, three, three, three start getting things asynchronous out of line. If I put a limit on that on how far my articulator is allowed to get out of synchrony, the 1s and 2s and 3 basically talk about the canonical positions for the ca the ah and the ta or for sense, ah, eh sa he. So I can talk about this one moved ahead but this one didn't and all that really is a change in index space. So if you do that, basically we now have basically a bunch of trainable parameters and one basic configuration parameter that basically says hey this is a -- these are the possible transitions between these two states. So we actually -- do we keep that fixed? Doesn't matter. But the idea is that what we're doing is eliminating deterministic variables. So originally we tried to implicit it with a general purpose CRF toolkit and it got hairy. By doing some neat tricks, which actually are related to the tricks for the boundary-related factored thing about restricting your state spaces, you can actually get a computational tractable model. And we found that if you take the alignment error rate, we actually end up getting articulator alignment errors reduced by about five to 15 percent relative. So that's sort of where that's been. >>: [inaudible]. >> Eric Fosler-Lussier: So you initialize the model but using switchboard transcription project, so you have hand transcribed things and then we're going to propagate -- actually in that case we're training and testing on that set. But the idea is eventually we then took that and aligned all of switchboard for another experiment. But I'm not going to talk about that. But, yes, you use that as a bootstrap. >>: [inaudible]. >> Eric Fosler-Lussier: model, that's right. We don't have Tinmit [phonetic] results for this >>: But you follow the details. >> Eric Fosler-Lussier: We probably should do a Tinmit model for this result that's a good point. If you're interested in more details that was our ASRU 2011 paper. So Rowit has to graduate at some point so he can go off and be a great researcher somewhere. And he was playing around with trying to get this to be full recognition. And he said, well, what if I try to do this as keyword spotting. And we elicited the help of Yosee Tochet [phonetic] who had done some discriminative keyword spotting, and built a model that basically says, hey, I'm going to develop this feature-based keyword spotter that basically says all right I've got two segments of speech and one has the word, one does not have the word and the objective function that I want to optimize basically says my score for the thing that has the word had better be better than the score that doesn't have the word. That's the very simple version of that. So we want this function. I'm not going to talk too much about that, but the idea that's behind this is that this now becomes this we can parameterize with some land awaited sets of features that we extract from something. And in this case we're going to use the alignment model to basically say what's my best alignment of the word and then we're going to extract features from those segments found in the alignment. And currently he's done this with phone based alignments but what he's trying to work on now is getting articular word alignments. And at the moment, we just finished a paper of rough for MSLP machine learning and -- machine learning and speech and language processing new thing the symposium that's coming out. It turns out that's actually worked pretty well in low resource settings where you don't have a lot of data which is important to us if we, let's say, have language from a new data -- data from a new language where we don't have a lot of data. So we haven't conducted any experiments on that kind of multi-lingual setting. So I realize that was an extremely hand wavy sort of version of this. But I do want to talk a little bit about the stuff we've been doing with electronic medical records. If anybody had any questions about that, but that's just to give you the flavor of we've got articulators alignments and then we're going to use those to extract features for doing one of these keyword spotters. That's the real rub. >>: You say you have two cuts for UCF? >> Eric Fosler-Lussier: So this turns out -- the model turns out not to be a CRF in that case. It's kind of like a Perceptron. USC wouldn't call it Perceptron, I call it Perceptron. It's actually trained using a max margin style approach. So you could think of it as an SVM, linear SVM. Okay. Last bit. I've been working on some electronic medical record stuff, which has kind of been eye-opening, taking me back to the CRF stuff, where did CRF stuff come from in terms of language processing. So Zillo was saying, I've been dragged back into language [laughter] it's like, whoa. So let me give you a medical record. This is sanitized medical record out of OSU Medical Center. And you'll notice some interesting things about this. So you've got some information corresponding to the patient and the doctor and this is all structured data. Right? And then we've got a bunch of unstructured data that's in particular regions. So what's the history, what did we find in terms of tests, what are we planning to do? And the idea is that we're going to get multiple of these nodes. And eventually, what we want to be building over the lifetime is a model where we say here's our first narrative. We have some things where here's our admission date. There's some things that happen before admission. There's some things that happen after admission but before discharge and there's some things that happen after discharge or plan to happen after discharge. And here's the second admission which is hopefully after the first discharge date. Hopefully way after but not often. And some of these things. And so if there's a readmission, you know, we now know that the medical history is going to correspond to some events but maybe some other events that weren't in this original thing. And how do we find all these correspondences. So this becomes an interesting problem of how do I know when a medical event in one document or within one document corresponds to another medical event within that document or across documents. So there's sort of two problems. Within document and cross-document, co-reference resolution. Medical events. It's kind of like anti-for resolution in language processing but here we're talking about events, event resolution. So the obvious thing to do -- and I just want to point out: Here we've got K competitive use these are labeled twice in the same data and this is the same. But you might have chest pain here but this chest pain actually is not the same as that chest pain. We'll see an example what that really means. And the thing is that labeling these things is hard. You can get reasonable entertainment agreement if you spend a lot of time at it but it takes forever. We've got -- at the point of this study we had three patients and about 35 clinical notes. It really takes -- that was with four annotators going through and doing all the stuff. So it's pretty ugly. techniques, right? So we want to use some sort of semi supervised And the question is can we use unlabeled data to do that, right? So what we want to do is build a temporal classifier to do this because when people think about co-references they often think about semantic concepts, chest pain acute chest pain, although the same thing so semantic overlap, right? But it turns out there's a lot of queues that correspond to temporal information according to events that really can tell you is this the same event or not. I'm going to skip the semantic -- so I'm going to -- we take it for granted we extract a bunch of semantic features that are medically relevant. But I want to talk about the temporal stuff, because that's actually where our newest work -- people have done that kind of thing for the semantic stuff. So the idea is that what we're going to do is try and build coarse time bins and build a sequence tagger conditional random field to assign medical events to relative points in time -- blah. Points in time relative to admission. So we may have things that are way before admission, just before admission like way before admission like a year. Before admission, after admission or after discharge. Those are -- or at admission is the other one. The idea is that here's our medical note. Right? So you've got K competitive use and hypertension. It's with a history of. So sort of way before. Right? Chest pain, which started two days ago. It's also before admission. He does not have chest pain now. That's after admission. But ever since the episode two days ago before admission. So here we have chest pain. These two are not co-referent but this episode is co-referent with the chest pain. Okay. So we're going to need some semantic information to know that chest pain could be an episode. So that's what the semantic stuff is doing and temporal stuff is basically giving us a different view. So we're going to extract state functions and then we're going to have some transition functions in the usual CRF thing. So what are the kinds of functions you would actually extract? One is what section are you currently in? It's not a guaranteed thing. So past medical history doesn't necessarily always -- it's going to be a bias definitely to things that are before admission. But things doctors don't always pay attention to what they're typing where. So things that can happen after admission still occur in that. So we look at the section -- but it's a good clue, right? We're also interested in lexical features. So history of presented with. These are going to be the preceding bigrams that will tell us something. We also extract things like not only just the preceding diagrams but also parts of speech and stuff like that so we know that oh this is probably a temporal modifier so that's a good thing to know. The other thing is that there's actually a lot of time relative time references in here. So two days ago. Two days ago. Well, these two things are going to map to this. So we look for the closest relevant bits of temporal expressions. So we do a temporal expression parser, for example. And essentially what we end up doing is building this model using all of these expressions and we can now label every event with we know the admission date. Because that's in the structured data. And now we basically say we can give a relative time essentially based on where we're assigning things in the temporal bin. Why is this important to us? Well, essentially we're going to then take our semantic features as one view of the world. Temporal features are another view of the world and we're going to do semi-supervised learning in a multi-view type environment and we play with two different multi-view strategies. One is to basically do co-training where you basically say I'm going to train classifier on one type of feature. Predict label instances in my unlabeled data that I'm pretty sure about and then use that to train the other and then go back and forth, go back and forth. >>: This bigram. >> Eric Fosler-Lussier: It's a huge sparse vector. Right. And the other option is to basically talk about posterior regularization I'm not an expert in posterior regularization I'll wave my hands by saying, look, you can develop a probability structure and then have a prior from some other model. Well, the other model, the temporal model now serves as a prior for the semantic model. And semantic model is a prior for the temporal model. Then you can again do that kind of labeling so that you try and constrain the amount that you disagree on the outputs of those two things. So you go back and forth, back and forth. So just to give you a sense of this, if you do supervised learning on a 60/40 split of the data this is dataset for which we have all the data. You get about 77 percent on clinical notes with recall of precision of 77 percent of is this co-referent recall of 90 percent. Co-training basically works a little worse than posterior regularization for this, but essentially we're getting pretty close to the supervised learning, which is surprising to us, and we have not actually taken advantage of all of the vast numbers of clinical notes that we could do. That's to be the future experiment. So summing. Sorry to blast through the last bit. But I just wanted to give you a flavor of the kinds of things we're working on in the speech and language technology lab. We've got a number of other projects, as I said. But a lot of the stuff we're doing is basically how do I think about intelligent feature extraction and then plugging it into relatively standard segmental tools. So we've been doing this in speech processing and how do you do intelligent factorization of that space and the text processing we've been using this as features for something else, right? So one of the things that we've noticed is that you basically, if you start using things like people talked about CRF as a replacement for HMMs, discriminative HMMs, a linear chain CRF is more powerful than HMM when you start thinking about transition observations. I think the practical import of it is that the Highgold paper kind of shows you can factorize the space back into that -- in practical terms this is an easy way to have this observational dependence, right? And so we've been using the sequence models also as features for other learners and finding out that incorporating sequence information at the lower levels can often help when you're doing the prediction for other tasks as well. So those are the two main messages. So that's all I have to say. [applause]. >>: [inaudible]. >> Eric Fosler-Lussier: I'm sorry? >>: You mentioned [inaudible] off. incorporate that? Do you have any thought how to >> Eric Fosler-Lussier: No, but I would love it if -- I've been wrapping my brain around that for a while, and I have some ideas on like curve modeling and stuff like that. But it's not -- I don't have anything solid. >>: Whose idea how you can incorporate [inaudible] in CRF systems? >> Eric Fosler-Lussier: I think I'll have Jeff do that. [laughter] I mean, that's an interesting question, because the state space I think the way that Jeff does it is not -- is not a bad start because he thinks about -- it's hard to talk about it with him in the room. [laughter]. But you know thinking about the language model state space, you know, allows you to think about any depth of language modeling you want because you're thinking about the state space, but I think if you -- I think my take on it is you're probably -- you're not going to be able to get it all in one system. And we get this with the deep nets and we get all this stuff is that I kind of see this as the acoustic smoothing over a number of lower level estimates that can give you some sort of phonological substrate and then you use that to go upwards from there. So I think the language model stuff, you know, you would do -- you could do some interesting things either with max center or CRF style models on top of the lattice. Kind of the way that Jeff thinks about this model, I think. And so as long as you can -- so that would be my answer. But I don't have a good answer for you I think is the upshot. [applause]