1

advertisement
1
>> Patrick Pantel: Okay. We'll get started. So I gave Hal a formal introduction yesterday, so
today I won't go through it again. But in case you weren't there, professor Hal Daumé is here
visiting this week alone from Maryland where he's an assistant professor in the Department of
Computer Science there.
And he gave us a fantastic talk yesterday on structured prediction, and today he's going to give
us a 3 1/2-ish hour tutorial on domain adaptation, which he's given in the past with John Blitzer
at ICML, which was very, very well received. And I think we're in for a great treat.
About midway through we'll have a break with snacks in the back, and Hal will control when
that will be.
>> Hal Daumé: So you should be nice to me.
>>: [inaudible] spit balls thrown at you [inaudible].
>> Hal Daumé: I guess if you really want snacks, you could just leave. And I -- they couldn't
lock the doors, so -- okay. So this is a tutorial I gave with John at ICML last year. John's no
longer at Berkeley; he's at Google, so -- but I didn't update the slides.
So, anyway, he deserves credit for a lot of the work in the tutorial, and some of the stuff is work
that he's done. But since he's not here, he doesn't get to give his half. So I'll try to do it some
justice.
Okay. So I want to sort of start out with the nonadaptation setting, just so that we sort of are all
on the same page. So the classical learning setting is single-domain learning. No? Okay. So
the classic setting is single-domain learning. Our job is to get some sort of input X, predict a
corresponding output Y, and we assume that X and Y are drawn according to some probability
distribution over joint pairs X and Y.
So, for instance, our Xs might be book reviews from Amazon. So here's a book review of the
book Running With Scissors. It says the book was horrible, I read half, suffered from a headache
the entire time and eventually I lit it on fire. One less copy in the world. Don't waste your
money. I wish I had the time spent reading this book back. It wasted my life.
So our job is to read this review and predict whether this person likes the book or dislikes the
book. So what do you think? Likes the book? It could be sarcasm, right? But probably not,
right? So we probably think that this is a negative review.
So here's a second example. So this is a speech recognition example. So this is Yoram Singer,
and our job is to listen to him talk and transcribe his speech. So here's him talking.
Audio playing: So the topic of the talk today is online learning.
>> Hal Daumé: Okay. So that would be our X. And the corresponding Y would be some sort of
transcription. Okay? So this is sort of the classic setting.
So in domain adaptation, we're going to talk about having two distributions, so we're going to
have a source distribution and a target distribution. The source is essentially going to be our
training distribution. So we get to observe labeled examples of Yoram speaking.
2
Audio playing: So the topic of [inaudible].
>> Hal Daumé: I could listen to Yoram forever, so -- and -- and then we're going to get test
examples from some different distribution, which I'm calling T for the target distribution. So, for
instance, maybe we get some speech from Manfred Warmuth, who sounds more like this.
Audio playing: Everything's happening online, even the slides are produced online.
>> Hal Daumé: Okay. So the task is the same. So in both cases we're trying to predict the same
sort of Y for a given X. It's just the distribution over Xs has changed, right? Like Yoram and
Manfred sound very different from each other.
Okay. So this is the general setting that we'll talk about for sort of the first half of the talk. So
there are a lot of examples. So in NLP we have the sentiment problem. So we might get reviews
of books and we might say that book is packed with fascinating information. This is a really
good indicator of a positive review.
Or we might have a review of a blender that says it's a breeze to clean up, and this is, again, an
indicator of a positive review. But I'm not going to review a book positively by saying that it's a
breeze to clean up, and I'm not going to review a blender positively by saying that it's packed
with fascinating information.
So there are other cases that we run into all the time, right? So somehow someone decided 20
years ago that it would be a good idea to annotate all sorts of data only in Wall Street Journal
because of course the rest of the world only cares about doing NLP on Wall Street Journal. But
unfortunately, you know, there are various other places where we might want to apply NLP
techniques where the data looks different although the task remains the same.
The same thing happens in visual object recognition. So these are images from Amazon's
products, which you might want to use coupled with some Web cam to do object identification,
but here the objects have the background subtracted and edited out, here they don't, so it's still
the same task but the sorts of examples that we're getting are changing.
So this was actually a really recent paper that did this. I think it's really cool. It was ICML last
year. I guess I'm going to talk mostly about the NLP side just because it's easy for examples.
Okay. So this is basically what we're going to talk about. So at the beginning I'm just going to
introduce a bunch of common concepts that will [inaudible] the rest of the talk, and then there
will be two sort of main sections with a break in the middle. So the first is going to be what I'm
going to call semi-supervised adaptation. So this means that I get transcribed versions of Yoram
speaking and my job is to transcribe Manfred speaking but I don't get any labeled data for
Manfred speaking. I only get sound file. So no transcriptions.
And then in the supervised case I'll get training data from both Manfred and Yoram, and I want
to use both in some interesting way.
And then at the end I'll talk about sort of briefly stuff that this doesn't address. And since I'm
here and since there are a lot of NLP people here, I'm going to talk specifically about some issues
in translation and how do to adaptation in a translation setting.
3
>>: [inaudible] it should be unsupervised application, or that's what we always call this space.
>> Hal Daumé: It's what you call unsupervised?
>>: Always.
>> Hal Daumé: Okay, okay.
>>: [inaudible]
>> Hal Daumé: Well, it's because you get labeled data from the source and unlabeled data from
the target, so it looks kind of like semi-supervised.
Yeah, I mean, there's a big question about -- like so John and I had an argument for about an
hour about whether we should call this unsupervised or semi-supervised. Somehow we settled
on semi-supervised.
Although, part of the reason is because I actually think there's an additional question which you
can ask, which I would consider unsupervised adaptation, so all of the time here we're going to
be talking about like classification in supervised learning problems, but you can also talk about
adapting clustering algorithms or adapting dimensionality reduction algorithms. And so there I
would think of that -- I mean, you could call it like super unsupervised domain adaptation, but,
yeah, it's kind of arbitrary. There's definitely not a consensus about what these things should be
called.
So then, you know, if you want to sneak out but you care about MT, you can come back for the
last like 20 minutes and hear some translation stuff.
Okay. So there's be a lot of work for, you know, about 50 years on trying to come up with useful
measures in the classic supervised learning setting for how well you'll do on test data based on
how well you've already done on training data.
So the classic results that you get looks something like this. So my error on test data, so this is
my expected error, is bounded by my empirical error on the training data, plus some term that
usually looks like the square root of some complexity, so like VC dimension, Rademacher
complexity, whatever you like, divided by the number of training points I have.
So as I get more and more training points, the test error is going to closer and closer to the
training error. So people have been doing this forever. This is sort of in the line of structural
risk minimization.
In adaptation, we'd like to be able to say similar things. So we want to be able to bound how
well we'll do on our test data as some function of how well we've done on training data and
maybe other terms.
The reason why this becomes difficult is that these two errors, although one is an expected error
and one is an empirical error, they're both measured on the same distribution. They're both
measured on Manfred or Yoram speaking. Whereas down here, the test error is measured on
some new distribution that we don't have at training time.
4
Okay. So there are a couple concepts that come up over and over again, so we're going to give
them names. So the first is covariate shift. So formally this means that for a given X the
distribution over possible labels for this X under the source distribution is same as the
distribution -- is the same as the probability under the target distribution. All right. So what this
basically means is once I've given you an X, the right label is basically the same regardless of
what distribution I'm in.
>>: [inaudible] reduced to the classic case, wouldn't it?
>> Hal Daumé: No, because P of X can be different. But P of Y given X is the same.
Okay. So sort of pictorially, you know, there might be some guy, Kobe, who can understand
both Yoram and Manfred. And so somehow he's able -- regardless of which X he gets, he's able
to use -- he's able to produce the correct Y.
A slightly stronger assumption is the single good hypothesis assumption, which basically says
that there's some classifier out there in the world, I don't know what it is, but there's some
classifier for which this classifier's error on the source distribution and its error on the target
distribution are both small.
Okay. So pictorially sort of it means there is a classifier who understands both Yoram and
Manfred. Okay. Do these -- does the distinction make sense? So this is with respect to a
particular type of classifier, whereas this is just generally the distributions are the same.
Okay. So if the distribution -- well, let me not say that. Okay.
>>: So coming back to the previous slide, so is that end result -- the previous -- is that end result
for the problem that -- from the right-hand side that could be kL distance [inaudible]?
>> Hal Daumé: Yeah, I will show you one in about 20 minutes. Otherwise, I wouldn't put that
up, right? If I can't come up with an answer, I'm not going to tease you like that.
So then the third concept that's going to come up more and more is this domain discrepancy. So
throughout the talk, I think blue is always going to be mean source distribution and red is always
going to be mean target distribution. So here think of these like Gaussians, but, you know -- so if
any of you talk to product people, having an ability to actually draw two-dimensional Gaussian
would be awesome in PowerPoint.
So but these are Gaussians represented by, say, one contour. So the idea is that if the source
distribution and the target distribution are close, the adaptation problem shouldn't be that hard.
Right? So, you know, if you take me and Patrick, maybe Patrick says "eh" at the end of every
sentence, but other than that our distributions are not that different. So maybe if you can
understand me, it's pretty easy to understand Patrick. Sorry, Patrick. I know you don't say eh.
Most of the time.
But if you take two distributions that are quite different, so like me and Yoram, even if you're
very good at me, the set of inputs that Yoram and I are both likely to say, like basically unless
I'm trying to do a Yoram impression or Yoram a is trying to do a Hal impression, we're almost
never going to produce the same X. And so intuitively this should be harder.
5
Okay. So those are the three common concepts, and we'll see them pop up over and over again.
So we're going to start with this semi-supervised or unsupervised approach. The first case that
we're going to talk about is the shared support, which basically means the distributions overlap,
and then the second part we're going to talk about is no shared support. So there is potentially no
utterance that Manfred and I would say the same. And you might think that's crazy, you're never
going to be able to do anything in that case, but you actually can, which is kind of magic.
Okay. So this is a paper from a couple of years ago. And I'm not going to prove this, although
it's not difficult. It takes five or six lines. But basically what it says is that if I have the covariate
shift assumption, so my conditional distributions are the same, for any classifier H -- this doesn't
have to be the optimal classifier, any classifier you choose -- my error for this classifier on the
target distribution is bounded by its error on the source distribution plus some measure of the
distance between the source and target distribution.
Okay. So the sort of intuitive idea is you're going to say, okay, I want to count how often I'm
going to make an error on the target. So if I draw an example from the target distribution, and
this example looks a lot like the source distribution, then my error from source and target are
going to be pretty close. But if the example that I draw doesn't look like the source distribution
at all, then my errors are going to be far apart.
And so what I'm doing here is I'm basically saying over all possible examples that I could draw,
how far apart -- basically how likely is T to say this versus how likely is S to say this.
Okay. So it's not particularly complicated. This measure is usually called the total variations of
distributions. Okay. So -- yeah.
>>: So just a question about that. Imagine on the source distribution that there's a point, which
is very unlikely, on the P of X, but we make [inaudible] but because it's unlikely an expected
error, then the source distribution is low.
>> Hal Daumé: Right.
>>: But now that point is very high probability [inaudible] target distribution.
>> Hal Daumé: Yep.
>>: So now we're going to make large errors.
>> Hal Daumé: Right.
>>: Is this bound capturing?
>> Hal Daumé: Yeah, it'll capture that. Because if it has very low, it -- right. So this will be
approximately 0. This will be approximately 1. And, I mean, maybe not approximately 1, but it
will be like 1/2. Yeah.
>> [inaudible] example when the covariate shift is [inaudible]?
>> Hal Daumé: It's hard. I guess intuitively. It's a fine line. I mean, we can talk about this a
6
little bit later, but there's one case where it's definitely true. So one of the pieces of work that
spurred -- or I shouldn't say spurred, but is related to a lot of the work on demand rotation is this
work on sample selection bias that came out of the economics literature back in the '70s. And
the guy -- I can't remember his name, Heckman, something like that, actually got a Nobel Prize
for it.
And so the problem that they have is like say I'm an insurance company and I want to classify
people as to whether they're going to have an insurance claim or not. So what I want to do is
when a person asks for insurance, I'm going to run my classifier on them, and if it looks like
they're never going to have a large claim then I'll offer them insurance, and if it looks like they
will have a large claim, then I won't offer them insurance.
But, you know, okay, so there's this total population of people, and there's some of which -okay. There's this total population of people, some of them will have claims. The problem is
that the only people that I get to know if they'll file a claim or not are the ones I've already
offered insurance to. Right? So I only get to see the people that I've offered insurance to.
So what's happening here is I have a big distribution over all people, and then I have this subset
distribution over people that I happened to have offered insurance to in the past.
And so in this case, I think the covariate shift definitely holds. In fact, it's stronger, right? It's
not just that the conditional is the same. It's that one's actually a subset of the other. So that's
one case where it's clearly true.
Another case is it's definitely a fine line. A lot of it actually depends on sort of how you define
X, like in -- from a feature engineering perspective. You can sort of argue whether it's covariate
shift or not. It's tricky.
>>: So this problem must be really bad [inaudible].
>> Hal Daumé: It's terrible. Yeah. But it's easy.
Okay. So we're going to make this assumption. In addition to covariate shift, we're going to
assume shared support, which basically means that there's no example that's possible under one
distribution and impossible under the other distribution.
Okay. So, for instance, this is not a shared support scenario, because there are examples out here
that are possible under one distribution but not possible under the other.
Okay. And one of the ideas that you can do here is if you have sort of this distribution over
example, so here the examples look very sourcey, and here the examples look very targety,
maybe what I can do is I can increase the importance of the examples that look very targety and
decrease the importance of the examples that look very sourcey. And then maybe if I train on
this reweighted distribution, I can do a better job.
Okay. So it turns out that this actually works pretty well. So I actually am going to walk
through the steps on this just to show that none of this stuff is actually scary. It's all pretty
straightforward.
Okay. So our error is our expected error on the target distribution. So we're drawing an X from
7
the target distribution. We're drawing the corresponding Y from this conditional distribution,
and I don't have to specify whether it's source or target, because I believe covariate shift. So P of
Y given X is the same for both.
And then I'm looking at under these two distributions how often does my classification differ
from the true label.
Okay. So we just sort of expand things out. So we're summing over all X. So I've rewritten the
expectation as an explicit sum. And now we're going to do this trick that comes up over and
over again where everything where we multiply by 1, where I write 1 as probability S of X
divided by probability S of X. Right? So this is where I'm using the fact that they are shared
support, because otherwise this could go to infinity.
Okay. So now I can just rearrange things a little bit. So basically what I'm going to do is I want
to make this thing look like an expectation under the source distribution rather than an X -- sorry.
I want to make this thing look like an X -- yeah, yeah, yeah. It was an expectation under the
target distribution, I want to make it look like an expectation under the source distribution,
because the source distribution is what I have data about.
Okay. So I now can rewrite it as an expectation under the source distribution with a weighting
term that's given by the ratio of the probability of this data point under the target divided we the
probability of this data point under the source.
Okay. Okay. So now I can write my expected target error as my -- as an expectation over source
examples, which is good because source examples are what I see at training time.
Okay. And this term I'm going to call my instance weight. Okay. So now I have a weighted
error of my hypothesis with respect to these weights, where the weights are defined as the ratio
of these distributions.
Okay. So that's one way to think about the problem. There's another way which actually leads
to more interesting algorithms. So the way that I'm going to think about it is I'm going to ask
myself how do I get a source data point. Okay? And I'm going to describe a really weird way to
get a source data point.
Okay. The first thing I'm going to do is I'm going to draw a target data point. Okay. So I have
my distribution over source, my distribution over target, I draw some target point.
Okay. Now I'm going to flip a coin that says if it comes up heads I'm going to relabel this source
point -- sorry. I'm going to relabel this target point as a source point, and if it comes up tails I'm
going to throw it out. Okay. So I'm drawing something from the target and then some of the
things I draw I'm throwing out, the other things I return as draws from the source distribution.
So this one I might relabel as a source point, this one may be his probability under the source
distribution is so small that I don't relabel him as a source point.
Okay. So if you follow this analogy, the way that I can draw a point from the source distribution
is to draw it from the target distribution and then decide whether I should move it over to source
or not. Okay. And then I have to normalize this. So this is just the regular normalization
constant. All right.
8
So we're going to do some algebra. I'm not going to walk through all the steps. You know, you
can look back at the slides after the fact if you really want the details. But it's not that
interesting.
So I basically apply Bayes' rule. I rearrange things so that I have this did I select it into the
source distribution thing on both sides. I recognize that this top part is independent of X. So
from a classification perspective it doesn't matter.
And I can then rewrite my error on the source distribution as an expectation over examples
drawn from the source with a weight given by 1 divided by the probability that this coin comes
up heads and then my usual error on this example.
Okay. So this has basically given us a different way of coming up with instance weights. So in
the previous case we got instance weights which were the ratio of the probability of this example
and the two distributions. Now we have an instance weight which is 1 divided by this
probability of Bernoulli variable, this probability of a coin flip.
Okay. So why did I do this? Why wasn't I happy with the first one?
>>: Because you're about to do something tricky.
>>: Well, that's just ->> Hal Daumé: This was the tricky part. Hmm?
>>: Oh, that's just enough [inaudible] I mean, it's your -- it's how overlapping you think your ->> Hal Daumé: Oh, but I'll compute this. Don't worry. I'm not going to just put .6 in or
something. Don't worry. I'll compute it.
>>: So you are doing the proof of the [inaudible] or are you ->> Hal Daumé: No, so I'm going to propose an algorithm based on this. But the question is I
could turn this into an algorithm, right, so I weight each of my examples by this probability
divided by this probability, and then I run some classifier, like an SVM or something.
>>: But the problem, you don't know the [inaudible].
>> Hal Daumé: Yeah, exactly, right? So this is like -- these are two density estimation
problems, right? I have to compute my distribution under T, I have to compute my distribution
under S, and I have to compute their ratio. This is going to be really, really hard. Right?
The nice thing about this second version is that this is basically some sort of like binary
classification problem, right? Does this example look like a source example or does this
example look like a target example. This is a much, much simpler thing to compute. So we like
that a lot.
Okay. So now we get a little bit arbitrary. So we're going to now model this distribution as
some logistic function. Okay. So essentially what we're going to do is we're going to take
9
unlabeled source points and unlabeled target points. We're going to call all of the source points
class 1, all of the target points class 0, and we're going to train a classifier to distinguish source
points from target points.
Okay. And the nice thing is we can do this using only unlabeled data. We don't need labeled
data to do this.
Okay. And if we do this using logistic regression, then what we end up with at the end of the
day is a function that says this example looks with 90 percent probability like a source point. Or
this example looks with 2 percent probability like a source point.
>>: So what data do you use to train [inaudible]?
>> Hal Daumé: We have to get unlabeled data from both distributions. We don't have the
[inaudible] to totally host.
>>: Okay. But that to balance some, right? Because if you take unbalanced [inaudible].
>> Hal Daumé: Yeah, probability should be balanced. Although, I don't know. Because what's
going to happen is that you're going to plug it in here, right, and so if it's imbalanced but it only
changes the estimates by a constant, so I can sort of -- I mean, actually, already I threw out a
constant of proportionality. So if the -- the biasness only changes this by a constant and
shouldn't matter.
But in practice that's probably not always going to happen. So, yeah, you probably should
balance it.
Right. So we're going to train the classifier. Source instances we're going to say sigma is 1.
Target instances we're going to say sigma equals 0 and then train some logistic regression.
Okay. So does this make sense? All right. So here's a little pictorial example. I love this, and I
can say I love it because John came up with it. I didn't come up with it. So this is a really I think
great example of how and why this approach works. Okay.
So here's the setup. We have a binary classification problem. It's one dimensional, so our inputs
are just some position along the real line. And the label of an example is minus, so it's a negative
example any time it's in between these two lines, and it's a positive example off to the left and a
positive example off to the right.
Okay. So now our source distribution is some Gaussian centered, say, here. And our target
distribution is some Gaussian centered here. Okay.
So before I show you any data, if I'm in the source domain, I should probably learn a class -- and
I'm learning a linear classifier, I should probably learn a classifier that roughly corresponds to
this line. So anything to the right of this line I'll say is positive; anything to the left of this line
I'll say is negative; and, yes, I'll screw stuff up over here, but stuff over here has such low
probability under the source distribution that it doesn't matter.
On the other hand, for the target distribution, what I should do is I should say I have a
classification -- I have a decision boundary here, anything to the left I say is positive; anything to
10
the right I say is negative. And, again, it will screw up over here, but things are low probability
so it doesn't matter.
Okay. The problem, of course, is that I only -- at training time I only get to see examples from
the source examples distribution. Right? So I see this positive example, these negative
examples, these positive examples, I learn a classifier, the classifier probably sticks the decision
boundary somewhere around here, and then it says anything over here is positive; anything over
here is negative. So when I start getting test examples from the target distribution, it basically
labels everything negative, and so I get like 50 percent error, which I might as well not have
learned anything.
Okay. So we're now going to run our algorithm. So we have our label data from the source
distribution, and now we get a bunch of unlabeled points from the target distribution.
Okay. We now call all of the unlabeled points from the target distribution negative examples
and all of the labeled points from the source distribution positive examples. And I learn a
decision boundary that separates the target examples from the source examples.
Okay. Now I reweight every example by the -- by 1 divided by the probability that it's a target
example. All right. So this plus gets really, really big, because here looks a lot like a target
example, and these pluses get really, really small because they look nothing like a target
example.
And, in fact, I mean, Gaussians, you know, they go to zero really quickly, so this would actually
be like the size of this room and these would be like much smaller, like a pixel, right? But I can't
draw that.
Okay. So now we get this reweighted data. We train a new predictor. And on the weighted data
it now says, you know, these three are really the only ones that matter getting correct, these kind
of don't matter at all, so I'll stick my decision boundary somewhere here, and positives examples
are to the left; negative examples are to the right, and I basically found the optimal classifier on
the target distribution.
Obviously this is a carefully hand-engineered example, but I hope the intuition is there.
Yeah, so that was basically what we looked at.
Okay. So that gives us an algorithm. Does this algorithm actually work. So we have to look at a
couple terms. So one of the things that bugs me in domain adaptation is like these nice clean
bounds that we have in single-task learning, you have to add like 50 new terms to deal with the
fact that there's a domain divergence, but I'll try to step through it in a reasonable way.
Okay. So this is our weighted source error on a sample of size N. So it's empirical error because
it has a hat, its source because it has S, it's weighted because it has a weight, and it's on N
examples.
And you can basically prove something like this. So the difference between my test error on the
target distribution and my weighted training error on the source distribution is not too big. All
right. And what does it mean by being not too big? Well, we have this order 1 over delta, which
is just to make the probabilities go through. But the thing that matters is we have something that
11
looks like a complexity divided by N, which is sort of the bound we expect to see.
So what does this complexity say? So it's a term that grows linearly in the square of the largest
weight for an example, okay, where the weight is the reciprocal of the probability of this
example being a source example.
Okay. So there are actually more complicated bounds. If you're interested, there's this paper by
Arthur Gretton that talks about these things.
Okay. Actually, I thought there was a slide about this. So the problem is that this can be
enormous. So if there is an example that, say, a hundred times more likely in your source
distribution than your target distribution, then this is a hundred squared, so now you have
something that grows linearly in 10,000, which means that unless you have, you know, at least
20,000 data points, this bound is completely vacuous.
Okay. So this is their measure of how different are these distributions. And the problem is that
this is a really, really weak measure.
>>: So assume that for multiple classes you just replace [inaudible].
>> Hal Daumé: If you have multiple domains?
>>: Yeah.
>> Hal Daumé: Yeah.
>>: You can do that.
>> Hal Daumé: Yeah. And I think the theory more or less goes through. It should work fine.
>>: So the question that I have is that when you describe this logistical regression could be
[inaudible] what kind of weight you have, but you can't label [inaudible] domain or how do you
know [inaudible].
>> Hal Daumé: Oh. So I assume that you tell me; that these are speeches given by Yoram and
these are speeches given by Manfred.
>>: Oh, so that would [inaudible] will be supervised.
>> Hal Daumé: Well, but you don't get to see the transcripts. You only get to know who said it.
>>: Oh, I see. Okay. So you have that level of the supervision, the supervision [inaudible].
>> Hal Daumé: Yeah.
>>: Okay.
>>: So this is just [inaudible] the difference [inaudible] it doesn't guarantee an absolute
[inaudible].
12
>> Hal Daumé: You can rewrite it so that it does. So this always used to confuse me. People
like to present bounds like this. It's actually easier to prove them in this way. But if you just
rewrite it as this error is bounded by this error plus this, the same thing holds. So you do get the
form that you expect, it's just this is easier to derive.
Yeah, Patrick?
>> Patrick Pantel: Does the number [inaudible]?
>> Hal Daumé: Yeah, of course it has. Okay. The -- so it's kind of hidden here. Let me see
what I can say. I can't remember exactly what the assumption is. I think the assumption is that
you have enough unlabeled data that you can actually estimate these weights pretty much
correctly. So the general assumption here is that I have mounds on unlabeled data and really
what's limited is the unlabeled data.
I think I'll mention something later that looks at this tradeoff, but there's not been much that I
know of that looks at that.
>>: And in practice?
>> Hal Daumé: In practice I think that -- I think that the N is the much more limiting thing. I
don't think that it makes that big of a difference. Yeah, Rich?
>>: Are we assuming covariate shift here or are we assuming that there is hypothesis?
>> Hal Daumé: No, we're assuming covariate shift. So we're assuming covariate shift with
shared support. Because if there's -- so, for instance, if there's no shared support, then this could
be 0, and then 1 divided by this is going to be really bad news.
>>: So then I've got a question about it, then. So suppose the model class that you're using to do
the learning, not the learned ->> Hal Daumé: This learning?
>>: Yeah.
>> Hal Daumé: Yeah.
>>: Oh, it has to with this model class?
>> Hal Daumé: No, it can be whatever you want.
>>: Okay. So suppose that model class can do very well on the source distribution but it turns
out that model can't learn the target distribution well. Now it seems like -- it seems like you've
made some assumption that sort of the model class is equally good on the source of the target
distribution, which is why I was wondering if you were making the assumption that there exists
[inaudible] single hypothesis that was good on both as opposed to just the covariate hypothesis?
>> Hal Daumé: Yeah, let me think about that for a second. I know the answer. I think I know
the answer. So the idea is that -- right. Okay. So what could happen is maybe on my source
13
distribution there's this nice linear classifier that's happy. But then maybe things go wacky on
my target distribution where I get this like really curved thing. Right?
The reason why this is still okay is because what I'm measuring is my weighted source error, not
my vanilla source error. So what's going to happen is these points are going to blow up, and then
this is going to blow up. I think that's the right answer.
>>: I think that's right.
>> Hal Daumé: Okay. All right. So in practice, even though this looks scary because of this,
this algorithm actually works pretty well. Like I actually have used it and I recommend using it.
I think that what's going on here is that we just don't know exactly the right way to analyze it. I
don't think it's that the algorithm is really broken.
>>: But the intuition is clear, I mean, if the source looked more like [inaudible].
>> Hal Daumé: Yeah, exactly.
>>: But when you train that classifier, you need to have the knowledge about whether particular
[inaudible].
>> Hal Daumé: Yeah, yeah. So I think -- so this is interesting. I mean, this is something I've
been interested in for a while, is like -- so we've been looking at this -- I mean, maybe I'll talk
about this at the end, but we've been looking at this problem where say you're doing online
learning and examples come in and they come in -- I mean, the normal assumption is they come
in -- or they come in with the input and what domain this comes from, and your job is to predict
the output.
But sometimes they don't come in with the domain, right? And so one of the things that we're
been thinking about is can you learn on the basis of X to predict what domain it's coming from
and then simultaneously learn that with the actual predictor.
So maybe for some of them you get the domain information, but not for all. I think it's an
interesting question, but I don't have a great answer.
Okay. So in order to make this work, we had to make two assumptions. So I guess Rich sort of
pre-empted this slide. So the first assumption is the covariate shift assumption, so the
distribution of labels is the same, and the second is the shared support which basically prevents
us from having 1 divided by 0.
The nice thing about this algorithm is that in the limit as I get infinitely many training points, my
empirical error, my empirical weighted error on the source distribution approaches my target
expected error. Okay? And this is actually kind of amazing. It means that we're getting optimal
classification on the target distribution without ever seeing any labeled target data.
I don't know. I mean, like, okay, maybe in symbols it's not that impressive, but, I don't know, I
think that this is pretty -- this is a strong result.
The problem is that the convergence rate is terrible. So if you ask the question, well -- you
know, okay, well, as N trends toward infinity this happens, what happens for finite values of N.
14
The problem is that the rate of convergence depends on the largest ratio of probability of an
example between target and source, which can be quite huge. Right? Even if you think that it's
possible that sometimes I sound like Manfred or Manfred sounds like me, that's like so unlikely,
right? And so this ratio is just going to blow up to something enormous.
Okay. So if you want more information -- yeah, I got his name right. So Heckman, this is the
sample selection bias stuff from the '70s. So you can get a Nobel Prize by working on domain
adaptation, or at least you could 30 years ago. I don't know if you can anymore. I'm keeping my
fingers crossed. And then I presented the result by Gretton. So Corinna and other authors have
sort of follow-on work to the Gretton work. And then the Steffen Bickel paper is basically the
algorithm that I sketched where you train the logistic regression to separate the two distributions.
Okay. And I should mention this tutorial is also online on John's Web page, and it's linked off
my Web page. Okay. So that was covariate shift. So now we're going to forget about -- we're
going to forget about shared support and we're going to ask what can we do in the case of
nonshared support.
Okay. So where do we not have shared support? So here's an example from Amazon. So this is
our book review for Running With Scissors. We're going to have a bunch of labeled examples
for book reviews, and then we're going to have test data about kitchen appliances that looks more
like this. Avante Deep Fryer, black. Lid does not work well. I love the way the Tefal deep fryer
cooks. However, I'm returning my second one due to a defective lid closure. The lid may close
initially, but after a few uses it no longer stays closed. I won't be buying this one again.
So this is clearly a negative example to us, but it looks nothing like the book examples, right?
You would never, ever, ever see a book review where they say this. Right? It's just not going to
happen.
So from a sort of mathematical perspective, the probability of this example under the source
distribution is zero. And perhaps the probability of this example under the target distribution is
also zero. I'm never going to review a blender by saying that I lit it on fire and there's one less
copy. You would never refer to a blender as a -- I don't know. I wouldn't ever refer to a blender
as a copy.
Okay. So, again, the setup is we have a bunch of labeled data from the source distribution, we
have a bunch of unlabeled data from the target distribution. But now I'm no longer willing to
assume that there -- the probability distributions overlap; they're completely disjoint at this point.
So the hope -- let me actually see what it -- yeah. So actually I don't remember what I wanted to
say for that.
So the main maybe takeaway message from a practical perspective is that if you train and test on
books you get an error of like 13 percent. If you train on books and test on appliances your error
rate basically doubles. So we'd like to sort of avoid this, because 26 percent is pretty terrible.
So just to back up for a second, I want to talk a little bit more specifically about where our
examples come from. So the example I'm going to be using is basically bag of words
representation for documents. Or I guess this is maybe term frequency vectors for documents.
So we get some document. We write it as a vector, so some word occurred three times, a bunch
of words occurred zero times, some other word occurred once, a bunch of things occurred zero
15
times, some other word appeared once.
Okay. So this is our vector X. Maybe the first word is excellent. That word is great. And that
last word is fascinating. And then our classification is into positive or negative. And we're just
going to assume that it's some assigned linear classifier parameterized by theta. So theta is going
to be our weight vector. Oops. I thought that animated more. We're going to take the dot
product. If it's greater than zero we say positive; if it's less than zero we say negative.
Okay. So we're going to move into the no shared support setting. So I guess there should be an
X through this picture. So there is no shared -- oh, wait. No. I flipped for all and there exists.
Okay. So shared support means there is no point that's possible in one and not possible in the
other. No shared support means there do exist points that are possible in one and not possible in
the other. It doesn't mean every point is possible in one and not possible in the other. Okay. So
this is the correct picture. It's good we had pictures, because otherwise I would have lied to you
for the rest of today.
And the second thing we're going to assume is the single good hypothesis. So there's some in
this case linear classifier theta that gets small combined error on source and target distribution.
So pictorially maybe we have -- I can't even remember which color is source and which is target.
But we have one of them. We have the other one. Even though they're non -- or even though
they don't completely overlap, there's some linear boundary that is good for both.
Okay. And this is a stronger assumption than the covariate shift assumption. And what we're
going to do is we're going to hope that there's some coupled representation of the two domains
that makes them look similar. Okay. So in a sense we kind of want to throw out whatever
makes these domains different and only retain what makes them the same. And if we can train a
classifier on the stuff that's the same, then hopefully it will work well on both domains because
we haven't overfit to something that's source specific.
Okay. So we're basically going to try to take our books, take our kitchen appliances, and project
them into some common space. And then we're going to of course bound the target error,
because that's what we love to do.
So the question -- the first question is, well, do I believe the single good linear hypothesis
assumption. Right? Like if I don't believe that, then why am I even talking about it?
So here's empirical data to try to give you a sense. So I'm actually going to measure squared
error rather than classification error, just because it matches the theory slightly better. So here
we're predicting plus 1 and minus 1 and we're measuring how far off our squared error is.
So if we train on books and test on books, the error is like 1.35, whatever that means. If we train
on kitchen and test on kitchen, the error is like 1.19. Lower is better, of course.
If we train on the union of books and kitchen and test on books, the error goes up slightly, but
not really by much. And if we train on both and test on kitchen, the error goes up slightly but not
by much.
>>: So the fact that they increase, that means the sum is just [inaudible] in the training, right? It
does have something in common from different domain.
16
>> Hal Daumé: No, the next numbers I show will show that. What this is saying is that even
though -- okay. So we're always training a linear classifier, right? So if I add -- so if I'm training
my linear classifier in books and I add in a bunch of crap from kitchen appliances and it doesn't
really make it much worse, then I kind of believe single good hypothesis, right? Because I found
some hypothesis that's pretty good for both.
Okay. So the other thing you can ask of course is, well, what happens if you train on books and
test on kitchen and vice versa, and it turns out that this is pretty terrible. So this is evidence that
there is -- there are points that don't have shared support.
Okay. So if we go back to what we talked about at the beginning, right, so we know that our
error on test is bounded by the error on target plus this total variation, but there's a lot of
problems with total variation which we'll go into in a second. But the question is if we assume
that a single good hypothesis exists, can we get a better notion of discrepancy than total
variation.
So the idea is total variation is completely agnostic to the fact that we're trying to classify.
Right? It's just some measure of difference between probability distributions. So maybe if I
keep in mind the fact that my end goal is classification, I can develop a better measure of the
difference between the domains. And so the idea is we're going to look at two classifiers, so H
and H prime. You can think of H and H prime maybe as like H is really good on books and H
prime is really good on kitchen appliances.
Okay. And what we're going to ask is if I learn H but the truth is H prime, how bad is that for
me. Right? So in order to do that, we're going to look at the regions of space where they
disagree. Right? So maybe H says everything down here is positive; everything up here is
negative. Maybe H prime says everything over here is positive; everything over here is negative.
So I don't know if I set this up right because I'm not sure whether gray is supposed to be good or
bad, but either these are both good and these are both bad, or these are both bad and these are
both good. Okay. So gray is bad. So gray is where they disagree.
So we're essentially going to define this measure of discrepancy as under the distribution that we
care about how big are those bad pie wedges. Right? So they might be big, but maybe there's no
data in there anyway. Okay. So we want to measure how often are we going to draw an
example that lies in that discrepancy region.
Okay. So to sort of unfold it, we're going to consider all hypotheses H and H prime, so all linear
classifiers, because we're going to -- you know, all of this is worst case, right? We learned H, we
were trying to learn H prime, what's the worst we possibly could have done. Right? So we have
two classifiers, H and H prime.
We're looking at how much do H and H prime disagree on the source distribution versus how
much do H and H prime disagree on the target distribution.
Okay. And the reason why I care about this is because I can learn -- okay. So now think of it
slightly differently. Think of H as the thing we're learning, H prime as the thing we want to
learn. Right? So what I want to know is that if I'm doing well on the source distribution, I'm not
doing too badly on the target distribution.
17
Okay. So here's a little example. So -- and this is why we don't need shared support anymore.
So here's an H and H prime. Here's my source distribution, here's my target distribution. Right?
So intuitively you might say, well, these two hypotheses are very different. So maybe they lead
to high discrepancy. You could also say these distributions don't overlap at all. Maybe this is
high discrepancy. But in this case it's actually low discrepancy, because there are no points on
which H and H prime disagree. Right? On the -- let me see if I can say this better.
So I shouldn't say there are no points. There are exactly as many points in one distribution on
which they disagree as there are in the other distribution.
Okay. So they're disagreeing all the time, but they're disagreeing equally often on each
distribution. So we say that this distribution discrepancy is low.
It's also low the other way around. So if they're never disagreeing on the source distribution and
they're never disagreeing on the target distribution, this is also low discrepancy, even though
these two distributions are far apart.
The case where it becomes high is when one distribution lies in this discrepancy region and the
other distribution doesn't lie in this discrepancy region.
Okay. So we've defined this notion of discrepancy. The question is is it good in any way. And
we're going to sort of compare it to this notion of total variation.
So the nice thing about discrepancy is that if I get unlabeled samples from the source distribution
and the target distribution, I can compute, at least approximately, the discrepancy of these
distributions. Basically I train a classifier to separate the distributions. If the classifier does a
good job at separating them, then they're far apart and they have high discrepancy. If the
classifier does a crummy job of separating them, then they're close together -- did I say high or
low last time? Okay. If they're far apart, it's easy to separate. They have low discrepancy. If
they're close together, they're hard to separate, they have ->>: Higher.
>> Hal Daumé: High. They have high when they're close together. Right? I hope I said that
right. Okay. So this is computable from finite samples. The only reason I have to say
approximately is because I have to find a linear decision boundary that optimizes 0, 1 loss, which
is NP-hard, but you can approximate it by an SVM or something like that.
Total variation is much harder essentially because we have to do density estimation. So if you
do density estimation with, say, two Gaussians, maybe these two distributions look very close,
but if you do density estimation with a mixture of Gaussians with small variants, maybe the
variation looks very high. It's very hard to tell from a finite sample. And in fact it's actually
impossible to tell from a finite sample. There are some theory papers about this.
So you can view this as a pro or a con. But the total variation has nothing to do with the
hypothesis class that we're trying to learn, whereas the discrepancy distance depends crucially on
the hypothesis class we're trying to learn. So I see this as a positive for discrepancy. But you
could argue that it's a negative.
Okay. And so this -- the Bickel algorithm that I sketched before, it's sort of heuristically
18
minimizing both. So when you're training this classifier to distinguish between source and target
and weigh examples by the probability under this classifier, this is basically the same -- or it's
pretty much exactly the same as computing this discrepancy value and then weighting the
distributions according to the discrepancy. Okay. So that's not how it was motivated, but it turns
out that it's sort of a heuristic for solving that problem.
Okay. So okay. So we've defined this measure. It has a couple nice properties, at least maybe
nice than total variation, but does it actually mean anything in practice.
So here we have four domains, so books, DVDs, electronics, and kitchen. Books and DVDs,
they tend to have a lot of shared vocabulary. Electronics and kitchen also tend to have a lot of
shared vocabulary, but the two pairs are quite different.
So he might say fascinating and boring for books and DVDs. You probably wouldn't describe a
blender as fascinating or boring. And you might describe a blender as super easy or bad quality,
but you probably wouldn't describe a book as super easy.
So here the plot is the empirical discrepancy measured between the two domains. So say
between electronics and kitchen. And on the Y axis it's the error rate if you train on electronics
and test on kitchen. Okay. And so basically what you see is as the discrepancy goes up, the
error for training on one domain and testing on the other domain also goes up. It's linear, but
that's kind of an accident. Okay?
All right. So one thing that's hidden in this slide is that this Y axis is not symmetric. So training
on electronics and testing on kitchen doesn't give you the same error rate as training on kitchen
and testing on electronics. So the other ones aren't plotted. They look more or less the same, but
it just confuses the picture a little bit. Oh, actually, there's a more important point there. The Y
axis is not symmetric. The X axis is symmetric. So there's something a little funny going on
there. I don't know. Maybe someone can figure out how to define a discrepancy measure that
actually reflects this asymmetry between the distributions.
Okay. So I'm just going to make this all show up, and then we'll talk about it. Okay. So this is
the first of the nasty things that I'll show you. And I hope you didn't think the other things were
nasty, because if you did, you're in for really bad news now.
So what we're looking at is how far is our learned classifier on the target domain from the
optimal classifier on the target domain, where each star is my single good hypothesis. Okay.
Each star is not my optimal test -- my optimal target classifier. It's my optimal source and target
together classifier.
Okay. So in a sense this is the best we could hope to do without labeled target data. If we get
labeled target data, maybe we can do better. But without label target data, this is the best we can
do. Okay. So we have our usual term, which is how well does H do on the source data. So this
is our normal training error. Actually, I think there are more ovals.
The next two terms are our complexity measure. So this is -- you know, you can plug your ears
if you don't like weird words. This is the Rademacher complexity of the hypothesis class we're
considering on the source distribution, and this is the Rademacher complexity of that class on the
target distribution.
19
If you don't know what Rademacher complexity is, that's fine. It's just some measure of
complexity of classifiers. If you want to think VC dimension, thinking VC dimension is fine. If
you don't want to think about it at all, that's fine too.
>>: So related to size of the parameter how many ->> Hal Daumé: Yeah. It's -- so Rademacher complexity is basically how good is my hypothesis
class at fitting noise. So if I take data, I add noise to it, can it fit that noise. And if it's a very rich
hypothesis class, it can fit all sorts of crazy noise. But if it's not very rich, it can't fit noise. So
that's sort of the intuition behind Rademacher.
Okay. So then we have this other term which basically has to do with the sort of standard
probabilistic argument which we can ignore, and then the last term is this discrepancy term
which we just defined a couple slides ago.
So basically I think the way to think about this is if I can get low error on the source distribution
and there's a small discrepancy between the source and target distribution, then I'll have low
error on the target distribution.
Okay. And these are sort of -- you know, if you're doing linear classifiers, this is just a constant.
If you have a fixed number of points, this is just a constant. Okay?
Oh. One last thing. The thing that's really nice about this bound is everything on the right-hand
side you can compute from unlabeled data. Well, this you need labeled data, but you have
labeled source data. Rademacher complexity you can compute from unlabeled data. This is just
something, and the discrepancy we said we can compute. So it's not one of those bounds where
it's like gee, I hope that number's small. You can actually compute the number and see if it is
small or not.
Okay. So here's an example of how you can do this in practice. So we're going to have a linear
hypothesis class. And we're going to take our data and we're going to use some projection
matrix P to project it into a lower dimensional space. And the idea is that information that's
important for classification should be preserved under this projection, but information that is
relevant only to one domain or the other should not be preserved under this projection.
Okay. So the simplest projection of course is the identity projection. So we just copy our
vectors. This is kind of uninteresting. The more interesting case is when we have some
nonidentity projection and we actually drop the dimensionality of the space. Okay. So maybe
we start out with this big vector and we project everything down into two dimensions.
Okay. So what do we want for P? So we wanted to minimize divergence. So what this means is
the distribution over projected examples in the source domain and the distribution over projected
examples in the target domain should be small. Because that makes that discrepancy term that
was the last term in the bound go down.
And then the second thing is that we want there to be -- to still be a single good hypothesis in this
projected space. Right? So one way to make the divergence small is just to project everything to
zero. But then you can't find a single good hypothesis. One way to make the single good
hypothesis hold is to not project anything but then our divergence will kind of suck. So this is
the tradeoff that we have.
20
Okay. So this is basically John's thesis. So if you want lots more information about it, you can
look there.
So the idea is that there are going to be a lot of things that are specific to one domain or the
other. And then there are also hopefully going to be some things that are general to both
domains.
And essentially what we want to do is we want to use the stuff that's general to the two domains
to try to figure out that fascinating books are kind of like blenders that work like a charm. Or
boring books are kind of like defective fryers. Okay. And the way that we're going to do this is
we're going to use the information that's coupled between the domains to try to find a projection
that puts similar words in a similar part of space, regardless of which side they come from.
Okay. And these purple words are called pivot words.
Okay. So here's a little example. So this is from -- the top is -- oh, wait, no. These are all target
distribution examples. So maybe this first one says do not buy. We know from our pivot
features that not buy is probably negative. The second one says absolutely great purchase, we
know from our source distribution that great is good. And then this last one, it says a sturdy deep
fryer. And since no one describes a book or a DVD as sturdy, we have no idea whether this is a
good thing or a bad thing.
So what do we do? We look at context, right? So we noticed that in this target example it says
do not buy, so this is a pivot, so we know this is bad, blah, blah, blah, blah, blah, blah, defective.
Right? So maybe defective is indicative of negative.
And then the next one says an absolutely great product, blah, blah, blah, blah, blah, this blender
is incredibly sturdy. So maybe sturdy is a positive thing.
And so then when we get down to this bottom case, we can kind of hope that we can learn that
sturdy co-occurs with great, great is positive, so maybe sturdy is also positive.
Okay. So that's the intuition. The mechanics behind it, how it works, are a little bit complicated.
But basically the game that we're going to play is we're going to take a document and we're
going to try to train a classifier to predict whether this document contains the word great or not.
Okay?
>>: What if they said not great?
>> Hal Daumé: Negations. I hate those. They should be disallowed. I have nothing to say
about negations. I mean, negations are actually interesting, because it's basically an X or
problem, and so linear classifiers are going to die anyway. So yeah. There are lots of issues with
those.
So maybe we see that, you know, often if we see the word sturdy, this is a good predictor of the
word great, which hopefully indicates that there's some relationship between being sturdy and
being great.
Okay. So for each of these features we train a classifier to predict whether the review contains
the phrase highly recommended based on the rest of the text in the review. And we can write
21
this down as a big matrix of weights, right? Weights for each possible pivot.
Let me actually skip this part. No, I shouldn't skip this part. Okay. So -- yeah, I should actually
go through this. No, I'm actually not going to go through this. It will take too long.
Okay. So what do I want to say? So I'll give the sort of brief summary. So the brief summary is
that there might be features that are consistently highly indicative of positive pivots and other
features that are consistently highly indicative of negative pivots.
And so what I'm going to do is I'm going to -- so basically what this means is that even though
this is a high-dimensional matrix, there is low-dimensional variation in it. So I'm going to run
PCA, or whatever you want, to do a singular value decomposition on this giant matrix and find
the direction of maximum variation, and this is going to correlate features which are just other
words with pivots, which are the words that I know things about.
Okay. So it turns out that later people discover that this is basically running CCA. So if you
know CCA, this is exactly CCA between the source distribution and the target distribution.
And so now the classifier that we're going to learn is a linear classifier not on the inputs X but on
the projected inputs. And then we're going to threshold it.
Okay. So if we go back to the bounds. So I'm not going to measure these things because they're
going to be constant for everything that we talk about, but I can measure how good is the source
classifier and what's the discrepancy. Okay.
So we're going to start with the identity projection, so we're not going to project anything. So it
has relatively high discrepancy. I mean, I don't know what 1.7 means, but when I show the other
numbers you'll see that this is big.
So it has relatively high discrepancy. It has very low loss on the source distribution. So this is
Huber loss, doesn't really matter what it is. But it's very low loss, but it has quite high target
error. Okay. I think this is going from books to kitchen, because that's the running example.
So I could alternatively do a random -- yeah.
>>: Will this handle the case where [inaudible]?
>> Hal Daumé: No.
>>: So you kind of assume that the condition is the same?
>> Hal Daumé: Yes. But what's going to happen is if -- so let's say that great is always positive.
And in one domain sturdy correlates with great and in the other domain sturdy anticorrelates
with great. So then what's going to happen is when the singular value decomposition runs or
when CCA runs, sturdy is going to get thrown out, because it's not highly indicative of one of the
pivots. So it will just ignore that information. It won't use it in any way, but it will just
completely ignore it.
>>: [inaudible]
22
>> Hal Daumé: Yeah. Okay. So discrepancy is high, source error is low, target error is high.
You can also do random projections. People like random projections. Just choose a Gaussian
random matrix.
Discrepancy drops. This is kind of just an artifact of the fact that you've dropped the dimension,
so there just aren't as many classifiers in lower dimensions. The source error has gone up
dramatically, because we're throwing away useful stuff by this random projection. And the
target error is worse than random guessing. So that's not a good idea.
So then you can do this coupled projection thing, this CCA thing. So the discrepancy drops
actually below what you have for random. So the number of dimensions retained here is the
same, so it's actually kind of comparable. The source error only goes up slight -- well, it goes up
6 1/2 percent. Whether you think that's slight or not, I don't know. But the target error is sort of
better than the previous.
Okay. So at least there is sort of from can I tell a story about this bound, it seems there's some
story I can tell.
Okay. How does this actually work in practice? So we're going to consider three source
domains, so books, DVDs, and electronics. We're going to always test on a kitchen domain.
And this is how well we do if we train on kitchen and test on kitchen. So this is now accuracy.
This isn't some lacking like Huber loss or something.
So now we're going to do no adaptation. So we train on books, test on kitchen, train on DVDs,
test on kitchen, train on electronics, test on kitchen. So electronics and kitchen are the most
similar, so it's not too terrible.
And now we do the adaptation. So we don't recover all the way to having labeled kitchen data,
but you kind of wouldn't really expect to. And electronics we split the difference. And the rest
we actually more than split the difference, so it seems like it's doing a pretty good job for not
having any labeled target data. Yeah.
>>: Do you have any account why DVDs [inaudible] so much more?
>> Hal Daumé: Rather than books?
>>: Rather than books [inaudible].
>> Hal Daumé: Well, actually, let's see. So DVDs to kitchen, right? We can go back to this
picture maybe and see ->>: [inaudible] DVD machine or for DVD ->> Hal Daumé: DVD movies. Not the machine.
Where is this slide? I didn't realize it was this far back. Let's see. DVDs to kitchen. Actually
you would expect it to be pretty -- well, okay. So DVDs to kitchen versus books to kitchen, I
guess the discrepancy is slightly lower for DVDs to kitchen. So you might expect it to do a little
bit better. And I don't have the good intuitive -- I mean, it might just be that you actually can say
that DVDs break, you can say -- I don't know what else you could say.
23
>>: Scratch?
>> Hal Daumé: Scratch, yeah. I don't have a great intuitive answer. And I doubt that that little
bit of discrepancy is actually accounting for that big jump in performance.
>>: So do you have a comparable result for these tasks using sources of weighting?
>> Hal Daumé: I don't have it in the slides. It doesn't work as well. It's definitely better than no
adaptation.
>>: [inaudible].
>> Hal Daumé: Yeah. I'd have to look it up. It's been too long since I've looked at it.
Okay. So we get something like a 36 percent error, reduction in error.
Okay. So you can also look at what P does. This is -- I don't know if this gives you that much
intuition, but up here -- so this is zero, these are things that are indicative of positive, things that
are indicative of negative, and where different words fall along this projection.
So in books I guess Grisham writes good books, or at least he writes popular books, things you
must read, things that are engaging, things that are fascinating. And sort of the corresponding
words on the kitchen side are years now, so like I've owned this for 20 years now, it's a breeze to
use, something are perfect, espresso. People just like espresso in general.
>>: Are you looking at all diagrams, or they just ->> Hal Daumé: No, it's just some diagrams. On the negative side, so it's kind of bad to say it's
predictable. If you're talking about how long it is, this is probably not a good sign. If you
mention the plot, I guess you never talk about the plot if it's good, you only talk about the plot if
it sucks. Kitchen appliances can be awkward to use, they can leak, they can be poorly designed.
And I guess in general if you have a kitchen appliance made of plastic it's not going to be liked.
Okay. So there is -- these are sort of the references. The most interesting stuff that I didn't talk
about is actually this Yishay Mansour work where they actually extend this notion of domain
discrepancy and they tighten a lot of the bounds. So this is a great, great paper. The other ones
are great too, but this one really settles a lot of the questions that were raised in the previous
work.
Okay. So it's now time to eat. So get snacks. I don't know, what do you want to say, like 15
minutes? Ten minutes?
>> Patrick Pantel: Yeah, let's get back [inaudible]. 15 minutes we'll reconvene.
>> Hal Daumé: Okay.
[break]
>> Hal Daumé: Okay. So we can get started again. All right. So just to sort of summarize what
24
we've done so far. So in the beginning we talked about these three important concepts, so
covariate shift, single good hypothesis, and discrepancy, some measure of discrepancy.
And then we talked about basically two cases. The first case is the shared support case, the
shared support covariate shift case. And there we have this algorithm where we train a classifier
to separate the two distributions and then use the probabilities that it produces to weight
examples when we're training a source classifier.
And then the second one we basically said, okay, don't do weighting, instead we're going to try
to project the source data and the target data into some common space and we're going to hope
that single good hypothesis still holds in that space and there's relatively low discrepancy.
Okay. And so in all of these cases, the best that we could hope to do -- say in the single good
hypothesis case, is as good as that single good hypothesis.
Okay. So what we're going to do now is we're going to ask the question, well, what happens if I
can get a small amount of labeled target data. Now I have the potential to be able to do better
than single good hypothesis. Okay. And so the question of course is can we.
So I'm going to talk about two sort styles of approach. And at some level they're basically the
same. So one is basically thinking about adaptation at a feature level and the next is thinking
about adaptation at a parameter level. So sort of if you squint, these -- this set is going to look
sort of like discriminative learning and this set is going to look not like generative learning but
it's going to be like graphical model stuff.
So there will be a little bit of things like that in there. And then I'll talk about translation at the
end for those of you who are holding out for the translation.
>>: What?
>> Hal Daumé: I'll have Lucy pat you on the shoulder when it's time to pay attention.
Okay. So all right. So here is feature based approach. So let's say we're looking at reviews of
cell phones and reviews of hotels. So actually -- someone asked a question almost exactly like
this earlier, right?
So in both cases saying that this product is horrible is a bad thing, but saying that your cell phone
is small is usually good but saying that your hotel is actually usually bad. And so if you don't
have any labeled target data, sort of the best you can hope for is that you just learn to ignore the
word small.
But now that we have some labeled target data, we're going to actually try to learn that horrible
behaves the same across the domains and small behaves differently across the domains.
Okay. So this is sort of the brief summary of this 2007 paper I had which probably actually
everyone knows about. It's like that key example and if you have like a really easy like trivial to
implement paper, it will be your most cited paper, even if it's like not something that you're that
proud of.
So but the idea is we want to share some features across the domain, and we want to not share
25
others. But of course I don't want to have to hand specify which ones to share and which ones
not to share, so I'm going to let some arbitrary learning algorithm decide which is which.
And I'll tell you in advance, I'm actually going to tell you more than what's in that ACL paper.
So you might actually get something new.
Okay. So these are the feature augmentation techniques. So we have two examples: The phone
is small; the hotel is small. So if we're doing bag of words representation, we can extract these
words from this source example, these words from this target example. And then naively we
could learn a hypothesis on, say, the union of this data, in which case maybe we would just learn
that small isn't indicative of good or bad because sometimes it looks good, sometimes it looks
bad, so let's give it a weight of zero.
So these are going to be our original features. I'm then going to add source specific features and
target specific features, so these are the augmented features. So source specific features I only
generate for source examples; target specific features I only generate for target examples.
And so now if it wants to learn that small is good in the source domain, it can put a high weight
on S colon small, and if it wants to learn that small is bad in the target domain it can put a large
negative weight on T colon small. And then maybe put a 0 weight on W colon small.
So for source domain we expand the feature vector for X and to X followed by X followed by a
bunch of zeros, and for the target domain X followed by a bunch of zeros followed by X.
Okay. And this is of course a redundant representation. You could actually get rid of the last
copy and it doesn't really matter.
Okay. If you like kernels, there's sort of another way to think about this. So if we do this feature
expansion and then we compute the kernel between two examples, the kernel in the augmented
space for examples that come from different domains is just the same as the kernel in the original
space, because you basically get X dot X plus X dot 0 plus 0 dot X. And in the -- if these two
examples are from the same domain, then we get X dot X plus X dot X plus 0 dot 0, which is just
2.
And of course now you can play this game, well, now 2 is kind of like a hyper parameter, maybe
I can play with 2 and set it to different values maybe depending on the discrepancy.
So if they're very different, maybe I want this to be high. If they're very similar, I want this to be
low. I never really tried anything like that, but it seems like a reasonable idea.
So one interesting thing to note is that at this point we've basically ensured that single good
hypothesis is going to hold. Like if there was a good hypothesis in the source distribution and
there was a good hypothesis in the target distribution, maybe they have nothing to do with each
other, there's going to be a good hypothesis in this data, because I can just have the good one in
the source ignore everything but this middle component, the good one, and the target ignore
everything but the final component, and they're going to do awesome.
And we've also destroyed shared support, right? It's absolutely impossible to see a nonzero here
in the target domain.
26
Okay. So here's what some of these weights look like just to sort of give you a sense. So here
what we're doing is we're doing named entity recognition. We're trying to pick out people,
geopolitical entities, organizations, and locations. And we have six different domains. We have
broadcast news, telephone conversations, newswire, blogs, Usenet, so news groups, and
telephone conversations. Sorry. This is like conversations of people on the news, right, like, you
know, gee, how's the weather today, blah, blah, blah, right?
Okay. So big means large weight, white means positive, black means negative.
So the feature that we're looking at in particular is this feature bush. And so in general, which
basically means across all the domains for the shared feature, this word bush is highly indicative
of being a person and it's highly not indicative of being an organization. But in broadcast news,
it's even more certainly a person. Which means probably in broadcast news they're not really
talking about like shrubbery. And so on.
So this is kind of a not super interesting case. Maybe a more interesting case is this. So here the
feature is the previous word is the word the. Okay. So in general if I'm a word following the, I
am probably not an entity. Except in conversations. And in conversations if the previous word
is the, I'm actually reasonably likely to be some sort of entity.
So why? I think it's on the slide. No, it's not on the slide.
>>: [inaudible] the church or ->> Hal Daumé: Yeah, the church or -- yeah, exactly. I don't know why people only say this in -hmm?
>>: The bat.
>> Hal Daumé: Yeah. Okay. So, anyway, it -- so the point of this is that, you know, even if you
had asked me ahead of time like previous word equals the, should this be a shared feature or a
domain specific feature, I probably would have said it should be a shared feature. But actually
it's a useful discriminator in one domain, even though I wouldn't have anticipated that.
Okay. I have other examples, but I won't show them all.
Okay. So this is like the nasty chart. I'm not going to go into too much detail. But this basically
like other ways you could try to solve this problem, and then this is how well you do with the
feature augmentation stuff. And the bold ones are of course the case where this either wins or it
ties for the best within some confidence. I think it's like 5 percent or something.
And we have a bunch of tasks. So this is named entity recognition. Here we have a bunch of
different domains. These are the ones I showed before. This is part -- no, this is named entity
recognition in the CoNLL data. This is part of speech tagging where you train on Wall Street
Journal, test on PubMed. This is capitalization where you train on Wall Street Journal and test
on CNN.
This is syntactic chunking on the tree bank. So lots of people don't know this, but the tree bank
isn't just Wall Street Journal. There's other stuff in there too. So there's Wall Street Journal,
there's switchboard, which is telephone conversations, right? Yeah, okay. And then there's
subsets of the brown corpus, so there's like fiction and other things. I have no idea what all these
27
things stand for, but just different types of English text.
So one of the things that's interesting is that -- oh. And so all of these numbers it's I train on
everything else and test on the one that's held out.
So if I train on everything else and test on Wall Street Journal, augment does a good job. If I
train on everything else, test on switchboard, it does a good job. If I train on everything else and
then test on any of the browns individually, it doesn't win. But if I treat the brown corpus as just
one big, so if I lump all of these together, then it starts working again. So this is a little weird.
So the question was can I kind of figure out why this is going on.
>>: Well, are they very small? I mean, how big are [inaudible]?
>> Hal Daumé: They're pretty small. So actually another thing I want to point out is that even
though we're losing here, so this is a bunch of different baselines and this says which baseline it
is. So it's not consistently the same one that's winning. So that's kind of annoying from a
practical perspective.
Okay. So these blue guys are the ones where we seem to win. It turns out that if you look at the
ones on which we win and look at these two columns, which I didn't explain, so this is if you
train only on everything else and test on Wall Street Journal how well do you do, so you train
only on the source domains and then test on the target domain. This is if you train only on the
target domain and test on the target domain.
Okay. So blue is where we win. And it turns out that every time we win the source only error is
worse than the target only error. So what does this mean? So why would the source only error
be better than the target only error? Here I'm training on other stuff, testing on brown. Here I'm
training on brown, testing on brown, but I'm doing worse training on brown and testing on
brown. Why? Because I've got know data. Right? So the reason why you do better training on
source and testing on target is if you have lots of data and the domains are pretty close.
If they're far, it doesn't -- I can give you like a billion examples and it doesn't matter. But if
they're close and I have loss more data here than here, I'm going to win with the source only
model.
And it turns out, and we couldn't really explain this at the time, that these were exactly the cases
where we didn't win. Okay. But it was at least a correlation.
Okay. So then, you know, some student's like I need a project to work on, I'm like, okay, well,
go explain this. Okay.
So then he comes back a couple weeks later and he says, okay, I explained it. Here's something.
Okay. And it takes me a while to try to figure out how this explains it. Okay. So all right. I'm
going to walk you through this. So this now -- you know, if you thought the semi-supervised
bounds were bad, it's getting even worse, right? Because now I have more data in the mix, I
have more terms that I have to talk about. But, you know, whatever.
Okay. So our error on the target data is what we care about of course. Now we have training
data from both source and target, right? So I'm going to look at my average source and target
28
error. Okay. So for the way that I presented this bound, I've assumed that I have equal amounts
of source and -- no, I haven't assumed that. Erase the last ten seconds of your memory.
Okay. I just measured these errors. Okay? So I have my Rademacher complexity term just like
before. I have my discrepancy term just like before. And the new thing that's changed, so I used
to have something like square root 1 over delta divided by N. Now I have some labeled
examples in the source domain and some labeled examples in the target domain. Okay. And I
can get more of both. Or get more of either, I guess.
Okay. So, you know, I can look at this, I'm like, okay, well, it kind of makes sense, right? As I
get more examples, it's good. Okay. But why does it sort of correspond to the observation that
we had here?
So what I want to do is I want to think about two cases. So I'm going to think about the case
where I train and test only on the target versus the case where I add in the source data and I do
this augmentation. So forget about source only. I'm only going to talk about target only and the
augmented.
Okay. If I train and test only on the target, then this term goes away, because I'm not using any
source data, and this term goes away because I'm not doing any adaptation.
When I do the adaptation and I add in this source data, what's happening is I'm going to pay this
discrepancy term, but I win because my number of source examples is going up.
So what's happening is that if I get enough source data that this 1 over square root N dominates
this discrepancy penalty, then it's going to be advantageous to throw the source data in. But if
the discrepancy is either really large or I don't have that much source data, then I'm better off just
training on the target.
Okay. So this was -- yeah. So this was a paper at NIPS last year. I actually highly don't
recommend reading the paper. It's very unpleasant. This is a selection of one of the results that
we had. The others are even worse to write down. So I'm not going to show them.
Okay. So I gave this example where we're adapting -- you know, first I said we're adapting from
one domain to a second, then I said, okay, well, we're actually going to do six domains. Now
we're going to consider the case what happens when I do lots of domains.
So in particular if I want to do something like personalization, maybe every person in my system
is a new domain. Okay. So this seems like the natural thing to do. The problem is that if I
actually execute this on k many domains, I end up with k plus 1 times D many parameters, where
D is my original dimensionality. And so this is bad. It's like this ridiculously high dimensional
thing, maybe it's hard to store, I now have all these parameters to estimate, so maybe I need lots
and lots of training data. So, you know, this is bad news.
So pictorially what happens is I start out with a feature vector like this. I do feature
augmentation so I get the first copy, then I have some number of zeros, then I get a second copy,
then I have like a bajillion zeros off at the end.
So what I'm going to do now is I'm going to take this augmented feature vector and I'm going to
apply this trick of feature hashing. So the idea is I'm going to compress this large vector into a
29
much shorter vector by basically hashing each feature. So I pick some hash function, I say:
Hash function, what do you say about 1? And he says: Oh, 1 hashes to 9. So I'm going to put
this 1 in position 9.
And then I say: Okay, hash function, what do you say about position 2? 2 hashes to 1, so I put
this 2 in position 1.
And then, you know: What do you say about -- well, I never noticed that the numbers that I put
in were exactly the position. That's very ambiguous. Okay.
Then I say: What do you say about position 6? 6 hashes to 10. So this was a -- oh, I guess this
was a 1, so I put a 1 here, but then of course there's going to be collisions, and so this position
also hashes to here, to I get the sum.
Okay. So I shrink the vector. I introduce some noise because there's collisions. But the hope is
that if my hash function obeys the usual properties that you hope that hash functions obey, these
collisions are sort of going to wash out as noise.
Okay. So does this make sense? All right.
And so the nice thing of course is that when you implement this you don't say, you know, start
with this, compute this, and then hash. You just go directly to the hash. And actually ->>: [inaudible].
>> Hal Daumé: Yeah, it will be too big.
>>: And if ->> Hal Daumé: And so it's computationally expensive. And actually if you have the right type
of hash function, it's no work to compute this. No extra work than if you didn't do the
augmentation.
So the question is does hashing do a good job. So let me see if I can say this well. So the
question is does -- do collisions lead to confusion at the time of classification. Right? I only
care about -- I don't care about collisions that have no effect, I only care about collisions that
cause me to screw up.
And so basically what we're going to do is we're going to say let's take our weight vector and
let's pick some person, Lucy. Lucy corresponds to some subset of the features. I'm going to take
the weight vector and I'm going to remove all of the Lucy features. And then I'm going to ask
what happens when I take the dot product between that weight vector and the hashed vector from
Lucy. Right?
So if I weren't doing hashing, this dot product should be zero. Because I removed everything
that's specific to her. But because of collisions, this isn't going to be zero. But I hope that it's
small.
So what I'm going to say is that the probability that it's bigger than some constant epsilon is
exponentially small in epsilon. Okay. So basically if you want this sort of two-word summary,
30
it works.
Okay. And it depends sorts of nicely on the target dimensionality, so how big my hashed
representation is.
Okay. Here's how it works. This is spam classification. I sort of personalized spam
classification. I can't remember how many people there are. So this is how many bits I maintain
in the hashed.
Representation. And this is basically the relative error over a nonpersonalized baseline.
Okay. And so basically what happens is that if you don't personalize and you hash and you hash
into a very small thing, you add extra error. This sort of makes sense. But once I use enough
buckets in my hash table, there's basically no difference between using a hashed representation
and a nonhashed representation for the nonpersonalized case.
In the personalized case, you can't even compute the nonhashed version. It's just way too big.
So all we can look at is the hashed version. And so basically what you see is unless you use a
tiny hash table, in which case you suck, you're doing a lot better by doing this personalization.
>>: How big was the [inaudible]?
>> Hal Daumé: It is a bag of words from e-mail. So whatever that is.
>>: What's the compression ratio?
>> Hal Daumé: Okay. So let's see. So let's say we go over here. So 26 bits, that's, what, 2 to
the 26? That's like some thousands. Wait. No.
>>: [inaudible].
>> Hal Daumé: Okay. 64 million. So it's a -- well, from the original representation, it's getting
bigger, because you're maybe starting at 50,000 features for bag of words. But I think this is on
something like 10,000 users. So it's -- it's not an outrageous compression, it's, you know,
somewhere between 1 and 100, I think. Okay.
So I think this is pretty cool. So this is the big problem with the feature augmentation approach
and this hash stuff seems to at least address it in a way that makes this approach usable for a
personalization setting.
Okay. So the next question is can I do this -- now I really don't have a term for this. So can I do
this in like a semi-supervised supervised way. Okay. So the setting is I've labeled source data.
I've labeled target data. And I have unlabeled target data. So I don't know what you want to call
this. Semi-semi-supervised or something.
Okay. So what we're going to do is the usual thing for labeled data. So we're going to expand
labeled pairs XY to XX0Y for source and X0XY for target.
And so the question is what to do about unlabeled data. And so the thing that I really love about
this story is the way that we came up with this algorithm is we proved the bound that I showed
you like three slides ago and then we realized, hey, we can improve this bound by using
31
unlabeled data, and that gave the algorithm that I'm about to show.
So I think this is like an awesome case for theory, like doing what theory is supposed to do.
So what do we want to do? So the way that we derive the bound is basically there this approach
of -- so it's called coregularization. So people use it in cotraining, where I'm basically training
two classifiers and I want them to agree on unlabeled data.
And so we sort of adapted the same sorts of techniques to get the result that I showed earlier.
So but the way that they work is they basically say the complexity of a hypothesis class that
agrees on unlabeled data is smaller than the complexity of a hypothesis class that may or may not
agree on the unlabeled data. The agreement on the unlabeled data restricts the space of
classifiers that I can consider.
So what I want to do is I want to say for my unlabeled data I want my hypothesis, my source
hypothesis, to agree with my target hypothesis. Okay. So if you believe single good hypothesis,
this is actually a really good idea. Even if you don't believe single good hypothesis, it turns out
it's still a good idea.
Okay. So if you're doing linear classification, what does this mean? It means that your weight
vector in the target dot with your example minus your weight vector and the source dotted with
your example is zero. Because they both have the same general weight vector, the same shared
component. And they only differ in the second and third parts. And so they'll agree if this
difference is zero.
So to map our unlabeled points, we're going to take an unlabeled point and we're going to create
two labeled points. The first labeled point is going to be a positive example whose feature vector
is 0X minus X. And the second one is going to be a negative example with a feature vector 0X
minus X.
So this seems crazy, right? Like, I mean, we wrote this on the board and we were like this can't
possibly work. This seems like a ridiculously bad idea.
All right. So the idea is that I want to encourage the difference between these dot products to be
zero. So I can't tell a classifier make something zero. The best I can tell it is make it plus 1 and
make it minus 1. Okay. So that's why we're creating two: one with a positive and one with a
negative label.
And so what's going to happen here is in order for the overall weight vector to have -- so if the
overall weight vector has a 0 dot product with this, then it means the source specific part dotted
with X minus the target specific part dotted with X is 0.
Okay. And then, you know, same for the positive and the negative case. Okay. So if you think
about this in terms of hinge loss, maybe this picture makes a little bit more sense.
So if we have a positive example, we compute the margin, so W dot X. If it's -- okay. Positive
example is red. So if our margin -- this is like an SVM. So if our margin is at least 1, we get 0
loss, otherwise our loss grows linearly.
32
For negative examples, if we have a margin of less than minus 1, we get 0 loss, otherwise our
error grows linearly.
And then for these examples, it's basically the sum of these two curves. So if we're making a
prediction on an unlabeled point and our prediction falls in this sort of region of uncertainty, we
get minimal loss. Otherwise the loss grows as you go out.
Okay. This picture also surprised me because there is a paper by Neil Lawrence where they do
some Gaussian process unsupervised learning stuff, and they regularize not with a function that
looks like this on unlabeled data but with a function that looks like on unlabeled data. So it's like
exactly the opposite. And I don't have an intuition of why they both work, but somehow they
both work.
Okay. So this is encouraging agreement on unlabeled data. It's akin to multiview learning, and
you can show that if you do this it shrinks the generalization bound.
And so this is how well it works in practice. This picture is kind of hard to read. So I'll explain
the legend first. So source only is you train only on the source data, test on the target. Target
only full means that you train on all the available target data. Okay. So for the rest -- no. Target
only we're only training on target data also.
Okay. So for the rest, what we're going to do is we're going to keep the ratio of source examples
to target examples the same, and we're going to adjust the total number of examples that you see.
Okay. So, for instance, at 2,000 it means we have 200 target examples and 1,800 source
examples. And here it means we have 800 target examples and, whatever, 7,200 source
examples.
Okay. So you can draw the picture in other ways, but this is the one that I chose to show, even
though it's kind of hard to read.
So the thing that I want to sort of look at is -- well, I don't know. I don't want to dwell on it too
much. But basically it seems to help. It doesn't help like orders of magnitude, but adding the
unlabeled data, so that's EA++, does sort of give you a consistent win over everything else,
except the case where you have extra target data. So this -- right?
So the reason this graph is hard to read is that this point is training on 2,000 target examples,
whereas this point is training 1,100 target examples and a bunch more source examples.
So if you're keeping the number of target examples fixed, you need to like compare this point
with like a point way over here. Okay. So from that perspective I think it's doing quite a good
job.
>>: [inaudible]
>> Hal Daumé: Oh. Sentiment. It's the same as before. I think it's from books to kitchen.
Okay. So yeah. Okay. Here are some papers. So this was the frustratingly easy paper, which is
highly related to this [inaudible] paper. This is the paper about feature hashing, and then this is
the one that has like the ridiculously terrible bound. And also the semi-supervised extension.
33
Okay. So I want to talk a little bit about sort of a completely different style of approach. So
pretty much everything I've been talking about up until this point has been what I sort of consider
the linear algebra version of machine learning. And not really the probabilistic version of
machine learning, although we did have this P of sigma equals 1. But that was kind of the only
place that crept up.
So now I want to talk a little bit about probabilistic models, because there's been sort of a whole
different set of intuitions applied to adaptation in a probabilistic setting.
Okay. So one way of thinking about the feature augmentation approach from a parameter
perspective rather than from a feature perspective is that I'm not duplicating the features. I'm just
going to train two separate classifiers, and instead of regularizing these classifiers towards 0, I'm
going to regularize them toward some shared underlying classifier that's common to both
domains.
So I have sort of one general classifier, and then I have my source specific classifier, my target
specific classifier, and I regularize my source classifier and my target classifier to be similar to
the general classifier.
Okay. So this was an observation that I think was in the original frustratingly easy paper, but
sort of didn't go anywhere.
So if I draw graphical models are people happy? All right. I'm going to assume that you're
happy. Okay. So here's a simple graphical model for just vanilla single domain classification.
So we have N data points, XY, we're trying to learn some weights W, and I've explicitly drawn a
prior on W as like a Gaussian with the 0 mean.
So the first thing that we could try is we could learn weights on the source examples, we could
fix -- we could clamp these weights to their value and then learn on the target examples but
regularize toward whatever we learned on the source examples. Okay. So this is a pretty
popular approach in speech land.
Okay. What else could we do. So this is sort of more like the thing that I hand-waved about. So
we can jointly learn weights on the source and target examples that are regularized towards some
unknown weight vector which itself is regularized towards 0.
I guess since I'm talking about Bayesian, I shouldn't say regularized, I should say has a prior
distribution centered at 0. But it's basically the same thing.
Okay. So you can take this further. So you can now say, well, instead of two domains I have k
domains, each with its own weight vector which are sort of communally regularized toward a
shared weight vector which itself is regularized towards 0.
Okay. And in fact you can derive the augment approach as sort of an approximation to this
model.
So now people went crazy. They're like, okay, we have this general framework, what should our
priors and stuff like that look like. So the first thing that you might try is put a Gaussian prior on
W0 and put another Gaussian prior on each WK. So this was a 2008 paper by Jenny Finkel
[phonetic] in some conference, ACL, [inaudible] something like that.
34
Or maybe you don't want to be linear. So instead of putting a Gaussian prior, you can put a
Gaussian process prior. So this was a paper by someone, I can't remember who, actually earlier
than the Gaussian case. So people went for the hard version before they went for the easy
version.
You could also choose to cluster the task. So maybe I don't want to share across all tasks, I want
to say, well, these three tasks are kind of the same and these five tasks are kind of the same, and
this one task is sort of an outlier, so I can put a Dirichlet process here, and this will have the
effect of clustering the tasks. Which, again, came before the easy version. I don't know why no
one tried the easy version first. But it also works.
So okay. So these are sort of three different ways you can do it, and you get different behavior.
So you either have nonlinearity, or you have this clustering assumption.
So one thing that always bothered me about the clustering assumption was that basically what
you're saying is either my tasks or my domains are basically the same, or there's no relation
between them at all.
And this doesn't really seem to capture how I think things happen in reality. So maybe what I'd
prefer is some sort of tree structured prior. So these are our Amazon categories, or appliances,
kitchen electronics, other, music, books, DVDs, and video. And this is the tree structure that you
get if you just run sort of a agglomerative clustering over the Xs. So you ignore the label. You
just cluster on the basis of the document.
So DVDs and videos look very similar. Electronics and other look very similar, appliances and
kitchen look very similar.
Okay. So what we want to do -- I guess what I want to do, because I thought it would be more
fun, is instead of keeping the tree structure fixed based on the inputs, lets infer the tree structure
as part of the classification process. So I simultaneously want to learn a classifier and learn a
good tree structure for classification, not necessarily a good tree structure for explaining the
documents that I see.
Okay. So to do this, we use this model called the coalescent, which I'm just going to skip over,
even though I have like awesome animations for it. You can look it up if you're interested. You
can see the animations.
Okay. But this is the part that matters. Okay. So we have a distribution over tree structures.
This is what the coalescent gives us. And once we have a tree structure, we can now just think of
this as a graphical model. Right? So we have random variables at each of the nodes in the tree,
and then we have some process by which we describe the distribution over some node given the
value of its parent. And this is going to completely specify our graphical model. So we have
ways of doing efficient inference in these models.
So basically the way that this is going to work is we do EM. So in the E step we're going to have
a fixed tree structure and we're going to compute expectations over the weights which are weight
vectors sitting at each junction in this tree. And then in the M step I'm going to keep those
expectations and I'm going to reoptimize the weights that I just learned and then I iterate. And
then I'm going to initialize with the tree given by the data.
35
So you can do the expectations by belief propagation on the tree, and then maximizing the tree
structure we use this greedy agglomerative algorithm that we have.
Okay. So one of the things that's interesting is that the tree that you learn for classification is
actually quite different than the tree that's good for explaining the data. So I apologize for the
like low-resolution version, but this was copied out of the PDF. So this is the tree that's good for
explaining the data. This is the tree that's good for explaining the classification decisions.
And so one of the things that I think is interesting is that over here music and DVD look totally
different, right, presumably because, you know, I don't know, when I review music and when I
review DVD, I talk about plots in DVDs and stuff like that, and won't really talk about a plot of
music.
But from a classification perspective, the way that I say good things about DVDs is very similar
to the way that I say good things about music.
So I don't know. I think this is kind of cool, that it learns a fairly different tree structure to
optimize classification performance.
Okay. So here's how well it does. I'm not going to explain all the points. I feel like we're like
playing a 1985 video game. It's like blocky. But basically the thing that matters is that this
approach works awesome. One of these curves is if you use the data tree rather than the inferred
tree, and then actually the black one is if you do the feature augmentation, so for this task the
feature augmentation actually loses.
Okay. All right. So here's some stuff to look at. So this was the paper that does something. I
don't remember which. This is the one that does the Gaussian process stuff. This is the one that
does the Dirichlet process stuff. And then this is the one that does the hierarchical stuff.
Actually one plug I'll put in for the hierarchical stuff, just because I think it's incredibly cute, is
that one of the awesome things about the coalescent is that if I have a sample from a coalescent
and I cut it at any point, the resulting distribution is a draw from the Dirichlet process.
So in a sense this completely captures the Dirichlet process model if it wants to. But it's more
general.
So, anyway, I think that's kind of a cute little theory thing. It doesn't subsume the Gaussian
process thing, though. Although, I guess you could put Gaussian processes on the nodes, but
that's just way too complicated.
Okay. So now everyone can wake up because I'm going to talk about translation.
Okay. So I got interested in adapting translation models across domains. Basically because, you
know, I go to ACL and 90 percent of the things that I hear are about translation, and so it's kind
of impossible to avoid. And I'm like, hey, I know how to do adaptation, that should be cool.
We'll fix all these translation problems.
So the setting that I'm going to consider I'm going to translate from French to English. I'm not
always just going to do just French and English, I just need to talk about it concretely. So French
36
is always going to be my -- so this is one of these super annoying things about adaptation for
MT.
So adaptation has source distributions and target distributions, MT has source languages and
target languages. So I'm trying to get around this problem by calling my source language French
and my target language English.
And I'm going to move from parliament proceedings to medical texts, so that's going to be by
adaptation goal.
So this is some stuff that I did with [inaudible] a couple years ago. We basically tried to analyze
the effect that adaptation has in translation. And actually I think this is more general.
So I subsequently run similar experiments to analyze the effects in other tasks, not just
translation, because I think it's sort of an interesting way to think about this problem. So we
broke it down into four sources of error, so -- and we had to make them all start with the same
letter, so some of them are a little bit weird so that they all start with the same letter.
Okay. So one is seen error. So this is basically I'm trying to translate a French medical text, all
of a sudden I see a French word that I've never seen before. So obviously I'm just hosed. Like
there's just nothing I can do. I mean, if it's a name, maybe I just copy it and I cross my fingers.
But if it's not or if my source language is Chinese, this isn't going to work very well.
The second issue is basically word sense. So I see a French word and I've seen this French word
before, but I've never seen the proper translation of this French word. So by sense here I really
mean it in the sort of like word sense disambiguation from MT sense, which is that a word sense
is described by a translation pair, not by some lexicographer sitting down and saying, you know,
these are 15 senses of the word go or whatever they do.
>>: So you're saying in that case we [inaudible] the English translation -- the correct English
translation is not in [inaudible] at all?
>> Hal Daumé: It's not paired with this French word. I might have seen the English translation
before, but never from this French word.
Although, actually, I hope that that's rare. I hope that the case that I have seen the English word
but I haven't seen the French word, I kind of hope that's rare.
>>: Why [inaudible]?
>> Hal Daumé: All right. We can talk about it. I don't have a great intuition. But, anyway, this
is one source pair.
Score is basically I have the French, I have the English, but it's just scored low or something. So
someone else wins when I'm producing a translation.
And then of course I can make search errors, which God knows what to do about search errors.
So I'm going to completely ignore search errors for the rest of the time. But it also starts with S,
so he had to include it.
37
I mean, it's possible that you make more search errors in the target domain. Right? Maybe the
entropy of your model is higher so it's less certain and so it makes more -- so it is plausible, but
it's just so hard to measure that we didn't look at it.
So we looked at four target domains. So one was news, the other three were medical, movie
subtitles, and PHP documentation. And basically what happens -- so this is sort of through a
series of oracle experiments. But if you move from parliament to news, seen and sense, each
account for about 10 percent of the added error that you get, and score accounts for about 80
percent.
Okay. So when we first saw this, I was like this is awesome, right? Like what are all these
machine learning techniques good at? They're good at fixing score. They can't find new words.
Like they don't know what a new word is. But they're really good at adjusting weights. So I was
like, Oh, we're going to kick butt, right?
And then we did the same thing on these three other domains. It turns out that score really isn't
so much of an issue there. It's really all about unseen source words or seen source words with
unseen translations.
Which is maybe not super shocking, but if you go and read papers where people do adaptation
MT, they're -- with the exception of maybe people at Microsoft, they're almost always adapting
to news, which is maybe the less interesting case. Although, it makes all of my machine learning
stuff work, so maybe it's good.
So here's a little bit about the data that we used. So we're assuming that we have parallel
parliament data and only comparable data in the target domain. So this is how much data we
have, so news, medical, subtitles, tech docs.
I'm always going to tune in the target domain, so I do have a little bit of parallel data in the target
domain. It doesn't really matter. This sort of makes like 1 1/2 to 2 blue points difference, but it
pretty much slides everything up and down the same. So it's not that big of a deal.
I'm also going to assume I have a target language model, but, again, it sort of slides everything
up and down about 1 1/2 to 2 blue points. And then we have some test sentences.
Okay. So here are the most frequent OOV words. I showed the English because it's easier to
read. So the OOV rate in news is only about 17 percent, whereas in the others it's upwards of 50
or worse. So these are the most frequent ones, so ->>: [inaudible] behavior doesn't show up in [inaudible]?
>> Hal Daumé: It -- I think I counted OOV also as things that only occur once.
>>: [inaudible] is this [inaudible]?
>> Hal Daumé: Yeah.
>>: This is a [inaudible].
>> Hal Daumé: Ah. That explains it. Oh, yeah, because some people put use in there for some
38
weird reason. I'm surprised color isn't there either, right? But, I mean, these aren't like super
frequent. It's only like 17 percent.
There's also obviously a temporal issue, right? Like [inaudible] would probably be in Europarl
now, but the news data came after.
>>: [inaudible]
>> Hal Daumé: It's token level.
>>: [inaudible]
>> Hal Daumé: Yeah. Okay. So medical, it's about 50 percent you see words that, you know,
who knows what these words mean. Subtitles you see a lot of swearing, which I edited out. But
of course they're not edited out in the actual data, but you can fill it in yourself.
What else? PHP documentation? And you see the things that you expect, right? So there's
nothing really crazy going on.
Okay. So what we want do is, of course, we can fix this scoring problem using the stuff that I
talked about, but for the other domains this isn't going to work. So the setting we're going to
work in is that we have a seed dictionary which we extracted from Europarl, and then we have
comparable data in the medical domain, and I want to use my seed dictionary to extract new
translation pairs to cover these OOV words.
So I'm only right now looking at the OOV words, because the sense errors, it's very easy to say,
oh, there's a French word I've never seen before. It's much harder to say, oh, there's a French
word and I've seen it, but I think it's being used in a new sense. So we have some thoughts about
how do that, but if you have thoughts I'd love to hear them.
But our goal of course is to ->>: [inaudible]
>> Hal Daumé: Hmm?
>>: [inaudible]
>> Hal Daumé: We only have comparable data.
>>: [inaudible] test set.
>> Hal Daumé: Yes.
>>: So your test set, if you have a word, and you can produce the [inaudible] translation
[inaudible].
>> Hal Daumé: Right. So that's how we got the oracle numbers earlier. But, yeah, with just
comparable data it's harder. Okay. So we want a large translation dictionary, right?
39
So I'm going to briefly describe this paper that [inaudible] had. So the idea is maybe we know
that ate and mange are pairs. So we're going to look at the contexts of each of these words and
we're going to put them in some high dimensional space corresponding to these red dots. And
then we're going to look at other contexts of these words and other contexts of these sort of
shared pairs and put those in other parts of English space and French space.
And then we're going to run CCA because we love CCA, and we're going to find directions that
preserve the pairwise ordering between these paired points.
And this will give us a low dimensional representation of words that's common to both English
and French, and then from that we can look for English words that are highly similar to French
words in the shared representation.
Okay. So more generally I can think about this as a matching problem. Right? So I get a bunch
of examples, right, so you can think of it as English words or French words or images and text,
or something like that, and my job is to match one to the other.
So the intuition about how we're going to do this is a little bit confusing. You have to maintain a
lot of words in your head in order to follow it. Okay. So let's say that this half of the room is
English and that half of the room is French. Okay. And what I can do is I can measure
similarity between any pair of English words or any pair of English people and any pair of
French people. Wow. This actually kind of worked out. So -- sort of. So -- but I can't measure
similarity between an English person and a French person. That's just -- I don't know how to do
it. Okay.
So what I'm going to say is that everyone on this side of the room is going to have a counterpart
on the that side of the room. I'm going to assume there are equal numbers. And let's say that a
rule thinks that his counterpart is Patrick. Okay. Now, if a rule's counterpart is Patrick, a rule is
going to look at his friends on the English sides, and Patrick is going to look at his friends on the
French side. And if a rule and Patrick are aligned, then a rule's friend and Patrick's friends had
better be aligned also.
So I should be aligned to someone in such a way that my friends are aligned to my counterpart's
friends.
All right. I'm getting better at that every time I say it. So you should have heard me likes six
months ago. It was a disaster. So this is the basic determination. We're going to do things in
terms of kernels as a way of representing similarity. So we're going to have similarity between
English words represented as a kernel K, similarity between French words represented as a
kernel L.
And what we're going to do is we're going to try to permute the rows and columns of L so K and
L look as similar as possible. Okay. And so we can represent this permutation as a permutation
matrix. So pi as a permutation matrix like this times L on the left is going to permute the rows
and times L on the right is going to permute the columns, and it's going to permute in the same
way. So we're going to search for some permutation pi star that maximizes this product. Yeah.
>>: Is there a single number in the same language means [inaudible]?
>> Hal Daumé: So in this case we're going to define it as they occur in similar contexts. So a
40
distributional similarity stuff. But you could do theoretically anything you want.
Okay. And so the next thing is that we're not doing this from nothing. We have the seed
dictionary, which we extracted from Europarl. So for those we're going to clamp the
permutation according to the dictionary that we've extracted and only optimize over the rest of
the permutation. So we're going to do this in iterative fashion. In one step we're going to learn a
permutation, in the next step we're going to learn a better embedding, then learn a new
permutation, learn a better embedding and so on.
So there's an initialization issue. When I initialize for the things that are not in my dictionary,
should I just say they're not aligned to anyone, or should I say they're aligned to anyone with
equal probability. Makes a small bit of difference, but I don't remember which one wins. We'll
see in the next iteration. Or in the next slide.
Okay. So here are a bunch of problems. So this is image alignment. This is document
alignment from Europarl where we do parallel data. This is document alignment from Europarl
where we do comparable data. So this is sort of the usual trick. If you take the first half of the
English document and the second half of the French document and try to pair them up. So it's
not a perfect translation anymore. This is the same thing on Wikipedia and this is on translation
my name.
So this is sort of vanilla kernelized sorting which basically sucks. This is the Haghighi and Klein
approach which does awesome on parallel data but kind of not so good on the other things. This
is a previous baseline we had. And then these are the two using the seeds. This is with the zeros
and this is with the uniform. I guess the zeros work slightly better, but it doesn't make that much
of a difference. And basically what happens is by the end, for instance, in the translation mining
we're able to mine correctly about 260 out of 300 pairs in sort of a first best way.
>>: [inaudible] that you're trying to learn?
>> Hal Daumé: Yeah. So I'd have the words that I flagged as OOV.
So here's what it actually extracts. So here are five different German words and the top three -or sometimes it doesn't think there's even more than one translation. I have honestly no idea if
these are reasonable or not. Some of them I can guess. This one's probably right. This one's
probably right. This one's probably right. I don't really know about the others. Any German
speakers want to say?
>>: [inaudible]
>> Hal Daumé: This is right?
>>: How do you say that?
>>: [inaudible]
[multiple people speaking at once]
>>: And then the top three, I mean, the German word is a drop in blood pressure. I don't know
how that [inaudible].
41
>> Hal Daumé: I guess you can understand, right, like, it definitely is affiliated with stroke, but
it is probably not a great translation. Okay. And so here's if you integrate these into your phrase
table for translation in some kind of hacky way. Here's what ends up happening.
So basically the plus is, you know, how much do things change. So news it changes a little bit,
not much. There wasn't that much OOV to start with. In medical it makes a big difference. A
little bit surprisingly, even though there was a lot of room for improvement for subtitles, and
PHP it doesn't help that much, at least in German. It helps a little bit more in French, but
somehow the dictionary mining seems harder here. I think it's also because we have far less data
for these. For this we have a ton of data. For this we only have a little bit of data. A little bit of
comparable data.
Okay. So that's it.
>>: Does data need to be comparable at all?
>> Hal Daumé: No. But that's what we had.
Okay. So that's basically all I wanted to talk about. Hopefully you learned something
interesting. So I can take questions.
[applause]
>>: So I was wondering if you had any thoughts on thing like [inaudible] like for sent
[inaudible] it's a very simple way of doing things. Let's say you have four annotated label
domains where you train four classifiers and then you combine them in a weighted way and
those four weights basically just -- from a few examples that come from the new domain. And I
wonder if you have any, you know, thoughts about sort of formal boundaries of that or if there's
any way, for example, to choose your ensemble in the smartest possible way to actually optimize
the weight and collaborate onto training weights.
>> Hal Daumé: Yeah. Okay. So there's one thing that I can say. So I haven't tried that. A
similar that I have tried is train a classifier in the source domain, and then when you get a target
domain example, predict its label with the source domain classifier and then include this as a
feature for a target domain classifier. So you get an additional feature. Maybe you give the
margin, not just the decision.
So this is actually a really good baseline in the semi-supervised setting.
I suspect that actually the ensemble might be better if you had a large -- well, it's hard to say. So
the tricky thing about this add a single feature is that, you know, yes, you do have this one really
good feature, but you also have like a bajillion other features. So you might need a reasonable
number of examples to learn something.
So if you don't have a large number of examples, that might suck. For the ensemble, you should
be able to get away with just a few labeled examples, unless you have a large number of
domains. But the other one would have the same problem with large number of domains.
Yeah, I don't have a great sense. I mean, I'm sure that it would work well. So this is one of the
42
things about domain adaptation. Like almost anything you can dream up works pretty well.
Which is nice. So dream up something, write a paper. But, yeah, I don't have much to say other
than that.
>> Patrick Pantel: Okay. We'll take questions afterwards, but, Hal, thank you very, very much
for this.
[applause]
Download