Document 17888528

advertisement
>> Kuansan Wang: All right. Let's get started. It is my greatest pleasure to welcome Mianwei
to come back and have an interview with us. Mianwei has been an outstanding intern with ISRC
before, and he's graduating this summer, I hope, and his work is on the entity-centric search, and
so this happens to be the area that we are focusing a lot, so we are very happy to have him to
come and tell us what he has been doing. So without further ado, Mianwei.
>> Mianwei Zhou: Okay, thank you. Thanks, Kuansan. Yes, so thank you for coming to my
job talk today, so I'm very happy to share my work in my PhD studies to you. Yes, so the topic
we show here is about entity-centric search, querying by entities and for entities. So should I
speak louder, or this voice will be fine?
>>: This is fine.
>> Mianwei Zhou: Okay, sure. So you know nowadays the web has become a very huge
database, storing all kinds of different entities, so by entities, that means like, for example,
person, like products, like companies. All these things are entities. And because of this, most of
the information-retrieval operations on the web are also related about entities, people searching
all kinds of information, for example, for latest event for some person, some phone number for
some company, some historical fact. So all these kind of information-retrieval operations are all
entity centric. According to some statistics reported in the previous work, like 72.9% of web
queries are actually entity oriented and 71% of web queries contains the named entities. Well, in
order to get the information, so nowadays people get used to use search engine, so as the
traditional search engine framework, people will use keywords to characterize what they want.
And then the search engine will return lots of documents. But we also find that this kind of
searching framework cannot well support this kind of entity-centric search operations. SO, for
example, given a query like when did Michael Jordan graduate, if in terms of the input, if the
search engines are not aware that that this query actually is trying to search for an entity and also
have some entity invoked in this query, then it may not be able to better judge which result is
correct. And also, whenever you use the keywords, usually, you have some ambiguity issue
there. And in terms of result, for those kind of search operations, when people try to search
some particular entity, directly returning a document may not be a good option, as well, because
the users have to look into the document to find answers. Yes. So in my PhD studies, actually, I
tried to focus on we tried to move beyond this kind of traditional information-retrieval
framework. So we want to study, when the concept of entity involved in the input query or in
the output result, how can we actually change this kind of search behaviors, such as linking
results or something. So here I use a diagram to show the position of our work, so I use the xaxis to show the input and the y-axis to show the output. So in this way, traditional IR will
become like in list area, and what we will study is to study when the concept of entities appears
in the input or output. So we call this kind of operation entity-centric search operations. So
specifically, according to whether the entities appear in the input or in the output, we can further
categorize these kind of search operations. So if the concept of entity appears in the input,
usually people will know some entity, and then they want to find some information about this
entity. We call this kind of operation querying by entities. For example, find some latest event
for some person for some useful reviews of some product. And in contrast with querying by
entities, we also have querying for entities. That means the output of results should be -- that
means the users will expect the output result contains the entity. In this case, for example, we
want to find the highest building in the world or the most expensive handbag or something like
that. Yes. You can also think that these two areas have some overlapping area. So in this kind
of operation, people usually try to search some entities that match some desired relation for our
input entities, so it's a query by and for entities. For example, like the phone number of some
company, some year of graduation year of some person, so this will have the entity involved in
both the input part and also the output part. So basically, my research work can fit into these
kind of three categories. Yes, so for the querying for entities, we have some work about it
already, content query system, querying by entity parts. We have the entity-centric document
filtering, and the overlapping area, we will have the relational entity search. And this talk,
actually, I will specifically focus on this area, this work and this work, the querying by entities
work and the querying for entities work. For the relational entities search, although we have
some publications, but right now I'm still actively working on that, so I will briefly introduce it in
the future work. So this is going to be the overview of my talk today. I will start with the
querying for entities and query by entities and then followed by our future work. Yes. It's pretty
much just an overview of my talk today. So let's start with the querying for entities part. So the
position of this work is here. So in this work, we tried to design a general data-oriented content
query system, tried to support the searching directly into the document for finding the related
entities. The input will be some keywords, the output will be some entity. The motivation of
this work is that -- so, nowadays, we are actually witnessing many different kind of efforts trying
to search entities inside the text data. So, for example, that time I did a search proposing 2006,
tried to use some keywords to directly return an entity as a result. Like with information
extraction, people would try to use some simple pattern and then use some idealized
[indiscernible] to iteratively get a lot of relational data from the web test. Web-based question
and answering, people are giving some question as input that they want to find some answer
directly as a result. Well, you can see that these kind of different efforts are all doing similar
stuff. They will have different keywords, therefore different patterns, try to find out some
entities inside the text data. It belongs to the querying for entities applications. Well, since this
project developed independently, so these kind of keywords or patterns or scoring functions are
all hard-coded in their projects, and that's their limitation. So the project we developed, one
project cannot be used easily in other projects. So the motivation of work is that can we do a
more general system to better support these kind of different applications? So this is the goal of
our work. Since they are doing similar stuff, so we want to build a general system that can better
support these kind of different querying for entities applications -- query for entities projects. So
this is my idea. It's actually much similar to database, so before database is invested, people,
when they tried to do some data managed job, they need to do write the data storage, data
indexing, in different projects, and then the invention of DBMS actually largely preceded their
jobs, so similar here. We want to try to build a general system to better support these kind of
different querying for entities applications. So how this system should look like? A couple of
requirements. As I have mentioned before, different projects actually will have different kind of
entity extractions and different kinds of scoring functions to find out the correct entities. And
also, they need to support different entity types. So specifically, when you require that in order
to build a general system that should support a very flexible entity extraction, the scoring
function should be customizable for different applications, and the entity type should also be
extensible. Every time when we try to support some new entities, we don't need to change the
product code. We can directly extend it easily in the system. So this is the purpose of the
requirements of our system. So to handle this actual solution is kind of an evaluational model.
The key idea is that we try to view the set of keywords and entities. It's a lot of tables in the
corpus, in the back end, and then we'll build a system beyond that and then we'll design a
language to query that and find the result, return the result as a kind of at table result. I know it's
still a bit unclear here, so I want to show you a demo to see how actually it works. So this is a
system that I developed for this project, and here's the interface. This system is built based upon
the -- this system is built based upon Wikipedia corpus and has some entity types extracted and
indexed in the corpus, and then it supports a query language to query those kind of entities. For
example, like person, location, in the text data. So how does it work? For example, I write a
simple query first. So, for example, I want to find all the person entities from -- we'd like to have
some pattern here, and then say we want to find all the persons which have the computer science
list kind of noun phrase appear close to it. Then you want to process it and get all the entities
that match these kind of patterns, so this is the idea. So we can learn some different kind of
patterns here, and then to specify what kind of entities we want. So here you use the brace. We
mean that the computer science is a sequential phrase. Here, we use the square bracket on each
side of these, kind of like window patterns. Yes.
>>: Do you have a ranking of these persons?
>> Mianwei Zhou: Right now, no. So in ranking, we need to have something like group by,
order by, so I can first show how these kind of entities match the -- so I can see a contest
function. I can see how they match our patterns, so, for example, like here we get a person
name, and then we see how the computer science, these kinds of words appear around it. So this
is just some similar patterns to that. And then, also, if we wanted to support the weighting, we
need to order the result. So, for example, showing a query here. So in this query ->>: Wait. I don't see any connection between broaching with computer science being
[indiscernible].
>> Mianwei Zhou: So yes, so computer science. So that's another person here. So, yes, there's
no connection, but I just say we'll collect -- so this basically is a collecting phase.
>>: What does this context mean?
>> Mianwei Zhou: Context means that we show that, for example, Burgan here is a person name
here, computer science. We try to find all the entities that match that would have this word
appearing around.
>>: That means that they are 10 words apart?
>> Mianwei Zhou: Yes, 10 words around, 10 words apart. Within 10 words, actually.
>>: What's the raw data?
>> Mianwei Zhou: The raw data is the -- so this actually is a smaller demo, so this demo is just
built on the Wikipedia corpus. I think it's 2,000 Wikipedia corpus. We will have a larger system
built on a large data set using [full] web data. That one is bigger.
>>: The context is the document, and the person is the title of the document?
>> Mianwei Zhou: The person is the entity I want to extract. Here, for example, I extract a lot
of entities from this kind of context here.
>>: For the first evidence, you actually did understand, right, because this is a paper written by
this guy? Unfortunately, in this journal, it's like monographs in computer science.
>> Mianwei Zhou: Yes, so this is not ranked result. This is just some snippet that matches this
pattern. So if we want to rank the result, we still need to do some aggregation and ranking. So,
for example, here, I would like to find the population of China, then maybe I write some similar
pattern, population of, and then close to China. And then I can do some group by. For example,
because those kind of entities that match up here frequently will be more likely to become the
correct result, right? So do some aggregation, and then we can do some ranking.
>>: I see. So now you have all these kind of tables constructed, and then you use this kind of ->> Mianwei Zhou: Yes, yes, it seems like this. But, actually, in the back end, there's no table.
It's still just like inverting index.
>>: So in this way, you cannot support a natural language parses?
>> Mianwei Zhou: No, so I just say this is a general system to better support this kind of project,
so it's a kind of like a database, actually.
>>: So what are the key technologies here? The area is how do you do fill the inverted index, or
how do you [indiscernible] this or how do you get the basic data? Can we see some data?
>> Mianwei Zhou: So yes, in this talk, actually, I didn't go into the details about the technique,
but in the paper, the key technique is, for example, when this kind of pattern becomes very
complex, what kind of key inverted index do I need to choose? So query plan and also which
index should be used? So this is the general idea, so basically try to support querying the entity
inside of text data. So already showed how it will work, actually, is just like for each kind of
occurrences of the keyword or entity, I just viewed it as a table, since here, we will have its
position, we'll have [spend], we'll have some confidence like measure, wider, this kind of
instruction is correct or not. We'll have stored information in the inverted index. And then,
given some -- so as I said before, we need to fulfill two requirements, flexible entity instruction,
customized scoring function, so how can we do that? So basically, we use the relational model.
For example, we're a from class to specify which entity which we want, use some pattern filter
some result. Use some aggregation to aggregate the result, frequently mentioned and some
customizable scoring function we can use to rank the result. So, basically, it's a general system
to support it. And another interesting idea for this work is that I have mentioned about a third
requirement, that it's extensible entity [pipes]. So in a lot of projects, when we try to support
new entities, they need to rebuild the index, do a lot of other things. Well, for this project, we
use an idea of view -- the concept of view -- the view concept from database to support
extensible data types. So like the example showed here, we already indexed the number,
location, person, right? But we can use the idea -- for example, we specify some pattern the
characterize a number from -- to define a population type from a number type, so I can show an
example here. So here, you see we've already defined a population of a number, but we can also
divide it as a view, like here. We can specify a lot of different kind of patterns, like inhabitants
or population of and define this population as the number. By doing that, then go back to the
query, we can directly query the population directly. Although these kind of entities are not
indexing our back-end system, but we can still query that and then get the same result. So this is
the kind of like virtual entity types from the existing types, and defining this type actually is
pretty easy. We're actually going to write some type here, and then simply say it will appear in
the left part, and you can directly query that, so let's define some new types, new entities, based
on some existing entities. Yes, so this is the requirements, how I will fulfill this kind of
requirements in our system, and then the main technique -- yes?
>>: Yes, so I have some question about your definition of entities? What do you mean by
entities?
>> Mianwei Zhou: Entity is so -- so in our system, we pre-extract some entities in the back end.
For example, we extract a number of locations, organizations, locations. These are our very
basic entities. So I defined. It means I defined some high-level entities based on this basic one.
>>: In the R, right? In the R package?
>> Mianwei Zhou: Yes, in the R package to extract that, yes. And also, this demo already
showed these four types, but in the larger types, I used a lot of dictionary to match more entity
types. Yes.
>>: So the population you extracted, is it a new entity, or is it a new attribute of an existing
entity?
>> Mianwei Zhou: It's a new entity based on the basic entity. The number is pre-extracted and
indexed, but while the population is not indexed at all, we just use these kind of patterns to
extract a new type. To extract those kind of numbers, that is more likely to become an entity. So
some new entity types based on the existing types, existing entities.
>>: So I had one question. So what kind of problem cannot be solved by keyword association
and can be solved by this demo.
>> Mianwei Zhou: No, I'd just say this is a back-end system to better support those kind of
projects. I didn't say I solved a new problem. I just built a general system to better support
them. So they don't need to write too much code for their projects. Yes. Okay, so this is the
first piece of our work. So you say the main technique or focus on the index design, how can we
built an inverted index for those kind of keyword entities to make this kind of querying faster
and also some keyword optimization, which keywords to be used and which entity should be
searched, something like that. So this pile doesn't go into the details, so if you have questions,
we can discuss offline. Yes. So basically, this is the first piece of our work. It's about querying
for entities, so our work is about data-oriented content query systems. So the second work I'm
going to introduce is the list part, and this will be the main focus of my talk today. It's about
entity-centric document filtering. The idea here is that we try to input some entity, and then we
want to output the document, and then the problem is that given an entity, which is characterized
by an identification page, how can we identify relevant documents? So entity-centric document
filtering. So as I have introduced before, many kind of information-retrieval operations on the
web are searching information, searching documents about particular entities. For example, for
common people, they might want to know those kind of latest events for their favorite
celebrities. For many of these people, they want to know that, for example, after they release
some products, they want to know some feedback or review of their products. So recently, they
also recently proposed a tracked task is about knowledge-based acceleration. They want to help
Wikipedia editors to identify which kind of documents are relevant for some particular entities
and then help them to enrich the content of Wikipedia in a fast manner. So we can see that all
these kind of scenarios are all about querying by entities. They want to find documents that
relate to entities, so the goal for our world is to build a system to better support these kind -- to
facilitate these kind of applications. So in order to build this system, what information do we
need? The system needs to know what kind of entity we want to write and to know the
information about these entities, and we will say that, actually, using the entity name is usually
insufficient, because it can be very ambiguous. Given Michael Jordan, it can refer to different
people. So in our project, our idea is that we will try to directly give some percentage that
characterizes this entity as input, so for most of the entities in the world nowadays, we will have
some pages on the web characterizing this entity. For example, for a movie star, we will have an
IMDb page. For a product, we will have their specification page. For academic people, we
would have homepage, or popular entity, we will have their Wikipedia page. We will have some
descriptive page that characterizes this entity. So using these kind of things as inputs, we will
have two advantages. One, first, we definitely resolve the ambiguity problem. And second, you
also provide more information about this entity. We can better have the system better understand
the input query. And in terms of output, given some entity, we will try to know what kinds of
documents are relevant or irrelevant for this entity, so this is a simple example. For example, the
first one, this is relevant. Well, the second one is irrelevant, because it's mainly talking about
Steve Jobs instead of Bill Gates. And similarly, here, for Michael Jordan, the first one is related
while the second one is unrelated because the first two are computer science people.
>>: [Indiscernible].
>> Mianwei Zhou: So the input will be a page that characterizes.
>>: So a page, we see this entity name.
>> Mianwei Zhou: Yes, yes, characterize this. But you know that it should define this one, this
kind of relevance or irrelevance is pretty subjective. Different applications or different labels,
even, they will have different criteria, so in our work, we try to learn that from use of labels,
instead of define some criteria for why that is relevant or not. So this will be the input, this will
be the output, and this will be the problem we will study. Given a query page, lot of documents,
we want to find which one is relevant or irrelevant. We say this is core entity-centric document
filtering problem. This will be the problem we are studying in this work. So why this problem is
challenging? Because the key challenge of this problem is that we will have a very noisy entity
page as input. The query is very long. So let's see the traditional IR. Usually, in traditional
information retrieval, what people will do is usually they already use these kind of very short
keyword queries. A keyword query only contains one or two keywords or one sentence, and
then people have already developed a lot of information-retrieval models for these scenarios, like
BM25, Language Model, Vector Space model. They all work pretty well for this kind of short
keyword query. However, at the same time, also, people have also observed that this kind of
information-retrieval model, usually, their performance will degrade when the query becomes
longer, for example, consists of two or three sentences. When the query becomes longer, this
kind of IR model wouldn't look real. Well, in our problem, it becomes more difficult. Our query
becomes a very long page. It's not just a few sentences. It's a very long document. So in this
document, definitely, it provides a lot of information about this entity. However, at the same
time, it also incurs a lot of noisy keywords, which may not be that relevant for this entity. So in
this case, definitely traditional information-retrieval model wouldn't work very well here. So this
will be the key challenge we will focus in this work. Yes. The noisy query page as input. So
how can I -- how to stop this? Instead of directly proposing some solution, some quick solution,
we actually start with investigating the fundamental principal actually behind document scoring.
It's a very simple question, given a query, given a document, how can we judge when the
document is relevant or not to this query? The answer is that we need to check the content array.
Basically, we need to check how this document contains keywords that are important for this
query. It's very straightforward, intuitive idea. We call this as a document scoring principal.
The relevance of document depends on how it contains keywords that are relevant for this query.
But, actually, you might notice that this kind of very simple principle actually contains two
questions. First, which keywords are actually important for this query? So in this scenario, for
example, like Microsoft, Bill Gates, these kind of keywords will be very important for this input
entity, right? And second question, how the document -- how the keywords are contained in the
document? So usually we will see, if this kind of document contains more important keywords,
it's more likely to become a relevant document. So it actually can be decomposed into two
questions. And traditional nonlearning scoring models actually fulfills this principle quite well.
For example, like BM25, Language Model, they actually follow this kind of principle. Like, for
example, which keywords are important. They were generating those kind of keywords that
appear a lot in the query and less in the whole corpus would be more important keywords. Those
high inverse document frequency keywords will be more important keywords, and how they are
contained. Usually, this kind of model will count that frequency in the document. So this is how
traditional nonlearning scoring model actually will do that. But we will say that actually this
kind of simply using the -- the simple answer for these two questions actually is insufficient for
our scenario. So different from traditional information retrieval, then our query becomes a long
page, so simply using IDF is not enough. In order to identify these kind of important keywords,
we need to make use of more other information. For example, we need to see the kid of
keywords, how you appear, whether you're appearing in the title or in the info box or in the
opening paragraph. We have more information to leverage to identify which keywords are
important for this entity. And similar here, because our target document is a webpage, and
webpage is also somehow structured, it's like title, we'll have UIO and [indiscernible] text and
also something like the picture or word. So there are also a lot of other information things
characterizing how these kind of keywords are contained in the target document. And once we
want to incorporate more information -- once we want to incorporate more information about this
kind of questions, then simply manually fulfilling the principle becomes impractical. And
because of that, we will think about is it possible, can we try to learn to answer these two
questions? So it's very natural to think about learning to rank. So when we look at learning to
rank, we found that actually traditional learning to rank, it fails to learn to fulfill this principle.
Right now, there is a community -- the academic community has already proposed many
learning to rank models, like RankSVM, LambdaMART or RankBoost, a lot of different kind of
learning to rank models. And all these kind of learning to rank models actually follow the same
abstraction. They try to learn a ranker for a query in a document, and then they will have some
input features to characterize that, and then they try to output the document relevance for that.
And from this abstraction, we can see that actually this kind of abstraction doesn't learn to fulfill
our principle, because in the principle, the most important concept is keywords, how the
keywords appear in the query, and how the keywords are important for the query and how the
keywords are contained in the document. The concept of the keyword is not modeled in here.
And so how this type of learning to rank model actually scored the document?
>>: I'm sorry. What do you mean by keywords here?
>> Mianwei Zhou: Actually, keywords means like -- maybe actually used the words, it's better.
It means the words in the query, each word in this query, and now this kind of word appears in
the document.
>>: Why dos it not appear in the document?
>> Mianwei Zhou: Why it's not appearing ->>: You said the keyword is not ->> Mianwei Zhou: So the keyword dos not appear in the abstraction, this abstraction. So this
model abstraction, we already have query and document. We don't model the word, the concept
word here. Yes.
>>: So I can say that you can model that in a teacher function, right?
>> Mianwei Zhou: Yes, so that's my point here. So in order to make this kind of system work,
they need to use this kind of feature, BM25, Language Model. They need to -- in this feature
design, they need to characterize how the document contains keywords from the query. So,
actually, they require the designer to fulfill the principle in the feature design instead of learning
that. And as we have seen that before, for some scenario, manually design this kind of feature,
manually fulfilling the principle is difficult, as I have claimed before, because the query is
complex and the document is also complex. Then manual designing these features is difficult.
So this kind of traditional learning to rank frameworks simply leaves this burden to the feature
designers. So it's not good. So that's why I claimed that they didn't learn to fulfill the principle
by simply leaving it to the designers. So the purpose of our work here is like -- question? We
tried to facilitate the feature design for people, and the first important concept we proposed here
is called the feature decoupling idea. The key idea is that traditional feature design, they need to
describe how a document contains keywords that are important in the query. And we say it's
difficult, because in this feature design, people need to adjust these two questions together in the
feature design. So our intuition is that, actually, we can decouple this feature design into two
types of more elementary features such that one type of feature adjusts one type of question -one question at a time. So this kind of feature will become much more easier to be designed. So
this is a key intuition of our work. We try to decouple the features into two directions. One is
for meta-features, characterizing which keywords are important, and intra-feature characterizes
how the keywords are contained. Yes. So here are some example features. For our entitycentric document filtering, example feature can be like, for example, a general feature like IDF
can be used in the meta-feature for the query side, and whether this keyword is a noun or a verb
is mentioned in the entity or not, and you also have some structural features like the position of
this keyword in this page, how they are mentioned in the info box. They are all meta-features,
they query side features. Similarly, we can define the intra features, how the document -- how
the keywords appear inside the document. So, for example, different positions in the UI or in the
title, different positions of that and different representations besides simply counting the term
frequency, we can have a log-scaled TF or normalized TF. We have different kind of term
frequency representation characterize how the term appears in the document. So this will be the
example feature we designed for our applications. And because this decouples the feature, we
are facing a new learning to rank framework. So yes?
>>: So essentially, the left-hand feature is a query-level feature, which ->> Mianwei Zhou: Query-level feature.
>>: The right-hand side is query URL features.
>> Mianwei Zhou: Document features.
>>: Query document features.
>> Mianwei Zhou: Keyword document features, I think.
>>: So you can also put BM25 and all these features in the intro feature, as well?
>> Mianwei Zhou: BM25, I think you already combined the keywords inside the query. BM25,
so it's a summation of all the keywords inside the query.
>>: I'm saying that you can also put it in, right?
>> Mianwei Zhou: No, I think -- so the feature, meta-feature, will be F, W and Q for the query - or the query, and then for BM25, it's like the feature defined on document and query or ID. So
this BM25 will correspond to our traditional learning to rank features.
>>: Yes, but for each of the words, you can treat it as a document per return and sum over all the
terms. It's going to be a BM25, right?
>> Mianwei Zhou: Yes, you can say that, but BM25 usually is a sum of all of our different
keywords in the query.
>>: All the keywords?
>> Mianwei Zhou: Yes, all the keywords. Usually, this is how BM25 is designed. So it's just
like this one is designed over one single keyword.
>>: Oh, I understand this part. This is a query-level feature, model.
>> Mianwei Zhou: This is also a keyword part. You also define it over one keyword at a time.
>>: I think what the mentioned is the right part is only document level, right?
>> Mianwei Zhou: No, no [indiscernible], only the keyword document.
>>: For the document, yes.
>> Mianwei Zhou: Yes, and this part is like keyword for the query.
>>: Can I assume you faced no overlap between -- there is no keyword overlap between left part
and right part. You can learn [indiscernible]. You can learn [indiscernible].
>> Mianwei Zhou: So usually, when we are doing an information-retrieval task, we will see
how a document contains a keyword only from the query, so if this keyword doesn't mention the
query, usually we didn't count it. So, yes, basically, we just find overlapping keywords between
the query part and the document part. Yes. So, yes, the basic idea on this slide is we try to
decouple the feature design, and then we will have a new learning to rank problem, so this
setting is still like traditional learning to rank where you have some query and relevant, irrelevant
document for this query, and then, given a new document, also characterized by a Wikipedia
page, we want to judge whether a document is relevant or irrelevant by learning this model. This
is pretty much similar to traditional learning to rank. Yes.
>>: So is one assumption that you have [indiscernible] intention, so is the keyword that exists
for the query independent of the documents or not?
>> Mianwei Zhou: Independent of the documents?
>>: For the page Bill Gates, the keyword is always Bill Gates and Michael, so independent of
the documents, already what you are going to -- by learning to rank keyword depends on even
document relations?
>> Mianwei Zhou: Actually, no, I just checked those kind of -- so given the document, I already
checked those keywords that are mentioned in this document also appear in the queries, that they
are overlapping keywords. Given the query and document, I will check they are overlapping
keywords.
>>: Okay.
>> Mianwei Zhou: Yes. But yes, definitely, we can extend it. For example, given a query, we
can extend more using some synonym or other techniques, we find more keywords related for
this query. We can also do that. But I say because it's a decoupled feature, we will have a new
learning to rank framework, so in this time, for the query and document pair, we describe by two
types of features, instead of one feature function, like before. So how can we solve this kind of
decoupled feature-based learning to rank problem will be the focus of our work. Since they are
decoupled, so we need to recover them back, so let's review the document scoring principle,
relevant document demands so how it contains keywords that are important for the query.
Mathematically, we can translate it here. The relevance of a document to a query can become a
summation of the contribution of each keyword for this document and for this query. Yes, this
will be the keyword contribution for the query and document at the same time. And what I mean
recovering is that how can we define this function based on the meta-features and the intro
features. This will be the focus of the first requirement in our learning to rank framework. And
then a second requirement that this kind of function should be noise aware, so as I mentioned
before, the key challenge is that our query is very noisy, so this contribution function should be
aware of those kind of noisy keywords. So how can we solve these kind of -- fulfill these kind of
two requirements. The second concept we propose is called inferred sparsity idea. The idea is
very straightforward. Because this query is long, so they will have a lot of overlapping keywords
and lots of keywords are noisy, they may not be related with this query. And because of the
existence of these kind of noisy keywords, if we just assign some low value to these noisy
keywords, their score might accumulate, and then as a result it will affect the final scoring
accuracy. So because of this, we will require that for those kind of noisy keywords, they should
fulfill the requirement satisfying contribution. This contribution function should be equal to
zero. We call this inverse sparsity.
>>: How do you define if it's a noisy keyword?
>> Mianwei Zhou: Yes, this is a good question. Actually, I will mention that. Because we don't
have keyword labels saying this is noisy or not, right, so yes, this is a simple key challenge I
need to handle in the later solution. I will introduce it very soon. So I first introduced why this
is called inferred sparsity. Because it's pretty related to traditional sparse learning, and sparse
learning people, in order to select features, in order to improve the accuracy, prediction accuracy,
they all will require that. For the feature weightings, they want those kind of feature weightings
to be very sparse. Already, the important keywords will have nonzero weighting. Well, on the
important ones, they should have zero vary. And you see right here, we want the contribution
function to become nonzero already for those kind of important keywords, so they're actually
pretty related. But they're also somehow similar -- different. So for traditional sparse learning,
alpha here is official weighting. It's a free parameter, so they can make it sparse by using
techniques such as L1 degradation. Well, here, the function we want to sparsify -- the tag you
want to sparsify is a function instead of a free parameter. This function actually is defined based
on our meta-features and decoupled features. So traditional technique doesn't work well in our
scenario. So to achieve this inferred sparsity, a key idea is that we will use a two-layer scoring
model. So, basically, I will try to lend a keyword classifier to judge whether keywords is
important or not. So here, C means a keyword classifier. If it goes to zero, it means it's a noisy
keyword, and then you face important [indiscernible] that you will have some nonzero value.
And based on the output of this keyword classifier, I will further judge, determine its
contribution, based on how it appeared in the target document. But trying to design this model is
not that easy. As you have mentioned, we don't have keyword labels here, so it's not easy to
learn the keyword classifier here, and also how can we design a feature to determine these two
aspects, so how can we solve it? Our solution is to use a graphical model, restricted Boltzmann
machine model to handle that. The general idea is like this. Given the document, we try to use
the random variable Y to denote whether this document is relevant or not. And then you will
connect to a lot of keywords. You contain a lot of keywords, right, so for each keyword, we will
also have a variable denoting whether this keyword is an important keyword for this query or
not. If it goes to a nonzero value, that means that it's important. If it equals to zero, that means
that this keyword is a noisy keyword. And then we will have a couple -- three types of different
factors to determine how this kind of keyword -- how to define the keyword classifier and then
also how the keyword contribution should look like. So first, we have a keyword classifier, so
we will use this factor to represent that. Basically, according to our definition for features for
meta-features, we used meta-features to characterize how the keywords are important for the
query or not, so here we will use this factor to incorporate this information. And this part will
represent the contribution of a keyword to the document. So if this is an important keyword,
then we will further look at its -- we will further look at how you appear in the target document,
so we measured it here. And if it's not an important keyword, then this kind of keyword should
be eliminated, so it should not contribute any score to the final document relevance, so we have
zero here. So basically, it's an [indiscernible] line, and we also have document factors. This
part, we can try to incorporate traditional learning to rank features for page rank and document
length or even BM25, Vector Space model, we can have this kind of traditional models,
traditional features, combining these factors. And then our algorithm just tries to solve here. So
this design actually will handle the lack of keyword labels problems, because we can simply
view this part as a hidden variable. And then, based on the document labels, already to learn
which keywords should be important, and then also as well as how they should contribute the
score to the final document relevance. And the key advantage of this model is that it's a tree
structure, actually. So when I tried to solve this problem, tried to optimize his model, we can use
the belief propagation to efficiently fulfill this. So this is a key idea. Yes. So what I want to
claim, to make a summary, in our technique, there are two important ideas. The first one is
called the feature decoupling idea, try to decouple the feature design such that people can easier
design the features, and the second is the inverse sparsity. We can filter those kind of noisy
keywords and only use the important ones. And then we said, actually, this kind of framework is
a very general framework that can be applied to other applications. So, for example, like use the
recommendation system, try to recommend some items for users. Usually, we can have a similar
called item ranking principle here, because we can connect the user and item by their attribute.
Usually, when we try to measure whether this item is good or not for a user, we will see how this
item matched the attribute that these users are interested in, right? So they're connected by the
attribute. So the user here becomes a query, attribute becomes keywords before and items
becomes the document. Similarly, we can have a lot of meta-features and intra-features to
characterize these two links. For meta-features, we can characterize how the users lie in these
kind of attributes. For example, some long-term preference, some short-term preferences, how
he has ranked the different kind of features to characterize how he or she likes this kind of
different attributes. And for intra-feature, we can similarly have the intra-features for the item
attribute. For example, do we have the exact value for the attribute, or we just have some
estimating information whether this value is missing or not? We can characterize these things
feature vectors. And so I also investigate trying to apply my work to domain annotation
framework. So here is one application. We tried to weight a lot of reviews coming from some
domains, so we have lots of reviews, positive or negative, for example, for the [indiscernible]
domains. We tried to learn a model which could be adapted the kitchen appliance domain. Yes,
and this is a domain annotation problem, because the different reviews will have used different
keywords. For a book review, they will use keywords such as interesting, boring, while the
kitchen appliance review will use like high quality, leaking, broken. They use different
keywords. And then previously, how can we match that -- previously, when people tried to use - for this application, people proposed an algorithm called structural correspondence learning.
The idea is try to match these kind of keywords based on their mutual information, the
correlation with those kind of common keywords, good or bad. So their intuition is that similar
correlation will have a similar sentiment here. And then they proposed their solution. Well,
what I want to say here is that, actually, their solution is pretty specific to the application. Well,
we can say actually this is just meta-features. They can be viewed as meta-features in our
framework, and similarly, we can have intra-features characterize how this keyword appears in
the document, so we can apply our framework to this kind of scenario, as well. So in terms of
experiment, actually, in my paper submitted to KD recently, actually tried to -- I evaluate our
framework in two types of different applications. One is the entity-centric document filtering,
and the second one is the domain annotation idea. So basically, it used a data set, tried to learn
the model from some domain and then tried to adapt it to a different domain. And then, we
compare our solution with many different kind of baselines. So, for example, first, the standard
learning algorithm, basically just like using traditional learning to rank framework. But I used
some feature we designed for that. For example, for entity-centric document filtering, I will use
a lot of learning to rank features, BM25, Vector Space model and then compare. So, basically,
there were LETOR, the Learning to Rank data set in MSR, so in that page, there are a lot of
features designed, so I just followed that feature design to design a lot of traditional learning to
rank features and then built this baseline. The second one is I simply combined the metafeatures and the intra-features by multiplying them, so much like the TF-IDF, then simply
multiply TF and IDF, so as this baseline, I didn't use our graphical model solution. I directly
multiplied two types of features, and the disadvantage of this is that they didn't fulfill the inferred
sparsity requirement. They will introduce a lot of noise contribution, because I simply multiplied
them together. And this solution is a boosting framework that I proposed in my previous paper,
and actually, my solution is -- my graphical model actually extended from this model and
actually outperformed that. And this algorithm is specifically designed for the domain
annotation, the performance, compared with this existing work.
>>: So as far as I know, the [indiscernible] baseline is quite low, already. People already did a
lot of experiments on this datacenter. So did you try to compare to the more advanced baseline,
like the neural network one?
>> Mianwei Zhou: No.
>>: The [indiscernible] and then the other baseline?
>> Mianwei Zhou: Oh, okay, this part I haven't considered, because in this paper, especially, I
just wanted to show the possibility of applying this approach to a domain annotation. So I didn't
-- yes, that part, I haven't looked at that. Maybe in the later experiment, I can try to use more
strong baseline for that. Yes, but I just -- I include the comparison with the domain through this
work. I just want to say it can be applied to that scenario using a more general framework. So
this is the experimental result for our framework. Yes, so in this work, actually, we are starting a
querying by entities problem, entity-centric document filtering problem, and then we proposed a
general framework to solve that. So as the future work, so first, we are starting the querying by
entities and querying for entities. We will try to move one step forward, and we will try to study
their overlapping part. And also, we want to try to connect it better back to the traditional
information retrieval problem. This will be the direction of our future work. So first, for the
querying by and for entities, the overlapping part that we are studying the relational entity search
problem, basically, the idea is that given some entities, we try to find some other kind of entities
that is related to that, according to their attribute. For example, given the Bill Gates, we can
follow that his wife, his education background or his founded companies have different kind of
attributes for that. And right now, this kind of project has been realized in the industry. For
example, Google, they have knowledge graphs, and Microsoft, that they have [indiscernible],
this kind of knowledge-based graph idea. However this kind of output is supported by our backend knowledge graph, so if the entity is not that popular, then if it is not starred in the knowledge
graph, then it could not be shown in the result. So in our future work, actually, what we are
studying is that we tried to build a lot of relational searches for that. We tried to do online the
relation extraction for this kind of input query. For example, we will learn a lot of searchers.
One searcher is specifically finding the wife of that person and some education. We have
different kind of relation specifically designed for this person, for this type of entities. And then
we can try to rank the entity according to these kind of returned results. Yes, and based on how
these kind of entities appear in the web document. So as the current progress, we have some
models, learned some models, tried to learn these kind of ranks based on some data from the
knowledge base. For example, Freebase, we extract a lot of relational data here and then use that
to learn this kind of searcher, and this has already been published. However, there are still a lot
of unsolved problems, so this is what we are working on right now. So, for example, in the
previous work, we already work on one relation at a time, so right now currently we try to do
how can we do efficiently search multiple relations together, to both improve the efficiency and
accuracy, and also how can we resolve the entity ambiguity? So given the input query, so this
can be -- belongs to different persons, so how can we search their relations to discriminate
different kind of entities. So this will be the first direction of our relational entity search. And
the other future work I am working on right now is about the entity intended basically scoring, so
basically, we try to better understand the entity concept hidden inside the query, and also we
want to search passages that support this kind of question answering. So the problem is how can
we rank the passage according to the entity intent hidden side the query, so this is also a problem
actually right now I'm working on, actually. Yes, so this is my work. I basically mainly focus
on entity-centric search, and I also kept some other related work on some citation prediction and
also social annotation. So if you're interested in taking another direction, we can also discuss it
offline. Yes, thank you.
>>: We have time for questions. I guess a lot of work that you're also doing are probably on
how do you want to have more indexed discussion when you are meeting with the candidates.
But if there is any question, quick question?
>>: How do you envision this in a product? For example, it's [indiscernible].
>> Mianwei Zhou: Which work, actually?
>>: On this querying by entities, for entities.
>> Mianwei Zhou: You evaluate that.
>>: Not evaluate it, but how do you envision it working in a product like [indiscernible]?
>> Mianwei Zhou: Okay. So, for example, like the first piece of work I see, this I don't think is
kind of -- I propose a new problem. I just propose a new platform, a back-end platform that can
be used to help other different kind of querying for entities fabrications, so I'm sure -- so, for
example, when we design different kind of projects, and we can easily, for example, just write
some simple query as a design to fulfill a lot of different kind of difficult operations before. Yes,
so this can be used as a back end supporter for different kinds of applications, online
applications.
>>: Applications which we can use [indiscernible]?
>> Mianwei Zhou: Yes, yes. So, for example, actually, one example application can be simply
the entity search, where I simply find out -- so we can still use keyword query, but this can be
translated into our design query and then try to return a result. So this is build based upon our
algorithm, yes, our platform.
>>: Why it would come up with 100,000?
>> Mianwei Zhou: I just ranked the results. The second one may be not good. I just ranked
them most correctly. Because they also mention more inhabitants based on these kind of patterns
I designed.
>>: So why China is? Oh, okay.
>> Mianwei Zhou: I think it's pretty close. It [indiscernible]. It mentioned multiple cities, so
there are a lot of errors, I guess, based on some tighter instruction. We can design some
applications on top of our general platform. And for our second piece of work, so I think this, it
can be very useful, as I say, for those kind of motivating scenarios I mentioned here, trying to
find documents relevant for that. And it's also easy to get the homepage or descriptive and some
text data for entity, so it can be used to improve the accuracy for that part.
>>: All right, so if there is no further questions, then I thank the speakers.
Related documents
Download