Document 17888528

>> Kuansan Wang: All right. Let's get started. It is my greatest pleasure to welcome Mianwei to come back and have an interview with us. Mianwei has been an outstanding intern with ISRC before, and he's graduating this summer, I hope, and his work is on the entity-centric search, and so this happens to be the area that we are focusing a lot, so we are very happy to have him to come and tell us what he has been doing. So without further ado, Mianwei. >> Mianwei Zhou: Okay, thank you. Thanks, Kuansan. Yes, so thank you for coming to my job talk today, so I'm very happy to share my work in my PhD studies to you. Yes, so the topic we show here is about entity-centric search, querying by entities and for entities. So should I speak louder, or this voice will be fine? >>: This is fine. >> Mianwei Zhou: Okay, sure. So you know nowadays the web has become a very huge database, storing all kinds of different entities, so by entities, that means like, for example, person, like products, like companies. All these things are entities. And because of this, most of the information-retrieval operations on the web are also related about entities, people searching all kinds of information, for example, for latest event for some person, some phone number for some company, some historical fact. So all these kind of information-retrieval operations are all entity centric. According to some statistics reported in the previous work, like 72.9% of web queries are actually entity oriented and 71% of web queries contains the named entities. Well, in order to get the information, so nowadays people get used to use search engine, so as the traditional search engine framework, people will use keywords to characterize what they want. And then the search engine will return lots of documents. But we also find that this kind of searching framework cannot well support this kind of entity-centric search operations. SO, for example, given a query like when did Michael Jordan graduate, if in terms of the input, if the search engines are not aware that that this query actually is trying to search for an entity and also have some entity invoked in this query, then it may not be able to better judge which result is correct. And also, whenever you use the keywords, usually, you have some ambiguity issue there. And in terms of result, for those kind of search operations, when people try to search some particular entity, directly returning a document may not be a good option, as well, because the users have to look into the document to find answers. Yes. So in my PhD studies, actually, I tried to focus on we tried to move beyond this kind of traditional information-retrieval framework. So we want to study, when the concept of entity involved in the input query or in the output result, how can we actually change this kind of search behaviors, such as linking results or something. So here I use a diagram to show the position of our work, so I use the xaxis to show the input and the y-axis to show the output. So in this way, traditional IR will become like in list area, and what we will study is to study when the concept of entities appears in the input or output. So we call this kind of operation entity-centric search operations. So specifically, according to whether the entities appear in the input or in the output, we can further categorize these kind of search operations. So if the concept of entity appears in the input, usually people will know some entity, and then they want to find some information about this entity. We call this kind of operation querying by entities. For example, find some latest event for some person for some useful reviews of some product. And in contrast with querying by entities, we also have querying for entities. That means the output of results should be -- that means the users will expect the output result contains the entity. In this case, for example, we want to find the highest building in the world or the most expensive handbag or something like that. Yes. You can also think that these two areas have some overlapping area. So in this kind of operation, people usually try to search some entities that match some desired relation for our input entities, so it's a query by and for entities. For example, like the phone number of some company, some year of graduation year of some person, so this will have the entity involved in both the input part and also the output part. So basically, my research work can fit into these kind of three categories. Yes, so for the querying for entities, we have some work about it already, content query system, querying by entity parts. We have the entity-centric document filtering, and the overlapping area, we will have the relational entity search. And this talk, actually, I will specifically focus on this area, this work and this work, the querying by entities work and the querying for entities work. For the relational entities search, although we have some publications, but right now I'm still actively working on that, so I will briefly introduce it in the future work. So this is going to be the overview of my talk today. I will start with the querying for entities and query by entities and then followed by our future work. Yes. It's pretty much just an overview of my talk today. So let's start with the querying for entities part. So the position of this work is here. So in this work, we tried to design a general data-oriented content query system, tried to support the searching directly into the document for finding the related entities. The input will be some keywords, the output will be some entity. The motivation of this work is that -- so, nowadays, we are actually witnessing many different kind of efforts trying to search entities inside the text data. So, for example, that time I did a search proposing 2006, tried to use some keywords to directly return an entity as a result. Like with information extraction, people would try to use some simple pattern and then use some idealized [indiscernible] to iteratively get a lot of relational data from the web test. Web-based question and answering, people are giving some question as input that they want to find some answer directly as a result. Well, you can see that these kind of different efforts are all doing similar stuff. They will have different keywords, therefore different patterns, try to find out some entities inside the text data. It belongs to the querying for entities applications. Well, since this project developed independently, so these kind of keywords or patterns or scoring functions are all hard-coded in their projects, and that's their limitation. So the project we developed, one project cannot be used easily in other projects. So the motivation of work is that can we do a more general system to better support these kind of different applications? So this is the goal of our work. Since they are doing similar stuff, so we want to build a general system that can better support these kind of different querying for entities applications -- query for entities projects. So this is my idea. It's actually much similar to database, so before database is invested, people, when they tried to do some data managed job, they need to do write the data storage, data indexing, in different projects, and then the invention of DBMS actually largely preceded their jobs, so similar here. We want to try to build a general system to better support these kind of different querying for entities applications. So how this system should look like? A couple of requirements. As I have mentioned before, different projects actually will have different kind of entity extractions and different kinds of scoring functions to find out the correct entities. And also, they need to support different entity types. So specifically, when you require that in order to build a general system that should support a very flexible entity extraction, the scoring function should be customizable for different applications, and the entity type should also be extensible. Every time when we try to support some new entities, we don't need to change the product code. We can directly extend it easily in the system. So this is the purpose of the requirements of our system. So to handle this actual solution is kind of an evaluational model. The key idea is that we try to view the set of keywords and entities. It's a lot of tables in the corpus, in the back end, and then we'll build a system beyond that and then we'll design a language to query that and find the result, return the result as a kind of at table result. I know it's still a bit unclear here, so I want to show you a demo to see how actually it works. So this is a system that I developed for this project, and here's the interface. This system is built based upon the -- this system is built based upon Wikipedia corpus and has some entity types extracted and indexed in the corpus, and then it supports a query language to query those kind of entities. For example, like person, location, in the text data. So how does it work? For example, I write a simple query first. So, for example, I want to find all the person entities from -- we'd like to have some pattern here, and then say we want to find all the persons which have the computer science list kind of noun phrase appear close to it. Then you want to process it and get all the entities that match these kind of patterns, so this is the idea. So we can learn some different kind of patterns here, and then to specify what kind of entities we want. So here you use the brace. We mean that the computer science is a sequential phrase. Here, we use the square bracket on each side of these, kind of like window patterns. Yes. >>: Do you have a ranking of these persons? >> Mianwei Zhou: Right now, no. So in ranking, we need to have something like group by, order by, so I can first show how these kind of entities match the -- so I can see a contest function. I can see how they match our patterns, so, for example, like here we get a person name, and then we see how the computer science, these kinds of words appear around it. So this is just some similar patterns to that. And then, also, if we wanted to support the weighting, we need to order the result. So, for example, showing a query here. So in this query ->>: Wait. I don't see any connection between broaching with computer science being [indiscernible]. >> Mianwei Zhou: So yes, so computer science. So that's another person here. So, yes, there's no connection, but I just say we'll collect -- so this basically is a collecting phase. >>: What does this context mean? >> Mianwei Zhou: Context means that we show that, for example, Burgan here is a person name here, computer science. We try to find all the entities that match that would have this word appearing around. >>: That means that they are 10 words apart? >> Mianwei Zhou: Yes, 10 words around, 10 words apart. Within 10 words, actually. >>: What's the raw data? >> Mianwei Zhou: The raw data is the -- so this actually is a smaller demo, so this demo is just built on the Wikipedia corpus. I think it's 2,000 Wikipedia corpus. We will have a larger system built on a large data set using [full] web data. That one is bigger. >>: The context is the document, and the person is the title of the document? >> Mianwei Zhou: The person is the entity I want to extract. Here, for example, I extract a lot of entities from this kind of context here. >>: For the first evidence, you actually did understand, right, because this is a paper written by this guy? Unfortunately, in this journal, it's like monographs in computer science. >> Mianwei Zhou: Yes, so this is not ranked result. This is just some snippet that matches this pattern. So if we want to rank the result, we still need to do some aggregation and ranking. So, for example, here, I would like to find the population of China, then maybe I write some similar pattern, population of, and then close to China. And then I can do some group by. For example, because those kind of entities that match up here frequently will be more likely to become the correct result, right? So do some aggregation, and then we can do some ranking. >>: I see. So now you have all these kind of tables constructed, and then you use this kind of ->> Mianwei Zhou: Yes, yes, it seems like this. But, actually, in the back end, there's no table. It's still just like inverting index. >>: So in this way, you cannot support a natural language parses? >> Mianwei Zhou: No, so I just say this is a general system to better support this kind of project, so it's a kind of like a database, actually. >>: So what are the key technologies here? The area is how do you do fill the inverted index, or how do you [indiscernible] this or how do you get the basic data? Can we see some data? >> Mianwei Zhou: So yes, in this talk, actually, I didn't go into the details about the technique, but in the paper, the key technique is, for example, when this kind of pattern becomes very complex, what kind of key inverted index do I need to choose? So query plan and also which index should be used? So this is the general idea, so basically try to support querying the entity inside of text data. So already showed how it will work, actually, is just like for each kind of occurrences of the keyword or entity, I just viewed it as a table, since here, we will have its position, we'll have [spend], we'll have some confidence like measure, wider, this kind of instruction is correct or not. We'll have stored information in the inverted index. And then, given some -- so as I said before, we need to fulfill two requirements, flexible entity instruction, customized scoring function, so how can we do that? So basically, we use the relational model. For example, we're a from class to specify which entity which we want, use some pattern filter some result. Use some aggregation to aggregate the result, frequently mentioned and some customizable scoring function we can use to rank the result. So, basically, it's a general system to support it. And another interesting idea for this work is that I have mentioned about a third requirement, that it's extensible entity [pipes]. So in a lot of projects, when we try to support new entities, they need to rebuild the index, do a lot of other things. Well, for this project, we use an idea of view -- the concept of view -- the view concept from database to support extensible data types. So like the example showed here, we already indexed the number, location, person, right? But we can use the idea -- for example, we specify some pattern the characterize a number from -- to define a population type from a number type, so I can show an example here. So here, you see we've already defined a population of a number, but we can also divide it as a view, like here. We can specify a lot of different kind of patterns, like inhabitants or population of and define this population as the number. By doing that, then go back to the query, we can directly query the population directly. Although these kind of entities are not indexing our back-end system, but we can still query that and then get the same result. So this is the kind of like virtual entity types from the existing types, and defining this type actually is pretty easy. We're actually going to write some type here, and then simply say it will appear in the left part, and you can directly query that, so let's define some new types, new entities, based on some existing entities. Yes, so this is the requirements, how I will fulfill this kind of requirements in our system, and then the main technique -- yes? >>: Yes, so I have some question about your definition of entities? What do you mean by entities? >> Mianwei Zhou: Entity is so -- so in our system, we pre-extract some entities in the back end. For example, we extract a number of locations, organizations, locations. These are our very basic entities. So I defined. It means I defined some high-level entities based on this basic one. >>: In the R, right? In the R package? >> Mianwei Zhou: Yes, in the R package to extract that, yes. And also, this demo already showed these four types, but in the larger types, I used a lot of dictionary to match more entity types. Yes. >>: So the population you extracted, is it a new entity, or is it a new attribute of an existing entity? >> Mianwei Zhou: It's a new entity based on the basic entity. The number is pre-extracted and indexed, but while the population is not indexed at all, we just use these kind of patterns to extract a new type. To extract those kind of numbers, that is more likely to become an entity. So some new entity types based on the existing types, existing entities. >>: So I had one question. So what kind of problem cannot be solved by keyword association and can be solved by this demo. >> Mianwei Zhou: No, I'd just say this is a back-end system to better support those kind of projects. I didn't say I solved a new problem. I just built a general system to better support them. So they don't need to write too much code for their projects. Yes. Okay, so this is the first piece of our work. So you say the main technique or focus on the index design, how can we built an inverted index for those kind of keyword entities to make this kind of querying faster and also some keyword optimization, which keywords to be used and which entity should be searched, something like that. So this pile doesn't go into the details, so if you have questions, we can discuss offline. Yes. So basically, this is the first piece of our work. It's about querying for entities, so our work is about data-oriented content query systems. So the second work I'm going to introduce is the list part, and this will be the main focus of my talk today. It's about entity-centric document filtering. The idea here is that we try to input some entity, and then we want to output the document, and then the problem is that given an entity, which is characterized by an identification page, how can we identify relevant documents? So entity-centric document filtering. So as I have introduced before, many kind of information-retrieval operations on the web are searching information, searching documents about particular entities. For example, for common people, they might want to know those kind of latest events for their favorite celebrities. For many of these people, they want to know that, for example, after they release some products, they want to know some feedback or review of their products. So recently, they also recently proposed a tracked task is about knowledge-based acceleration. They want to help Wikipedia editors to identify which kind of documents are relevant for some particular entities and then help them to enrich the content of Wikipedia in a fast manner. So we can see that all these kind of scenarios are all about querying by entities. They want to find documents that relate to entities, so the goal for our world is to build a system to better support these kind -- to facilitate these kind of applications. So in order to build this system, what information do we need? The system needs to know what kind of entity we want to write and to know the information about these entities, and we will say that, actually, using the entity name is usually insufficient, because it can be very ambiguous. Given Michael Jordan, it can refer to different people. So in our project, our idea is that we will try to directly give some percentage that characterizes this entity as input, so for most of the entities in the world nowadays, we will have some pages on the web characterizing this entity. For example, for a movie star, we will have an IMDb page. For a product, we will have their specification page. For academic people, we would have homepage, or popular entity, we will have their Wikipedia page. We will have some descriptive page that characterizes this entity. So using these kind of things as inputs, we will have two advantages. One, first, we definitely resolve the ambiguity problem. And second, you also provide more information about this entity. We can better have the system better understand the input query. And in terms of output, given some entity, we will try to know what kinds of documents are relevant or irrelevant for this entity, so this is a simple example. For example, the first one, this is relevant. Well, the second one is irrelevant, because it's mainly talking about Steve Jobs instead of Bill Gates. And similarly, here, for Michael Jordan, the first one is related while the second one is unrelated because the first two are computer science people. >>: [Indiscernible]. >> Mianwei Zhou: So the input will be a page that characterizes. >>: So a page, we see this entity name. >> Mianwei Zhou: Yes, yes, characterize this. But you know that it should define this one, this kind of relevance or irrelevance is pretty subjective. Different applications or different labels, even, they will have different criteria, so in our work, we try to learn that from use of labels, instead of define some criteria for why that is relevant or not. So this will be the input, this will be the output, and this will be the problem we will study. Given a query page, lot of documents, we want to find which one is relevant or irrelevant. We say this is core entity-centric document filtering problem. This will be the problem we are studying in this work. So why this problem is challenging? Because the key challenge of this problem is that we will have a very noisy entity page as input. The query is very long. So let's see the traditional IR. Usually, in traditional information retrieval, what people will do is usually they already use these kind of very short keyword queries. A keyword query only contains one or two keywords or one sentence, and then people have already developed a lot of information-retrieval models for these scenarios, like BM25, Language Model, Vector Space model. They all work pretty well for this kind of short keyword query. However, at the same time, also, people have also observed that this kind of information-retrieval model, usually, their performance will degrade when the query becomes longer, for example, consists of two or three sentences. When the query becomes longer, this kind of IR model wouldn't look real. Well, in our problem, it becomes more difficult. Our query becomes a very long page. It's not just a few sentences. It's a very long document. So in this document, definitely, it provides a lot of information about this entity. However, at the same time, it also incurs a lot of noisy keywords, which may not be that relevant for this entity. So in this case, definitely traditional information-retrieval model wouldn't work very well here. So this will be the key challenge we will focus in this work. Yes. The noisy query page as input. So how can I -- how to stop this? Instead of directly proposing some solution, some quick solution, we actually start with investigating the fundamental principal actually behind document scoring. It's a very simple question, given a query, given a document, how can we judge when the document is relevant or not to this query? The answer is that we need to check the content array. Basically, we need to check how this document contains keywords that are important for this query. It's very straightforward, intuitive idea. We call this as a document scoring principal. The relevance of document depends on how it contains keywords that are relevant for this query. But, actually, you might notice that this kind of very simple principle actually contains two questions. First, which keywords are actually important for this query? So in this scenario, for example, like Microsoft, Bill Gates, these kind of keywords will be very important for this input entity, right? And second question, how the document -- how the keywords are contained in the document? So usually we will see, if this kind of document contains more important keywords, it's more likely to become a relevant document. So it actually can be decomposed into two questions. And traditional nonlearning scoring models actually fulfills this principle quite well. For example, like BM25, Language Model, they actually follow this kind of principle. Like, for example, which keywords are important. They were generating those kind of keywords that appear a lot in the query and less in the whole corpus would be more important keywords. Those high inverse document frequency keywords will be more important keywords, and how they are contained. Usually, this kind of model will count that frequency in the document. So this is how traditional nonlearning scoring model actually will do that. But we will say that actually this kind of simply using the -- the simple answer for these two questions actually is insufficient for our scenario. So different from traditional information retrieval, then our query becomes a long page, so simply using IDF is not enough. In order to identify these kind of important keywords, we need to make use of more other information. For example, we need to see the kid of keywords, how you appear, whether you're appearing in the title or in the info box or in the opening paragraph. We have more information to leverage to identify which keywords are important for this entity. And similar here, because our target document is a webpage, and webpage is also somehow structured, it's like title, we'll have UIO and [indiscernible] text and also something like the picture or word. So there are also a lot of other information things characterizing how these kind of keywords are contained in the target document. And once we want to incorporate more information -- once we want to incorporate more information about this kind of questions, then simply manually fulfilling the principle becomes impractical. And because of that, we will think about is it possible, can we try to learn to answer these two questions? So it's very natural to think about learning to rank. So when we look at learning to rank, we found that actually traditional learning to rank, it fails to learn to fulfill this principle. Right now, there is a community -- the academic community has already proposed many learning to rank models, like RankSVM, LambdaMART or RankBoost, a lot of different kind of learning to rank models. And all these kind of learning to rank models actually follow the same abstraction. They try to learn a ranker for a query in a document, and then they will have some input features to characterize that, and then they try to output the document relevance for that. And from this abstraction, we can see that actually this kind of abstraction doesn't learn to fulfill our principle, because in the principle, the most important concept is keywords, how the keywords appear in the query, and how the keywords are important for the query and how the keywords are contained in the document. The concept of the keyword is not modeled in here. And so how this type of learning to rank model actually scored the document? >>: I'm sorry. What do you mean by keywords here? >> Mianwei Zhou: Actually, keywords means like -- maybe actually used the words, it's better. It means the words in the query, each word in this query, and now this kind of word appears in the document. >>: Why dos it not appear in the document? >> Mianwei Zhou: Why it's not appearing ->>: You said the keyword is not ->> Mianwei Zhou: So the keyword dos not appear in the abstraction, this abstraction. So this model abstraction, we already have query and document. We don't model the word, the concept word here. Yes. >>: So I can say that you can model that in a teacher function, right? >> Mianwei Zhou: Yes, so that's my point here. So in order to make this kind of system work, they need to use this kind of feature, BM25, Language Model. They need to -- in this feature design, they need to characterize how the document contains keywords from the query. So, actually, they require the designer to fulfill the principle in the feature design instead of learning that. And as we have seen that before, for some scenario, manually design this kind of feature, manually fulfilling the principle is difficult, as I have claimed before, because the query is complex and the document is also complex. Then manual designing these features is difficult. So this kind of traditional learning to rank frameworks simply leaves this burden to the feature designers. So it's not good. So that's why I claimed that they didn't learn to fulfill the principle by simply leaving it to the designers. So the purpose of our work here is like -- question? We tried to facilitate the feature design for people, and the first important concept we proposed here is called the feature decoupling idea. The key idea is that traditional feature design, they need to describe how a document contains keywords that are important in the query. And we say it's difficult, because in this feature design, people need to adjust these two questions together in the feature design. So our intuition is that, actually, we can decouple this feature design into two types of more elementary features such that one type of feature adjusts one type of question -one question at a time. So this kind of feature will become much more easier to be designed. So this is a key intuition of our work. We try to decouple the features into two directions. One is for meta-features, characterizing which keywords are important, and intra-feature characterizes how the keywords are contained. Yes. So here are some example features. For our entitycentric document filtering, example feature can be like, for example, a general feature like IDF can be used in the meta-feature for the query side, and whether this keyword is a noun or a verb is mentioned in the entity or not, and you also have some structural features like the position of this keyword in this page, how they are mentioned in the info box. They are all meta-features, they query side features. Similarly, we can define the intra features, how the document -- how the keywords appear inside the document. So, for example, different positions in the UI or in the title, different positions of that and different representations besides simply counting the term frequency, we can have a log-scaled TF or normalized TF. We have different kind of term frequency representation characterize how the term appears in the document. So this will be the example feature we designed for our applications. And because this decouples the feature, we are facing a new learning to rank framework. So yes? >>: So essentially, the left-hand feature is a query-level feature, which ->> Mianwei Zhou: Query-level feature. >>: The right-hand side is query URL features. >> Mianwei Zhou: Document features. >>: Query document features. >> Mianwei Zhou: Keyword document features, I think. >>: So you can also put BM25 and all these features in the intro feature, as well? >> Mianwei Zhou: BM25, I think you already combined the keywords inside the query. BM25, so it's a summation of all the keywords inside the query. >>: I'm saying that you can also put it in, right? >> Mianwei Zhou: No, I think -- so the feature, meta-feature, will be F, W and Q for the query - or the query, and then for BM25, it's like the feature defined on document and query or ID. So this BM25 will correspond to our traditional learning to rank features. >>: Yes, but for each of the words, you can treat it as a document per return and sum over all the terms. It's going to be a BM25, right? >> Mianwei Zhou: Yes, you can say that, but BM25 usually is a sum of all of our different keywords in the query. >>: All the keywords? >> Mianwei Zhou: Yes, all the keywords. Usually, this is how BM25 is designed. So it's just like this one is designed over one single keyword. >>: Oh, I understand this part. This is a query-level feature, model. >> Mianwei Zhou: This is also a keyword part. You also define it over one keyword at a time. >>: I think what the mentioned is the right part is only document level, right? >> Mianwei Zhou: No, no [indiscernible], only the keyword document. >>: For the document, yes. >> Mianwei Zhou: Yes, and this part is like keyword for the query. >>: Can I assume you faced no overlap between -- there is no keyword overlap between left part and right part. You can learn [indiscernible]. You can learn [indiscernible]. >> Mianwei Zhou: So usually, when we are doing an information-retrieval task, we will see how a document contains a keyword only from the query, so if this keyword doesn't mention the query, usually we didn't count it. So, yes, basically, we just find overlapping keywords between the query part and the document part. Yes. So, yes, the basic idea on this slide is we try to decouple the feature design, and then we will have a new learning to rank problem, so this setting is still like traditional learning to rank where you have some query and relevant, irrelevant document for this query, and then, given a new document, also characterized by a Wikipedia page, we want to judge whether a document is relevant or irrelevant by learning this model. This is pretty much similar to traditional learning to rank. Yes. >>: So is one assumption that you have [indiscernible] intention, so is the keyword that exists for the query independent of the documents or not? >> Mianwei Zhou: Independent of the documents? >>: For the page Bill Gates, the keyword is always Bill Gates and Michael, so independent of the documents, already what you are going to -- by learning to rank keyword depends on even document relations? >> Mianwei Zhou: Actually, no, I just checked those kind of -- so given the document, I already checked those keywords that are mentioned in this document also appear in the queries, that they are overlapping keywords. Given the query and document, I will check they are overlapping keywords. >>: Okay. >> Mianwei Zhou: Yes. But yes, definitely, we can extend it. For example, given a query, we can extend more using some synonym or other techniques, we find more keywords related for this query. We can also do that. But I say because it's a decoupled feature, we will have a new learning to rank framework, so in this time, for the query and document pair, we describe by two types of features, instead of one feature function, like before. So how can we solve this kind of decoupled feature-based learning to rank problem will be the focus of our work. Since they are decoupled, so we need to recover them back, so let's review the document scoring principle, relevant document demands so how it contains keywords that are important for the query. Mathematically, we can translate it here. The relevance of a document to a query can become a summation of the contribution of each keyword for this document and for this query. Yes, this will be the keyword contribution for the query and document at the same time. And what I mean recovering is that how can we define this function based on the meta-features and the intro features. This will be the focus of the first requirement in our learning to rank framework. And then a second requirement that this kind of function should be noise aware, so as I mentioned before, the key challenge is that our query is very noisy, so this contribution function should be aware of those kind of noisy keywords. So how can we solve these kind of -- fulfill these kind of two requirements. The second concept we propose is called inferred sparsity idea. The idea is very straightforward. Because this query is long, so they will have a lot of overlapping keywords and lots of keywords are noisy, they may not be related with this query. And because of the existence of these kind of noisy keywords, if we just assign some low value to these noisy keywords, their score might accumulate, and then as a result it will affect the final scoring accuracy. So because of this, we will require that for those kind of noisy keywords, they should fulfill the requirement satisfying contribution. This contribution function should be equal to zero. We call this inverse sparsity. >>: How do you define if it's a noisy keyword? >> Mianwei Zhou: Yes, this is a good question. Actually, I will mention that. Because we don't have keyword labels saying this is noisy or not, right, so yes, this is a simple key challenge I need to handle in the later solution. I will introduce it very soon. So I first introduced why this is called inferred sparsity. Because it's pretty related to traditional sparse learning, and sparse learning people, in order to select features, in order to improve the accuracy, prediction accuracy, they all will require that. For the feature weightings, they want those kind of feature weightings to be very sparse. Already, the important keywords will have nonzero weighting. Well, on the important ones, they should have zero vary. And you see right here, we want the contribution function to become nonzero already for those kind of important keywords, so they're actually pretty related. But they're also somehow similar -- different. So for traditional sparse learning, alpha here is official weighting. It's a free parameter, so they can make it sparse by using techniques such as L1 degradation. Well, here, the function we want to sparsify -- the tag you want to sparsify is a function instead of a free parameter. This function actually is defined based on our meta-features and decoupled features. So traditional technique doesn't work well in our scenario. So to achieve this inferred sparsity, a key idea is that we will use a two-layer scoring model. So, basically, I will try to lend a keyword classifier to judge whether keywords is important or not. So here, C means a keyword classifier. If it goes to zero, it means it's a noisy keyword, and then you face important [indiscernible] that you will have some nonzero value. And based on the output of this keyword classifier, I will further judge, determine its contribution, based on how it appeared in the target document. But trying to design this model is not that easy. As you have mentioned, we don't have keyword labels here, so it's not easy to learn the keyword classifier here, and also how can we design a feature to determine these two aspects, so how can we solve it? Our solution is to use a graphical model, restricted Boltzmann machine model to handle that. The general idea is like this. Given the document, we try to use the random variable Y to denote whether this document is relevant or not. And then you will connect to a lot of keywords. You contain a lot of keywords, right, so for each keyword, we will also have a variable denoting whether this keyword is an important keyword for this query or not. If it goes to a nonzero value, that means that it's important. If it equals to zero, that means that this keyword is a noisy keyword. And then we will have a couple -- three types of different factors to determine how this kind of keyword -- how to define the keyword classifier and then also how the keyword contribution should look like. So first, we have a keyword classifier, so we will use this factor to represent that. Basically, according to our definition for features for meta-features, we used meta-features to characterize how the keywords are important for the query or not, so here we will use this factor to incorporate this information. And this part will represent the contribution of a keyword to the document. So if this is an important keyword, then we will further look at its -- we will further look at how you appear in the target document, so we measured it here. And if it's not an important keyword, then this kind of keyword should be eliminated, so it should not contribute any score to the final document relevance, so we have zero here. So basically, it's an [indiscernible] line, and we also have document factors. This part, we can try to incorporate traditional learning to rank features for page rank and document length or even BM25, Vector Space model, we can have this kind of traditional models, traditional features, combining these factors. And then our algorithm just tries to solve here. So this design actually will handle the lack of keyword labels problems, because we can simply view this part as a hidden variable. And then, based on the document labels, already to learn which keywords should be important, and then also as well as how they should contribute the score to the final document relevance. And the key advantage of this model is that it's a tree structure, actually. So when I tried to solve this problem, tried to optimize his model, we can use the belief propagation to efficiently fulfill this. So this is a key idea. Yes. So what I want to claim, to make a summary, in our technique, there are two important ideas. The first one is called the feature decoupling idea, try to decouple the feature design such that people can easier design the features, and the second is the inverse sparsity. We can filter those kind of noisy keywords and only use the important ones. And then we said, actually, this kind of framework is a very general framework that can be applied to other applications. So, for example, like use the recommendation system, try to recommend some items for users. Usually, we can have a similar called item ranking principle here, because we can connect the user and item by their attribute. Usually, when we try to measure whether this item is good or not for a user, we will see how this item matched the attribute that these users are interested in, right? So they're connected by the attribute. So the user here becomes a query, attribute becomes keywords before and items becomes the document. Similarly, we can have a lot of meta-features and intra-features to characterize these two links. For meta-features, we can characterize how the users lie in these kind of attributes. For example, some long-term preference, some short-term preferences, how he has ranked the different kind of features to characterize how he or she likes this kind of different attributes. And for intra-feature, we can similarly have the intra-features for the item attribute. For example, do we have the exact value for the attribute, or we just have some estimating information whether this value is missing or not? We can characterize these things feature vectors. And so I also investigate trying to apply my work to domain annotation framework. So here is one application. We tried to weight a lot of reviews coming from some domains, so we have lots of reviews, positive or negative, for example, for the [indiscernible] domains. We tried to learn a model which could be adapted the kitchen appliance domain. Yes, and this is a domain annotation problem, because the different reviews will have used different keywords. For a book review, they will use keywords such as interesting, boring, while the kitchen appliance review will use like high quality, leaking, broken. They use different keywords. And then previously, how can we match that -- previously, when people tried to use - for this application, people proposed an algorithm called structural correspondence learning. The idea is try to match these kind of keywords based on their mutual information, the correlation with those kind of common keywords, good or bad. So their intuition is that similar correlation will have a similar sentiment here. And then they proposed their solution. Well, what I want to say here is that, actually, their solution is pretty specific to the application. Well, we can say actually this is just meta-features. They can be viewed as meta-features in our framework, and similarly, we can have intra-features characterize how this keyword appears in the document, so we can apply our framework to this kind of scenario, as well. So in terms of experiment, actually, in my paper submitted to KD recently, actually tried to -- I evaluate our framework in two types of different applications. One is the entity-centric document filtering, and the second one is the domain annotation idea. So basically, it used a data set, tried to learn the model from some domain and then tried to adapt it to a different domain. And then, we compare our solution with many different kind of baselines. So, for example, first, the standard learning algorithm, basically just like using traditional learning to rank framework. But I used some feature we designed for that. For example, for entity-centric document filtering, I will use a lot of learning to rank features, BM25, Vector Space model and then compare. So, basically, there were LETOR, the Learning to Rank data set in MSR, so in that page, there are a lot of features designed, so I just followed that feature design to design a lot of traditional learning to rank features and then built this baseline. The second one is I simply combined the metafeatures and the intra-features by multiplying them, so much like the TF-IDF, then simply multiply TF and IDF, so as this baseline, I didn't use our graphical model solution. I directly multiplied two types of features, and the disadvantage of this is that they didn't fulfill the inferred sparsity requirement. They will introduce a lot of noise contribution, because I simply multiplied them together. And this solution is a boosting framework that I proposed in my previous paper, and actually, my solution is -- my graphical model actually extended from this model and actually outperformed that. And this algorithm is specifically designed for the domain annotation, the performance, compared with this existing work. >>: So as far as I know, the [indiscernible] baseline is quite low, already. People already did a lot of experiments on this datacenter. So did you try to compare to the more advanced baseline, like the neural network one? >> Mianwei Zhou: No. >>: The [indiscernible] and then the other baseline? >> Mianwei Zhou: Oh, okay, this part I haven't considered, because in this paper, especially, I just wanted to show the possibility of applying this approach to a domain annotation. So I didn't -- yes, that part, I haven't looked at that. Maybe in the later experiment, I can try to use more strong baseline for that. Yes, but I just -- I include the comparison with the domain through this work. I just want to say it can be applied to that scenario using a more general framework. So this is the experimental result for our framework. Yes, so in this work, actually, we are starting a querying by entities problem, entity-centric document filtering problem, and then we proposed a general framework to solve that. So as the future work, so first, we are starting the querying by entities and querying for entities. We will try to move one step forward, and we will try to study their overlapping part. And also, we want to try to connect it better back to the traditional information retrieval problem. This will be the direction of our future work. So first, for the querying by and for entities, the overlapping part that we are studying the relational entity search problem, basically, the idea is that given some entities, we try to find some other kind of entities that is related to that, according to their attribute. For example, given the Bill Gates, we can follow that his wife, his education background or his founded companies have different kind of attributes for that. And right now, this kind of project has been realized in the industry. For example, Google, they have knowledge graphs, and Microsoft, that they have [indiscernible], this kind of knowledge-based graph idea. However this kind of output is supported by our backend knowledge graph, so if the entity is not that popular, then if it is not starred in the knowledge graph, then it could not be shown in the result. So in our future work, actually, what we are studying is that we tried to build a lot of relational searches for that. We tried to do online the relation extraction for this kind of input query. For example, we will learn a lot of searchers. One searcher is specifically finding the wife of that person and some education. We have different kind of relation specifically designed for this person, for this type of entities. And then we can try to rank the entity according to these kind of returned results. Yes, and based on how these kind of entities appear in the web document. So as the current progress, we have some models, learned some models, tried to learn these kind of ranks based on some data from the knowledge base. For example, Freebase, we extract a lot of relational data here and then use that to learn this kind of searcher, and this has already been published. However, there are still a lot of unsolved problems, so this is what we are working on right now. So, for example, in the previous work, we already work on one relation at a time, so right now currently we try to do how can we do efficiently search multiple relations together, to both improve the efficiency and accuracy, and also how can we resolve the entity ambiguity? So given the input query, so this can be -- belongs to different persons, so how can we search their relations to discriminate different kind of entities. So this will be the first direction of our relational entity search. And the other future work I am working on right now is about the entity intended basically scoring, so basically, we try to better understand the entity concept hidden inside the query, and also we want to search passages that support this kind of question answering. So the problem is how can we rank the passage according to the entity intent hidden side the query, so this is also a problem actually right now I'm working on, actually. Yes, so this is my work. I basically mainly focus on entity-centric search, and I also kept some other related work on some citation prediction and also social annotation. So if you're interested in taking another direction, we can also discuss it offline. Yes, thank you. >>: We have time for questions. I guess a lot of work that you're also doing are probably on how do you want to have more indexed discussion when you are meeting with the candidates. But if there is any question, quick question? >>: How do you envision this in a product? For example, it's [indiscernible]. >> Mianwei Zhou: Which work, actually? >>: On this querying by entities, for entities. >> Mianwei Zhou: You evaluate that. >>: Not evaluate it, but how do you envision it working in a product like [indiscernible]? >> Mianwei Zhou: Okay. So, for example, like the first piece of work I see, this I don't think is kind of -- I propose a new problem. I just propose a new platform, a back-end platform that can be used to help other different kind of querying for entities fabrications, so I'm sure -- so, for example, when we design different kind of projects, and we can easily, for example, just write some simple query as a design to fulfill a lot of different kind of difficult operations before. Yes, so this can be used as a back end supporter for different kinds of applications, online applications. >>: Applications which we can use [indiscernible]? >> Mianwei Zhou: Yes, yes. So, for example, actually, one example application can be simply the entity search, where I simply find out -- so we can still use keyword query, but this can be translated into our design query and then try to return a result. So this is build based upon our algorithm, yes, our platform. >>: Why it would come up with 100,000? >> Mianwei Zhou: I just ranked the results. The second one may be not good. I just ranked them most correctly. Because they also mention more inhabitants based on these kind of patterns I designed. >>: So why China is? Oh, okay. >> Mianwei Zhou: I think it's pretty close. It [indiscernible]. It mentioned multiple cities, so there are a lot of errors, I guess, based on some tighter instruction. We can design some applications on top of our general platform. And for our second piece of work, so I think this, it can be very useful, as I say, for those kind of motivating scenarios I mentioned here, trying to find documents relevant for that. And it's also easy to get the homepage or descriptive and some text data for entity, so it can be used to improve the accuracy for that part. >>: All right, so if there is no further questions, then I thank the speakers.

Document 17888528

Related documents

Products

Support

Document 17888528

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib