22863 >> Evelyne Viegas: Hello, everybody. Thank you for joining us this afternoon. So this is an internship presentation. And I know that Eric -- so actually promised I would do it with my French accent, Eric Rozell, he has a slide that talks about himself, so I'll just let you go. >> Eric Rozell: Thank you very much. As she said, I'm Eric Rozell. In my internship project, we worked with a technology out of Microsoft Research Asia called Probase, in order to do semantic analysis. And basically by semantic analysis we mean trying to pull the meaning or the concepts out of text. And in order to evaluate that system, we applied it to two different applications, recommendations and document clustering. So the title of my talk is applying semantic analysis to content-based recommendation and document clustering. So before I get into that, just a little bit about myself. This bottom right corner picture is my favorite intern event that we had here, which was trampoline dodge ball. So if you ever get a chance to play dodge ball on a trampoline I really recommend that. I'm a graduate student at RPI. I work as a research assistant with the Tetherless World Constellation where my advisor is Professor Peter Fox. And my research focus is in semantic e-science. And if you need to contact me after I leave, my contact information is there. So just to give you a quick overview of what I'm going to talk about today, I'll talk a little bit about the background and the scope of the problem. I'll get into what semantic analysis is and the different techniques that we used. I'll talk about the recommendation experiment and also the clustering experiment and then go through some of the conclusions that we derived from our results. So just a little bit of background. As most of you are probably aware, there are a growing number of documents on the Web in the order of billions at this point. Much of the data on the Web right now is in fact in semi-structured format, especially with the advent of Web 2.0 technologies, things like folksonomies and micro formats. However, most of the knowledge on the Web still remains, still remains in unstructured text. So that being said, there are quite a few techniques that are NLP, natural language processing, for pulling the signal from the noise, if you will, and trying to generate the meaning behind this text and things like ontology extraction, topic extraction, named entity disambiguation and, of course, semantic analysis which we're going to talk about today. And our intuition was that some techniques might be better than others in terms of the various information retrieval tasks that you can apply them to, whether you're applying it to document clustering or recommendation or query expansion. So I wanted to talk a little bit about Probase, which was our sort of motivation for tackling this problem. And Probase, as I said before, was developed at Microsoft Research Asia. It's a probabilistic knowledge base generated from the Bing index, the Bing query logs and other sources like Free Base, Wikipedia, tables on the Web, et cetera. And basically how it works is it uses these text mining patterns, namely herz patterns. So when the system encounters plain text like artists such as Picasso, the system has evidence that Picasso is an artist or there's this hypernym relationship between artists and Picasso. This is just a demonstration of the concepts that are actually found in Probase. So some of the resources that are already out there that have concepts like large knowledge bases have a broad coverage of concepts like Free Base or DP-pedia, they have, you know, very good coverage over a small number of topics. So maybe on the order of tens of thousands. So they have full coverage of things like countries and presidents, but what Probase is really good at is capturing this long tail of more obscure, more fine-grained concepts. Things like late 18th century writers. And given that, that long tail of concepts, the system is very capable at conceptualizing groups of entities and finding what the most specific or the most relevant concept is in various different scenarios. So if you feed the system three countries like China, India and the United States, Probase would tell you you're most likely talking about countries. But if you fed them the BRIC countries, Brazil, Russia, India and China, it would likely tell you you're dealing with emerging markets, because these sort of things show up together in articles -- well, these concepts show up together in articles about BRIC countries. And the other thing that Probase is it differentiates between entities and attributes. An entity is this, is our relationship, the hypernym relationship that we discussed before where something like a birthday is an occasion or is a party, but you can also capture the attributes of different concepts. The system would also use patterns to recognize that a birthday could be an attribute of a person or a politician or celebrity. There's a variety of different applications that have been used or that have been made surrounding Probase already, from the MSRA group. And the one that we're really focused on today is this top application. And it was they developed a technique called short text conceptualization, and they ran the short text conceptualization algorithm over a corpus of tweets that they collected and clustered, using the concepts generated by Probase and checked their correlation with hashtags that they used in their initial collection process. So given that we have this resource at MSRA, we had a bunch of research questions that we thought we could address using this resource. So number one, what's the best way of extracting concepts from text? And one way to do that is to compare different techniques for abstract analysis. How are abstracted concepts useful, and what we'd like to do and what we did was we generated data about where these semantic analysis techniques are most applicable in information retrieval applications. A more specific question that we asked is are user ratings affected by the concepts in the descriptions of media items such as movies. And so we tested semantic analysis techniques in recommender systems and then how useful -this is sort of a broader question that we hope to address in the future research, how useful are these Web scale domain knowledge bases? So in a narrower domain for information retrieval. So Probase was generated at Web scale. It takes all of the documents from the Bing index to generate the concepts. But these kinds of things are a lot noisier than might be required for a narrower domain application. So I've talked a little bit about the background. I'm going to talk about semantic analysis and different techniques we used. So as I said before semantic analysis, to put it simply, is just to generate the meaning from natural language. And specifically the tasks that we were trying to address is generating hypernyms from unstructured text. So an example is if you see an article with the terms Apple, IBM and Microsoft, then you might want to generate or you might want to infer that this article is about technology companies or IT firms. So there's different approaches for semantic analysis. One set of the approach is use an external knowledge base, such as the conceptualization technique from MSRA. There's also a technique called explicit semantic analysis which is based on Wikipedia. And then there's also another technique that's based on the WordNet syn set, or groups of synonyms that occur together in WordNet using external knowledge. And then the other half is semantic analysis which uses the latent features or the probabilistic features generated from the text alone without using external knowledge, and two examples of that are the latent Dirichlet application and latent semantic analysis. So now that I've introduced the semantic analysis technique I'm going to talk about each of the algorithms that were italicized on the previous slide in more detail. And so the first resource that we wanted to use was Probase. And this is a variation on the technique that was used in the short text conceptualization algorithm from MSRA, and basically you start with a document corpus. For each document you split it into the words or the tokens in that document. You identify the phrases in that document that co-occur in Probase. So basically you just run through the document and find long phrases, and I'll show an example of that in the next slide. Feed those to Probase, and then for each of these terms, because you're feeding them individually, you're generating a set of concepts for those terms. So when you send the term China, it sends back country, probability distribution over things like country and emerging market and all those things. So basically using these terms that we've identified in the text we can generate this matrix of concepts. But then what we want to do is take this matrix of concepts and reduce it into a single feature vector for the entire document that can be used later in recommender systems and clustering. And so the technique used at MSRA was to use a naive Bayes model and some Laplacian smoothing. That worked great for their application because they were generating these features over tweets which are rather short. They're limited to 160 characters. But when you're looking at longer texts like news articles or things like descriptions of movies, this Laplacian smoothing and trying to reduce this gigantic matrix into a vector, it doesn't work out as well, and the probability -- the probabilities that you end up with are extremely small. So what we did, we basically instead of using this more sophisticated naive Bayes model, we just did a simple summation, and that was based on the previous work and another semantic analysis technique which was explicit semantic analysis. And I'll talk about that more in one second. At this point we've generated the concept vectors for each document. And then some of the concepts in Probase, especially when we're doing the simple summation, there's a tendency to be a bias towards the more general concepts. You get extremely general things like Word as a concept or generic Word. And so we want to filter out for these sorts of things or these sorts of generic terms. And the two well accepted ways of doing that is to use inverse document frequency. And then we also just do some simple filtering to get rid of the nondiscriminative features. So if a concept shows up in more than half of the corpus, then it's not going to be helpful for us in an information retrieval application. And if it shows up in only too few then, again, it's not going to help us. And then so at the very end from the document corpus we end up with these vector for each document. This is just an example to show, to demonstrate how we're sending these articles to Probase. This is taken from IMDB. It's the beginning of a movie description for Toy Story. And you can see the example that I talked about before where we find the longest phrases that are relevant. So rather than sending Tom and Hanks individually we send the term Tom Hanks to Probase. This is an example of some of the results, concepts that came out using Probase. And it's really not what you might expect. And we found this quite frequently for all the movies. But you do get some good concepts to come out. So like lovable Toy Story character that comes out of terms like Buzz Lightyear and Woody. And that shows up in the top ten. But the other ones they don't seem to be very useful for Toy Story like DVD encryptions comes out because there's a character named RC in the movie. But that being said, we still needed to evaluate whether this was going to work out well in our applications. So that's it for our Probase technique, but I'm going to present the other two techniques that we're evaluating against. One of which is explicit semantic analysis. And like I said before, this is based on Wikipedia. Essentially what the authors do is take Wikipedia and build an inverted index from it. They take all the words and all the different articles and then based on the term frequencies in each of those articles, they have a ranked set of articles for each word in this inverted index. And then the same thing goes, I guess -- it works similar to the way that the Probase system works. You feed it text and then it tells you what the most likely relevant articles are. And so this image is from their paper. That's not my pointer. And so they continued another application where they were comparing the semantic relatedness between two documents. But we just stopped at this point where we had the concept vectors or the article vectors for Wikipedia. And I wanted to give a comparison of the sort of concepts that were being generated between Probase and ESA. So you can see here, automatically you think that, well, the Probase concepts look a lot better. And the reason here is because even though they've recognized that the word "buzz" comes up a lot in Toy Story, the actually use of buzz, Buzz Lightyear isn't even in the top 10 concepts. And so that was the example for that. Then the last semantic analysis technique we used is latent Dirichlet allocation. This was developed in 2003. It's an unsupervised learning method. Essentially the way the model works is you have a distribution over topics and a distribution over words. And when you combine these two things together, then you can, quote/unquote, generate a corpus. But obviously if you already have the corpus, then you can use an engine like infer.net to reverse infer what the topics distributions are over these documents. So that's what we've done. We've just basically used the infer.net system to infer the document topic distributions and then use those as features for the corpus. So that's just basically an overview of all the semantic analysis techniques that we used. And now I'm going to get into the actual evaluation. And so before I do that, well, the first evaluation that I want to talk about is recommendation systems. And before I do that I want to talk in general about recommendation, basically there are two primary approaches in the recommendation field. One of which is collaborative filtering, and the other is content-based approach. So collaborative filtering is, I guess maybe the Amazon shopping cart is a good example of that. It's where you see the customer who or group of customers who purchased some set of items also purchased another set of items. So if you haven't purchased those, then you should be recommended those, and in the content-based approach, you use features about the things being recommended themselves as ways of performing the recommendation. So the movie Golden Eye is actually similar to Mission Impossible. And I'll show that in one second. And also most modern-day systems take a hybrid approach, where they're mixing this collaborative and content-based approach together. So we're interested in content-based recommendations, because there's not a lot of things you can do using semantic analysis surrounding collaborative approaches. And in particular we're interested in the unstructured item content rather than the structured content. So just as an example, structured item content is things like movie genre where both Golden Eye and Mission Impossible are action adventure thriller movies, we want to use the descriptions of those movies and try and figure out whether or not those can help in making recommendations. So this is just an example. Some of the top terms that come out of doing a simple overlap in TF-IDF for the two movie descriptions for the two movies are helicopter agent and infiltrate and CIA. We thought maybe the underlying concepts behind these words in the movies might help people or might be a better way of making the recommendations, and so in fact some of the concepts that come out from using Probase are aircraft and intelligence agencies. So if you like movies about CIA you might also like movies about British intelligence of something. This is just a quick overview or quick reminder we're working in the unstructured content-based approaches in the recommendation field. And this is basically our experiment. It's really a simple view of it. We're pulling down these movie synopsis from IMDB running them through semantic techniques and generating features. Plugging in those features as item information for the Matchbox recommendation platform and also pulling down some movie ratings from Movie Lens and we do some training and testing and try to approximate the ratings and the testing set and what we get is a mean absolute error. Which is similar to the root mean squared error if you're familiar with the Netflix challenge. This is just a quick overview of the Matchbox system. So what we did was we generated features for the item model. That was where the semantic analysis features got plugged in. But the way the system works is that it uses an expectation propagation algorithm, if you're familiar with that. And iterates a certain number of times and reduces each of these different components, the user model, the item model and the context model into some number of latent features, and experimentally we determine that the best number of latent features for our data was around 20. And you can stop me at any point if you have any questions. I forgot to say that in the beginning. So this is our experimental data. We used as I said before the Movie Lens dataset from the workshop on heterogeneous recommendation systems. And the nice thing about this dataset was it had mappings from the IDs in their data to IMDB ID so it was easy for us to pull down the movie synopsis. Has over 800,000 ratings, over 10,000 movies from over 2,000 users. And there wasn't a movie synopsis for every movie. So we actually collected around 2600, leaving over 400,000 ratings from, luckily, all 2,000 of the users. And the way the ratings data worked was it was scored by half points from 0.5 to 5. So there were 10 values and it was a -- I'm missing that word. Yes, basically 10 values from 0.5 to 5. And in order to test whether or not these features would work better as, say, in a cold star scenario where you don't have a lot of user data or in basically the whole corpus. So we checked -- we performed the training and testing for 200 movies, a thousand movies, and the whole set. And we trained on 90 percent of the ratings and tested on the remaining 10 percent. These were the features that we used so we had three different baselines, in one baseline we didn't add any feature items whatsoever. In the second baseline we added movie genres, which is you can consider that to be like a small amount of structured data from a limited vocabulary. And then in a third baseline, we used movie tags, which is a much larger vocabulary, and you can think of it as like a folksonomy, where users are contributing to the semi-structured data for movies. And then, of course, we used all the semantic analysis techniques as features. And these were the different training regimens that we used. And in one case in the top left we trained on a subset of the ratings and tested on only ratings where users and movies had never been seen before by the system. In the top right case, we trained on only movies that had never been seen before by the system. And the bottom left only users that hadn't been seen, and then in the bottom right case, we trained on sort of a random distribution but really what it was was that anything that was tested on had some other data for both for the user and for the movie. And so these are results. I'm not going to talk to these results. But I just wanted to show you that the results were extremely noisy, which was something we weren't expecting. And actually we weren't expecting it because the Matchbox paper itself had some very nice convergent curves. And they were only using maybe an order of two number of ratings more than us. I think they were using around one million ratings. So even in the first baseline case where we weren't adding any item features, we were still getting this really noisy curve. But what I am going to talk to you about are the data tables for each of these, because I can -- it's much easier to see which technique won. So this is the first testing set which contained both users and movies that had not been seen in training. So the one way you can think about this is that the recommendations being made are based on the item features alone. They're based on the overlapping concepts. They're based on the overlapping genre in the baseline cases, and what we found here is that a small amount of structured data such as movie genres is the most influential in this scenario where you have never seen an item before. And the second case, this is the case where the testing set contained users that had not been seen before. There was an extensive amount of collaborative data available. There were a lot of users models to learn from before for a particular movie, before actually testing. And what we found here was that given this extensive amount of collaborative data the item features or actually any of the item features are really marginally beneficial. If you look even in the best case, only beats the baseline with no item features by less than one percent. Or maybe a little more than one percent, but in some cases less than one percent. And this is the case where, similar to the first case, we tested on movies that had not been seen before. And we found the same results as for the first set. It was a small amount of structured data that really improved the recommendations. The only difference between this and the first set was that you had an extensive amount of data beforehand to train for the user. But so this -- we also have this one -- we're pretty sure this is an outlier, and we're still in the process of generating more results to test that. And then this is the last result. And again this scenario is kind of similar to the second scenario where you have an extensive amount of collaborative data. And again we found these item features are really only marginally beneficial on the order of one percent. And this is the results in general. And I wanted to put this here, because I wanted to talk about the fact that none of these semantic analysis techniques actually panned out for recommendation. And I definitely don't think that's to say that these recommendation techniques aren't useful in general. Because they are. But really what it just shows is that something noisy like Web scale generated knowledge bases might not be useful in recommendation, particularly. But there are other applications like query expansion and document clustering which we're about to talk about. So document clustering is pretty simple. Basically you want to automatically divide a set of documents into some specified number of groups. And this is useful for a variety of different information retrieval tasks. You can automatically generate topics for search results, to help users navigate in some search scenario. You can make recommendations for items that are similar to pages that are currently being visited. And then you can also visualize the search space. We use a really simple approach because we were just testing semantic analysis. So we use K means for those who aren't familiar. You start with some initial clusters. You compute the means. You reassign based on a minimum distance and repeat this until convergence. This is the experiment, experimental setup. Again it was really simple. We generated features using the semantic analysis techniques, randomly assigned the clusters ten times and an K means and computed a purity and adjusted Rand index score, and then we were able to take the mean and standard deviation. So the experimental data we used was a miniature version of the 20 news groups dataset. This has around 2,000 messages from use net news groups, which most of us can do math and that makes 100 messages per topic, and we also filtered for the message's body text, because the headers of those messages had some discriminative information in them including the actual name of the cluster. And that was our source, and there's an example on the top right corner of one of the news articles, and that's a subset of it because they're rather long. These are our results. We're still working on getting the latent Dirichlet allocation results because they're actually still running, and we should have those by the end of next week when my internship is finished. So if you'd like to contact me and you're really interested in how that pans out, feel free to do so. What we found -- what we found was that the semantic analysis techniques alone weren't very good or weren't as good as just using something like TF-IDF, but when you combine the two together, when you use the actual document terms and you add the semantic, the results from the, the features from the semantic analysis, then you get a significant improvement, and actually it was on the order of about 10 percent for Probase, and Probase did beat out the explicit semantic analysis technique. And this confirmed the results from the MSRA group who did tweet clustering, so those were short texts. They also found that Probase was able to improve on clustering over ESA. And you know even though it did improve, the results were comparable and that was similar to one of the experiments that they ran, where the clusters they actually used were, there were subtle differences between them. And if you'd like I can explain that in the, during the questions, but I'm going to proceed. It also was similar to some clustering work done using WordNet in 2003. And what they found was that by combining WordNet features with the actual text, they were able to get about a 10 percent improvement, which is similar to what we found. The difference here is that WordNet is a human-generated resource and Probase is an automatically generated resource. So these are some of the conclusions that we've made from these results at this point. And again we're still waiting on some of the results to come through. But basically we found that semantic analysis features are only marginally beneficial in recommendation. The structured data from -- let me just -- oh, and then we also found that the movie genres were the experiments that performed best or the features that performed best in the case where you're making recommendations for something you had never trained on or seen before. So this small amount of structured data from a limited vocabulary is the best approach in that scenario. We also found that the explicit and latent semantic analysis approaches were comparable in movie recommendation at least. And if you paid attention closely and the results that was just because each of the semantic analysis techniques we evaluated were pretty comparable in their marginal improvement over the baseline. So we'd like to think that we've found knowledge base, we found that knowledge base is generated at Web scaler domain tasks this probably would need some confirmation in yet another domain maybe another recommendation domain or maybe another narrow information retrieval application. And we talk about that a little bit at Web scale. Or in the future work. We also just confirmed the efficacy of semantic analysis techniques. So we made sure that we were in fact saying in that trying these sorts of things was actually a good idea. The features that we generated were somewhat useful in some tasks because they weren't necessarily useful in recommendation. But it confirmed some of the other results in document clustering. Some of the future directions that we have, we'd like to do noise reduction for those examples that I showed you early on in semantic analysis. And there's a couple different ways we can do this. Maybe we can create an extension of the recommendation system to be fine tuned for these semantic analysis concepts. There were a lot of different parameters that we could play with and actually generating the concepts especially for Probase. You can change the number of concepts you create for each term. So if you feed a term like Barack Obama, do you want just the number one term, which is president or do you want the top ten terms which include president, politician, Democrat, senator, and so you can sort of vary how many of these come out. And then there's other parameters to be explored. And then one last potential for noise reduction that we thought about after looking at the ESA results was doing some kind of combined named entity disambiguation and hybrid conceptualization. And hybrid of conceptualization and named entity disambiguation, because if you looked at that buzz example where there were ten different varieties of buzz and none of which were actually Buzz Lightyear, if we had identified beforehand that we were talking about specifically Buzz Lightyear we could feed that into the semantic analysis techniques. And to further test whether or not domain-specific sources or whether or not Web scale knowledge sources are useful in a narrow domain, we might want to try an information retrieval task where we have a domain-specific knowledge source and show that the Web scale resource really does not compare to the domain-specific knowledge source. And this is just some further reading. And I wanted to thank the group at MSR Cambridge for helping me get set up in working with Matchbox. The group at MSR Asia, for listening to my extensive e-mails while I was working with Probase. And a special thanks to my mentor for putting up with me for the summer, I guess. So thanks Evelyne Viegas. And, of course, Microsoft Research Connections for allowing me to do my research on their time. [applause] >>: So you broke down about how there's group analysis and content analysis and how most websites or most recommendation systems use a hybrid. And then you went and jumped off talking about your content analysis and showed us how you set up all those 90 percent, 10 percent experiments and whatnot and results thereof and from my understanding that the analysis was purely on the content side, correct? >> Eric Rozell: No, actually that was the reason for doing the different training regimens. So let me just go back to this. So in the case where -- these top two cases where you're recommending based on new movies, the recommendations are being made on item features alone. So that was more of a content-based approach. Actually the way Matchbox works is to it's a hybrid approach. So it's taking both the collaborative features and the content-based features. But we also controlled for that fact based on the fact that the only things we were varying were the item features. So we were seeing how well we can improve the system or improve the results by varying on the different item features that were used. And we assumed that by getting the results for each of these different things we could see which would be the best in a purely content-based scenario. >>: Okay. So what I was going to ask next then was if you only tested content and you saw the various efficacies of semantic analysis there, I was wondering if perhaps when you combined content with collaborative were there any unique synergies that come out from your semantic analysis and was that explored at all? >> Eric Rozell: Right. So if you look at either of these bottom two scenarios, where you have an extensive amount of collaborative data available for the Matchbox system to consume, it does use that, because it's a hybrid -- it's at its -and so the results that we found were that in this scenario you really just can't improve on how good Matchbox is at collaborative features. These things are only -- any of the item features are only marginally beneficial, and so even having -- so, yeah, the best case here was using collaborative tags from a folksonomy and genre at the thousand movie dataset but it only improved on using no features whatsoever and only the collaborative approach. So, yeah, so the first baseline is basically a purely collaborative approach, using no content features whatsoever. So we barely improved on that using any item features, especially with the semantic analysis. >>: But when you say analysis, are you going pure content or content plus collaborative? >> Eric Rozell: Content plus collaborative rather than implementing our own content implementer we used Probase and we tried to set up the controls so we could see which content approach was the best. >>: What accuracy does Probase stand right now? What's the accuracy? >> Eric Rozell: I think -- in what task, I guess? >>: So in terms of this, the concepts for semantic -- for Obama you were having president and lists, is this some kind of evaluation? >> Eric Rozell: So the evaluation that they're publishing in Hki [phonetic] in September is -- I don't remember, actually, what the -- no, I don't of those numbers for you, but I can get them for you if you want to give me your e-mail afterwards, or you can -- they have a website also, which I should list at some point. >>: So the literature required from Probase was the concepts about entities, then this could very well be taken from the categories, right? >> Eric Rozell: The what? >>: The categories. >> Eric Rozell: Yes. >>: Sounds familiar that this assumes -- I assume that it's taken from the category most of the times. >> Eric Rozell: Actually, no, most of the time for Probase it's based on these text patterns. So it would encounter something like presidents such as George Washington, then it uses that pattern to infer that George Washington is a president. >>: Is it likely using some of the knowledge base we overcome this limitation which Probase is ->> Eric Rozell: So using less probabilistic knowledge is that what you're asking? Yeah, and I think that's one of the things we brought up in the end is using a combined named entity disambiguation and semantic analysis technique where we can inform the semantic analysis by using some semi-structured data. So first we can say, okay, this for sure is George Washington. And we know from DPpedia or Free Base that George Washington is a president so we can rule out everything else or at least weight negatively very heavily. >>: Yago, have you heard of Yago? >> Eric Rozell: Yes. >>: So Yago stands at around 95 percent accuracy, quite clean in terms of these concepts and all that. >> Eric Rozell: Yeah, definitely. I'll look into it and. And so that was the other -in the actual Probase literature they compare Probase against things like Yago and Free Base and the value of Probase they have over 120,000 entities and over three million groups of concepts. >>: 30,000 entities. >> Eric Rozell: Like the entities like in George Washington or the instances of the different concept clusters. >>: Yago has ten million. >> Eric Rozell: Did I say 20,000? I meant 20 million. >>: Yago was 20 million. >> Eric Rozell: How many classes in Yago? >>: Around 95. >> Eric Rozell: 95,000? >>: Yes. >> Eric Rozell: Okay. So Probase has around two million classes. So that's what they're really focusing on at this point. >> Evelyne Viegas: Any other questions? All right. Let's thank Eric again. Thank you very much. [applause]