22863 >> Evelyne Viegas: Hello, everybody. Thank you for... So this is an internship presentation. And I know...

advertisement
22863
>> Evelyne Viegas: Hello, everybody. Thank you for joining us this afternoon.
So this is an internship presentation. And I know that Eric -- so actually promised
I would do it with my French accent, Eric Rozell, he has a slide that talks about
himself, so I'll just let you go.
>> Eric Rozell: Thank you very much. As she said, I'm Eric Rozell. In my
internship project, we worked with a technology out of Microsoft Research Asia
called Probase, in order to do semantic analysis. And basically by semantic
analysis we mean trying to pull the meaning or the concepts out of text. And in
order to evaluate that system, we applied it to two different applications,
recommendations and document clustering.
So the title of my talk is applying semantic analysis to content-based
recommendation and document clustering. So before I get into that, just a little
bit about myself. This bottom right corner picture is my favorite intern event that
we had here, which was trampoline dodge ball. So if you ever get a chance to
play dodge ball on a trampoline I really recommend that.
I'm a graduate student at RPI. I work as a research assistant with the Tetherless
World Constellation where my advisor is Professor Peter Fox. And my research
focus is in semantic e-science. And if you need to contact me after I leave, my
contact information is there.
So just to give you a quick overview of what I'm going to talk about today, I'll talk
a little bit about the background and the scope of the problem. I'll get into what
semantic analysis is and the different techniques that we used.
I'll talk about the recommendation experiment and also the clustering experiment
and then go through some of the conclusions that we derived from our results.
So just a little bit of background. As most of you are probably aware, there are a
growing number of documents on the Web in the order of billions at this point.
Much of the data on the Web right now is in fact in semi-structured format,
especially with the advent of Web 2.0 technologies, things like folksonomies and
micro formats. However, most of the knowledge on the Web still remains, still
remains in unstructured text.
So that being said, there are quite a few techniques that are NLP, natural
language processing, for pulling the signal from the noise, if you will, and trying to
generate the meaning behind this text and things like ontology extraction, topic
extraction, named entity disambiguation and, of course, semantic analysis which
we're going to talk about today.
And our intuition was that some techniques might be better than others in terms
of the various information retrieval tasks that you can apply them to, whether
you're applying it to document clustering or recommendation or query expansion.
So I wanted to talk a little bit about Probase, which was our sort of motivation for
tackling this problem.
And Probase, as I said before, was developed at Microsoft Research Asia. It's a
probabilistic knowledge base generated from the Bing index, the Bing query logs
and other sources like Free Base, Wikipedia, tables on the Web, et cetera. And
basically how it works is it uses these text mining patterns, namely herz patterns.
So when the system encounters plain text like artists such as Picasso, the
system has evidence that Picasso is an artist or there's this hypernym
relationship between artists and Picasso.
This is just a demonstration of the concepts that are actually found in Probase.
So some of the resources that are already out there that have concepts like large
knowledge bases have a broad coverage of concepts like Free Base or
DP-pedia, they have, you know, very good coverage over a small number of
topics. So maybe on the order of tens of thousands. So they have full coverage
of things like countries and presidents, but what Probase is really good at is
capturing this long tail of more obscure, more fine-grained concepts.
Things like late 18th century writers. And given that, that long tail of concepts,
the system is very capable at conceptualizing groups of entities and finding what
the most specific or the most relevant concept is in various different scenarios.
So if you feed the system three countries like China, India and the United States,
Probase would tell you you're most likely talking about countries. But if you fed
them the BRIC countries, Brazil, Russia, India and China, it would likely tell you
you're dealing with emerging markets, because these sort of things show up
together in articles -- well, these concepts show up together in articles about
BRIC countries.
And the other thing that Probase is it differentiates between entities and
attributes. An entity is this, is our relationship, the hypernym relationship that we
discussed before where something like a birthday is an occasion or is a party, but
you can also capture the attributes of different concepts. The system would also
use patterns to recognize that a birthday could be an attribute of a person or a
politician or celebrity.
There's a variety of different applications that have been used or that have been
made surrounding Probase already, from the MSRA group. And the one that
we're really focused on today is this top application. And it was they developed a
technique called short text conceptualization, and they ran the short text
conceptualization algorithm over a corpus of tweets that they collected and
clustered, using the concepts generated by Probase and checked their
correlation with hashtags that they used in their initial collection process.
So given that we have this resource at MSRA, we had a bunch of research
questions that we thought we could address using this resource. So number
one, what's the best way of extracting concepts from text? And one way to do
that is to compare different techniques for abstract analysis. How are abstracted
concepts useful, and what we'd like to do and what we did was we generated
data about where these semantic analysis techniques are most applicable in
information retrieval applications.
A more specific question that we asked is are user ratings affected by the
concepts in the descriptions of media items such as movies. And so we tested
semantic analysis techniques in recommender systems and then how useful -this is sort of a broader question that we hope to address in the future research,
how useful are these Web scale domain knowledge bases? So in a narrower
domain for information retrieval. So Probase was generated at Web scale. It
takes all of the documents from the Bing index to generate the concepts. But
these kinds of things are a lot noisier than might be required for a narrower
domain application. So I've talked a little bit about the background. I'm going to
talk about semantic analysis and different techniques we used. So as I said
before semantic analysis, to put it simply, is just to generate the meaning from
natural language. And specifically the tasks that we were trying to address is
generating hypernyms from unstructured text.
So an example is if you see an article with the terms Apple, IBM and Microsoft,
then you might want to generate or you might want to infer that this article is
about technology companies or IT firms.
So there's different approaches for semantic analysis. One set of the approach
is use an external knowledge base, such as the conceptualization technique from
MSRA. There's also a technique called explicit semantic analysis which is based
on Wikipedia. And then there's also another technique that's based on the
WordNet syn set, or groups of synonyms that occur together in WordNet using
external knowledge. And then the other half is semantic analysis which uses the
latent features or the probabilistic features generated from the text alone without
using external knowledge, and two examples of that are the latent Dirichlet
application and latent semantic analysis.
So now that I've introduced the semantic analysis technique I'm going to talk
about each of the algorithms that were italicized on the previous slide in more
detail. And so the first resource that we wanted to use was Probase. And this is
a variation on the technique that was used in the short text conceptualization
algorithm from MSRA, and basically you start with a document corpus. For each
document you split it into the words or the tokens in that document. You identify
the phrases in that document that co-occur in Probase. So basically you just run
through the document and find long phrases, and I'll show an example of that in
the next slide.
Feed those to Probase, and then for each of these terms, because you're feeding
them individually, you're generating a set of concepts for those terms. So when
you send the term China, it sends back country, probability distribution over
things like country and emerging market and all those things.
So basically using these terms that we've identified in the text we can generate
this matrix of concepts. But then what we want to do is take this matrix of
concepts and reduce it into a single feature vector for the entire document that
can be used later in recommender systems and clustering. And so the technique
used at MSRA was to use a naive Bayes model and some Laplacian smoothing.
That worked great for their application because they were generating these
features over tweets which are rather short. They're limited to 160 characters.
But when you're looking at longer texts like news articles or things like
descriptions of movies, this Laplacian smoothing and trying to reduce this
gigantic matrix into a vector, it doesn't work out as well, and the probability -- the
probabilities that you end up with are extremely small.
So what we did, we basically instead of using this more sophisticated naive
Bayes model, we just did a simple summation, and that was based on the
previous work and another semantic analysis technique which was explicit
semantic analysis. And I'll talk about that more in one second.
At this point we've generated the concept vectors for each document. And then
some of the concepts in Probase, especially when we're doing the simple
summation, there's a tendency to be a bias towards the more general concepts.
You get extremely general things like Word as a concept or generic Word. And
so we want to filter out for these sorts of things or these sorts of generic terms.
And the two well accepted ways of doing that is to use inverse document
frequency. And then we also just do some simple filtering to get rid of the
nondiscriminative features. So if a concept shows up in more than half of the
corpus, then it's not going to be helpful for us in an information retrieval
application. And if it shows up in only too few then, again, it's not going to help
us. And then so at the very end from the document corpus we end up with these
vector for each document.
This is just an example to show, to demonstrate how we're sending these articles
to Probase. This is taken from IMDB. It's the beginning of a movie description
for Toy Story. And you can see the example that I talked about before where we
find the longest phrases that are relevant. So rather than sending Tom and
Hanks individually we send the term Tom Hanks to Probase. This is an example
of some of the results, concepts that came out using Probase. And it's really not
what you might expect. And we found this quite frequently for all the movies.
But you do get some good concepts to come out. So like lovable Toy Story
character that comes out of terms like Buzz Lightyear and Woody. And that
shows up in the top ten. But the other ones they don't seem to be very useful for
Toy Story like DVD encryptions comes out because there's a character named
RC in the movie. But that being said, we still needed to evaluate whether this
was going to work out well in our applications. So that's it for our Probase
technique, but I'm going to present the other two techniques that we're evaluating
against. One of which is explicit semantic analysis. And like I said before, this is
based on Wikipedia. Essentially what the authors do is take Wikipedia and build
an inverted index from it. They take all the words and all the different articles and
then based on the term frequencies in each of those articles, they have a ranked
set of articles for each word in this inverted index. And then the same thing
goes, I guess -- it works similar to the way that the Probase system works. You
feed it text and then it tells you what the most likely relevant articles are. And so
this image is from their paper. That's not my pointer. And so they continued
another application where they were comparing the semantic relatedness
between two documents. But we just stopped at this point where we had the
concept vectors or the article vectors for Wikipedia. And I wanted to give a
comparison of the sort of concepts that were being generated between Probase
and ESA. So you can see here, automatically you think that, well, the Probase
concepts look a lot better. And the reason here is because even though they've
recognized that the word "buzz" comes up a lot in Toy Story, the actually use of
buzz, Buzz Lightyear isn't even in the top 10 concepts. And so that was the
example for that. Then the last semantic analysis technique we used is latent
Dirichlet allocation. This was developed in 2003. It's an unsupervised learning
method. Essentially the way the model works is you have a distribution over
topics and a distribution over words. And when you combine these two things
together, then you can, quote/unquote, generate a corpus. But obviously if you
already have the corpus, then you can use an engine like infer.net to reverse
infer what the topics distributions are over these documents. So that's what
we've done. We've just basically used the infer.net system to infer the document
topic distributions and then use those as features for the corpus.
So that's just basically an overview of all the semantic analysis techniques that
we used. And now I'm going to get into the actual evaluation. And so before I do
that, well, the first evaluation that I want to talk about is recommendation
systems. And before I do that I want to talk in general about recommendation,
basically there are two primary approaches in the recommendation field. One of
which is collaborative filtering, and the other is content-based approach. So
collaborative filtering is, I guess maybe the Amazon shopping cart is a good
example of that. It's where you see the customer who or group of customers who
purchased some set of items also purchased another set of items. So if you
haven't purchased those, then you should be recommended those, and in the
content-based approach, you use features about the things being recommended
themselves as ways of performing the recommendation.
So the movie Golden Eye is actually similar to Mission Impossible. And I'll show
that in one second. And also most modern-day systems take a hybrid approach,
where they're mixing this collaborative and content-based approach together.
So we're interested in content-based recommendations, because there's not a lot
of things you can do using semantic analysis surrounding collaborative
approaches. And in particular we're interested in the unstructured item content
rather than the structured content.
So just as an example, structured item content is things like movie genre where
both Golden Eye and Mission Impossible are action adventure thriller movies, we
want to use the descriptions of those movies and try and figure out whether or
not those can help in making recommendations.
So this is just an example. Some of the top terms that come out of doing a
simple overlap in TF-IDF for the two movie descriptions for the two movies are
helicopter agent and infiltrate and CIA. We thought maybe the underlying
concepts behind these words in the movies might help people or might be a
better way of making the recommendations, and so in fact some of the concepts
that come out from using Probase are aircraft and intelligence agencies. So if
you like movies about CIA you might also like movies about British intelligence of
something.
This is just a quick overview or quick reminder we're working in the unstructured
content-based approaches in the recommendation field. And this is basically our
experiment. It's really a simple view of it. We're pulling down these movie
synopsis from IMDB running them through semantic techniques and generating
features. Plugging in those features as item information for the Matchbox
recommendation platform and also pulling down some movie ratings from Movie
Lens and we do some training and testing and try to approximate the ratings and
the testing set and what we get is a mean absolute error.
Which is similar to the root mean squared error if you're familiar with the Netflix
challenge. This is just a quick overview of the Matchbox system. So what we did
was we generated features for the item model. That was where the semantic
analysis features got plugged in. But the way the system works is that it uses an
expectation propagation algorithm, if you're familiar with that. And iterates a
certain number of times and reduces each of these different components, the
user model, the item model and the context model into some number of latent
features, and experimentally we determine that the best number of latent
features for our data was around 20. And you can stop me at any point if you
have any questions. I forgot to say that in the beginning.
So this is our experimental data. We used as I said before the Movie Lens
dataset from the workshop on heterogeneous recommendation systems. And
the nice thing about this dataset was it had mappings from the IDs in their data to
IMDB ID so it was easy for us to pull down the movie synopsis. Has over
800,000 ratings, over 10,000 movies from over 2,000 users. And there wasn't a
movie synopsis for every movie. So we actually collected around 2600, leaving
over 400,000 ratings from, luckily, all 2,000 of the users. And the way the ratings
data worked was it was scored by half points from 0.5 to 5. So there were 10
values and it was a -- I'm missing that word.
Yes, basically 10 values from 0.5 to 5. And in order to test whether or not these
features would work better as, say, in a cold star scenario where you don't have
a lot of user data or in basically the whole corpus. So we checked -- we
performed the training and testing for 200 movies, a thousand movies, and the
whole set. And we trained on 90 percent of the ratings and tested on the
remaining 10 percent. These were the features that we used so we had three
different baselines, in one baseline we didn't add any feature items whatsoever.
In the second baseline we added movie genres, which is you can consider that to
be like a small amount of structured data from a limited vocabulary.
And then in a third baseline, we used movie tags, which is a much larger
vocabulary, and you can think of it as like a folksonomy, where users are
contributing to the semi-structured data for movies.
And then, of course, we used all the semantic analysis techniques as features.
And these were the different training regimens that we used. And in one case in
the top left we trained on a subset of the ratings and tested on only ratings where
users and movies had never been seen before by the system.
In the top right case, we trained on only movies that had never been seen before
by the system. And the bottom left only users that hadn't been seen, and then in
the bottom right case, we trained on sort of a random distribution but really what
it was was that anything that was tested on had some other data for both for the
user and for the movie.
And so these are results. I'm not going to talk to these results. But I just wanted
to show you that the results were extremely noisy, which was something we
weren't expecting. And actually we weren't expecting it because the Matchbox
paper itself had some very nice convergent curves.
And they were only using maybe an order of two number of ratings more than us.
I think they were using around one million ratings. So even in the first baseline
case where we weren't adding any item features, we were still getting this really
noisy curve. But what I am going to talk to you about are the data tables for each
of these, because I can -- it's much easier to see which technique won. So this is
the first testing set which contained both users and movies that had not been
seen in training.
So the one way you can think about this is that the recommendations being
made are based on the item features alone. They're based on the overlapping
concepts. They're based on the overlapping genre in the baseline cases, and
what we found here is that a small amount of structured data such as movie
genres is the most influential in this scenario where you have never seen an item
before. And the second case, this is the case where the testing set contained
users that had not been seen before.
There was an extensive amount of collaborative data available. There were a lot
of users models to learn from before for a particular movie, before actually
testing. And what we found here was that given this extensive amount of
collaborative data the item features or actually any of the item features are really
marginally beneficial. If you look even in the best case, only beats the baseline
with no item features by less than one percent. Or maybe a little more than one
percent, but in some cases less than one percent.
And this is the case where, similar to the first case, we tested on movies that had
not been seen before. And we found the same results as for the first set. It was
a small amount of structured data that really improved the recommendations.
The only difference between this and the first set was that you had an extensive
amount of data beforehand to train for the user. But so this -- we also have this
one -- we're pretty sure this is an outlier, and we're still in the process of
generating more results to test that.
And then this is the last result. And again this scenario is kind of similar to the
second scenario where you have an extensive amount of collaborative data. And
again we found these item features are really only marginally beneficial on the
order of one percent.
And this is the results in general. And I wanted to put this here, because I
wanted to talk about the fact that none of these semantic analysis techniques
actually panned out for recommendation. And I definitely don't think that's to say
that these recommendation techniques aren't useful in general. Because they
are. But really what it just shows is that something noisy like Web scale
generated knowledge bases might not be useful in recommendation, particularly.
But there are other applications like query expansion and document clustering
which we're about to talk about.
So document clustering is pretty simple. Basically you want to automatically
divide a set of documents into some specified number of groups. And this is
useful for a variety of different information retrieval tasks. You can automatically
generate topics for search results, to help users navigate in some search
scenario. You can make recommendations for items that are similar to pages
that are currently being visited. And then you can also visualize the search
space.
We use a really simple approach because we were just testing semantic
analysis. So we use K means for those who aren't familiar. You start with some
initial clusters. You compute the means. You reassign based on a minimum
distance and repeat this until convergence.
This is the experiment, experimental setup. Again it was really simple. We
generated features using the semantic analysis techniques, randomly assigned
the clusters ten times and an K means and computed a purity and adjusted Rand
index score, and then we were able to take the mean and standard deviation.
So the experimental data we used was a miniature version of the 20 news
groups dataset. This has around 2,000 messages from use net news groups,
which most of us can do math and that makes 100 messages per topic, and we
also filtered for the message's body text, because the headers of those
messages had some discriminative information in them including the actual name
of the cluster.
And that was our source, and there's an example on the top right corner of one of
the news articles, and that's a subset of it because they're rather long. These are
our results. We're still working on getting the latent Dirichlet allocation results
because they're actually still running, and we should have those by the end of
next week when my internship is finished.
So if you'd like to contact me and you're really interested in how that pans out,
feel free to do so. What we found -- what we found was that the semantic
analysis techniques alone weren't very good or weren't as good as just using
something like TF-IDF, but when you combine the two together, when you use
the actual document terms and you add the semantic, the results from the, the
features from the semantic analysis, then you get a significant improvement, and
actually it was on the order of about 10 percent for Probase, and Probase did
beat out the explicit semantic analysis technique.
And this confirmed the results from the MSRA group who did tweet clustering, so
those were short texts. They also found that Probase was able to improve on
clustering over ESA. And you know even though it did improve, the results were
comparable and that was similar to one of the experiments that they ran, where
the clusters they actually used were, there were subtle differences between
them. And if you'd like I can explain that in the, during the questions, but I'm
going to proceed.
It also was similar to some clustering work done using WordNet in 2003. And
what they found was that by combining WordNet features with the actual text,
they were able to get about a 10 percent improvement, which is similar to what
we found. The difference here is that WordNet is a human-generated resource
and Probase is an automatically generated resource.
So these are some of the conclusions that we've made from these results at this
point. And again we're still waiting on some of the results to come through. But
basically we found that semantic analysis features are only marginally beneficial
in recommendation. The structured data from -- let me just -- oh, and then we
also found that the movie genres were the experiments that performed best or
the features that performed best in the case where you're making
recommendations for something you had never trained on or seen before. So
this small amount of structured data from a limited vocabulary is the best
approach in that scenario.
We also found that the explicit and latent semantic analysis approaches were
comparable in movie recommendation at least. And if you paid attention closely
and the results that was just because each of the semantic analysis techniques
we evaluated were pretty comparable in their marginal improvement over the
baseline.
So we'd like to think that we've found knowledge base, we found that knowledge
base is generated at Web scaler domain tasks this probably would need some
confirmation in yet another domain maybe another recommendation domain or
maybe another narrow information retrieval application. And we talk about that a
little bit at Web scale. Or in the future work.
We also just confirmed the efficacy of semantic analysis techniques. So we
made sure that we were in fact saying in that trying these sorts of things was
actually a good idea. The features that we generated were somewhat useful in
some tasks because they weren't necessarily useful in recommendation. But it
confirmed some of the other results in document clustering. Some of the future
directions that we have, we'd like to do noise reduction for those examples that I
showed you early on in semantic analysis. And there's a couple different ways
we can do this. Maybe we can create an extension of the recommendation
system to be fine tuned for these semantic analysis concepts. There were a lot
of different parameters that we could play with and actually generating the
concepts especially for Probase. You can change the number of concepts you
create for each term. So if you feed a term like Barack Obama, do you want just
the number one term, which is president or do you want the top ten terms which
include president, politician, Democrat, senator, and so you can sort of vary how
many of these come out. And then there's other parameters to be explored. And
then one last potential for noise reduction that we thought about after looking at
the ESA results was doing some kind of combined named entity disambiguation
and hybrid conceptualization. And hybrid of conceptualization and named entity
disambiguation, because if you looked at that buzz example where there were
ten different varieties of buzz and none of which were actually Buzz Lightyear, if
we had identified beforehand that we were talking about specifically Buzz
Lightyear we could feed that into the semantic analysis techniques.
And to further test whether or not domain-specific sources or whether or not Web
scale knowledge sources are useful in a narrow domain, we might want to try an
information retrieval task where we have a domain-specific knowledge source
and show that the Web scale resource really does not compare to the
domain-specific knowledge source. And this is just some further reading. And I
wanted to thank the group at MSR Cambridge for helping me get set up in
working with Matchbox. The group at MSR Asia, for listening to my extensive
e-mails while I was working with Probase. And a special thanks to my mentor for
putting up with me for the summer, I guess. So thanks Evelyne Viegas. And, of
course, Microsoft Research Connections for allowing me to do my research on
their time.
[applause]
>>: So you broke down about how there's group analysis and content analysis
and how most websites or most recommendation systems use a hybrid. And
then you went and jumped off talking about your content analysis and showed us
how you set up all those 90 percent, 10 percent experiments and whatnot and
results thereof and from my understanding that the analysis was purely on the
content side, correct?
>> Eric Rozell: No, actually that was the reason for doing the different training
regimens. So let me just go back to this.
So in the case where -- these top two cases where you're recommending based
on new movies, the recommendations are being made on item features alone.
So that was more of a content-based approach. Actually the way Matchbox
works is to it's a hybrid approach. So it's taking both the collaborative features
and the content-based features. But we also controlled for that fact based on the
fact that the only things we were varying were the item features. So we were
seeing how well we can improve the system or improve the results by varying on
the different item features that were used. And we assumed that by getting the
results for each of these different things we could see which would be the best in
a purely content-based scenario.
>>: Okay. So what I was going to ask next then was if you only tested content
and you saw the various efficacies of semantic analysis there, I was wondering if
perhaps when you combined content with collaborative were there any unique
synergies that come out from your semantic analysis and was that explored at
all?
>> Eric Rozell: Right. So if you look at either of these bottom two scenarios,
where you have an extensive amount of collaborative data available for the
Matchbox system to consume, it does use that, because it's a hybrid -- it's at its -and so the results that we found were that in this scenario you really just can't
improve on how good Matchbox is at collaborative features. These things are
only -- any of the item features are only marginally beneficial, and so even
having -- so, yeah, the best case here was using collaborative tags from a
folksonomy and genre at the thousand movie dataset but it only improved on
using no features whatsoever and only the collaborative approach. So, yeah, so
the first baseline is basically a purely collaborative approach, using no content
features whatsoever. So we barely improved on that using any item features,
especially with the semantic analysis.
>>: But when you say analysis, are you going pure content or content plus
collaborative?
>> Eric Rozell: Content plus collaborative rather than implementing our own
content implementer we used Probase and we tried to set up the controls so we
could see which content approach was the best.
>>: What accuracy does Probase stand right now? What's the accuracy?
>> Eric Rozell: I think -- in what task, I guess?
>>: So in terms of this, the concepts for semantic -- for Obama you were having
president and lists, is this some kind of evaluation?
>> Eric Rozell: So the evaluation that they're publishing in Hki [phonetic] in
September is -- I don't remember, actually, what the -- no, I don't of those
numbers for you, but I can get them for you if you want to give me your e-mail
afterwards, or you can -- they have a website also, which I should list at some
point.
>>: So the literature required from Probase was the concepts about entities, then
this could very well be taken from the categories, right?
>> Eric Rozell: The what?
>>: The categories.
>> Eric Rozell: Yes.
>>: Sounds familiar that this assumes -- I assume that it's taken from the
category most of the times.
>> Eric Rozell: Actually, no, most of the time for Probase it's based on these text
patterns. So it would encounter something like presidents such as George
Washington, then it uses that pattern to infer that George Washington is a
president.
>>: Is it likely using some of the knowledge base we overcome this limitation
which Probase is ->> Eric Rozell: So using less probabilistic knowledge is that what you're asking?
Yeah, and I think that's one of the things we brought up in the end is using a
combined named entity disambiguation and semantic analysis technique where
we can inform the semantic analysis by using some semi-structured data. So
first we can say, okay, this for sure is George Washington. And we know from
DPpedia or Free Base that George Washington is a president so we can rule out
everything else or at least weight negatively very heavily.
>>: Yago, have you heard of Yago?
>> Eric Rozell: Yes.
>>: So Yago stands at around 95 percent accuracy, quite clean in terms of these
concepts and all that.
>> Eric Rozell: Yeah, definitely. I'll look into it and. And so that was the other -in the actual Probase literature they compare Probase against things like Yago
and Free Base and the value of Probase they have over 120,000 entities and
over three million groups of concepts.
>>: 30,000 entities.
>> Eric Rozell: Like the entities like in George Washington or the instances of
the different concept clusters.
>>: Yago has ten million.
>> Eric Rozell: Did I say 20,000? I meant 20 million.
>>: Yago was 20 million.
>> Eric Rozell: How many classes in Yago?
>>: Around 95.
>> Eric Rozell: 95,000?
>>: Yes.
>> Eric Rozell: Okay. So Probase has around two million classes. So that's
what they're really focusing on at this point.
>> Evelyne Viegas: Any other questions? All right. Let's thank Eric again.
Thank you very much.
[applause]
Download