Developing Semantic Classifiers for Big Data Richard Scherl

Semantics for Big Data
AAAI Technical Report FS-13-04
Developing Semantic Classifiers for Big Data
Richard Scherl
Department of Computer Science & Software Engineering
Monmouth University
West Long Branch, NJ 07764
[email protected]
Here the use of machine learning to develop such filters is
Linked Data (Heath and Bizer 2011) is widely used,
even when the publishers of data do not utilize a recognized ontology for type categorization or for defining the
semantics of the properties. We can use the names of the
properties (links) without their definition in an ontology.
Page Rank (Brin and Page 1998) has proven useful in web
search and many variants have been developed (Manning,
Raghavan, and Schütze 2008; Baeza-Yates and Ribeiro-Neto
2011). Lisa Getoor and collaborators (Sen and Getoor 2007;
Lu and Getoor 2003) have worked on methods for linkedbased classification of web pages. Here a type of generalization of Page Rank is proposed for web pages that are part of
the space of Linked Data, but are not necessarily classified
by type and may have links that are errors.
When the amount of RDF data is very large, it becomes
more likely that the triples describing entities will contain errors and may not include the specification of a
class from a known ontology. The work presented here
explores the utilization of methods from machine learning to develop classifiers for identifying the semantic
categorization of entities based upon the property names
used to describe the entity. The goal is to develop classifiers that are accurate, but robust to errors and noise.
The training data comes from DBpedia, where entities
are categorized by type and densely described with RDF
properties. The initial experimentation reported here indicates that the approach is promising.
The problem being addressed by this paper is the development of techniques to help in the search for entities within
large Linked Data sets. The entity is a thing or object of
some sort that has an identifying URI and is described
by RDF (subject, predicate, object) triples. The techniques
should be applicable in the context of Big Data, where the
number of RDF triples is large and is likely to contain errors.
A number of researchers (Hitzler and van Harmelen 2010;
d’Amato et al. 2010; Hitzler 2009; Buikstra et al. 2011;
Fensel and van Harmelen 2007; Jain et al. 2010b; Auer and
Lehmann 2010; Jain et al. 2010a) have pointed out that the
semantic web as originally conceived (based on classical
logic) does not usefully apply to the large amounts of material on the web today, even that portion with semantic annotations. They argue that what is needed are methods that
are probabilistic, inductive and robust to noise and errors, as
opposed to those that are deductive and make use of formally
defined ontologies in Owl.
Here, an attempt is made to utilize features of semantic
annotation, but without the full power of the semantic representation and semantic query languages such as SPARQL.
It is assumed that potentially relevant RDF triples have been
gathered through some sort of adaptation of traditional document retrieval techniques (for example (Neumayer 2012)).
Then filters would be useful to narrow down the entities towards those that fall into the category that the user wants.
Motivating Examples
Figure 1 shows a keyword search done with FactForge1 . The
keyword is “Paris” and as expected we get not only the capital of France, but many other RDF entities that have the
name “Paris.” In fact, FactForge has a variety of different
tools (including SPARQL) to narrow down the search using
the full semantic capabilities of RDF. But the method proposed here is designed to be used in circumstances where a
RDF type property is not available or where the value of the
property may be in error.
Another potential application is the Billion Triples Semantic Web Challenge2 . Here a large data set has been
crawled from the Web and the goal is to make sense of this
data. The classifiers being proposed here could be used to
classify the entities into types.
The Project
The proposed approach is to use machine learning methods from text classification to develop classifiers for the type
of a RDF entity based on the names of the properties used
to describe the entity. The intuition that forms the basis of
this work is that the names of the properties that are used
to describe an entity correlate very highly with the categories of which the entity is an instance. The text classifi1
Copyright c 2013, Association for the Advancement of Artificial
Intelligence ( All rights reserved.
1.000000 (xsd:double)
312, 773, 872
588445298.668339 (xsd:double)
28163530711.793663 (xsd:double)
606057217.818624 (xsd:double)
606100000.000000 (xsd:double)
5498026760.621261 (xsd:double)
17870917.961318 (xsd:double)
181.965600 (xsd:double)
1837-03-04 (xsd:date)
1837-03-04 (xsd:date)
City Council
(City in a Garden), Make Big Plans (Make No Small
Plans), I Will
3.000000 (xsd:float)
Figure 2: DBpedia Chicago Entry
database(Auer et al. 2007) that is automatically created from
Wikipedia. DBpedia does contain semantic categorization
with regard to the type of object that is the subject of the
DBpedia forms an important hub in the space of Linked
Data4 . There are links both within DBpedia and also outside
of Dbpedia. There are quite a number of different property
names used in DBpedia. Some are defined in RDF(S), OWL,
or FOAF. Others are defined in DBpedia. For this work the
prefixes are ignored. So, the source does not matter. Only the
RDF type labels are removed. Any further winnowing of
the labels is done automatically by the classifier algorithm.
Although the data is processed in RDF N-triple format, it is easier to visualize with the representation that
one sees when viewing DBpedia entries with a web
browser. For example, consider Figure 2 and Figure 3.
These show portions of the DBpedia entry for the city of
“Chicago.” Much has been removed as many links (e.g.
dbpedia-owl:birthPlace ) occur numerous times.
The selection shows that there are a number of link names
that are highly indicative of “Chicago” being a city.
Figure 1: Key Word Search
A set of Python programs were developed to access DBpedia
and other sources of RDF data and to train and test off-theshelf implementations of machine learning algorithms. The
code makes use of existing libraries for processing RDF5 ,
cation method will distinguish between those that are useful and those which are not. The result will be a probabilistic classifier that is robust to noise and errors. For training
data, use is made of DBpedia3 , the semantically annotated
property names). Using Bayes’ rule, we have:
is dbpprop:rd1t4Loc of
is dbpprop:rd2t1Loc of
is dbpprop:popplace of
is dbpprop:prev of
is dbpprop:rd3t1Loc of
is dbpprop:recLocation
Generally, the denominator is dropped as indicated here
cmap = arg max P (c | d)P (c)
P (d | c) = P (ht1 , . . . , tnd i | c))
for the Multinomial Naive Bayes case and
is dbpprop:restingPlace
is dbpprop:restingplace
is dbpprop:runnerup of
is dbpprop:seat of
is dbpprop:residence of
is dbpprop:resides of
is dbpprop:setting of
is dbpprop:shipBuilder
P (d | c) = P (he1 , . . . , eM i | c)
for the Bernoulli Naive Bayes case.
The conditional independence assumption that each of the
attribute values is independent of the others given the class
is what gives the name Naive in Naive Bayes. With this assumption, the Multinomial computation becomes
P (d | c) = P (ht1 , . . . , tnd i | c)) =
P (Xk = tk | c)
where Xk is a random variable for position k in the document. So, P (Xk = t | c) is the probability that in a document of class c the term t will occur in position k. An
additional independence assumption is that the conditional
probabilities for a term is the same for all positions, allowing the random variable to be written as P (X = t | c). The
multinomial Naive Bayes therefore counts the number of occurrences of a term in a document, but does not include the
With the independence assumption, the Bernoulli Naive
Bayes formula becomes
P (d | c) = P (he1 , . . . , end i | c)) =
P (Ui = ei | c)
Figure 3: DBpedia Chicago Entry
code for accessing DBpedia6 , machine learning algorithms7 ,
as well as other Python resources (Harrington 2012; Perkins
There are a variety of methods commonly used in
text classification (Manning, Raghavan, and Schütze 2008;
Baeza-Yates and Ribeiro-Neto 2011). The experiments reported here use the Naive Bayes method. Naive Bayes
(following the presentation in (Manning, Raghavan, and
Schütze 2008)) works by computing the maximum a posteriori probability (MAP)
where Ui is the random variable for vocabulary term i and
takes as values 0 for absence and 1 for presence. Hence
P (Ui = 1 | c) is the probability that in a document of class
c the term ti will occur in the document.
Multinomial Naive Bayes works by computing the probability of a document (here a list of
Qproperty names) d being
in a class c as P (c | d) ∝ P (c) 1≤j≤nd P (tj | c) where
P (tj | c) is the probability of the term (here property name)
occurring in a document of class c and P (c) is the prior probability of a document
Q occurring in class c. The formula is
P (c | d) ∝ P (c) 1≤j≤M P (ej | c) in the Bernoulli case,
were the representation of d is a vector of 0s and 1s.
cmap = arg max P (c | d)
of the class c given the document d (in our case a list of
is dbpprop:recorded of
P (c | d)P (c)
P (d)
since we are only interested in the argmax and not in the
actual probability.
There are two interpretations of Naive Bayes used in text
classification. In the Multinomial Naive Bayes the document
d is a sequence of terms ht1 , . . . , tnd i. In the Bernoulli Naive
Bayes model the document d is a binary vector he1 , . . . , eM i
(of dimensionality M the number of terms) indicating for
each term whether or not it occurs in d. So for the computation of the conditional probability of the document given the
class, we have
arg max P (c | d) = arg max,
Intuition about RDF and semantic representation leads us
to expect that the number of times a property occurs in a
RDF description of an entity is not all that significant. A city
is a city regardless of the number of famous people born in
that city. But whether or not a property is stated is likely
to be significant. On this basis, one would predict that the
Bernoulli classifier will perform better.
The Naive Bayes Multinomial algorithm uses the training
set to compute P̂ (c) ( NNc )) and P̂ (tj | c) as estimates for
P (c) and P (tj | c). Laplace smoothing (adding one to each
count) is used to handle the cases where some terms do not
occur in the training data, in the computation
intuitions) were obtained with the NLTK (Bernoulli) Naive
Bayes classifier. For country we have, an accuracy of .8333,
a precision of positive of .75, of negative of 1.0, a recall of
positive of 1.0, and a recall of negative of .6666. On person
we have an accuracy of .8333, a precision of positive of .75,
of negative of 1.0, a recall of positive of 1.0, and a recall of
negative of .6666.
The version of naive Bayes in NLTK has a useful method
that shows the most informative features in a particular classifier. Here it was run with the number of features set to 10.
The result for the country classifier is given in Table 1. The
establishedevent = None
gini = None
populationcensus = None
demonym = None
gdpnominalyear = None
populationdensityrank = None
populationcensusyear = None
areasqmi = None
legislature = None
currency = None
Tct + 1
t0 ∈V (Tct + 1)
P̂ (tj | c) = P
where Tct is the count of the number of occurrences of t in
the documents of class c in the training set, N is the total
number of documents, and Nc is the number of documents
of class c. These estimates are used
arg max P̂ (c | d) = arg ma xc∈C P̂ (c)
P̂ (tj | c)
to find find the best class cmap . In the Bernoulli case, the
formula used for learning is
P̂ (t | c) =
neg : pos
neg : pos
neg : pos
neg : pos
neg : pos
neg : pos
neg : pos
neg : pos
neg : pos
neg : pos
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
6.2 : 1.0
Table 1: Most Informative Features for Country Classifier
most informative features for the person classifier is given
in Table 2. From these displays, it is clear that the nonoccur-
(Nct + 1)
(Nc + 2)
givenname = None
birthplace = None
surname = None
surname = True
caption = True
wasderivedfrom = None
wordnet type = True
caption = None
areatotal = None
timezone = None
where Nct is the number of documents of class c containing
term t in the training set.
A small set of test entities was extracted from DBpedia8
for learning the category country and the category person.
Using the Scikitlearn multinomial naive Bayes classifier, the
accuracy9 for country is .8333, the precision10 of positive11
is .75, the precision for negative12 is 1.0, the recall13 of positive is 1.0 and the recall of negative is .666. For person the
accuracy is .666, with a positive precision of .6, a negative
precision of 1.0, a positive recall of 1.0 and a negative recall
of .333.
Using the Scikitlearn Bernoulli classifier on the country
data has an accuracy of only .333, a precision of positive
of 0, a precision on negative of .4, a recall on positive of 0,
and a recall on negative of .666. The Bernoulli classifier on
person has an accuracy of .666, a precision on negative of
1.0, a precision on positive of .6, a recall on negative of .333
and a recall on positive of 1.0.
These results do not match our intuitions reported earlier, since the Multinomial classifier is doing better than the
Bernoulli. On the other hand better results (that do match the
neg : pos
neg : pos
neg : pos
pos : neg
pos : neg
neg : pos
pos : neg
neg : pos
pos : neg
pos : neg
6.2 : 1.0
6.2 : 1.0
5.4 : 1.0
4.5 : 1.0
4.0 : 1.0
3.8 : 1.0
3.5 : 1.0
3.2 : 1.0
3.0 : 1.0
3.0 : 1.0
Table 2: Most Informative Features for Person Classifier
rence of particular property names is crucial in distinguishing between categories.
Current work is increasing the amount of DBpedia examples and the number of categories being considered. Additionally, experiments are underway with other methods of
Further tests
U.S. government data14 on the current members of congress
was downloaded in JSON format. Additionally, data from
the CIA World Factbook15 was downloaded in RDF format.
The NLTK Naive Bayes country classifier (multinomial)
was tested on 5 congress members. In each case, it classified
the entity as not a country. It was also tested on 5 countries
Since the full content of DBpedia was not used, the prior probability estimates do not contribute to the classifier.
Accuracy is the fraction of the classifications that are correct.
P recision = TruePositves+FalsePositives
Here a true positive result means that the example is an instance of the class.
Here a true positive result means that the example is not an
instance of the class.
Recall = TruePositives+FalseNegatives
from the CIA World Factbook and in each case it classified the entity as a country. The NLTK Naive Bayes person
classifier (multinomial) was tested on 5 congress members.
In each case, it classified the entity as a person. It was also
tested on 5 countries from the CIA World Factbook and in
each case it classified the entity as not a person.
Even though this test was surprisingly successful, it is
likely that more is needed to ensure that the classifier trained
on the terminology of one ontology’s properties will work
on another’s. Possibilities being investigated are to parse the
property names into potentially significant pieces and using
some form of stemming.
Buikstra, A.; Neth, H.; ten Teije, L. S. A.; and van Harmelen, F. 2011. Ranking Query Results from Linked Open
Data Using a Simple Cognitive Heuristic. In Proceedings to
the Workshop on Discovering Meaning On the Go in Large
Heterogeneous Data, collocated with The Twenty-second International Joint Conference on Artificial Intelligence (IJCAL2011), Barcelona, Spain.
d’Amato, C.; Fanizzi, N.; Fazzinga, B.; Gottlob, G.; and
Lukasiewicz, T. 2010. Combining semantic web search
with the power of inductive reasoning. In Deshpande, A.,
and Hunter, A., eds., Scalable Uncertainty Management:
Proceedings of the 4th Internatioal Conference, SUM 2010,
137–150. Springer.
Doan, A.; Madhavan, J.; Domingos, P.; and Halevy, A. Y.
2002. Learning to map between ontologies on the semantic
web. In Lassner, D.; Roure, D. D.; and Iyengar, A., eds.,
WWW, 662–673. ACM.
Euzenat, J., and Shvaiko, P. 2007. Ontology matching. Heidelberg (DE): Springer-Verlag.
Fensel, D., and van Harmelen, F. 2007. Unifying reasoning and search to web scale. Internet Computing, IEEE
Harrington, P. 2012. Machine Learning in Action. Greenwich, CT, USA: Manning Publications Co.
Heath, T., and Bizer, C. 2011. Linked Data: Evolving the
web into a Global Data Space. Morgan & Claypool Publishers.
Hitzler, P., and van Harmelen, F. 2010. A reasonable semantic web. Semantic Web Journal 1(1):39–44.
Hitzler, P. 2009. Towards reasoning pragmatics. In Janowicz, K.; Raubal, M.; and Levashkin, S., eds., GeoSpatial Semantics, Third International Conference, FeoS 2009, Mexico
City, Mexico, December 3-4 2009, 9–25. Springer.
Jain, P.; Hitzler, P.; Sheth, A. P.; Verma, K.; and Yeh, P. Z.
2010a. Ontology alignment for linked open data. In International Semantic Web Conference (1), 402–417.
Jain, P.; Hitzler, P.; Yeh, P. Z.; Verma, K.; and Sheth,
A. P. 2010b. Linked data is merely more data. In AAAI
Spring Symposium: Linked Data Meets Artificial Intelligence. AAAI.
Lu, Q., and Getoor, L. 2003. Link-based classification. In
20th International Conference on Machine Learning.
Manning, C. D.; Raghavan, P.; and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge, England:
Cambridge University Press.
Neumayer, R. 2012. Semantic and Distributed Entity Search
in the Web of Data. Ph.D. Dissertation, Department of
Computer and Information Science, Norwegian University
of Science and Technology.
Perkins, J. 2010. Python Text Processing with NLTK 2.0
Cookbook. Packt Publishing.
Sen, P., and Getoor, L. 2007. Link-based blassification.
Dept. of Computer Science Technical Report CS-TR4858,
University of Maryland.
Conclusion and Future Work
The paper presented here explores the development of classifiers for identifying the semantic categorization of RDF
entities based upon the semantic links referencing and referenced by the entity. Methods from text classification are
adapted to use the names of the links. The goal is to develop
classifiers that are accurate, but robust to errors and noise.
The training data comes from entities that are already semantically categorized. DBpedia was used as a testbed for
this approach. These preliminary experiments are very encouraging as the tests worked quite well.
The next stage of the project will extend this work with
binary classifiers to the development of a multi-label classifier. Dbpedia provides an elaborate ontology that can be
used as the basis for the multi-label classifier.
An additional application for this approach is to address the ontology matching problem (Euzenat and Shvaiko
2007). Building on previous work (Doan et al. 2002) on the
use of machine learning for ontology matching, one could
use classifiers (of the sort discussed in this paper) trained
on the categories in DBpedia to determine the relationship
between the DBpedia ontology and another ontology.
The author thanks Dr. Joe Chung of Monmouth University
for invaluable help with systems issues.
Auer, S., and Lehmann, J. 2010. Making the web a data
washing machine - creating knowledge out of interlinked
data. Semantic Web Journal 1(1-2):97–104.
Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak,
R.; and Ives, Z. 2007. Dbpedia: a nucleus for a web of open
data. In Proceedings of the 6th international The semantic
web and 2nd Asian conference on Asian semantic web conference, ISWC’07/ASWC’07, 722–735. Berlin, Heidelberg:
Baeza-Yates, R., and Ribeiro-Neto, B. 2011. Modern Information Retrieval: the concepts and technology behind
search. Harlow, England: Addison Wesley.
Brin, S., and Page, L. 1998. The anatomy of a large-scale
hypertextual web search engine. In Seventh International
World-Wide Web Conference (WWW 1998).