Full Text

advertisement
Hybrid Classification Method for Multi-source
POIs Based on Semantic Analysis
Liu Jiping, Luo An1, Zhang Fuhao, Wang Yong
Chinese Academy of Surveying and Mapping, 28 Lianhuachi West Road,
Haidian District, Beijing 100830, China
Abstract. With the increasing number of Internet users and mobile devices in the recent years, the amount of POI (point of interest) which is used as
a special point location on the Web is growing at a rapid rate. It is always
provided by many different web sites such as google map, bing map, baidu
map and so on. To better exploit the POIs from different website, there are
some problems of multi-sources data fusion need to overcome. The first one
of these problems is classifying POIs automatically. In this paper, we present a classification method for multi-source POIs based on semantic analysis, which matches through three aspects information of POI as category
tagging, name and attributes. It can classify the POI datasets fast and automatically and save a lot of manual labor. In order to prove the effectiveness of this method, a prototype system is designed and implemented, and
it demonstrates that this proposed method yields high-quality results under
realistic situation.
Keywords: multi-source, POIs, classification, semantic
1.
Introduction
In recent years, the size of Internet users and mobile devices are in the rapid increase especially in China. Until 2014, the number of Internet users is
632 million and the number of mobile phone users is 527 million in China.
Geo-reference information which is provided by Internet or mobile phone
users will become more and more over time. As a typical type of georeference information, points of interest, or POIs for short, are the specific
point locations that many people may use to find somewhere or something
1
Luo An, E-mail address: luoan@casm.ac.cn.
on the map. They are very useful and maybe the primary portal for public
contact of geographic information.
In spite of their importance, the production of POIs is scattered across a
myriad of different websites, systems and devices, and each one uses its
own taxonomy to classify the POIs, thus making it extremely difficult to
classify POIs using a unified taxonomy. The classification of POIs has been
traditionally a burdensome and manual task. Until recently, the pace of
change in those approaches was still slow, but currently the massive quantity and update rate of POIs available in the web demands for an automatic
approach.
The most atomic units of POIs consist of the name of the place, a short text
description, a category tag that unambiguously identifies it is, a pair of Latitude and Longitude, some extra information such as the attributes of POI.
With these aspects information of POIs, the researchers in geospatial field
are tried to put forward many methods to classify POIs automatically, and it
has borne some fruit. There are three different approaches for POIs classification based on dictionary (e.g. WordNet, HowNet, Synset), ontology and
machine learning (McKenzie Grant et al. 2013, Filipe Rodrigues 2010, Santos
M & Moreira A 2006, Ying Xia et al. 2014, S. Ilarri et al. 2011).
In this paper, we extract the information related to its category from POIs
and propose a hybrid method for multi-source POIs based on semantic
analysis. It matches POIs on three aspects as the category tagging, name of
space and attributes related POI’s type. It can classify the POI datasets
automatically and save a lot of manual labor.
2.
Semantic classification method
Combining the features of POI information, the proposed method realized
automatic and fast classification for multi-source POIs from web through
category tag, name and attributes, as shown in Figure 1. First, we extract the
category tag, name and attributes related POI’s type. Second, choose the
different matching with having category tag or not. If there is a category tag,
we can get the candidate classifications for POIs according the category tag
mapping in the taxonomy. Otherwise, skipping category matching, go to the
next step directly. Third, classify POIs with matching name of POIs based
on text matching and semantic analysis. It can further determine the classification of POIs. Finally, we can improve to classify POIs accuracy by attributes matching, which always reflects the application of POI from some
aspects.
POI information
Information extraction
Has category or not
Y
N
Category Matching
Name Matching
Attributes Matching
Semantic similarity
calculating
Selection and
Classification
Figure 1. The approach of semantic classification for multi-source POIs.
2.1. Category tag matching
Category tag is a method of management for POIs by the data providers. It
is artificial marked for its purpose or the way people use it, and can reflect
what it is and how to use it accurately and reliably. Category is not just a
short text for its function, or is one of categories in taxonomy which defines
the hierarchical relationships between categories. Therefore, we think the
category tag is very important for automatic classification of multi-source
POIs.
Category tag matching is the process of mapping the category in one taxonomy to a category in a specify taxonomy. Each taxonomy is not exactly the
same, it makes the certain category in one taxonomy cannot match exactly
with a category in a specify taxonomy. Sometime one category is mapping
to a few categories in other taxonomy.
Before the category matching, we need extract semantics related with the
category tag for POIs from corresponding taxonomy. With these semantics
of category tag, the result of category matching is determined by the following rules:



Exactly matching: The semantic factors extracted from POI category
tag are exactly the same semantics with the POI classification matched.
Contain matching: The set of semantic factors extracted from POI category tag is a subset of POI classification matched.
Partial matching: Between two sets of semantics have intersection, but
no one is contained.
2.2. Name of POIs matching
All POIs have their names which people use to find them or guide for tour
and drive. Just looking at the name of POI, it is easy for people to know
what it is or how to use it, because of different categories have different
naming rules. For all kinds of naming rules, we adopt a hybrid method for
name matching.
Keyword matching: Only for a few categories, they always contain some
special key words which other categories do not contain.
 Center-word matching: For the categories whose center word can identify the function of POI.
 Regular expression matching: For the categories which can express by
combinations of words.
(1) Keyword matching

Firstly, construct a set of keywords for the special category in the taxonomic
hierarchies. Then match the name of POI one by one. It can get the preliminary category for a part POIs data. For example: The POI whose name containing “cuisine” is always belongs to “Restaurant”.
(2) Center-word matching
Firstly, tag the role for words in POI’s name with the phrase segmentation
and use a word dictionary. Secondly, extract the central word in the POI’s
name, then cut in the POI’s name with the help of central word as shown in
Figure 3. Finally, calculate the similarity of the POI’s name with the results
of cutting. For example: a POI’s name is “北京海关驻顺义区办事处南面”,
and the cutting result is “北京 | 海关 | 驻 | 顺义区 | 办事处 | 南面”. The center word is “办事处”, not “南面”,and it belongs to “政府机构”, whose english name is “agency unit”.
Nc
Nj
Nj
Ns
Nd
北京
海关
办事处
v
Ns
驻
顺义区
adv
南面
Figure 2. The example of center-word cutting.
(3) Regular expression matching
Match the regular expression with the name of POIs. If not matched, go to
the next expression. Else go to the next POI. The rules will be more and
more through machine learning. In this paper, the Regular expressions contains as following:




The rules for the first word of name
The rules of combination words
The numeral Rules
The rules for the last word of name
2.3. Attribute matching
After category and name matching, more than one category are maybe
matched with the POI. In order to classify more accurately, we refine the
relevant information with the category of POIs. Considering the different
categories of POI with the different characteristics and properties, then attributes information are found and extracted.
Attributes can be refined from the description information of POIs, and
they describe more details about characteristics, functions and uses for
POIs. Attributes of POIs can be divided into two varieties as following:


The first is the description for POI’s proprietary features. They are connotations of POI.
The other is the instance for POI. They always describe what POI is.
A concepts ( L  (U ,D ) ) can be used to express one POI. L represents a
POI, U contains the first kind of attributes, D contains the rest.
Firstly, we formulate the corresponding attribute template for each category
of POI in the taxonomy. Then according to the attribute template, we extract the attributes from description of POIs, and form a POI concept lattice.
Lastly, this paper promotes an algorithm to calculate the similarity between
two formal concepts ( L1  (U 1 ,D1 ) and L2  (U 2 ,D 2 )). The formula is:
Sim(L1 ,L2 )  
3.
U1  U2
max(U 1 ,U 2 )
 
D1  D 2
max( D 1 , D 2 )
Experiment and Application
3.1. Experiment and analysis
In this section, we report experimental results for multi-source POIs matching. First, we refine POIs in the region of Beijing from Google map, Baidu
map and Amap, and the total of POIs is about 500 thousand. Then we categorize POIs by human means, and extract more than 3 thousand school
POIs. In the category of school, there are seven subclasses such as kindergarten, primary school, middle school, university, adult school, private
technical school, special education school and so on.
Though the method of this paper proposed, we can find 2817 school POIs
and the matching precision is about 86.2%. The detailed matching results
as shown in Table 1 and a part of results display as shown in Figure 3.
School
kindergarten
primary
middle
school
school
university
adult
private
special
school
technical
education
school
school
total
3268
1409
1093
495
76
6
157
32
correct
2817
1146
982
452
73
5
134
25
86.2%
81.3%
89.8%
91.3%
96.1%
83.3%
85.3%
78.1%
match
precision
Table 1. The result of school matching for multi-source POIs in beijing.
Figure 3. Some example for the school POIs in beijing.
3.2. Application
In order to evaluate the actual performance of our proposed method, and
this prototype system is implemented in Visual C# language under Windows 7 on a PC. First, we choose the category of residence and extract them
form multi-source POIs on the Web. Then we display the property of price
on the map as shown in Figure 4, which can reflect the prices of heat in Beijing.
Figure 4. The application of property of price.
Another application: we extract more than 21 thousand farmers markets
from 14 million web POIs and display the distribution on map in China,
which is shown in Figure 5. The result data was used to analysis and forecast of bird flu and had been shown to be important in the spatial epidemiology.
Figure 5. The extracting result of farmers markets.
4.
Conclusion
This paper proposes a new classification method for multi-source POIs
based on semantic analysis, which contains three types matching: category
tag matching, name of POI matching and attributes matching. It can classify the POI datasets from their source taxonomy to a special one fast and
automatically, that have been verified in the experiment and application.
The method in this paper will save a lot of human labor, and promote the
development of POI service effectively.
Our experiments suggest that the efficiency of the classification method
depends largely on the keywords and matching rules. In the future work, we
will extend the taxonomy and enrich the keywords and regular expressions.
This classification method will be better and better by machine learning
from large number of POIs sample.
Acknowedments
This research was funded by National High Technology Research and
Development Program of China (863 Program) under grant No.
2012AA12A402 and No. 2013AA12A403, and the Basic Research Fund of
CASM under grant No. 7771403.
References
McKenzie Grant, Janowicz Krzysztof, Adams Benjamin (2013). Weighted multiattribute matching of user-generated points of interest, Proceedings of the ACM
International Symposium on Advances in Geographic Information Systems, p
430-433
Filipe Rodrigues (2010): POI Mining and Generation. University of Coimbra, 2010
Santos M, Moreira A (2006). Automatic classification of location contexts with
decision trees. Processdings of the Conference on Mobile and Ubiquitous Systems, Guimares, Portugal, 2006
Ying Xia, Shiyan Luo, Xu Zhang, Hae Yong Bea (2014). Organization and Retrieval
Method of Multimodal Point of Interest Data based on Geo-ontology. Advanced
Science and Technology Letters, Vol.45, pp49-54
Kerr Dermot, Coleman Sonya, Scotney Bryan (2008). Comparing cornerness
measures for interest point detection, 2008 International Machine Vision and
Image Processing Conference (IMVIP 2008), p 105-110
S. Ilarri, A lllarramendi, E Mena, A sheth (2011). Semantics in Location-Based
Services. IEEE Internet Computing, Vol. 15, pp.10-14
POI Classification, Stratification and Attribute Structure Document of Geographic
Information Public Service Platform. www.esrichina-bj.cn
Yogesh Dwivedi, Navonil Mustafee, Micheal D, Banita Lal (2009). Classification of
information systems research revisited: A keyword analysis approach. Pacific
Asia Conference on Information Systems (PACIS), PACIS 2009 Proceedings
Viktoratos I., Tsadiras A., Bassiliades N (2013). A Rule Based Personalized
Location In-formation System for the Semantic Web.14th International
Conference, EC-Web 2013, Prague, Czech Republic, August 27-28. Proceedings.
pp 27-38
Yu H, Zhang H, Liu Q (2003) Recognition of Chinese Organization Name Based on
Role Tagging. Proceedings of the 20th International Conference on
Computer Processing of Oriental Languages, 2003
Li J, Wang D, Wang X (2008). Chinese organization name recognition based
on template matching. Information Technology, 6(25):p97-99
Li L, Goodchild MF (2010) Automatically and accurately matching objects in
geo-spatial datasets. Theory, Data Handling and Modelling in GeoSpatial
Information Science. Hong Kong, 26-28 May, 2010
Download