Hybrid Classification Method for Multi-source POIs Based on Semantic Analysis Liu Jiping, Luo An1, Zhang Fuhao, Wang Yong Chinese Academy of Surveying and Mapping, 28 Lianhuachi West Road, Haidian District, Beijing 100830, China Abstract. With the increasing number of Internet users and mobile devices in the recent years, the amount of POI (point of interest) which is used as a special point location on the Web is growing at a rapid rate. It is always provided by many different web sites such as google map, bing map, baidu map and so on. To better exploit the POIs from different website, there are some problems of multi-sources data fusion need to overcome. The first one of these problems is classifying POIs automatically. In this paper, we present a classification method for multi-source POIs based on semantic analysis, which matches through three aspects information of POI as category tagging, name and attributes. It can classify the POI datasets fast and automatically and save a lot of manual labor. In order to prove the effectiveness of this method, a prototype system is designed and implemented, and it demonstrates that this proposed method yields high-quality results under realistic situation. Keywords: multi-source, POIs, classification, semantic 1. Introduction In recent years, the size of Internet users and mobile devices are in the rapid increase especially in China. Until 2014, the number of Internet users is 632 million and the number of mobile phone users is 527 million in China. Geo-reference information which is provided by Internet or mobile phone users will become more and more over time. As a typical type of georeference information, points of interest, or POIs for short, are the specific point locations that many people may use to find somewhere or something 1 Luo An, E-mail address: luoan@casm.ac.cn. on the map. They are very useful and maybe the primary portal for public contact of geographic information. In spite of their importance, the production of POIs is scattered across a myriad of different websites, systems and devices, and each one uses its own taxonomy to classify the POIs, thus making it extremely difficult to classify POIs using a unified taxonomy. The classification of POIs has been traditionally a burdensome and manual task. Until recently, the pace of change in those approaches was still slow, but currently the massive quantity and update rate of POIs available in the web demands for an automatic approach. The most atomic units of POIs consist of the name of the place, a short text description, a category tag that unambiguously identifies it is, a pair of Latitude and Longitude, some extra information such as the attributes of POI. With these aspects information of POIs, the researchers in geospatial field are tried to put forward many methods to classify POIs automatically, and it has borne some fruit. There are three different approaches for POIs classification based on dictionary (e.g. WordNet, HowNet, Synset), ontology and machine learning (McKenzie Grant et al. 2013, Filipe Rodrigues 2010, Santos M & Moreira A 2006, Ying Xia et al. 2014, S. Ilarri et al. 2011). In this paper, we extract the information related to its category from POIs and propose a hybrid method for multi-source POIs based on semantic analysis. It matches POIs on three aspects as the category tagging, name of space and attributes related POI’s type. It can classify the POI datasets automatically and save a lot of manual labor. 2. Semantic classification method Combining the features of POI information, the proposed method realized automatic and fast classification for multi-source POIs from web through category tag, name and attributes, as shown in Figure 1. First, we extract the category tag, name and attributes related POI’s type. Second, choose the different matching with having category tag or not. If there is a category tag, we can get the candidate classifications for POIs according the category tag mapping in the taxonomy. Otherwise, skipping category matching, go to the next step directly. Third, classify POIs with matching name of POIs based on text matching and semantic analysis. It can further determine the classification of POIs. Finally, we can improve to classify POIs accuracy by attributes matching, which always reflects the application of POI from some aspects. POI information Information extraction Has category or not Y N Category Matching Name Matching Attributes Matching Semantic similarity calculating Selection and Classification Figure 1. The approach of semantic classification for multi-source POIs. 2.1. Category tag matching Category tag is a method of management for POIs by the data providers. It is artificial marked for its purpose or the way people use it, and can reflect what it is and how to use it accurately and reliably. Category is not just a short text for its function, or is one of categories in taxonomy which defines the hierarchical relationships between categories. Therefore, we think the category tag is very important for automatic classification of multi-source POIs. Category tag matching is the process of mapping the category in one taxonomy to a category in a specify taxonomy. Each taxonomy is not exactly the same, it makes the certain category in one taxonomy cannot match exactly with a category in a specify taxonomy. Sometime one category is mapping to a few categories in other taxonomy. Before the category matching, we need extract semantics related with the category tag for POIs from corresponding taxonomy. With these semantics of category tag, the result of category matching is determined by the following rules: Exactly matching: The semantic factors extracted from POI category tag are exactly the same semantics with the POI classification matched. Contain matching: The set of semantic factors extracted from POI category tag is a subset of POI classification matched. Partial matching: Between two sets of semantics have intersection, but no one is contained. 2.2. Name of POIs matching All POIs have their names which people use to find them or guide for tour and drive. Just looking at the name of POI, it is easy for people to know what it is or how to use it, because of different categories have different naming rules. For all kinds of naming rules, we adopt a hybrid method for name matching. Keyword matching: Only for a few categories, they always contain some special key words which other categories do not contain. Center-word matching: For the categories whose center word can identify the function of POI. Regular expression matching: For the categories which can express by combinations of words. (1) Keyword matching Firstly, construct a set of keywords for the special category in the taxonomic hierarchies. Then match the name of POI one by one. It can get the preliminary category for a part POIs data. For example: The POI whose name containing “cuisine” is always belongs to “Restaurant”. (2) Center-word matching Firstly, tag the role for words in POI’s name with the phrase segmentation and use a word dictionary. Secondly, extract the central word in the POI’s name, then cut in the POI’s name with the help of central word as shown in Figure 3. Finally, calculate the similarity of the POI’s name with the results of cutting. For example: a POI’s name is “北京海关驻顺义区办事处南面”, and the cutting result is “北京 | 海关 | 驻 | 顺义区 | 办事处 | 南面”. The center word is “办事处”, not “南面”,and it belongs to “政府机构”, whose english name is “agency unit”. Nc Nj Nj Ns Nd 北京 海关 办事处 v Ns 驻 顺义区 adv 南面 Figure 2. The example of center-word cutting. (3) Regular expression matching Match the regular expression with the name of POIs. If not matched, go to the next expression. Else go to the next POI. The rules will be more and more through machine learning. In this paper, the Regular expressions contains as following: The rules for the first word of name The rules of combination words The numeral Rules The rules for the last word of name 2.3. Attribute matching After category and name matching, more than one category are maybe matched with the POI. In order to classify more accurately, we refine the relevant information with the category of POIs. Considering the different categories of POI with the different characteristics and properties, then attributes information are found and extracted. Attributes can be refined from the description information of POIs, and they describe more details about characteristics, functions and uses for POIs. Attributes of POIs can be divided into two varieties as following: The first is the description for POI’s proprietary features. They are connotations of POI. The other is the instance for POI. They always describe what POI is. A concepts ( L (U ,D ) ) can be used to express one POI. L represents a POI, U contains the first kind of attributes, D contains the rest. Firstly, we formulate the corresponding attribute template for each category of POI in the taxonomy. Then according to the attribute template, we extract the attributes from description of POIs, and form a POI concept lattice. Lastly, this paper promotes an algorithm to calculate the similarity between two formal concepts ( L1 (U 1 ,D1 ) and L2 (U 2 ,D 2 )). The formula is: Sim(L1 ,L2 ) 3. U1 U2 max(U 1 ,U 2 ) D1 D 2 max( D 1 , D 2 ) Experiment and Application 3.1. Experiment and analysis In this section, we report experimental results for multi-source POIs matching. First, we refine POIs in the region of Beijing from Google map, Baidu map and Amap, and the total of POIs is about 500 thousand. Then we categorize POIs by human means, and extract more than 3 thousand school POIs. In the category of school, there are seven subclasses such as kindergarten, primary school, middle school, university, adult school, private technical school, special education school and so on. Though the method of this paper proposed, we can find 2817 school POIs and the matching precision is about 86.2%. The detailed matching results as shown in Table 1 and a part of results display as shown in Figure 3. School kindergarten primary middle school school university adult private special school technical education school school total 3268 1409 1093 495 76 6 157 32 correct 2817 1146 982 452 73 5 134 25 86.2% 81.3% 89.8% 91.3% 96.1% 83.3% 85.3% 78.1% match precision Table 1. The result of school matching for multi-source POIs in beijing. Figure 3. Some example for the school POIs in beijing. 3.2. Application In order to evaluate the actual performance of our proposed method, and this prototype system is implemented in Visual C# language under Windows 7 on a PC. First, we choose the category of residence and extract them form multi-source POIs on the Web. Then we display the property of price on the map as shown in Figure 4, which can reflect the prices of heat in Beijing. Figure 4. The application of property of price. Another application: we extract more than 21 thousand farmers markets from 14 million web POIs and display the distribution on map in China, which is shown in Figure 5. The result data was used to analysis and forecast of bird flu and had been shown to be important in the spatial epidemiology. Figure 5. The extracting result of farmers markets. 4. Conclusion This paper proposes a new classification method for multi-source POIs based on semantic analysis, which contains three types matching: category tag matching, name of POI matching and attributes matching. It can classify the POI datasets from their source taxonomy to a special one fast and automatically, that have been verified in the experiment and application. The method in this paper will save a lot of human labor, and promote the development of POI service effectively. Our experiments suggest that the efficiency of the classification method depends largely on the keywords and matching rules. In the future work, we will extend the taxonomy and enrich the keywords and regular expressions. This classification method will be better and better by machine learning from large number of POIs sample. Acknowedments This research was funded by National High Technology Research and Development Program of China (863 Program) under grant No. 2012AA12A402 and No. 2013AA12A403, and the Basic Research Fund of CASM under grant No. 7771403. References McKenzie Grant, Janowicz Krzysztof, Adams Benjamin (2013). Weighted multiattribute matching of user-generated points of interest, Proceedings of the ACM International Symposium on Advances in Geographic Information Systems, p 430-433 Filipe Rodrigues (2010): POI Mining and Generation. University of Coimbra, 2010 Santos M, Moreira A (2006). Automatic classification of location contexts with decision trees. Processdings of the Conference on Mobile and Ubiquitous Systems, Guimares, Portugal, 2006 Ying Xia, Shiyan Luo, Xu Zhang, Hae Yong Bea (2014). Organization and Retrieval Method of Multimodal Point of Interest Data based on Geo-ontology. Advanced Science and Technology Letters, Vol.45, pp49-54 Kerr Dermot, Coleman Sonya, Scotney Bryan (2008). Comparing cornerness measures for interest point detection, 2008 International Machine Vision and Image Processing Conference (IMVIP 2008), p 105-110 S. Ilarri, A lllarramendi, E Mena, A sheth (2011). Semantics in Location-Based Services. IEEE Internet Computing, Vol. 15, pp.10-14 POI Classification, Stratification and Attribute Structure Document of Geographic Information Public Service Platform. www.esrichina-bj.cn Yogesh Dwivedi, Navonil Mustafee, Micheal D, Banita Lal (2009). Classification of information systems research revisited: A keyword analysis approach. Pacific Asia Conference on Information Systems (PACIS), PACIS 2009 Proceedings Viktoratos I., Tsadiras A., Bassiliades N (2013). A Rule Based Personalized Location In-formation System for the Semantic Web.14th International Conference, EC-Web 2013, Prague, Czech Republic, August 27-28. Proceedings. pp 27-38 Yu H, Zhang H, Liu Q (2003) Recognition of Chinese Organization Name Based on Role Tagging. Proceedings of the 20th International Conference on Computer Processing of Oriental Languages, 2003 Li J, Wang D, Wang X (2008). Chinese organization name recognition based on template matching. Information Technology, 6(25):p97-99 Li L, Goodchild MF (2010) Automatically and accurately matching objects in geo-spatial datasets. Theory, Data Handling and Modelling in GeoSpatial Information Science. Hong Kong, 26-28 May, 2010