Online Prices Checking System based on Web Information Extraction Sawsan Ali Hamid Noor Liza Binti Ahmad Hamzah Ali Al-Aidaros Dr. Norliza Katuk School of computing, College of Arts and Science Universiti Utara Malaysia, Kedah, Malaysia Sawsan_ali1983@yahoo.com School of computing, College of Arts and Science Universiti Utara Malaysia, Kedah, Malaysia liza.khasasi@gmail.com School of computing, College of Arts and Science Universiti Utara Malaysia, Kedah, Malaysia m7amza7@yahoo.com School of computing, College of Arts and Science Universiti Utara Malaysia, Kedah, Malaysia k.norliza@uum.edu.my Abstract: Information Extraction (IE) has been famous and well known due to the predominance and massive development of the Internet technology. Many cases where the users perform Internet information extraction are by browsing operations; for example in email checking, news reading, social communities, price checking, and pricing comparison. In order to get to the accurate information especially in the pricing comparison activity, the users need to look into numerous websites affected and do the comparison themselves manually. Looking at the current trend where consumer are more price conscious especially on buying household items like grocery and fresh foods, we have developed a system to do the comparison for the users. The essential purpose of this paper is to give a general overview about web information extraction and explain how to implement pricing comparison website based on the information extraction concept. Key words: Web, Information Extraction, Internet, price, check, system models, techniques. I. INTRODUCTION The number of information and data sharing in the Internet is growing rapidly, all the available data are open for all to be used and to be queried out. Lots of time and efforts are to be expected from the users in order to get to the correct information and then in the decision making activity, especially on the pricing comparison activity on grocery and on household items. Efficiency in retrieving the information is sacrificed in order to get to the accurate information. Information Extraction (IE) aims to eliminate all the unnecessary multiple numbers of queries to be triggered by the users in order to get to the right information and to help the users with faster decision making activity based on the gathered information. Information Extraction (IE) is an automatic extraction activity done by a program or code of unstructured data originates from structured information. The major task of information extraction is to take out accurate information from given search text query and then to identifies the corresponding data or information from semi-structured or from an unstructured information from the specified queried web pages or from a collection of documents into a structured data. Supporting systems or applications can be developed based on the information extraction principle where information is gathered from multiple different websites. Some examples of the applications that can be designed are collection of local news application taken from multiple local newspaper websites, currency exchange rate application to query multiple currency exchange websites, and price checking application to get goods pricing from various hypermarkets, department stores, online stores, and from Small or Medium sized Enterprises (SME) shops websites. This paper is focuses on how information extraction is used in order get grocery and household items pricing and do pricing comparison from Tesco hypermarket, youbeli online store, and from Giant hypermarket. Section1 of this paper offers an introduction of Web Information Extraction (WIE). Section 2 provides different definitions of information extraction and its properties. Section 3 offers a clear explanation about the system models that are used in the price checking and comparison system. Section 4 describes the whole system from the view of the programming environment and the techniques used to extract information from different websites. The importance and the benefits of the price checking and comparison system are exposed in Section 5. The last section; Section 6, summarizes the paper including the future work planned for the developed system. II. WEB INFORMATION EXTRACTION CONCEPT The information in the Internet has becoming richer and in abundant due to the rapid development of the Internet technology in the twentieth century. Users have many source options to gain information as a result of the massive information accumulated, thus there is a need to focus on choosing the right source or website from the huge amount of information [1]. There are a lot of available search engines ready to be used out of the box such as Google and Yahoo. The drawback of these search engines is that they do not provide the information needed directly, they just provide all relevant search result with associate web links [2] for further manual filtering by the users to get to the correct information. A. Information Extraction Concept Web is an enormous repository of information contained in billions of individual web sites and pages. Information extraction (IE) tries to process this information and make it available to be queried out by program or by software layer. Basically, information extraction system (IES) is targeted towards particular domains of interest and involves either manual or semi-automatic query processes. The objective of automatic information extraction is in discovering the relationships between exact or similar data items from several separate domains [3]. The information extraction concept differs in definition between the international research and the domestic research fields. International research supposes that the task of information extraction is to locate specific information from nature language, while the others assume that the usual Information Extraction (IE) needs a lot of human involvement or by a set of manually set rules. In China, a number of scholars deem that information extraction is to extract special class information from a section of provided text, in addition to enable users a platform to query information from structured data [4]. Information extraction in general is the automatic extraction from unstructured documents of structured information, to making information more machine-process able, more practical and more obtainable for users. Information extraction purpose is to build large knowledge bases[5]. Its main function is to extract exact data from the given text, and to recognize exact pieces of data or information in semistructured or an unstructured textual document from multiple web pages or from a collection of a document into a structured format. The principle can be applied to many different types of text search like searching for scientific articles, newspaper articles, web pages, newsgroup messages, medical notes, classified advertisements and banking. Extracted information then put into a structured form which can be used for further analyzing or data drilling purpose [5]. The web information extraction is also used to extract data from semi-structured documents consist of tree-like structured tag, sentences and free-text paragraphs [6]. B. Information Extraction examples There are many examples of information extraction system available. Two example systems are as follows: 1) MedEx System. Medication information is one of the most important types of clinical data in electronic medical records. It is vital for the safety and the quality of healthcare. In addition, it is very important for clinical research that uses electronic medical record data. On the other hand, medication data are often recorded in clinical notes as free-text. At the same time, they did not use the coded data that depend on computer. MedEx system depicts a new natural language processing system which extracts medication information from clinical notes [7] into machine process able information. 2) Protein Active Site Template Acquisition (PASTA) The Protein Active Site Template Acquisition (PASTA) system performs automatic extraction of information relating to the roles of specific amino acid residues in protein molecules from online scientific articles and abstracts. PASTA is the first information extraction system developed for the protein structure domain and one of the most thoroughly evaluated information extraction systems operating on biological scientific text to date [8]. C. Information Extraction Challenges It is observed that existing technologies are not satisfactory for web information extraction process. Three major challenges identified in completing this project are as follows: 1) Structure Instability Hazard: the structure for website's page changes incidentally and frequently. Web information extraction system may be invalidated or terminated unexpectedly [9] due to any slight change made to the existing data structure. 2) Dependence of Heuristic Hazard: Heuristic information (like rules for domain or website) for information extraction system is relying on fixed web site or fixed domain. Information extraction algorithms used in the information extraction system that relying heavily on heuristic constraints makes the system not suitable for scalability and adaptively. 3) Ambiguity of Structure Hazard: The hypertext markup language pages might have types of unspecified and unexpected structures, like missing multiple feature orders and etcetera. Ambiguities require the information extraction system in representing elevated level semantics be more expressive. III. SYSTEM MODELS A. Use Case Diagram: Figure 1 below shows the use case diagram of price checking website and how the user deals with this website. The system provides price search function for products sell in Tesco hypermarket, in youbeli online shop, and in Giant hypermarket. In short, the user can access, check the prices, and make comparison from three different websites: Tesco, Youbeli and Giant. Figure 1: Price Checking System Use Case Diagram B. Use Sequence Diagram: Figure 2 illustrates the use sequence diagram of price checking website and agents used to contact different predefined websites (Tesco, youbeli and Giant websites) once the price checking action is triggered by a user. Maximum number of five items can be selected from the price checking system at one time. The user needs to select category first in order to fine tune the selection list available for item, then to select the required item to query, follows by key in the required quantity for the corresponding selected item. After that, click on the Check Price button to retrieve the results from the aforementioned predefined websites. Finally the user needs to click on Get Total Price button to display the processed result and total expected expenses in a grid view format. IV. SYSTEM DESCRIPTION A. Program Environment: The price checking system was developed from the combination of asp.net, C#, JavaScript and JSON. Visual Studio 2010 express was our development tool used as it is a free official version provided by Microsoft. In addition, XML files are used to keep lightweight data for user selection list and also to temporary store query result for indexing and rendering before we present the collective result to the result division of the page. B. Use Extraction Approach/ Techniques: We make use of free Google Custom Search application programming interface (API) to extract information from the source websites. API is applied on server-side within a web server. A server-side is a program interface that described response-request message, typically expressed in XML or JSON, which is done via the web - most commonly by means of an HTTP - based web server [9]. API is a set of procedural functions with input and output parameters. Existing binding specifications adopt these structural elements to languages like C, Java, Python, Perl, Ruby and C# [10]. To obtain short descriptions of the search results we made use of the Google Custom Search API [11]. Figure 2: Price Checking Use Sequence Diagram C. Use Activity Diagram: Figure 3 demonstrates the use activity diagram of price checking website and explains the activities running background with price checking operation. Google is offering a search engine customization product named Google Custom Search Engine. Actually, this product has a little options for customization that may be used to emphasize exacting resources and therefore establish how the selection of the information source manipulates the performance [12]. The Google Custom Search API allows websites as well as programs to recover and show the search results from Google Custom Search. As Giant hypermarket does not offer their product pricing online, we used the website savershub.com to obtain pricing for Giant. The website is a small enterprise website that aims to improve the approach of searching about business perespective information.. C. Functions and Screen Layout: Some.com free web hosting is used to host our price checking system. We developed and uploaded our design code to a sub domain resides in some.com host server with the URL: http://www.rmpricecheck.somee.com/. Picture 1 shows the home page of our website. Figure 3: Price Checking Use Activity Diagram Picture 1: Price Checking System - Home Page Screenshot Picture 2 shows the user interface designed for users to select category, to select required items to query, and to determine the required quantity for each selected item. Bottom part of the form shows the “Check Price” button where users need to click on to activate the search function. “Get Total Price” button located next to the “Check Price” button is to be used to display results in grid view format the users. The “Reset” button next to it is to be used to reset and clear previous results and controls for next query session. their budget and expenses in advance prior to the actual shopping to take place. B. The Benefits of the System: Our system has numerous benefits, such as: Comparing the grocery and household items prices. Helping with expenses budget on groceries and household items. VI. CONCLUSION AND FUTURE WORK Information Extraction is a famous and well known concept due to the predominance and massive development of the Internet technology and also due to the increasing number of data or information readily available in the Internet. This paper explored the conception of web information extraction and offered a clear explanation about the system models that are used in developing the price checking system. The programming environment and the techniques used to extract the information were also explained in brief. Finally, we discussed the importance and the benefits of this system. Picture 2: Price Checking System - User Interface Screenshot Picture 3 shows the search results processed based from the queries made to three predefined websites. Total expected expenses for each hypermarket are also shown at the end of the grid view. Future works planned as an enhancement features to the developed price checking system are 1) to add more shops or websites to query from such as Presto, Redtick, MatRuncit, Mydin, and others, 2) to increase the number and to diversify the categories and items append to the initial set of selection lists, 3) to add print shopping list function, and 4) to add function for the user to save the shopping list to their own local drive which may help them in analyzing their shopping trends and habits. REFERENCES [1] [2] [3] [4] [5] Picture 3: Price Checking System - Search Result Screenshot V. [6] DISCUSSION A. The Importance of the Developed System: Our main vision for the website is to enable Malaysian consumer to be able to query prices of grocery from different multiple websites. Our website is to be the gateway for user to refer to if they want to know about their preferred grocery and household items prices from various hypermarkets, department stores and online stores. Indirectly this can also help the consumer to plan [7] [8] T. Q. Dung and W. Kameyama, "A proposal of ontology-based health care information extraction system: Vnhies," in Research, Innovation and Vision for the Future, 2007 IEEE International Conference on, 2007, pp. 1-7. C.-H. Chang, M. Kayed, R. Girgis, and K. F. Shaalan, "A survey of web information extraction systems," Knowledge and Data Engineering, IEEE Transactions on, vol. 18, pp. 1411-1428, 2006. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak, "Towards domain-independent information extraction from web tables," in Proceedings of the 16th international conference on World Wide Web, 2007, pp. 71-80. H. Mingsheng, J. Zhijuan, and Z. Xiangyu, "An approach for text extraction from web news page," in Robotics and Applications (ISRA), 2012 IEEE Symposium on, 2012, pp. 562-565. T. Zhou, C.-J. Sun, L. Lin, and B.-Q. Liu, "An information extraction system for heterogeneous Web source," in Machine Learning and Cybernetics (ICMLC), 2010 International Conference on, 2010, pp. 3287-3292. C.-W. Tsai, J.-H. Ho, T.-W. Liang, and C.-S. Yang, "An intelligent Web portal system for Web information region integration," in Systems, Man and Cybernetics, 2005 IEEE International Conference on, 2005, pp. 3878-3883. H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny, "MedEx: a medication information extraction system for clinical narratives," Journal of the American Medical Informatics Association, vol. 17, pp. 19-24, 2010. R. Gaizauskas, G. Demetriou, P. J. Artymiuk, and P. Willett, "Protein structures and information extraction from biological texts: the PASTA system," Bioinformatics, vol. 19, pp. 135-143, 2003. [9] [10] [11] [12] F. Hong and Z. Zhao, "Information extraction system in large-scale web," in Communications and Information Technology, 2005. ISCIT 2005. IEEE International Symposium on, 2005, pp. 809-812. P. Troger, H. Rajic, A. Haas, and P. Domagalski, "Standardization of an API for distributed resource management systems," in Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh IEEE International Symposium on, 2007, pp. 619-626. M. Sappelli, S. Verberne, and W. Kraaij, "TNO and RUN at the TREC 2012 Contextual Suggestion Track: Recommending personalized touristic sights using Google Places," in 21st Text REtrieval Conference Notebook Proceedings (TREC 2012), 2013. R. Dragusin, P. Petcu, C. Lioma, B. Larsen, H. L. Jørgensen, I. J. Cox, L. K. Hansen, P. Ingwersen, and O. Winther, "FindZebra: A search engine for rare diseases," International journal of medical informatics, 2013.