Online Prices Checking by Noor Liza, Hamzah and Sawsan

advertisement
Online Prices Checking System based on Web
Information Extraction
Sawsan Ali Hamid
Noor Liza Binti Ahmad
Hamzah Ali Al-Aidaros
Dr. Norliza Katuk
School of computing,
College of Arts and Science
Universiti Utara Malaysia,
Kedah, Malaysia
Sawsan_ali1983@yahoo.com
School of computing,
College of Arts and Science
Universiti Utara Malaysia,
Kedah, Malaysia
liza.khasasi@gmail.com
School of computing,
College of Arts and Science
Universiti Utara Malaysia,
Kedah, Malaysia
m7amza7@yahoo.com
School of computing,
College of Arts and Science
Universiti Utara Malaysia,
Kedah, Malaysia
k.norliza@uum.edu.my
Abstract: Information Extraction (IE) has been famous and well
known due to the predominance and massive development of the
Internet technology. Many cases where the users perform Internet information extraction are by browsing operations; for example in email checking, news reading, social communities, price
checking, and pricing comparison. In order to get to the accurate
information especially in the pricing comparison activity, the
users need to look into numerous websites affected and do the
comparison themselves manually. Looking at the current trend
where consumer are more price conscious especially on buying
household items like grocery and fresh foods, we have developed
a system to do the comparison for the users. The essential purpose of this paper is to give a general overview about web information extraction and explain how to implement pricing comparison website based on the information extraction concept.
Key words: Web, Information Extraction, Internet, price, check,
system models, techniques.
I. INTRODUCTION
The number of information and data sharing in the Internet
is growing rapidly, all the available data are open for all to be
used and to be queried out. Lots of time and efforts are to be
expected from the users in order to get to the correct
information and then in the decision making activity, especially
on the pricing comparison activity on grocery and on
household items. Efficiency in retrieving the information is
sacrificed in order to get to the accurate information.
Information Extraction (IE) aims to eliminate all the
unnecessary multiple numbers of queries to be triggered by the
users in order to get to the right information and to help the
users with faster decision making activity based on the
gathered information. Information Extraction (IE) is an
automatic extraction activity done by a program or code of
unstructured data originates from structured information. The
major task of information extraction is to take out accurate
information from given search text query and then to identifies
the corresponding data or information from semi-structured or
from an unstructured information from the specified queried
web pages or from a collection of documents into a structured
data.
Supporting systems or applications can be developed based
on the information extraction principle where information is
gathered from multiple different websites. Some examples of
the applications that can be designed are collection of local
news application taken from multiple local newspaper
websites, currency exchange rate application to query multiple
currency exchange websites, and price checking application to
get goods pricing from various hypermarkets, department
stores, online stores, and from Small or Medium sized
Enterprises (SME) shops websites. This paper is focuses on
how information extraction is used in order get grocery and
household items pricing and do pricing comparison from Tesco
hypermarket, youbeli online store, and from Giant
hypermarket.
Section1 of this paper offers an introduction of Web
Information Extraction (WIE). Section 2 provides different
definitions of information extraction and its properties. Section
3 offers a clear explanation about the system models that are
used in the price checking and comparison system. Section 4
describes the whole system from the view of the programming
environment and the techniques used to extract information
from different websites. The importance and the benefits of the
price checking and comparison system are exposed in Section
5. The last section; Section 6, summarizes the paper including
the future work planned for the developed system.
II. WEB INFORMATION EXTRACTION CONCEPT
The information in the Internet has becoming richer and in
abundant due to the rapid development of the Internet
technology in the twentieth century. Users have many source
options to gain information as a result of the massive
information accumulated, thus there is a need to focus on
choosing the right source or website from the huge amount of
information [1]. There are a lot of available search engines
ready to be used out of the box such as Google and Yahoo. The
drawback of these search engines is that they do not provide
the information needed directly, they just provide all relevant
search result with associate web links [2] for further manual
filtering by the users to get to the correct information.
A. Information Extraction Concept
Web is an enormous repository of information contained in
billions of individual web sites and pages. Information
extraction (IE) tries to process this information and make it
available to be queried out by program or by software layer.
Basically, information extraction system (IES) is targeted
towards particular domains of interest and involves either
manual or semi-automatic query processes. The objective of
automatic information extraction is in discovering the
relationships between exact or similar data items from several
separate domains [3].
The information extraction concept differs in definition
between the international research and the domestic research
fields. International research supposes that the task of
information extraction is to locate specific information from
nature language, while the others assume that the usual
Information Extraction (IE) needs a lot of human involvement
or by a set of manually set rules. In China, a number of
scholars deem that information extraction is to extract special
class information from a section of provided text, in addition to
enable users a platform to query information from structured
data [4].
Information extraction in general is the automatic
extraction from unstructured documents of structured
information, to making information more machine-process
able, more practical and more obtainable for users. Information
extraction purpose is to build large knowledge bases[5]. Its
main function is to extract exact data from the given text, and
to recognize exact pieces of data or information in semistructured or an unstructured textual document from multiple
web pages or from a collection of a document into a structured
format. The principle can be applied to many different types of
text search like searching for scientific articles, newspaper
articles, web pages, newsgroup messages, medical notes,
classified advertisements and banking. Extracted information
then put into a structured form which can be used for further
analyzing or data drilling purpose [5]. The web information
extraction is also used to extract data from semi-structured
documents consist of tree-like structured tag, sentences and
free-text paragraphs [6].
B. Information Extraction examples
There are many examples of information extraction system
available. Two example systems are as follows:
1) MedEx System.
Medication information is one of the most important
types of clinical data in electronic medical records. It is vital for the safety and the quality of healthcare. In addition,
it is very important for clinical research that uses electronic medical record data. On the other hand, medication data are often recorded in clinical notes as free-text. At the
same time, they did not use the coded data that depend on
computer. MedEx system depicts a new natural language
processing system which extracts medication information
from clinical notes [7] into machine process able information.
2) Protein Active Site Template Acquisition (PASTA)
The Protein Active Site Template Acquisition (PASTA) system performs automatic extraction of information
relating to the roles of specific amino acid residues in protein molecules from online scientific articles and abstracts.
PASTA is the first information extraction system developed
for the protein structure domain and one of the most thoroughly evaluated information extraction systems operating
on biological scientific text to date [8].
C. Information Extraction Challenges
It is observed that existing technologies are not satisfactory
for web information extraction process. Three major challenges
identified in completing this project are as follows:
1) Structure Instability Hazard: the structure for website's
page changes incidentally and frequently. Web information
extraction system may be invalidated or terminated
unexpectedly [9] due to any slight change made to the existing
data structure.
2) Dependence of Heuristic Hazard: Heuristic information
(like rules for domain or website) for information extraction
system is relying on fixed web site or fixed domain.
Information extraction algorithms used in the information
extraction system that relying heavily on heuristic constraints
makes the system not suitable for scalability and adaptively.
3) Ambiguity of Structure Hazard: The hypertext markup
language pages might have types of unspecified and
unexpected structures, like missing multiple feature orders and
etcetera. Ambiguities require the information extraction
system in representing elevated level semantics be more
expressive.
III.
SYSTEM MODELS
A. Use Case Diagram:
Figure 1 below shows the use case diagram of price
checking website and how the user deals with this website. The
system provides price search function for products sell in
Tesco hypermarket, in youbeli online shop, and in Giant
hypermarket. In short, the user can access, check the prices,
and make comparison from three different websites: Tesco,
Youbeli and Giant.
Figure 1: Price Checking System Use Case Diagram
B. Use Sequence Diagram:
Figure 2 illustrates the use sequence diagram of price
checking website and agents used to contact different
predefined websites (Tesco, youbeli and Giant websites) once
the price checking action is triggered by a user. Maximum
number of five items can be selected from the price checking
system at one time. The user needs to select category first in
order to fine tune the selection list available for item, then to
select the required item to query, follows by key in the required
quantity for the corresponding selected item. After that, click
on the Check Price button to retrieve the results from the
aforementioned predefined websites. Finally the user needs to
click on Get Total Price button to display the processed result
and total expected expenses in a grid view format.
IV.
SYSTEM DESCRIPTION
A. Program Environment:
The price checking system was developed from the
combination of asp.net, C#, JavaScript and JSON. Visual
Studio 2010 express was our development tool used as it is a
free official version provided by Microsoft. In addition, XML
files are used to keep lightweight data for user selection list and
also to temporary store query result for indexing and rendering
before we present the collective result to the result division of
the page.
B. Use Extraction Approach/ Techniques:
We make use of free Google Custom Search application
programming interface (API) to extract information from the
source websites. API is applied on server-side within a web
server. A server-side is a program interface that described
response-request message, typically expressed in XML or
JSON, which is done via the web - most commonly by means
of an HTTP - based web server [9].
API is a set of procedural functions with input and output
parameters. Existing binding specifications adopt these
structural elements to languages like C, Java, Python, Perl,
Ruby and C# [10]. To obtain short descriptions of the search
results we made use of the Google Custom Search API [11].
Figure 2: Price Checking Use Sequence Diagram
C. Use Activity Diagram:
Figure 3 demonstrates the use activity diagram of price
checking website and explains the activities running
background with price checking operation.
Google is offering a search engine customization product
named Google Custom Search Engine. Actually, this product
has a little options for customization that may be used to
emphasize exacting resources and therefore establish how the
selection of the information source manipulates the
performance [12]. The Google Custom Search API allows
websites as well as programs to recover and show the search
results from Google Custom Search.
As Giant hypermarket does not offer their product pricing
online, we used the website savershub.com to obtain pricing
for Giant. The website is a small enterprise website that aims to
improve the approach of searching about business perespective
information..
C. Functions and Screen Layout:
Some.com free web hosting is used to host our price
checking system. We developed and uploaded our design code
to a sub domain resides in some.com host server with the URL:
http://www.rmpricecheck.somee.com/. Picture 1 shows the
home page of our website.
Figure 3: Price Checking Use Activity Diagram
Picture 1: Price Checking System - Home Page Screenshot
Picture 2 shows the user interface designed for users to
select category, to select required items to query, and to
determine the required quantity for each selected item. Bottom
part of the form shows the “Check Price” button where users
need to click on to activate the search function. “Get Total
Price” button located next to the “Check Price” button is to be
used to display results in grid view format the users. The
“Reset” button next to it is to be used to reset and clear
previous results and controls for next query session.
their budget and expenses in advance prior to the actual
shopping to take place.
B. The Benefits of the System:
Our system has numerous benefits, such as:

Comparing the grocery and household items prices.

Helping with expenses budget on groceries and
household items.
VI. CONCLUSION AND FUTURE WORK
Information Extraction is a famous and well known concept
due to the predominance and massive development of the
Internet technology and also due to the increasing number of
data or information readily available in the Internet.
This paper explored the conception of web information
extraction and offered a clear explanation about the system
models that are used in developing the price checking system.
The programming environment and the techniques used to
extract the information were also explained in brief. Finally, we
discussed the importance and the benefits of this system.
Picture 2: Price Checking System - User Interface Screenshot
Picture 3 shows the search results processed based from the
queries made to three predefined websites. Total expected
expenses for each hypermarket are also shown at the end of
the grid view.
Future works planned as an enhancement features to the
developed price checking system are 1) to add more shops or
websites to query from such as Presto, Redtick, MatRuncit,
Mydin, and others, 2) to increase the number and to diversify
the categories and items append to the initial set of selection
lists, 3) to add print shopping list function, and 4) to add
function for the user to save the shopping list to their own local
drive which may help them in analyzing their shopping trends
and habits.
REFERENCES
[1]
[2]
[3]
[4]
[5]
Picture 3: Price Checking System - Search Result Screenshot
V.
[6]
DISCUSSION
A. The Importance of the Developed System:
Our main vision for the website is to enable Malaysian
consumer to be able to query prices of grocery from different
multiple websites.
Our website is to be the gateway for user to refer to if they
want to know about their preferred grocery and household
items prices from various hypermarkets, department stores and
online stores. Indirectly this can also help the consumer to plan
[7]
[8]
T. Q. Dung and W. Kameyama, "A proposal of ontology-based
health care information extraction system: Vnhies," in Research,
Innovation and Vision for the Future, 2007 IEEE International
Conference on, 2007, pp. 1-7.
C.-H. Chang, M. Kayed, R. Girgis, and K. F. Shaalan, "A survey
of web information extraction systems," Knowledge and Data
Engineering, IEEE Transactions on, vol. 18, pp. 1411-1428, 2006.
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak,
"Towards domain-independent information extraction from web
tables," in Proceedings of the 16th international conference on
World Wide Web, 2007, pp. 71-80.
H. Mingsheng, J. Zhijuan, and Z. Xiangyu, "An approach for text
extraction from web news page," in Robotics and Applications
(ISRA), 2012 IEEE Symposium on, 2012, pp. 562-565.
T. Zhou, C.-J. Sun, L. Lin, and B.-Q. Liu, "An information
extraction system for heterogeneous Web source," in Machine
Learning and Cybernetics (ICMLC), 2010 International
Conference on, 2010, pp. 3287-3292.
C.-W. Tsai, J.-H. Ho, T.-W. Liang, and C.-S. Yang, "An intelligent
Web portal system for Web information region integration," in
Systems, Man and Cybernetics, 2005 IEEE International
Conference on, 2005, pp. 3878-3883.
H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and
J. C. Denny, "MedEx: a medication information extraction system
for clinical narratives," Journal of the American Medical
Informatics Association, vol. 17, pp. 19-24, 2010.
R. Gaizauskas, G. Demetriou, P. J. Artymiuk, and P. Willett,
"Protein structures and information extraction from biological
texts: the PASTA system," Bioinformatics, vol. 19, pp. 135-143,
2003.
[9]
[10]
[11]
[12]
F. Hong and Z. Zhao, "Information extraction system in large-scale
web," in Communications and Information Technology, 2005.
ISCIT 2005. IEEE International Symposium on, 2005, pp. 809-812.
P. Troger, H. Rajic, A. Haas, and P. Domagalski, "Standardization
of an API for distributed resource management systems," in
Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh
IEEE International Symposium on, 2007, pp. 619-626.
M. Sappelli, S. Verberne, and W. Kraaij, "TNO and RUN at the
TREC 2012 Contextual Suggestion Track: Recommending
personalized touristic sights using Google Places," in 21st Text
REtrieval Conference Notebook Proceedings (TREC 2012), 2013.
R. Dragusin, P. Petcu, C. Lioma, B. Larsen, H. L. Jørgensen, I. J.
Cox, L. K. Hansen, P. Ingwersen, and O. Winther, "FindZebra: A
search engine for rare diseases," International journal of medical
informatics, 2013.
Download