Movie Showtime by Saiful Azmi, Omar Tariq

advertisement
Web Information Extraction: Movie Showtimes
System
Omar Tariq Mohammed, Mohammed Ahmed Taiye, and Saifulazmi Tayib
School of Computing
University Utara Malaysia, Kedah, Malaysia
omar_alsaegh@hotmail.com, tfeatslekan@gmail.com, imzalufias@yahoo.com
Abstract—Web information extraction is very popular
nowadays among organizations and businesses to collect
information from various websites where they can make an
analysis or statistical report based on the information extracted.
This technique is beneficial in giving a solution for users
especially if they need to do a decision-making process. In this
paper, a web information extraction system namely Movie
Showtimes System is proposed to extract the required
information from selected cinema operators’ websites in
Malaysia. The information comes in the forms of the movie show
dates and times. Web information extraction can be done by
implementing several approaches such as wrapper technique,
regular expressions, using classifiers, and sequence models. The
system design can be modeled by using a use case diagram, a
sequence diagram and an activity diagram. It is very important
to choose the right programming environment to develop the
system, together with the suitable extraction approach and the
appropriate functions of the system. The system has to be
beneficial to the users and eliminates all the hassles that the users
need to face nowadays. Future improvements also need to be
done to ensure that the system can successfully achieve what it is
supposed to do.
Keywords—Information extraction, web, movie show time.
I.
INTRODUCTION
Nowadays, information becomes an essential part in driving
businesses forward. Internet as a source of information
provides the information needed for these businesses to use in
their decision making processes. Web information extraction,
in this context, is being implemented by these businesses to
analyze product specifications, pricing information and market
trends from various websites.
Generally, information in the Internet comes in the form of
text. There are three basic concepts for text representation
which is data, information and knowledge [1]. Data is
unstructured text that needs to be processed and represented in
a way that can be underhanded by human. Information is the
processed data in a structural way that human can understand.
Meanwhile, knowledge is the processed information that
formulates facts driven from information.
Movie Showtimes System is a proposed web information
extraction system where it allows the user to search for movie
show times from three main cinema operators in Malaysia. By
using this system, the user does not need to browse through
each operator’s websites just to find which shows are available
for their viewing.
This paper will discuss more about the web information
extraction systems, in terms of its definition and concepts
which will be elaborated in Section II. The proposed system
models will be described in Section III while Section IV will
look into the proposed system description. Next, Section V will
discuss the importance and benefits of the proposed system and
finally the paper will conclude in Section VI.
II.
WEB INFORMATION EXTRACTION CONCEPTS AND
SYSTEMS
There are many definitions can be used to describe what
web information extraction is. However, all these definitions
are referencing to the same meaning. Web information
extraction can be defined as the technique that converts
information taken from web resources, which is in the form of
natural language text, into a structured knowledge
representation with fixed format in a database [2].
Instances of a particular class of events or relationships in a
natural language text are identified and extracted to be
transforming into a structured representation. Web information
extraction simplifies the huge amount of information available
on the Internet, gathers the information form multiple
resources, and organizes it in a formatting report.
The information extraction uses wrapper technology, which
is a program that is used to extract appropriate information that
the users search from the web pages and put them in a specified
format. Regular expressions also can be used as an extraction
approach. In regular expressions approach, the extraction
pattern is by the character’s sequence in the text [3]. This
approach usually used in pattern-matching or string matching.
Another approach is by using classifiers such as naïve
Bayes classifier and maximum entropy model [3]. Naïve Bayes
classifier is a generative classifier where it applies Bayes’
theorem with strong independence assumptions. Maximum
entropy model is a discriminative classifier where it is widely
used in natural language processing.
Apart from the above mentioned approaches, there are also
sequence models such as hidden Markov model (HMM) and
conditional random fields (CRF) that can be used for
information extraction [3]. Hidden Markov model is a model
based on a statistical concept where the system being modeled
is assumed to be a Markov process with hidden states.
Conditional random fields are a class of statistical modeling
method often applied in pattern recognition and machine
learning, where they are used for structured prediction.
There are many examples of system that extracts
information from web-based resources. Here are two examples
of web information extraction system available:
Fig. 2. The sequence diagram of the system
A. Electronic Citation Extraction System
The system implements automatic extraction for Indonesian
electronic journal system, where it extracts information from
four universities e-journal’s site [4]. These four universities are
University of Indonesia, Jakarta; Bandung Institute of
Technology, Bandung; Udayana University, Bali; and Petra
Christian University, Surabaya.
The purpose of the system is to help relate electronic
documents that available on the Internet with each other. These
documents include multiple electronic resources such as
technical reports, articles, journals and papers.
Developed by using PHP language and MySQL database,
the system is a web-based application that searches and shows
citation indexes extracted from the above mentioned
institutions and the relationship between data collected from
each document.
The system starts when user inserts the desired keyword in
the field provided. This keyword might be the document’s title
or the author’s name. Then the system will display the results
of this particular keyword.
B. Web information/Knowledge Extraction System
The system, known as WIKE, is used to extract knowledge
from certain parts of websites [5]. In the current situation, if we
want to collect information related to certain country, we have
to refer to the web pages that provide the information about this
country. The same goes if we want to do a comparison between
20 or 100 countries which means that we have to refer to the
web pages of all of them.
The purpose of WIKE system is to extract information from
targeted web page based on which part of the web page is the
user desired to extract. First, the system will get the typical web
pages from the desired web applications and generates an
extraction pattern. This extraction pattern will be used to
extract information from the previously selected web pages.
Fig. 1. The use case diagram of the system
Lastly, WIKE system will generate a table that shows the
extraction result. The extraction process will be based on two
aspects, which is the part selection (which part is needed by
user in this particular website) and the data type (what is the
type of information needed by user).
III.
SYSTEM MODELS
The design of the system is represented by a use case
diagram, a sequence diagram and an activity diagram. These
diagrams help to give a pictorial representation of the activities
performed by the system. The tool used for the system
modeling is Visual Paradigm for Unified Modeling Language
10.2.
Use case diagram is the foundation of Unified Modeling
Diagram which gives a framework on how other diagrams will
be represented. These diagrams are text-based method of
describing and documenting complex processes that adds
details to the requirements outlined in the definition.
These diagrams give a set of activities that produce some
output results from the designed system, which is implemented
by the display of an event that triggers the system designed,
where the trigger is an event that causes the use case to be
executed.
A. Use Case Diagram
The use case diagram for the system is shown in Figure 1.
B. Sequence Diagram
The sequence diagram for the system is shown in Figure 2.
C. Activity Diagram
The activity diagram for the system is shown in Figure 3.
Fig. 3. The activity diagram of the system
IV.
SYSTEM DESCRIPTION
The development of the system can be described in three
major elements, which are the programming environment used,
the extraction approach or technique employed, and the
functions of the system.
A. The Programming Environment
Movie Showtimes System is developed by using HTML
and PHP programming language. PHP is used because it is an
open source programming language that enables the user to
develop a web-based application system. The coding of the
system is written using a web design and development tool,
Adobe Dreamweaver. This tool is used because it provides the
easiness in coding and syntax highlighting for PHP
programming language.
Fig. 4. The system output for search by movie
The development of the system also involves the creation
of a database to store information of the cinema show times.
The database of the system is created by using an open source
relational database management system (RDMS) platform,
MySQL. The PHP applications and MySQL administration
tool, phpMyAdmin is used to manage the system’s database.
B. The Extraction Approach and Technique
The system extracts information of the cinema show times
from these three cinema operators’ websites: Golden Screen
Cinemas
(http://www.gsc.com.my),
TGV
Cinemas
(http://www.tgv.com.my),
and
MBO
Cinemas
(http;//www.mbocinemas.com).
The extraction processes begin with these there cinemas’
websites defined in the system. Then the system will open each
websites’ pages and read the source code from these pages.
The system will extract the information needed which is the
movie title together with its show dates and times.
The extracted information then will be inserted into the
system’s database in a structural format. During this extraction
processes, the user can already perform searching where the
system will receive the user’s input. The system will match the
user’s request with the information stored in the database and
display the result or output to the user.
Fig. 5. The system output for search by date
 Search by date
In this type or searching, the user can select any date from
the date list. The system then will display the list of movies and
the cinemas that are showing them during the selected date,
together with its show times (see Figure 5).
 Search by movie and date
In this type or searching, the user can select any movie
from the movie list and any date from the date list. The system
then will display the list of cinemas that are showing the
selected movie during the selected date, together with its show
times (see Figure 6).
C. The Functions of the System
Through this system, the user can perform searching and
find information about movie showtimes from three main
cinema operators in Malaysia, which are Golden Screen
Cinemas, TGV Cinemas and MBO Cinemas.
In the system, the user can perform three types of searching
to get the movie showtimes. The user can either search by
movie, search by date, or both.
 Search by movie
In this type or searching, the user can select any movie
from the movie list. The system then will display the list of
cinemas that are showing the selected movie, together with its
show dates and times (see Figure 4).
Fig. 6. The system output for search by movie and date
V.
DISCUSSION
The main purpose of the system, as mentioned earlier, is to
act as a portal or hub for users to find movie show times from
three main cinema operators in Malaysia. All these cinema
operators will reschedule their movie show times every week
on Thursdays. So every week there will be a new set of movie
show times being updated into their websites for new release
movies.
This system is very beneficial in helping the user making
their decision to watch what movie in what place at what time.
Among the benefits of the system are:
A. Centralized Information
There is no need for the user to browse through these three
cinema operators’ websites. All the movie show times from
these cinemas are combined together in the system.
B. Single Interface
There is only a single page in the system for the user to
perform searching. The user’s request will be displayed in the
exact same page and the user can perform as many searches as
they want.
C. Easy Navigation
The user interface of the system is very simple where there
are a dropdown list of movie and a dropdown list of date. The
user can select their preferences from these two lists, either
separately or together.
VI.
CONCLUSION AND FUTURE WORKS
Web information extraction is very useful in helping users
managing information for specific purpose. The concepts of
web information extraction have a potential to be expanded in
the future for the users’ benefits. For possible future works, the
information extraction technique used in the system can be
improved to produce better results to the users. More functions
can be added into the system such as sorting and displaying the
move show times by state, and display the movie posters along
with its show times.
REFERENCES
[1]
[2]
[3]
[4]
[5]
M. Hu, Z. Jia, and X. Zhang, “An approach for text extraction from web
news page,” 2012, IEEE Symposium on Robotics and Applications
(ISRA), pp. 562-565.
Y. Gui-Sheng and G. Guang-Dong, “A template-based method for
theme information extraction from web pages,” 2010, International
Conference on Computer Application and System Modeling (ICCASM),
pp. 721.725.
T. Michal, “Comparison of approaches for information extraction from
the web,” 2008, 9th International PhD Workshop on Systems and
Control: Young Generation Viewpoint, pp. 1-3.
S. Riri Fitri and K. Agung, “Implementation of Indonesian electronic
citation system based on web extraction techniques,” 2010, 3rd
International Conference on Knowledge Discovery and Data Mining, pp.
494-497.
H. Hao and T. Takehiro, “WIKE: a web information/knowledge
extraction system for web service generation,” 2008, 8th International
Conference on Web Engineering, pp. 354-357.
Download