Web Information Extraction: Movie Showtimes System Omar Tariq Mohammed, Mohammed Ahmed Taiye, and Saifulazmi Tayib School of Computing University Utara Malaysia, Kedah, Malaysia omar_alsaegh@hotmail.com, tfeatslekan@gmail.com, imzalufias@yahoo.com Abstract—Web information extraction is very popular nowadays among organizations and businesses to collect information from various websites where they can make an analysis or statistical report based on the information extracted. This technique is beneficial in giving a solution for users especially if they need to do a decision-making process. In this paper, a web information extraction system namely Movie Showtimes System is proposed to extract the required information from selected cinema operators’ websites in Malaysia. The information comes in the forms of the movie show dates and times. Web information extraction can be done by implementing several approaches such as wrapper technique, regular expressions, using classifiers, and sequence models. The system design can be modeled by using a use case diagram, a sequence diagram and an activity diagram. It is very important to choose the right programming environment to develop the system, together with the suitable extraction approach and the appropriate functions of the system. The system has to be beneficial to the users and eliminates all the hassles that the users need to face nowadays. Future improvements also need to be done to ensure that the system can successfully achieve what it is supposed to do. Keywords—Information extraction, web, movie show time. I. INTRODUCTION Nowadays, information becomes an essential part in driving businesses forward. Internet as a source of information provides the information needed for these businesses to use in their decision making processes. Web information extraction, in this context, is being implemented by these businesses to analyze product specifications, pricing information and market trends from various websites. Generally, information in the Internet comes in the form of text. There are three basic concepts for text representation which is data, information and knowledge [1]. Data is unstructured text that needs to be processed and represented in a way that can be underhanded by human. Information is the processed data in a structural way that human can understand. Meanwhile, knowledge is the processed information that formulates facts driven from information. Movie Showtimes System is a proposed web information extraction system where it allows the user to search for movie show times from three main cinema operators in Malaysia. By using this system, the user does not need to browse through each operator’s websites just to find which shows are available for their viewing. This paper will discuss more about the web information extraction systems, in terms of its definition and concepts which will be elaborated in Section II. The proposed system models will be described in Section III while Section IV will look into the proposed system description. Next, Section V will discuss the importance and benefits of the proposed system and finally the paper will conclude in Section VI. II. WEB INFORMATION EXTRACTION CONCEPTS AND SYSTEMS There are many definitions can be used to describe what web information extraction is. However, all these definitions are referencing to the same meaning. Web information extraction can be defined as the technique that converts information taken from web resources, which is in the form of natural language text, into a structured knowledge representation with fixed format in a database [2]. Instances of a particular class of events or relationships in a natural language text are identified and extracted to be transforming into a structured representation. Web information extraction simplifies the huge amount of information available on the Internet, gathers the information form multiple resources, and organizes it in a formatting report. The information extraction uses wrapper technology, which is a program that is used to extract appropriate information that the users search from the web pages and put them in a specified format. Regular expressions also can be used as an extraction approach. In regular expressions approach, the extraction pattern is by the character’s sequence in the text [3]. This approach usually used in pattern-matching or string matching. Another approach is by using classifiers such as naïve Bayes classifier and maximum entropy model [3]. Naïve Bayes classifier is a generative classifier where it applies Bayes’ theorem with strong independence assumptions. Maximum entropy model is a discriminative classifier where it is widely used in natural language processing. Apart from the above mentioned approaches, there are also sequence models such as hidden Markov model (HMM) and conditional random fields (CRF) that can be used for information extraction [3]. Hidden Markov model is a model based on a statistical concept where the system being modeled is assumed to be a Markov process with hidden states. Conditional random fields are a class of statistical modeling method often applied in pattern recognition and machine learning, where they are used for structured prediction. There are many examples of system that extracts information from web-based resources. Here are two examples of web information extraction system available: Fig. 2. The sequence diagram of the system A. Electronic Citation Extraction System The system implements automatic extraction for Indonesian electronic journal system, where it extracts information from four universities e-journal’s site [4]. These four universities are University of Indonesia, Jakarta; Bandung Institute of Technology, Bandung; Udayana University, Bali; and Petra Christian University, Surabaya. The purpose of the system is to help relate electronic documents that available on the Internet with each other. These documents include multiple electronic resources such as technical reports, articles, journals and papers. Developed by using PHP language and MySQL database, the system is a web-based application that searches and shows citation indexes extracted from the above mentioned institutions and the relationship between data collected from each document. The system starts when user inserts the desired keyword in the field provided. This keyword might be the document’s title or the author’s name. Then the system will display the results of this particular keyword. B. Web information/Knowledge Extraction System The system, known as WIKE, is used to extract knowledge from certain parts of websites [5]. In the current situation, if we want to collect information related to certain country, we have to refer to the web pages that provide the information about this country. The same goes if we want to do a comparison between 20 or 100 countries which means that we have to refer to the web pages of all of them. The purpose of WIKE system is to extract information from targeted web page based on which part of the web page is the user desired to extract. First, the system will get the typical web pages from the desired web applications and generates an extraction pattern. This extraction pattern will be used to extract information from the previously selected web pages. Fig. 1. The use case diagram of the system Lastly, WIKE system will generate a table that shows the extraction result. The extraction process will be based on two aspects, which is the part selection (which part is needed by user in this particular website) and the data type (what is the type of information needed by user). III. SYSTEM MODELS The design of the system is represented by a use case diagram, a sequence diagram and an activity diagram. These diagrams help to give a pictorial representation of the activities performed by the system. The tool used for the system modeling is Visual Paradigm for Unified Modeling Language 10.2. Use case diagram is the foundation of Unified Modeling Diagram which gives a framework on how other diagrams will be represented. These diagrams are text-based method of describing and documenting complex processes that adds details to the requirements outlined in the definition. These diagrams give a set of activities that produce some output results from the designed system, which is implemented by the display of an event that triggers the system designed, where the trigger is an event that causes the use case to be executed. A. Use Case Diagram The use case diagram for the system is shown in Figure 1. B. Sequence Diagram The sequence diagram for the system is shown in Figure 2. C. Activity Diagram The activity diagram for the system is shown in Figure 3. Fig. 3. The activity diagram of the system IV. SYSTEM DESCRIPTION The development of the system can be described in three major elements, which are the programming environment used, the extraction approach or technique employed, and the functions of the system. A. The Programming Environment Movie Showtimes System is developed by using HTML and PHP programming language. PHP is used because it is an open source programming language that enables the user to develop a web-based application system. The coding of the system is written using a web design and development tool, Adobe Dreamweaver. This tool is used because it provides the easiness in coding and syntax highlighting for PHP programming language. Fig. 4. The system output for search by movie The development of the system also involves the creation of a database to store information of the cinema show times. The database of the system is created by using an open source relational database management system (RDMS) platform, MySQL. The PHP applications and MySQL administration tool, phpMyAdmin is used to manage the system’s database. B. The Extraction Approach and Technique The system extracts information of the cinema show times from these three cinema operators’ websites: Golden Screen Cinemas (http://www.gsc.com.my), TGV Cinemas (http://www.tgv.com.my), and MBO Cinemas (http;//www.mbocinemas.com). The extraction processes begin with these there cinemas’ websites defined in the system. Then the system will open each websites’ pages and read the source code from these pages. The system will extract the information needed which is the movie title together with its show dates and times. The extracted information then will be inserted into the system’s database in a structural format. During this extraction processes, the user can already perform searching where the system will receive the user’s input. The system will match the user’s request with the information stored in the database and display the result or output to the user. Fig. 5. The system output for search by date Search by date In this type or searching, the user can select any date from the date list. The system then will display the list of movies and the cinemas that are showing them during the selected date, together with its show times (see Figure 5). Search by movie and date In this type or searching, the user can select any movie from the movie list and any date from the date list. The system then will display the list of cinemas that are showing the selected movie during the selected date, together with its show times (see Figure 6). C. The Functions of the System Through this system, the user can perform searching and find information about movie showtimes from three main cinema operators in Malaysia, which are Golden Screen Cinemas, TGV Cinemas and MBO Cinemas. In the system, the user can perform three types of searching to get the movie showtimes. The user can either search by movie, search by date, or both. Search by movie In this type or searching, the user can select any movie from the movie list. The system then will display the list of cinemas that are showing the selected movie, together with its show dates and times (see Figure 4). Fig. 6. The system output for search by movie and date V. DISCUSSION The main purpose of the system, as mentioned earlier, is to act as a portal or hub for users to find movie show times from three main cinema operators in Malaysia. All these cinema operators will reschedule their movie show times every week on Thursdays. So every week there will be a new set of movie show times being updated into their websites for new release movies. This system is very beneficial in helping the user making their decision to watch what movie in what place at what time. Among the benefits of the system are: A. Centralized Information There is no need for the user to browse through these three cinema operators’ websites. All the movie show times from these cinemas are combined together in the system. B. Single Interface There is only a single page in the system for the user to perform searching. The user’s request will be displayed in the exact same page and the user can perform as many searches as they want. C. Easy Navigation The user interface of the system is very simple where there are a dropdown list of movie and a dropdown list of date. The user can select their preferences from these two lists, either separately or together. VI. CONCLUSION AND FUTURE WORKS Web information extraction is very useful in helping users managing information for specific purpose. The concepts of web information extraction have a potential to be expanded in the future for the users’ benefits. For possible future works, the information extraction technique used in the system can be improved to produce better results to the users. More functions can be added into the system such as sorting and displaying the move show times by state, and display the movie posters along with its show times. REFERENCES [1] [2] [3] [4] [5] M. Hu, Z. Jia, and X. Zhang, “An approach for text extraction from web news page,” 2012, IEEE Symposium on Robotics and Applications (ISRA), pp. 562-565. Y. Gui-Sheng and G. Guang-Dong, “A template-based method for theme information extraction from web pages,” 2010, International Conference on Computer Application and System Modeling (ICCASM), pp. 721.725. T. Michal, “Comparison of approaches for information extraction from the web,” 2008, 9th International PhD Workshop on Systems and Control: Young Generation Viewpoint, pp. 1-3. S. Riri Fitri and K. Agung, “Implementation of Indonesian electronic citation system based on web extraction techniques,” 2010, 3rd International Conference on Knowledge Discovery and Data Mining, pp. 494-497. H. Hao and T. Takehiro, “WIKE: a web information/knowledge extraction system for web service generation,” 2008, 8th International Conference on Web Engineering, pp. 354-357.