Full Issue in PDF

Journal of Emerging Technologies in Web Intelligence ISSN 1798-0461 Volume 4, Number 2, May 2012 Contents Special Issue: Selected Best Papers of the International Conference on Information and Communication Systems (ICICS 09) Guest Editors: Mohammad AL-Rousan Guest Editorial Mohammad AL-Rousan 117 SPECIAL ISSUE PAPERS The Boundaries of Natural Language Processing Techniques in Extracting Knowledge from Emails Thomas W. Jackson, Sara Tedmori, Chris J. Hinde, and Anoud I. Bani-Hani 119 MALURLS: A Lightweight Malicious Website Classification Based on URL Features Monther Aldwairi and Rami Alsalman 128 Vision-based Presentation Modeling of Web Applications: A Reverse Engineering Approach Natheer Khasawneh, Oduy Samarah, Safwan Al-Omari, and Stefan Conrad 134 Adaptive FEC Technique for Multimedia Applications Over the Internet Mohammad AL-Rousan and Ahamd Nawasrah 142 Semantic Approach for the Spatial Adaptation of Multimedia Documents Azze-Eddine Maredj and Nourredine Tonkin 148 Adaptive Backoff Algorithm for Wireless Internet Muneer O. Bani Yassein, Saher S. Manaseer, and Ahmad A. Momani 155 ConTest: A GUI-Based Tool to Manage Internet-Scale Experiments over PlanetLab Basheer Al-Duwairi, Mohammad Marei, Malek Ireksossi, and Belal Rehani 164 Context-oriented Software Development Basel Magableh and Stephen Barrett 172 Aggregated Search in XML Documents Fatma Zohra Bessai-Mechmache and Zaia Alimazighi 181 The Developing Economies’ Banks Branches Operational Strategy in the Era of E-Banking: The Case of Jordan Yazan Khalid Abed-Allah Migdadi 189 Evolving Polynomials of the Inputs for Decision Tree Building Chris J. Hinde, Anoud I. Bani-Hani, Thomas.W. Jackson, and Yen P. Cheung 198 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 117 Special Issue on Selected Best Papers of the International Conference on Information and Communication Systems (ICICS 09) Guest Editorial The International Conference on Information and Communications Systems (ICICS2011) is a forum for Industry Professionals and Academics from Companies, Governmental Agencies, and Universities around the world to present their latest research results, ideas, developments and applications in all areas of Computer and Information Sciences. The topics that have been covered in the ICICS2011 include, but are not limited to: Artificial Intelligence, Mobile Computing, Networking, Information Security and Cryptography, Web Content Mining, Bioinformatics and IT Applications, Database Technology, Systems Integration, Information Systems Analysis and Specification, and Telecommunications. We selected 11 high quality papers (out of 54 papers, which were presented at the ICICS2-11) and invited the authors of the selected papers to extend them and submit them for a complete new peer-review for consideration in this Special Issue (SI). The final decision for the inclusion in the SI has been strictly based on the outcome of the review process. The main objective of the SI is to make available the latest results in the field to the research community and report state-of-the-art and in-progress research on all aspects of information and communication systems. The selected papers span a broad range on the information retrieval, E-business and Internet. The contributions of these papers are outlined below. Jackson et. al, have studied the boundaries of natural language processing techniques in extracting Knowledge from emails, where they aimed to determine if natural language processing techniques can be used to fully automate the extraction of knowledge from emails. Based on the system built by the authors and it has been shown that although the f-measure results are world leading, there is still a requirement for user intervention to enable the system to be accurate enough to be of use to an organisation. On the hand, Al-Dwairi and Alsalman fcused on a very major problem in the World Wide Web where they proposed a lightweight system to detect malicious websites online based on URL lexical and host features and call it MALURLs. The system relies on Naïve Bayes classifier as a probabilistic model to detect if the target website is a malicious or benign. It introduces new features and employs self learning using Genetic Algorithm to improve the classification speed and precision. The system achieves an average precision of 87%. Two interesting studies have been presented for Web applications; The first one is vision-based presentation modelling of web applications, by Khasawneh et. al. This valuable work discussed the design, implementation, and evaluation of a reverse engineering tool that extracts and builds appropriate UML presentation model of existing Web applications. Their approach relies on a number of structured techniques such as page segmentation, the approach was applied and evaluated in the Goalzz home page. The Second work is presented by AL-Rousan and Nawasrah on techniques for Forward Error Correction (FEC) multimedia streams over the Internet. The work proposed a new approach of adaptive FEC scheme that optimized the redundancy of the generated codewords from a Reed-Solomon (RS) encoder. The adaptation of the FEC scheme is based on predefined probability equations, which are derived from the data loss rates related to the recovery rates at the clients. Along the same line, Maredj and Tonkin proposed a new generic approach to the spatial adaptation of multimedia documents. This approach allows heterogeneous devices (desktop computers, personal digital assistants, phones, via Internet, etc.), to play multimedia documents under various constraints (small screens, low bandwidth). This approach will enhance the access of multimedia content over the Internet regardless of their specifications Next, an interesting backoff algorithm for wireless Internet is proposed by Bani Yassein et al. Unlike the Binary Exponential Backoff algorithm which makes exponential increments to contention window sizes, their work studied the effect of choosing a combination between linear, exponential and logarithmic increments to contention windows. Results have shown that choosing the right increment based on network status enhances the data delivery ratio up to 37% compared to the Binary Exponential Backoff, and up to 39 % compared to the Pessimistic Linear Exponential Backoff algorithms for wireless Internet. A study presented by Al-Duwairi et al., introduced a new GUI-based tool to manage Internet-Scale experiments over PlanetLab PlanetLab is being used extensively to conduct experiments, implement and study large number of applications and protocols in Internet-like environment. The tool (called ConTest) enables PlanetLab users to setup experiment and collect results in a transparent and easy way. This tool also allows different measurements for different variables over the PlanetLab network. Content-oriented programming, searching, and retrieval are emerging techniques that enables these systems be more context aware. Magableh and Barrett described a development paradigm for building context-oriented applications using a combination of Model-Driven Architecture that generates an ADL, which presents the architecture as a components-based system, and a runtime infrastructure (middleware) that enables transparent self-adaptation with the underlying context environment. On the other hand, Zohra et. al. proposed an aggregated approach for searching a content of an XML documents. The objective of their work is to build a virtual elements that contain relevant and nonredundant elements, that are likely to answer better the query that elements taken separately. In related issue, e-banking © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.117-118 118 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 systems involves heavy retrieval and content-based transactions. Yazan Migdadi in his paper has addressed this issue with a particular attention paid to the e-banking in Jordanian. Finally, as guest –co-editors of this SI, we would like to express our deepest thanks to the Editor-in-Chief, Professor Sabah Mohammed for hosting this Issue in the JETWI and for his continued support and helpful guidance throughout all the stages of preparing this SI. Our sincere thanks also go to the Editorial-office staff of the journal for their excellent job during the course of preparing this special issue. We also thank the authors for their contributions, including those whose papers were not included. We thank and greatly appreciate the thoughtful work of many reviewers who provided invaluable evaluations and recommendations. Guest Editor Mohammad AL-Rousan College of Computer and Information Technology, Jordan University of Science & Technology, Jordan Email: alrousan@just.edu.jo Mohammad Al-Rousan is currently A full professor at the Department of Network Engineering and Security, Jordan University of Science and Technology (JUST). He was educated in KSA, Jordan and USA. He received his BSc in Computer Engineering from King Saud University, Saudi Arabia, in 1986. He received his M.S. in Electrical and Computer Engineering from University of Missouri-Columbia, MI, USA in 1992. In 1996 , he was awarded the PhD in Electrical and Computer Engineering from Brigham Young University, UT, USA. He was then an assistant professor at JUST, Jordan. In 2002, he joint the Computer Engineering Department at American University of Sharjah, UAE. Since 2008, he has been the Dean of College of Computer and Information Technology at JUST. He is the Director of the Artificial Inelegant and Robotics Laboratory, and a co-founder for the Nano-bio laboratory, at JUST. His search interests include wireless networking, System protocols, intelligent systems, computer applications, and Nanotechnology, Internet computing. Dr. Al-Rousan served on organising and program committees for many prestigious international conferences. He is the recipient of several prestigious awards and recognitions. He co-chaired international conferences on Information and Communication Systems (ICICS09 and ICICS11). © 2012 ACADEMY PUBLISHER JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 119 The Boundaries of Natural Language Processing Techniques in Extracting Knowledge from Emails Thomas W. Jackson Information Science, Loughborough University, UK Email: T.W.Jackson@lboro.ac.uk Sara Tedmori Computer Science Department, Princess Sumaya University for Technology, Jordan Email: S.Tedmori@psut.edu Chris J. Hinde and Anoud I. Bani-Hani Computer Science, Loughborough University, UK Email: C.J.Hinde@lboro.ac.uk; A.I.Bani-Hani@lboro.ac.uk Abstract—The aim of this research is to determine if natural language processing techniques can be used to fully automate the extraction of knowledge from emails. The paper reviews the five generations of building systems to share knowledge and highlights the challenges faced by all. The paper discuss the system built by the authors and shows that although the f-measure results are world leading, there is still a requirement for user intervention to enable the system to be accurate enough to be of use to an organisation. Index Terms—Natural Language Processing, Keyphrase Extraction, f-measure, Email Communication, Knowledge Extraction, NLP Limitations I. INTRODUCTION Over the last several decades, many reports [1], [2], [3], [4], [5] have indicated that people searching for information prefer to consult other people, rather than to use on-line or off-line manuals. Allen [5] found that engineers and scientists were roughly five times more likely to consult individuals rather than impersonal sources such as a database or file cabinet for information. In spite of the advancements in computing and communications technology, this tendency still holds; people remain the most valued and used source for knowledge[6], [7]. Unfortunately, finding individuals with the required expertise can be extremely expensive [8], [9], as it is time consuming and can interrupt the work of multiple persons. A common problem with many businesses today, large and small, is the difficulty associated with identifying where the knowledge lies. A lot of data and information generated and knowledge gained from projects reside in the minds of employees. Therefore the key problem is, how do you discover who possesses the knowledge sought? © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.119-127 In the search for the solution, information systems have been identified as key players with regards to their ability to connect people to people to enable them to share their expertise and collaborate with each other [10], [11], [12], [13]. Thus, the solution is not to attempt to archive all employees’ knowledge, but to link questions to answers or to knowledgeable people, who can help find the answers sought [14]. This has led to the interest in systems, which help connect people to others that can help them solve their problems, answer their questions, and work collaboratively. Cross et al. [15] reviewed [16], [17], [18], [19], [20], [21], [22], [23], [24], and summarises the benefits of seeking information from other people. These benefits include: • provision of solutions to problems; • provision of answers to questions; • provision of pointers to others that might know the answer • provision of pointers to other useful sources; • engagement in interaction that helps shape the dimension of the problem space; • psychological benefits (e.g. confidence, assurance); • social benefits (e.g. social approval for decisions, actions); • improvement in the effectiveness with which a person advances their knowledge in new and often diverse social contexts; • improvement in efficiency (e.g. reduction in time wasted pursuing other avenues); and • legitimation of decisions. Cross [15] identifies five categories that these benefits fall under: (1) solutions (know what and know how); (2) meta-knowledge (pointers to databases or other people); 120 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 (3) problem reformulation; (4) validation of plans or solutions; and (5) legitimation from contact with a respected person. It has been recognised that the idea of connecting people to people is a way forward, yet from a natural language processing viewpoint what has been attempted before and what are the limitations of the current systems. This paper reviews the expert finding approaches and discusses the natural language processing (NPL) techniques used to extract knowledge from email, including the one developed by the authors. It concludes by reflecting on the current f-measure scores for knowledge extraction and the role of the user in any knowledge location system. II. EMAIL, KNOWLEDGE WORK, AND EXPERTISE LOCATION This section explores the role of email in knowledge work, focusing on its potential value for expertise location. Email is an important knowledge channel [47], and collaboration tool [48], actively used by organisational knowledge workers worldwide. Email has been reported as the most common Internet activity, used daily by ninety-one percent of U.S. internet users [49]. The organisational dominance of email is also demonstrated by a recent study reporting that email messages involve an estimated average of 15.8 Megabytes of archive storage per corporate end-user per day[50]. Despite email’s popularity and ubiquity, there is little research on the value that email provides to organisational knowledge management (KM) strategies. Email enables greater knowledge work than possible in earlier technological eras [51], [52]. It enables knowledge, creation [53], [54], knowledge sharing and knowledge flow [55]. According to Tedmori et al., employees are motivated to use email for knowledge work for reasons including [56]: • • • • • • • • Email messages attract workers’ attention; Email is well integrated with everyday work; Email discourse provides a context for sensemaking about ideas, projects and other types of business knowledge; Email enables the referencing of work objects (such as digital documents), and provides a history via quoted messages; Email’s personalized messages are appealing, meaningful and easily understood; Email encourages commitment and accountability by automatically documenting email exchanges; Email is collected in inboxes and organisational archives, email represents valuable individual, collective and organisational memories that may be tapped later; and Email discourse facilitates the resolution of multiple conflicting perspectives which can © 2012 ACADEMY PUBLISHER stimulate an idea for a new or improved process, product or service. Email provides several important, often unexploited, opportunities for expertise-finding. Knowledge in email can be accessed and reused directly [57] or can serve indirectly as a pointer to an expert [30], [58], [56]. A recognized definition of an expert is someone who possesses specialised skills and knowledge derived from training and experience [59]. Traditionally, email clients were designed for the reuse of personal knowledge archives. For example, folders are popular structures for organising email messages so that they assist owners with knowledge reuse. This facility was highlighted by a recent study of Enron’s publicly available email archive, where significant folder usage was employed [60]. Employees often search personal email archives seeking knowledge, in preference to searching electronic knowledge repositories (EKR) [57], raising questions about the effectiveness of EKRs for reuse, an issue first raised by [61]. As mentioned earlier, Swaak et al.’s study also found that employees prefer to find an expert to help them with their knowledge-based concern, rather than searching directly for knowledge [57]. The next section describes automated expert finder tools that exploit email as evidence of expertise. III. EXPERT FINDING APPROACHES Various approaches to expertise location have been developed and implemented to link expertise seekers with internal experts. These have been initially discussed in Lichtenstein et al.’s research [62], but in this paper the authors extend the review to include the fifth generation of extraction systems. The first generation of such systems sprung out of the use of helpdesks as formal sources of knowledge, and comprised knowledge directories and expert databases. Microsoft’s SPUD project, Hewlett-Packard’s CONNEX KM system, and the SAGE expert finder are key examples of this genre. Generally expert databases have ‘Yellow Pages’ interfaces representing electronic directories of experts linked to their main areas of expertise. Such directories are based on expert profiles which must be maintained by experts on a voluntary basis. The key advantages of such directories include conveniently connecting those employees inadequately tapped into social and knowledge networks with relevant experts. However such approaches also suffer from significant shortcomings. Unless employees regularly update their profiles, the profiles lose accuracy and no longer reflect reality. Yet employees are notorious for neglecting to update such profiles as such duties are often considered onerous and low priority [25]. Employees may not wish to provide expertise. Overall, when large numbers of employees are registered and profiles are inaccurate, credibility is rapidly lost in such systems which are increasingly ignored by knowledge seekers, who instead rely on social networks or other methods [9]. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 In addition, expertise descriptions are usually incomplete and general, in contrast with the expert-related queries that are usually fine-grained and specific, and replete with various qualitative requirements [25]. In the second generation of expertise locators, companies took advantage of personal web pages where employees could advertise expertise internally or externally. Such pages are designed according to corporate templates or personal design, and are usually made accessible via the World Wide Web or corporate intranets. The convenience of web site creation and update, web site retrieval and access, and sophisticated search engines, are key advantages of this approach. However, employees may lack the motivation, time or technical expertise to develop or update their profiles, which rapidly lose accuracy and credibility and the capacity to meet expert location needs [25]. In addition, as noted by Yimam-Seid and Kobsa [25], employee use of search engines for locating an expert’s web page may be ineffective since such a process is based on a simple keyword matching task which does not always yield the most relevant experts’ web pages. The search activity can also be very time consuming when a high number of hits is returned and an employee must then systematically or randomly attempt to choose and explore the listed link(s). As Yimam-Seid and Kobsa have observed for this approach, knowledge seekers are allocated significant and often onerous responsibility for finding relevant experts ([25]. The second generation of approaches also included the development of more dynamic expert databases. Answer Garden [26], [27], which is a question-answering system, maintains a database of frequently asked questions and answers. When the system does not find required information in the database, an end-user may ask the question of the system. Answer Garden then routes the question to the corresponding experts. However, it is not clear with this approach how the system identifies experts and, in particular, whether experts have nominated their own areas and levels of expertise. The third generation of approaches relies primarily on secondary sources for expert identification. For example, the web application Expertise Browser [28], studies browsing patterns/activities in order to identify experts. With this application, if the user knows a particular expert, the user can ask the system to reveal the browsing path of that expert, relevant to the user’s query. Among other disadvantages, if an employee does not know an expert, the user must ask the system to identify one or more experts. The employee must then scan the browsing paths of the identified experts for possibly useful links, which can be a very time consuming process. Furthermore, it is likely that browsing reveals interests rather than expertise [25]. The monitoring of browsing patterns clearly involves privacy issues that such systems fail to address. Other secondary-source approaches utilise message board discussions as indicators of expertise. For example, ContactFinder [29] is a research prototype that reviews messages posted on message boards. ContactFinder analyses subject areas from messages and links them to the names of experts who wrote the © 2012 ACADEMY PUBLISHER 121 messages. It provides users seeking experts with expert referrals when user questions match expert’s earlier postings. All such approaches infer experts from secondary sources but do not allow experts to confirm such inferences. A recently recognised socially based approach is the use of social networks which provide a complex social structure for the development of social capital and the connection of novices and experts [11]. In a study conducted by [7], while some people were difficult to access, they were still judged to be valuable sources of help. The use of a social network to locate expertise has become popular because colleagues are often physically available, are personal friends, or are known to be experts on the topic. However, there is no guarantee that a genuine expert will be consulted, as users may choose to consult a moderately knowledgeable person, a person with whom a good relationship exists, a proximate employee, or a quickly located employee, simply as that person is within the expertise seeker’s social network. With this approach, low quality expertise may be systematically input into an organisation where it is quickly applied. Automated social network approaches such as Referral Web suffer from similar concerns. The fourth generation may include one or more the above approaches together with natural language processing and artificial intelligence techniques in order to analyse stored knowledge, seeking to identify expertise and experts [25], [30], [31]. A forerunner of such systems was Expert Locator which returns pointers to research groups in response to natural language queries on reports and web pages [32]. A second example is Expert Finder [33] which considers self-published documents containing the topic keyword, and the frequency of the person named near the same topic keyword in non-self-published documents, in order to produce expertise scores and ranks. In 1993 Schwartz and Wood first attempted to utilise e-mail messages, known to be heavily knowledge-based, to deduce shared-interest relationships between employees. In 2001, following other experts’ promising attempts, Sihn & Heeren implemented XpertFinder, the first substantial attempt to exploit the knowledge-based content of e-mail messages by employing technology to analyse message content. More recently Google Mail have use similar techniques to scan email content whilst reading messages on-line, to extract key phrases that can then be matched with specific marketing adverts that appear to the right hand side of the browser. This is more a case of just-in-time knowledge that could be extremely useful to employees if, for example, they were writing reports and the application would mine for keywords and link the user to existing written material or experts to aid in the report writing task. The drawback of many of the fourth generation approaches is the output. For example, expert listings are unordered when presented to a user seeking experts, which means significant user effort is required to identify the best expert. Such systems identify experts by textual analysis but rarely support expert selection by users. In addition, such systems fail to present the varying degrees 122 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 (or levels) of expertise that people possess and tend to assume a single level of expertise. It is thus entirely the user’s responsibility to systematically process the returned results in order to identify the most suitable experts for answering specific queries. Techniques employed to build the fourth generation expertise profiles should be advanced to ensure that the textual fragments analysed accurately convey employees’ expertise. To date, the automated techniques have been inadequate because they cannot distinguish between what is important and what is not important in identifying an expert. In addition, the system should be able to match user needs with expertise profiles by using appropriate retrieval techniques, ensuring that relevant experts are not overlooked and that less relevant experts are not overburdened with inappropriate queries. Numerous attempts have been made by researchers in both academia and industry to semi-automate/ automate the process of finding the right expert. What could be deemed as the fifth generation of systems, work by analyzing who we socialize with. IBM’s SmallBlue project (which is also part of the Atlas software suite) applies artificial intelligence algorithms for purpose of uncovering participants’ social network “who they know” and also the expertise of those participants “what they know” [63]. The application analyses employees’ email outboxes, but only for the employees that work at the same company. The application also analyses employees’ outgoing instant messaging chat transcripts and their profile information. The inferred social network for each participant is private and displayed as a visualization in Ego for each user. SmallBlue comes with a search engine that enables users to search for expertise based on a topic and displays the results in the context of the wider enterprise social network. There are many other systems available that could be deemed part of the fifth generation of systems, e.g. AskMe, Autonomy's IDOL, Triviumsoft's SEE-K, MITRE’s ExpertFinder, MITRE’s XpertNet, MindServer Expertise, but they also have similar limitations to those that have been discussed earlier in this section. This extended but abbreviated evolutionary review of expertise locator systems has highlighted the need for new expert locator systems with enhanced information retrieval techniques that provide user friendly expertise seeking techniques and high levels of accuracy in identifying relevant experts. In the next section, we summarise the techniques that have been used to extract key phrases and then discuss the latest attempts by the authors to improve upon the techniques to enhance the accuracy of the key phrases extracted and the ranking of their importance according to a user’s expertise. IV. KEY PHRASE EXTRACTION Numerous papers explore the task of producing a document summary by extracting key sentences from the document [34], [35], [36], [37], [38], [39], [40], [41], [42], [43]. © 2012 ACADEMY PUBLISHER The two main techniques are domain dependent and domain independent. Domain dependent techniques employ machine learning and require a collection of documents with key phrases already attached, for training purposes. Furthermore, the techniques (both domain dependent and domain independent) are related to linguistics and/or use pure statistical methods. A number of applications have been developed using such techniques. A full discussion of existing approaches, together with their merits and pitfalls, is provided in [44]. There are many weaknesses with current approaches to automatic key phrase identification, several of which are discussed here to illustrate the issues. First, the extraction of noun phrases from a passage of text is common to all such approaches [43], [45]. However, a disadvantage of the noun extraction approach is that, despite the application of filters, many extracted key phrases are common words likely to occur in numerous emails in many contexts. Therefore, it is important to distinguish between more general nouns and nouns more likely to comprise of key phrases. Second, Hulth [45] pinpoints two common drawbacks with existing algorithms, such as KEA. The first drawback is that the number of words in a key phrase is limited to three. The second drawback is that the user must state the number of keywords to extract from each document [45]. In the attempt to push the boundaries of key phrase extraction, work undertaken by the authors aimed to enable end-users to locate employees who may possess specialised knowledge that users seek. The underlying technical challenge in utilising e-mail message content for expert identification is the extraction of key phrases indicative of sender skills and experience. In developing the new system the Natural Language ToolKit (NLTK) was employed to build a key phrase extraction “engine”. NLTK comprises a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. The completed key phrase extractor was then embedded within EKE - an Email Knowledge Extraction process based on two stages. The first stage involves a training process which enables the creation of a speech-tagging model for tagging parts-of-speech (POS) within an e-mail message. The second stage involves the extraction of key phrases from e-mail messages with the help of the speech-tagging model. Figure 1 shows the keyphrase extraction steps of stage two. The email message sent by the user is captured by the system and the text in the email body is fed into the keyphrase extraction engine. The text is then divided into tokens using regular expression rules and tagged by their parts of speech with the help of the POS model created. Keyphrases are identified by applying rules which were manually set by the author to group all occurrences of specific sequences of tags together. The rule is formed from a sequence of grammatical tags which are most likely to contain words that make up a keyphrase. Once sequences of tags are collated, more rules are applied to remove a subset of non relevant phrases. Keyphrases are then chosen from the identified JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 candidate phrases. The process concludes with the use of linguistic filtering to extract the most important keyphrases. This results in a set of lines, each containing a sequence of tokens representing key competencies. Figure 2 depicts how the EKE system analyses e-mail messages to identify experts (but many other systems use a similar process). Once a message is sent by a user (step 1), the body of the message is captured by EKE. Figure 2 Stages of the extraction process EKE’s key phrase extraction engine will parse the body of the email seeking appropriate key phrases that might represent the user’s expertise (step 2). This process is fully automated and takes only milliseconds to complete, and is so far transparent to both sender and receiver. It is possible that key phrases will not be identified by the key phrase extraction engine as the message may not contain any text suggesting key phrases, or the message contains key phrases that were not detected. In such cases, EKE will not require any action from the user whose work activities will therefore remain uninterrupted. In step 3, if the engine identifies key phrases the user is requested to rank the extracted key phrase using a scale of 1 - 4, to denote level of user expertise in the corresponding field. The rankings 1 – 4 represent basic knowledge, working knowledge, expert knowledge, or not applicable. The four point categorisation scale was devised because a seeker of knowledge should be forewarned that a self-nominated expert may lack an expected capability. The knowledge seeker can then decide whether to proceed to contact such an expert for help. In Figure 1, “Questionnaire”, “Semantics”, “Casino” and “Online database” are examples of the key phases that have been extracted from the body of a message. On © 2012 ACADEMY PUBLISHER 123 average very few key phrases are extracted from a message because generally, according to our development tests and pilot studies, there are few key-phrases contained within any one e-mail message. Therefore typically a user is not unduly delayed by the key phrase expertise categorisation process. Once categorised (for example in Figure 1, “Questionnaire” may be categorised as basic knowledge, “Semantics” as expert knowledge, and so on), key phrases are stored in an expertise profile database (excluding key phrases categorised as “not applicable”). The user can access and edit his/her expert profile at anytime (step 4). The key phrases that are stored in the expertise profile database are also made available to other employees within the organisation, to enable them to locate relevant experts by querying the database (step 5). Figure 2 – Overview of the E-mail Knowledge Extraction System The EKE system has significant advantages compared with other e-mail key phrase extraction systems, not all of which perform steps 3 and 4. The present system gains accuracy by requiring a user in steps 3 and 4 to rank his or her level of expertise for a particular key phrase. Most existing systems attempt to rank experts automatically rather than consulting users for their perceptions of their level of expertise. Such systems are likely to be less successful at accurately identifying expertise levels as they do not capture employee knowledge of their own expertise. The above approach has been trialled at Loughborough University and has shown to be effective in correctly identifying experts [44]. However, it is important to note that this system used a hybrid approach of NLP and user intervention to determine the usefulness of the key phrases extracted. The focus of this paper is reviewing the boundaries of NLP techniques and the next section reviews the results of the NLP system without user intervention, which leads to a discussion about the boundaries of NLP in key phrase extraction. V. RESULTS AND BOUNDARIES OF NLP It is important to note that the full EKE system which includes user intervention has not been included in these 124 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 results. The full results have been published in Tedmori’s et al.’s research [44]. The purpose of showing the results in this paper is to illustrate that the f-measure results have not improved over a number of years. This highlights that the discipline has potential found the boundaries of NLP in extracting knowledge keyphrases from email. The Natural Language ToolKit system developed by the authors was tested on a number of corpuses. • • • Corpus 1 - Emails from various academic domains; Size 45 Corpus 2 - Total office solutions organisation; Size 19 Corpus 3 – Enron; Size 50 The sampling units were collected from subjects from different backgrounds (people with English as their first language and people who can communicate in English, but is not their first language). All subjects belong to the age group 24-60. All the sampling units were outgoing mail. The authors believe that sampling units are representative of typical messages that are sent out in institutional and corporate environments. The sampling units of the sample, Corpus 1, were collated from various academic disciplines (computer science, information science, building and construction engineering). The sampling units of the second sample, Corpus 2, are specific to one employee from a large supplier of total office solutions in the UK & Ireland, which for confidentiality reasons in is referred to as Employee E from Company XYZ. The sampling units of the final sample, Corpus 3, are collated from the Enron email dataset, which is freely available on the net. The f-measure, a widely used performance measure in information retrieval, was used to measure the system and is defined by (1). Equation 1 – F-Measure Calculation where precision is the estimate of the probability that if a given system outputs a phrase as a key phrase, then it is truly a key phrase and recall is an estimate of the probability that, if a given phrase is a key phrase, then a given system will output it as a key phrase. Table I – Results of testing the author’s Natural Language ToolKit system Corpus Precision Recall Corpus 1 Corpus 2 Corpus 3 53.3 59.6 41.7 57.6 63.1 48.3 fmeasure 55.4 61.3 44.8 In Table I, precision, recall, and the f-measure results are shown. The highest precision (59.6), recall (63.1), and f-measure (61.3) were achieved on the smallest © 2012 ACADEMY PUBLISHER sample (19 messages). Since only three sets were evaluated, one cannot determine the coloration between size of the sample and performance of the extractor. Turney [47] evaluates four key phrase extraction algorithms using 311 email messages collected from 6 employees, and in which 75% of each employee’s messages was used for training and 25% (approximately 78 messages) was used for testing. His evaluation approach is similar to the authors of this paper and the highest f-measure reported was that of the NRC, the extractor component of GenEx, which uses supervised learning from examples. The f-measure reported is 22.5, which is, as expected, significantly less than the fmeasures shown in Table I. Hulth [45] reports results from three different term selection approaches. The highest f-measure reported was 33.9 from the n-gram approach with POS tags assigned to the terms as features. All unigrams, bigrams, and trigrams were extracted, after which a stop list was used where all terms beginning or ending with a stopword were removed. The Natural Language ToolKit system developed by the authors appears to have the best f-measure results in the world when it comes to email knowledge extraction. Although the results are pleasing, the sight of a fully automated system that can extract knowledge from email without user intervention appears to be many years away, if at all possible. However, with the financial muscle of organizations like Google developing techniques for their range of information retrieval applications, this domain is likely to see rapid progress within a short period of time. VI. CONCLUSION This paper has reviewed the four generations of building systems to share knowledge and highlighted the challenges faced by all. The paper discussed the techniques used to extract key phrases and the limitations in the NLP approaches which have defined the boundaries of the domain. The paper has shown that although the f-measure results of the study are encouraging, there is still a requirement for user intervention to enable the system to be accurate enough to provide substantial results to the end users. It is concluded that NLP techniques are still many years away from providing a fully automated knowledge extraction system. REFERENCES [1] [2] [3] [4] [5] Hiltz, S.R. (1985). Online Communities: A Case Study of the Office of the Future, Ablex Publishing Corp, Norwood, NJ. Lang, K.N., Auld, R. & Lang, T. (1982). "The Goals and Methods of Computer Users", International Journal of Man-Machine Studies, vol. 17, no. 4, pp. 375-399. Mintzberg, H. (1973). The Nature of Managerial Work, Harper & Row, New York. Pelz, D.C. & Andrews, F.M. (1966). Scientists in Organizations: Productive Climates for Research and Development, Wiley, New York. Allen, T. (1977). Managing the Flow of Technology, MIT Press, Cambridge, MA. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] Cross, R. & Sproull, L. (2004). "More Than an Answer: Information Relationships for Actionable Knowledge", Organization Science, vol. 15, no. 4, pp. 446-462. Kraut, R.E. & Streeter, L.A. (1995). "Coordination in Software Development", Communications of the ACM, vol. 38, no. 3, pp. 69-81. Maltzahn, C. (1995). "Community Help: Discovering Tools and Locating Experts in a Dynamic Environment", CHI '95: Conference Companion on Human Factors in Computing SystemsACM, New York, NY, USA, pp. 260. Campbell, C.S., Maglio, P.P., Cozzi, A. & Dom, B. (2003). "Expertise Identification Using Email Communications", Twelfth International Conference on Information and Knowledge Management New Orleans, LA, pp. 528. Bishop, K. (2000). "Heads or Tales: Can Tacit Knowledge Really be Managed", Proceeding of ALIA Biennial Conference Canberra, pp. 23. Cross, R. & Baird, L. (2000). "Technology is not Enough: Improving Performance by Building Organizational Memory", Sloan Management Review, vol. 41, no. 3, pp. 41-54. Gibson, R. (1997). Rethinking the Future: Rethinking Business, Principles, Competition, Control & Complexity, Leadership, Markets, and the World, Nicholas Brealey, London. Lang, J.C. (2001). "Managing in Knowledge-based Competition", Journal of Organizational Change Management, vol. 14, no. 6, pp. 539-553. Stewart, T.A. (1997). Intellectual Capital: The New Wealth of Organizations, Doubleday, New York, NY, USA. Cross, R. (2000). "More than an Answer: How Seeking Information Through People Facilitates Knowledge Creation and Use", Toronto, Canada. Burt, R.S. (1992). Structural Holes: The Social Structure of Competition. Harvard University Press, Cambridge. Erickson, B.H. (1988). "The Relational Basis of Attitudes." in Social Structures: A Network Approach:, Barry Wellman and S. D. Berkowitz (eds.), edn, Cambridge University Press., New York:, pp. 99-121. Schön, D.A. (1993). "Generative Metaphor: A Perspective on Problem-setting in Social Policy" in Metaphor and Thought, ed. A. Ortony, 2nd edn, Cambridge University Press, Cambridge, pp. 137-163. Walsh, J.P. (1995). "Managerial and Organizational Cognition: Notes from a Trip down Memory Lane.", Organizational Science, vol. 6, no. 3, pp. 280-321. Weick, K.E. (1979). The Social Psychology of Organising, 2nd edn, McGraw-Hill, New York. Weick, K.E. (1995). Sense making in Organisations, Sage, London. Blau, P.M. (1986). Exchange and Power in Social Life, Transaction Publishers, New Brunswick, NJ. March, J.G. & Simon, H.A. (1958). Organizations, Wiley, New York. Lave, J. & Wenger, E. (1991). Situated Learning : Legitimate Peripheral Participation, Cambridge University Press, U.K. Yimam-Seid, D. and Kobsa, A. (2003) ‘Expert finding systems for organizations: problem and domain analysis and the DEMOIR approach’, Journal of Organizational Computing and Electronic Commerce, Vol. 13, No. 1, pp.1–24. © 2012 ACADEMY PUBLISHER [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] 125 Ackerman, M.S. and Malone, T.W. (1990) ‘Answer garden: a tool for growing organizational memory’, Proceedings of ACM Conference on Office Information Systems, Cambridge, Massachusetts, pp.31–39. Ackerman, M.S. (1994) ‘Augmenting the organizational memory: a field study of answer garden’, Proceedings of the ACM Conference on Computer-Supported Cooperative Work, pp.243–252. Cohen, A.L., Maglio, P.P. and Barrett, R. (1998) ‘The expertise browser: how to leverage distributed organizational knowledge’, Presented at Workshop on Collaborative Information Seeking at CSCW’98, Seattle, Washington. Krulwich, B. and Burkey, C. (1996a) ‘Learning user information interests through the extraction of semantically significant phrases’, In AAAI 1996 Spring Symposium on Machine Learning in Information Access, Stanford, California. Balog, K. and de Rijke, M. (2007) ‘Determining expert profiles (with an application to expert finding)’, Proceedings of the Twentieth International Joint Conferences on Artificial Intelligence, Hyderabad, India, pp.2657–2662. Maybury, M., D’Amore, R. and House, D. (2002) ‘Awareness of organizational expertise’, International Journal of Human-Computer Interaction, Vol. 14, No. 2, pp.199–217. Streeter, L.A. and Lochbaum, K.E. (1988) ‘An expert/expert-locating system based on automatic representation of semantic structure’, Proceedings of the Fourth Conference on Artificial Intelligence Applications, San Diego, California, pp.345–349. Mattox, D., Maybury, M. and Morey, D. (1999) ‘Enterprise expert and knowledge discovery’, Proceedings of the 8th International Conference on Human-Computer Interaction, Munich, Germany, pp.303–307. Luhn HP. 1958. The automatic creation of literature abstracts. I.B.M. Journal of Research and Development, 2 (2), 159-165. Marsh E, Hamburger H, Grishman R. 1984. A production rule system for message summarization. In AAAI-84, Proceedings of the American Association for Artificial Intelligence, pp. 243-246. Cambridge, MA: AAAI Press/MIT Press. Paice CD. 1990. Constructing literature abstracts by computer: Techniques and prospects. Information Processing and Management, 26 (1), 171-186. Paice CD, Jones PA. 1993. The identification of important concepts in highly structured technical papers. SIGIR-93: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 69-78, New York: ACM. Johnson FC, Paice CD, Black WJ, Neal AP. 1993. The application of linguistic processing to automatic abstract generation. Journal of Document and Text Management 1, 215-241. Salton G, Allan J, Buckley C, Singhal A. 1994. Automatic analysis, theme generation, and summarization of machine-readable texts. Science, 264, 1421-1426. Kupiec J, Pedersen J, Chen F. 1995. A trainable document summarizer. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68-73, New York: ACM. 126 [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Brandow R, Mitze K, Rau LR. 1995. The automatic condensation of electronic publications by sentence selection. Information Processing and Management, 31 (5), 675-685. Jang DH, Myaeng SH. 1997. Development of a document summarization system for effective information services. RIAO 97 Conference Proceedings: Computer-Assisted Information Searching on Internet; 101-111. Montreal, Canada. Tzoukermann E, Muresan S, Klavans JL. 2001. GISTIT: Summarizing Email using Linguistic Knowledge and Machine Learning. In Proceeding of the HLT and KM Workshop, EACL/ACL. Tedmori, S., Jackson, T.W. and Bouchlaghem, D. (2006) ‘Locating knowledge sources through keyphrase extraction’, Knowledge and Process Management, Vol. 13, No. 2, pp.100–107. Hulth A. 2003. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'03). Sapporo. Turney PD. 1997. Extraction of Keyphrases from Text: Evaluation of Four Algorithms, National Research Council, Institute for Information Technology, Technical Report ERB-1051. (NRC #41550) Lichtenstein, S. 2004, "Knowledge Development and Creation in Email", The 37th Annual Hawaii International Conference on System Sciences (HICSS'04) - Track 8IEEE Computer Society, Washington, DC, USA. Garcia, I. 2006, 04/03-last update, The Good in Email (or why Email is Still the Most adopted Collaboration Tool) [Homepage of Central Desktop], [Online]. Available: http://blog.centraldesktop.com/?entry=entry060403214628 [2011, 06/22]. Pew Internet & American Life Project 2007, 01/11last update, Pew Internet & American Life Project Surveys, March 2000-April 2006 [Homepage of Pew Research Center], [Online]. Available: http://www.pewinternet.org/trends/Internet_Activiti es_1.11.07.htm [2007, 11/27] . Derrington, S. 2006, 11/11-last update, Getting Control of the Storage Environment, [Online]. Available: http://www.datastorageconnection.com/article.mvc/ Gaining-Control-Of-The-Storage-Environment-iB0002 [2011, 06/22] . Jackson, T. & Burgess, A. 2003, "Capturing and managing email knowledge.", Business Innovation in the Knowledge Economy - Abstracts from the IBM \& Stratford-Upon-Avon Conference, eds. J. Abbot, L. Martin, R. Palmer, M. Stone & L.T. Wright, , pp. 28. Whittaker, S., Bellotti, V. & Moody, P. 2005, "Introduction to this special issue on revisiting and reinventing e-mail", Human-Computer Interaction, vol. 20, pp. 1-9. Ducheneaut, N. & Belloti, V. 2003, "Ceci n’est pas un objet? Talking about objects in e-mail", HumanComputer Interaction, vol. 18, pp. 85-110. Jackson, T.W. & Tedmori, S. 2004, "Capturing and Managing Electronic Knowledge: The Development of the Email Knowledge Extraction", The International Resource Management Association ConferenceIdea Group, New Orleans, USA, pp. 463. © 2012 ACADEMY PUBLISHER [55] [56] [57] [58] [59] [60] [61] [62] [63] Bontis, N., Fearon, M. & Hishon, M. (2003). "The e-Flow Audit: An Evaluation of Knowledge Flow Within and Outside a High-tech Firm", Journal of Knowledge Management, vol. 7, no. 1, pp. 6-19. Tedmori, S., Jackson, T.W., Bouchlaghem, N.M. & Nagaraju, R. 2006, "Expertise Profiling: Is Email Used to Generate, Organise, Share, or Leverage Knowledge", , eds. H. Rivard, E. Miresco & H. Melham, , pp. 179. Swaak, J., de Jong, T. & van Joolingen, W.R. 2004, "The effects of discovery learning and expository instruction on the acquisition of definitional and intuitive knowledge", Journal of COmputer Assisted Learning, vol. 20, no. 4, pp. 225-234. Campbell, C.S., Maglio, P.P., Cozzi, A. & Dom, B. 2003, "Expertise identification using email communications", twelfth international conference on Information and knowledge managementNew Orleans, LA, pp. 528. Shanteau, J. & Stewart, T.R. 1992, "Why study expert decision making? Some historical perspectives and comments", Organizational Behavior and Human Decision Processes, vol. 53, no. 2, pp. 95-106. Klimt, B. & Yang, Y. 2004, "Introducing the Enron Corpus", First Conference on Email and Anti-spam. Markus, M.L. 2001, "Toward a Theory of Knowledge Reuse: Types of Knowledge Reuse Situations and Factors in Reuse Success", Journal of Management Information Systems, vol. 18, no. 1, pp. 57-93. Lichtenstein, S., Tedmori, S. and Jackson, T.W., 2008, ''Socio-ethical issues for expertise location from electronic mail'', International Journal of Knowledge and Learning, 4(1), 58-74. Lin, C., Cao, N., Liu, S., Papadimitriou, S., Sun, J., and Yan, X. 2009, “SmallBlue:Social Network Analysis for Expertise Search and Collective Intelligence”, IEEE 25th International Conference on Data Engineering, pp. 1483-1486. Thomas W. Jackson - is a Senior Lecturer in the Department of Information Science at Loughborough University. Nicknamed ‘Dr. Email’ by the media Tom and his research team work in two main research areas, Electronic Communication and Information Retrieval within the Workplace, and Applied and Theory based Knowledge Management. He has published more than 70 papers in peer reviewed journals and conferences. He is on a number of editorial boards for international journals and reviews for many more. He has given a number of keynote talks throughout the world. In both research fields Tom has, and continues to work closely with both private and public sector organisations throughout the world and over the last few years he has brought in over £1M in research funding from research councils, including EPSRC. He is currently working on many research projects, including ones with The National Archives and the Welsh Assembly Government surrounding information management issues. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Sara Tedmori - In 2001, she received her BSc degree in Computer Science from the American University of Beirut, Lebanon. In 2003, she obtained her MSc degree in Multimedia and Internet Computing from Loughborough University. In 2008, she received her Engineering Doctorate in Computer Science from Loughborough University, UK. Currently she is appointed as an assistant Professor in the Computer Science Department at Princess Sumaya University of Technology, Jordan. Her research interests include: Object Tracking, image processing, expertise locator, knowledge extraction, knowledge sharing, and privacy. Chris Hinde is Professor of Computational Intelligence in the Department of Computer Science at Loughborough University. His interests are in various areas of Computational Intelligence including fuzzy systems, evolutionary computing, neural networks and data mining. In particular he has been working on contradictory and inconsistent logics with a view to using them for data mining. A recently completed project was concerned with railway scheduling using an evolutionary system. He has been funded by various research bodies, including EPSRC, for most of his career and is a member of the EPSRC peer review college. Amongst other activities he has examined over 100 PhD's. Anoud Bani-Hani is an EngD research Student at Loughborough University, researching into Knowledge Management in SME’s in the UK, with specific focus on implementing an ERP system into a low-tech SME. Prior to joining the EngD scheme Anoud was a Lecturer at Jordan University of Science and Technology and holds an undergraduate degree in Computer science and information technology system from the same university and a Master degree in Multimedia and Internet computing from Loughborough University. © 2012 ACADEMY PUBLISHER 127 128 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 MALURLS: A Lightweight Malicious Website Classification Based on URL Features Monther Aldwairi Department of Network Engineering and Security, Jordan University of Science and Technology, Irbid, Jordan munzer@just.edu.jo Rami Alsalman Department of Computer Engineering, Jordan University of Science and Technology, Irbid, Jordan rsalsalman08@cit.just.edu.jo Abstract—Surfing the World Wide Web (WWW) is becoming a dangerous everyday task with the Web becoming rich in all sorts of attacks. Websites are a major source of many scams, phishing attacks, identity theft, SPAM commerce and malwares. However, browsers, blacklists and popup blockers are not enough to protect users. That requires fast and accurate systems with the ability to detect new malicious content. We propose a lightweight system to detect malicious websites online based on URL lexical and host features and call it MALURLs. The system relies on Naïve Bayes classifier as a probabilistic model to detect if the target website is a malicious or benign. It introduces new features and employs self learning using Genetic Algorithm to improve the classification speed and precision. A small dataset is collected and expanded through GA mutations to learn the system over short time and with low memory usage. A completely independent testing dataset is automatically gathered and verified using different trusted web sources. They algorithm achieves an average precision of 87%. Index Terms— malicious websites, machine learning, genetic algorithm, classification I. INTRODUCTION Internet access is an integral part of the modern life and employees in today’s fast economy depend on Internet connected smart phones, laptops and personal assistants to perform their jobs on the go. Even regular Joe and young children are becoming techsavvy and cannot live without Internet. Users shop, check movies, bank accounts, email, health insurance, they renew driving licenses, pay bills, make calls over IP, chat and play games, just to name a few daily activities. Such Internet access takes place through web browsers making them the most popular application for most users. As browser vendors race to introduce more features and new functionalities more vulnerabilities arise and more personal data are put at risk. Browsers have become the main system entry point for many attacks that aims at stealing private data and manipulating users to reveal sensitive information. Unsuspecting web surfers are not aware of the many drive-by-downloads of malwares, ad wares, spywares and Trojans to their devices. Just a single visit to a shady website is sufficient to allow the © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.128-133 intruder to detect vulnerabilities in the surfer’s computer and inject a malware that might enable the intruder to gain remote access or open a backdoor for future blunders. Users do not have to visit pornographic or hacker websites to get compromised. Commerce related SPAM such as pharmaceuticals and fake products are one way to coerce users to click and access malicious websites. In addition, they can be redirected to such websites through more organized schemes such as fast flux networks (FFN) [1]. Consequently, users are easily tricked to reveal private information using phishing and pharming attacks [2]. In addition to all of that, browsers collect sensitive data such as favorites, cache files, history file, cookies, form data and passwords. This puts such information at risk and keeping your browser and computer up to date will not cut it. For instance, cache timer sniffing enables intruders to determine websites you have visited. Finding and identifying such websites is no simple task due to the ever growing World Wide Web and the dynamic nature of malicious websites. Blacklisting services rose to the challenge and were encapsulated into browsers, toolbars and search engines. The lists are constructed through manual reporting, honeypots or web spiders. But blacklists grow uncontrollably and become a performance bottleneck. Incorrect listing is a major problem, due to reporting, analysis and record keeping mistakes. Therefore, legitimate websites may be incorrectly evaluated and listed, while malicious websites are not listed because they are new and haven’t been analyzed yet. Researchers have been very active in devising online and offline solutions to classify malicious websites and make web surfing safer. Shortly we give a brief survey of the current state of art techniques for classifying websites. In this paper we propose lightweight statistical selflearning scheme to classify websites based on their features. It is fast and designed to run online to protect the users. We use a Naïve Bayes classifier to classify the websites into two classes: malicious or benign. The number of features used is small and they fall under one of three categories: lexical, host-based or special features. Features include those suggested by McGrath et al. [3] JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 and Ma et al. [4]. We add special features to improve the classification accuracy such as JavaScript Enable/Disable, Document Frequency, and Title Tag. In addition, Genetic Algorithm is used to expand the training dataset through mutations to learn the Naïve Bayes classifier better and faster without the need to deal with huge datasets. The authors presented preliminary results in a previous paper [5] and in this paper they expand the original work by including more implementation details and adding additional results. The rest of the paper is organized as follows. Section II surveys the related work. Section III details the methodology followed to classify websites, this includes the list features, collecting the training and testing datasets and how the Naïve Bayes and GA are used. Section IV presents the experimental results. II. RELATED WORK Blacklisting was and still is a popular technique. Whittaker et al. [6] offline analyzed millions of pages daily from the noisy Google's phishing blacklist. Their main contribution was achieving 90% classification accuracy for phishing pages after a three weeks training. PhishNet [7] used approximate pattern matching algorithm to match URL components against blacklist entries. Though the above techniques tried to automatically manage blacklists and increase their accuracy, they are still insufficient and suffer from their growing size and incorrect listing. Blacklists can be combined with other techniques that uses machine learning to classify malicious websites. One of the earliest classification systems for malicious websites was concerned with the detection of SPAM in blog posts. Blog identification and splog detection by Kolari et al. [8] used the activity and comments generated by a blog post as the main classification feature in addition to ping update services. Support Vector Machines (SVM) was used with only linear kernel in all experiments and reported moderate results. Subsequent work focused on detecting Phishing URLs in SPAM emails. Garera et al. [9] main contribution was identifying eighteen features to detect phishing URL embedded in SPAM. They used linear regression compare millions of Google’s toolbar URLs to identify 777 phishing pages a day and 9% of the users that visit them are potential victims. McGrath et al. [3] studied phishing infrastructure and the anatomy of phishing URLs. They pointed out the importance of features such as the URL length, linked-to domains age, number of links in e-mails and the number of dots in the URL. PhishDef [10] used features that resist obfuscation and suggested used the AROW algorithm to achieve higher accuracy. To further increase the accuracy several approaches focused on page content statistics such as Seifert et al. [11]. They added features derived from JavaScript and HTML tags such as redirects, long script code lines and shell code. Seifert et al. used features from the page contents such as the number of HTML script tags and size of iframe tags. Cova [12] et al. went too far by © 2012 ACADEMY PUBLISHER 129 profiling the normal JavaScript behavior and applying anomaly detection which is prone to high false positives. Anomaly detecting works by extracting features during the normal learning phase based on a specific model. In the testing phase the new feature values for the websites to be tested are checked against the training models representing the normal behavior. The features used include: the number of code executions, code length, number of bytes, shell codes and the difference in returned pages for different browsers and the number of redirections. The Prophiler by Canali [13] used HTML tag counts, percentage of the JavaScript code in the page, percentage of whitespace, entropy of the script, entropy of the of the strings declared, number of embed tags, presence of meta refresh tags, the number of elements whose source is on an external domain and the number of characters in the page. While improving accuracy the Prophiler significantly increased the number of features to eighty eight. In addition to the increased overhead due to the statistically processing the page content, those techniques suffered from the inherent danger of having to access the malicious page and download the content before deciding it was malicious. Ma et al. [4] used Yahoo−PhishTank dataset and validated their work using three machine learning models Naïve Bayes, SVM with an RBF kernel and regularized logistic regression. Later, Ma et al. [14] [15] developed a light weight algorithm for website classification based on lexical and host-based features while excluding page properties. It was designed as real-time, low-cost and fast alternatives for black listing. They reported 3.5% error rates and 10–15% false negatives, but the tradeoff was between memory usage and accuracy. However, the main disadvantage was the fact that they use tens and hundreds of thousands of features to achieve their results. Another disadvantage shared among all previous approaches is collecting and handling a large number of websites and features which makes it hard to run them online. III. MTHODOLOGY To address the drawback of previous wok we need to identify malicious websites. We define a malicious webpage as a page that downloads a file, uploads a file, collects data, installs an application, opens a pop window(s), displays an advertisement or any combination of the above without the knowledge or consent of the user. We manually construct our own dataset and label websites as either benign or malicious based on trusted web directories. The complete MALURLs framework is shown in Fig. 1. In step 3 the features for the training dataset are calculated through various sources. We collect 100 benign and 100 malicious sites. In steps 5 and 6 the dataset is expanded to 10000 records using Genetic Algorithm (GA). GA self learns the classifier through mutations on the dataset. Based on a fitness function we can use mutations and crossovers from the current dataset to generate a larger dataset and grantee not to learn our classifier based on specific domain. In step 7 Naïve Bayes are trained using part of the collected features. Finally, a completely different dataset of 200 websites is 130 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Figure 1. MALURLs framework. used for testing. The testing dataset is collected and classified automatically into benign and malicious. The features are calculated based on a same web sources used in training. It is worth mentioning that the testing and training datasets are completely independent and are verified and classified using different trusted web sources to eliminate any chance of data poisoning. Unlike most of previous approaches which used the same source, mainly PhishTank. The following Subsections explain the three feature groups and the basis for their selection, how the training dataset is collected, how the Naïve Bayes classifier works, the Genetic Algorithm and finally how the testing dataset is built. A. Features Features used fall into one of three categories: lexical, host-based features and special features. 1) Lexical Features URL stands for uniform resource locator or formerly the universal resource locator. URL and uniform resource identifier (URI) are equivalent and identify any document retrieved over the WWW. The URL has three main parts: the protocol, hostname and path. Consider the following URL for example: “http://www.just.edu.jo/~munzer/ Courses/INCS741/Lec/ch3.ppt”. The protocol is: http://”, the hostname is: “www.just.edu.jo” and the path is: “~munzer/Courses/ INCS741/Lec/ch3.ppt”. Lexical features are the properties of the URL itself and do not include content of the page it points to. The URL properties include the length of the top level domain (TLD), other domains, the hostname, URL length, as well as the number of dots in the URL. In addition, lexical features include each token in the hostname (delimited by ‘.’) and tokens in the path URL delimited by ‘/’, ‘?’, ‘+’, ‘ ̶ ’, ‘%’, ‘&’, ‘.’, ‘=’, and ‘_’. Those feature groups are known as a “bag-of-words”. The features above tell a lot © 2012 ACADEMY PUBLISHER about a webpage. The domain might indicate a blacklisted malicious content provider. A large number of NULL token probably attributed to too many slashes might indicate an http denial of service attack (DOS) on Microsoft Internet Information Server (IIS). 2) Host-based Features Host-based features are derived from the host properties such as the IP address, geographic properties, domain name properties, DNS time to live (TTL), DNS A, DNS PTR and DNS MX records as well as WHOIS information and dates. Those features are very important and can help any classifier better the detection process. They help address a lot of important questions such as: does the IP address belong to a geographical location associated with malicious content? Does the PTR record resolve an IP that belongs to the host? Do the IP addresses of the DNS records belong to one autonomous system (AS)? 3) Special Features Those are new features that do not fit under the aforementioned categories or reported good results with previous systems. Some are simple to get a value for such as JS Enable/Disable, HTML Title tag content (<title></title>), 3-4-5 grams (n-grams) and Term Frequency and Inverse Document Frequency (TF-IDF). JavaScript code usually is downloaded and run on the client’s browser which can be very dangerous. Term frequency is the number of times a term occurs in a document. The inverse document frequency is the logarithm of the number of documents divided by the number of documents containing the term and it measures the importance of a term. TF-IDF is commonly used in search engines, classification and data mining and finally 3-4-5 grams take longer to calculate than the other features. Other features require significant computation time such as Anchors or bag-of-anchors which are extracted from all URLs in Anchor tags on the page being examined. Table I below shows a list of features’ groups used for training and testing purposes. B. Training Dataset The dataset is composed to 200 websites, half benign and the other half is malicious. The websites are chosen randomly and ALEXA Web Information Company’s TABLE I. FEATURE GROUPS JS-Enable-Disable Document Frequency DF Title tag <title>??</title> 3-4-5 grams TF-IDF weighting Blacklists WHOIS dates IP address misc Lexical misc 4grams Features DNS PTR record WHOIS info Connection speed TLD + domain DNS A record Geographic Hostname Words+URLs Meta+link URLs+anchors URLs+anchors+meta Path tokens Last token of the path Spamassassin plugin TLD DNS TTL DNS MX record Bag-of-words URLs Anchors Meta tags JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 website [16] is used to determine benign websites while PhishTank dataset [17] is used to determine malicious websites. ALEXA is one of the most influential and trusted WWW information companies. It provides information about websites including Internet traffic and top sites rankings. MALURLs uses IP tracer website [18] to extract the URL, DNS, IP address and the geographic properties for benign websites. PhishTank is an open source anti-phishing website that is widely used by almost all major browsers and WWW vendors such as Mozilla, Yahoo and McAfee. It offers phish pages verification tool through a voting system and issues annual phishing reports. PhishTank dataset contains partial information about malicious websites such as URL, DNS, IP-address and the geographic information. However, not all features are available on PhishTank, particularly the new special features. Therefore, we use Sphider [19] which is an open source web spider and a search engine. Sphider performs full text indexing, for example it finds links anywhere in a document whether in href or even in JavaScript strings. Sphider is able to calculate the Term Frequency (TF), Document Frequency (DF) and Inverse Document Frequency (IDF) features. C. Naïve Bayes classifier Bayes [20] is a probabilistic model based on Bayesian theorem. Though it is simple but often outperforms most of the other classifiers especially if trained using supervised learning methods. Naïve Bayes classifiers assume that the effect on a class from a feature is independent of the values of other features. This conditional independence simplifies the computation without sacrificing the accuracy. This makes Naïve Bayes a perfect match for our lightweight online algorithm. Bayesian theorem for example, calculates the probability that a website is malicious from the independent probabilities that a website is from a geographic location that generates fake traffic, has a random and very long hostname, has an IP address that does not match the DNS A recor and so on. In our case the number of features and their values range are large. If C represents the class and F represents a feature then the conditional probability (Pr) of C given F is calculated according to (1). ( | )= ( | ) ( ) ( ) (1) D. Genetic Algorithm To generate a larger dataset from the initial dataset we use Genetic Algorithm. GA is a biologically inspired algorithm that applies the principles of evolution and natural selection. The algorithms starts with an initial population encoded as a chromosome structure which is composed of genes encoded as numbers or characters. In our case the initial population represents the group of features for the training dataset. Chromosome goodness is evaluated using a fitness function that uses mutations and crossovers to simulate the mutation of species. The fittest chromosomes are selected and the process is repeated till we converge to a solution [21]. © 2012 ACADEMY PUBLISHER 131 The initial population is the dataset collected as specified in Subsection B and used in the learning phase. Mutations are applied on the initial dataset which is composed of a number of features or genes. Mutations are simply applying changes to certain features such as changing JS-Enable-Disabled from True to False (binary encoding) or to add a random amount between 0.2-0.3 to DF and TF (value encoding). In the testing step the fitness function is calculated by multiplying the probability values for the all features as shown in (2). The fitness function is used to calculate malicious and benign probabilities. The website is classified based on the highest probability. ( , ℎ , , , ( )=∏ , − , ℎ , , , ) , (2) E. Testing Dataset For testing we collect 200 URLs (100 malicious, 100 benign) using WOT Mozilla Plug-in [22]. The features values are collected the same way as in the training dataset. WOT is a traffic-light style rating system where green means (benign) and red means stop (malicious). The rating of a website depends on a combination of user ratings and data from trusted sources such as Malware Patrol [23], Panda [24], PhishTank, TRUSTe [25], hpHosts [26] and SpamCop [27]. In addition, WOT enables users to evaluate the trustworthiness of a website and incorporate their ratings in calculating the reputation of a website. IV. IMPLEMENTATION AND RESULTS We implement MALURLs using PHP programming language and MySQL database. Equation (3) defines the precision metric used to evaluate the relative accuracy of MALURLs using different features. Precision (P) is a measure of the usefulness of the retrieved documents. = (3) The testing dataset of 200 instances was divided into five different subsets and the average precision with and without Genetic Algorithm was calculated as shown in Table II. The use of GA to expand the training dataset results in a significant improvement in the classification precision. The average precision for classifying a website as benign or malicious using GA is 87% when using all feature groups. We run experiments to measure the improvement in TABLE II. OVERALL PRECISION WITH AND WITHOUT GA Dataset Number 1 2 3 4 5 Average Precision (%) Without GA With GA 70 75 80 80 65 74 85 90 95 80 85 87 132 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 TABLE III. OVERALL PRECISION WITH AND WITHOUT TF-IDF Dataset Number 1 2 3 4 5 Average Precision using GA (%) Without JS With JS 60 65 90 55 60 66 70 60 95 80 75 76 TABLE IV. PRECISION WITH AND WITHOUT JS-ENABLE-DISABLE Dataset Number 1 2 3 4 5 Average Precision using GA (%) Without JS With JS 55 50 75 70 65 63 65 60 75 65 80 69 TABLE V. PRECISION WITH AND WITHOUT 3-4-5 GRAMS Dataset Number 1 2 3 4 5 Average Precision using GA (%) Without n-grams With n-grams 80 70 85 60 90 77 80 70 85 75 90 80 precision attained by adding the new individual features to MALURLs. The features added include TF-IDF, JSEnable-Disable and 3-4-5 grams. The addition of TF-IDF results in a significant increase in classification precision from 66% to 76% as shown in Table III. This is expected because of the volume of information presented by this feature. Adding the JS-Enable-Disable did show a very good increase in average precision from 63% to 69% as illustrated by Table IV. The experiments to measure the improvement in MALURLs precision achieved by adding 3-4-5 grams show a small increase in average precision from 77% to 80% as illustrated by Table V. 3-4-5 grams calculation is complex, takes a long time and puts the user at risk due to the need to download the document. Therefore n-grams can be deemed irrelevant because of the high overhead and small improvement which is consistent with our goal of keeping the algorithm light weight. V. CONCLUSIONS This paper presents a new website classification system based on URL, host-based and special feature. We experiment with various features and determine the ones that improve the precision with minimum overhead. The data is collected using WOT Mozilla plug-in and the features are calculated using various web resources. MALURLs system reduces the training time using GA to expand the training dataset and learn the Naïve Bayes classifier. The experimental results show the average system precision of 87%. The additional features proved © 2012 ACADEMY PUBLISHER valuable to improve the overall classification precision. TF-IDF improved precision by up to 10%, JS-EnableDisable improvement was about 6% while 3-4-5 grams improvement was limited to 3%. REFERENCES [1] A. Caglayan, M. Toothaker, D. Drapeau, D. Burke and G. Eaton, “Real-time Detection and Classification of Fast Flux Service Networks”, In Proceedings of the Cybersecurity Applications and Technology Conference for Homeland Security (CATCH), Washington, DC, Mar 2009. [2] J. Zdziarski, W. Yang, P .Judge, “Approaches to Phishing Identification using Match and Probabilistic Digital Fingerprinting Techniques”, In Proceedings of the MIT Spam Conference, 2006. [3] D. McGrath and M. Gupta, “Behind Phishing: An Examination of Phisher Modi Operandi”, In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), San Fransicso, CA, Apr 2008. [4] J. Ma, L. Saul, S. Savage, and G. Voelker, “Beyond Blacklists: Learning to Detect Malicious Websites from Suspicious URLs”, In Proceedings of the ACM SIGKDD Conference, Paris, France, Jun 2009. [5] M. Aldwairi, R. Alsalman. “MALURLs: Malicious URLs Classification System”. In Proceedings of the Annual International Conference on Information Theory and Applications (ITA), Singapore, Feb 2011. [6] C. Whittaker, B. Ryner, and M. Nazif, “Large-Scale Automatic Classification of Phishing Pages”, In Proceedings of the 17th Annual Network and Distributed System Security Symposium (NDSS’10), San Diego, CA, Mar 2010. [7] P. Prakash, M. Kumar, R. R. Kompella and M. Gupta, “PhishNet: Predictive Blacklisting to Detect Phishing Attacks”, In Proceedings of INFOCOM ’10, San Diego, California, Mar 2010. [8] P. Kolari, T. Finin, and A. Joshi, “SVMs for the Blogosphere: Blog Identification and Splog Detection”, In AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, Mar 2006. [9] S. Garera, N. Provos, M. Chew, and A. D. Rubin, “A Framework for Detection and Measurement of Phishing Attacks”, In Proceedings of the ACM Workshop on Rapid Malcode (WORM), Alexandria, VA, Nov 2007. [10] A. Le, A. Markopoulou and M. Faloutsos, “PhishDef: URL Names Say It All”, In Proceedings of the 30th IEEE INFOCOM 2011 (Mini Conference), Shanghai, China, April 10-15, 2011. [11] C. Seifert, I. Welch and P. Komisarczuk, “Identification of Malicious Web Pages with Static Heuristics”, In Proceedings of the Australasian Telecommunication Networks and Applications Conference (ATNAC), 2008. [12] M. Cova, C. Kruegel and G. Vigna, "Detection and Analysis of drive-by-Download Attacks and Malicious JavaScript Code", In Proceedings of the 19th International Conference on World Wide Web (WWW'10), Raleigh, NC, Apr 2010. [13] D. Canali, M. Cova, G. Vigna and C. Kruegel, “Prophiler: a Fast Filter for the Large-Scale Detection of Malicious Web Pages", In Proceedings of the 20th International World Wide Web Conference (WWW), Hyderabad, India, Mar 2011. [14] J. Ma, L. Saul, S. Savage, and G. Voelker, “Identifying Suspicious URLs: An Application of Large-Scale Online Learning”, In Proceedings of the International Conference JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] on Machine Learning (ICML), Montreal, Quebec, Jun 2009. J. Ma, L. Saul, S. Savage, and G. Voelker, “Learning to Detect Malicious URLs”, ACM Transactions on Intelligent Systems, vol. 2, no. 3, Apr 2011. ALEXA Web Information Company’s website, http://www.alexa.com/, last access Feb 2011. PhishTank, http://www.phishtank.com/, last access Feb 2011. IP Tracer Website. http://www.ip-adress.com/ip_tracer/, last access Feb 2011. Sphider, http://www.sphider.eu/, last access Feb 2011 G. John, G. and P. Langley, Estimating continuous distributions in Bayesian classifiers, In proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338–345), San Francisco, CA, 1995. J. Koza, M. Keane, M. Streeter, W. Mydlowec, J. Yu and G. Lanza “Genetic Programming IV: Routine HumanCompetitive Machine Intelligence”, Kluwer Academic Publishers, 2003. Web of Trust Mozilla Plug-in, http://www.mywot.com/en/download, last access Feb 2011. Malware Patrol. http://www.malware.com.br/, last access Feb 2011. Panda WOT. http://www.pandasecurity.com/, last access Feb 2011. TRUSTe – WOT. http://www.truste.org/, last access Feb 2011. hpHosts online. http://www.hosts-file.net/, last access Feb 2011. SpamCop. http://www.spamcop.net/, last access Feb 2011. Monther Aldwairi was born in Irbid, Jordan in 1976. Aldwairi received a B.S. in electrical engineering from Jordan University of Science and University (JUST) in 1998, and his M.S. and PhD in computer engineering from North Carolina State University © 2012 ACADEMY PUBLISHER 133 (NCSU), Raleigh, NC in 2001 and 2006, respectively. He is an Assistant Professor of Network Engineering and Security Department at Jordan University of Science and Technology, where he has been since 2007. He is also the Vice Dean of Faculty of Computer and Information Technology since 2010 and was the Assistant Dean for Student Affairs in 2009. In addition, he is an Adjunct Professor at New York Institute of Technology (NYiT) since 2009. He worked as Post-Doctoral Research Associate in 2007 and as a research assistant at NCSU from 2001 to 2006. He interned at Borland Software Corporation in 2001. He worked as a system integration engineer for ARAMEX from 1998 to 2000. His research interests are in network and web security, intrusion detection and forensics, artificial intelligence, pattern matching, natural language processing and bioinformatics. He published several well cited articles. Dr. Aldwairi is an IEEE and Jordan association of engineers’ member. He served at the steering and TPC committees of renowned conferences and he is a reviewer for several periodicals. He organized the Imagine cup 2011-Jordan and the national technology parade 2010. Rami Alsalman was born in Irbid, Jordan in 1986. He received his B.S. in computer information systems from Jordan University of Science and University in 2008. He received his M.S. degree in computer engineering JUST in 2011. He is currently working on his PhD in cognitive systems from the computer science department at the University of Bremen, Germany. He has three publications in AI and security areas. He received a three months research scholarship at the Heinrich-Heine University of Düsseldorf, Germany during the summer of 2010. His research interests include security, data mining, information assurance and data reasoning. Eng Alsalman is an editorial board at an International Journal of Web Applications. Additionally, he was a committee member in International Conference on Informatics, Cybernetics, and Computer Applications (ICICCA2010). 134 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Vision-based Presentation Modeling of Web Applications: A Reverse Engineering Approach Natheer Khasawneh Department of Software Engineering, Jordan University of Science and Technology, Irbid, Jordan Email: natheer@just.edu.jo Oduy Samarah Department of Computer Engineering, Jordan University of Science and Technology, Irbid, Jordan Email: oasamarah09@cit.just.edu.jo Safwan Al-Omari Department of Software Engineering, Jordan University of Science and Technology, Irbid, Jordan Email: ssomari@just.edu.jo Stefan Conrad Institute of Computer Science, Heinrich Heine University, Dusseldorf, Germany Email: conrad@cs.uni-duesseldorf.de Abstract—Presentation modeling, which captures the layout of an HTML page, is a very important aspect of modeling Web Applications (WAs). However, presentation modeling is often neglected during forward engineering of Web Applications; therefore, most of these applications are poorly modeled or not modeled at all. This paper discusses the design, implementation, and evaluation of a reverse engineering tool that extracts and builds appropriate UML presentation model of existing Web Applications. The tool consists of three steps. First, we identify and extract visual blocks and presentation elements of an HTML page such as navigation bars, header sections, text input, etc. In this step, we adopt the VIPS algorithm, which divides an HTML into semantically coherent blocks. Second, the identified presentation elements in step one are mapped to the most appropriate UML presentation model elements. Third, the resulting presentation model is made available in Magicdraw for manipulation. Our approach is applied and evaluated in the Goalzz home page. Index Terms—Reverse Engineering, Web Application, Web UML, Vision-based Page Segmentation I. INTRODUCTION Recently, many applications and services have evolved from being stand-alone and monolithic applications into web applications. According to a recent study [1], most of the existing web applications lack proper modeling, which is necessary for maintenance, reengineering, and proper evolution to emerging web technologies. For the purpose of modeling legacy web applications, there is an urgent need to have a reverse engineering method to extract models of existing web applications. Chikofsky describes reverse engineering as “the process of analyzing a subject system to identify the system’s © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.134-141 components and their interrelationships and create representations of the system in another form or at a higher level of abstraction” [2]. Current modeling languages and methodologies are not sufficient for capturing all aspects of web applications. UML for example is not sufficient to express the hyperlinks between different HTML pages. Modeling of web applications can be performed at three levels: 1. Content modeling: focuses on modeling data in an HTML page. 2. Hyper text modeling: focuses on modeling links between HTML pages in a web application. 3. Presentation modeling: focuses on the layout of the items inside a particular HTML page. In this paper we focus on the presentation model, which is used to model the page layout in a UML presentation model. The approach we follow in generating a presentation model is based on page segmentation method discussed in [3]. Page segmentation divides the page into different blocks according to its visual appearance when rendered in a web browser. The reset of the paper is organized as follows: in Section 2, we provide an overview of the related work; Section 3 introduces the proposed approach in details; Implementation details are discussed in Section 4; Section 5 discusses case studies to illustrate and demonstrate our approach; finally in Section 6, we sketch concluding remarks and future work. II. RELATED WORK In the literature there are many methods and tools of web reverse engineering which are built on the standard of reverse engineering techniques. These methods and tools can be used to describe and model the web JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 applications with respect to different levels: content, hypertext, and presentation [4]. The UML modeling language is the most widely used during the forward engineering design process and in many web application reverse engineering techniques. For example, in [1] authors defined a process for reverse engineering by describing a method for understanding web applications to be easily maintained and evolved. In [5], authors showed how web applications with UML presentation can be easily maintained. The approach in [6] is based on structured and model based techniques. In this approach the HTML page is divided into several blocks according to a cognitive visual analysis. After that the specific patterns with these blocks are extracted to produce structural blocks and through these structural blocks a conceptual model is represented. The approach in [7] relies on HTML pages analysis by extracting the useful information from the web page and analyze the extracted information using the domain ontology and form the analysis results the UML conceptual schema is generated. III. APPROACH AND METHODOLOGY In this paper, we present a new approach to reverse engineer existing web applications into UML presentation model. The proposed approach focuses on discovering the structure of the web page and presenting the structure in UML presentation model, as shown in Fig. 1. 135 model generation stage. Page segmentation stage accepts an HTML page as input and produces an XML description of segmented page. Mapping process stage transforms the XML description to Web UML XML description. Finally, UML generation Stage accepts a Web UML XML description of the segmented page and outputs the UML presentation model. In the rest of this section, we describe, in details, page segmentation stage, web UML XML, mapping process stage, and UML model generation stage, respectively. A. Page Segmentation Stage In this stage we use Vision-based Page Segmentation Algorithm (VIPS) [3] to segment the web page into different blocks. VIPS incorporates both the page appearance (visual cues) and Document Object Model (DOM) to perform the segmentation. VIPS is done in three steps (Fig. 2): block extraction, separator detection, and content structure construction. These steps are repeated recursively several times until a user-defined threshold is reached. The threshold is called Permitted Degree of Coherence (PDoC) and it is based on Degree of Coherence (DoC). DoC is a numeric value between 1 and 10 that increases as the consistency between blocks increases. Following is a description of each step in VIPS: Visual Block Extraction The input to the visual block extraction is the visual Webpages Page Segmentation Stage 1 VIPS XML Description of Segmented Page Mapping Process Stage 2 Web UML XML Description of Segmented Page Parsing Web UML XML Description by Magicdraw Tool Stage 3 UML Presentation Model of Segmented Webpage Figure 1. Approach Architecture. The presented method consists of three stages: page segmentation stage, mapping process stage, and UML © 2012 ACADEMY PUBLISHER Figure 2. Flowchart of the Segmentation Process. cues and the DOM tree of the web page. During this step each node in the DOM tree is matched with the block where it belongs to in the visual cues. From the tree node, 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 VIPS starts looking if the sub nodes belong to the same block. Nodes with a coherence value less than PDoC are matched together to belong to the same group. This process is repeated recursively for the sub roots of the unmatched nodes until all nodes are matched to a block in the visual cues. Visual Separator Detection The input to this step is the collection of the extracted blocks from the previous step, whereas, the output is a set of separators which separate different blocks. There are two types of separators, horizontal separator and vertical separator. The process starts by one separator which spans the whole page. Then blocks are added one by one and the separator gets updated according to the following rules: 1. If the added block falls inside the separator, the separator will be splitted into two. The splitting will be done either vertically or horizontally. 2. If the added block covers part of the separator the separator will be resized. 3. If the added block covers the whole separator, the separator will be removed. Weight separator is assigned to each separator by considering factors that show how similar the neighboring blocks are. For a separator which separates two blocks, the more the two blocks differ the higher the weight that will be assigned for the separator and vice versa. The used factors are: distance between blocks, overlapping with HTML tags, background differences, font differences, and structure similarity. Content Structure Construction In this step content structure is constructed by merging lowest weight separators with the neighboring separators. The newly merged separator will be given a DoC value equals to the maximum DoC of the merged separators. This step is iterative until a separator with maximum weight is reached. Finally, each node in the newly generated block is checked whether its DoC meets the condition given by PDoC or not. If not, the Visual Extraction Process starts over. Otherwise, the process terminates and the page structure is generated in VIPS XML description format. VIPS XML description format is the immediate output of the VIPS algorithm. It is very simple and understandable description, which is defined for blocks and their content in the segmented page. The VIPS XML description captures and describes some attributes for each block and HTML elements in a web page, as shown in Table I. This table contains the main attributes which are used during the mapping process stage. Fig. 3 represents the VIPS XML description for blocks with ID (1-2) which contains three elements irrespective of the type of these elements (block or primitive HTML Element). Fig. 3 also shows more information for block such as coordinates and not containing any tables or images. In contrast, Fig. 4 represents VIPS XML description for primitive HTML Element (anchor) with ID (1-2-1), and also Fig. 4 shows more information about © 2012 ACADEMY PUBLISHER TABLE I. VIPS XML ATTRIBUTES AND THEIR DESCRIPTION VIPS XML Attributes ContainImg IsImg Descriptions Determine whether the block contains image Determine whether the HTML Element is image ContainTable Determine whether the block contains table ContainP Determine whether the HTML Element contains text TextLen Determine the length of text in blokes or HTML Elements DOMCldNum Determine whether the element is block or primitive HTML tag ( it is block if DOMCldNum >1) ObjectRectLeft ObjectRectTop ObjectRectWidth ObjectRectHeight Determine the coordinates for blocks or HTML Elements Content Determine the content of blocks and anchor which appears for user .(used for data mining techniques ) SRC Determine the source of image, anchor or any HTML Element ID Unique ID for block or HTML Element, which is used for specific purpose during the implementation. order Unique ID (integer number) the element such as coordinates and Uniform Resource Locator (URL) for anchor in SRC attribute. Through ID attribute, we can know that the block with ID (1-2) contains the HTML element with ID (1-2-1). <LayoutNode FrameSourceIndex="0" SourceIndex="16" DoC="10" ContainImg="0" IsImg="false" ContainTable="false" ContainP="0" TextLen="15" LinkTextLen="15" DOMCldNum="3" FontSize="7.5" FontWeight="700" BgColor="#00ffff" ObjectRectLeft="10" ObjectRectTop="289" ObjectRectWidth="202" ObjectRectHeight="236" ID="1-2" order="10"> Figure 3. VIPS XML Description for Block. <LayoutNode FrameSourceIndex="0" SourceIndex="18" DoC="11" ContainImg="0" ContainTable="false" ContainP="0" TextLen="5" LinkTextLen="5" DOMCldNum="1" FontSize="7.5" FontWeight="700" BgColor="#00ffff" IsImg="false" ObjectRectLeft="60" ObjectRectTop="308" ObjectRectWidth="95" ObjectRectHeight="36" Content="Page1 " SRC="<A href= " FirstPage. html" >Page1</A> " ID="1-2-1" order="11"/> Figure 4. VIPS XML Description for Anchor HTML Elements. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 B. Web UML XML Web UML XML description is based on UML-Based Web Engineering (UWE) Metamodel [8]. UWE is a methodology used for Web application modeling purpose, especially, structure modeling and behavior modeling. Furthermore, this methodology provides guidelines for systematic modeling of Web applications. UWE comprises four main parts to model web application: Notations, Methods, Metamodel, and Process. Next we present a brief description about the UWE notation before discussing the process of transforming VIPS XML description format to Web UML XML format. UWE Notation For UML presentational model, UWE defines stereotypes for user interface elements, these elements are divided into two parts: primitive user interface (UIElement) elements and user interface container (UIContainer) elements which contain a collection of primitive user interface elements. A case-tool Magicdraw supports the design and model of web applications with different aspects by using UWE notations and Metamodels as plug-in. This casetool is used to parse the Web UML XML description to generate UML presentation model in generation model stage. Table II summarizes the UWE Stereotypes, which are used to model web application into UML presentation model. These Stereotypes are: 1. Page: A Page element is the area of the user interface which contains all of UIElement elements and UIContainer elements. The page constructs the root of presentation model. 2. Anchor: An anchor element permits the user to move from page to another page, or from location to another location on the same page. 3. Group Presentation: A group presentation element is used to define a set of UIElement elements, such as a collection of anchors. 4. Text: A text element is used to display a sequence of characters. 5. Textinput: A text input element permits the user to enter text. 6. Form: A form element contains a collection of UIElement elements that are used to provide data for a submitted process. 7. Button: A button element allows the user to initiate some actions on the web page. Actions include submitting the content for Textinput element, playing video, displaying image, triggering anchor and so on. 8. Selection: A selection element displays a list of items for the user to select one or more items. 9. Image: An image element is used to display the image. 10. Media Object: MediaObject elements are used to play multimedia objects such audio and video. 137 TABLE II. UWE SEROTYPES SYMBOL. Stereotypes Symobl Page Anchor Text Text input Selection File upload Image mediaObject Form Button Collection Custom component 11. File upload: A File upload element allows user to upload files. 12. Custom component: UWE also defines custom component stereotypes for custom HTML elements which are not defined by UWE. The Web UML XML description is a complicated description which is used to describe the UWE stereotypes in XML format. For that, this type of XML description can be interpreted and rendered to UWE stereotypes by any case-tool that is capable of parsing this XML description to build a UML presentation model. The Web UML XML description is divided into three main parts, each part specifies some properties for elements. Fig. 5, Fig. 6, and Fig. 7 contain the UML XML description for a simple block of web page, this block is represented by the DIV HTML tag which consists of two Textinput elements and one submit button as shown in Fig. 8. The Web UML XML description in Fig. 5 specifies the ID number and name for each element, and also determines whether the element is visible or not in UML presentation model. The benefit of visibility property appears through modeling the hidden HTML elements that are not displayed by the browser. <packagedElement xmi:type='uml:Class' xmi:id='1_1_2' name='Administrator Login' visibility='public'> <ownedAttribute xmi:type='uml:Property' xmi:id='5' name='textinput' visibility='private' aggregation='composite' type='1_1_2_1'/> <ownedAttribute xmi:type='uml:Property' xmi:id='6' name='textinput' visibility='private' aggregation='composite' type='1_1_2_2'/> <ownedAttribute xmi:type='uml:Property' xmi:id='7' name='button' visibility='private' aggregation='composite' type='1_1_2_3'/> </packagedElement> ------------------------------------------------------------------------------<packagedElement xmi:type='uml:Class' xmi:id='1_1_2_1' name=''Username' ' visibility='public'/> <packagedElement xmi:type='uml:Class' xmi:id='1_1_2_2' name=''Password' visibility='public'/> <packagedElement xmi:type='uml:Class' xmi:id='1_1_2_3' name='Submit' visibility='public'/> Figure 5. UML XML Description (Part1). © 2012 ACADEMY PUBLISHER 138 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Fig. 5 is also divided into two parts as shown by the dashed lines. In the first part, the block is defined as a class type and the elements in this block are defined as a property type. In the second part, the elements in block are only defined as a class type. This means that the group element, or in other words UIContainer elements, is defined as a class type only, and the primitive user interface (UIElement) element is defined as a class and property type. Fig. 6 represents the second part of Web UML XML description. In this part, the coordinates for each element are specified by four points as shown in bold text in Fig.7. These points are: 1. Padding-left: sets the left padding (space) of an element. 2. Padding-Top: sets the top padding (space) of an element. 3. Width: sets the width for element. 4. Height: sets the height for element. These points are structured as: Padding-left, PaddingTop, Width, and Height. <mdElement elementClass='Class' xmi:id='1004'> <elementID xmi:idref='1_1_2'/> <properties> <mdElement elementClass='BooleanProperty'> <propertyID>SUPPRESS_STRUCTURE</propertyID> <propertyDescriptionID>SUPPRESS_STRUCTURE_DESCRIP TION</propertyDescriptionID> </mdElement> </properties> <geometry>90,225, 20, 130</geometry> <propertyID>SUPPRESS_STRUCTURE</propertyID> <compartment xmi:value='5^6^7' compartmentID='ATTRIBUTES'/> <parts> <mdElement elementClass='Part' xmi:id='1005'> <elementID xmi:idref='5'/> <geometry>100,255, 20, 20</geometry> <propertyID>SUPPRESS_STRUCTURE</propertyID> </mdElement> <mdElement elementClass='Part' xmi:id='1006'> <elementID xmi:idref='6'/> <geometry>100,280, 20, 20</geometry> <propertyID>SUPPRESS_STRUCTURE</propertyID> </mdElement> <mdElement elementClass='Part' xmi:id='1007'> <elementID xmi:idref='7'/> <geometry>100,305, 20, 20</geometry> <propertyID>SUPPRESS_STRUCTURE</propertyID> </mdElement> </parts> </mdElement> <UWE_PROFILE:PRESENTATIONGROUP XMI:ID='1018' BASE_CLASS='1_1_2'/> <UWE_PROFILE: TEXTINPUT XMI:ID='1019' BASE_CLASS='1_1_2_1'/> <UWE_PROFILE: TEXTINPUT XMI:ID='1020' BASE_CLASS='1_1_2_2'/> XMI:ID='1021' <UWE_PROFILE:BUTTON BASE_CLASS='1_1_2_3'/> -----------------------------------------------------------------------------<UWE_Profile: textInput xmi:id='1030' base_Property='5'/> <UWE_Profile: textInput xmi:id='1031' base_Property='6'/> <UWE_Profile:button xmi:id='1032' base_Property='7'/> Figure 7. UML XML Description (Part3). Figure 8. Group of HTML Elements. description as output file. Substantially, the mapping process consists of the following steps: 1. Firstly, Contrast Map table between HTML elements and UWE stereotypes, as shown in Table III. 2. Read the VIPS XML Description for each block, this step continues until the end of VIPS XML Description file is reached. 3. The VIPS XML attributes for each element are examined, and then useful information from these attributes such as the type of element and coordinates are extracted. TABLE III. VIPS XML ATTRIBUTES AND THEIR DESCRIPTION. HTML Element <div> </div> <span></span> <fieldset></fieldset> <select> <input type=” radio”/> <input type=”checkbox”/> <button></button> <input type=" reset " /> <input type=" button " /> <input type=" submit " /> <input type=" file " /> Figure 6. UML XML Description (Part2). <input type=" text " /> <input type=" password " /> < textarea></ textarea> Fig. 7 represents the third part of Web UML XML description. In this part, the stereotype of element is specified by UWE profile. Also, Fig. 7 is divided into two parts as shown by the dashed lines, First part specifies the stereotype of element from class type, and the second part also specifies the stereotype of element from property type, as is shown by bold text in Fig. 7. C. Mapping Process Stage The mapping process is considered the main part of our approach. This process accepts the VIPS XML description as input file and produces Web UML XML © 2012 ACADEMY PUBLISHER <a href=””></a> <p> </p> 4. 5. UEW stereotype Group Presentation Selection Button File upload Text input Anchor Text <form > </form> Form <img src=””></img> Image The extracted information for all elements in block is stored together in the data structure list for block. Examine the data structure list for each block to check whether the block has a collection of elements JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 139 that have the same type or not. For example, check whether the blocks contain a collection of anchors or not. 6. Mapping location step is considered the most important step; this step needs to change the coordinates for elements to become consistent within UML presentation model. 7. Mapping from HTML element to UWE Stereotypes. 8. Now, the content of data structure list for each block is Web UML XML Description. This content is stored into Web UML XML file. Fig. 9 shows the pseudo code for mapping process. 1: Input: VIPS XML description for blocks in the segmented Web Page. 2: Output: UML XML description for blocks in the segmented page. 3: Begin 4: Counter Å 1 // Counter for blocks. 5: total_blocks Å N // N is the no. of blocks in segmented page. 6: For counter to total_blocks step by 1 7: BlockÅ list_of_blocks [counter] 8: Counter_elemnent Å 1 // Counter for elements in block 9: While Block contains elements with VIPS XML description 10: ElementÅ Block [Counter_elemnent] 11:VIPS_ Info Å Extract useful information from VIPS XML att. of Element. 12: WebUMLXML_info Å Mapping _to_WebUMLXML (VIPS_ Info). 13: Block [Counter_elemnent] Å WebUMLXML_info 14: Counter_elemnent INCREMNET BY 1 15: End While 16: Store the information of block and elements into Web UML XML file. 17: End for Figure 9. Mapping Process Pseudo Code. D. UML Presentation Model Generation Stage After mapping process stage, the Web UML XML description is used as input file for Magicdraw tool to generate UML presentation model. The UML presentation model in Fig. 10 is generated by parsing the WebUML XML Description in Fig. 5, Fig.7 and Fig. 7 which form the simple HTML block as shown in Fig. 8. IV. IMPLEMENTATION DETAILS The process described in section 3 is implemented using C#.Net Programming language (.NET Framework 3.5). This language provides the object oriented programming which assists in programming process through providing the libraries which provides many features. In addition, we use the following APIs: 1. microsoft.mshtml.dll 2. MSXML3.dll 3. PageAnalyzer.dll The PageAnalyzer.dll is considered the main API used in our implementation; this dll file generates the VIPS XML Description for any Web Page. The main challenges which we faced during the implementation process are: Figure 10. Simple UML Presentation Model. 1. 2. Mapping the coordinates of element and block in VIPS XML Description to coordinates in Web UML XML Description. Take into account all the HTML elements. V. CASE STUDY OF THE PROPOSED APPROACH In this section, we will illustrate the reverse engineering process described earlier by applying it to a specific website. A. Page Segmentation Fig. 11 shows the home page for goalzz web site , this page is segmented using the VIPS algorithm into blocks as shown in Fig. 12, some of these blocks are small and others are large in size depending on the structure of the page. The page layout of the segmented page is shown in Fig. 13. In addition, Fig. 14 shows the DOM tree for the segmented page. B. Mapping Process After the page segmentation process, the VIPS algorithm assigns each block, sub-block, and element with XML descriptions; these descriptions capture some properties of each block and sub-block in the web page, then the VIPS XML format is mapped and transformed into the Web UML XML format as illustrated in Fig. 2. Figure 11.Web Page before Segmentation Stage. © 2012 ACADEMY PUBLISHER 140 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 C. UML Presentation Model Generation Stage After the mapping and transformation process is achieved, the presentation model can be imported and manipulated in the Magicdraw modeling tool. Fig.15 shows the presentation model for goalzz home page. Figure 14. DOM Tree for Segmented Page. Figure 12.Web Page after Segmentation Stage. Figure 15. UML Presentation Model for Goalzz Homepage. Figure 13. Page Layout for Goalzz Homepage. VI. SUMMARY AND FUTURE WORK In this paper we present an approach of reverse engineering web applications into UML web presentation model. This issue is evolved from the more generic reverse engineering process, concentrating on the structure of the web page. We have presented an approach to the identification of structure in a web page and model this structure in UML presentation model. The approach relies on a number of structured techniques such as page segmentation. Future work will concentrate on building a complete framework which automatically builds the UML presentation model for any given application. The process of presenting the UML presentation model will be automated and will apply content mining along with the © 2012 ACADEMY PUBLISHER segmentation technique to accurately identify different blocks of the web page. Also, we will work on handling Dynamic HTML pages (DHTML). By DHTML pages we mean pages which change its layout on the client side. So you will have a unique URL with different page layout segmentation. For example consider a faculty member website, page simply list the faculty member publication, if the user clicks on any publication the abstract of that publication will be shown on the page without the need to connect back to the server. Handling DHTML pages introduces two issues that need to be considered in our future work. First, we need to make sure that the segmentation process takes this into consideration. Second, we need to check if UWE notation is flexible and rich enough to capture and model such a dynamic behavior of the web page. ACKNOWLEDGMENT This research was supported in part by German Research Foundation (DFG), Higher Council for Science and Technology (HCST) and Jordan University of Science and Technology (JUST). REFERENCES [1] G. A. D. Lucca, A. R. Fasolino, and P. Tramontana, "Reverse engineering web applications: the WARE approach," Journal of Software Maintenance and Evolution: Research and Practice, v.16, n.1-2, p.71-101, January-April 2004 “doi:10.1002/smr.281”. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 [2] E. J. Chikofsky and J. H. II Cross, "Reverse engineering and design recovery: a taxonomy," Software, IEEE, vol.7, no.1, pp.13-17, January 1990 “doi:10.1109/52.43044”. [3] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, VIPS: a visionbased page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, November 2003. [4] S. Tilley and S. Huang, "Evaluating the reverse engineering capabilities of Web tools for understanding site content and structure: a case study," Proceedings of the 23rd International Conference on Software Engineering, pp. 514523, May 2001 “doi: 10.1109/ICSE.2001.919126”. [5] S. Chung and Y.-S. Lee, "Reverse software engineering with UML for Web site maintenance," Proceedings of the First International Conference on Web Information Systems Engineering, vol.2, pp.157-161, 2000 “doi: 10.1109/WISE.2000.882874”. [6] R. Virgilio and R. Torlone, "A Structured Approach to Data Reverse Engineering of Web Applications," Proceedings of the 9th International Conference on Web Engineering, San Sebastián, Spain, June 2009 “ doi:10.1007/978-3-642-02818-2_7”. [7] B. Djelloul, M. Mimoun, and B. S. Mohamed, "Ontology based Web Application Reverse-Engineering Approach," INFOCOMP Journal of Computer Science, vol. 6, pp. 3746, March 2007. [8] C. Kroiß and N. Koch, "The UWE Metamodel and Profile – User Guide and Reference", Technical Report 0802, Ludwig-Maximilians-Universität München, p. 35, February 2008. Natheer Khasawneh is an Assistant Professor in the Department of Software Engineering at Jordan University of Science and Technology since 2005. He received his BS in Electrical Engineering from Jordan University of Science and Technology in 1999. He received his Master and PhD degrees in Computer Science and Computer Engineering from University Akron, Akron, Ohio, USA in the years 2002 and © 2012 ACADEMY PUBLISHER 141 2005 respectively. His current research interest is data mining, biomedical signals analysis, software engineering and web engineering. Oduy Samarah is a Master student in the Department of Computer Engineering at Jordan University of Science and Technology since 2009. He received his BS in Computer Engineering from Jordan University of Science and Technology in 2009. His current research interest is in Wireless sensor network, software engineering and web engineering. Safwan Al-Omari is an assistant professor in the department of Software Engineering at Jordan University of Science and Technology. Dr. Safwan Al-Omari received his PhD degree in Computer Science from Wayne State University in 2009. He received his Master degree in Computer and Information Science from the University of Michigan-Dearborn and Bachelor degree in Computer Science from the University of Jordan in 2003 and 1995, respectively. His current research is in software engineering, service-oriented computing, and cloud computing.. Stefan Conrad is a Professor in the Department of Computer Science at Heinrich-Heine-University Duesseldorf, Germany. He was an Associate Professor in the Department of Computer Science at Ludwig-Maximilians- University in Munich, Germany, from 2000 to 2002. From 1994 to 2000, he was an Assistant Professor at the University of Magdeburg where he finished his ‘Habilitation’ in 1997 with a thesis on federated database systems. From 1998 to 1999, he was also a Visiting Professor at the University of Linz, Austria. He received his PhD in Computer Science in 1994 at Technical University of Braunschweig, Germany. His current research interests include database integration, knowledge discovery in databases, and information retrieval. He is a (co)author of two books (in German) and a large number of research papers. 142 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Adaptive FEC Technique for Multimedia Applications Over the Internet M. AL-Rousan and . A. Nawasrah Jordan University of Science & Technology College of Computer and Information Technology Irbid, Jordan alrousan@just.edu.jo Abstract— Forward Error Correction (FEC) is a common technique for transmitting multimedia streams over the Internet. In this paper we propose a new approach of adaptive FEC scheme for multimedia applications over the Internet. This adaptive FEC will optimize the redundancy of the generated codewords from a Reed-Solomon (RS) encoder, in–order to save the bandwidth of the channel. The adaptation of the FEC scheme is based on predefined probability equations, which are derived from the data loss rates related to the recovery rates at the clients. The server uses the RTCP reports from clients and the probability equations to approximate the final delivery ratio of the sent packets to the client after applying the adaptive FEC. The server uses the RTCP reports also to predict the next network loss rate using curve fitting technique to generate the optimized redundancy in-order to meet certain residual error rates at the clients. Index Terms— Forward Error Correction (FEC), ReedSolomon coder, network loss prediction, redundant bandwidth optimization. I. INTRODUCTION Streaming of multimedia over the Internet suffers many difficulties, because of the already installed equipments and protocols that support mainly data applications. Recently, many multimedia applications over the Internet infrastructure are taking place. Applications such as Voice over IP (VoIP), Video on Demand (VoD) and interactive gaming are utilizing the installed switches, routers and backbone of the Internet. Using the Internet infrastructure ensures multimedia applications with low cost to end users. Although the TCP session guarantees the delivery of all the of the packets; it is not appropriate for on-line or interactive multimedia streams, because the out-of-order discard mechanism and NACK-retransmit technique generates unsuitable jitter of the play-back multimedia player at the client. The data of multimedia streams are attached to time, which means that the arrived packet is useful only if it arrived before the play-back time. So the Forward Error Correction (FEC) technique has been used for multimedia applications, which mainly depends on sending redundant packets that might be used to recover the lost packets [1-3]. A. Related Work Most of the FEC research concentrated toward bit error recovery, where the Reed-Solomon codec is © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.142-147 extensively used for storage devices such as the Compact Disks (CD) [15]. Recent network installations suffer very low Bit Error Rates (BER), as low as 10-9 [15] in fiber networks. The main loss contributor in the Internet is due to buffering of packets and discard mechanisms in routers. Also multimedia applications suffer the out-oftime delivery of packets, due to the latency of arrived packets. The packet level recovery includes more complex computations due to the large arrays involved in the generation of large blocks of data. For a ReedSolomon codec the computational complexity is still relatively low at the server, but it is much complex at the client side [7]. The FEC source techniques vary based on the application, Nonnenmacher et al [12] suggested a hybrid FEC and ARQ layer for time tolerance multicast applications, where the FEC is applied as the 1st layer, then the normal ARQ procedure will take place for lost packets after the FEC operation, Chan et al [4] also target the time tolerance video streams by introducing another selectively delayed replication stream rather than the FEC scheme, in order to achieve certain residual loss requirements. The work of Parthasarathy et. al. [13] presents another hybrid approach for high-quality video over ATM networks; by joining FEC technique at the sender side with simple passive error concealment at the client side, which in turn enhances the packet recovery even at high network loss rates. Yang et al [16] introduced a FEC scheme adapted for video over Internet, based on the importance of high spatial-temporal frame packets, and it its effect on further depending packets, so they send multiple redundancy levels based on the spatial content of the packets. Change, Lin and Wu [5] studied the FEC impact for CDMA 3-D systems over wireless ATM networks, so they presents two levels of FEC for header and payload packets, the header contains the Discrete Cosine Transform (DCT) information so it requires powerful FEC scheme to be reliably delivered, and the payloads will be transmitted with lower FEC protection. Song L., Yu M., Shaffer M. [15] present ideas for hardware designing blocks of Reed-Solomon coders. B. Forward Error Correction In traditional FEC, the server adds n-k redundant (parity) packets to every k data packets, this yields n packets to be transmitted. At the client side, if any k packets from the n packets were received then the client JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 can recover all the lost packets without retransmission. The amount of parities n-k is determined at the start of a session, where the redundancy is calculated based on a long-term average of network loss. The redundancy R is defined as the amount of parities n-k to the block of packets k as in equation 1 [12] R = n−k k (1) The generation of the extra parities requires mathematical codec. The codec must be reversible, so the client can reconstruct the lost data out of the received data. The Reed-Solomon codec is often used [6,7]. In this paper we adaptively optimize the parity packets n-k, in order to save the redundant bandwidth without degrading the quality of the displayed media. In our approach, the source generates the maximum allowed redundant parities n-k using Reed-Solomon encoder, but it only sends r parity packets that are required to overcome the expected network loss. To overcome the above weakness, we propose the BOAFEC approach where the source uses the long-term average only to determine the maximum allowed redundancy R. The BOAFEC predicts the current network loss using simple three-points curve fitting technique. The network loss prediction is used to determine the amount of parity packets r to be transmitted with the block of data packets k. A. Probability Equations and Residual Loss Calculation The BOAFEC uses the Reed-Solomon encoder to generate the maximum allowed redundancy packets n-k, but only sends r parities. That is because the ReedSolomon involves large arrays computations, and it is computationally efficient to design only one coder with fixed and maximum possible parities. The BOAFEC predicts the network loss then uses the probability equations to calculate the expected loss at the client. And since packets suffer only two cases, whither delivered or lost, the Binomial Distribution is hold [7]. Assuming the loss of a packet is the event of success, and applying the Binomial distribution for a group of n packets. The probability of l packets to be lost from n packets, if the loss probability is πv will be: ⎛ n⎞ P(Loss= l) = ⎜⎜ ⎟⎟*(1−πv )n−1 *πvn ⎝l ⎠ Figure 1. FEC group packets with k=4, n-k = 2 and n = 6 Our work of adaptive FEC shows a considered bandwidth saving over networks with low to medium loss rates without affecting the quality of the multimedia applications. That is, our scheme saves about 25% of the redundant bandwidth, which leads to more clients can subscribe to the same server, also the proposed scheme responds to networks with high loss rates by saturating to the maximum allowed redundancy, which corresponds to a best effort mode, where the adaptive FEC cannot save the bandwidth but can achieve the same quality of a Reed-Solomon FEC. II. BANDWIDTH OPTIMIZED ADAPTIVE FEC (BOAFEC) The traditional FEC approach determines the redundant parities based on the long-term average of network loss, which is not suitable for multimedia applications where the loss is instantaneous. On the counterpart; generating on-the-spot parities is not possible since the source does not know the current loss rate at the client side, also the generation of adaptive parities involves more computational complexity. Also the long-term average of loss can miss lead the source to send more parities than required, which results in wasting the bandwidth. © 2012 ACADEMY PUBLISHER 143 (2) Since the group of n packets includes k data packets and r parity packets, the Reed-Solomon coder can recover up to r different packets from the lost packets. So the FEC coder has the property that if l packets were lost from the above group, then two cases apply: l ≤ r: All packets will be recovered l > r: None of the packets can be recovered The FEC function of order r is defined as: ⎧0, l ≤ r FECr (l) = ⎨ ⎩l, l > r (3) The Expectation function [8] of the number of lost packets is: ⎛ n⎞ l E(x) = ∑l⎜⎜ ⎟⎟*(1−πv )n−l *πv 1 ⎝l ⎠ n (4) And hence the Expectation function after applying the FEC is: ⎛ n⎞ l E(x) = ∑l⎜⎜ ⎟⎟*(1−πv )n−l *πv r+1 ⎝ l ⎠ n (5) The Expectation function presents the number of expected packets over a group of n packets, so new loss 144 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 rate πv‘ which is the same as the residual loss rate ξr after applying the FEC operation is: πv' =ξr = E n • • • (7) And so the delivery rate D is given as: D = 1- π’v (8) The above equations lead the source to calculate the residual loss rate ξr. The source will adaptively increase or decrease the redundant packets r, in order to meet the specified residual loss ξs at the client. The number of packets k and number of maximum added codewords n-k will influent the Reed-Solomon encoder, the largest block size will result in more computational complexity. The L. Rizzo Reed-Solomon codec [7] is commonly used for software FEC coding. B. Network Loss Prediction The network loss heavily fluctuates over time, so the average of long periods rarely presents the actual network status, Figure 2 shows the loss of a network running for 120 seconds, although Figure 2 presents the average of 500ms for each reading, it is still fluctuating over time. The long-term average loss of this run was 15.3%, while it is obviously deviate most of time on this average. We present a simple network loss prediction based on the last three RTCP reports from the destination to the source. Every RTCP report contains the loss at the client at last transmission. The source constructs a matrix of the network loss, and using Gaussian Elimination method and Pivoting, the source can predict the next loss rate in a finite time. The source will then use the predicted loss to send the appropriate number of parity packets r, which can be used at the client to reconstruct the lost packets in order to meet the specified residual loss rate ξs. Start the transmission assuming the highest network loss, and so it uses the maximum redundancy. Wait for three RTCP reports in order to predict the next network loss rate. Calculate the optimal redundancy based on the probability equations then update the number of redundant packet for the next transmission. Client: • • • • Ask for a reservation for a suitable bandwidth for the Multimedia application. Determine the acceptable residual loss rate ξs for the application. Specify a client window size based on the round-trip-time, the block of packets k and the Rmax Send RTCP reports D. The BOAFEC Parameter The BOAFEC generates the redundant parities based on four parameters that are the Network Loss Rate (NLR), the maximum available redundant bandwidth, the maximum allowed jitter and the target residual loss rate ξs. Network Loss Rate (NLR): The NLR apparently is the main influence factor of the BOAFEC decision of the number of redundant parities. The BOAFEC cannot have control over the NLR, because the NLR corresponding to the network failures or congestions. Also the NLR is the only continuously changing variable of the four BOAFEC parameters, because the other three parameters are fixed at the start of a session. C. BOAFEC Procedure The server and client negotiate at the start of the session to determine the network status, such as the round-trip-time and the long-term loss average. Also they negotiate to determine the parameters of the operation such as residual loss rate, number of packets k in each block and the maximum allowed redundancy. The procedure of operation for the BOAFEC is as follows: Server: • • Allocate a certain bandwidth for each client at the start of a session based on it available resources and the multimedia application. Determine the k and n-k parameters, and hence the maximum allowable redundancy from equation (1) based on the long-term history of the network loss and the target residual loss ξs. Rmax = n−k k © 2012 ACADEMY PUBLISHER Figure 2: Raw Network Loss based on 120 seconds simulated network, every reading represents an average of 500ms. Maximum Redundant Bandwidth (MRB): The MRB determines how much redundancy can the server afford for the client as redundant codewords to be JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 used later for recovering lost packets. The MRB inversely proportional to the residual loss rate. The MRB is defined as the number of redundant parity packets to the number of data packets, equation 1 represent the MRB. The server determines the MRB based on the registered QoS for each client, the better QoS requirement needs more MRB to be associated to the data transmission, in –order for the receiver to reconstruct more lost packets. Maximum Allowed Jitter (MAJ): The MAJ represent the maximum jitter for a packet to be displayed in order for the media application to run smoothly. The MAJ indicates the scale of interactivity for the media application. Applications with higher inactivity requires less MAJ. The MAJ leads the media player to determine when the player should discard a delayed packet. Target Residual Loss Rate (ξs): The target residual loss rate ξs is the maximum tolerable loss rate for a multimedia application in order to run smoothly. Since the multimedia applications cannot tolerate the excess delay generated from the retransmission of the lost packets, there must be a residual loss even when using the FEC. The residual loss can further be reduced by using receiver based error correction techniques like the Interleaving or repetition of lost packets [14]. III. NETWORK LOSS BEHAVIOR The network loss behavior over the Internet is very complex to be defined, because of the many variables that cannot be predicted nor defined. Although the loss over a channel is simply figured by the number of lost packets related to the total number of sent packets, but the sequence or probability of a packet to be lost is not simply defined. A. Gilbert Model A well known approximation of network loss is the Gilbert-Model, which uses Two-State Markov chain to represent the end-to-end loss. The Gilbert model is widely used to simulate the Internet loss, due to its simplicity and mathematical traceability [9][12][16]. The Two-State Markov is shown in Figure 3. The 0 state represents a packet was lost, where the 1 state represents a packet reached the destination. Let p denote the transition from the 0 state to state 1, and q denote the transition from state 1 to state 0, so the probability of losing a packet is (1-q), and the probability to lose n consecutive packets equals (1-q)qn-1. From the Markov Chain transition matrix [16], the long run loss probability π can be defined as: π = © 2012 ACADEMY PUBLISHER p p + q (9) 145 Fig ure 3: Two-State Markov Model IV. SIMULATION RESULTS In this section we present the simulation results for the BOAFEC scheme, we show the response of the BOAFEC with different MRB and how it effects the overall residual loss ξs and the used redundancy. Also we show the BOAFEC results and compare it with the pure Reed Solomon FEC (RS FEC) and a replicated stream approach. The MRB is very important factor for improving the recovery rates at the clients, whereas the server must bound the MRB in order to determine the number of clients that can be attached to it at once. The MRB relates directly with recovery rates at the clients, and hence it relates inversely with the residual loss. Using the BOAFEC can let the server to assign more MRB to clients since the BOAFEC scheme optimizes that redundancy. Figures 4 and 5 show the response of the BAOFEC to different NLR when the MRB are varying. Figure 4 shows that increasing the MRB results in better residual loss rates at the clients, but it also shows that increasing the MRB up to a certain level for networks with low loss rates, such as when the NLR = 0.05; where increasing the MRB over the 0.2 results in slight reduction of the residual loss at clients, this response was due to the BOAFEC response when the residual loss matches the target loss ξs. Figure 5 presents the relation of the MRB to the real used redundancy, note that the BAOFEC optimizes the redundant bandwidth so the parity packets are used as much as possible. We finally compare the BOAFEC with the traditional Reed-Solomon FEC and with the replicated stream FEC scheme. The replicated stream approach simply sends every packet twice, this increases the probability that at least one of the two packets could arrive. Apparently the replication stream requires 100% of redundancy, with the lowest processing overhead over the known FEC. The replicated stream is very useful for devices with low processing resources, also it is suitable for networks with high loss rates. The Reed-Solomon codec was simulated using the parameters k = 30, n-k = 15 and hence n = 48. The BOAFEC parameters was residual loss rate ξs = 1%, k = 30 and maximum redundancy = 0.6. 146 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Figure 4: Residual loss versus the MRB for different NLR, k = 20 and ξs = 1% Figure 5: The used redundant bandwidth versus the MRB for different NLR, k = 20 and ξs = 1% Figure 5: Redundant Bandwidth versus Network Loss. k = 30, n-k = 18, ξs = 1% and maximum redundancy = 60%. From Figure 4, the residual loss rate for the BOAFEC scheme is very close to the RS FEC in the region of loss of 0% to 20%, where both the BOAFEC and RS FEC performs much better than the replicated stream in the above region, but it is still known that the replication stream performs better than both in networks with high loss rates (greater than 40%). Figure 4 shows also a swing results for the BOAFEC under the specified residual loss ξs, which is 1% in our example. The residual loss ξs is considered to be compensated, whither it is objectively tolerable or the destination uses other client based repair techniques, such as Interpolation or regeneration of lost packets [14]. Figure 5 shows the redundant bandwidth required for each scheme. Obviously the BOAFEC uses the lowest bandwidth while maintain close residual loss rates. The BOAFEC achieves its best results for networks with low to medium loss rates. Note when the loss rate exceeds the 25%; the BOAFEC saturates to the limit of its maximum allowed redundant bandwidth in order to reduce the residual loss rate. V. CONCLUSION Figure 4: Residual loss versus the network loss. k = 30, nk = 18, ξs = 1% and maximum redundancy = 60%. © 2012 ACADEMY PUBLISHER The packets loss is inevitable in networks, data networks can tolerate the latency but not the loss, where multimedia networks can tolerate the loss but cannot tolerate the latency, due to the interactive nature of multimedia applications. The FEC presents the least latency recovery technique. The FEC is a very promising technique for developing Multimedia applications over the Internet without scarifying the QoS of the media applications. In this paper, we proposed and study a bandwidth optimized FEC approach, by predicting the loss at the client, while optimizing the amount of redundancy in order to achieve a certain residual loss rate. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 The BOAFEC achieves very close recovery rates to the pure FEC, while saving 25% (on average) of the bandwidth, when the network loss rates are in the range of 0% to 20%. In networks with high loss rates, the BOAFEC saturates on the maximum allowed redundancy in order to achieve the best possible quality. REFERENCES [1] Chih-Heng Ke, Rung-Shiang Cheng2, Chen-Da Tsai, and Ming-Fong Tsai, "Bandwidth Aggregation with Path InterleavingForward Error Correction Mechanism for Delay-Sensitive Video Streaming in Wireless Multipath Environments", Tamkang Journal of Science and Engineering, Vol. 13, No. 1, pp. 1-9 (2010). [2] Tsai, M.-F., Shieh, C.-K., Hwang, W.-S. and Deng, D.-J., “An AdaptiveMulti-Hop FEC Protection Scheme for Enhancing the QoS of Video Streaming Transmission overWireless Mesh Networks,” International Journal of Communication Systems, Vol. 22, pp.1297−1318 (2009). [3] Edward Au, Francesco Caggioni, and Jeff, Hutchins, "Core Optics100G Forward Error Correctio", White Paper, 2010, www.oifforum.com, accessed June 21, 2011. [4] Chan S., Zheng X., Zhang Q., Zhu W., Zhang Y., “Video Loss Recovery with FEC and Stream Replication”, IEEE Transactions on Multimedia, VOL. 8, NO. 2, April 2006. [5] Chang P., Lin C., Wu M., “Design of Multimedia CDMA for 3-D Stereoscopic Video over Wireless ATM Networks”, IEEE Transactions on Vehicular Technology, VOL. 49, NO. 2, March 2000. [6] Lacan J., V. Roca, J. Peltotalo, S. Peltotalo, “ReedSolomon Forward Error Correction (FEC)”, draft-ietf-rmtbb-fec-rs-01.txt, (work in progress) , June 2006. [7] Luby M., Vicisano L., Gemmell J., Rizzo L., Handley M., Crowcroft J. “The Use of Forward Error Correction (FEC) in Reliable Multicast”, RFC 3453, December 2002 [8] Mathews J.H., “Numerical Methods for Computer Science, Engineering and Mathematics”, 1st Edition, ISBN 0-13626565-0. [9] Mizuochi T., “Recent Progress in Forward Error Correction and Its Interplay With Transmission Impairments”, IEEE Journals, VOL. 12, NO. 4, July/August 2006. [10] Moore A.W., “Probability Density in Data Mining”, Carnegie Mellon University. [11] Moore A.W., “Probabilistic and Bayesian Analytics”, Carnegie Mellon University. [12] Nonnenmacher J., Biersack E., Towsley D., “Parity-Based Loss Recovery for Reliable Multicast Transmission”, IEEE/ACM Transactions on Networking, VOL. 6, NO. 4, August 1998. [13] Parthasarathy V., Modistino J., Vastola K. S., “Reliable Transmission of High-Quality Video over ATM Networks”, IEEE Transactions on Image Processing, VOL. 8, NO. 3, March 1999. [14] Perkins C., Hodson O., Hardman Y., “A Survey of Packet Loss Recovery Techniques for Streaming Audio”, University College London, IEEE Network, September 1998. [15] Song L., Yu M., Shaffer M., “10 and 40 Gbs Forward Error Correction Devices for Optical Communications”, IEEE Journals of Solid State Circuits, VOL. 37, NO. 11, November 2002. [16] Yang X., Zhu C., Li Z., Lin X, Ling N., “An Unequal Packet Loss Resilience Scheme for Video Over the Internet”, IEEE Transactions on Multimedia, VOL. 7, NO. 4, August 20. © 2012 ACADEMY PUBLISHER 147 Mohammad Al-Rousan is currently A full professor at the Department of Network Engineering and Security, Jordan University of Science and Technology (JUST). He was educated in KSA, Jordan and USA. He received his BSc in Computer Engineering from King Saud University, Saudi Arabia, in 1986. He received his M.S. in Electrical and Computer Engineering from University of Missouri-Columbia, MI, USA in 1992. In 1996 , he was awarded the PhD in Electrical and Computer Engineering from Brigham Young University, UT, USA. He was then an assistant professor at JUST, Jordan. In 2002, he joined the Computer Engineering Department at American University of Sharjah, UAE. Since 2008, he has been the Dean of College of Computer and Information Technology at JUST. He is the Director of the Artificial Inelegant and Robotics Laboratory, and a co-founder for the Nano-bio laboratory, at JUST. His search interests include wireless networking, System protocols, intelligent systems, computer applications, and Nanotechnology, Internet computing. Dr. AlRousan served on organising and program committees for many prestigious international conferences. He is the recipient of several prestigious awards and recognitions. He co-chaired international conferences on Information and Communication Systems (ICICS09 and ICICS11). Ahmad Nawasrah Received his MS in Computer Engineering from Jordan University of Science and Technology in 2007. He has been the COE of AL-Bahith Co. His research interests include embedded systems and coding. 148 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Semantic Approach for the Spatial Adaptation of Multimedia Documents Azze-Eddine Maredj and Nourredine Tonkin Research Center on Scientific and Technical Information (CERIST) Ben Aknoun, Algiers, Algeria Email: {amaredj, ntonkin}@mail.cerist.dz Abstract—With the proliferation of heterogeneous devices (desktop computers, personal digital assistants, phones), multimedia documents must be played under various constraints (small screens, low bandwidth, etc). Taking these constraints into account with current document models is impossible. Hence, generic source documents must be adapted or transformed into documents compatibles with the target contexts. The adaptation consists of modifying this specification of the document in a minimal way to lead it to satisfy the target profile. The profile defines constraints that must be satisfied by the document to be played. At this level, the transgressive adaptation becomes necessary when no specification model exists to satisfy this profile. We focus on the spatial dimension of a multimedia document and we provide an approach for the spatial adaptation that permits to best preserve the initial document semantic. In the proposed transgressive adaptation, the relations that do not comply with the target profile are not most often replaced by the closest ones due to the fact that their immediate neighboring relations have the same similarity degree whereas there may exist differences between them. In this paper, we extend this approach to, firstly, best preserve the proximity between the adapted and the original documents by weighting the arcs of the conceptual neighborhood graphs and secondly, to deal with complex relations models by integrating the concept of the relations relaxation graphs that permit to handle the distances defined within the relations. Index Terms—multimedia document, spatial adaptation, conceptual neighborhood graph, distance relaxation I. INTRODUCTION Multimedia document should be able to be played on different platforms (mobile phones, PDAs, laptops, PCs…) and must be presented according to the user preferences. The challenge is to execute and transmit to users media objects with a high quality in a consistent presentation that reproduces as closely as possible the semantics of the original document. Indeed, in the interactive online courses, the reader can use any type of devices without worrying whether the document is supported and the provided document shall reproduce the same content as specified by the author. To deal with the diversification of target contexts and user preferences, multimedia document must to be adapted before being played, i.e., from the profile (hardware and software constraints and user preferences) © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.148-154 and the source document, the adaptation must transform the document to be compatible with the target profiles. A lot of works has been done in this way and showed that there are two types of adaptation. The first type is to specify directly the different organizations of the document for each target platform. In this case, the task becomes complex since it requires an extra work to the author when he should specify the conditions of execution of his document on each target context. It can also be incomplete since the author must foresee all the existing targets. The second type of adaptation is based on the dynamic adaptation of the document performed by a program transforming the document. In this type, two kinds of adaptation are possible: a local adaptation that considers the media object individually but does not most often, preserve the document semantics and the global adaptation related to the document composition (temporal, spatial and hypermedia specifications) and which preserves the semantics of the document [1]. This paper focuses on the latter type of the adaptation and concerns only the spatial aspect. It is devoted to the transgressive adaptation of the spatial relations where the document is represented by an abstract structure expressing all relations between media objects. In this context, adapt a multimedia document consists in transforming its abstract structure in a manner that it meets the requirements of the target profile. In [1], it was shown that a spatial relation is represented by the combination of the Allen relations [2] on the horizontal and vertical axes and the spatial adaptation follows the same principle as the one defined for the temporal adaptation in [3]. This principle says that if no model of the original specification satisfies the adaptation constraints (context constraints) then transgressive adaptation is applied. In this approach, the transgressive adaptation consists in transforming the relations between media objects while ensuring two main properties: (i) the adaptation constraints are satisfied and (ii) the adapted document is, semantically, as close as possible to the initial document. This consists in finding another set of models (solutions), close to the initial, which satisfies these constraints. The proposed solution is to replace each relation that does not meet the profile by another semantically close. To find the closest relations, the conceptual neighborhood graph proposed in [4] is used. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 In this approach, the specification is done using models (temporal and spatial) where delays and distances between media objects are not considered, when actually, the produced documents (which are subjects of adaptation) are often, for expressiveness purpose, composed using very complex models. In our previous works [5], we showed the interest of the weighted conceptual neighborhood graph usage when seeking for the closest substitution relations. In this work, we will extend this approach to spatial models where distances between media objects are defined. To represent such spatial models, we use the temporal model of Wahl and Rothermel [6] which is an adaptation of the general model of Allen [2] where the authors proposed a set of composition operators integrating the delay concept in the relations. The elaboration of those operators was motivated by the need to facilitate the temporal specification of the document. More details about this representation are presented in section 4. In the second section of this article, we present the context of our work. The third section presents multimedia adaptation approaches and in section 4, we present our approach to the spatial adaptation. In the fifth section we present the adaptation procedure and the last section concludes this paper. II. CONTEXT OF THE WORK A multimedia document specification comprises temporal relations defining inter media objects synchronization and spatial relations that express the spatial representation of these media objects. The different users of multimedia documents impose different presentation constraints on the specification like display capabilities (screen size and resolution). The user’s device may do not have the necessary capabilities to support the spatial constraints of the document. For example, let us consider a multimedia document with the following spatial relation: image B Above image A and the resolution of the two images is 200 x 300. If the terminal resolution is superior or equal to 400 x 600 then, the user will not have any issue when displaying this document but, if its resolution is inferior to 400 x 600, we can have the following solutions [6]: (i) delete one of the two images or (ii) resize image A or image B or (iii) change the spatial relation Above by another relation. The deletion of one of the two images may alter or produce an incomprehensible document. Resizing one of the objects will not affect the relation between them but may lead to a wrong interpretation or make the image indistinguishable in the case of an X-ray radio for example. The spatial relation modification does not cause information waste as in the image deletion and does not make indistinguishable the images. It only changes the places of those images. Here, we focus on the relations transformation while trying to preserve the document semantic as well as possible. © 2012 ACADEMY PUBLISHER 149 Before we present the spatial adaptation of multimedia documents, we start by giving the different approaches for multimedia documents adaptation. III. ADAPTATION APPROACHES Several approaches have been proposed for the multimedia documents adaptation and we group them into four categories: the three categories, specification of alternatives, using transformation rules and using flexible documents models as presented in [1] to which we added the semantic and dynamic approaches. Specification of alternatives: The author of the document specifies a set of presentation alternatives by defining criteria on some media objects of the document. If the media objects satisfy the criteria then, they are selected and presented else, they are deleted from the presentation. The adaptation is performed beforehand like in SMIL which defines the operator switch to specify the alternatives that are played only if they comply with the target profile. The advantage of this family is that the adaptation is instantly. However, the author has to foresee all the possible target profiles and specify all the conceivable alternatives. Using Transformation rules: This category uses a set of transformation rules that are applied to the multimedia documents. The adaptation consists on selecting and applying rules to transform the document to satisfy the target profile. The advantage of this approach is that the author has not to care about the execution context of his document. Furthermore, this set of rules can be completed if new contexts appear. However, the entire transformation rules should be specified to ensure an efficient adaptation. Using flexible documents models: The adapted document is generated automatically from a noncomposed set of media objects represented by a model defining an abstraction of the document. Thanks to a formatting model, a multimedia presentation may be generated. In this category, we can mention Guypers [8] that aims to generate web-based presentation for multimedia databases. It is based on the use of semantic relations between multimedia objects and ontologies. Semantic and dynamic approaches: In [1], an approach based on the qualitative specifications of the document was proposed. Each document is considered as a set of potential executions and each profile is considered as a set of possible executions. The adaptation is done according to the context at the execution time. It consists on calculating the intersection between the potential (initial specification models) and the possible executions corresponding to the target profile. The advantage of this approach is its independence from description languages. However, the usage of the conceptual neighborhood graphs where all the weights of the arcs are set to one (01) assumes that a relation may be replaced by any one of its immediate neighbors while there are substantial differences between them; especially when using more elaborated relations models. Furthermore, the delays and the distances defined within the relations are not considered. 150 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 In the reminder of this paper, we present our proposition for the spatial adaptation which fits into this last category. IV. SPATIAL ADAPTATION A. Context Description [9] To describe the target context (profile), the universal profiling schema (UPS) [10] can be used. It is defined to serve as a universal model that provides a detailed description framework for different contexts. UPS is built on top of CC/PP and RDF [11]. Unlike CC/PP, UPS does not describe only the context of the client but it describes also all the entities that exist in the client environment and that can play a role in the adaptation chain from the content server to the target client. UPS includes six schemas: the client profile schema, the client resources profile, the document instance profile, the adaptation method profile and the network profile. The description of the user is included in the client profile schema. B. Spatial Relations Model To describe the spatial presentation of a multimedia document, we use the directional representation [12] that permits to define the orientation in space between media objects. In this representation, a media object is considered as two intervals corresponding to its projection on the horizontal and vertical axes. The set of the directional relations is obtained by combining the intervals of the two media objects on the two axes by using the Wahl and Rothermel relations model [6] presented in figure 1, on each axis. There are 20 possible relations (10 basic relations and theirs inverses) on each axis. Thus, a spatial relation is represented by two temporal relations [1]: one on the horizontal axis and one on the vertical axis. For example, the spatial relation left_top can be represented by its two components: before (on the horizontal axis) and before (on the vertical axis). In the multimedia document spatial adaptation, we follow the same procedure as the one presented in [5] for the temporal adaptation. C. Conceptual Neighborhood Graph of the Spatial Relations The representation of the spatial relations by two components (temporal relations) permits us to use the conceptual neighborhood graph of the temporal relations [4] to elaborate the spatial relations conceptual neighborhood graph. Indeed, for each component of the spatial relations (vertical and horizontal components), we use the temporal relations conceptual neighbourhood graph. The composition of the two graphs gives the conceptual neighborhood graph of the spatial relations. It’s the square product of the conceptual neighborhood graph of temporal relations. Conceptual Neighborhood Two relations between two media objects are conceptual neighbors if they can be directly transformed into one another by continuous deformation (shortening or lengthening) of the duration of the media objects without going through an intermediate relation. For example, in figure 2, the relations before and Overlaps are conceptual neighbors since a temporal extension of the media object A may cause a direct transition from the relation before to the relation Overlaps. And in figure 3, the relations before and Contains are not conceptual neighbors, since a transition between those relations must go through one of the relations Overlaps, Endin, Cobegin, Coend, Beforeendof1 , Cross-1, Delayed-1 or Startin-1. The Conceptual neighborhood graph, presented in figure 4, is defined as a graph where the nodes correspond to the relations of Wahl and Rothermel model and each arc between two nodes (relations) r and r’ corresponds to the satisfaction of the propriety of the conceptual neighborhood, i.e., r and r’ are conceptual neighbors. Figure 2. Example of neighboring relations. Figure 1. The relations model of Wahl and Rothermel [6] © 2012 ACADEMY PUBLISHER Figure 3. Example of non-neighboring relations JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 151 Figure 6. Information items of a relation TABLE I. INFORMATION ITEMS THAT CHARACTERIZE A TEMPORAL RELATION Information 1 2 3 4 5 6 Signification begin(A) begin(B) end(A) end(B) 1>2 1<2 Information 7 8 9 10 11 12 1>4 1<4 3>4 3<4 3>2 3<2 Signification Figure 4. Conceptual neighborhood graph of the Wahl and Rothermel relations Figure 5. Conceptual neighborhood graph of Allen relations Weighting of the Conceptual Neighborhood Graph In the conceptual neighborhood graph of the relations of Allen [2] (figure 5) presented in [3], the weights of the arcs are set to 1. This assumes that a relation can be indifferently replaced by any one of its neighbors having the same smallest conceptual distance whereas there may be a substantial difference between the candidate relations. It would be interesting to differentiate the proximity degree between these relations. The distinction in the proximity of the neighboring relations is done by assigning different weights to the arcs of a graph.To assign different weights to the arcs of a conceptual neighborhood graph, our idea is to identify all the information items that characterize a temporal relation so they serve as a basis for comparing and differentiate the similarity between the relations. Information Items of a Relation The analysis of a relation between two media A and B (Figure 6) on a time axis showed that the positioning is done according to the order that exists between their respective edges (occurrence order of the beginning and ending instants of the media objects). Therefore, to characterize a temporal relation, we selected the following information items: the values of the beginnings and the endings of the media objects and the orders (precedes (>) or succeeds (<)) between their edges. Table1 gives a recapitulation of the selected information items. Then, we have for each relation, determined the information items that it contains among the selected ones and the result is given in Table 2. © 2012 ACADEMY PUBLISHER TABLE II. INFORMATION ITEMS CONTAINED IN THE WAHL AND ROTHERMEL RELATIONS Before Overlaps Endin Cobegin Coend Beforeendof-1 Cross-1 Delayed-1 Startin-1 While Contains Beforeendof Cross Delayed Startin Cobegin-1 Endin-1 Coend-1 Overlaps-1 Before-1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 2 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 3 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 0 4 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 5 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 10 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 11 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Calculation of the Similarity Degree Between a Relation and its Immediate Neighbors To calculate the similarity degree between a relation and its neighbors, a distance should be defined. The aims of this distance is to make the difference of proximity between the relations but not to have a precision. Thus, any distance definition can be used (Euclidian distance, Manhattan distance, etc). We choose the Manhattan distance defined as follows: Let us consider the two vectors V (v1, v2... vn) and U (u1, u2... un). The Manhattan distance between V and U is: d (V-U) = Σni=1 |vi - ui| In our work, we consider the information items of each relation as a vector where the value of each information item is set to 1 whether this information item is included in the relation or zero (0) otherwise. Using the Manhattan distance, we established the distances between each relation and its immediate neighbors as presented in table 3. Then, we affect the calculated distances to the arcs of the conceptual neighborhood graph as shown in figure 7. 152 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 TABLE III. DISTANCES BETWEEN THE RELATIONS AND THEIRS NEIGHBORS opportunity to best preserve the semantic of the initial document. For example, the relation A Before (d) B can be transformed into the relation A Before (-) B where the symbol "-" means that the distance is not specified. We present below, the delays relaxation of the relations of the Wahl and Rothermel model. For this, we use a graph structure (we call it relaxation graph) based on the number of delays defined for a relation as shown in figure 8. The relaxation graph of a relation is composed by the different forms of this relation obtained by a progressive relaxation of its delays. To illustrate this principle, consider the example of the figure 9 where the specified relation is Image <Overlaps (200,50,250>, Contains (10, 90)> Text And the target screen size is 480 x 360. The two objects cannot be displayed because the width of the area they occupy (500 px) is larger than the screen width (480 px). The distances relaxation of this relation can lead to an adaptation solution without changing the relation and this solution (see figure 10) may be: Image <Overlaps (180, 70, 230>, Contains (10, 90)> Text. Figure 7. Weighted conceptual neighborhood graph of Wahl and Rothermel relations The Conceptual Distance of a Spatial Relation The conceptual distance [1] between two relations r and r' is the length of the shortest path between those relation in the conceptual neighborhood graph. The conceptual distance of a spatial relation is defined as the sum of the conceptual distances of its two components (temporal relations on the two axes). For example, let’s consider the two spatial relations r1=<Before, Overlaps> and r2=<Endin, While>. The conceptual distance between r1 and r2 is dx (r1, r2) + dy (r1, r2) i.e, d (Before, Endin) + d (Overlaps, While). C. Distance Relaxation According to the consideration of a spatial relation as two temporal relations, its distances are taken as delays in the two temporal relations. In this case, before seeking for substitution relation, we start by trying to relax these delays in order to keep the relations and give us the © 2012 ACADEMY PUBLISHER Figure 8. Delays relaxation graphs Figure 9. Example of a spatial specification Figure 10. An adaptation solution with distance relaxation Figure 11. An adaptation solution without distance relaxation If we proceeded directly by replacing the relation as was proposed in [1] then, the relation Overlaps will be JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 directly replaced by the relation While and we will get the result of the figure 11. V. ADAPTATION PROCEDURE The semantic adaptation of a multimedia document is achieved by modifying the specification of the document. This involves finding another set of values for the distances of the relations or otherwise a set of relations satisfying the adaptation constraints of the target platform with the smallest distance from the initial specification. The distances relaxation and the relations replacement processes are done by traversing the graphs (relaxation graph for distances relaxation and conceptual neighborhood graph for the relations replacement) in both directions. This is done by searching the shortest path between the relation to be replaced and the other relations of the graph. A relation is considered as candidate to replace the initial only if it does not lead to an inconsistency (for this we recommend the use of the Cassowary resolver [13]). Finally, the solution with the smallest conceptual distance is considered as the adapted document. The adaptation procedure is done by following two phases: Distances relaxation and transgressive adaptation. A. Adaptation Algorithm We implemented this procedure throw an algorithm as shown in algorithm 1. // Phase 1: relaxation Input : MI1j: //Matrix of the document relations // relaxation relations search for i = 0 to n-1 do // n number of objects for j = 0 to n-1 do RG = SelectRG(MI[i, j]); MS[i, j] = determineRelaxedRelations(RG); end for end for // Elaboration of the possible combinations //output : combinations list Cp Cp = ElaborateCombinationsMatrix(MSij); // Sort combinations according to the conceptual distance for i=0 to nCombinations -1 do d[i] Å 0; // Matrix of the conceptual distances for j=0 to n-1 do d[i] = d[i] + Djikstra(C[i,j], MR[i,j]); end for end for QuickSortCombinations(C[i], d[i]); // Consistency Verification found Å false ; for i = 0 to nCombinations -1 do if Consistency (C[i]) then Solution Å (C[i]); Exit(); end if end for // Phase 2: replacement // replacement relations search Clear (MSi, j); for i = 0 to n-1 do // n number of objects for j = 0 to n-1 do for k=1 to NR do // NR : number of the relations of the model © 2012 ACADEMY PUBLISHER 153 if respecteProfile (Rm [k]) then //Rm set of the model relations MS [i, j] Å MS [i, j] ∪ {Rm [k]}; end if end for end for end for // Elaboration of the possible combinations //output : combinations list Cp Cp = ElaborateCombinationsMatrix(MSij); // Sort combinations according to the conceptual distance for i=0 to nCombinations -1 do d[i] Å 0; // Matrix of the conceptual distances for j=0 to n-1 do d[i] = d[i] + Djikstra(C[i,j], MR[i,j]); end for end for QuickSortCombinations(C[i], d[i]); // Consistency Verification found Å false ; for i = 0 to nCombinations -1 do if Consistency (C[i]) then Solution Å (C[i]); Exit(); end if end for Algorithm 1. Adaptation algorithm - RG : relaxation graph of the selected relation MI[i, j] - determineRelaxedRelations(RG) : determines all the relations forms obtained by relaxing the initial relation by using the correspondent Relaxation graph GR. - ElaborateCombinationsMatrix(MSij): determines all the possible combinations of the candidate relations for the replacement of the initial relations of the document. - Consistency (C[i]): Calls the linear constraints solver: Cassowary for the consistency verification of the solution. B. Algorithm Description Distance Relaxation The algorithm takes as input the matrix MIij: matrix of the complete relations graph of the initial specification. For each relation of MIij, we identify its corresponding RG from which we determine all the different form of this relation obtained by the relaxation process (only the relations that comply with the target profile are retained). The result is placed in the substitution matrix MSij. Then, we determine, by combinations, all the possible solutions Ci from the matrix MSij. Next, we perform an ascending sort of all solutions of Ci with the classical sorting algorithm "quick sort" using the conceptual distances calculated by using the Dijkstra's shortest path algorithm in the GR of each relation. This will ensure that the solutions are sorted from the closest specification to the farthest from the original. Finally, we call the constraints solver (cassowary [13]) for the consistency verification and the calculation of the solution (new values to the distances) for each specification in the order defined by the sort. The first meet with a consistent solution stops the process and that solution is considered as the adapted document. 154 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Relations replacement If at the end of the distances relaxation phase, no solution was found, we perform the replacement of the relations: We take back the initial relation matrix MIij and we determine the substitution matrix MSij, which gives for each relation of the matrix Mij the relations candidates for its substitution (not its relaxation as in the first phase) from those that meet the target profile constraints among the relations of the model, we determine by combinations, all the possible solutions Ci from the matrix MSij. Next, we perform an ascending sort of all solutions of Ci using the conceptual distances calculated by using the Dijkstra's shortest path algorithm through the conceptual neighborhood graph. Finally, we call the constraints solver cassowary for the consistency verification and the calculation of the solution for each specification. The first verification that gives a consistency stops the process of solutions consistency verification. VI. CONCLUSION In this paper, we proposed an approach to the spatial adaptation of multimedia documents that takes into account the distances defined in the relations. The differentiation in similarity degrees between each relation and its neighbors by affecting different weights to the arcs of the conceptual neighborhood and the introduction of the relaxation principle to take into consideration the distances permit to replace a relation with its closest semantically and thus to keep the document as close as possible to the original document. The first axe of our future work would be to merge the two phases of the adaptation procedure by integrating the RGs of the relations in the conceptual neighborhood graph and then, have only one phase that will permit to have an adaptation solution composed by relaxed and replaced relations. The second axe would be to determine the similitude measure between the adapted document and the original one by using some extra information (annotations) like weights assigned to relations based on their importance in the specification to determine relations to be modified or to be removed if it’s necessary. REFERENCES [1]. S. Laborie. Adaptation sémantique de document multimédia. PhD Thesis, University of Joseph Fourier Grenoble 1, France, 2008. [2]. J. F. Allen. Maintaining knowledge about temporal intervals. Communication of the ACM, vol. 26 (1), 1983, pp. 832–843 [3]. J. Euzenat, N. Layaïda, V. Diaz. A semantic framework for multimedia document adaptation. 18th International Joint Conference on Artificial Intelligence (IJCAI). Acapulco (MX), 2003, pp. 31-36. [4]. C. Freksa. Conceptual neighborhood and its role in temporal and spatial reasoning. In Proceedings of the IMACS International Workshop on Decision Support Systems and Qualitative Reasoning. Toulouse, France, 1991, pp. 181-187. © 2012 ACADEMY PUBLISHER [5]. A. Maredj, N. Tonkin. Weighting of the conceptual neighborhood graph for multimedia documents temporal adaptation. Proc. Of the International Arab Conference on Information technology (ACIT’2010). Benghazi-Libya, 2010, pp 98. [6]. T. Wahl, K. Rothermel. Representing Time in Multimedia Systems. In Proceedings of the International Conference on Multimedia Computing and Systems. Boston, Massachusetts. 1994, pp. 538-543. [7]. A. Gomaa, N. Adam, V. Atluri. Adapting Spatial Constraints of Composite Multimedia Objects to achieve Universal Access. IEEE International Workshop on Multimedia Systems and Networking (WMSN05), held in conjunction with the 24th IEEE International Performance, Computing, and Communications Conference (IPCCC). Phoenix, Arizona, USA, April 2005. [8]. J. Geurts, J. V. Ossenbruggen and L. Hardman: Application specific constraints for multimedia presentation generation. In Proceedings of the International Conference on Multimedia Modeling, 2001, pp. 247-266. [9]. T. Lemlouma , N. Layaïda context-aware adaptation for mobile devices (2004) in proceedings of the IEEE international conference on mobile data management. Berkeley, CA, USA, 2004, pp.19-22, [10]. . Lemlouma , N. Layaïda, Universal Profiling for Content Negotiation and Adaptation in Heterogeneous Environments, W3C Workshop on Delivery Context, W3C INRIA Sophia-Antipolis, France, 4-5 March 2002. [11]. Graham Klyne et al, Composite Capability/Preference Profiles (CC/PP): Structure and Vocabularies, http://www.w3.org/TR/CCPP-structvocab/, W3C Working Draft, 15 March 2001 [12]. D. Papadias, T. Sellis. Qualitative Representation of Spatial Knowledge in Two-Dimensional Space. Special Issue on Spatial Databases, Vol. 3(4), 1994, pp. 479-516. [13]. J. Badros, A. Borning, S. J. Stuckey. The Cassowary linear arithmetic constraint solving algorithm. ACM Transactions on Computer-Human Interaction (TOCHI), vol. 8(4), 2001, pp. 267–306. Azze-Eddine Maredj acquired his doctorate in computer science at the university of science and technology Houari Boumediane (Algeria). He is a search scientist at the Research Center on Scientific and Technical Information (CERIST), algiers (Algeria) where he lead the “multimedia system and structured documents” team. His Research interests are mutlimedia authoring systems and retrieval systems of structured and multimedia documents. Currently, he works on project devoted to delever an authoring system which permits the edition, the rendering and the exploitation (indexation and retreival) of mutlimedia documents. Nourredine Tonkin received the engineer degree in computer science, from the Mouloud Mammeri University, Tizi-Ouzou (Algeria) in 2004. He Works at the Research Center on Scientific and Technical Information (CERIST), Algeria. His Research interests are mutlimedia authoring systems. Currently, he works on the adaptation of multimedia documents. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 155 Adaptive Backoff Algorithm for Wireless Internet Muneer O. Bani Yassein Jordan University of Science and Technology /Department of Computer Science, Irbid, Jordan Email: masadeh@just.edu.jo Saher S. Manaseer and Ahmad A. Momani University of Jordan/Department of Computer Science, Amman, Jordan Jordan University of Science and Technology /Department of Computer Science, Irbid, Jordan Email: saher@ju.edu.jo, ahmed_cs2004@yahoo.com Abstract— The standard IEEE 802.11 MAC protocol uses the Binary Exponential Backoff algorithm. The Binary Exponential Backoff makes exponential increments to contention window sizes. This work has studied the effect of choosing a combination between linear, exponential and logarithmic increments to contention windows. Results have shown that choosing the right increment based on network status enhances the data delivery ratio up to 37% compared to the Binary Exponential Backoff, and up to 39 % compared to the Pessimistic Linear Exponential Backoff algorithms for wireless Internet. limited transmission range; therefore each node in the network connects to the nearest base station within its transmission range [23]. Index Terms— Wireless Internet, MAC, CW, Backoff algorithms I. INTRODUCTION The first appearance of wireless networks was in 1970s. Since that time, these networks are being developed so fast [22, 23]. It has been noticed that in the last decade all trends moved toward wireless Internet technology [23]. Also the mobile wireless network which is also called mobile ad hoc network has become the new age of wireless networks. We can distinguish two types of networks; infrastructure and ad hoc networks [1, 3, 22, 23]. See figures 1 and 2. Figure 1: An example of an Infrastructure Wireless Internet [29] Fig. 1 shows a simple example of the first type of infrastructure wireless networks. Communication between nodes at such networks is managed via a base station or a central access point. Each base station has a © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.155-163 Figure 2: An example of Wireless Internet On the other hand, Fig. 2 shows an example of the second type of wireless internet, a mobile ad hoc network (MANET). A MANET is a set of mobile nodes that communicate through wireless links. Each node acts as a host and router. Since nodes are mobile, the network topology can change rapidly and unpredictably over time [1, 3]. In other words, a MANET does not have a base station, so communication between nodes is managed by the nodes themselves. Moreover, nodes are not expected to be fully-connected, hence nodes in a MANET must use multihop path for data transfer when needed [24]. Recently, most interests were focused on MANETs due to potential applications provided by this type of networks such as military operation, disaster recovery, and temporary conference meetings. A. Features and Characteristics of MANETs Despite a MANET has many features shared with infrastructure networks, it also has its own additional features. Some of these features are: • Dynamic network topology: nodes in the network are free to move unpredictably over time. Thus, the network topology may change rapidly and unpredictably. This change may lead to some serious issues, such as increasing the number of transmitted messages between nodes of the network to keep 156 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 routing information table updated, and this will increase the network overhead [25]. • Distributed operations: in MANET there is no centralized management to control the network operations like security and routing, therefore, the nodes must collaborate to implement such functions. In other words, the control management is distributed among nodes of the network [22]. • Limited resources: in MANETs nodes are mobile, so they suffer constrained resources compared to wired networks. For example, nodes in a MANET depend on batteries for communication and computation, so we should take in to account how to optimize energy consumption [26, 27, 28]. B. Applications of MANETs MANETs are deployed in different environments due to its valuable features of mobility, no base stations, Some of its applications are [22, 23]: • Military Operations In battlefield environments, a MANET can be very useful to setup a reliable communication between vehicles and solders where it seems almost impossible to have an infrastructure network in such environments. Emergency Operations MANETs are very useful to be deployed in places that the conventional infrastructure-based communication facilities were destroyed by earthquakes, volcanoes, and any other supernatural situations. This is true since a MANET is a flexible, mobile, not expensive and can saves time for deployment phase. • Mobile Conferencing It is unrealistic to expect that all business is done inside an office environment, so a communication between a group of people or researchers can be achieved using MANETs. • II. LITERATURE REVIEW The Binary Exponential Backoff (BEB) [7, 8, 12, 13, 14] is used widely by IEEE 802.11 MAC protocols due to simplicity and good performance in general. The BEB algorithm works as the following: When a node attempts to send a packet to a specified destination, it first senses the shared medium in order to decide whether to start transmitting or not. If the channel is founded to be idle, the transmission process starts. Otherwise the node should wait a random number of time between the range of [0, CW-1], this time is calculated using the formula: Backoff time = (Rand () MOD CW)*aSlotTime (1) After getting the backoff time, the node should wait until this time reaches zero before start transmitting. The backoff time (BO) is decremented by one at each idle © 2012 ACADEMY PUBLISHER time slot. But if the channel is busy the BO timer will be frozen. Finally if the node received an acknowledgment for the packet sent, the contention window (CW) is reset to minimum for that node. But if the node did not receive an acknowledgment (send failure occur), the CW is incremented exponentially to the new backoff value. S. Manaseer and M. Masadeh [1] proposed the Pessimistic Linear Exponential Backoff (PLEB). This algorithm is composed of two increment behaviors for the backoff value; the exponential and linear increments. When a transmission failure occurs, the algorithm starts working by increasing the contention window size exponentially. And after incrementing the backoff value for a number of times, it starts increasing the contention window size linearly. PLEB works the best when implemented in large network sizes. S. Manaseer, M. Ould-Khaoua and L. Mackenzie [2] proposed Fibonacci Increment Backoff (FIB). This algorithm uses the Fibonacci series formula which is defined by: (2) FIB algorithm aims to reduce the difference between contention windows sizes generated, resulting in a higher network throughput than the standard IEEE 802.11. H. Ki, Choi, S. Choi, M. Chung and T. Lee [15] proposed the binary negative-exponential backoff (BNEB) algorithm. This algorithm uses exponential increments to contention window size during collisions (transmission failures), and reduces the contention window size by half after a successful transmission of a frame. The analytical model and simulation results in [15, 16] showed that the BNEB outperforms the BEB implemented in standard IEEE 802.11 MAC protocol. S. Kang, J. Cha and J. Kim [17] proposed the Estimation-based Backoff Algorithm (EBA). This new algorithm has two main functions; the first one used to estimate the number of active nodes, and the second used to decide which contention window CW is optimal for the current case. The estimation function uses the average number of idle slots during backoff time to obtain the number of nodes which will be after the optimal CW for the current case. EBA algorithm outperforms the binary exponential backoff (BEB), the exponential increase exponential decrease (EIED), the exponential increase linear decrease (EILD), the pause count backoff (PCB) and the history based adaptive backoff (HBAB) in network throughput and the mean packet delay. S. Pudasaini, A. Thapa, M. Kang, and S. Shin [18] proposed an intelligent contention window control scheme for backoff based on Collision Resolution Algorithm (CRA). This algorithm keep a history for a success and failure access attempts in order to use this history to modify the contention window interval (CWmin, CWmax). This modification will cause a dynamic shifting for backoff interval to more suitable region. This new algorithm made some improvements to channel efficiency in terms of packet end-to-end delay. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 A. Balador, A. Movaghar, and S. Jabbehdari [19] proposed a new History Based Contention Window Control (HBCWC) algorithm for IEEE 802.11 MAC protocol. HBCWC made an optimization to the contention window values via saving the last three states of transmission. The main factor in this algorithm is the packet lost rate, if this factor increases due to collisions or channel errors then the CW size will increase. Otherwise it will decrease. S. Manaseer and M. Ould-Khaoua [7], proposed the logarithmic backoff algorithm (LOG) for MAC protocol in MANETs. This algorithm uses logarithmic increments to provide a new backoff values instead of exponential ones. The new backoff values are extracted using the formula: (CW) new = (log (CW) old) * (CW) old * aSlotTime. LOG algorithm generates values that are close to each other in order to achieve a higher throughput when used in MANETs. V. Bharghavan, A. Demers, S. Shenker, and L. Zhang [10] proposed Multiplicative Increase and Linear Decrease (MILD) backoff algorithm. This algorithm uses multiplication by a factor when failed transmission occurs (due to collision or transmission failure). After a success transmission occur the contention window CW is decremented by a factor in order to reduce the probability of successful users to access the channel all the time. This decrement helps solving the unfairness problem which might occur to other users who have collisions and send failures [8, 9, 10]. J. Deng, P. Varshney, and Z. Haas [9] proposed the linear multiplicative increase and linear decrease (LMILD) backoff algorithm. LMILD uses both linear and multiplicative increments in the case of send failure; that is when a collision occurs, the colliding nodes increase their contention window CW multiplicatively, and other nodes overhearing this collision make a linear increment to their CW. In the case of successful transmission, all nodes decrease their contention windows linearly [8, 9]. LMILD has shown a better performance than the standard IEEE 802.11 when used in large network sizes. It also outperforms the pessimistic linear exponential backoff (PLEB) in small networks, but PLEB achieves better performance than LMILD in large network sizes [1]. III. THE NEW PROPOSED BACKOFF ALGORITHM A. Overview In general, backoff algorithms tend to increase the contention window (CW) size after each transmission failure. Since this is true, a backoff algorithm should use a suitable increment for CW size in order to achieve the best performance. Many increment behaviors were used in this field such as: linear, exponential, logarithmic, and Fibonacci series. If we split the networks into three types: small, medium and large, each increment scheme would suit at the most two network types and drops dramatically at the third one. For example, the exponential increment of BEB algorithm which is used in standard IEEE 802.11 © 2012 ACADEMY PUBLISHER 157 MAC does not achieve the best performance due to large CW gaps produced. Another example is a linear increment of LMILD; it does not allow adequate backoff time before data retransmission. The Binary Exponential Backoff (BEB) algorithm increases the CW sizes exponentially based on transmission failure. That is, when a node has a packet to transmit, first it starts sensing the channel. If the channel is found to be idle, the node starts transmitting immediately the data packets. Otherwise, the backoff mechanism is triggered. Furthermore, a backoff timer is selected randomly from the current CW size; this timer is decremented only at each time slot found to be idle. When the timer reaches zero, the node transmits the data packets. If the acknowledgement received from the destination node, then CW size is reset to minimum. On the other hand, if the acknowledgement is lost the CW size is incremented exponentially. See Fig. 3. Figure 3: BEB algorithm description Another backoff algorithm called pessimistic linear exponential backoff (PLEB). This algorithm combines two increment behaviors: exponential and linear increments. PLEB assumes that congestion in the network will not be resolved in the near future. Thus, PLEB selects exponential increments at the first stages of the algorithm and continues to linear behavior. See Fig. 4. The next section suggests a new backoff algorithm that aims to improve the network performance overall. Our goal is to achieve a higher data delivery ratio with less overhead to the network. The new suggested algorithm uses a combination of different increments in order to take the advantage of each. The increments used in this algorithm are: linear, logarithmic, and exponential. The exponential increment aims to produce adequate CW lengths at the first stages of the new algorithm. Logarithmic increment provides proper increments to CW if it is still small; generally logarithmic increment is used as a transition stage towards linear increment to avoid continues exponential increments. The last increment used is linear increment, which aims to avoid 158 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 large increments of exponential and logarithmic ones. The simulation results presented in the next paper show that the new proposed backoff algorithm improves the network delivery ratio and overhead. After that, the lines 8 and 9 shows that SABA calculates the average of CW sizes in the history array in order to start a new increment behavior based on the average value. This average is computed only once. Now, the lines 11-16 express that one increment behavior will be chosen: linear or logarithmic. If the average is not high (less than threshold N) the next increment behavior will be the logarithmic increment. Otherwise it will be the linear increment. 1 Set BO to initial value 2 While Bo ≠0 do 3 For each time slot 4 If channel is idle then BO=BO -1 5 6 7 Figure 4: PLEB algorithm description B. The Smart Adaptive Backoff Algorithm The Smart Adaptive Backoff Algorithm (SABA) is the new suggested backoff algorithm. This algorithm assumes that a network performance can be enhanced if the network is very sparse or congested too much. The explanation for this is the following; SABA assumes that solving the network collision in very sparse and congested networks will not be in the near future. Therefore, exponential increments are used. In very sparse networks, paths can be broken easily due to mobility, and mostly there exists only one path in the route table. So, the exponential increments will provide adequate values to establish other paths and start using them. On the other hand, in congested networks many nodes in the network use the same paths, so we should provide more time before start transmitting on this path. However, figure 5 shows in detail the basic functionality of SABA algorithm. As shown, the first 5 lines of the algorithm set the initial value for the backoff timer. And then, starts the decrement of that time based on idle time slots. This means, the timer will be decremented only at idle time slots. Otherwise, it freezes. After the timer reaches zero, the data packet is transmitted. Now, in case of successful transmission the CW value is saved in the history array as shown in line 21 of the algorithm. Otherwise, the backoff mechanism will be triggered. The lines 7-19 show a brief description of the adaptive process in SABA algorithm. It starts with line 7 by incrementing the CW exponentially only for a number of times based on transmission success. In other words, SABA provides an exponential increment for a node, and saves the CW size in case of transmission success. This process is repeated until the array of five elements is full. © 2012 ACADEMY PUBLISHER Wait for a period of DIFS then Send If (Send-Failure) then If (CW array of last five successes is full) 8 If (Array used for the first time) 9 Calculate the average of the history array and use it as a new CW value Else 10 11 If (CW > N) then 12 CW = CW + T 13 Backoff-Timer = Random x; 14 Else 15 CW = Log (CW) * CW 16 Backoff-Timer = Random x; 17 1 ≤ x ≤ CW - 1 1 ≤ x ≤ CW - 1 Else 18 CW = CW * 2 19 1 Set BO to initial value 20 Else 21 Save the CW value used in the history 22 Go to line number 1 23 Stop Backoff-Timer = Random x 1 ≤ x ≤ CW - 1 20 Else 21 Save the CW value used in the history 22 Go to line number 1 23 Stop Figure 5: The Smart Adaptive Backoff Algorithm The purpose of using the average is to reduce the CW size if it is a large number. Both of logarithmic and linear increments aim to avoid excessive of CW sizes in order to enhance the network performance. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 A. Simulation Environment Our simulations were run using a network of 10, 20, 50 and 100 nodes placed randomly within a 1000 meter × 1000 meter area. Each node has a radio propagation range of 250 meters, and the channel capacity is 2 Mb/s. Each run executed for 900 seconds. We used the IEEE 802.11 as the MAC layer protocol. The Constant Bit Rate (CBR) node traffic is used in the simulations. We used the random waypoint model for node mobility. We used various node maximum speeds: 1, 2, 3 and 4 meter per second. In addition, we used traffic loads of 5, 10 and 20 packets per second, repeated for 5 and 10 sources. B. Simulation Results Different performance metrics were used to evaluate backoff algorithms [1, 2, 4, 6, 7, 9, 10, 11, 12]. This study uses data delivery ratio and network overhead parameters to measure performance levels. The ideal case is to achieve maximum delivery ratio and minimum network overhead. 100 Delivery Ratio (%) In this research paper the network performance is measured by two criteria: packet delivery ratio and network overhead. In this paper, we present and evaluate the simulation results that were obtained for different scenarios. In the simulation experiments, we varied the number of sources and the maximum node speed. We implemented the proposed backoff algorithm SABA using the GloMoSim [21] simulator to evaluate the performance of the new algorithm compared to the well-known BEB and PLEB algorithms. transmissions. After that, if the result is a large contention window size, the increment is linear. Otherwise, increment to contention window is logarithmic and then continues in linear way. The logarithmic increment is significantly smaller than exponential increment but it still larger than linear (this is needed in large networks to perform in a better way). 80 60 BEB 40 PLEB 20 SABA 0 1 2 © 2012 ACADEMY PUBLISHER 4 Mobility Speed(m/sec) 0.5 0.4 0.3 BEB 0.2 PLEB 0.1 SABA 0 1 2 3 4 Mobility Speed (m/sec) Figure 7: Overhead of BEB, PLEB and SABA for 20 nodes, 5 sources each source sending 20 packets per second. 100 Delivery Ratio (%) The figures (6, 7, 8, 9, 10, 11, 12, 13) show that SABA achieves the best performance in light and heavy (small and large) networks. This is not surprising since a network with a small number of nodes does not frequently trigger backoff mechanism; therefore, SABA which uses the average of last five exponential increments (successful increments only) outperforms the exponential increments in BEB and PLEB. Moreover, when the number of nodes increases and the network become of a large size (e.g. a network of 100 nodes), the exponential increments has essential role in network performance; that is, in BEB the exponential increments continues to enlarge the gaps between contention window sizes which in turn significantly reduces the network performance. Moreover, in PLEB algorithm the linear increments begin early in a way that does not allow the increments to the best values for large networks (i.e. the increments still small compared to network size and nodes mobility speed). On the other hand, SABA provides the best contention window increments. SABA aims to utilize exponential, logarithmic and linear increments to achieve the best performance in large networks. It starts with exponential increments until the node successfully sends five times. After that, it is expected to have a large contention window size. Therefore, it is reduced by calculating the average contention window sizes of these successive 3 Figure 6: Data delivery ratio of BEB, PLEB and SABA for 20 nodes, 5 sources each source sending 20 packets per second. Overhead IV. SIMULATION AND RESULTS' EVALUATION 159 80 60 BEB 40 PLEB 20 SABA 0 1 2 3 4 Mobility Speed (m/sec.) Figure 8: Data delivery ratio of BEB, PLEB and SABA for 20 nodes, 10 sources each source sending 10 packets per second. At medium sized networks, the simulation results have shown in figures (14, 15, 16, 17) that SABA is still closely comparable the BEB and PLEB algorithms, and even gives a better performance at high traffic rates. BEB, PLEB and SABA backoff algorithms start to increase the contention window size exponentially. 160 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 100 0.3 0.2 Delivery Ratio(%) Overhead 0.25 BEB 0.15 PLEB 0.1 0.05 SABA 80 60 BEB 40 PLEB 20 0 SABA 0 1 2 3 4 1 2 3 4 Mobility Speed (m/sec.) Mobility Speed (m/sec.) Figure 9: Overhead of BEB, PLEB and SABA for 20 nodes, 10 sources each source sending 10 packets per second. Figure 12: delivery ratio of BEB, PLEB and SABA for 100 nodes, 10 sources each source sending 20 packets per second. 50 40 80 60 Overhead Delivery Ratio(%) 100 BEB 40 20 30 BEB 20 PLEB PLEB 10 SABA 0 0 1 1 2 3 4 2 3 4 Mobility Speed (m/sec.) Mobility Speed (m/sec.) Figure 10: Data delivery ratio of BEB, PLEB and SABA for 100 nodes, 5 sources each source sending 20 packets per second. Figure 13: Overhead of BEB, PLEB and SABA for 100 nodes, 10 sources each source sending 20 packets per second. . 2.5 2 BEB 1.5 PLEB 1 Delivery Ratio (%) 100 3 Overhead SABA 80 60 BEB 40 PLEB 20 SABA 0 0.5 SABA 1 0 1 2 3 4 Mobility Speed (m/sec.) © 2012 ACADEMY PUBLISHER 3 4 Mobility Speed (m/sec.) Figure 14: Data delivery ratio of BEB, PLEB and SABA for 50 nodes, 5 sources each source sending 20 packets per second. Overhead Figure 11: Overhead of BEB, PLEB and SABA for 100 nodes, 5 sources each source sending 20 packets per second The number of backoff processes expected to be moderate. Therefore, the continuous exponential increments in BEB algorithm would not be a problem in this case. Moreover, SABA algorithm continues with logarithmic increments which are more suitable for high traffic rates in this type of networks. While the PLEB algorithm starts linear increments early making this mechanism unsuitable to gain the best performance for medium networks. 2 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 BEB PLEB SABA 1 2 3 4 Mobility Speed (m/sec.) Figure 15: Overhead of BEB, PLEB and SABA for 50 nodes, 5 sources each source sending 20 packets per second. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Delivery Ratio (%) 100 V. CONCLUSIONS 80 60 BEB 40 PLEB 20 SABA 0 1 2 3 4 Mobility Speed (m/sec.) Figure 16: Data delivery ratio of BEB, PLEB and SABA for 50 nodes, 10 sources each source sending 10 packets per second. Overhead 161 3.5 3 2.5 2 1.5 1 0.5 0 BEB PLEB SABA 1 2 3 4 Mobility Speed (m/sec.) Figure 17: Overhead of BEB, PLEB and SABA for 50 nodes, 10 sources each source sending 10 packets per second Furthermore, the results show some other important information; first, the linear increment behavior directly affects the network overhead. For example, PLEB algorithm provides the lowest network performance compared to BEB and SABA for all scenarios. This is true since the linear increments in the early stages of PLEB do not allow adequate increments for the CW values. To explain this in detail, the three types of network should be studied extensively. That is, in a sparse network linear increments make the source node sends the route request more frequent in order to establish a path between source and destination nodes (this is hard because the number of nodes is still small). Furthermore, in a medium and large network it is expected to have a more frequent backoff triggers. Therefore, it is normal to have a congested network and more broken routes. For all reasons mentioned above, it is normal to expect that linear increments in backoff algorithms cause a higher network overhead than exponential ones for the applied scenarios in this thesis. Secondly, in case of lightly loaded networks, the data delivery ratio is increasing. This means that a network of 20 nodes (5 and 10 sources sending data packets) and a network of 50 nodes (only 5 sources sending data packets) the data delivery ratio has increased due to the increasing mobility of nodes can be more useful. In other words, the data delivery ratio increases because when routes break due to mobility some other routes are built quickly. Finally, at highly loaded networks (ex. network of 100 nodes) the data delivery ratio decreases as mobility speed increases. This © 2012 ACADEMY PUBLISHER In this paper we presented a new backoff algorithm for MANETs called the Smart Adaptive Backoff Algorithm (SABA). The main objective of this work is to evaluate the performance of the new backoff algorithm in terms of network size, mobility speeds and traffic rates. The results obtained approve that changes made to contention window size increment and decrement directly affects network performance metrics such as data delivery ratio and overhead. The results have shown that SABA outperforms BEB and PLEB algorithms in different network types. The data packet delivery ratio of SABA against BEB and PLEB algorithms was fluctuating. For example, in a small network type when a transmission rate is twenty packets per second, the number of sources is five and mobility is low SABA outperform BEB and PLEB by 37.11% and 38.45%, respectively. At high mobility, SABA outperforms BEB and PLEB by 7.74% and 14.13%, respectively. At medium network type, when a transmission rate is ten packets per second, the number of sources is ten and mobility is low SABA outperforms BEB and PLEB by 5.59% and 22.79%, respectively. At high mobility, SABA outperforms BEB and PLEB by 13.19% and 21.56%, respectively. At large networks, when a transmission rate is ten packets per second, the number of sources is ten and mobility is low SABA outperforms BEB and PLEB by 0.93% and 30.34%, respectively. At high mobility, SABA outperforms BEB and PLEB by 36.42% and 39.67%, respectively. The network overhead of SABA against BEB and PLEB algorithms was fluctuating. For example, in a small network type when a transmission rate is twenty packets per second, the number of sources is five and mobility is low SABA outperform BEB and PLEB by 0.249% and 0.392%, respectively. At high mobility, SABA outperforms BEB and PLEB by 0.299% and 0.746%, respectively. At medium network type, when a transmission rate is twenty packets per second, the number of sources is ten and mobility is low SABA outperforms BEB and PLEB by 0.427% and 8.875%, respectively. At high mobility, SABA outperforms BEB and PLEB by 1.793% and 28.486%, respectively. At large networks, when a transmission rate is twenty packets per second, the number of sources is ten and mobility is low SABA outperforms BEB and PLEB by 0.454% and 6.645%, respectively. At high mobility, SABA outperforms BEB and PLEB by 18.248% and 39.211%, respectively. In general, the results of this research paper indicate that each type of networks needs a different way to handle contention window increment. That is, for small networks low increments are preferred (lower than exponential). On the other hand, for medium and large networks it is preferred to have low increments after large ones. Finally, this work has studied the effect of choosing the behavior changing point between linear, logarithmic 162 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 and exponential increments in the proposed algorithm SABA. Results have shown that using the suitable increment type according to the network status increases overall network performance. ACKNOWLEDGMENT We would like to thank the staff of the Faculty of Computer and Information Technology, Jordan University of Science and Technology (JUST) and the University of Jordan (JU), Jordan, for their great and continuous support. REFERENCES [1] S. Manaseer and M. Masadeh. “Pessimistic Backoff for Mobile Ad hoc Networks”. Al-Zaytoonah University, the International Conference on Information Technology (ICIT'09), Jordan, 2009. [2] S. Manaseer, M. Ould-Khaoua, and L. Mackenzie. “Fibonacci Backoff Algorithm for Mobile Ad Hoc Networks”. Liverpool John Moores University, the 7th Annual Postgraduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNET 06), Liverpool, 2006. [3] Y. Yuan, A. Agrawala. “A Secure Service Discovery Protocol for MANET”. Computer Science Technical Report CS-TR-4498, Computer Science Department, University of Maryland, 2003. [4] S. Manaseer, M. Ould-Khaoua, and L. Mackenzie. “On a Modified Backoff Algorithm for MAC Protocol in MANETs”. Int. J. of Information Technology and Web Engineering, Vol. 2(1), pp. 34-46, 2007. [5] P. Karn. “MACA - A new channel access method for packet radio”. ARRL/CRRL Amateur Radio 9th computer Networking Conference, London, pp. 134–140, 1990. [6] M. Bani Yassein, S. Manaseer, and A. Al-Turani. “A Performance Comparison of Different Backoff Algorithms under Different Rebroadcast Probabilities for MANET’s”. University of Leeds, the 25th UK Performance Engineering Workshop (UKPEW), UK, 2009. [7] S. Manaseer and M. Ould-Khaoua. “Logarithmic Based Backoff Algorithm for MAC Protocol in MANETs”. Technical Report, University of Glasgow, 2006. [8] H. Wu and Y. Pan. “Medium Access Control in Wireless Networks”. Nova Science Publishers Inc, pp. 26- 29, 2008. [9] J. Deng, P. Varshney, and Z. Haas. “A New Backoff Algorithm for the IEEE 802.11 Distributed Coordination Function”. Communication Networks and Distributed Systems Modeling and Simulation (CNDS’04), San Diego, California, 2004. [10] V. Bharghavan et al, “MACAW: A Media Access Protocol for Wireless LAN’s”, in Proc. ACM SIGCOMM ’94, pp. 212–225, 1994. [11] S. Zhalehpoor and H. Shahhoseini. “SBA Backoff Algorithm to Enhance the Quality of Service in MANETs”. International Conference on Signal Acquisition and Processing ICSAP, Kuala Lumpur, Malaysia, pp. 4347, 2009. [12] C. Huy, H. Kimz, and J. Hou. “An Evaluation of the Binary Exponential Backoff Algorithm in Distributed MAC Protocols”. Technical Report, University of Illinois – Urbana Champaign, 2005. © 2012 ACADEMY PUBLISHER [13] J. Goodman et al. “Stability of binary exponential backoff”. In the Proc. of the 17-th Annual in Journal of the ACM, Vol. 35(3), pp. 579–602, 1988. [14] Y. Wang, “Medium Access Control in Ad Hoc Networks with Omni-directional and Directional Antennas”, University of California, doctoral Dissertation, 2003. [15] H. Ki, Choi, S. Choi, M. Chung, and T. Lee. “Performance evaluation of Binary Negative-Exponential Backoff Algorithm in IEEE 802.11 WLAN”. Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg, vol. 4325, pp. 294–303, 2006. [16] B. Choi, S. Bae, T. Lee, and M. Chung. “Performance Evaluation of Binary Negative-Exponential Backoff Algorithm in IEEE 802.11a WLAN under Erroneous Channel Condition”. Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg, vol. 5593, pp. 237-249, 2009. [17] S. Kang, J. Cha, and J. Kim. “A Novel Estimation-Based Backoff Algorithm in the IEEE 802.11 Based Wireless Network”. 7th Annual IEEE Consumer Communications & Networking Conference (CCNC), Las Vegas, Nevada, USA, 2010. [18] S. Pudasaini, A. Thapa, M. Kang, and S. Shin. “An Intelligent Contention Window Control Scheme for Distributed Medium Access”. 7th Annual IEEE Consumer Communications & Networking Conference (CCNC), Las Vegas, Nevada, USA, 2010. [19] A. Balador, A. Movaghar, and S. Jabbehdari. “History Based Contention Window Control in IEEE 802.11 MAC Protocol in Error Prone Channel”. Journal of Computer Science, 6 (2), pp. 205-209, 2010. [20] B. Ramesh, D. Manjula. “An Adaptive Congestion Control Mechanism for Streaming Multimedia in Mobile Ad-hoc Networks”. International Journal of Computer Science and Network Security (IJCSNS), Vol. 7(6), pp. 290-295, 2007. [21] X. Zeng, R. Bagrodia, M. Gerla. “GloMoSim: A Library for Parallel Simulation of Large-scale Wireless Networks”. Proceedings of the 12th workshop on Parallel and distributed simulation, Banff, Alberta, Canada, pp. 154161, 1998. [22] I. Stojmenovic. “Handbook of wireless networks and mobile computing”, Wiley, New York, 2002. [23] C-K Toh. “Ad hoc mobile wireless networks, protocols and systems”, Prentice-Hall, New York, 2002. [24] IEEE, ANSI/IEEE standard 802.11, 1999 Edition (R2003), Part 11: “Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications”. [25] R. Rajaraman. “Topology control and routing in ad hoc networks: a survey”. ACM Special Interest Group on Algorithms and Computation Theory (SIGACT), Vol. 33(2), 2002. [26] V. Rodoplu and T. Meng. “Minimum energy mobile wireless networks”. IEEE Journal of Selected Areas in Communications, Vol. 17(8), pp. 1333-1344, 1999. [27] S. Nesargi and R. Prakash. “MANETconf: configuration of hosts in a mobile ad hoc network”. Proceedings of IEEE Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), Vol(2), pp. 1059-1068, 2002. [28] Y. Tseng and T. Hsieh. “Fully power-aware and locationaware protocols for wireless multi-hop ad hoc networks”. Proceedings of the Eleventh International Conference on Computer Communications and Networks, pp. 608-613, 2002. 29. Manaseer. “On Backoff Mechanisms for Wireless Mobile Ad Hoc Networks”. Doctoral Dissertation, United Kingdom (UK), Glasgow University, December, 2009 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Muneer O. Bani Yassein received his B.Sc. degree in Computing Science and Mathematics from Yarmouk University, Jordan in 1985 and M. Sc. in Computer Science, from Al Al-bayt University, Jordan in 2001. He received his Ph.D. degree in Computing Science from the University of Glasgow, U.K. in 2007. His current research interests include networking protocol development and analysis the performance probabilistic flooding in mobile ad hoc networks. Saher S. Manaseer received PhD degree in Computer Science from the Department of Computing Science at the University of Glasgow in 2009. His main area of research is Computer Networks and Embedded Systems. Currently, he is an active researcher in the field of Mobile Ad Hoc Networks. More specifically, his research on MANETs is focused on developing MAC layer protocols. Before obtaining his PhD, He got his Masters in Computing Science from the University of Jordan in 2004 where his research in that era was focused on Software Engineering. Since he earned his undergraduate degree in Computer science, he has worked in two parallel directions, being Training and software development. Ahmad A. Momani received his B.Sc. degree in Computer Science from Jordan University of Science and Technology, Jordan in 2007 and M. Sc. in Computer Science, from Jordan University of Science and Technology, Jordan in 2011. His main area of research is Computer Networks. Currently, he is an active researcher in the field of Mobile Ad Hoc Networks. More specifically, his research on MANETs is focused on developing MAC layer protocols. Before obtaining his Masters, he has worked as a programmer and then a teacher in the Ministry of Education. . © 2012 ACADEMY PUBLISHER 163 164 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 ConTest: A GUI-Based Tool to Manage InternetScale Experiments over PlanetLab Basheer Al-Duwairi Jordan University of Science and Technology/Department of Network Engineering and Security, Irbid, Jordan Email: basheer@just.edu.jo Mohammad Marei, Malek Ireksossi, and Belal Rehani Jordan University of Science and Technology/Department of Computer Engineering, Irbid, Jordan Email: {mymarei, malek.erq, belalrehani}@hotmail.com Abstract— PlanetLab is being used extensively to conduct experiments, implement and study large number of applications and protocols in Internet-like environment. With this increase in PlanetLab usage, there is a pressing need to have efficient way to setup and mange experiments. This paper proposes a graphical user interface-based tool, called ConTest, to setup and mange experiments over PlanetLab. ConTest enables PlanetLab users to setup experiment and collect results in a transparent and easy way. This tool also allows different measurements for different variables over the PlanetLab network. The paper discusses the design and implementation of ConTes and shows different scenarios. Index Terms—Internet measurements, PlanetLab. API, Networking, GUI. I. INTRODUCTION PlanetLab is an open shared platform for developing, deploying, and accessing planetary scale applications [1]. Currently, it has more than 1100 nodes distributed all over the world. It allows its users to freely access the shared nodes, upload programs and execute them. It is used by many distributed applications developers all around the world for testing their applications (e.g., [2, 3, 4, 5]). It can also serve the purpose of having machines distributed all around the Internet to break the limit of locality and allow the users to perform their measurements on random nodes at random places to ensure generality and confidence in the obtained results. PlanetLab is being used extensively to conduct experiments, implement and study large number of applications and protocols in Internet-like environment. These applications and protocols fall into different areas that include real-time measurements of delay, bandwidth, and security. With this increase in PlanetLab usage, there is a pressing need to have efficient way to setup and mange experiments. Having such tool would be a great assist not only for Planetlab community, but also to new researchers/users who find it difficult to explore and utilize this platform. In fact, several systems (as described in Section V) have been proposed to achieve that goal. © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.164-171 This paper builds on previous efforts in this direction and proposes a GUI-based tool, called ConTest, that allows users to visually manage experiments in Internet-like environment. In this regard, ConTest provides great deal of flexibility and simplicity to conduct experiments over PlanetLab. Using this tool, users have the ability to visually create the network topology they want simply by selecting PlanetLab nodes as they appear on the world map view. Also, it allows users to specify the role of each node (client, server, or proxy) and to generate different type of traffic (e.g., TCP, UDP, ICMP, etc.). In addition, ConTest has the capability to perform basic operations in sequence or in parallel. This include authentication, file upload, and command execution. ConTest provides efficient and simple way to deploy and test applications over Planetlab. With this tool it becomes very easy to study the properties of Internet applications and protocols. It also becomes easy to measure their performance in terms of response time, delay jitter, packet loss, etc. With its ability in supporting application deployment and monitoring in Internet-Like environment such as PlanetLab, we believe that ConTest would promote research in this field and would significantly contribute to the increase of PlanetLab users because of its attractive features that hide many of the complications behind the scene. The rest of this paper is organized as follows: Section II explains ConTest architecture. Section III discusses implementation details. Section VI discusses security issues. Section V discusses related work. Finally, conclusion and future work is presented in Section VI. II. CONTEST ARCHITICURE The design and implementation of ConTest is based on modularity and usability concepts. In this regard, ConTest is composed of several modules with each module having specific functionality. In particular, ConTest has the following modules: Authentication Module, Command execution module, File management module, and Connection management module. What follows is a description of each module. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 A. Authentication Module Typically, users access PlanetLab nodes through slices. Basically, a slice is a multi-user account registered on the PlanetLab Central (PLC) to offer all the users of that account the same resources on the nodes which they add to their slice such that users of that account work on the same project and preserve node’ resources of that slice. Accessing a PlanetLab node requires the user to have access to the slice to which the node belongs to. PlanetLab offers a web interface for the users to allow them to access their slices and see the nodes registered there. But it also offers an API that allows external programs to communicate with PLC databases through remote procedure calls (RPC) to retrieve users' account and node information. ConTest utilizes this API to perform authentication. The authentication module is a program that communicates directly with the PlanetLab central API. It accepts the authentication information (email, PlanetLab password and slice name) from the main module. This information is then validated by the PLC. After that, it retrieves the slice info from the PLC and the node list registered on this slice, with all the necessary data about each node (ID, name, location, status, etc.) as illustrated in Fig. 1. 165 connects to the desired nodes using the secure public key cryptography authorization, sends the command to the destination nodes and retrieves the output and error messages from each node. C. File Management Module Similar to the command execution module, file copying is done through SSH and using the secure copy function. After accepting the key pair identification, the files are uploaded to the destination nodes and their console dump is retrieved to ensure that the copying process was successful. D. Connection Management Module This module represents the operational core of the tool. In order for the user to perform their tests, they need to connect the nodes together using a testing program which they also need to upload. The connection management module does that for the user. Also, this module heavily utilizes the program capabilities. It uses a special class to setup the connections required by the user. The role of each node (e.g., client, server) is determined using the programmed connection algorithm. The connection management module communicates with the map interface on the user interface to help the user select their connection topology in a simple and efficient way. It also utilizes the file copying and command execution modules to upload the test files to the desired nodes, execute these files, retrieve the results from each node and output these results to the GUI. The results associated with an experiment are stored to separate files for reference. Also some statistics (e.g., bandwidth, round trip time, etc.) are displayed to the users upon the execution of these experiments. Fig 2. shows the execution flow for the connection management module and its interaction with other modules in each step. Figure 1: How the authentication module accesses the API to retrieve node list B. Command Execution Module The command execution module allows users to run their own custom commands on the nodes to which they have access (i.e., within their slices). Also, it is used in the connection management module to run the tests requested by the user. In order for the tool to communicate with the remote nodes and access their resources to perform the required test, and due to the security restrictions on the access to PlanetLab nodes, any connection to the nodes must be established through the secure shell (SSH) protocol. Therefore, command execution is done through SSH connections, where the command execution module © 2012 ACADEMY PUBLISHER Figure 2: Execution flow in the connection management module E. The Graphical User Interface (GUI) module. 166 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 The GUI represents the main program that links all other modules of the tool. The GUI module accepts user inputs and manages outputting results and error messages to the user in text file format and to the program log. It communicates with the authentication module to read node data and display them in table format, as well as identify and show their location on the world map using the retrieved coordinates from the authentication module. It also sends the user input to the command execution and file management modules and retrieves the console log from each module to display it in the program log or save it to separate files. Also, it integrates the connection management module with the command execution/file management modules to achieve the tests required by the user. Node view class: This class is mainly used to represent the nodes graphically, it also stores all the available data of the nodes, namely: • Node ID. • Node name. • Location. • Status. • Coordinates (for display in the GUI's map view). It also contains pointers to other nodes. These pointers are used in creating linked lists that help in constructing the required connections for the tests performed by the connection management module. Fig. 3 illustrates the relationship between the different classes; it also shows the relationship between the main module and the authentication module. F. ConTest Classes GUI class: The core of ConTest is the GUI module/class, which connects the other modules together and manages their interactions. This class is responsible for interacting with the user, I/O management between the sub-classes, and error handling. Most of the content of the main class concerns the graphical widgets displayed on the GUI, the initiating of the other modules and the management of their inputs and outputs. So we won't go deep into describing this class as it contains many auto generated elements (the graphical widgets). That leaves us with the sub-classes in the design, which are: Copying thread class: The copying thread class is the class that represents the file management module. Instances of this class receive and store the essential data of the nodes to which the copying is going to be performed, as well as the data of the file to be copied to the nodes, this includes: • Node name. • Key file (for the ssh connection). • Source file (the file to be copied). • Destination directory on the remote node. The instances run a scp (secure copy) command using the information passed by the main module, retrieves the console output from the remote nodes and relay this output to the GUI. Execution thread class: Similar to the file copying class, instances of this class contain the data of the nodes which the command is going to be executed on: • Node name. • Key file. • Command string. Each instance of this class runs a ssh command using that information, which is passed to it by the main module, retrieves the console output from the remote nodes and relays this output back to the GUI. © 2012 ACADEMY PUBLISHER Figure 3: Relationship between different classes III. IMPLEMENTATION DETAILS A. Tools and Programs used to Implement ConTest ConTest was implemented using Qt Qreator [6][7], a widely used IDE (integrated development environment) that helps programmers create powerful graphical tools, which provides an effective way to integrate an easy to use, simple graphical interface with the ability to perform a wide set of different operations (like accessing APIs, using the SSH commands, displaying map, etc.). Qt Qreator has a huge library of graphical and non graphical classes and functions that enable the programmer to do almost anything they want, helping them give their programs the maximum functionality. It uses C++ as the programming language, and its set of libraries makes the programming process fast and effective, and helps the programmer to focus on the core functionality of their program while it handles the graphical aspects. In addition Ping [8, 9] and Iperf [10] have been integrated within ConTest to provide means to measure different aspects of network performance related metrics, such as round trip time and connection throughput. In ConTest, this can be done by starting two processes on the JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 designated remote node, one measures the bandwidth using Iperf tool and the other measures the round trip time using the ping command. These processes then return their output through SSH to the program in order for it to record it locally and analyze it to get the required connectivity statistics about all the nodes of interest to the user. 167 handler or, as it is called in the Qt environment, signal/slot pair. When a thread has some output that it needs to write to the interface, it emits a signal to the GUI telling it that its output is ready. The GUI receives this signal through a specific slot and puts it in the queue of signals that wish to access the interface. This regulation prevents multiple access to one object at the same time and gives each thread its turn to write to the interface. B. Inter-process Communication The main program, the GUI, constitutes the main process that initializes and connects other processes, thus the other sub-processes take their required running data from the GUI through an inter-process link, then they run their separate codes independently in the background and return the output to the main program through the same link. When created, each sub-process gets initialized and sent its parameters, and then each process establishes a connection to the main process through two channels, the standard output channel (stdout), which communicates all outputs from this process to the main process, and the standard error channel (stderr) which transfers the error messages (if any) from the sub-process back to the calling process. Even when a sub-process needs to communicate with another sub-process, this communication needs to go through the main process as illustrated in Fig. 4. Figure 4: Process communication through channels C. Multi-threading Since the execution of commands and file copying through the SSH is a lengthy process, and since we need to do a lot of copying, command execution and file writing, along with the active interaction with the GUI, we had to program the tool in the form of threads. Threads are, essentially, child processes that execute concurrently. Each separate thread runs in parallel with the GUI, thus keeping the program responsive at all times and increasing the utilization of the machine resources. Of course, these threads need to access the GUI to write their output through it to the user, however, multiple threads writing to the same object (i.e. the program log) is prohibited as it can cause access errors and even program crashes. So we had to use the concept of event/event © 2012 ACADEMY PUBLISHER D. Authentication As mentioned earlier, the PlanetLab Central has high security standards that restricts the access to their user and node databases, but fortunately, they provide the API to access their PLC database and retrieve the node data we require for our tool, thus we have programmed the separate authentication script. This script does the authentication through the API functions and using the user's data provided in the GUI (their slice name and PlanetLab registered email and password), and retrieves the detailed node list for that specific user. Also, the connection to the nodes themselves requires the use of the SSH protocol, and since SSH uses public key cryptography to authenticate to the remote computer, we had to ask the user for the key file which they have registered in their PlanetLab account. Another issue with the SSH connections is that SSH protocol requires the key's password each time you connect to the remote host for confirmation. This is the default process and it is recommended as it prevents altering or theft the user's SSH key, but if the user needs to access multiple hosts at a time, it will make the connection process more redundant and time consuming. However, since this check can be disabled through the command options, we decided to give the user the liberty to choose whether to enable this confirmation or not, but disabling the check is not recommended as it may raise security warnings and risk the user's key security. VI. CONTEST FUNCTIONALITY AND USAGE ConTest provides an easy and flexible way to setup and perform experiments over PlanetLab. This is done mainly through the experiment manager interface, where the user specifies experiment parameters and through the connection tester interface which establishes different connection schemes between nodes and measures some aspects of these connections. Before going into the details of these two interfaces we describe the way authentication is performed. A. Authentication The authentication dialog pops up at the start-up of the program, it requires the following data to enable you to use the device: • Email: the e-mail which you have registered in your PlanetLab central account. 168 • • • JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Password: the password of your account. Slice name: the name of a slice reserved for you on PlanetLab. Key file: the directory of your private key that is used to access the nodes through the secure shell (ssh). Upon a successful authentication, user information will be displayed in the upper right corner; the number of nodes in the provided slice will be displayed in the program log. Also the details of the node list registered on the slice are displayed in the table view and the nodes themselves are represented on the map as shown in Fig.5. [ Figure 8: Authentication successful ] Figure 5: After successful authentication Connection testing tab provides an interface for users that allows them to test the connection between any nodes from the list of nodes within a slice. It also allows selecting the protocol of preference. The test results are displayed on the program log and stored to external files for documentation. The basic functions that can be performed within this tab are: • • Test Protocol: from here you should be able to select your has its own client, server and forwarding program. You can select the protocol from the “test protocol” drop down box. And you can also browse for your own programs if you have customized a code to be used with ConTest. To restore the defaults test programs, just press the “restore defaults” button. The Map Interface: In the connection tester, you can add nodes to test only through the map interface. This is done by clicking on a node using the left mouse button. As long as you keep selecting nodes using the left button you will keep the path open. Until you close it by clicking a node using the right mouse button, thus setting it as a server. Currently, you can’t give a single node two roles (e.g. a client and a server at the same time). After setting up your connection, you click the “run test” button to start your test, when the test is done, its results are displayed on the program log as shown in Fig.6. B. Experiment Management In this tab, user can execute commands on remote nodes or upload files to them. The basic functions that can be performed within this tab are: • Node selection: Nodes can be selected by clicking on them in the node list table or by clicking on the nodes' icons on the map. Information about a node can be obtained by hovering the mouse pointer on it, this will display a tool tip showing the node details. • Command execution: To execute a command, you simply select the nodes on which you want to run your command, insert the command text with all variables in the command text box, and then click the “run command” button. The output from each node will be displayed on the program log. • Uploading Files: To upload files to nodes, go to the “copy file” tab, from there you can select the file which you want to upload using the “browse” button, you can also specify the destination directory on the remote nodes which you want to store your file in, after selecting the nodes you click “upload file” to start uploading your file. When uploading is done, the log will display the output from the remote consoles to verify that the process is successful, if there are no error messages then the file is uploaded. C. Connection Testing © 2012 ACADEMY PUBLISHER Figure 6: Adding nodes to a path, : The test results showing on the program log D. Illustrative Examples We consider two scenarios to show the functionality and flexibility that ConTest offers in setting up and executing experiments over PlanetLab. In the first scenario, we show how to use the tool to obtain information about the disk usage percentage on each node within a slice. In the second scenario, we show how to perform round trip time measurements between two nodes and how to assign roles for each node. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Example 1-Retrieving Disk Usage Percentage: In this example, we show how to access multiple PlanetLab nodes simultaneously and perform a simple test to obtain the percentage of the disk space used on each node within a slice. In order to perform this test, a shell script responsible for obtaining the disk space usage has been uploaded on each node within our slice. This was done by selecting all nodes from the list in the node list view and uploading the shell script “diskUsuage.sh”. To find the disk usage percentage on a node, the shell script was executed via the execute command tab where the path of the script “testScript/deskUsage.sh” is specified. Disk usage percentage is displayed in the program log area as shown in Fig. 7. Example 2: Round Trip Time (RTT) Measurements In ConTest, the connection test interface allows users to specify the role of each node in an experiment. In this regard, a node can be specified to be a client, a server, or a forwarding node (i.e., a proxy). Also, it allows them to upload the code that is suitable for the role of each node. Measuring the round trip time between two nodes is a simple example that we describe here to illustrate this functionality. Fig. 8 shows the results of measuring RTT between a client (the green node) and a server (the pink node). It also shows the results of measuring RTT between the same nodes assuming that the traffic is forced to pass through another node (the proxy) represented by orange in this example. Such setup is commonly used for indirect communication and it can serve for testing triangular Figure 7: Using ConTest to obtain disk usage percentage of each node within a slice (Example 1) Figure 8. The results of measuring round trip time between two nodes (Example 2) © 2012 ACADEMY PUBLISHER 169 170 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 routing or even studying some characteristics of fast flux networks which is becoming a very important problem [11][12]. It is important to emphasize here that using ConTest to perform such experiments over PlanetLab provides great deal of flexibility and saves a lot of time to setup and run the experiment and collect the related results. So instead of using the CLI to access each node individually to upload the files, execute commands or setting up tests, ConTest automates this process and saves the user's time and effort. V. RELATED WORK Since the deployment of PlanetLab, there have been a lot of efforts to simplify the setup and control of experiments over this distributed testbed. These efforts focused mainly on providing a graphical user interface to select nodes, execute commands, and perform different measurement tests. These tools typically take advantage of the PlanetLab Central (PLC) which provides detailed information about PlanetLab nodes such as host name, geographical location, TCP connectivity. It also provides convenient interfaces for adding/removing nodes. Planet lab manager [13] is one of these tools. It basically allows users to choose nodes for their slices and then to choose some of these nodes for an experiment based on the common information about these nodes. It has the capability of deploying experiment files and execution single/multiple command on every node in parallel. It also provides means to monitor the progress of an experiment and to view the output from the nodes. PlanetLab application manager [14] is another tool that was designed to simplify deployment and monitoring and running applications over PlanetLab. It has several features that enable users to centrally manage their experiments and monitor their applications. The tool is composed of two components: the server side which requires access to aPostagreSQl/MySQL Database element and a PhP web server to allow web access. On the other hand, the client side is basically a shell script that runs under bash. The client side shell scripts require specific customizations making it a little bit complicated. Stork is a software installation utility service that installs and maintains software for other services [15]. It allows users to install software on a large number of PlanetLab nodes and keep that software updated for a long period of time. It also has and a novel security system that empowers users and package creators while reducing trust in a repository administrator. As a results Stork provides a secure, scalable, effective way to maintain tarball and RPM packages and software on large networks. Other tools have been developed to simplify the process of evaluating the characteristics of PlanetLab nodes, thus allowing users to select suitable nodes for their experiments. e. g., pShell [16], CoDeeN [17] [18]. The pShell is a Linux shell that provides basic commands to © 2012 ACADEMY PUBLISHER interact with PlanetLab nodes. However, for pShell users, they still need to manually select suitable nodes. ConTest share many of these features with other tools. The most obvious difference between ConTest and other tools is that ConTest provide a visual way to select nodes within a slice using the world map view. It also, allow users to visually construct the network topology and project it on the map. Moreover, users can specify the role of each node to be client, server, or forwarding node. VI. CONCLUSION AND FUTURE WORK In this paper, we have discussed the design and implementation of ConTest, a graphical user interfacebased tool that enables researchers to visualize experiments over PlanetLab network. ConTest is a powerful graphical tool that helps PlanetLab users access nodes within their slices, upload files and execute commands easily, with no complications and with much fewer security restrictions compared to the command line access, which is the default way of accessing PlanetLab network. It also helps users perform different tests on the nodes they choose and automates gathering and summarizing the results for these tests. Also, it gives the users the ability to utilize its connection automation to run their own tests using their custom testing applications. This tool would be of a great help So this tool is a very promising tool as it has unlimited improvement possibilities which we hope we will be able to explore to make it a more effective tool in the world of distributed applications testing and development. As part of our ongoing efforts to improve ConTest’s functionality and usability, we seek to achieve the following goals: • In the current implementation, the user can browse remote machine folders using the browsing commands (like ls). Future release of ConTest will enable browsing directories on a remote node graphically to increase the usability and ease of use. • Enhance our tool's testing capabilities by adding some additional statistics. Users of the tool can easily help in improving this aspect since the test codes are separated from the main tool functionality, so virtually any codes can be tested and used. • Improving the graphical design of the tool. Specifically the map view. In particular, we will increase the accuracy of the coordinate mapping and add the zooming feature to make the map interface easier to use. • Even though we enable the user to choose the test codes, it is still, somehow, a limited feature. On the far view, we want to enable the users to conduct their own experiments based on the connection setup they choose, instead of limiting them by simple connection testing codes. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 ACKNOWLEDGMENT The authors would like to thank Dr. Mohammad Fraiwan for his constructive thoughts and comments to improve this work. REFERENCES [1] [2] PlanetLab. [Online] http://www.planetlab.org. [18] M. J. Freedman, E. Freudenthal, & D. Mazires, (2004) “Democratizing content publication with coral”, In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI ’04). [3] V. Ramasubramanian, & E.G. Sirer, (2004) “O(1) lookup performance for power-law query distributions in peer-topeer overlays”, In NSDI , pp 99-112. [4] K. Park, & V.S. Pai, (2006) “Scale and performance in the coblitz large-file distribution service”, In Proc. 3rd Symposium on Networked Systems Design and Implementation (NSDI 06). [5] N. Spring, D. Wetherall, & T. Anderson, (2002) “Scriptroute: A public internet measurement facility”. [6] Nokia Corporation, Qt Creator, qt.nokia.com/products/developer-tools, 2010. [7] Qt Centre Forum - Qt Centre Community Portal, www.qtcentre.org. [8] Charles Morin, Randy, "How to PING", 2001. [9] Forouzan and Behrouz, "Data Communications And Networking", 2007. [10] Iperf project, sourceforge.net/projects/iperf, 2008. [11] Holz, T., Gorecki, C., Rieck, K., Freiling, F.: Measuring and detecting fast-flux service networks. In: Proceedings of the Network & Distributed System Security Symposium (2008). [12] Passerini, E., Paleari, R., Martignoni, L., Bruschi, D.: FluxOR: detecting and monitoring fast-flux service networks. Detection of Intrusions and Malware, and Vulnerability Assessment pp. 186–206 (2008). [13] PlanetLab Experiment Manager, http://www.cs.washington.edu/research/networking/cplan e/ [14] R. Huebsch. Planetlab application manager. http://appmanager.berkeley.intel-research.net/, 2004. [15] J. Cappos and J. Hartman. Why it is hard to build a longrunning service on PlanetLab. In Proc. Workshop on Real, Large Distributed Systems (WORLDS), San Francisco, CA, Dec. 2005. [16] B. Maniymaran, pShell: An Interactive Shell for Managing Planetlab Slices, http://cgi.cs.mcgill.ca/ anrl/projects/pShell/index.php/ [17] L.Wang, V. Pai and L. Peterson, The Effectiveness of Request Redirection on CDN Robustness, Proceedings of the 5th OSDI Symposium, December 2002. [18] CoDeeN, Princeton University: http://codeen.cs.princeton.edu/ © 2012 ACADEMY PUBLISHER 171 Basheer Al-Duwairi received the PhD and MS degrees in computer engineering from Iowa State University in Spring 2005 and Spring 2002, respectively. Prior to this, he received the BS degree in electrical and computer engineering from Jordan University of Science and Technology (JUST) Irbid, Jordan in 1999. He has been an assistant professor at the Department of Network Engineering and Security at JUST since fall 2009; prior to this, he was an assistant professor in the Computer Engineering Department at the same university from fall 2005 to fall 2009. His research expertise are in the areas of trusted Internet encompassing Internet Infrastructure Security focusing on DDoS, Botnets, and P2P security, Wireless Network security, and Resource Management in real-time systems. He has coauthored several research papers in these fields. He has given tutorials at reputed conferences, served as a member of technical program committee and session chair in many conferences. He also served as Assistant Dean for Student Affairs, Chairman of Computer Engineering Department, Vice Dean for Computer and Information Technology, and the Director of Computer and Information Cenetr, all at JUST. http://www.just.edu.jo/~basheer Mohammad Marei, Malek Ireksossi, and Belal Rehani are undergraduate students at the department of computer engineering at Jordan University of Science & Technology. They are expected to graduate in summer 2011. 172 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Context-oriented Software Development Basel Magableh Trinity College Dublin, Ireland, Ireland Email: magablb@cs.tcd.ie Stephen Barrett Trinity College Dublin, Ireland, Ireland Email: Stephen.barrett@tcd.ie Abstract— Context-oriented programming is an emerging technique that enables dynamic behaviour variation based on context changes. In COP, context can be handled directly at the code level by enriching the business logic of the application with code fragments responsible for performing context manipulation, thus providing the application code with the required adaptive behavior. Unfortunately, the whole set of sensors, effectors, and adaptation processes is mixed with the application code, which often leads to poor scalability and maintainability. In addition, the developers have to surround all probable behavior inside the source code. As an outcome, the anticipated adjustment is restricted to the amount of code stubs on hand offered by the creators. Context-driven adaptation requires dynamic composition of context-dependent parts. This can be achieved trough the support of a component model that encapsulates the context-dependent functionality and decouples them from the application’s core-functionality. The complexity behind modeling the context-dependent functionality lies in the fact that they can occur separately or in any combination, and cannot be encapsulated because of their impact across all the software modules. Before encapsulating crosscutting context-dependent functionality into a software module, the developers must first identify them in the requirements documents. This requires a formal development paradigm for analyzing the context-dependent functionality; and a component model, which modularizes their concerns. COCA-MDA is proposed in this article as model driven architecture for constructing self-adaptive application from a context oriented component model. Index Terms— Adaptable middleware, Context oriented component, Self-adaptive application, Object. I. INTRODUCTION There is a growing demand for developing applications with aspects such as context awareness and self-adaptive behaviors. Self-adaptive software evaluates its own behavior and changes its behavior when the evaluation indicates that it is not accomplishing what the software is intended to do, or when better functionality or performance is possible. Traditionally, self-adaptability is needed to handle complexity, robustness of unanticipated conditions, changing of priorities and polices governing the objective goals, and changing conditions in the contextual environment. Hirschfeld et al. [1] considered context to be any information that is computationally © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.172-180 accessible and upon which behavioral variations depend. A context-dependent application adjusts its behavior according to context conditions arising during execution. The techniques that enable applications to handle the contextual application are generally known as “contexthandling” techniques. Context handling is of vital importance for developers and for self-adaptive architecture since it provides dynamic behavioral adaptation and makes it possible to produce more useful computational services for end users in the mobile computing environment [2]. The mobile computing environment is heterogeneous and dynamic. Everything from devices used, resources available, network bandwidths, to user context, can change drastically at runtime [3]. This presents the application developers with the challenge of tailoring behavioral variations to each specific user and context. With the capacity to move and the desire to be socially collaborative, mobile computing users might benefit from the self-adaptability and the context-awareness features that are supported by selfadaptive applications. This article focuses on describing a development paradigm for Context-Oriented Programming, which enables self-adaptability features in this emerging class of applications. The development methodology Context Oriented Component-based Application Model-Driven Architecture (COCA-MDA) modularizes the application’s context-dependent behavior into contextoriented components. The components separate the application’s functional concerns from the extrafunctional concerns. The application is organized into two casually connected layers: the base layer, which provides the application’s core structure, and the metalayer, where the COCA-components are located, and which provides compassable units of behavior. The component model design (COCA-components) has been proposed in previous work in [4]. A COCA-component refers to any subpart of the software system that encapsulates a specific context-dependent functionality in a unit of behavior contractually specified by interfaces and explicit dependences. The result from this methodology is a component-based application described by an architecture description language, COCA-ADL. COCA-ADL is used to bridge the gap between the JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 software models in the platform-independent model of the MDA and the software architecture runtime model. Such employment of the ADL decouples the application’s architecture design from the target platform implementation. The rest of the article is structured as follows. Section II discusses behavioral variability support in ContextOriented Programming and aspects. Section III describes the rationale for providing a development paradigm for context-oriented Programming. Section IV describes the COCA-component model. Section V describes the COCA-ADL elements. The COCA-MDA phases are described in Section VII. Section VIII demonstrates a case study designed using the COCA-MDA and implemented with the COCA-middleware. II. VARIABILITY MANAGEMENT WITH CONTEXTORIENTED PROGRAMMING AND ASPECTS Compositional adaptation enables an application to adopt a new structure/behavior for anticipating concerns that were unforeseen during the original design and construction. Normally, compositional adaptation can be achieved using the separation of concerns techniques, computational reflection, component-based design, and adaptive middleware [5]. The separation of concerns enables the software developers to separate the functional behavior and the crosscutting concerns of the selfadaptive applications. The functional behavior refers to the business logic of an application [5]. Context-driven behavioral variations are heterogeneous crosscutting concerns and a set of collaborated aspects that extend the application behavior in several parts of the program and have an impact across the whole system. Such behavior is called crosscutting concerns. Crosscutting concerns are properties or areas of interest such as quality of service, energy consumption, location awareness, users’ preferences, and security. This work considers the functional behavior of an application as the basecomponent that provides the user with context-free functionality. On the other hand, context-dependent behavior variations are considered as crosscutting concerns that span the software modules in several places. Context-oriented programming is an emerging technique that enables context-dependent adaptation and dynamic behavior variations [6, 7]. In COP, context can be handled directly at the code level by enriching the business logic of the application with code fragments responsible for performing context manipulation, thus providing the application code with the required adaptive behavior [8]. Unfortunately, the whole set of sensors, effectors, and adaptation processes is mixed with the application code, which often leads to poor scalability and maintainability [9]. In general, the proposed COP approaches support fine-grained adaptation among the variant behaviour that were introduced at the compile time, A special compiler is needed for performing the context handling operation. To best of our knowledge, COP does not support dynamic composition of software © 2012 ACADEMY PUBLISHER 173 modules and have no support for introducing new behaviour/or adjusting the application structure to anticipate the context changes. In addition, the developers have to surround all probable behavior inside the source code. As an outcome, the anticipated adjustment is restricted to the amount of code stubs on hand offered by the creators. On the other hand, it is impractical to forecast all likely behaviors and program them at the source code. For a more complex context-aware system, the same context information would be triggered in different parts of an application and would trigger the invocation of additional behavior. In this way, context handling becomes a concern that spans several application units, essentially crosscutting into the main application execution. A programming paradigm aiming at handling such crosscutting concerns (referred to as aspects) is aspect-oriented programming (AOP) [10]. In contrast to COP, Using the AOP paradigm, context information can be handled through aspects that interrupt the main application execution. In order to achieve self-adaptation to context in a manner similar to COP, the contextdependent behavioral variations must be addressed. Unfortunately, the aspect-oriented development methodology can be used to handle homogeneous behavioral variations where the same piece of code can be invoked in several software modules [11, 12], and it does not support adaptation of aspects to context in what is called context-driven adaptation [9]. Moreover, static AOP is classified as a compositional adaptation performed in compile time [5]; anticipating context changes at runtime is not an option, especially with the presence of unforeseen changes. Another approach supported by AOP is called dynamic weaving for aspects [13]; this injects the code in the program execution whenever a new behavior is needed. However, existing AOP languages tend to add a substantial overhead in both execution time and code size, which restricts their practicality for small devices with limited resources [14]. III. RATIONALE Context changes are the causes of adaptation. A context-driven adaptation requires the self-adaptive software to anticipate its context-dependent variations. The context-dependent variation can be classified into actor-dependent, system-dependent, and environmentdependent behavior variations. The complexity behind modeling these behavior variations lies in the fact that they can occur separately or in any combination, and cannot be encapsulated because of their impact across all the software modules. Context-dependent variations can be seen as collaboration of individual features (aspects) expressed in requirements, design, and implementation, and are sufficient to qualify as heterogeneous crosscutting concerns in the sense that different code fragments are applied to different program parts. Before encapsulating crosscutting context-dependent behaviors into a software module, the developers must first identify them in the requirements documents. This is difficult to achieve 174 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 because, by their nature, context-dependent behaviors are tangled with other behaviors, and are likely to be included in multiple parts (scattered) of the software modules. Using intuition or even domain knowledge is not necessarily sufficient for identifying the contextdependent parts of self-adaptive applications. This requires a formal procedure for analyzing them in the software requirements and separating their concerns. Moreover, a formal procedure for modeling these variations is needed. Such analysis and modeling procedures can reduce the complexity in modeling selfadaptive applications. In this sense, a formal development methodology can facilitate the development process and provide new modularization of a self-adaptive software system in order to isolate the context-dependent from the context-free functionalities. Such a methodology, it is argued, can decompose the software system into several behavioral parts that can be used dynamically to modify the application behavior based on the execution context. Behavioral decomposition of a context-aware application can provide a flexible mechanism for modularizing the application into several units of behavior. Because each behavior realizes a specific context condition at runtime, such a mechanism requires separation of the concerns of context handling from the concern of the application business logic. In addition, separation of the application’s context-dependent and context-independent parts can support a behavioral modularization of the application, which simplifies the selection of the appropriate parts to be invoked in the execution whenever a specific context condition is captured. The adaptive software operates through a series of substates (modes). The substates are represented by j, and j might represent a known or unknown conditional state. Examples of known states in the generic form include detecting context changes in a reactive or proactive manner. a set of context-driven and context-free states. At runtime, the middleware transforms the self-adaptive software form statei into statei+1, considering a specific context condition tjk, as shown in Figure 1. This enables the developer to clearly decide which part of the architecture should respond to the context changes tjk, and provides the middleware with sufficient information to consider a subset of the architecture during the adaptation; this enhances the adaptation process, impact, and cost and reduces the computation overhead from implementing this class of applications in mobile devices. Self-adaptive and context-aware applications can be seen as the collaboration of individual behavioral variations expressed in requirements, design, and implementation. This article contributes by proposing a model-driven architecture (COCA-MDA) integrated with a behavioral decomposition technique, based on observation of the context information in requirements and modeling. As a result of combining a decomposition mechanism with MDA, a set of behavioral units is produced. Each unit implements several contextdependent functionalities. This requires a component model that encapsulates these code fragments in distinct architecture units and decouples them from the corefunctionality components. This is what motivates the research towards proposing a context-oriented component model (COCA-component). Context-driven adaptation requires dynamic composition of context-dependent parts, which enables the developer to add, remove, or reconfigure components within an application at runtime. Each COCA-component embeds a specific contextdependent functionality (C+S)ji, realized by a contextoriented component (COCA-component) model. Each COCA-component realizes several layers that encapsulate a fragment of code related to a specific software mode layer(Mjk), as shown in Figure 1. The developers have the ability to provide a decision policy(jk) for each decision point (j) whenever a specific context-related condition is captured. Hereafter, the COCA-components are dynamically managed by COCA-middleware and their internal parts to modify the application behavior. The COCA-middleware performs context monitoring, dynamic decision-making, and adaptation, based on policy evaluation. The decision policy framework is maintained in modeling and runtime time. Figure 1: Behavioral Decomposition Model In the presence of uncertainty and unforeseen context changes, the self-adaptive application might be notified about an unknown condition prior to the software design. Such adaptation is reflected in a series of context-system states. (C+S)ji denotes the ith combination of contextdependent behavior, which is related to the decision points j by the notion mode Mjk. In this way, the development methodology decomposes the software into © 2012 ACADEMY PUBLISHER IV. CONTEXT-ORIENTED COMPONENT MODEL (COCA-COMPONENT) The COCA-component model was proposed in [21], based on the concept of a primitive component introduced by Khattak et al. in [17] and Context-Oriented Programming (COP) [13]. COP provides several features that fit the requirements of a context-aware application, such as behavioral composition, dynamic layers activation, and scoping. This component model dynamically composes adaptable context-dependent applications based on a specific context-dependent JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 functionality. The authors developed the component model by designing components as compositions of behaviors, embedding decision points in the component at design time to determine the component behaviors, and supporting reconfiguration of decision policies at runtime to adapt behaviors. 175 At this stage, each COCA-component must adopt the COCA-component model design. A sample COCAcomponent is shown in Figure 2; it is modeled as a control class with the required attributes and operations. Each layer entity must implement two methods that collaborate with the context manager. Two methods inside the layer class, namely ContextEntityDidChanged and ContextEntityWillChanged, are called when the context manager posts the notifications in the form [NotificationCeneter Post:ContextConditionDidChanged]. This triggers the class layer to invoke its method ContextEntityDidchanged, which embeds a subdivision of the COCA-component implementation. V. COCA-ADL: A CONTEXT-ORIENTED COMPONENT-BASED APPLICATION ADL Figure 2: COCA-component Conceptual Diagram The COCA-component has three major parts: a static part, a dynamic part, and ports. The component itself provides information about its implementation to the middleware. The COCA-component has the following attributes: ID, name, context entity, creation time, location, and remote variable. The Boolean attribute remote indicates whether or not the components are located on the distributed environment. The decision policy and decision points are attributes with getter and setter methods. These methods are used by the middleware to read the attached PolicyID and manipulate the application behavior by manipulating the decision policy. The COCA-component handles the implementation of a context-dependent functionality through employing the delegate design pattern [6], so the adaptation manager invokes these components whenever the COCAcomponent is notified by the context manager. A delegate is a component that is given an opportunity to react to changes in another component or influence the behavior of another component. The basic idea is that two components coordinate to solve a problem. A COCAcomponent is very general and intended for reuse in a wide variety of contextual situations. The basecomponent stores a reference to another component, i.e., its delegate, and sends messages to the delegate at critical times. The messages may only inform the delegate that something has happened, giving the delegate an opportunity to do extra processing, or the messages may ask the delegate for critical information that will control what happens. The delegate is typically a unique custom object within the controller subsystem of an application [6]. © 2012 ACADEMY PUBLISHER The aim of this section is to introduce the architecture description language COCA-ADL. COCA-ADL is an XML-based language used to describe the architecture produced by the development methodology COCAMDA. COCA-ADL is used to bridge the gap between the application models and the implementation language. Thus, it enables the architecture to be implemented by several programming languages. Figure 3: COCA-ADL Elements COCA-ADL is designed as a three-tier system. The first level consists of the building blocks, i.e., the components, including the COCA-component and basecomponent. The second refers to connectors, and the third refers to the architecture configuration, which includes a full description of the state-machine models, which describes an activity diagram plus the decision policies’ syntax. Figure 3 shows the main elements of COCAADL. Each element is associated with an architecture template type. The main features provided by the element types are instantiation, evolution, and inheritance. VI. COCA-MDA DEVELOPMENT APPROACH COCA-MDA has adopted the component collaboration architecture (CCA) and the entity model. The CCA details how to model the structure and behavior of the components that comprise a system at varying and mixed levels of granularity. The entity model describes a metamodel that may be used to model entity objects that are a representation of concepts in the application problem domain and define them as composable components [17]. 176 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 COCA-MDA partitioning the software into three viewpoints: the structure, behavior, and enterprise viewpoints. The structure viewpoint focuses on the core component of the self-adaptive application and hides the context-driven component. The behavior viewpoint focuses on modeling the context-driven behavior of the component, which may be invoked in the application execution at runtime. The enterprise viewpoint focuses on remote components or services, which may be invoked from the distributed environment. The design of a context-aware application according to the COCA-MDA approach was proposed in [18]. The use of COCA-MDA for developing self-adaptive applications for indoor wayfinding for individuals with cognitive impairments was proposed in [19]. Evaluating the COCA-MDA productivity among the development effort was demonstrated in [20]. This article focuses on describing in detail the process of analyzing and classifying the software requirements, and how the software is designed through the platform-independent model and the platform-specific model. Model transformation and code generation were discussed in a previous work [18]. behavioral models, and the context transformed into the COCA-ADL model. model are A. Computational Independent Model In the analysis phase, the developers analyze several requirements using separation of concern techniques. The developers focus on separating the functional requirements from the extra-functional requirements as the first stage. They then separate the user and context requirements from each other. There are two subtasks in the analysis phase. VII. CONTEXT-ORIENTED COMPONENT-BASED APPLICATION EXAMPLE IPetra is a tourist-guide application that helps the client to determine the bravura historical city of Petra, Jordan. IPetra offers a map–client interface maintained by an augmented reality browser (ARB). The browser exhibits many points of interest (POI) inside the physical outlook of the tool’s camera. Information related to every POI is exhibited inside the camera overlay outlook. The POIs comprise edifices, tourist services sites, restaurants, hotels, and ATMs in Petra. The AR browser offers an instantaneous live direct physical display inside the portable camera. When the client positions the portable camera in the direction of a building, an explanation confined to a small area related to that edifice is shown to the client. Constant use of the device’s camera, backed by attaining data from many sensors, can consume the tool’s resources. This needs the application to adjust its tasks among several contexts to maintain quality of service without disrupting the function’s tasks. The function requires frequent updates of client position, network bandwidth, and battery level. Figure 6 summarizes the modeling tasks, using the associated UML diagrams. The developer starts analysis of an application scenario to capture the requirements. The requirements are combined in one model in the requirements diagram. The requirements diagram is modeled using a use-case diagram that describes the interaction between the software system and the context entity. The use-case is partitioned into two separate views. The core-structure view describes the core functionality of the application. The extra-functionality object diagram describes the COCA-component interaction with the core application classes. The state diagram and the activity diagram are extracted from the behavioral view. Finally, the core structure, the © 2012 ACADEMY PUBLISHER Figure 6: Modelling tasks 1) Task 1: Requirements capturing by textual analysis: In this task, the developer identifies the candidate requirements for the illustration scenario using a textual analysis of the application scenario. It is recommended that the developer identifies the candidate actors, use-cases, classes, and activities, as well as capturing the requirements in this task. This can be achieved by creating a table that lists the results of the analysis. This table provides an example of a data dictionary for the proposed scenario. 2) Task2: Identifying the extra-functional requirements and relating them into the middleware functionality: The first step in the process is to understand the application’s execution environment. The context is classified in the requirements diagram, based on its type, and whether it comes from a context provider or consumer. A context can be generated from a physical or logical source (i.e., available memory), or resources (i.e., battery power level and bandwidth). The representations of sensors and resources on the application that is going to consume them at runtime refers to the context consumers. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Figure 7: Requirements UML profile The next level of requirements classification is to classify the requirements based on their anticipation level; this can be foreseeable, foreseen, or unforeseen. This classification allows the developer to model the application behavior as much as possible and to plan for the adaptation actions. However, to facilitate this classification framework, a UML profile is designed to support the requirements analysis and to be used by the software designer, as shown in Figure 7. As shown in Figure 8, extra-functional requirements are captured during this task, for example, requirement number 3: adapt the location service. IPetra is required to adapt its behavior and increase the battery life. This is achieved by adopting a location service that consumes less power. For example, if the battery level is low, the IPetra application switches off the GPS location services and uses the cell-tower location services. Using an IPbased location reduces the accuracy of the location but saves battery energy. In addition, the application may reduce the number of POIs it displays to the most recent device location. Moreover, the application reduces the frequency of the location updates. On the other hand, if the battery level is high and healthy, IPetra uses the GPS service with more accurate locations. The application starts listening for all events in the monitored region inside Petra city. 3) Task 3: Capturing user requirements: This task focuses on capturing the user’s requirements as a subset of the functional requirements, as shown in the UML profile in Figure 7. This task is similar to a classical requirement-engineering process where the developers analyze the main functions of the application that achieve specific goals or objectives. B. Modelling: Platform Independent Model In order to be aware of possible resources and context variations and the necessary adaptation actions, a clear analysis of the context environment is the key to building dynamic context-aware applications. 4) Task 4: Resources and context entity model Resources and context Model refers to generic a overview of the underlying device’s resources, sensors, and logical context provider. This diagram is modelling the engagement between the resources and the application under development. It facilitates the developer to understand the relationship between the context providers and their dependency. © 2012 ACADEMY PUBLISHER 177 5) Task 5: use-cases The requirements diagram in Figure 8 represents the main inputs for this task. Each requirement is incorporated into a use-case, and the developers identify the actor of the requirement. An actor could be a user, system, or environment. The use-cases are classified into two distinct classes, i.e., the core functionality and extended use-cases, by the context conditions. The first step is to identify the interaction between the actor and the software functions to satisfy the user requirement in a context-free fashion. For example, the displaying POIs functionality in the figure is context independent in the sense that the application must provide it, regardless of the context conditions. All these use-cases are modeled separately, using a class diagram that describes the application core-structure or the base-component model, as shown in the following task. Figure 8: Functional and extra-functional partial requirements diagram 6) Task 6: modeling the application core-structure In this task, a classical class diagram models the components that provide the application’s core functions. These functions are identified from the use-case diagram in the previous task. However, the class diagram is modeled independently from the variations in the context information. For this scenario, some classes, such as “Displaying POIs”, “Route-planningUI”, “CameraUI”, “MapUI”, and “User Interface”, are classified to be on the application core. These classes provide the core functions for the user during his tour of Petra city. Figure 9 shows the core-structure class-model without any interaction with the context environment or the middleware. 7) Task 7: identifying application-variant behavior (behavior view): In this phase, the developers specify how the application can adapt to context conditions to achieve a specific goal or objectives. After specifying the core elements of the application in the previous task, the behavioral view is identified in this task. This task identifies when and where an extra-functionality can be invoked in the application execution. This means the developer has to analyze the components involved, their communication, and possible variations in their subdivisions, where each 178 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 division realizes a specific implementation of that COCA-component. To achieve this integration, the developers have to consider two aspects of the context-manager design: how to notify the adaptation manager about the context changes, and the how the component manager can identify the parts of the architecture that have to respond to these changes. These aspects can be achieved by adopting the notification design pattern in modeling the relation between the context entity and the behavioral component. Hereafter, these extra-functionalities are called the COCA-components. Each component must be designed on the basis of the component model described in Figure 2. Figure 9: I-Petra Core-Classes structure The IPetra application is modularized into several COCA-components. Each component models one extrafunctionality such as the LocationCOCA−component in Figure 10. The COCA-component sublayers implement several context-dependent functionalities that use the location service. Each layer is activated by the middleware, based on context changes. After applying the observer design pattern and the COCA-component model to the use-cases, the class diagram for the middleware functionality “Update Location” can be modeled as shown in Figure 10. Figure 10 shows a COCAcomponent modelled to anticipate the ’direction output’. The COCA-component implements a delegate objects and sub layers; each layer implements a specific contextdependent function. The COCA-middleware uses this delegate object to redirect the execution among the sub layers, based on the context condition. Invoking different variations of the COCA-component requires identification of the application architecture, behavior, and the decision policies in use. As mentioned before, these decision policies play an important role in the middleware functions, which use them in handling, the architecture evolution, and the adaptation action. The model in Figure 10 helps the developer to extract the decision policies and the decision points from the interactions between the context entities and the COCAcomponents. Decision policies are used by the middleware to select suitable adaptation actions among specific context conditions. The application behavioural model is used to demonstrate the decision points in the execution that © 2012 ACADEMY PUBLISHER might be reached whenever internal or external variables are found. This decision point requires several parameter inputs to make the correct choice at this critical time. Using the activity diagram, the developers can extract numerous decision polices. Each policy must be modelled in a state diagram, for example, the Policy: Camera Flashes is attached to the ’Camera flashes’ COCAcomponent. The policy syntax can be described by the code shown in listing 1. The IPetra application has been implemented in two distinct versions, i.e. with and without the COCAmiddleware. The Instruments is a tool application for dynamically tracing and profiling IPhone devices. The battery life has been measured by Energy Diagnostics Instruments running on the device [22]. The energy Diagnostic used to measure the battery while the device is not connected to external power supply. The experiments show that the COCA IPetra application saved the batteryconsuming level by 13% in addition to its self-tuning adaptability. Fig. 14 shows the experimental results for energy usage. The IPetra implementation without adopting the COCA-platform consumes more energy during context monitoring, draining the battery faster. On the other hand, when the same application adopts the COCA-middleware, the application is able to adapt its behaviour and use less energy. The adaptation time for handling low and high battery-levels are shown in Figure 13. It is worth mentioning here that when the battery level is low, the COCA-middleware allocates less memory because of the size of the COCA-component, which is small compared to its implementation. Figure 10: Extra-functionality Object Diagram of the Context Oriented Components JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 If ( direction is Provided && Available memory >= 50 && CPU throughput <= 89 && light level >= 50 && BatteryLevel >= 50) then {EnableFlashes();} Else If ( BatteryLevel < 50 || LightLevel < 50 ) then {DisableFlashes(); SearchForPhotos();} else If( BatteryLevel < 20) then DisableFlashes(); Listing 1: Adaptation time (ms) and memory allocation (KB) to anticipate the evolution of both functional and nonfunctional requirements. The decision policies require more development with respect to policy mismatch and resolution. This is in line with an improvement in terms of self-assurance and dynamic evaluation of the adaptation output. REFERENCES [1] [2] [3] Figure 13: Adaptation time (ms) and memory allocation (KB) [4] [5] [6] Figure 14: Energy usage for IPetra application. [7] VIII. EXPERIMENTS EVALUATION IX. CONCLUSIONS AND FUTURE WORKS This article described a development paradigm for building context-oriented applications using a combination of Model-Driven Architecture that generates an ADL, which presents the architecture as a components-based system, and a runtime infrastructure (middleware) that enables transparent self-adaptation with the underlying context environment. Specifically, a Model-Driven Architecture is used to demonstrate a new approach to building context-aware and self-adaptive applications by adopting a ModelDriven Architecture (COCA-MDA). COCA-MDA enables developers to modularize applications based on their context-dependent behaviors, enables developers to separate context-dependent functionalities from the application’s generic functionality, and enables dynamic context-driven adaptation without overwhelming the quality attributes. The COCA-MDA needs to be improved with respect to support for both requirement reflection and modeling requirements as runtime entities. The requirement reflection mechanism requires support at the modeling level and at the architecture level. Reflection can be used © 2012 ACADEMY PUBLISHER 179 [8] [9] [10] [11] [12] [13] R. Hirschfeld, P. Costanza, and O. Nierstrasz, “Contextoriented programming,” Journal of Object Technology, vol. 7, no. 3, pp. 125–151, March 2008. A. K. Dey, “Providing architectural support for building context-aware applications,” Ph.D. dissertation, Georgia Institute of Technology, Atlanta, GA, USA, 2000. N. M. Belaramani, C.-L. Wang, and F. C. M. Lau, “Dynamic component composition for functionality adaptation in pervasive environments,” in Proceedings of the The Ninth IEEE Workshop on Future Trends of Distributed Computing Systems, ser. FTDCS ’03. plus 0.5em minus 0.4emIEEE Computer Society, 2003, pp. 226–232. B. Magableh and S. Barrett, “Pcoms: A component model for building context-dependent applications,” in Proceedings of the 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns, ser. COMPUTATIONWORLD ’09. plus 0.5em minus 0.4emWashington, DC, USA: IEEE Computer Society, 2009, pp. 44–48. P. K. McKinley, S. M. Sadjadi, E. P. Kasten, and B. H. C. Cheng, “Composing adaptive software,” Computer, vol. 37, pp. 56–64, July 2004. M. Gassanenko, “Context-oriented programming,” in Proceedings of the EuroFORTH’93 conference, Marianske Lazne (Marienbad), Czech Republic, 15-18 October 1998, pp. 1–14. R. Keays and A. Rakotonirainy, “Context-oriented programming,” in Proceedings of the 3rd ACM international workshop on Data engineering for wireless and mobile access, ser. MobiDe ’03, San Diego, CA, USA, 2003, pp. 9–16. M. Salehie and L. Tahvildari, “Self-adaptive software: Landscape and research challenges,” ACM Trans. Auton. Adapt. Syst., vol. 4, pp. 14:1–14:42, May 2009. G. Kapitsaki, G. Prezerakos, N. Tselikas, and I. Venieris, “Context-aware service engineering: A survey,” Journal of Systems and Software, vol. 82, no. 8, pp. 1285–1297, 2009. G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J.-M. Loingtier, and J. Irwin, “Aspect-oriented programming,” in ECOOP’97 âEuropean Conference on Object-Oriented Programming, ser. Lecture Notes in Computer Science. plus 0.5em minus 0.4emSpringer Berlin / Heidelberg, 1997, vol. 1241, pp. 220–242. S. Apel, T. Leich, and G. Saake, “Aspectual mixin layers: aspects and features in concert,” in Proceedings of the 28th international conference on Software engineering, ser. ICSE ’06. plus 0.5em minus 0.4emShanghai, China: ACM, 2006, pp. 122–131. M. Mezini and K. Ostermann, “Variability management with feature-oriented programming and aspects,” SIGSOFT Softw. Eng. Notes, vol. 29, pp. 127–136, October 2004. A. Popovici, T. Gross, and G. Alonso, “Dynamic weaving for aspect-oriented programming,” in Proceedings of the 1st international conference on Aspect-oriented software 180 [14] [15] [16] [17] [18] [19] [20] JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 development, ser. AOSD ’02. plus 0.5em minus 0.4emNew York, NY, USA: ACM, 2002, pp. 141–147. C. Hundt, D. Stöhr, and S. Glesner, “Optimizing aspectoriented mechanisms for embedded applications,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6141 LNCS, pp. 137–153, 2010. Y. Khattak and S. Barrett, “Primitive components: towards more flexible black box aop,” in Proceedings of the 1st International Workshop on Context-Aware Middleware and Services: affiliated with the 4th International Conference on Communication System Software and Middleware (COMSWARE 2009), ser. CAMS ’09. plus 0.5em minus 0.4emNew York, NY, USA: ACM, 2009, pp. 24–30. E. Buck and D. Yacktman, Cocoa design patterns, 2nd ed. plus 0.5em minus 0.4emDeveloper’s Library, 2010. “Enterprise collaboration architecture (eca) specification,” http://www.omg.org/, pp. 1–202, Feb 2004. B. Magableh and S. Barrett, “Objective-cop: Objective context oriented programming,” in International Conference on Information and Communication Systems, ser. ICICS 2011, vol. 1, May 2011, pp. 45–49. ------, “Self-adaptive application for indoor wayfinding for individuals with cognitive impairments,” in The 24th IEEE International Symposium on Computer-Based Medical Systems, ser. CBMS 2011, vol. In press, Lancaster, UK, June 2011, pp. 45–49. B. Magableh, “Model-Driven productivity evaluation for self-adaptive Context-Oriented software development,” in 5th International Conference and Exhibition on Next © 2012 ACADEMY PUBLISHER Generation Mobile Applications, Services, and Technologies (NGMAST’11), vol. In press, Cardiff, Wales, United Kingdom, Sep. 2011. [21] R. Anthony, D. Chen, M. Pelc, M. Perssonn, and M. Torngren, “Context-aware adaptation in dyscas,” in Proceedings of the Context-aware Adaptation Mechanisms for Pervasive and Ubiquitous Services (CAMPUS 2009), 2009, p. 15. [22] Ios 4.0 apple developer library. http://developer.apple.com/library/ios/navigation/ (2010), "[Online; accessed 1-April-2011]" Basel Magableh received his MS degree in computer science from New York Institute of Technology, NY, USA, in 2004. He is currently a Ph.D. candidate at Distributed Systems Group, Trinity College Dublin, Ireland. His research focuses in integrating Model Driven Architecture with a component-based system to construct self-adaptive and context-aware applications. He is a full-time lecturer in Grafton College of Management Science, Dublin, Ireland. He was member of staff in the National Digital Research Center of Ireland from 2008- 2011. Stephen Barrett is currently a lecturer at Distributed Systems Group, Trinity College Dublin, Ireland. His research centers on middleware support for adaptive computing. (with particular focus on model driven paradigms) and on large scale applications research (particularly in the context of web search, trust computation and peer and cloud computing) . JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 181 Aggregated Search in XML Documents Fatma Zohra Bessai-Mechmache Research Centre on Scientific and Technical Information, CERIST, Algiers, Algeria zbessai@cerist.dz Zaia Alimazighi University of Science and Technology, USTHB, LSI, Algiers, Algeria alimazighi@wissal.dz Abstract—In this paper, we are interested in aggregated search in structured XML documents. We present a structured information retrieval model based on possibilistic networks. Relations terms-elements and elements-document are modeled through possibility and necessity. In this model, the user’s query starts a process of propagation to recover the elements. Thus, instead of retrieving a list of elements that are likely to answer partially the user’s query, our objective is to build a virtual elements that contain relevant and non-redundant elements, that are likely to answer better the query that elements taken separately. Indeed, the possibilistic network structure provides a natural representation of links between a document, its elements and its content, and allows an automatic selection of relevant and non-redundant elements. We evaluated our approach using a sub-collection of INEX (INitiative for the Evaluation of XML retrieval) and presented some results for evaluating the impact of the aggregation approach. possibilitic theory, it makes it possible to quantify in a qualitative and quantitative way the various subjacent links. it allows to express the fact that a term is certainly or possibly relevant with respect to an element and/or a document and to measure at which point an element (or a set of elements) can necessarily or possibly answer the user’s query. This paper is organized as follows: Section 2 presents a brief state of the art on aggregation search. Section 3 gives a brief definition of the possibilistic theory. Section 4 is devoted to the description of the model which we propose. We show, in section 5 an example illustrating this model. Section 6 gives the evaluation results and shows the effectiveness of the model. Section 7 concludes the paper. Index Terms— XML Information Retrieval, possibilistic networks, aggregated search, redundancy. The aim of the aggregated search is to assemble information from diverse sources to construct responses including all information relevant to the query. This comes in contrast with the common search paradigm, where users are provided with a list of information sources, which they have to examine in turn to find relevant content. It is well known that, in the context of Web search, users typically access a small number of documents [12]. A study on users Web [24] showed that the percentage of users who consult fewer documents (Web pages) per query increases with time. For example, from 1997 to 2001, the users' percentage looking at a document by query is passed from 28.6% to 50.5%. This percentage increased further to over 70% after 2001 [25]. It gives to think that for a list of documents, is mainly confined to documents in the first, second and sometimes (at most) third rank. The study reported in [11] showed that on 10 documents displayed, 60% of users have looked less than 5 documents and nearly 30% have read a single document. The aggregated search allows to bring solutions to this problem. Indeed, its aim is to integrate other types of documents (web documents, images, videos, maps, news, books) in the results page. Example of search engines that begin to make aggregation, we find Google Universal Search, Yahoo! alpha, etc. Users have access to various document types. This can be beneficial for certain I. INTRODUCTION The main problem of content-based XML information retrieval is how to select the unit of information that better answers the user’s query [13] [9]. Most of XML Information Retrieval (IR) approaches [23] [17] [15] [16] [18] consider that the returned units are a list of disjoint elements (subtrees of XML documents). We assume that relevant unit is not necessarily a unique adjoining elements or a document it could also be any aggregation of elements of that document. Let us consider a document with the following structure (document(title)(chappter1(section1)(section2)) chapter2(…)). If the relevant information are located in the “title” and “section1”, most of XML IR systems will return the whole document as the relevant unit. In our case we consider that, the only unit to be returned is an aggregate (element set) formed by both elements : ”title” & “section1”. To achieve this objective, we propose a model enabling to automatically select aggregation of non redundant elements of the document that better answer the user’s need formulated through a list of key words. The model we propose finds its theoretical bases in the possibilistic networks. The network structure provides a natural manner to represent the links between, a document, its elements and its content. As for the © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.181-188 II. STATE OF THE ART 182 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 queries, such as "trip to Finland" can return maps, blogs, weather, etc. Another technique that can be used to improve the search results page is the clustering. However, it is not enough simply to return the clusters. It is important to provide users an overview of the contents of the documents forming a cluster [25]. A common approach to provide such an overview is a summary of all documents of the cluster (‘multi documents summary’). Examples of systems based on this technique, we find NewInEssence [20], QCS [8], etc. The issue of elements aggregation from a collection of XML documents is not addressed in the literature. Indeed, the proposed approaches that address this issue are limited to Web documents [6] [1]. However, few Information retrieval systems begin to aggregate the results, of a query on XML documents, as summaries. For example, eXtract [10] is an information retrieval system that generates results as XML fragments. An XML fragment is considered a result if it answers four features: Autonomous (understood by the user), distinct (different from the other fragments), representative (of the themes of the query) and succinct. XCLUSTERs [19] is a model of representation of XML abstracts. It includs some XML elements and uses a small space to store data. The objective is to provide significant excerpts for users to easily evaluate the relevance of query results. The approach we propose in this paper is located to junction between the research of the relevant elements and their regrouping (aggregation) in a same result. Our approach is based on possibilistic theory [26] [7] [4] and more particularly on possibilistic networks [2] [3]. These networks offer a simple and natural model for representing the hierarchical structure of XML documents and to handle uncertainty, inherent to information retrieval. We find this uncertainty in the concept of document relevance with respect to a query, the degree of representativeness of a term in a document or part of documents and the identification of the relevant part that answers the query. Within this framework, to identify the relevant part that answers the query, unlike the approaches suggested in the literature, which select the sub-tree, likely to be relevant; our approach allows to identify and to select, in a natural way, an aggregation of non redundant elements of XML document that may answer the query. Besides the points mentioned above, the theoretical framework, that supports our proposals, namely the possibilistic networks, clearly differentiate us from the settings used in previous approaches. III. THE POSSIBILISTIC THEORY The possibilistic logic [26] enables to model and quantify the relevance of a document considering a query through two measurements: the necessity and the possibility. The necessarily relevant elements are those that must appear in top of the list of the selected elements and must allow system efficiency. The possibly relevant © 2012 ACADEMY PUBLISHER elements are those that would eventually answer the user query. A. Possibility Distribution A possibility distribution, denoted by π, corresponds to a mapping from X to the scale [0, 1] encoding our knowledge on the real world. π (x) evaluates to what extent x is the actual value of some variable to which π is attached. π (x) =1 means that it is completely possible that x is the real world (or that x is completely fulfilling), 1> π (x) >0 means that x is somewhat possible (or fulfilling), and finally π (x) =0 means that x is certainly not the real world (or is completely unsatisfactory). An event is said ‘no possible’ does not only mean that the opposite event is possible. It actually means that it is certain. Two dual measures are used: the possibility measure П(A) and the necessity measure N(A). The possibility of an event A, denoted П(A), is obtained by П (A) = max x є A π(x) and describes the most normal situation in which A is true. The necessity N(A) = min x ∉ A 1 − π(x) = 1 − П(¬A) of an event A reflects the most normal situation in which A is false. B. Product-based Conditioning In the possibilistic setting, the product-based conditioning consists of modifying our initial knowledge, encoded by the possibility distribution π by the arrival of new fully certain piece of information e. Let us take Φ = [e] the set of models of e. The initial distribution π is then replaced by another one π′, such as π′ = π (•/Φ). Assuming that Φ ≠ Ø and that П (Φ) > 0, the natural postulates for possibilistic conditioning are: π(w /p Φ) = π(w)/ П (Φ) and 0 if w є Φ otherwise (1) Where /p is the product-based conditioning. C. Possibilistic Networks A directed possibilistic network over a set of variables V= {V1, V2,…,Vn} is characterized by: - A graphical component composed of a Directed Acyclic Graph (DAG). The DAG structure encodes a set of independence relations between variables. - A numerical component consisting in a quantification of different links in the DAG using the conditional possibilisties of each node in the context of its parents. Such conditional distributions should respect the following normalization constraints for each variable Vi of the set V: Let UVi the set of parents of Vi If UVi = Ø (i.e. Vi is a root), then the a priori possibility relative to Vi should satisfy: max ai Π(ai) = 1, ∀ ai ∈ DVi JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 If UVi ≠ Ø (i.e. Vi is not a root), then the conditional distribution of Vi in the context of its parents should satisfy: max ai Π(ai /ParVi) = 1, ∀ ai ∈ DVi Where ParVi is the set of possible configurations (aggregations) of parents of Vi Using the definition of conditioning based on the product operator. This leads to the following definition of a product-based possibilistic network: D. Product-based Possibilistic Network A product-based possibilistic network over a set of variables V = {A1, A2,…, AN} is a possibilistic graph where conditionals are defined using product-based conditioning (1). Product-based possibilistic networks are appropriate for a numerical interpretation of the possibilistic scale. The possibility distribution of the product-based possibilistic network, denoted by ΠP, is obtained by the following product-based chain rule [2]: ΠP (A1,..., AN) = PRODi=1..N Π (Ai/PARAi) (2) Where: ‘PROD’ is the product operator. IV. THE AGGREGATED SEARCH MODEL A. Model Architecture The architecture of the proposed model is illustrated in Fig. 1. The graph represents the document nodes, index terms and element nodes (elements of XML document). The links between the nodes allow representing the relations of dependences between the various nodes. T1 E2 T2 En Ei Ti Tj Tm Figure 1 . Model Architecture. Document nodes represent the collection documents. Each document node Di, represents a binary random variable taking values in the set dom (Di) = {di, ¬di}, where the value Di = di (resp. ¬di) represents “the document Di is relevant for a given query (resp. nonrelevant). © 2012 ACADEMY PUBLISHER Nodes E1, E2, …, En, represent the elements of document Di. Each node Ei, represents a binary random variable taking values in the set dom (Ei) = {ei, ¬ei}. The value Ei = ei (rep. ¬ei) means that the element ‘Ei’ is relevant for the query (resp. non-relevant). Nodes T1, T2, …, Tm are the term nodes. Each term node Ti represents a binary random variable taking values in the set dom (Ti) = {ti, ¬ti,} where the value Ti = ti (resp. ¬ti) means that term ‘Ti’ is representative of the parent node to which it is attached (resp. non-representative of the parent node to which it is connected). It should be noticed that a term is connected to the node that includes it as well as to all its ancestors. The passage from the document to the representation in the form of possibilistic network is done in a simple way. All nodes (elements) represent the level of variables Ei. The values that will be assigned with the arcs of dependences between term-element nodes and elementdocument nodes depend on the sense which one gives to these links. Each structural variable Ei, Ei ∈ E = {E1, E2, …, En}, depends directly on its parent node which is the root Di in the possibilistic network of the document. Each variable of contents Ti, Ti ∈ T = {T1, …, Tm} depends only on its structural variable (structural element or tag). It should be also noticed that the representation considers only one document. In fact, the documents are considered independent from each other, thus we can consider only the sub-network, representing the document that is processed. We note by T(E) (resp. T(Q)) the set of the index terms of the elements of the document (resp. of the query). The arcs are oriented and are of two types: - Term-element links. These links connect each term node Ti є T(E) to each node Ei where it appears. - Elements-document links. These links connect each element node of the set E to the document that includes it, in our case Di. We will discuss interpretations we give these various links and the way we quantify them, in the following. Di E1 183 B. Query Evaluation As we underlined previously, we model the relevance according to two dimensions: the necessity and the possibility of relevance. Our model must be able to infer propositions of the type: - “the document di is relevant for the query Q” is possible to a certain degree or not, quantified by П(Q/di). - “the document di is relevant for the query Q” is certain or not, quantified by N(Q/di). The first type of proposition allows to eliminate the non-relevant documents, i.e. those that have a weak possibility. The second proposition focuses the attention on those that seem very relevant. For the model presented here, we will adopt the following assumptions: 184 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Assumption 1: A document has as much possibility to be relevant than non-relevant for a given user, either П(di) = П(¬di) = 1, ∀ i. Assumption 2: The query is composed of a simple list of key words Q= {t1,t2, … ,tn}. The relative importance between terms in the query is ignored. According to the definitions of the possibilistic theory, the quantities П(Q/di) and N(Q/di) are calculated like follows : distinguish the terms possibly representative of the elements of the document and those necessarily representative of these elements (terms that are sufficient to characterize the elements). With this intention, the possibility of relevance of a term (ti) to represent an element (ej), noted Π (ti /ej), is calculated like follows: Π ( t i / e j ) = tf /max (tf i j ∀ t k ∈e kj ) j (5) Π (Q/di ) = max (Pr ode ( Pr od (Π(ti / θ ej ))) * Pr ode (Π(θ ej / di )) * Π(di )) e E ∀θ ∈θ E j ∈θ Ti ∈T(E) ^ T(Q) E j ∈θ (3) Where: - Prod: means product (we used this symbol instead of ∏ not to confuse it with the symbol designating the possibility). - ti ∈T(E) ^ T(Q) : represents the terms of the queries which index the elements of the XML document. - θ : set of non redundant elements e - θ ej : represents the value of Ej in the aggregation θ e (example: the value of E1 in the aggregation (e1, e2) is e1). The selection of the relevant parts (units of information) is inherent with the model. Indeed, the (3) calculates the relevance by considering all possible aggregations (regrouping) of elements. The factor θe gives possible values of elements. The aggregation of elements that will be selected will be the one that includes obligatorily the terms of the query and presents the best relevance (maximum relevance) in terms of necessity and/or possibility. As it was mentioned in the introduction, our model is able tri select the best aggregation of elements that are likely to be relevant to the query. This aggregation is the aggregation that maximizes the necessity if it exists or the possibility. It obtained by: θ * = arg max Π (Q /d i ) ∀θ e ∈θ E (4) The various degrees Π and N between the nodes of the network are calculated as follows: 1) Possibility Distribution П(ti/ej) In Information Retrieval, the terms used to represent the content of a document, are weighted in order to better characterize the content of this document. The same principle is used in XML retrieval. The weights are generally calculated by using term frequency (tf) within a document or inverse document frequency (Idf) in the collection. In information retrieval, it has been shown [21] [22] that the performances of the system can be improved if one represents an element by considering its own content and the contents of its children nodes. In our model, we © 2012 ACADEMY PUBLISHER Where, tfij represents the frequency of the term ‘ti’ in the element ‘ej’. A term having a degree of possibility 0 means that the term is not representative of the element. If the degree of possibility is strictly higher than 0, then the term is possibly representative of the element. If it appears with a maximum degree of possibility, then it is considered as the best potential candidate for the representation and thus the restitution of the element. Let us note that: max (Π (ti / ej)) = 1, ∃ti ∈ ej In an XML document, a necessarily representative term of an element is a term that contributes to its restitution in response to a query. This term is called discriminative term and it is a term that frequently appears in few elements of XML document [5]. The factor commonly used in IR to quantify the discriminative power of a term is idf (ief in XML IR). Therefore, a degree of necessary relevance, βij, of the term ti to represent the element ej, will be defined by: Ne N N(ti →e j) ≥ βij = μ (tfij *iefij) * idf = μ (tfij *log( ) *log( )) nei +1 ni +1 (6) Where: - N and Ne are respectively the number of documents and elements in the collection. - ni and nei are respectively the number of documents and the number of elements containing the term ti. - μ : a function of normalization. A simple manner to normalize is to divide by the maximal value of the factor. - tfij represents the frequency of the term ‘ti’ in the element ‘ej’. - iefij represents the inverse frequency of the element ‘ej’ for the term ‘ti’. - idf represents the inverse frequency of the document. It should be noticed that (6) has been chosen according some experiments that were undertaken by Sauvagnat [22]. This degree of necessary relevance allows limiting the possibility that the term is compatible with the rejection of the element by: Π (ti /¬ej) ≤ 1- βij (this is deduced by definition in the possibilistic theory) JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 We summarize the possibility distribution on the Cartesian product {ej, ¬ej} x {ti, ¬ ti} by the following table: TABLE I. ti POSSIBILITY DISTRIBUTION ON THE SET OF TERMS T Π ¬ti ej tfij / max(tfkj), (∀tk∈ej) ¬ej 1- βij 1 1 (7) Where: - dist(di, ej) is the distance from the element ej to the root di in accordance with the hierarchical structure of the document. - α є ]0..1] is a parameter allowing to quantify the importance of the distance separating the element nodes (structural elements of the document) to the root of the document. Concerning the necessity to propagate, in an intuitive manner, one can think that the designer of a document uses the nodes of small size to emerge important information. These nodes can thus give precious indications on the relevance of their ancestors’ nodes. A title node in a section for example allows locating with precision the subject of its ancestor node section. It is thus necessary to propagate the signal calculated on the level of the node towards the root node. To answer this intuition, we propose to calculate the necessity of propagation of relevance of an element ej towards the root node di, denoted N(e → d ) , as follows: j equation, the more a term is of small size, the bigger is the necessity to propagate it. Therefore, Π (ej/ ¬ di) = lej/dl We summarize the possibility distribution on the Cartesian product {di, ¬di} x {ej, ¬ej} by the following table: TABLE II. POSSIBILITY DISTRIBUTION ON THE SET OF ELEMENTS E di Π 2) Possibility Distribution П(ej /di) The arc document-element (or arc root-element) indicates the interest to propagate information from an element towards the document node (root). The nodes, close to the root (of a tree), carry more information for the root than those located lower in the tree [22]. Thus it seems intuitive that more an element is far from the root more it is less relevant. We model this intuition by the use in the function of propagation of the parameter dist(root, e), that represents the distance between the root and one of its descendant nodes (elements) ‘e’ in the hierarchical tree of the document, i.e. the number of arcs separating the two nodes. The degree of possibility of propagation of relevance of an element (ei) towards the document node (di) is defined by Π (ej / di) and is quantified as follows: Π (ej / di) = α dist(di, ej)-1 185 ¬di ej α dist(di, ej)-1 lej/dl ¬ej 1 1 V. ILLUSTRATE EXAMPLE An example of XML document (an extract of a document) related to a book will be used to illustrate our talk. The XML document and its possibilistic network are presented as follows: <Book> <Title > Information Retrieval </Title > <Abstarct> In front of the increasing mass of information …</Abstract> …. <Chapter> <Title chapter> Indexing </title chapter> <Paragraph> The indexing is the process intended to represent by the elements of a documentary or natural language of … </Paragraph> </Chapter> </Book> The possibilistic network associated with XML document `Book' is as follows: Book TitleChapter Abstrac Title Paragrap Chapter Information Retrieval Indexing System XML i Figure 2 . Possibilistic network of the XML document ‘Book' N(e j → d i ) = 1 − le j dl (8) lej is the size of the element node ej and dl the size of a document (in number of terms). According to this © 2012 ACADEMY PUBLISHER For this example, the set of the elements E= {e1=Title, e2=Abstract, e3=Chapter, e4=Titlechapter, e5=Paragraph}. The set of the indexing terms of the elements, calculated while using the content of each element, along with its 186 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 child elements in the document, such as T(E) = {t1=Retrieval, t2=Information, t3=System, t4=Indexing, t5=XML}. We consider only some terms not to congest the example. The table containing the values of the arcs element node-term node of the possibilistic network of the document “Book” is given in Table III. We recall that a term is connected to the node that includes it as well as to all the ancestors of this node. TABLE III. П (ti/ej) e1 ¬ e1 e2 ¬ e2 e3 ¬ e3 e4 ¬ e4 e5 ¬ e5 POSSIBILITY DISTRIBUTION П t1 t2 1 0 0.5 0.5 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 t3 0 1 1 0 0 1 0 1 0 1 (ti/ej) t4 0 1 0.25 0.88 0.70 0.10 1 0 0.88 0.05 t5 0 1 0 1 0.5 0.2 0 1 1 0 The table containing the values of arcs root-element nodes of the possibilistic network of the document `Book' is given in Table IV (we take α = 0,6 and dl=100). TABLE IV. POSSIBILITY DISTRIBUTION П (ej /di ) П (ej/di) (di=book) e1 ¬e1 e2 ¬e2 e3 ¬e3 e4 ¬e4 e5 ¬e5 1 1 1 1 1 1 0.6 1 0.6 1 П (ej/di) (di= ¬book) 0.02 1 0.1 1 1 1 0.01 1 1 1 When the query is put, a process of propagation is started through the network modifying the values of possibilities a priori. In this model the equation of propagation used is the (3). Let’s take a query Q composed of the keywords “Retrieval” and “Information”, Q={Retrieval, Information}. According to the assumption 1, Π (di) = Π(¬ di) = 1, ∀ i. Given the query Q, the propagation process (3) considers only the aggregates of set E that include the query terms t1 = ‘Retrieval’ and t2= ‘Information’. In fact only the elements e1=’Title’ and e2=’Abstract’ will be considered. The aggregations that it is thus necessary considered are: {(e1, e2), (e1, ¬e2), (¬e1, e2), (¬e1, ¬e2)}. We calculate then: © 2012 ACADEMY PUBLISHER For di = book: a1 = Π( t1/e1) . Π (t2/e1). Π (t2/e2) . Π (e1/book) . Π (e2/ book) = 1 * 1*1 * 1 * 1= 1 a2 = Π( t1/e1) . Π (t2/e1) . Π (t2/¬e2) . Π (e1/book) . Π (¬e2/book) = 1* 1*0 * 1 * 1= 0 a3 = Π (t1/¬e1) . Π (t2/¬e1) . Π (t2/e2) . Π (¬e1/book) . Π (e2/book) = 0 * 0*1 * 1 * 1 = 0 a4 = Π( t1/¬e1). Π( t2/¬e1). Π (t2/¬e2). Π (¬e1/book) . Π (¬e2/ book) = 0 * 0 * 0 * 1 * 1 = 0 According to (3): Π(Q/book) = max (a1, a2, a3, a4) = 1 = a1 For ¬ di = ¬ book: a5 = Π (t1/e1). Π (t2/e1). Π(t2/e2). Π(e1/¬book) . Π(e2/¬book) = 1 * 1 * 1 * 0.02 * 0.1 = 0.002 a6 = Π (t1/e1). Π (t2/e1). Π (t2/¬e2). Π(e1/¬book) . Π(¬e2/¬book) = 1 * 1* 0 * 0.02 * 0.1 = 0 a7 = Π (t1/¬e1) . Π (t2/¬e1). Π (t2/e2) . Π(¬e1/¬book) . Π(e2/¬book) = 0 * 0 * 1 * 1 * 0.1 = 0 a8 = Π(t1/¬e1).Π(t2/¬e1).Π(t2/¬e2) . Π(¬e1/¬book) . Π(¬e2/¬book) = 0 * 0* 0 * 1 * 1 = 0 According to (3): Π(Q/¬ book) = max (a5, a6, a7, a8) = 0.002 = a5 The necessity N(Q/book) =1- Π( Q/¬ book) = 1- 0.002 = 0.998 The necessity N(Q/¬ book) =1- ∏( Q/book/) = 1- 1 = 0 The preferred documents are those that have a value N(Q/di) high among those that have a value Π (Q/di) high too. If N(Q/di)=0, the restored documents are (unwarranted of total adequacy) those that have a value Π(Q/di) high. Therefore, for the query Q = {Retrieval, Information}, it is the aggregation “a1” (title, abstract) that will be turned to the user as answer to his query. VI. EXPERIMENTS AND RESULTS A. Goals All studies performed to assess aggregated search were based on user studies [14]. This user study is designed with two major goals: - Evaluate the aptitude of an aggregate of XML elements to answer user queries. - Identify some of the advantages of the aggregated search in XML documents. B. Results To evaluate our model, a prototype was developed. Our experiments are conducted on a sample about 2000 XML documents of the INEX'2005 collection, a set of 20 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 queries from the same collection. Every query is assessed by exactly 3 users. The following histogram gives the judgments of users by query regarding the aggregate relevance: 187 Thus, it seems very important to identify other evaluation criteria to identify all benefits of aggregated search in XML documents. REFERENCES Figure 3 . Distribution of aggregation relevance results The experimental evaluation shows that aggregated search has big contribution for XML information retrieval. Indeed, the aggregate gathers non-redundant elements (parts of XML document). These elements can be semantically complementary and in this case the aggregate allows improving the interpretation of results, guides the user to the relevant elements of XML document, faster and also reduces the efforts the user must provide to locate information searched for. However, in some cases elements of the aggregate may be non complementary that means not semantically related with respect to information need expressed by user’s query. This sort of aggregation is very useful because it allows a very fine distinction of the different thematic expressed in the user's query when his need in information is generic. It also helps inform the user about various information of the corpus related to his information need thus help him, if necessary, to reformulate his query. VII. CONCLUSION This paper presents a new approach for aggregated search based on possibilitic networks. This model provides a formal setting to aggregate non-redundant elements into the same unit. It also directs the user more quickly toward the relevant elements of XML document. The user study is constructed around 2 main goals. First, we analyze the distribution of relevant results across different elements of XML document. Second, we identify some of the advantages of aggregated search. The user study was used to collect supporting data for these goals. The analysis of the distribution of relevant results provides interesting information. We notice that relevant information is sparse across many elements of XML document. Our analysis focuses on specific advantages of aggregated search. It is shown that aggregated search is useful to identify different interpretations of a query. It helps find different aspects of the same information need. © 2012 ACADEMY PUBLISHER [1] R. Agrawal, S. Gollapudi, A. Halverson, “Diversifying search results”, ACM Int. Conference on WSDM, 2009. [2] N. Ben Amor, “Qualitative possibilistic graphical models: from independence to propagation algorithms”, Thèse pour l’obtention du titre de Docteur en Gestion, université de Tunis, 2002. [3] S. Benferhat, D. Dubois, L. Garcia and H. Prade, “Possibilistic logic bases and possibilistic graphs”, In Proc. of the 15th Conference on Uncertainty in Artificial Intelligence, pp.57-64, 1999. [4] C. Borgelt, J. Gebhardt and R. Kruse, “Possibilistic graphical models”, Computational Intelligence in Data Mining, CISM Courses and Lectures 408, Springer, Wien, pp.51-68, 2000. [5] A. Brini, M. Boughanem and D. Dubois, “A model for information retrieval based on possibilistic networks”, SPIRE’05, Buenos Aires, LNCS, Springer Verlag, pp. 271282, 2005. [6] C.L. Clarke, M. Kolla, G.V. Cormack and O. Vechtomova, “Novelty and diversity in information retrieval evaluation”, SIGIR’08, pp. 659-666, 2008. [7] D. Dubois and H. Prade, “Possibility theory”, Plenum, 1988. [8] D. M. Dunlavy, D. P. O’Leary, J. M. Conroy and J. D. Schlesinger, “QCS: A system for querying, clustering and summarizing documents”, Information Processing and Management, pp. 1588-1605, 2007. [9] N. Fuhr, M. Lalmas, S. Malik and Z. Szlavik, “Advances in XML information retrieval: INEX 2004”, Dagstuhl Castle, Germany, 2004. [10] Y. Huang, Z. Liu and Y. Chen, “Query biased snippet generation in XML search”, ACM SIGMOD, pp. 315-326, 2008. [11] B. J. Jansen and A. Spink, “An Analysis of document viewing pattern of web search engine user”, Web Mining: Applications and Techniques, pp. 339-354, 2005. [12] B. J. Jansen and A. Spink, “How are we searching the world wide web?: a comparison of nine search engine transaction logs”, Information Processing and Management, pp. 248-263, 2006. [13] J. Kamps, M. Marx, M. De Rijke and B. Sigurbjörnsson, “XML retrieval: What to retrieve?”, ACM SIGIR Conference on Research and Development in Information Retrieval, pp.409-410, 2003. [14] A. Kopliku, F. Damak, K. Pinel-Sauvagnat and M. Boughanem, “A user study to evaluate aggregated search”, In IEEE/WIC/ACM International Conference on Web Intelligence, in press. [15] M. Lalmas, “Dempster-Shafer’s theory of evidence applied to structured documents: modelling uncertainty”, In Proceedings of the 20th Annual International ACM SIGIR, pp.110–118, Philadelphia, PA, USA. ACM, 1997. [16] M. Lalmas and P. Vannoorenberghe, “Indexation et recherche de documents XML par les fonctions de croyance”, CORIA'2004, pp. 143-160, 2004. [17] P. Ogilvie and J. Callan, “Using language models for flat text queries in XML retrieval”, In Proceedings of INEX 2003 Workshop, Dagstuhl, Germany, pp.12–18, December 2003. 188 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 [18] B. Piwowarski, G.E. Faure and P. Gallinari, “Bayesian networks and INEX”, In INEX 2002 Workshop Proceedings, pp. 149-153, Germany, 2002. [19] N. Polyzotis and M. N. Garofalakis, “XCluster synopses for structured XML content”, ICDE, pp. 63, 2006. [20] D. Radev, J. Otterbacher, A. Winkel and S. BlairGoldensohn, “NewsInEssence: summarizing online news topics”, Communications of the Association of Computing Machinery, pp. 95-98, 2005. [21] T. Rölleke, M. Lalmas, G. Kazai, I. Ruthven and S. Quicker, “The accessibility dimension for structured document retrieval”, BCS-IRSG European Conference on Information Retrieval (ECIR), Glasgow, Mars 2002. [22] K. Sauvagnat, “Modèle flexible pour la recherche d’information dans des corpus de documents semistructurés”, Thèse de Doctorat de l’Université Paul Sabatier, Juillet 2005. [23] B. Sigurbjornsson, J. Kamps and M. de Rijke, “An element-based approach to XML retrieval”, INEX 2003 workshop, Dagstuhl, Germany, December 2003. [24] A. Spink, B. J. Jansen, D. Wolfram and T. Saracevic, “From e-sex to e-commerce: web search changes”, IEEE Computer Science, vol. 35, pp. 107-109, 2002. [25] S. Sushmita and M. Lalmas, “Using digest pages to increase user result space: preliminary designs”, Special Interest Group on Information Retrieval 2008 Workshop on Aggregated Search, 2008. [26] L. A. Zadeh, “Fuzzy sets as a basis for a theory of possibility”, In Fuzzy Sets and Systems, 1:3-28, 1978. Fatma Zohra Bessai-Mechmache Algiers, Algeria. She obtained her Engineer Degree in Computer Science from Houari Boumediene University (USTHB) in Algeria and her Magister from the Research Centre of Advanced Technologies in Algeria. She has been a team member of the scientific and research staff of the Research Centre in Scientific and Technical Information of Algeria (CERIST) and from 2007 she is the head of the Databases Team at CERIST. Her research interests include database security and information retrieval. She is particularly interested in XML information retrieval, Aggregated Search and Mobile information retrieval. Zaia Alimazighi Algiers, Algeria. She obtained Doctorate in Computer Science in 1976 at Paris VI University and PHD in Information System at USTHB (Algiers’s University) in 1999. After 1976 and during more than ten years, she has been project leader in several industrial projects, in public companies in Algeria. She is a researcher at the USTHB since 1988 to nowadays. Today she is a full prof. at the computer science Department of the Faculty of Electronic & Computer Science of USTHB and Dean of this faculty. She is team manager in Computer Engineering Laboratory at USTHB. The current research interests include Information Systems, Collaborative Works, Data Warehouses, Service Web development and Databases Modeling. © 2012 ACADEMY PUBLISHER JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 189 The Developing Economies’ Banks Branches Operational Strategy in the Era of E-Banking: The Case of Jordan Yazan Khalid Abed-Allah Migdadi Assistant Professor of Operations Management, Business Administration Department, Yarmouk University Irbid, Jordan E-mail: Ymigdadi@yahoo.com Abstract— The aim of this study is to identify the impact of adopting e-banking on branches operations strategy during the period 1999 to 2008, 15 local banks in Jordan were surveyed by using 3 questionnaires, one was directed to account operations officer, another to tellers and last one to branches managers. Annual reports and banks websites were revised to identify the changes in performance indicators and e-banking adoption. The study revealed that; branches are still the main channels of conducting banking transactions, and e-banking is working in parallel with branches. This paper is the first paper reports this issue in developing economies. Index Terms—E-banking, Branches, Operations Strategy, Jordan I. INTRODUCTION Banks in developing economies directed toward adopting e-banking channels during the last decade, banks in Jordan directed toward adopting a lot of ebanking channels, about 18 out of 23 banks adopted internet banking, 8 banks adopted telephone banking, 15 adopted mobile banking, and 6 banks adopted ATMs. This adoption left questions: what is the impact of adopting e-banking on the branches operations?, and to what extent these changes affected the performance of banks?, so answering these questions help the managers in developing economies to identify the effective role of branches in the era of e-banking. Despite the significant adoption of e-banking over the universe, limited previous studies reported this issue in developed economies as the study of [40, 25], and no evidence about this issue in developing economies, so this study will bridge this gap by reporting this issue in Jordan. Accordingly the objectives of this study are: 1. Identifying the degree of change in adopting ebanking in Jordan during the period 1999 to 2008. 2. Identifying the degree of change in branches operations strategy actions and capabilities in Jordan during the period 1999 to 2008. 3. Identifying the degree of change in banking performance indicators during the period 1999 to 2008. 4. Identify the impact of significant adopting of ebanking on branches operations and performance in Jordan during 1999 to 2008. © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.189-197 II. OVERVIEW OF BANKING SECTOR IN JORDAN Jordan is officially known as Hashemite Kingdom of Jordan, it is a small country located in the Middle East, it shares boarders with Iraq, Saudi Arabia, and Syria. Jordanian economy is service oriented, and the banking sector plays a dynamic role in the Jordanian economy, it acquiring 56.4% of the total capital invested in the Amman Stock Exchange [2], further, the banking sector of Jordan is leader in comparison with its counter parts in other countries in the Middle East and North Africa region [15], the number of licensed banks in Jordan is 23 banks (15 local and 8 foreign), the banking sector is controlled by Central bank of Jordan [3]. During the last decade the banking sector in Jordan had been reformed, so new banking was launched by mid. of 2001, electronic banking transaction law was launched by 2002, and the banks were directed toward compliance with Basel Accord II by the end of 2008 [24] The banks in Jordan provide personal and corporate service, the banks’ branches cover the majority of Kingdom regions. During the last decade a lot of banks in Jordan have adopted more e-banking service, more ATM Kiosks have been lunched, 23 banks have lunched internet banking during [31], also 13 banks provide telephone banking and 15 banks have lunched mobile banking. III. LITERATURE REVIEW The purpose of literature review is to identify the branches operational issues that affected by e-banking, further, identify the banking performance indicators to be used in evaluating the branches operations strategy. Limited previous studies tracing of impact of e-banking on the branches operations in developed economies as the study of [40, 25], however, no studies envisaged this issue in developing economies. Identifying the Operations Actions and Capabilities Affected by E-banking A- Branches Location and Accessibility At the beginning of 1980s, the international trend was toward decreasing the number of branches as a result of investing in alternative delivery service channels as automatic teller machines, and which reduced the 190 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 operating costs, for example the Bank of America closed one third of their overall branch network, while increasing the automatic banking machines networks, and Barclay's of the UK closed 100 of their 3000 full service branch [8]. Generally speaking the number of bank branches operating in the UK (excluding North Ireland) has declined by over 11% from 37,761 in 1983 down to 33,511 in 1993, and building society branches have similarly declined by over 9% from 6,480 in 1983 down to 5, 876 in 1993 [17]. Despite the direction of banks during the last few years toward using remote access distributions channels, and expansion use of ATM's over the last two decades to deal with their customer; the number of commercial bank branches network have been expanded in developed countries as USA [19, 27, 18], accordingly the banks during the era of e-banking will concern about the convenient site location more than past [5]. During 2001 to 2006 brought about the greatest increase in the bank branches construction in US history, with nearly 3,000 new branches opening in 2005 [35]. According to Federal Deposit Insurance Corporation, the number of bank branches increased about 38% to just less than 70,000 between 1990 and 2004, in 2005 alone the banking industry had a 3% annual growth rate adding 2,255 net branches offices [16]. The Urban based banks in USA follow their customers into developing sub-urban, they opened branches in new retail centers along major highways, creating advantage over older established competitors which established their branches in a declining main street, down town or courthouse square area [35] more concern about the availability of parking in the front of the branch, easy to see, low traffic area, and low physical barriers [13]. However, in Canada the clicks-and-mortar banks' density in Urban areas is more than sub-urban or rural area, with closer occurring in urban areas and opening occurring in sub-urban areas [33], also the direction of banks to open branches in-store as a part of banks direction toward more convenience focused delivery strategy [37]. B- Branches Layout Management and Human C- Branches Capacity Strategy The capacity of branches that measured by the number of employees and teller station could be affected by adopting E-banking; if the number of branches will be reduced the same as made by banks in the UK the capability of branches will be reduced, however, if the number of branches increased the capacity will be extended. Resource During the era of e-banking the banks operations manager should start to overhaul branches with a high percentage of premium customer first, and focus more in developing the customer experience [21], the branches will adopt the retail mall layout or the high-touch hightech strategy [10], the focus on this type of layout is on marketing, or combining better customer service with expanded opportunity to sell financial products and improve the relation with customers [13]. External parking area in the front of branch as malls or restaurants is available, the branch is generally open, the majority of space will be used for selling and marketing purposes, further the use of meter or greeting station in © 2012 ACADEMY PUBLISHER the front of lobby, this station will assess the customer needs and direct traffic based on those need, or help customers specially the old customers to use online banking [26, 13]. In order to create an environment that encourages extended interaction between consumers and bank employees; the bank should move toward more open teller stations, free standing teller tower that gives customers the option by standing side by side with the teller to view information on waist high computer stand, or seated teller stations; which designed for the customers who cannot stand at regular teller windows, or for opening a new account if all closing rooms are occupied [13], also create area for private conferences a part from the traffic [6]. The reception area is very comfortable with large chairs and view into street, current newspapers and magazines, and a large TV turned to news and financial programming, further marketing on specially designed marketing walls and displays, each branch has internet coffee with fresh coffee or internet kiosks, digital signage and computer and printer for customers' uses, also some banks should consider creating a living room in at least some of their branches [38, 13]. Also the branches will have extensive warm and natural lighting, natural colors and materials, the lighting should draw attention to the teller lines, this lighting system should support productivity and impact how colors are perceived and how inviting an environment is. The lobby or branch hall includes some plants pictures, and the concerns also on availability of central heating system air conditions venerations, CCTV and Bandit screens, carpeted floor, the hard service area used in the front of teller stations, and easy visible signs [6]. D- Transaction Cost and Customers Waiting Time Using different routes in the branches as ATMs will reduce the transaction cost [24]; since the number of employees will be reduced. increasing the process routings as using e-banking channels in the branches will reduced the customers waiting time [41], since the reasonable number of customers will use e-channels so the number of customers using traditional banking channel will reduce, so less waiting time in accordance. Identifying the Performance Indicators of Banking Operations JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 A- Financial Indicators The widely adopted financial indicators in the banking industry are ROA (return on assets) and ROE (return on equity) [39], and are widely evaluated by previous studies as the study of [29, 34]. However, the previous indicators have income from interest and noninterest, so in order to reflect the core banking operations non-interest income as a percentage of total revenue can be used [17]. B- Marketing Indicators The marketing performance can be evaluated to have better insight about the impact of operations, the widely adopted indicators by previous studies are 1) market share [9, 1], 2) growth rate [9], and 3) perceived quality [36, 29], Firms in recent years have focused more on tracing retention. Customer retention is a pivotal strategic issue, and in the 1990s several streams of research focused on this [11]. IV. RESEARCH METHODOLOGY Analytical survey methodology is adopted by this study as a result of small sample size which is 15 local banks. The unit of observation is the bank, all banks in Jordan are the population, however, the local banks (15 banks) were surveyed since the locals are outperform the foreigners, also, the local banks have larger branches networks [3]. The time frame of this research is set at ten years, from the beginning of 1999 to the end of 2008. During this period the majority of banks started to adopt certain e-banking elements. A- The Achievement of First Objective The adoption of e-banking was traced by revising banks websites (website archive used for this purpose: www.archive.org) and annual reports. Table (1) shows the definition and scales of banking adoption dimensions. The adoption was measured by subtracting the adoption points of last period (2008) from the beginning period 1999 for every bank (Microsoft Excel was used for this purpose), then, the result was divided by the adoption during the first period; the result was the percentage of adopting e-banking. B- The Achievement of Second Objective: B-1 Changes in branches accessibility: The data required to compute the branches accessibility was collected from bank annual reports and websites, the number of branches in urban, suburban and rural areas of every bank was identified from annual reports. However, the population numbers in urban, suburban and rural areas were identified from the website of the department of statistics. After collecting data accessibility was computed as summarized in table (2), then, the changes was identified by subtracting the last © 2012 ACADEMY PUBLISHER 191 period accessibility from the beginning period by using Microsoft Excel for every bank and every action, then, the result was divided by the accessibility scores during the first period; the result was the percentage of change in branches accessibility. TABLE (1) THE MEASUREMENT OF ADOPTING E-BANKING E-banking Channel Adoption Scale alternatives Adoption of ATMs ATM Urban Accessibility: Number of Accessibility. ATM Suburban ATM/10,000 People. Accessibility. ATM Rural Models: each models 1 Accessibility. Number of ATM point. Models. Adoption of Internet Number of 1 point each service Banking. Transactional Services. Adoption Banking Adoption Banking of of General Adoption Mobile Telephone E-banking SMS Push SMS Pull Mobile Internet Banking. IVR (Interactive Voice Response) Call Centre. Contact Centre. ATM Internet Banking SMS Push SMS Pull Mobile internet banking IVR Call Centre Contact Centre 1 point each service 1 point each service 1 point each service TABLE (2) BRANCHES ACCESSIBILITY ALTERNATIVES AND SCALES Branches Accessibility References Scale Branches urban accessibility Branches suburban accessibility Branches rural accessibility [32, 14] Number of branches /10,000 people B-2 Changes in branches layout design and number of tellers stations Primary data was collected from branch managers’ inbranch, since they have direct contact with branch layout daily, and so are capable to provide historical data about layout design. Branch managers in different geographic regions were surveyed, since branch-layout design may be change from region to region; therefore multiple respondents from different geographic regions increased the reliability of data. Data was collected using a questionnaire developed by the researcher, and the items were developed according to literature, the number of items in this questionnaire was 192 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 57 (see table (3)). Data was collected from convenient sample of branches managers of 15 banks, who have been appointed in this position since 1999, the total number of participated managers in this survey was 236 (response rate 68%). After collected the facts the data was coded by using competency score presented in table (3), for each bank, and the data was aggregated for every bank by computing the average responses. The changes in each sub-dimension for every bank were identified by subtracting the competency score of the last period from the beginning period (Microsoft Excel was used for this purpose) then, then, the result was divided by the layout scores during the first period; the result was the percentage of change in layout design. B-3 Changes in Tellers Flexibility, Account Customers Waiting Time: Tellers were surveyed to identify whether there is a change in their role (directing toward selling and promoting products), and customers waiting time. Tellers were source of these facts since they are in direct interaction with operations. Data was collected from convenient sample of tellers of 15 banks, who have been appointed in this position since 1999, the total number of participated tellers in this survey was 241 (response rate 70%). Questionnaire was developed by researcher, after collecting data the responses were coded as presented in table (4). The customers waiting time was measured by minutes. However, tellers flexibility was measured as summarized in table (4). TABLE (4) TELLERS FLEXIBILITY: QUESTIONNAIRE’S ITEMS AND SCALES Flexibility Number Items Scale Dimensions of items References Promotion 1 [22, 23,30, Yes: 1 28] Making 2 solve problems: Decisions 1, making interest decision: 2 Cross Trained 2 Do back office: 1, do front office: 1 B-4 Transaction Cost and number of tellers Account transaction cost and number of tellers were identified from account operations unit in the headquarter by surveying account executive officer who have been appointed in this position since 1999 (one officer in each bank) so 15 executives were surveyed, since this unit has records about this fact, further, the branches managers and tellers do not have idea about such facts. Data was collected by using a questionnaire. The scale of transaction cost was currency unit (JD) and for number teller (number). The changes were identified in all of these dimensions by subtracting the score of the last period from the beginning period by using Microsoft Excel for every bank then, then, the result was divided by the score during the first period; the result was the percentage of change in transaction cost and number of tellers. © 2012 ACADEMY PUBLISHER C. The Achievement of Third Objective The source of all of these indicators was banks’ annual reports except customer satisfaction and retention which were collected from headquarter by surveying account operations officer. Then the change over the study period were identified by subtracting the results of last first period from last period, then the relative performance of each bank according to each performance indicators was identified by dividing the banks’ change in performance indicator over the best performed bank. D. The Achievement of Fourth Objective K-means cluster analysis was used to cluster banks according to degree of adopting e-banking, all clustering trial were generated until reach the maximum number of clustering trial, since the number of clusters were not known. Then, the clusters that adopted significant ebanking practices were identified by using Kruskal Wallis H test, since data is not normally distrusted and sample size is small. The significant branches operations strategy actions, capabilities and performance indicators of each clustering trial adopted significant e-banking practices were identified by using Kruskal Wallis H test. The level of significance was 0.05. The significant e-banking and branches operations’ practices were represented by using actual scores (the scores were represented in table (1,2,3,4) in previous section), these were identified by subtracting the indicators of the beginning period from the last period. Further the percentage of change was identified the same. V. DATA ANALYSIS AND FINDINGS Table (5) shows significant e-banking practices across clustering trials, it can be seen that; the significant ebanking practices were identified in 3rd and 4th clustering trial only. The significant practices across clusters of 3 clusters trial were; ATM suburban accessibility only, however, no significant practices related to internet banking, mobile banking and telephone banking. The significant practices across clusters trial of 4 clusters trial were; ATM suburban and degree of adopting mobile banking, no significant practices of internet banking, and telephone banking. The average ATM accessibility across 3 clusters trial was 1.28 ATMs/10,000 people, and the accessibility across 4 cluster trial was 1.19 ATMs/10,000 people. The maximum accessibility of 3 cluster trial was reported by cluster 2 (3.34 ATMs/10,000 people), and the maximum accessibility of 4 clusters trial was reported by cluster 3 (3.53 ATMs/10,000 people). The significant mobile banking practices of 4 cluster trial was adopting push SMS banking, the best cluster was 3rd cluster which adopted pull SMS banking. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 193 TABLE (3) BRANCHES LAYOUT DESIGN: QUESTIONNAIRE’S ITEMS, SCALES AND COMPETENCY SCORES Dimensions of Main Sub-dimension Number of items Item References Competency Score Actions (Items in Questionnaire) Visual -----3 [4, 7, 13, 15] Window: 1 teller stations: 3, lighting: 5 Isolation 2 [4, 7, 13, 15] Yes: 1 No: 0 Air conditioning 5 Central heating 2 Seats and disks 3 Parking Colours 1 3 Floor 2 Pictures and plants CCTV 1 2 Alarms 3 Recycling 3 Walls 2 Community 2 Signs and labels 5 Promotional leaflets and facilities 3 Badges and uniforms Customer service 2 2 Departmentalisation 2 Internet banking 2 ATM 2 Type of teller stations 3 Number 1 Convenience Aesthetic Safety and security [15] [13, 7] Social responsibility Information factor E-banking channels Teller stations © 2012 ACADEMY PUBLISHER [4, 6, 13, 15] [4, 13] [4, 13] Ventilation: 1, entrance air conditioner: 3, hall air conditioner: 5, teller station air conditioner: 7, credit offices air conditioner: 9 Central heating: 1 Heating thermostat: 2 Seats in the front of teller stations: 1, VIP halls: 5, disks: 7 Yes: 1 No: 0 Warm colours: 3, cool colours: 2, subdued colours: 1 Carpeted floor: 1, hard surface: 2 Yes: 1, No: 0 CCTV in the branch: 1, outside: 2 Entrance: 1, front of teller stations: 3, offices: 5, connected: 7 Solar panel: 5, collecting water: 3, recycled rubber floor: 1 Changed colours: 1, classed walls: 2 Children's playing area: 1, community meeting halls: 2 Dept. title signs: 5, direction signs: 3, teller station title and number signs: 1 Digital signs: 2, traditional: 1 Leaflets at the entrance: 1, leaflets in the hall: 3, TV screens: 5 Uniform: 2, badges: 1 Electronic kiosks: 2, staff: 1 Departmentalisation: 1, customer service unit: 2 Wireless laptop area: 2, Internet café: 1 Wall ATMs: 1, inside ATM: 2 Stand-up: 1, seat: 3, tower: 5 Number Max. Score 9 1 25 3 13 1 3 3 1 3 16 15 9+2 5 3 3 3 3 3 9 Relative number > Average + standard deviation 194 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Cluster number Across all clusters Significant banking practices e- ATM Suburban Accessibility Cluster 1 ATM Suburban Accessibility Cluster 2 ATM Suburban Accessibility Cluster 3 ATM Suburban Accessibility Cluster 4 ------- TABLE (5) SIGNIFICNAT E-BANKING PRACTICES ACROSS CLUSTERING TRIALS 3 clusters trial 4 clusters trial Percentage of Percentage of Actual Significant e- Kruskal Kruskal Change banking Wallis H change Wallis H change practices test test Chi. Chi. Sig. Sig. 5.21% ATM Suburban 10.406 5.165 6.39% 1.28 0.015 Accessibility 0.023 ATMs/ 10,000 Mobile Banking people Adoption 1.44% 8.775 0.023 1.78% ATM Suburban --------0.21% 0.35 Accessibility ATMs/ 10,000 Mobile Banking people Adoption 0.00 15.62% ATM Suburban --------3.34% 3.34 Accessibility ATMs/ 10,000 Mobile Banking people Adoption 0.00 ----- ----- 15.62% ----- 0.84 ATMs/ 10,000 people ----- ATM Suburban Accessibility Mobile Banking Adoption ATM Suburban Accessibility Mobile Banking Adoption Table (6) shows significant branches operations actions across clustering trials, it can be seen that; the significant actions identified in 3rd clusters trial were; trained front office employees to do back-office job and reduce customers waiting time, however, the significant actions identified in 4th clusters trial was trained frontoffice employees to do back-office jobs, so no significant actions related to branches accessibility, branches layout quality, and transaction cost. Cluster number Across clusters Cluster 1 all ----- ----- 1.19 ATMs/ 10,000 people 0.68: Push SMS 1.78 ATMs/ 10,000 People 0 0.84 ATMs/ 10,000 people 0 3.53% 3.21ATMs/ 10,000 people 1.75% 1.75: Pull SMS -0.47 ATMs/ 10,000 people -0.08% 1.13% 1: push SMS The average percentage of employees trained to do back-office job across 3 clusters trial was 56%, and the percentage across 4 cluster trial was 66% of employees. The maximum percentage of trained employees of 3 cluster trial was reported by all clusters except fourth cluster (100% of employees). The average reduction of customers waiting time of 3 cluster trial was reducing customers waiting time by 7.9 minutes; the best cluster was 3rd cluster, the customers waiting time was reduced by 9.5 minutes. TABLE (6) SIGNIFICNAT BRANCHES OPERATIONS ACTIONS ACROSS CLUSTERING TIRALS 3 clusters trial 4 clusters trial Percentage Percentage Actual Significant e- Kruskal Kruskal Significant Wallis H of change Change banking Wallis H of change branches test practices test operations actions Chi. Chi. Sig. Sig. 70% 56% of Trained front 11.345 61% 6.051 Trained front 0.010 office employees 0.014 office employees employees to to do back-office do back-office job. job. Reduce account -7.9 minutes -63.4% 3.710 customers waiting 0.05 time Trained front -----16.7% -16.7% of Trained front ----100% © 2012 ACADEMY PUBLISHER Actual Change Actual Change 66% of employees 100% of JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 employees office employees to do back-office job. Cluster 2 Reduce account customers waiting time Trained front office employees to do back-office job. Cluster 3 Reduce account customers waiting time Trained front office employees to do back-office job. Cluster 4 Reduce account customers waiting time ------- ----- ----- ----- -41.5% -7.03 minutes 100% 88% of employees -73% -7.72 minutes 100% of employees 100% of employees -76% -9.5 minutes ----- ----- Table (7) shows the differences in performance indicators across clustering trial, it can be seen that; no significant differences in performance indicators across 195 employees office employees to do back-office job. Trained front office employees to do back-office job. ----- 100% 100% of employees Trained front office employees to do back-office job. ----- 83% 100% of employees Trained front office employees to do back-office job. ----- -18.8% 18.8-% of employees clusters in 3 and 4 clusters trials, so, the significant changes in e-banking and branches operations did not lead to significant changed in performance. TABLE (7) KRUSCAL WALLIS H TEST FOR DIFFERENCES IN PERORMANCE INDICATORS ACROSS CLUSTERING TRIALS Performance Indicators 3 clusters trial 4 clusters trial Chi-square Sig. Chi-square Sig. Return on Equity 1.900 0.593 0.000 1.000 Return on Assets 2.657 0.448 0.320 0.572 Operating revenue/total revenue 3.244 0.356 2.337 0.126 Deposits Market share 1.250 0.741 0.080 0.777 Customers Satisfaction 4.447 0.217 0.082 0.774 Customers Retention 1.998 0.573 0.139 0.710 VI. DISCUSSION ATM suburban accessibility was significant ebanking practices since ATM has deployed in Jordan since mid. 1990s, but the majority of banks have ATMs in urban areas, so it is more attractive to launch ATMs in suburban. However, significant mobile banking practice is reasonable since the majority of banks adopted Push SMS so adopting Pull SMS is the significant. No significant operations related to branches location, branches layout, tellers role in promoting or selling products, number of teller stations and number of tellers which indicates that the branches are still working in parallel with e-banking channels, so the branches are still a channel of conducting transaction rather than a place of promoting and selling products, accordingly, no significant impact on transaction cost. The significant training of employees to do backoffice tasks is reasonable since the direction of customers toward using e-banking channels reduce the work load on front-office employees so they may be have more time to © 2012 ACADEMY PUBLISHER do back office job, further, the reduction of customers waiting time is as a result of using customers e-banking channels which reduce number of served customers by tellers, and reduce customers waiting time in accordance. No significant impact of adopting e-banking on performance, since e-banking practices are still recent in Jordan so the impact could be significant in future. VII. CONCLUSION This study investigated the impact of adopting ebanking on traditional banking operations in Jordan. Ebanking practices traced were; degree of adopting ATMs, internet banking, telephone banking, mobile banking and e-banking channels. A scale was developed in this study to measure the degree of adoption. Traditional banking operations actions and capabilities were traced in this study; branches accessibility, branches layout quality, tellers flexibility, transaction cost, customers waiting time, and branches capacity. The practices of 15 local banks in Jordan were reported by using questionnaires directed to tellers, 196 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 account operations executives, branches managers, further, annual reports were revised. K-means Cluster analysis was used to identify the clusters of banks adopted significant e-banking practices; the significant practices were adopted by three and four cluster trials, Kruskal Wallis H test used to identify significant differences across clusters. The significant e-banking practices adopted by banks in Jordan were; ATM suburban accessibility and adopt mobile banking, however, the significant traditional banking operations actions and capabilities were; reduce customers waiting time and trained tellers to do backoffice job. No significant change in branches accessibility, branches layout design, transaction cost, and branches capability were reported in Jordan which indicates that branches are still the main channels of conducting banking transactions, and e-banking is working in parallel with branches. VIII. APPLICATIONS AND FUTURE RESEARCHES The result of this is study is important for decisions makers and academics alike. The decision makers of banks in Jordan are having now facts about the effectiveness of actions they made in era of e-banking, so they can focus on the effective actions made and those not adopted some of these actions can plan now to realize these actions in future. Further, the proposed investors who are concerning about banking in Jordan can now make more rational actions. Academics now have some facts about the impact of e-banking on traditional banking operations strategy in developing countries, so these facts could be used to develop some propositions and hypothesis, further, academics now have some facts about banking process to be taught to students at universities, which help students in developing countries to have better insight about the actions they should be made in future. Despite Jordan is one of the developing countries, the results of this study is more applicable for banks in Jordan, so it is recommended to conduct further studies in other developing countries to find whether the same significant actions and capabilities are adopted by banks and trace the impact of actions on capabilities. The future studies could be more beneficial if it is cross countries or regions studies, since a comparison can be made and better conclusions about the impact of country or region context could be traced. REFERENCES [1] Ahmed, N. and Montagno, R. (1995), "Operations strategy organizational performance: an empirical study", International Journal of Operations and Production Management, Vol. 16 No.5, pp. 41-53. [2] Amman Stock Exchange, (2006), Capital market profile 2006 [online], Available from: http://www.exchange.jo, [accessed at 20 Feb. 2008]. [3] Association of Banks in Jordan (2007), Annual Report 29th [online]. Available from: http://www.abj.org.jo/Portals/0/AnnualReport/AnnualRep ort2007.pdf, [Accessed at 20th July 2008]. © 2012 ACADEMY PUBLISHER [4] Baker, J. Berry, L.L. and Parasuraman, A. (1988), "The marketing impact of branch facility design", Journal of Retail Banking, Vol.10 No.2, pp. 33-42. [5] Beery, A. (2002), "Site selection insight", ABA Banking, Vol. 34 No. 4, pp. 28-32. [6] Belski, L. (2004), "Seven design problems and a few good ideas on improving interiors", ABA Banking Journal, Vol.96 No.8, pp. 28-32. [7] Bielski, L. and Street, B. (2007), "Evoking a sense of place", ABA Banking Journal, Vol. 99(8), pp. 25-30. [8] Channon, D.F. (1986) Bank strategic management and marketing, Wiley and Sons, United Kingdome. [9] Cleveland, G. (1989), "A theory of production competency", Decision Science, Vol.20 No.4, pp. 655-668. [10] Davidson, S. (2002), "Bank branch design helps improve the bottom line", Community Banker, Vol. 11 No. 3, pp. 38-39. [11] Erikssan, K. and Vaghutt, A.L. (2000), "Customer retention, purchasing behavior and relationship substance in professional services", Industrial Marketing Management, Vol. 29, pp. 363-372. [12] Creane, S. Goyal, R. Mobarak, A.M. and Sab, R. (2004), Financial sector development in the Middle East and North Africa, IMF working paper [online], Monetary Fund, Middle East and Central Asia Department, Washington, DC USA. Available from: http://www.imf.org/external/Pubs/Pubs/and-inf.htm [accessed at 2 August 2008]. [13] Feig, N. (2005), "Extreme (and not so extreme) makeovers in modern branch design", Community Banker, Vol. 14 No. 5, pp. 32-40. [14] Frohlich, M. and Dixon, J. (2001), "A taxonomy of manufacturing strategies revisited", Journal of Operations Management, Vol. 19, pp. 541-558. [15] Greenland, S. and McGoldrick, P. (2005), "Evaluating the design of retail financial service environments", The International Journal of Bank Marketing, Vol.23 No.2/3, pp. 132-152. [16] Grover, C. and Ferris, J. (2007), "Right location wrong building", Community Banker, Vol. 16 No. 10, pp. 48-52. [17] Hallowell, R. (1996), "The relationship of customer satisfaction, customer loyalty, and profitability: an empirical study", International Journal of Service Industry Management, Vol.7 No.4, pp.27 [18] Hirtle, B. (2007), "The impact of network size on bank branch performance", Journal of Banking and Finance, Vol. 31, pp. 3782-3805. [19] Hirtle, B. and Melti, C. (2004), "The evaluation of US bank branch network: growth, consolidation, and strategy", Current Issues in Economic and Finance, Vol. 10 No. 8. [20] Howcroft, B. and Beckett, A. (1996), "Branch network and the retailing of high credence products", The International Journal of Bank Marketing, Vol.14 No.4, pp.4-6. [21] Hoeck, B. (2006), Customer centric banking 2007: customer relationship redefined-how retail banks can learn from other industries [online], White paper: GFT technology AG: Germany, Available from: http//www.GFT.com, [accessed at May 3 2008]. [22] Hunter, L. and Hittle, L.M. (1999), What makes a highperformance workplace?: evidence from retail bank branches [online], Financial Institution Centre: The Wharton School: University of Pennsylvania: USA. Available from: http://www.Watonschool.com [accessed at 3rd Dec. 2008]. [23] Hunter, L. (1995), How will competition change human resource management in retail banking?: a strategic perspective [online]. Financial Institution centre: The JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Wharton School: University of Pennsylvania: USA. Available from: : http://www.Watonschool.com [accessed at 3rd Dec. 2008]. [24] Khamis, M. (2003), Financial Sector reforms and issues in Jordan [online]. The Euro-Med Regional Economic Dialoge: Rome, October 20. Available form: http://eceuropa.eu/external_relations/euromed/en/red/kha mis.pdf, [assessed at March 10 2008]. [25] Lymperopoulos, C. Chaniotakis, I. (2004), “Branch employees' perceptions towards implications of e-banking in Greece”, International Journal of Retail & Distribution Management, Vol. 32 No.6/7, pp.302-311. [26] Lombardo, T. (2003), "Facility design plays key role in supporting brand identity", Bank News, Vol. 103 No. 8, pp. 20-21. [27] Lopez, R.H. and Shah, J. (2004), Branch Development in an era of commercial banking consolidation (1994-2006) [online], Pace University: Lubin School of Business, USA, Available from: http//digitalcommons.pace.edu/lubinfaculty_working papers/59, [accessed at February 24 2008]. [28] McKendrick, J. (2002) “Back to the branch: customer may demand access at every e-channel but they still want a smiling face”, Bank Technology News, Vol.15 No.5, p.1. [29] Menor, L. Roth, V. and Mason, C. (2001), "Agility retail banking: a numerical taxonomy of strategic service groups", Manufacturing and Service Operations Management, Vol. 13 No. 4, pp. 273-292. [30] Metter, R. and Vargas, V. (2000), "A typology of decoupling strategies in mixed services", Journal of Operations Management, Vol. 18, pp. 663-682. [31] Migdadi, Y.K.A. (2008), "the quality of internet banking service encounter in Jordan", Journal of Internet Banking and Commerce [online], December, 13(3), Available form: http://http://www.arraydev.com/commerce/jibc/, [Accessed at 13th March 2008]. [32] Miller, J. and Roth, A. (1994), "A taxonomy of manufacturing strategies", Management Science, Vol. 40 No. 3, pp. 285-304. [33] Mok, M.A. (2002), From bricks-and-mortar to Bricks-andclicks: the transformation of a bank branch network within greater Toronto area, Master thesis, Wielfrid Laurier University: Canada. © 2012 ACADEMY PUBLISHER 197 [34] Power, T. and Hahn, W. (2004), "Critical competitive methods, generic strategies and firm performance", The International Journal of Bank Marketing, Vol. 22 No. 1, pp. 43-64. [35] Reider, S. (2006), "The branching boom", Kentucky Banker Magazine, Jun, pp. 18-19. [36] Roth, A. and Jackson, W. (1995), "Strategic determinants of service quality and performance: evidence from the banking industry", Management Science, Vol.41 No.11, pp. 1720-1733. [37] Stroup, C. (1998), "Beyond in-store banking at star bank", Journal of Retail Banking Services, Vol. 20 No. 4, pp. 1925. [38] Swann, J. (2006), "Let the branch be unbroken", Community Banker, Vol. 15 No. 10, pp. 58-62. [39] Uzelac, N. and Sudarevic, T. (2006), "Strengthening the link between marketing strategy and financial performance", Journal of Financial Service Marketing, Vol.11 No.2, pp. 142-156. [40] Yakhlef, A. (2001), “Does the Internet compete with or complement bricks-and-mortar bank branches?”, International Journal of Retail & Distribution Management, Vol. 29 No. 6/7, pp.274-28. [41] Klassen, K. and Rohleder, T. (2001), "Combining operations and marketing to manage capacity and demand in services", The Service Industry Journal, Vol. 21 No.2, pp. 1-30. Dr. Yazan Khalid Abed-Allah Migdadi: Assistant professor of operations management, Yarmouk University, Jordan. He awarded Ph.D. in operations strategy from Bradford University, UK, also awarded BA and MBA from Yarmouk University, Jordan. He worked as teaching assistance and researcher of operations and information management at Bradford University, School of Management, UK. And a lecturer of Management at Yarmouk University. His research interests concentrated on operations strategy; reporting best practices in developing economies, reporting taxonomies of operations strategies in developing economies and developing typologies of operations strategy. 198 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 Evolving Polynomials of the Inputs for Decision Tree Building Chris J. Hinde and Anoud I. Bani-Hani Loughborough University/Computer Science, Loughborough, UK Email: {C.J.Hinde, A.I.Bani-Hani}@lboro.ac.uk Thomas.W. Jackson Loughborough University/Information Science, Loughborough, UK Email: T.W.Jackson@lboro.ac.uk Yen P. Cheung Monash University/Clayton School of IT, Wellington, Australia Email: Yen.Cheung@monash.edu.au Abstract— The aim of this research is to extend the discrimination of a decision tree builder by adding polynomials of the base inputs to the inputs. The polynomials used to extend the inputs are evolved using the quality of the decision trees resulting from the extended inputs as a fitness function. Our approach generates a decision tree using the base inputs and compares it with a decision tree built using the extended input space. Results show substantial improvements. Rough set reducts are also employed and show no reduction in discrimination through the transformed space. Index Terms—Decision tree building, Polynomials I. INTRODUCTION This paper addresses the well-known problem of data mining where given a set of data; the expected output is a set of rules. Decision trees using the ID3 approach [1], [2] are popular and in most cases successful in generating rules correctly. Extensions to ID3 such as C4.5 and CART are developed to cope with uncertain data. Fu et al [3] used C4.5 followed by a Genetic Algorithm (GA) to evolve better quality trees; in Fu’s work C4.5 was used to seed a GA, which were then used as a basis for evolving better trees then using Genetic Programming (GP) techniques to cross over the trees. Many rule discovery techniques combining ID3 with other intelligent techniques such as genetic algorithms and genetic programming have also been suggested [4], [5], [6]. Generally, when using ID3 with genetic algorithms, individuals which are usually fixed length strings are used to represent decision trees and the algorithm evolves to find the optimal tree. When Genetic Programming is used to generate decision trees, individuals are variable length trees, which represent the decision tree. Variations in these approaches can be found in the gene encoding. One rule per individual as done in Greene [7], Freitas et al [8], [9] is a simple approach but the fitness of a single rule is not necessarily the best indicator of the quality of the discovered rule set. Encoding several rules in an individual requires longer and more complex operators © 2012 ACADEMY PUBLISHER doi:10.4304/jetwi.4.2.198-203 [10], [11]. In genetic programming, a program can be represented by a tree with rule conditions and/or attribute values in the leaf nodes and functions in the internal nodes. Here the tree can grow dynamically and pruning of the tree is necessary [12]. Papagelis & Kelles [13] used a gene to represent a decision tree and the GA then evolves to find the optimal tree, similar to Fu et al [3]. To further improve the quality of the trees, Eggermont et al [14] applied several fitness measures and ranked them according to their importance in to tackle uncertain data. Previous work has taken the input space as a given and used evolution to produce the trees. In this work, as we shall see, the trees are generated using a variant of C4.5 and the input space is evolved rather than the trees, in direct contrast to other workers. A vast majority of the approaches use decision trees as a basis for the search in conjunction with either a GA or GP to further improve the quality of the trees. Our approach described in this paper addresses continuous data and adds polynomials of the input values to extend the input set. A GA is used to search the space for these polynomials based on the quality of the tree discovered using a version of C4.5. II. ITERATIVE DISCRIMINATION ID3, C4.5 and their derivatives proceed by selecting an attribute that results in an information gain with respect to the dependent variable. A simple data set with 2 continuous attributes that are linearly separable is shown in Fig. 1. Applying C4.5 to the data set gives the result shown in Figure 2, which was first documented in [15]. If no errors are required over a large training set then the complexity of the decision tree grows with the size of the training set. This is unsatisfactory. Anticipating the results of the proposed system a higher level discriminant of x−y in addition to the two basic variables x and y would give the result shown in Fig. 3. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 199 were deemed outside the concept. Fig. 4 illustrates the data set although, as above, does not show all the points. Applying C4.5 to the data set represented in Fig. 4 gives the decision tree shown in Fig. 5. This decision tree is smaller than the decision tree derived from the linearly separable data although the function used to produce the data is much more complex, and the predictions from the tree show fewer errors. The decision tree is difficult to interpret. Fig. 1. A granularised version of the linearly separable set of data based on a 2 dimensional data set. x <= -0.25 : | y > -0.75 : in (36.0) : : : x > -0.25 : | y <= 0.75 : out (40.0) | y>0.75: : : : Evaluation on training data (128 items): Before Pruning ------------------Size Errors 27 1( 0.8%) After Pruning -----------------------Size Errors Estimate 27 1( 0.8%) (13.3%) Fig. 2. The decision tree produced by C4.5 from the linearly separable data shown in Figure 1. The size of 27 indicates why this tree is not replicated here. x-y <= -0.5 : out (64.0) x-y > -0.5 : in (64.0) Evaluation on training data (128 items): Before Pruning After Pruning ------------------- ----------------------- Fig. 4. A granularised version of the torus illustrating a quadratic form. x <= -3.25 : out (16.0) x > -3.25 : | x > 2.75 : out (16.0) | x < 2.75 : out (16.0) | | y <= -3.25 : out (12.0) | | y > -3.25 : | | | y <= 2.75 : in (72.0/24.0) | | | y > 2.75 : out (12.0) 3 Evaluation on training data (128 items): Before Pruning After Pruning -------------- -----------------------Size Errors Size Errors Estimate 9 24(18.8%)9 24(18.8%)(25.5%) Fig. 5. The decision tree produced by C4.5 from the toroidal data. Size Errors 3 0( 0.0%) Size Errors Estimate 3 0( 0.0%) ( 2.1%) Fig. 3. The decision tree produced by C4.5 from the linearly separable data using the discriminant value x − y. III. MORE COMPLEX DISCRIMINANTS So far we have made no more progress than Konstam [16] who used a GA to find linear discriminants. He makes the comment that the technique can be applied to quadratic discriminants. However he makes no statements about focusing the search. A set of data was prepared using the same data points as above to explore higher order and higher dimensional discriminants. The data prepared used a torus such that points inside the torus were in the concept and points outside the torus, including those that are within the inner part of the torus, © 2012 ACADEMY PUBLISHER Taking the toroidal data set, Fig. 4, and adding another attribute computed from the sum of squares of x and y gives better discrimination and a more interpretable tree shown in Fig. 6. Notice that the decision tree is much smaller with 5 decision points compared to 9, and has no errors compared with 18.8% in the original tree, Fig. 5. IV. NON PROJECTABLE DATA SETS Thus far we have seen data sets that can be projected onto 1 dimension and which result in large trees but are nonetheless useful predictors. Section III shows that these trees can be reduced in size considerably by adding higher dimensional combined functions of the original data elements. 200 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 r2 > 8.125 : out (72.0) r2 <= 8.125 : | r2 <= 0.625 : out (8.0) | r2 > 0.625 : in (48.0) Evaluation on training data (128 items): Before Pruning After Pruning ------------------- -----------------------Size Errors Size Errors Estimate 5 0( 0.0%) 5 0( 0.0%) ( 3.1%) Fig. 6. The decision tree produced by C4.5 from the augmented toroidal data. r2 is the sum of the squares of x and y. This test shows a data set that cannot be discriminated by C4.5, however a decision tree does exist. It is shown if Fig. 10. This clear 2 dimensional data set results in the following tree from C4.5, Figure 12. V. GENETIC ALGORITHM The genetic algorithm attached to the front of c4.5 has a few special features. It follows most of the guidelines in [17], [18] and so has aspects designed to preserve inheritability and to ensure that no part of the genome has an inordinate effect on the phenome. With this in mind the structure of the genome is made up from a set of integers, rather than a binary genome. With the data sets shown in Figs 7 and 11, C4.5 does not produce a tree at all. Of the two data sets presented a higher order combined attribute results in a concise tree where no tree is produced without the higher order attribute. In the case of the quadrant data set, a concise decision is possible with the unaugmented data set, one is not produced by C4.5. x-y <= -2 : out (38.0) x-y > -2 : | x-y <= 1.5 : in (52.0) | x-y > 1.5 : out (38.0) A. Banded data set This test shows a data set that does not project down onto 1 dimension. This 2 dimensional data set results in the following tree from C4.5, Fig. 8. Evaluation on training data (128 items): Before Pruning After Pruning -------------- -----------------------Size Errors Size Errors Estimate 5 0( 0.0%) 5 0( 0.0%) ( 3.2%) Fig. 9. The decision tree produced by C4.5 from the banded data given the added input feature of x-y. x <= 0.0 : (64.0) | y<=0.0:in(32.0) | y>0.0:out(32.0) x > 0.0 : (64.0) | y <= 0.0 : out (32.0) | y > 0.0:in(32.0) Evaluation on training data (128 items) Before Pruning ------------------Size Errors 6 0( 0.0%) Fig. 7. A granularised version of banded linearly separable data. After Pruning -----------------------Size Errors Estimate 6 0( 0.0%) ( 3.2%) Fig. 10. The decision tree, which could be used to discriminate the quadrant data, but cannot be produced by C4.5. out (128.0/52.0) Evaluation on training data (128 items): Before Pruning -----------------Size Errors 1 52(40.6%) After Pruning -----------------------Size Errors Estimate 1 52(40.6%) (44.1%) Fig. 8. The decision tree produced by C4.5 from the banded data. The decision tree produced from the banded data, Figure 8, is shown in Fig. 8 and is almost useless. It does not reveal any useful information from the data. The most that can be gained from this data is that there are 52 elements in the concept and the rest are out. Adding the attribute x − y gives the tree shown in Fig. 9, this is a good predictor and also makes the information held in the data clear. B. Quadrant data set © 2012 ACADEMY PUBLISHER Fig. 11. A granularised version of quadrant data set. JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 out (128.0/52.0) Evaluation on training data (128 items): function field is interpreted as 2 for plus, 3 for minus, 5 for multiply. No other function types are illustrated. Before Pruning After Pruning -------------- -----------------------Size Errors Size Errors Estimate 1 52(40.6%)1 52(40.6%)(44.1%) 1 Fig. 12. The decision tree produced by C4.5 from the quadrant data. 0 1 5 1 0 2 5 2 Evaluation on training data (128 items): After Pruning -----------------------Size Errors Estimate 5 0( 0.0%) ( 3.2%) Fig. 13. The decision tree produced by C4.5 from the quadrant data given the added input feature of x*y. A. Genetic Structure The chromosome can deliver several genes corresponding to several combined attributes. The chromosome is a fixed maximum length and achieves a variable number of genes by an activation flag. Each gene delivers one new attribute and each variable is a linear combination of simpler variables. 1) Variable.: If the number in the variable slot is N and there are K basic continuous variables in the data set and M variables in the gene prior to this one then N mod (K + M ) refers to variable within those K + M variables. 2) Function.: If the function is a monadic function then it is applied to variable 1, otherwise to both. The prototype system has a set of simple arithmetic functions, power, multiplication, division and subtraction. This is sufficient to extract all the decision trees we have considered. 3) Number of genes and gene length.: The variable length chromosome has disadvantages as the effect on the gene itself of the two fields that determine the length of the gene is considerably more than any other field and can be destructive. The variable length gene has similar disadvantages. The gene structure finally chosen for the system is shown in Fig. 14. Active Variable 1 Function Variable 2 Fig. 14. This shows the basic structure of the gene adopted. The Active/Variable/Function/Variable is repeated up to the gene length. This potentially has some of the properties of recessive genes that are attributed to diploid gene structures although no experiments have been conducted to determine this. An example gene is shown in Figure 15. This gene has 4 segments, 1 of which is active. Each segment has 2 attributes, some active and some not. The © 2012 ACADEMY PUBLISHER active x y inactive x * x inactive y * y inactive x2 y2 1 3 2 x*y <= -2 : out (38.0) x*y > -2 : | x*y <= 1.5 : in (52.0) | x*y > 1.5 : out (38.0) Before Pruning ------------------Size Errors 5 0( 0.0%) 201 0 4 2 5 Fig. 15. An exemplar gene. x and y are variables number 1 and 2. The first new variable is x − y and is variable number 3. As this is activated then it will be made available as an input to the decision tree generator. If variable 6 is activated then because it relies on variables 4 and 5 they will also be kept but not necessarily activated. VI. EXEMPLAR DATA The system described above was applied to some data sets taken from the Machine Learning Repository [19] in order to compare the capability of the system with other known decision tree generators. The experiment compares the decision trees generated by C4.5 and the decision trees generated by C4.5 with the enhanced input space. The results consist of • the percentage of correct results on the training set • the percentage of correct results on the test set • the of degrees of freedom for the decision space • the probability that the result could not have arisen by chance • the decision tree size A. Experimental results Each data set was split randomly into two sets, the training set which comprised 90% of the data and the test set, which comprised 10% of the data. The split was generated by choosing whether a particular data point was to be in the training set or the test set using a random number generator. This way any temporal aspects that may be in the data are accounted for. Notice the degrees of freedom are different for the training set and the test set, this is because there were no data elements belonging to one of the categories in the test set, where there were elements in the training set. TABLE I EXPERIMENTAL RESULTS FOR GLASS DATA SET Train Correct Test Correct DOF Train DOF Test probability of not null Tree size C4.5 92.8 75 30 25 1.0 C4.5+GP 98.5 100 30 25 1.0 43 61 202 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 The glass data set shows a considerable improvement for the enhanced input space, however the decision tree is larger. The iris data set shows an improvement for the enhanced input space, but the improvement is marginal, however the decision tree is smaller. TABLE II EXPERIMENTAL RESULTS FOR IRIS DATA SET Train Correct Test Correct DOF probability of not null Tree size C4.5 98 100 6 0.99 C4.5+GP 100 100 6 0.99 9 7 TABLE III EXPERIMENTAL RESULTS FOR PIMA INDIANS DATA SET Train Correct Test Correct DOF probability of not null Tree size C4.5 82.7 74.5 2 1.0 C4.5+GP 98.3 84.2 2 1.0 33 119 The Pima indians data set shows a considerable improvement for the enhanced input space. Both the training set and the test set show improvement. The enhanced decision tree is also considerably bigger, by a factor of nearly 4. The experiments have shown that the enhanced system is able to significantly improve the quality of the decisions made, however this is often at the expense of a larger tree. The test on the iris data set indicates that the decision tree can be smaller, as shown by some of the demonstration data sets earlier in the paper. VII. REDUCED DATA SETS Work by Jensen and Shen on rough set theory aims to reduce the input set to a subset of attributes that have the same predictive value as the original set [20]. In this sense whereas the work reported here extends the input space by adding polynomials of the base features, Jensen’s work reduces the input space. Using the glass data set as an example set of data Jensen’s model removes the input value that measures the amount of Barium in the glass sample. Initial experiments show that the reduced set does reduce accuracy, but only marginally and not significantly, see Table IV. TABLE IV EXPERIMENTAL RESULTS FOR GLASS DATA SET COMPARING COMPLETE AND REDUCED DATA SETS WITH C4.5 Train Correct Test Correct DOF Train DOF Test probability of not null Tree size C4.5 Complete 92.8 75 30 25 1.0 C4.5 Reduced 92.3 70 30 25 1.0 43 45 The interim conclusion is that no predictive accuracy has been lost; but it is also true that C4.5 alone does not extract everything from the data that it is possible to © 2012 ACADEMY PUBLISHER extract. The next test evolves the reduced input space to extract as much predictive power as it can. These preliminary tests indicate that reduction system of Jensen and Shen [20] does not remove useful information by eliminating input attributes, and coupled with the enhanced input space system reported here shows no loss of accuracy. TABLE V EXPERIMENTAL RESULTS FOR GLASS DATA SET COMPARING COMPLETE AND REDUCED DATA SETS WITH C4.5 AND C4.5 + GP Train Correct Test Correct DOF Train DOF Test probability of not null Tree size C4.5 Complete 92.8 75 30 25 1.0 C4.5 Reduced 92.3 70 30 25 1.0 C4.5+GP Complete 98.5 100 30 25 1.0 C4.5+GP Reduced 99.0 100 30 25 1.0 43 45 61 59 VIII. CONCLUSIONS This paper has extended the capability of decision tree induction systems where the independent variables are continuous. The incremental decision process has been shown to be inadequate in explaining the structure of several sets of data without enhancement. The paper has shown that introducing variables based on higher order and higher dimensional combinations of the original variables can result in significantly better decision trees. This can all be accomplished by introducing these variables at the start of the decision tree generation and a suitable method for generating these would be a genetic algorithm. A fitness function for a genetic programming system has been introduced and serves to discover structure in the continuous domain. Although the work of [20] shows how to reduce the input set without losing any discriminating power they did not achieve all the predictive power that the input space could provide further work on a variety of different data sets should be performed to confirm this. REFERENCES [1] J. Quinlan, “Discovering rules from large collections of examples: a case study,” in Expert systems in the microelectronic age, D. Michie, Ed. Edinburgh University Press, 1979. [2] ] J. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, 1986. [3] Z. Fu, B. Golden, and S. Lele, “A GA based approach for building accurate decision trees,” INFORMS Journal on Computing, vol. 15, no. 5, pp. 3–23, 2003. [4] M. Ryan and V. Rayward-Smith, “The evolution of decision trees,” in Proceedings of the Third Annual Conference on Genetic Programming, J. Koza, Ed. San Francisco, CA.: Morgan Kaufmann, 1998, pp. 350– 358. [5] R. Marmelstein and G. Lamont, “Pattern classification using a hybrid genetic program-decision tree approach,” in Proceedings of the Third Annual Conference on Genetic Programming, J. Koza, Ed. San Francisco, CA 94104, USA: Morgan Kaufmann, 1998. [6] G. Folino, C. Puzzuti, and G. Spezzano, “Genetic programming and simulated annealing: a hybrid method to evolve decision trees,” in Pro- ceedings of the Third JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 4, NO. 2, MAY 2012 European Conference on Genetic Programming, R. Poli, W. Banzhaf, W. Langdon, J. Miler, P. Nordin, and T. Fogarty, Eds. Edinburgh, Scotland, UK: Springer-Verlag, 2000, pp. 294–303. [7] D. Greene and S. Smith, “Competition-based induction of decision models from examples,” Machine Learning, vol. 13, pp. 229–257, 1993. [8] A. Freitas, “A GA for generalised rule induction,” in Advances in Soft Computing, Engineering Design and Manufacturing. Berlin: Springer, 1999, pp. 340–353. [9] D. Carvalho and A. Freitas, “A genetic-algorithm based solution for the problem of small conjuncts,” in Principles of Data Mining and Knowl- edge Discovery (Proc. 4th European Conf., PKDD-2000. Lyon France), ser. Lecture Notes in Artificial Intelligence, vol. 1910. SpringerVerlag, 2000, pp. 345–352. [10] K. De Jong, W. Spears, and D. Gordon, “Using a genetic algorithm for concept learning,” Machine Learning, vol. 13, pp. 161–188, 1993. [11] C. Janikow, “A knowledge intensive GA for supervised learning,” Machine Learning, vol. 13, pp. 189–228, 1993. [12] Y. Hu, “A genetic programming approach to constructive induction,” in Genetic Programming, Proceedings 3rd Annual Conference, San Mateo, California, 1998, pp. 146–151. [13] A. Papagelis and D. Kalles, “GA tree: Genetically evolved decision trees,” in Tools with AI, ICTAI Proceedings 12th IEEE International Conference, 13–15 Nov 2000, pp. 203–206. [14] J. Eggermont, J. Kok, and W. Kosters, “Genetic programming for data classification: Partitioning the search space,” in ACM Symposium on Applied Computing, 2004, pp. 1001–1005. [15] D.Michie and R.Chambers,“Boxes:an experiment in adaptive control,” in Machine Intelligence 2, E. Dale and D. Michie, Eds. Edinburgh: Oliver and Boyd, 1968. [16] A. Konstam, “Linear discriminant analysis using GA,” in Proceedings Symposium on Applied Computing, Indiannapolis.IN, 1993. [17] M. Withall, “Evolution of complete software systems,” Ph.D. dissertation, Computer Science, Loughborough University, Loughborough, UK, 2003. [18] M. Withall, C. Hinde, and R. Stone, “An improved representation for evolving programs,” Genetic Programming and Evolvable Machines, vol. 10, no. 1, pp. 37–70, 2009. [19] “Machine learning repository,” http://archive.ics.uci.edu/ml/datasets.html, 2010. [20] R. Jensen and Q. Shen, “New approaches to fuzzy-rough feature selection.” IEEE Transactions on Fuzzy Systems, vol. 17, no. 4, pp. 824–838. Chris J. Hinde is Professor of Computational Intelligence in the Department of Computer Science at Loughborough University. His interests are in various areas of Computational Intelligence including fuzzy systems, evolutionary computing, neural networks and data mining. In particular he has been working on contradictory and inconsistent logics with a view to using them for data mining. A recently completed project was concerned with railway scheduling using an evolutionary system. He has been funded by various research bodies, including EPSRC, for most of his career and is a member of the EPSRC peer review college. Amongst other activities he has examined over 100 PhDs. © 2012 ACADEMY PUBLISHER 203 Anoud I. Bani-Hani is an EngD research Student at Lough- borough University, researching into Knowledge Management in SMEs in the UK, with specific focus on implementing an ERP system into a lowtech SME. Prior to joining the EngD scheme Anoud was a Lecturer at Jordan University of Science and Technology and holds an undergraduate degree in Computer science and information technology system from the same university and a Master degree in Multimedia and Internet computing from Loughborough University. Thomas W. Jackson - is a Senior Lecturer in the Department of Information Science at Loughborough University. Nicknamed ‘Dr. Email’ by the media Tom and his research team work in two main research areas, Electronic Communication and Information Retrieval within the Workplace, and Applied and Theory based Knowledge Management. He has published more than 70 papers in peer reviewed journals and conferences. He is on a number of editorial boards for international journals and reviews for many more. He has given a number of keynote talks throughout the world. In both research fields Tom has, and continues to work closely with both private and public sector organisations throughout the world and over the last few years he has brought in over £1M in research funding from research councils, including EPSRC. He is currently working on many research projects, including ones with The National Archives and the Welsh Assembly Government surrounding information management issues. Yen P. Cheung has an honours degree in Data Processing from Loughborough University of Technology (UK) in 1986 and completed her doctorate in Engineering at the Warwick Manufacturing Group (WMG), University of Warwick in 1991. She worked initially as a Teaching Fellow and then as a Senior Teaching Fellow at WMG from 1988 1994 where she was responsible for the AI and IT courses in the M.Sc. programs in Engineering Business Management. She also ran IT courses for major companies such as British Aerospace, Royal Ordnance, London Electricity and British Airways in UK and for Universiti Technology Malaysia in Malaysia. Besides teaching, she also supervised a large number of industry based projects. After moving to Australia in 1994, she joined the former School of Business Systems at Monash University. Currently she is a Senior Lecturer at the Clayton School of IT at Monash University, Australia where she developed and delivered subjects in the area of business information systems, systems development, process design, modelling and simulation. Her current research interests are in the areas of collaborative networks such as eMarketplaces particularly for SMEs, intelligent algorithms for business systems, applications and data mining of social media. She publishes regularly in both international conferences and journals in these areas of research. Call for Papers and Special Issues Aims and Scope Journal of Emerging Technologies in Web Intelligence (JETWI, ISSN 1798-0461) is a peer reviewed and indexed international journal, aims at gathering the latest advances of various topics in web intelligence and reporting how organizations can gain competitive advantages by applying the different emergent techniques in the real-world scenarios. Papers and studies which couple the intelligence techniques and theories with specific web technology problems are mainly targeted. Survey and tutorial articles that emphasize the research and application of web intelligence in a particular domain are also welcomed. These areas include, but are not limited to, the following: • • • • • • • • • • • • • • • • • • • • • • • Web 3.0 Enterprise Mashup Ambient Intelligence (AmI) Situational Applications Emerging Web-based Systems Ambient Awareness Ambient and Ubiquitous Learning Ambient Assisted Living Telepresence Lifelong Integrated Learning Smart Environments Web 2.0 and Social intelligence Context Aware Ubiquitous Computing Intelligent Brokers and Mediators Web Mining and Farming Wisdom Web Web Security Web Information Filtering and Access Control Models Web Services and Semantic Web Human-Web Interaction Web Technologies and Protocols Web Agents and Agent-based Systems Agent Self-organization, Learning, and Adaptation • • • • • • • • • • • • • • • • • • • • • • Agent-based Knowledge Discovery Agent-mediated Markets Knowledge Grid and Grid intelligence Knowledge Management, Networks, and Communities Agent Infrastructure and Architecture Agent-mediated Markets Cooperative Problem Solving Distributed Intelligence and Emergent Behavior Information Ecology Mediators and Middlewares Granular Computing for the Web Ontology Engineering Personalization Techniques Semantic Web Web based Support Systems Web based Information Retrieval Support Systems Web Services, Services Discovery & Composition Ubiquitous Imaging and Multimedia Wearable, Wireless and Mobile e-interfacing E-Applications Cloud Computing Web-Oriented Architectrues Special Issue Guidelines Special issues feature specifically aimed and targeted topics of interest contributed by authors responding to a particular Call for Papers or by invitation, edited by guest editor(s). We encourage you to submit proposals for creating special issues in areas that are of interest to the Journal. Preference will be given to proposals that cover some unique aspect of the technology and ones that include subjects that are timely and useful to the readers of the Journal. A Special Issue is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length. The following information should be included as part of the proposal: • Proposed title for the Special Issue • Description of the topic area to be focused upon and justification • Review process for the selection and rejection of papers. • Name, contact, position, affiliation, and biography of the Guest Editor(s) • List of potential reviewers • Potential authors to the issue • Tentative time-table for the call for papers and reviews If a proposal is accepted, the guest editor will be responsible for: • Preparing the “Call for Papers” to be included on the Journal’s Web site. • Distribution of the Call for Papers broadly to various mailing lists and sites. • Getting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be informed the Instructions for Authors. • Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact information. • Writing a one- or two-page introductory editorial to be published in the Special Issue. Special Issue for a Conference/Workshop A special issue for a Conference/Workshop is usually released in association with the committee members of the Conference/Workshop like general chairs and/or program chairs who are appointed as the Guest Editors of the Special Issue. Special Issue for a Conference/Workshop is typically made of 10 to 15 papers, with each paper 8 to 12 pages of length. Guest Editors are involved in the following steps in guest-editing a Special Issue based on a Conference/Workshop: • Selecting a Title for the Special Issue, e.g. “Special Issue: Selected Best Papers of XYZ Conference”. • Sending us a formal “Letter of Intent” for the Special Issue. • Creating a “Call for Papers” for the Special Issue, posting it on the conference web site, and publicizing it to the conference attendees. Information about the Journal and Academy Publisher can be included in the Call for Papers. • Establishing criteria for paper selection/rejections. The papers can be nominated based on multiple criteria, e.g. rank in review process plus the evaluation from the Session Chairs and the feedback from the Conference attendees. • Selecting and inviting submissions, arranging review process, making decisions, and carrying out all correspondence with the authors. Authors should be informed the Author Instructions. Usually, the Proceedings manuscripts should be expanded and enhanced. • Providing us the completed and approved final versions of the papers formatted in the Journal’s style, together with all authors’ contact information. • Writing a one- or two-page introductory editorial to be published in the Special Issue. More information is available on the web site at http://www.academypublisher.com/jetwi/. (Contents Continued from Back Cover) The Developing Economies’ Banks Branches Operational Strategy in the Era of E-Banking: The Case of Jordan Yazan Khalid Abed-Allah Migdadi 189 Evolving Polynomials of the Inputs for Decision Tree Building Chris J. Hinde, Anoud I. Bani-Hani, Thomas.W. Jackson, and Yen P. Cheung 198

Full Issue in PDF

Related documents

Products

Support

Full Issue in PDF

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib