IJCSI proceedings are currently indexed by: © IJCSI PUBLICATION 2010 www.IJCSI.org IJCSI Publicity Board 2010 Dr. Borislav D Dimitrov Department of General Practice, Royal College of Surgeons in Ireland Dublin, Ireland Dr. Vishal Goyal Department of Computer Science, Punjabi University Patiala, India Mr. Nehinbe Joshua University of Essex Colchester, Essex, UK Mr. Vassilis Papataxiarhis Department of Informatics and Telecommunications National and Kapodistrian University of Athens, Athens, Greece EDITORIAL In this second edition of 2010, we bring forward issues from various dynamic computer science areas ranging from system performance, computer vision, artificial intelligence, software engineering, multimedia, pattern recognition, information retrieval, databases and networking among others. We thank all our reviewers for providing constructive comments on papers sent to them for review. This helps enormously in improving the quality of papers published in this issue. IJCSI is still maintaining its policy of sending print copies of the journal to all corresponding authors worldwide free of charge. Apart from availability of the full-texts from the journal website, all published papers are deposited in open-access repositories to make access easier and ensure continuous availability of its proceedings. We are pleased to present IJCSI Volume 7, Issue 2, split in five numbers (IJCSI Vol. 7, Issue 2, No. 3). The acceptance rate for this issue is 27.55%. Out of the 98 papers submitted for review, 27 were eventually accepted for publication in this month issue. We wish you a happy reading! IJCSI Editorial Board March 2010 www.IJCSI.org IJCSI Editorial Board 2010 Dr Tristan Vanrullen Chief Editor LPL, Laboratoire Parole et Langage - CNRS - Aix en Provence, France LABRI, Laboratoire Bordelais de Recherche en Informatique - INRIA - Bordeaux, France LEEE, Laboratoire d'Esthétique et Expérimentations de l'Espace - Université d'Auvergne, France Dr Constantino Malagôn Associate Professor Nebrija University Spain Dr Lamia Fourati Chaari Associate Professor Multimedia and Informatics Higher Institute in SFAX Tunisia Dr Mokhtar Beldjehem Professor Sainte-Anne University Halifax, NS, Canada Dr Pascal Chatonnay Assistant Professor MaÎtre de Conférences Laboratoire d'Informatique de l'Université de Franche-Comté Université de Franche-Comté France Dr Yee-Ming Chen Professor Department of Industrial Engineering and Management Yuan Ze University Taiwan Dr Vishal Goyal Assistant Professor Department of Computer Science Punjabi University Patiala, India Dr Natarajan Meghanathan Assistant Professor REU Program Director Department of Computer Science Jackson State University Jackson, USA Dr Deepak Laxmi Narasimha Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia Dr Navneet Agrawal Assistant Professor Department of ECE, College of Technology & Engineering, MPUAT, Udaipur 313001 Rajasthan, India Prof N. Jaisankar Assistant Professor School of Computing Sciences, VIT University Vellore, Tamilnadu, India IJCSI Reviewers Committee 2010 • Mr. Markus Schatten, University of Zagreb, Faculty of Organization and Informatics, Croatia • Mr. Vassilis Papataxiarhis, Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece • Dr Modestos Stavrakis, University of the Aegean, Greece • Dr Fadi KHALIL, LAAS -- CNRS Laboratory, France • Dr Dimitar Trajanov, Faculty of Electrical Engineering and Information technologies, ss. Cyril and Methodius Univesity - Skopje, Macedonia • Dr Jinping Yuan, College of Information System and Management,National Univ. of Defense Tech., China • Dr Alexis Lazanas, Ministry of Education, Greece • Dr Stavroula Mougiakakou, University of Bern, ARTORG Center for Biomedical Engineering Research, Switzerland • Dr Cyril de Runz, CReSTIC-SIC, IUT de Reims, University of Reims, France • Mr. Pramodkumar P. Gupta, Dept of Bioinformatics, Dr D Y Patil University, India • Dr Alireza Fereidunian, School of ECE, University of Tehran, Iran • Mr. Fred Viezens, Otto-Von-Guericke-University Magdeburg, Germany • Dr. Richard G. Bush, Lawrence Technological University, United States • Dr. Ola Osunkoya, Information Security Architect, USA • Mr. Kotsokostas N.Antonios, TEI Piraeus, Hellas • Prof Steven Totosy de Zepetnek, U of Halle-Wittenberg & Purdue U & National Sun Yat-sen U, Germany, USA, Taiwan • Mr. M Arif Siddiqui, Najran University, Saudi Arabia • Ms. Ilknur Icke, The Graduate Center, City University of New York, USA • Prof Miroslav Baca, Faculty of Organization and Informatics, University of Zagreb, Croatia • Dr. Elvia Ruiz Beltrán, Instituto Tecnológico de Aguascalientes, Mexico • Mr. Moustafa Banbouk, Engineer du Telecom, UAE • Mr. Kevin P. Monaghan, Wayne State University, Detroit, Michigan, USA • Ms. Moira Stephens, University of Sydney, Australia • Ms. Maryam Feily, National Advanced IPv6 Centre of Excellence (NAV6) , Universiti Sains Malaysia (USM), Malaysia • Dr. Constantine YIALOURIS, Informatics Laboratory Agricultural University of Athens, Greece • Mrs. Angeles Abella, U. de Montreal, Canada • Dr. Patrizio Arrigo, CNR ISMAC, italy • Mr. Anirban Mukhopadhyay, B.P.Poddar Institute of Management & Technology, India • Mr. Dinesh Kumar, DAV Institute of Engineering & Technology, India • Mr. Jorge L. Hernandez-Ardieta, INDRA SISTEMAS / University Carlos III of Madrid, Spain • Mr. AliReza Shahrestani, University of Malaya (UM), National Advanced IPv6 Centre of Excellence (NAv6), Malaysia • Mr. Blagoj Ristevski, Faculty of Administration and Information Systems Management - Bitola, Republic of Macedonia • Mr. Mauricio Egidio Cantão, Department of Computer Science / University of São Paulo, Brazil • Mr. Jules Ruis, Fractal Consultancy, The netherlands • Mr. Mohammad Iftekhar Husain, University at Buffalo, USA • Dr. Deepak Laxmi Narasimha, Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya, Malaysia • Dr. Paola Di Maio, DMEM University of Strathclyde, UK • Dr. Bhanu Pratap Singh, Institute of Instrumentation Engineering, Kurukshetra University Kurukshetra, India • Mr. Sana Ullah, Inha University, South Korea • Mr. Cornelis Pieter Pieters, Condast, The Netherlands • Dr. Amogh Kavimandan, The MathWorks Inc., USA • Dr. Zhinan Zhou, Samsung Telecommunications America, USA • Mr. Alberto de Santos Sierra, Universidad Politécnica de Madrid, Spain • Dr. Md. Atiqur Rahman Ahad, Department of Applied Physics, Electronics & Communication Engineering (APECE), University of Dhaka, Bangladesh • Dr. Charalampos Bratsas, Lab of Medical Informatics, Medical Faculty, Aristotle University, Thessaloniki, Greece • Ms. Alexia Dini Kounoudes, Cyprus University of Technology, Cyprus • Mr. Anthony Gesase, University of Dar es salaam Computing Centre, Tanzania • Dr. Jorge A. Ruiz-Vanoye, Universidad Juárez Autónoma de Tabasco, Mexico • Dr. Alejandro Fuentes Penna, Universidad Popular Autónoma del Estado de Puebla, México • Dr. Ocotlán Díaz-Parra, Universidad Juárez Autónoma de Tabasco, México • Mrs. Nantia Iakovidou, Aristotle University of Thessaloniki, Greece • Mr. Vinay Chopra, DAV Institute of Engineering & Technology, Jalandhar • Ms. Carmen Lastres, Universidad Politécnica de Madrid - Centre for Smart Environments, Spain • Dr. Sanja Lazarova-Molnar, United Arab Emirates University, UAE • Mr. Srikrishna Nudurumati, Imaging & Printing Group R&D Hub, Hewlett-Packard, India • Dr. Olivier Nocent, CReSTIC/SIC, University of Reims, France • Mr. Burak Cizmeci, Isik University, Turkey • Dr. Carlos Jaime Barrios Hernandez, LIG (Laboratory Of Informatics of Grenoble), France • Mr. Md. Rabiul Islam, Rajshahi university of Engineering & Technology (RUET), Bangladesh • Dr. LAKHOUA Mohamed Najeh, ISSAT - Laboratory of Analysis and Control of Systems, Tunisia • Dr. Alessandro Lavacchi, Department of Chemistry - University of Firenze, Italy • Mr. Mungwe, University of Oldenburg, Germany • Mr. Somnath Tagore, Dr D Y Patil University, India • Ms. Xueqin Wang, ATCS, USA • Dr. Borislav D Dimitrov, Department of General Practice, Royal College of Surgeons in Ireland, Dublin, Ireland • Dr. Fondjo Fotou Franklin, Langston University, USA • Dr. Vishal Goyal, Department of Computer Science, Punjabi University, Patiala, India • Mr. Thomas J. Clancy, ACM, United States • Dr. Ahmed Nabih Zaki Rashed, Dr. in Electronic Engineering, Faculty of Electronic Engineering, menouf 32951, Electronics and Electrical Communication Engineering Department, Menoufia university, EGYPT, EGYPT • Dr. Rushed Kanawati, LIPN, France • Mr. Koteshwar Rao, K G Reddy College Of ENGG.&TECH,CHILKUR, RR DIST.,AP, India • Mr. M. Nagesh Kumar, Department of Electronics and Communication, J.S.S. research foundation, Mysore University, Mysore-6, India • Dr. Ibrahim Noha, Grenoble Informatics Laboratory, France • Mr. Muhammad Yasir Qadri, University of Essex, UK • Mr. Annadurai .P, KMCPGS, Lawspet, Pondicherry, India, (Aff. Pondicherry Univeristy, India • Mr. E Munivel , CEDTI (Govt. of India), India • Dr. Chitra Ganesh Desai, University of Pune, India • Mr. Syed, Analytical Services & Materials, Inc., USA • Dr. Mashud Kabir, Department of Computer Science, University of Tuebingen, Germany • Mrs. Payal N. Raj, Veer South Gujarat University, India • Mrs. Priti Maheshwary, Maulana Azad National Institute of Technology, Bhopal, India • Mr. Mahesh Goyani, S.P. University, India, India • Mr. Vinay Verma, Defence Avionics Research Establishment, DRDO, India • Dr. George A. Papakostas, Democritus University of Thrace, Greece • Mr. Abhijit Sanjiv Kulkarni, DARE, DRDO, India • Mr. Kavi Kumar Khedo, University of Mauritius, Mauritius • Dr. B. Sivaselvan, Indian Institute of Information Technology, Design & Manufacturing, Kancheepuram, IIT Madras Campus, India • Dr. Partha Pratim Bhattacharya, Greater Kolkata College of Engineering and Management, West Bengal University of Technology, India • Mr. Manish Maheshwari, Makhanlal C University of Journalism & Communication, India • Dr. Siddhartha Kumar Khaitan, Iowa State University, USA • Dr. Mandhapati Raju, General Motors Inc, USA • Dr. M.Iqbal Saripan, Universiti Putra Malaysia, Malaysia • Mr. Ahmad Shukri Mohd Noor, University Malaysia Terengganu, Malaysia • Mr. Selvakuberan K, TATA Consultancy Services, India • Dr. Smita Rajpal, Institute of Technology and Management, Gurgaon, India • Mr. Rakesh Kachroo, Tata Consultancy Services, India • Mr. Raman Kumar, National Institute of Technology, Jalandhar, Punjab., India • Mr. Nitesh Sureja, S.P.University, India • Dr. M. Emre Celebi, Louisiana State University, Shreveport, USA • Dr. Aung Kyaw Oo, Defence Services Academy, Myanmar • Mr. Sanjay P. Patel, Sankalchand Patel College of Engineering, Visnagar, Gujarat, India • Dr. Pascal Fallavollita, Queens University, Canada • Mr. Jitendra Agrawal, Rajiv Gandhi Technological University, Bhopal, MP, India • Mr. Ismael Rafael Ponce Medellín, Cenidet (Centro Nacional de Investigación y Desarrollo Tecnológico), Mexico • Mr. Supheakmungkol SARIN, Waseda University, Japan • Mr. Shoukat Ullah, Govt. Post Graduate College Bannu, Pakistan • Dr. Vivian Augustine, Telecom Zimbabwe, Zimbabwe • Mrs. Mutalli Vatila, Offshore Business Philipines, Philipines • Dr. Emanuele Goldoni, University of Pavia, Dept. of Electronics, TLC & Networking Lab, Italy • Mr. Pankaj Kumar, SAMA, India • Dr. Himanshu Aggarwal, Punjabi University,Patiala, India • Dr. Vauvert Guillaume, Europages, France • Prof Yee Ming Chen, Department of Industrial Engineering and Management, Yuan Ze University, Taiwan • Dr. Constantino Malagón, Nebrija University, Spain • Prof Kanwalvir Singh Dhindsa, B.B.S.B.Engg.College, Fatehgarh Sahib (Punjab), India • Mr. Angkoon Phinyomark, Prince of Singkla University, Thailand • Ms. Nital H. Mistry, Veer Narmad South Gujarat University, Surat, India • Dr. M.R.Sumalatha, Anna University, India • Mr. Somesh Kumar Dewangan, Disha Institute of Management and Technology, India • Mr. Raman Maini, Punjabi University, Patiala(Punjab)-147002, India • Dr. Abdelkader Outtagarts, Alcatel-Lucent Bell-Labs, France • Prof Dr. Abdul Wahid, AKG Engg. College, Ghaziabad, India • Mr. Prabu Mohandas, Anna University/Adhiyamaan College of Engineering, india • Dr. Manish Kumar Jindal, Panjab University Regional Centre, Muktsar, India • Prof Mydhili K Nair, M S Ramaiah Institute of Technnology, Bangalore, India • Dr. C. Suresh Gnana Dhas, VelTech MultiTech Dr.Rangarajan Dr.Sagunthala Engineering College,Chennai,Tamilnadu, India • Prof Akash Rajak, Krishna Institute of Engineering and Technology, Ghaziabad, India • Mr. Ajay Kumar Shrivastava, Krishna Institute of Engineering & Technology, Ghaziabad, India • Mr. Deo Prakash, SMVD University, Kakryal(J&K), India • Dr. Vu Thanh Nguyen, University of Information Technology HoChiMinh City, VietNam • Prof Deo Prakash, SMVD University (A Technical University open on I.I.T. Pattern) Kakryal (J&K), India • Dr. Navneet Agrawal, Dept. of ECE, College of Technology & Engineering, MPUAT, Udaipur 313001 Rajasthan, India • Mr. Sufal Das, Sikkim Manipal Institute of Technology, India • Mr. Anil Kumar, Sikkim Manipal Institute of Technology, India • Dr. B. Prasanalakshmi, King Saud University, Saudi Arabia. • Dr. K D Verma, S.V. (P.G.) College, Aligarh, India • Mr. Mohd Nazri Ismail, System and Networking Department, University of Kuala Lumpur (UniKL), Malaysia • Dr. Nguyen Tuan Dang, University of Information Technology, Vietnam National University Ho Chi Minh city, Vietnam • Dr. Abdul Aziz, University of Central Punjab, Pakistan • Dr. P. Vasudeva Reddy, Andhra University, India • Mrs. Savvas A. Chatzichristofis, Democritus University of Thrace, Greece • Mr. Marcio Dorn, Federal University of Rio Grande do Sul - UFRGS Institute of Informatics, Brazil • Mr. Luca Mazzola, University of Lugano, Switzerland • Mr. Nadeem Mahmood, Department of Computer Science, University of Karachi, Pakistan • Mr. Hafeez Ullah Amin, Kohat University of Science & Technology, Pakistan • Dr. Professor Vikram Singh, Ch. Devi Lal University, Sirsa (Haryana), India • Mr. M. Azath, Calicut/Mets School of Enginerring, India • Dr. J. Hanumanthappa, DoS in CS, University of Mysore, India • Dr. Shahanawaj Ahamad, Department of Computer Science, King Saud University, Saudi Arabia • Dr. K. Duraiswamy, K. S. Rangasamy College of Technology, India • Prof. Dr Mazlina Esa, Universiti Teknologi Malaysia, Malaysia • Dr. P. Vasant, Power Control Optimization (Global), Malaysia • Dr. Taner Tuncer, Firat University, Turkey • Dr. Norrozila Sulaiman, University Malaysia Pahang, Malaysia • Prof. S K Gupta, BCET, Guradspur, India • Dr. Latha Parameswaran, Amrita Vishwa Vidyapeetham, India • Mr. M. Azath, Anna University, India • Dr. P. Suresh Varma, Adikavi Nannaya University, India • Prof. V. N. Kamalesh, JSS Academy of Technical Education, India • Dr. D Gunaseelan, Ibri College of Technology, Oman • Mr. Sanjay Kumar Anand, CDAC, India • Mr. Akshat Verma, CDAC, India • Mrs. Fazeela Tunnisa, Najran University, Kingdom of Saudi Arabia • Mr. Hasan Asil, Islamic Azad University Tabriz Branch (Azarshahr), Iran • Prof. Dr Sajal Kabiraj, Fr. C Rodrigues Institute of Management Studies (Affiliated to University of Mumbai, India), India • Mr. Syed Fawad Mustafa, GAC Center, Shandong University, China • Dr. Natarajan Meghanathan, Jackson State University, Jackson, MS, USA • Prof. Selvakani Kandeeban, Francis Xavier Engineering College, India • Mr. Tohid Sedghi, Urmia University, Iran • Dr. S. Sasikumar, PSNA College of Engg and Tech, Dindigul, India • Dr. Anupam Shukla, Indian Institute of Information Technology and Management Gwalior, India • Mr. Rahul Kala, Indian Institute of Inforamtion Technology and Management Gwalior, India • Dr. A V Nikolov, National University of Lesotho, Lesotho • Mr. Kamal Sarkar, Department of Computer Science and Engineering, Jadavpur University, India • Dr. Mokhled S. AlTarawneh, Computer Engineering Dept., Faculty of Engineering, Mutah University, Jordan, Jordan • Prof. Sattar J Aboud, Iraqi Council of Representatives, Iraq-Baghdad • Dr. Prasant Kumar Pattnaik, Department of CSE, KIST, India • Dr. Mohammed Amoon, King Saud University, Saudi Arabia • Dr. Tsvetanka Georgieva, Department of Information Technologies, St. Cyril and St. Methodius University of Veliko Tarnovo, Bulgaria • Dr. Eva Volna, University of Ostrava, Czech Republic • Mr. Ujjal Marjit, University of Kalyani, West-Bengal, India • Dr. Prasant Kumar Pattnaik, KIST,Bhubaneswar,India, India • Dr. Guezouri Mustapha, Department of Electronics, Faculty of Electrical Engineering, University of Science and Technology (USTO), Oran, Algeria • Mr. Maniyar Shiraz Ahmed, Najran University, Najran, Saudi Arabia • Dr. Sreedhar Reddy, JNTU, SSIETW, Hyderabad, India • Mr. Bala Dhandayuthapani Veerasamy, Mekelle University, Ethiopa • Mr. Arash Habibi Lashkari, University of Malaya (UM), Malaysia • Mr. Rajesh Prasad, LDC Institute of Technical Studies, Allahabad, India • Ms. Habib Izadkhah, Tabriz University, Iran • Dr. Lokesh Kumar Sharma, Chhattisgarh Swami Vivekanand Technical University Bhilai, India • Mr. Kuldeep Yadav, IIIT Delhi, India • Dr. Naoufel Kraiem, Institut Superieur d'Informatique, Tunisia • Prof. Frank Ortmeier, Otto-von-Guericke-Universitaet Magdeburg, Germany • Mr. Ashraf Aljammal, USM, Malaysia • Mrs. Amandeep Kaur, Department of Computer Science, Punjabi University, Patiala, Punjab, India • Mr. Babak Basharirad, University Technology of Malaysia, Malaysia • Mr. Avinash singh, Kiet Ghaziabad, India • Dr. Miguel Vargas-Lombardo, Technological University of Panama, Panama • Dr. Tuncay Sevindik, Firat University, Turkey • Ms. Pavai Kandavelu, Anna University Chennai, India • Mr. Ravish Khichar, Global Institute of Technology, India • Mr Aos Alaa Zaidan Ansaef, Multimedia University, Cyberjaya, Malaysia • Dr. Awadhesh Kumar Sharma, Dept. of CSE, MMM Engg College, Gorakhpur-273010, UP, India • Mr. Qasim Siddique, FUIEMS, Pakistan • Dr. Le Hoang Thai, University of Science, Vietnam National University - Ho Chi Minh City, Vietnam • Dr. Saravanan C, NIT, Durgapur, India • Dr. Vijay Kumar Mago, DAV College, Jalandhar, India • Dr. Do Van Nhon, University of Information Technology, Vietnam • Mr. Georgios Kioumourtzis, University of Patras, Greece • Mr. Amol D.Potgantwar, SITRC Nasik, India • Mr. Lesedi Melton Masisi, Council for Scientific and Industrial Research, South Africa • Dr. Karthik.S, Department of Computer Science & Engineering, SNS College of Technology, India • Mr. Nafiz Imtiaz Bin Hamid, Department of Electrical and Electronic Engineering, Islamic University of Technology (IUT), Bangladesh • Mr. Muhammad Imran Khan, Universiti Teknologi PETRONAS, Malaysia • Dr. Abdul Kareem M. Radhi, Information Engineering - Nahrin University, Iraq • Dr. Mohd Nazri Ismail, University of Kuala Lumpur, Malaysia • Dr. Manuj Darbari, BBDNITM, Institute of Technology, A-649, Indira Nagar, Lucknow 226016, India • Ms. Izerrouken, INP-IRIT, France • Mr. Nitin Ashokrao Naik, Dept. of Computer Science, Yeshwant Mahavidyalaya, Nanded, India • Mr. Nikhil Raj, National Institute of Technology, Kurukshetra, India • Prof. Maher Ben Jemaa, National School of Engineers of Sfax, Tunisia • Prof. Rajeshwar Singh, BRCM College of Engineering and Technology, Bahal Bhiwani, Haryana, India • Mr. Gaurav Kumar, Department of Computer Applications, Chitkara Institute of Engineering and Technology, Rajpura, Punjab, India • Mr. Ajeet Kumar Pandey, Indian Institute of Technology, Kharagpur, India • Mr. Rajiv Phougat, IBM Corporation, USA • Mrs. Aysha V, College of Applied Science Pattuvam affiliated with Kannur University, India • Dr. Debotosh Bhattacharjee, Department of Computer Science and Engineering, Jadavpur University, Kolkata700032, India • Dr. Neelam Srivastava, Institute of engineering & Technology, Lucknow, India • Prof. Sweta Verma, Galgotia's College of Engineering & Technology, Greater Noida, India • Mr. Harminder Singh BIndra, MIMIT, INDIA • Dr. Lokesh Kumar Sharma, Chhattisgarh Swami Vivekanand Technical University, Bhilai, India • Mr. Tarun Kumar, U.P. Technical University/Radha Govinend Engg. College, India • Mr. Tirthraj Rai, Jawahar Lal Nehru University, New Delhi, India • Mr. Akhilesh Tiwari, Madhav Institute of Technology & Science, India • Mr. Dakshina Ranjan Kisku, Dr. B. C. Roy Engineering College, WBUT, India • Ms. Anu Suneja, Maharshi Markandeshwar University, Mullana, Haryana, India • Mr. Munish Kumar Jindal, Punjabi University Regional Centre, Jaito (Faridkot), India • Dr. Ashraf Bany Mohammed, Management Information Systems Department, Faculty of Administrative and Financial Sciences, Petra University, Jordan • Mrs. Jyoti Jain, R.G.P.V. Bhopal, India • Dr. Lamia Chaari, SFAX University, Tunisia • Mr. Akhter Raza Syed, Department of Computer Science, University of Karachi, Pakistan • Prof. Khubaib Ahmed Qureshi, Information Technology Department, HIMS, Hamdard University, Pakistan • Prof. Boubker Sbihi, Ecole des Sciences de L'Information, Morocco • Dr. S. M. Riazul Islam, Inha University, South Korea • Prof. Lokhande S.N., S.R.T.M.University, Nanded (MH), India • Dr. Vijay H Mankar, Dept. of Electronics, Govt. Polytechnic, Nagpur, India • Dr. M. Sreedhar Reddy, JNTU, Hyderabad, SSIETW, India • Mr. Ojesanmi Olusegun, Ajayi Crowther University, Oyo, Nigeria • Ms. Mamta Juneja, RBIEBT, PTU, India • Dr. Ekta Walia Bhullar, Maharishi Markandeshwar University, Mullana Ambala (Haryana), India • Prof. Chandra Mohan, John Bosco Engineering College, India TABLE OF CONTENTS 1. A General Simulation Framework for Supply Chain Modeling: State of the Art and Case Study – pg 1-9 Antonio Cimino, Mechanical Department, University of Calabria, Rende (CS), 87036, Italy Francesco Longo, Mechanical Department, University of Calabria, Rende (CS), 87036, Italy Giovanni Mirabelli, Mechanical Department, University of Calabria, Rende (CS), 87036, Italy 2. Database Reverse Engineering based on Association Rule Mining – pg 10-15 Nattapon Pannurat, Faculty of Information Sciences, Nakhon Ratchasima College, 290 Moo 2, Mitraphap Road, Nakhon Ratchasima, 30000, Thailand Nittaya Kerdprasop, Data Engineering and Knowledge Discovery Research Unit, Suranaree University of Technology, 111 University Avenue, Nakhon Ratchasima, 30000, Thailand Kittisak Kerdprasop, Data Engineering and Knowledge Discovery Research Unit, Suranaree University of Technology, 111 University Avenue, Nakhon Ratchasima, 30000, Thailand 3. A New Approach to Keyphrase Extraction Using Neural Networks – pg 16-25 Kamal Sarkar, Computer Science and Engineering Department, Jadavpur University, Kolkata700 032, India Mita Nasipuri, Computer Science and Engineering Department, Jadavpur University, Kolkata700 032, India Suranjan Ghose, Computer Science and Engineering Department, Jadavpur University, Kolkata700 032, India 4. C Implementation & comparison of companding & silence audio compression techniques – pg 26-30 Kruti Dangarwala, Department of Computer Engineering, Sri S'ad Vidya Mandal Institute of Technology, Bharuch, Gujarat, India Jigar Shah, Department of Electronics and Telecommunication Engineering, Sri S'ad Vidya Mandal Institute of Technology, Bharuch, Gujarat, India 5. Color Image Compression Based On Wavelet Packet Best Tree – pg 31-35 G. K. Kharate, Matoshri College of Engineering and Research Centre, Nashik - 422003, Maharashtra, India V. H. Patil, Department of Computer Engineering, University of Pune 6. A Pedagogical Evaluation and Discussion about the Lack of Cohesion in Method (LCOM) Metric Using Field Experiment – pg 36-43 Ezekiel Okike, School of Computer Studies, Kampala International University , Kampala, Uganda 256, Uganda IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 1 A General Simulation Framework for Supply Chain Modeling: State of the Art and Case Study Antonio Cimino1, Francesco Longo2 and Giovanni Mirabelli3 1 Mechanical Department, University of Calabria, Rende (CS), 87036, Italy 2 Mechanical Department, University of Calabria, Rende (CS), 87036, Italy 3 Mechanical Department, University of Calabria, Rende (CS), 87036, Italy Abstract Nowadays there is a large availability of discrete event simulation software that can be easily used in different domains: from industry to supply chain, from healthcare to business management, from training to complex systems design. Simulation engines of commercial discrete event simulation software use specific rules and logics for simulation time and events management. Difficulties and limitations come up when commercial discrete event simulation software are used for modeling complex real world-systems (i.e. supply chains, industrial plants). The objective of this paper is twofold: first a state of the art on commercial discrete event simulation software and an overview on discrete event simulation models development by using general purpose programming languages are presented; then a Supply Chain Order Performance Simulator (SCOPS, developed in C++) for investigating the inventory management problem along the supply chain under different supply chain scenarios is proposed to readers. Keywords: Discrete Event Simulation, Simulation languages, Supply Chain, Inventory Management. 1. Introduction As reported in [1], discrete-event simulation software selection could be an exceeding difficult task especially for inexpert users. Simulation software selection problem was already known many years ago. A simulation buyer’s guide that identifies possible features to consider in simulation software selection is proposed in [2]. The guide includes in the analysis considerations several aspects such as Input, Processing, Output, Environment, Vendor and Costs. A survey on users’ requirements about discrete- event simulation software is presented in [3]. The analysis shows that simulation software with good visualization/animation properties are easier to use but limited in case of complex and non-standard problems. Further limitations include lack of software compatibility, output analysis tools, advanced programming languages. In [4] and [5] functionalities and potentialities of different commercial discrete-event simulation software, in order to support users in software selection, are reported. In this case the author provides the reader with information about software vendor, primary software applications, hardware platform requirements, simulation animation, support, training and pricing. Needless to say that Modeling & Simulation should be used when analytical approaches do not succeed in identifying proper solutions for analyzing complex systems (i.e. supply chains, industrial plants, etc.). For many of these systems, simulation models must be: (i) flexible and parametric (for supporting scenarios evaluation) (ii) time efficient (even in correspondence of very complex real-world systems) and (iii) repetitive in their architectures for scalability purposes [6]. Let us consider the traditional modeling approach proposed by two commercial discrete event simulation software, Em-Plant by Siemens PLM Software solutions and Anylogic by Xj-Technologies. Both of them propose a typical object oriented modeling approach. Each discrete event simulation model is made up by system state variables, entities and attributes, lists processing, activities and delays. Usually complex systems involve high numbers of resources and entities flowing within the simulation model. The time required for executing a simulation run depends on the numbers of entities in the IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org simulation model: the higher is the number of entities the higher is the time required for executing a simulation run. In addition, libraries objects, which should be used for modeling static entities, very often fall short of recreating the real system with satisfactory accuracy. In other words, the traditional modeling approach (proposed by eM-Plant and Anylogic as well as by a number of discrete event simulation software), presents two problems: (i) difficulties in modeling complex scenarios; (ii) too many entities could cause computational heavy simulation models. Further information on discrete event simulation software can be found in [7]. An alternative to commercial discrete event simulation software is to develop simulation models based on general purpose programming languages (i.e. C++, Java). The use of general purpose programming languages allows to develop ad-hoc simulation models with class-objects able to recreate carefully the behavior of the real world system. The objective of this paper is twofold: first a state of the art on commercial discrete event simulation software and an overview on discrete event simulation models development by using general purpose programming languages are presented; then a Supply Chain Order Performance Simulator (SCOPS, developed in C++) for investigating the inventory management problem along the supply chain under different supply chain scenarios is proposed to readers. Before getting into details of the work, in the sequel a brief overview of paper sections is reported. Section 2 provides the reader with a detailed description of different commercial discrete event simulation software. Section 3 presents a general overview of programming languages and describes the main steps to develop a simulation model based on general purpose programming languages. Section 4 presents a three stages supply chain simulation model (called SCOPS) used for investigating inventory problems along the supply chain. Section 5 describes the simulation experiments carried out by using the simulation model. Finally the last section reports conclusions and research activities still on going. 2. Discrete Event Simulation Software 2 languages, prices, etc. For each aspect and for each software the survey reports a score between 0 and 10. Table 1 help modelers in discrete event simulation software selection. Moreover the following sections reports a brief description of all the software of table 1 in terms of domains of applicability, types of libraries (i.e. modeling libraries, optimization libraries, etc.), inputoutput functionalities, animation functionalities, etc. 2.1 Anylogic Anylogic is a Java based simulation software, by XJ Technologies [8], used for forecasting and strategic planning, processes analysis and optimization, optimal operational management, processes visualization. It is widely used in logistics, supply chains, manufacturing, healthcare, consumer markets, project management, business processes and military. Anylogic supports Agent Based, Discrete Event and System Dynamics modeling and simulation. The latest Anylogic version (Anylogic 6) has been released in 2007, it supports both graphical and flow-chart modeling and provides the user with Java code for simulation models extension. For input data analysis, Anylogic provides the user with Stat-Fit (a simulation support software by Geer Mountain Software Corp.) for distributions fitting and statistics analysis. Output analysis functionalities are provided by different types of datasets, charts and histograms (including export function to text files or excel spreadsheet). Finally simulation optimization is performed by using Optquest, an optimization tool integrated in Anylogic. 2.2 Arena Arena is a simulation software by Rockwell Corporation [9] and it is used in different application domains: from manufacturing to supply chain (including logistics, warehousing and distribution) from customers service and strategies to internal business processes. Arena (as Anylogic) provides the user with objects libraries for systems modeling and with a domain-specific simulation language, SIMAN [10]. Simulation optimizations are carried out by using Optquest. Arena includes three modules respectively called Arena Input Analyzer (for distributions fitting), Arena Output Analyzer (for simulation output analysis) and Arena Process Analyzer (for simulation experiments design). Moreover Arena also provides the users animation at run time as well as it allows to import CAD drawings to enhance animation capabilities. Table 1 reports the results of a survey on the most widely used discrete event simulation software (conducted on 100 people working in the simulation field). The survey considers among others some critical aspects such as domains of application (specifically manufacturing and logistics), 3D and virtual reality potentialities, simulation Table 1: Survey on most widely used Simulation software IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 Logistic Manufacturing 3D Virtual Reality Simulation Engine User Ability User Community Simulation Language Runtime Analysis tools Internal Programming Modular Construction Price 3 Anylogic Arena AutoMod Emplant Promodel Flexsim Witness 6.5 6.6 6.6 7 7 6.2 6.8 7.5 6.5 7.2 6.1 7 7.5 7.5 6.9 8 8 9 7 7 8 7 7 6 7 6.5 7.3 7.5 6 6.7 6.25 6.5 6.9 6 6 5.6 7.2 7.2 6.8 8 7 6.5 6.5 6.5 7.1 7 6.5 5.8 6.5 6.7 6.7 7 9 7.5 6.5 7.5 7.7 6.2 7.5 7 7 6.7 7.2 7.5 7.5 6.6 6.7 6 6 7 7 5.7 7.5 7.5 7 8 8 8.5 6.5 7 7.8 6.5 7 6 2.3 Automod 2.5 Promodel Automod is a discrete event simulation software, developed by Applied Materials Inc. [11] and it is based on the domain-specific simulation language Automod. Typical domains of application are manufacturing, supply chain, warehousing and distribution, automotive, airports and semiconductor. It is strongly focused on transportation systems including objects such as conveyor, Path Mover, Power & Free, Kinematic, Train Conveyor, AS/RS, Bridge Crane, Tank & Pipe (each one customizable by the user). For input data analysis, experimental design and simulation output analysis, Automod provides the user with AutoStat [12]. Moreover the software includes different modules such as AutoView devoted to support simulation animation with AVI formats. Promodel is a discrete event simulation software developed by Promodel Corporation [14] and it is used in different application domains: manufacturing, warehousing, logistics and other operational and strategic situations. Promodel enables users to build computer models of real situations and experiment with scenarios to find the best solution. The software provides the users with an easy to use interface for creating models graphically. Real systems randomness and variability can be either recreated by utilizing over 20 statistical distribution types or directly importing users’ data. Data can be directly imported and exported with Microsoft Excel and simulation optimizations are carried out by using SimRunner or OptQuest. Moreover, the software technology allows the users to create customized frontand back-end interfaces that communicate directly with ProModel. 2.4 Em-Plant Em-plant is a Siemens PLM Software solutions [13], developed for strategic production decisions. EM-Plant enables users to create well-structured, hierarchical models of production facilities, lines and processes. Em-Plant object-oriented architecture and modeling capabilities allow users to create and maintain complex systems, including advanced control mechanisms. The Application Object Libraries support the user in modeling complex scenarios in short time. Furthermore EM-Plant provides the user with a number of mathematical analysis and statistics functions for input distribution fitting and single or multi-level factor analysis, histograms, charts, bottleneck analyzer and Gantt diagram. Experiments Design functionalities (with Experiments Manager) are also provided. Simulation optimization is carried out by using Genetic Algorithms and Artificial Neural Networks. 2.6 Flexsim Flexsim is developed by Flexsim Software Products [15] and allows to model, analyze, visualize, and optimize any kind of real process - from manufacturing to supply chains. The software can be interfaced with common spreadsheet and database applications to import and export data. Moreover, Flexsim's powerful 3D graphics allow inmodel charts and graphs to dynamically display output statistics. The tool Flexsim Chart gives the possibility to analyze the simulation results and simulation optimizations can be performed by using both Optquest as well as a built-in experimenter tool. Finally, in addition to the previous described software, Flexsim allow to create own classes, libraries, GUIs, or applications. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 2.7 Witness Witness is developed by Lanner Group Limited [16]. It allows to represent real world processes in a dynamic animated computer model and then experiment with “what-if” alternative scenarios to identify the optimal solution. The software can be easily linked with the most common spreadsheet, database and CAD files. The simulation optimization is performed by the Witness Optimizer tool that can be used with any Witness model. Finally the software provides the user with a scenario manager tool for the analysis of the simulation results. 3. General Purpose and Specific Simulation Programming Languages There are many programming languages, general purpose or domain-specific simulation language (DSL) that can be used for simulation models development. General purpose languages are usually adopted when the programming logics cannot be easily expressed in GUI-based systems or when simulation results are more important than advanced animation/visualization [17]. Simulation models can be developed both by using discrete-event simulation software and general purpose languages, such as C++ or Java [18]. As reported in [1] a simulation study requires a number of different steps; it starts with problem formulation and passes through different and iterative steps: conceptual model definition, data collection, simulation model implementation, verification, validation and accreditation, simulation experiments, simulation results analysis, documentation and reports. Simulation model development by using general purpose programming languages (i.e. C++) requires a deep knowledge of the logical foundation of discrete event simulation. Among different aspects to be considered, it is important to underline that discrete event simulation model consists of entities, resources control elements and operations [19]. Dynamic entities flow in the simulation model (i.e. parts in a manufacturing system, products in a supply chain, etc.). Static entities usually work as resources (a system part that provides services to dynamic entities). Control elements (such as variables, boolean expressions, specific programming code, etc.) support simulation model states control. Finally, operations represent all the actions generated by the flow of dynamic entities within the simulation model. During its life within the simulation model, an entity changes its state different times. There are five different entity states [19]: Ready state (the entity is ready to be processed), Active state (the entity is currently being processed), Time-delayed state (the entity is delayed 4 until a predetermined simulation time), Condition-delayed state (the entity is delayed until a specific condition will be solved) and Dormant state (in this case the condition solution that frees the entity is managed by the modeler). Entity management is supported by different lists, each one corresponding to an entity state: the CEL, (Current Event List for active state entity), the FEL (Future Event List for Time-delayed entities), the DL (Delay List for condition-delayed entities) and UML (User-Managed Lists for dormant entities). In particular, Siman and GPSS/H call the CEL list CEC list (Current Events Chain), while ProModel language calls it AL (Action List). The FEL is called FEP (Future Events Heap) and FEC (Future Event Chain) respectively by Siman and GPSS/H. After entities states definition and lists creation, the next step is the implementation of the phases of a simulation run: the Initialization Phase (IP), the Entity Movement Phases (EMP) and the Clock Update Phase (CUP). A detailed explanation of the simulation run anatomy is reported in [19]. 4. A Supply Chain Simulation Model developed in C++ According to the idea to implement simulation models based on general purpose programming languages, the authors propose a three stage supply chain simulation model implemented by using the Borland C++ Builder to compile the code (further information on Borland C++ Builder can be found in [20]). The acronym of the simulation model is SCOPS (Supply-Chain Order Performance Simulator). SCOPS investigates the inventory management problem along a three stages supply chain and allows the user to test different scenarios in terms of demand intensity, demand variability and lead times. Note that such problem can be also investigated by using discrete event simulation software [21], [22], [23] and [24]. The supply chain conceptual model includes suppliers, distribution centers, stores and final customers. In the supply chain conceptual model a single network node can be considered as store, distribution center or supplier. A supply chain begins with one or more suppliers and ends with one or more stores. Usually stores satisfy final customers’ demand, distribution centers satisfy stores demand and plants satisfy distribution centers demand. By using these three types of nodes we can model a general supply chain (also including more than three stages). Suppliers, distribution centers and stores work 6 days per week, 8 hours per day. Stores receive orders from customers. An order can be completely or partially IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org satisfied. At the end of each day, on the basis of an OrderPoint, Order-Up-to-Level (s, S) inventory control policy, the stores decide whether place an order to the distribution centers or not. Similarly distribution centers place orders to suppliers according to the same inventory control policies. Distribution centers select suppliers according to their lead times (that includes production times and transportation times). 5 Dfi(t), demand forecast of the i-th item (evaluated by means of the moving average methodology); LTi, lead time of the i-th item; si(t), order point at time t of the i-th item; Si, order-up-to-level of the i-th item; SSi(t), safety stock at time t of the i-th item; Qi(t), quantity to be ordered at time t of the i-th item. 4.1 Supply Chain Orders Perfomance Simulator According to the Order-Point, Order-Up-to-Level policy [25], an order is emitted whenever the available quantity drops to the order point (s) or lower. A variable replenishment quantity is ordered to raise the available quantity to the order-up-to-level (S). For each item the order point s is the safety stock calculated as standard deviation of the lead-time demand, the order-up to level S is the maximum number of items that can be stored in the warehouse space assigned to the item type considered. For the i-th item, the evaluation of the replenishment quantity, Qi(t), has to take into consideration the quantity available (in terms of inventory position) and the order-up-to-level S. The inventory position (equation 1) is the on-hand inventory, plus the quantity already on order, minus the quantity to be shipped. The calculation of sj(t) requires the evaluation of the demand over the lead time. The lead time demand of the i-th item (see equation 2), is evaluated by using the moving average methodology. Both at stores and distribution centers levels, managers know their peak and off-peak periods, and they usually use that knowledge to correct manually future estimates based on moving average methodology. They also correct their future estimates based on trucks capacity and suppliers quantity discounts. Finally equations 3 and 4 respectively express the order condition and calculate the replenishment quantity. Pi (t ) Ohi (t ) Ori (t ) Shi (t ) Dlt i (t ) (1) t LTi Df (k ) k t 1 i (2) Pi (t ) ( si (t ) SS i (t )) (3) Qi (t ) S i Pi (t ) (4) where, Pi(t), inventory position of the i-th item; Ohi(t), on-hand inventory of the i-th item; Ori(t), quantity already on order of the i-th item; Shi(t), quantity to be shipped of the i-th item; Dlti(t), lead time demand of the i-th item; SCOPS translates the supply chain conceptual model recreating the complex and high stochastic environment of a real supply chain. For each type of product, customers’ demand to stores is assumed to be Poisson with independent arrival processes (in relation to product types). Quantity required at stores is based on triangular distributions with different levels of intensity and variability. Partially satisfied orders are recorded at stores and distribution center levels for performance measures calculation. In our application example fifty stores, three distribution center, ten suppliers and thirty different items define the supply chain scenario. Figure 1 shows the SCOPS user interface. The SCOPS graphic interface provides the user with many commands as, for instance, simulation time length, start, stop and reset buttons, a check box for unique simulation experiments (that should be used for resetting the random number generator in order to compare different scenarios under the same conditions), supply chain configurations (number of items, stores, distribution centers, suppliers, input data, etc.). For each supply chain node a button allows to access the following information number of orders, arrival times, ordered quantities, received quantities, waiting times, fill rates. SCOPS graphic interface also allows the user to export simulation results on txt and excel files. One of the most important features of SCOPS is the flexibility in terms of scenarios definition. The graphic interface gives to the user the possibility to carry out a number of different what-if analysis by changing supply chain configuration and input parameters (i.e. inventory policies, demand forecast methods, demand intensity and variability, lead times, inter-arrival times, number of items, number of stores, distribution centers and plants, number of supply chain echelons, etc.). Figure 2 display several SCOPS windows the user can use for setting supply chain configuration and input parameters. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 6 errors [1]. In this regards, during the simulation model development, the authors tried to find the existence of errors (bugs). The causes of each bug has been correctly identified and the model has opportunely been modified and tested (once again) for ensuring errors elimination as well as for detecting new errors. Fig. 1 SCOPS User Interface. Before going into details of simulation model validation, it is important to evaluate the optimal simulation run length. Note that the supply chain is a non-terminating system and one of the priority objectives of such type of system is the evaluation of the simulation run length [1]. Information regarding the length of a simulation run is used for the validation. The length is the correct trade-off between results accuracy and time required for executing the simulation runs. The run length has been correctly determined using the mean square pure error analysis (MSPE). After the MSPE analysis, the simulation run length chosen is 390 days. Choosing for each simulation run the length evaluated by means of MSPE analysis (390 days) the validation phase has been conducted by using the Face Validation (informal technique). For each retailer and for each distribution centre the simulation results, in terms of fill rate, have been compared with real results. Note that during the validation process the simulation model works under identical input conditions of the real supply chain. The Face Validation results have been analyzed by several experts; their analysis revealed that, in its domain of application, the simulation model recreates with satisfactory accuracy the real system. Fig. 2 SCOPS Windows. 4.2 SCOPS verification, simulation run length and validation Verification and validation processes assess the accuracy and the quality throughout a simulation study [26]. Verification and Validation are defined by the American Department of Defence Directive 5000.59 as follows: verification is the process of determining that a model implementation accurately represents the developer’s conceptual description and specifications, while validation is the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended use of the model. The simulator verification has been carried out by using the debugging technique. The debugging technique is an iterative process whose purpose is to uncover errors or misconceptions that cause the model’s failure and to define and carry out the model changes that correct the 5. Supply Chain Configuration and Design of Simulation Experiments The authors propose as application example the investigation of 27 different supply chain scenarios. In particular simulation experiments take into account three different levels for demand intensity, demand variability and lead times (minimum, medium and maximum respectively indicated with “-”, “0” and “+” signs). Table 1 reports (as example) factors and levels for one of the thirty items considered and table 3 reports scenarios description in terms of simulation experiments. Each simulation run has been replicated three times (totally 81 replications). Table 2: Factors and levels Minimum Medium High IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Demand Intensity [inter-arrival time] 3 5 8 Demand Variability [item] [18,22] [16,24] [14,26] Lead Time [days] 2 3 4 23 24 25 26 27 1 7 + + + + + - 0 0 + + + - 0 + 0 + - 5.1 Supply Chain Scenarios analysis and comparison After the definition of factors levels and scenarios, the next step is the performance measures definition. SCOPS includes, among others, two fill rate performance measures defined as (i) the ratio between the number of satisfied Orders and the total number of orders; (ii) the ratio between the lost quantity and the total ordered quantity. Simulation results, for each supply chain node and for each factors levels combination, are expressed in terms of average fill rate (intended as ratio between the number of satisfied Orders and the total number of orders). The huge quantity of simulation results allows the analysis of a comprehensive set of supply chain operative scenarios. Let us consider the simulation results regarding the store #1; we have considered three different scenarios (low, medium and high lead times) and, within each scenario, the effects of demand variability and demand intensity are investigated. Figure 2 shows the fill rate trend at store #1 in the case of low lead time. Fill Rate - Store 1 - Low Lead Time 0.8 Table 3: Simulation experiments and supply chain scenarios 0.7 Lead Time 1 - - - 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0 0 0 0 0 0 0 0 0 + + + + 0 0 0 + + + 0 0 0 + + + 0 0 + 0 + 0 + 0 + 0 + 0 + 0 + - 0.6 Fill Rate Demand Variability 0.5 Low Intensity 0.4 Medium Intensity 0.3 High Intensity 0.2 0.1 0 Low Variability Medium Variability High Variability Deamand Variability Fig. 2 Fill rate at store #1, low lead time. The major effect is due to changes in demand intensity: as soon as the demand intensity increases there is a strong reduction of the fill rate. A similar trend can be observed in the case of medium and high lead time (figure 3 and figure 4, respectively). Fill Rate - Store 1 - Medium Lead Time 0.8 0.7 0.6 Fill Rate Run Demand Intensity 0.5 Low Intensity 0.4 Medium Intensity 0.3 High Intensity 0.2 0.1 0 Low Variability Medium Variability High Variability Deamand Variability Fig. 3 Fill Rate at store # 1, medium lead time. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Fill Rate - Store 1 - High Lead Time Acknowledgments All the authors gratefully thank Professor A. G. Bruzzone (University of Genoa) for his valuable support on this manuscript. 0.8 0.7 0.6 Fill Rate 8 0.5 Low Intensity 0.4 Medium Intensity 0.3 High Intensity 0.2 0.1 0 Low Variability Medium Variability High Variability Deamand Variability Fig. 4 Fill Rate at store # 1, high lead time. The simultaneous comparison of figures 2, 3 and 4 shows the effect of different lead times on the average fill rate. The only minor issue is a small fill rate reduction passing from 2 days lead time to 3 and 4 days lead time. As additional aspect (not shown in figures 2, 3, and 4), the higher is the demand intensity the higher is the average on hand inventory. Similarly the higher is the demand variability the higher is the average on hand inventory. In effect, the demand forecast usually overestimates the ordered quantity in case of high demand intensity and variability. 6. Conclusions The paper first presents an overview on the most widely used discrete event simulation software in terms of domains of applicability, types of libraries (i.e. modeling libraries, optimization libraries, etc.), input-output functionalities, animation functionalities, etc. In the second part the paper proposes, as alternative to discrete event simulation software, the use of general purpose programming languages and provides the reader with a brief description about how a discrete event simulation model works. As application example the authors propose a supply chain simulation model (SCOPS) developed in C++. SCOPS is a flexible simulator used for investigating different the inventory management problem along a three stages supply chain. SCOPS simulator is currently used for reverse logistics problems in the large scale retail supply chain. References [1] J. Banks, Handbook of simulation, Principles, Methodology, Advances, Application, and Practice, New York: WileyInterscience, 1998. [2] J. Banks, and R.G. Gibson, Simulation software buyer’s guide, IIE Solution, pp 48-54, 1997. [3] V. Hlupic, Discrete-Event Simulation Software: What the Users Want, Simulation, Vol. 73, No. 6, 1999, pp 362-370. [4] J. J. Swain, Gaming Reality: Biennial survey of discreteevent simulation software tools, OR/MS Today, Vol. 32, No. 6, 2005, pp. 44-55. [5] J. J. Swain, New Frontiers in Simulation, Biennial survey of discrete-event simulation software tools, OR/MS Today, 2007. [6] F. Longo, and G. Mirabelli, An Advanced supply chain management tool based on modeling and simulation, Computer and Industrial Engineering, Vol. 54, No. 3, 2008, pp 570-588. [7] G. S. Fishman, Discrete-Event Simulation: Modeling, Programming, and Analysis. Berlin: Springer-Verlag, 2001. [8] Anylogic by XjTech, www.xjtech.com. [9] Arena by Rockwell Corporation, http://www.arenasimulation.com/. [10] D. J., Hhuente, Critique of SIMAN as a programming language, ACM Annual Computer Science Conference, 1987, pp 385. [11] Automod by Applied Materials Inc., http://www.automod.com/. [12] J. S. Carson, AutoStat: output statistical analysis for AutoMod users in Proceedings of the 1997 Winter Simulation Conference, 1997, pp. 649-656. [13] Em-plant by Siemens PLM Software solutions, http://www.emplant.com/. [14] Promodel by Promodel Corporation, http://www.promodel.com/products/promodel/. [15] Flexsim by Flexsim Software Products, http://www.flexsim.com/. [16] Witness by Lanner Group Limited, http://www.lanner.com/en/witness.cfm. [17] V. P. Babich and A. S. Bylev, An approach to compiler construction for a general-purpose simulation language, New York: Springer, 1991. [18] M. Pidd, and R. A. Cassel, Using Java to Develop Discrete Event Simulations, The Journal of the Operational Research Society, Vol. 51, No. 4, 2000, pp. 405-412. [19] T. J. Schriber, and D. T. Brunner, How discrete event simulation work, in Banks J., Handbook of Simulation, New York: Wiley Interscience, 1998. [20] K. Reisdorph, and K. Henderson, Borland C++ Builder, Apogeo, 2005. [21] G. De Sensi, F. Longo, G. Mirabelli, Inventory policies analysis under demand patterns and lead times constraints in IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org [22] [23] [24] [25] [26] a real supply chain, International Journal of Production Research, Vol. 46, No. 24, 2008, pp 6997-7016. F. Longo, and G. Mirabelli, An Advanced Supply Chain Management Tool Based On Modeling & Simulation, Computer and Industrial Engineering, Vol. 54, No. 3, 2008, pp 570-588. D. Curcio, F. Longo, Inventory and Internal Logistics Management as Critical Factors Affecting the Supply Chain Performances, International Journal of Simulation & Process Modelling, Vol. 5, No 2, 2009, pp 127-137. A. G. Bruzzone, and E. WILLIAMS, Modeling and Simulation Methodologies for Logistics and Manufacturing Optimization, Simulation, vol. 80, 2004, pp 119-174. E. Silver, F. D. Pike, R. Peterson, Inventory Management and Production Planning and Control, USA: John Wiley & Sons, 1998. O. Balci, Verification, validation and testing, in Handbook of Simulation, New York: Wiley Interscience, 1998. Antonio Cimino took his degree in Management Engineering, summa cum Laude, in September 2007 from the University of Calabria. He is currently PhD student at the Mechanical Department of University of Calabria. He has published more than 20 papers on international journals and conferences. His research activities concern the integration of ergonomic standards, work measurement techniques, artificial intelligence techniques and Modeling & Simulation tools for the effective workplace design. Francesco Longo received his Ph.D. in Mechanical Engineering from University of Calabria in January 2006. He is currently Assistant Professor at the Mechanical Department of University of Calabria and Director of the Modelling & Simulation Center – Laboratory of Enterprise Solutions (MSC-LES). He has published more than 80 papers on international journals and conferences. His research interests include Modeling & Simulation tools for training procedures in complex environment, supply chain management and security. He is Associate Editor of the “Simulation: Transaction of the society for Modeling & Simulation International”. For the same journal he is Guest Editor of the special issue on Advances of Modeling & Simulation in Supply Chain and Industry. He is Guest Editor of the “International Journal of Simulation and Process Modelling”, special issue on Industry and Supply Chain: Technical, Economic and Environmental Sustainability. He is Editor in Chief of the SCS M&S Newsletter and he works as reviewer for different international journals. Giovanni Mirabelli is currently Assistant Professor at the Mechanical Department of University of Calabria. He has published more than 60 papers on international journals and conferences. His research interests include ergonomics, methods and time measurement in manufacturing systems, production systems maintenance and reliability, quality. 9 10 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 Database Reverse Engineering based on Association Rule Mining Nattapon Pannurat1, Nittaya Kerdprasop2 and Kittisak Kerdprasop2, 1 Faculty of Information Sciences, Nakhon Ratchasima College 290 Moo 2, Mitraphap Road, Nakhon Ratchasima, 30000, Thailand 2 Data Engineering and Knowledge Discovery Research Unit, Suranaree University of Technology 111 University Avenue, Nakhon Ratchasima, 30000, Thailand Abstract Maintaining a legacy database is a difficult task especially when system documentation is poor written or even missing. Database reverse engineering is an attempt to recover high-level conceptual design from the existing database instances. In this paper, we propose a technique to discover conceptual schema using the association mining technique. The discovered schema corresponds to the normalization at the third normal form, which is a common practice in many business organizations. Our algorithm also includes the rule filtering heuristic to solve the problem of exponential growth of discovered rules inherited with the association mining technique. Keywords: Legacy Databases, Reverse Engineering, Database Design, Database Normalization, Association Mining. 1. Introduction Legacy databases are obviously valuable assets to many organizations. These databases were mostly developed with technologies in the 1970s [14] using old programming languages such as COBOL and RPG, and file systems of the mini-computer platforms. Some databases were even designed with the outdated concepts such as hierarchical data model, and thus made them difficult to be maintained and adjusted to serve current needs of modern companies. One solution to modernize the legacy databases is to migrate and transform their structures and corresponding contents to the new systems. This approach is, however, hard to achieve if the design document of the system does no longer exist, which is the common situation in most enterprises. To solve the problems of recovering database Corresponding author structures and migrating legacy databases, we propose the database reverse engineering methodology. The process of reverse engineering [7] originally aimed at discovering design and production procedure from devices, end products, or other hardware. This methodology often used in the Second World War for military advantage by copying opponents’ technologies. Reverse engineering of software refers to the process of discovering source code and system design from the available software [7]. In database community, reverse engineering is an attempt to extract the domain semantics such as keys, functional dependencies and integrity constraints from the existing database structures [6, 13]. Typically, database reverse engineering is the process of extracting design specifications from legacy systems and making the reverse transformation from logical to conceptual schema [6, 15]. Our work deals with the reverse schema process by making a step further from logical schema to the lower level of database instances. We apply the machine learning technique, association rule mining in particular, to induce dependency relationships among data attributes. The major problem of applying association mining to reallife databases is that it always generates tremendous amount of association rules [11, 12]. We thus include the rule- filtering component in our design to select only promising association rules. The structure of this paper is organized as follows. Section 2 presents the basic concept and the design framework of our methodology. Section 3 explains the system implementation by means of an example. Section 4 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 11 discusses related work. Finally, Section 5 concludes the paper. 2. Database Reverse Engineering with NoWARs The objective of our system is to induce conceptual schema from the database instances with the basic assumption that database design documents are absent. We apply the normalization principles and the association mining technique to discover the missing database design. Normalization [8] is the process to transform unstructured relation into separate relations, called normalized ones. The main purpose of this separation is to eliminate redundant data and reduce data anomaly (insert, update, and delete). There are many different levels of normalization depending on the purpose of database designer. Most database applications are designed to be in the third and the Boyce-Codd normal forms in which their dependency relations [3] are sufficient for most organizational requirements. Figure 1 [9] illustrates the refinement steps from un-normalized relations to the relations in fifth normal form. The main condition to transform from one normal form to the next level is the dependency relationship, which is a constraint between two sets of attributes in a relation. Experienced database designers are able to elicit this kind of information. But in the reverse engineering process in which the business process and operational requirements are unknown, this task of dependency analysis is tough even for the experienced ones. We thus propose to use the machine learning technique called association mining. Fig. 1 Normalization steps. Association mining searches for interesting relationships among a large set of data items [1, 2]. The discovery of interesting association relationships among huge amounts of business transaction records can help in many business decision making process, such as catalog design, crossmarketing, and loose-leader analysis [11]. An example of association rule mining is market basket analysis. This process analyzes customer buying habits by finding association the different items that customers place in their shopping baskets. For example, customers are buying milk also tend to buy bread and water drinking at the same time. These can represent in association rule as follows. milk [support=5%] bread, water drinking [support=5%] [confidence=100%] A support of 5% for association rule means that 5% of all the transactions under analysis show that milk, bread, and water drinking are purchased together. A confidence of 100% means that 100% of the customers who purchased some milk also bought bread and water drinking. Our methodology of database reverse engineering composes of designing and improving the process of normalization with association analysis technique. We use the normalization concept and association analysis technique to create a new algorithm called NoWARs (Normalization With Association Rules). NoWARs is an algorithm that combines normalization process and association mining technique together. We can find association rules by taking the dataset on a database and feeding into a data mining process. We use IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Apriori algorithm to find association rules. NoWARs has two important steps, first finding association rules and second normalization with rules obtained from the first step. The details of NoWARs algorithm are shown in Figure 2 and its workflow are shown in Figure 3. 12 3. Implementation The algorithm NoWARs starts when user enter query to define dataset to normalize. Then NoWARs will find the association rules by calling Apriori algorithm and save resulting a form of association rules in the database. Then NoWARs will select some rules to use in normalization process. Finally, use the selected rules to generate the 3NF table in relational schema form. The input of NoWARs is the un-normalize table. The example of input data format is shown in Table 1. Table 1: Example of input data. INV DATE C_ID P_ID P_Name QTY 001 9/1/2010 C01 P01 Printer 3 001 9/1/2010 C01 P02 Phone 5 002 9/1/2010 C03 P05 TV 6 002 9/1/2010 C03 P04 Lamp 2 : : : : : Fig. 2 NoWARs algorithm. : The un-normalized data as shown in Table 1 will be analyzed by the algorithm, and then its schema in 3NF is generated. We perform experimentation with five datasets as shown in Table 2. We use Oracle Database 10g XE Edition, tested on Pentium IV 3.0 GHz with RAM 512 MB machine. Table 2: Number of records and attributes in experimental datasets. Number of Records Number of attributes Register 12438 157 Video_Rental 483478 523 Data_Org 91845 334 Invoice 119795 123 Car_Color 199337 312 Dataset Name We take the Register dataset as a running example. This dataset is originally un-normalized and its structure is as follows. Register (STUDENT_CODE, STUDENT_NAME, TEACHER_CODE, TEACHER_NAME, UNIT, SUBJECT_CODE, SUBJECT_NAME) Fig. 3 The work flow of NoWARs algorihm. After execution, its conceptual schema is recovered as shown in Figure 4. The performance of rule-filtering also analyzed and shown in Figure 5. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Fig. 4 The result of running NoWARs algorihm on Register dataset 1000000 10000 100 1 Register Video_Rental Rule Found Data_Org Invoice Car_Color Rule Used Fig. 5 Performance of rule-filtering component of NoWARs algorithm in reducing number of association rules 13 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 4. Related Work Since the introduction of a famous association technique known as Apriori algorithm [1, 2], there have long been immense attempts to integrate this technique to improve database design, consistency checking, and querying. Han et al. [10] improved the DBMiner system to work with relational databases and data warehouses. DBMiner can do many data mining tasks such as classification, prediction and association. Sreenath, Bodagala, Alsabti, and Ranka [16] adopted Apriori algorithm to work with relational database system. They created Fast UPdate algorithm to search association data when the system has new transaction. Tsechansky, Pliskin, Rabinowitz and Porath [17] applied Apriori to find association data from many relations in the database. Berzal, Cubero, Marín and Serrano [4] used Tree-Based Association Rule mining (TBAR) to find association data in relational database. They kept large item set in tree structure format to reduce time cost in association process. Hipp, Güntzer and Grimmer [12] implemented Apriori algorithm with C++ programming language to work on DB2 database system. They used the program to find association data in Daimler-Chrysler Company database. In parallel to the attempts of applying learning techniques to existing large databases, researchers in the area of database reverse engineering have proposed some means of extracting conceptual schema. Lee and Yoo [14] proposed a method to derive a conceptual model from object-oriented databases. The derivation process is based on forms including business forms and forms for database interaction in the user interface. The final products of their method are the object model and the scenario diagram describing a sequence of operations. The work of Perez et al. [15] emphasized on relational object-oriented conceptual schema extraction. Their reverse engineering technique is based on a formal method of term rewriting. They use terms to represent relational and object-oriented schemas. Term rewriting rules are then generated to represent the correspondences between relational and object-oriented elements. Output of the system is the source code to migrate legacy database to the new system. Recent work in database reverse engineering has not concentrated on a broad objective of system migration. Researchers rather focus their study on a particular issue of semantic understanding. Lammari et al. [13] proposed a reverse engineering method to discover inter-relational constraints and inheritances embedded in a relational database. Chen et al. [5] also based their study on entityrelationship model. They proposed to apply association rule mining to discover new concepts leading to a proper design of relational database schema. They employed the concept of fuzziness to deal with uncertainty inherited 14 with the association mining process. Our work is also in the line of association mining technique application to the database design. But our main purpose is for the understanding of legacy databases and our method deals with uncertainty by means of heuristic in the step of rule filtering. 5. Conclusions and Future Work A forward engineering approach to the design of a complete database starts from the high-level conceptual design to capture detail requirements of the enterprise. Common tool normally used to represent these requirements is the entity-relationship, or ER, diagram and the product of this design phase is a conceptual schema. Typically, the schema at this level needs some adjustments based on the procedure known as normalization in order to reach a proper database design. Then, the database implementation moves to the lower abstraction level of logical design in which logical schema is constructed in a form of relations, or database tables. In legacy systems that design documents are incomplete or even missing, the system maintenance or modification is a difficult task due to the lack of knowledge regarding highlevel design of the system. To tackle this problem, a database reverse engineering approach is essential. In this paper, we propose a method to discover conceptual schema from the database instances, or relations. The discovering technique is based on the association mining incorporated with some heuristic to produce a minimal set of association rules. Transformation rules are then applied to convert association rules to database dependencies. Normalization is the principal concept of our heuristic and transformation. To deduce repeating group, insert anomaly, delete anomaly and update anomaly. We introduce the novel algorithm, called NoWARs, to normalize the database tables. In the normalization process, NoWARs uses only 100% confidence association rules with any support values. The results from the NoWARs algorithm are the same as the design schema obtained from the database designer. But NoWARs cannot normalize data model to the level higher than third normal form, which might be the desired level of a highly secured database. We thus plan to improve our methodology to discover a conceptual schema up to the level of fifth normal form. Acknowledgments This research has been conducted at the Data Engineering and Knowledge Discovery (DEKD) research unit, fully IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org supported by Suranaree University of Technology. The work of first and second authors has been funded by grants from Suranaree University of Technology and the National Research Council of Thailand (NRCT), respectively. The third author has been supported by a grant from the Thailand Research Fund (TRF, grant number RMU5080026). References [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between set of items in large databases”, in Proceedings of ACM SIGMOD International Conference on Management of Data, 1993, pp. 207-216. [2] R. Agrawal, and R. Srikant, “Fast algorithms for mining association rules in large database”, in Proceedings of 20th International Conference on Very Large Data Base, 1994, pp. 487-499. [3] W. W. Armstrong, “Dependency structures of database relationships”, Information Processing, Vol.74, 1974, pp. 580-583. [4] F. Berzal, J. Cubero, N. Marin, and J. Serrano, “TBAR: An efficient method for association rule mining in relational databases”, Data & Knowledge Engineering, Vol.37, No.1, 2001, pp. 47-64. [5] G. Chen, M. Ren, P. Yan, and X. Guo, “Enriching the ER model based on discovered association rules”, Information Sciences, Vol.177, 2007, pp. 1558-1556. [6] R. Chiang, T. M. Barron, and V. C. Storey, “A framework for the design and evaluation of reverse engineering methods for relational databases”, Data & Knowledge Engineering, Vol.21, 1997, pp. 57-77. [7] E.J. Chikofsky, and J. H. Cross, “Reverse engineering and design recovery: A taxonomy”, IEEE Software, Vol.7, No.1, 1990, pp. 13-17. [8] E. F. Codd, “A relational model of data for large shared data banks”, Communications of the ACM, Vol.13, No.6, 1970, pp. 377-387. [9] C. J. Date, and R. Fagin, “Simple conditions for guaranteeing higher normal forms in relational databases”, ACM Transactions on Database Systems, Vol.17, No.3, 1992, pp. 465-476. [10] J. Han, et al., “DBMiner: System for data mining in relational databases and data warehouses”, in Proceedings of CASCON’97: Meeting of Minds, 1997, pp. 249-260. [11] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, San Diego: Academic Press, 2001. [12] J. Hipp, U. Guntzer, and U. Grimmer, “Integrating association rule mining algorithms with relational database systems”, in Proceedings of 3rd International Conference on Enterprise Information Systems, 2001, pp. 130-137. [13] N. Lammari, I. Comyn-Wattiau, and J. Akoka, “Extracting generalization hierarchies from relational databases: A reverse engineering approach”, Data & Knowledge Engineering, Vol.63, 2007, pp. 568-589. [14] H. Lee, and C. Yoo, “A form driven object-oriented reverse engineering methodology”, Information Systems, Vol.25, No.3, 2000, pp. 235-259. [15] J. Perez, I. Ramos, V. Anaya, J. M. Cubel, F. Dominguez, A. Boronat, and J. A. Carsi, “Data reverse engineering for 15 legacy databases to object oriented conceptual schemas”, Electronic Notes in Theoretical Computer Science, Vol.72, No.4, 2003, pp. 7-19. [16] S. T. Sreenath, S. Bodogala, K. Alsabti, and S. Ranka, “An efficient algorithm for the incremental updation of association rules in large databases”, in Proceedings of 3rd International Conference on KDD and Data Mining, 1997, pp. 263-266. [17] S. Tsechansky, N. Pliskin, G. Rabinowitz, and A. Porath, “Mining relational patterns from multiple relational tables”, Decision Support Systems, Vol.27, No.1-2, 1999, pp. 179195. Natthapon Pannurat received his bachelor and master degrees in computer engineering in 2007 and 2009, respectively, from Suranaree University of Technology. He is currently a faculty member of Information Sciences, Nakhon Ratchasima College. His research interests are database management systems, data mining and machine learning. Nittaya Kerdprasop is an associate professor at the school of computer engineering, Suranaree University of Technology, Thailand. She received her B.S. from Mahidol University, Thailand, in 1985, M.S. in computer science from the Prince of Songkla University, Thailand, in 1991 and Ph.D. in computer science from Nova Southeastern University, USA, in 1999. She is a member of ACM and IEEE Computer Society. Her research of interest includes Knowledge Discovery in Databases, Artificial Intelligence, Logic and Constraint Programming, Deductive and Active Databases. Kittisak Kerdprasop is an associate professor and the director of DEKD research unit at the school of computer engineering, Suranaree University of Technology, Thailand. He received his bachelor degree in Mathematics from Srinakarinwirot University, Thailand, in 1986, master degree in computer science from the Prince of Songkla University, Thailand, in 1991 and doctoral degree in computer science from Nova Southeastern University, USA, in 1999. His current research includes Data mining, Machine Learning, Artificial Intelligence, Logic and Functional Programming, Probabilistic Databases and Knowledge Bases. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 16 A New Approach to Keyphrase Extraction Using Neural Networks Kamal Sarkar, Mita Nasipuri and Suranjan Ghose Computer Science and Engineering Department, Jadavpur University, Kolkata-700 032, India Abstract Keyphrases provide a simple way of describing a document, giving the reader some clues about its contents. Keyphrases can be useful in a various applications such as retrieval engines, browsing interfaces, thesaurus construction, text mining etc.. There are also other tasks for which keyphrases are useful, as we discuss in this paper. This paper describes a neural network based approach to keyphrase extraction from scientific articles. Our results show that the proposed method performs better than some state-of-the art keyphrase extraction approaches. Keywords: Keyphrase Extraction, Neural Networks, Text Mining 1. Introduction The pervasion of huge amount of information through the World Wide Web (WWW) has created a growing need for the development of techniques for discovering, accessing, and sharing knowledge. The keyphrases help readers rapidly understand, organize, access, and share information of a document. Keyphrases are the phrases consisting of one or more significant words. keyphrases can be incorporated in the search results as subject metadata to facilitate information search on the web [1]. A list of keyphrases associated with a document may serve as indicative summary or document metadata, which helps readers in searching relevant information. Keyphrases are meant to serve various goals. For example, (1) when they are printed on the first page of a journal document, the goal is summarization. They enable the reader to quickly determine whether the given article worth in-depth reading. (2) When they are added to the cumulative index for a journal, the goal is indexing. They enable the reader to quickly find a article relevant to a specific need. (3) When a search engine form contains a field labeled keywords, the goal is to enable the reader to make the search more precise. A search for documents that match a given query term in the keyword field will yield a smaller, higher quality list of hits than a search for the same term in the full text of the documents. When the searching is done on the limited display area devices such as mobile, PDA etc. , the concise summary in the form of keyphrases , provides a new way for displaying search results in the smaller display area[ 2] [ 3]. Although the research articles published in the journals generally come with several author assigned keyphrases, many documents such as the news articles, review articles etc. may not have author assigned keyphrases at all or the number of author-assigned keyphrases available with the documents is also too limited to represent the topical content of the articles. Many documents also do not come with author assigned keyphrases. So, an automatic keyphrase extraction process is highly desirable. Manual selection of keyphrases from a document by a human is not a random act. Keyphrase extraction is a task related to the human cognition. Hence, automatic keyphrase extraction is not a trivial task and it needs to automated due to its usability in managing information overload on the web. Some previous works on automatic keyphrase extraction used the machine learning techniques such as Naïve Bayes, Decision tree, genetic algorithm [15] [16] etc. Wang et.al (2006) has proposed in [14] a neural network based approach to keyphrase extraction, where keyphrase extraction has been viewed as a crisp binary classification task. They train a neural network to classify whether a phrase is keyphrase or not. This model is not suitable when the number of phrases classified by the classifier as positive is less than the desired number of keyphrases, K. To overcome this problem, we think that keyphrase extraction is a ranking problem rather than a classification problem. One good solution to this problem is to train a neural network to rank the candidate phrases. Designing such a neural network requires the keyphrases in the training data to be ranked manually. Sometimes, this is not feasible. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org In this paper, we present a keyphrase extraction method that uses a multilayer perceptron neural network which is trained to output the probability estimate of a class: positive (keyphrase) or negative (not a keyphrase). Candidate phrases which are classified as positive are ranked first based on their class probabilities. If the number of desired keyphrases is greater than the number of phrases classified as positive by the classifier, the candidate phrases classified as negative by the classifier are considered and they are sorted in increasing order of the their class probabilities, that is, the candidate phrase classified as negative with minimum probability estimate is added first to the list of previously selected Keyphrases. This process continues until the number of extracted keyphrases exceed the number K, where K = the desired number of the keyphrases. Our work also differs from the work proposed by Wang et.al (2006) [14] in the number and the types of features used. While they use the traditional TF*IDF and position features to identify the keyphrases, we use extra three features such as phrase length, word length in a phrase, links of a phrase to other phrases. We also use the position of a phrase in a document as a continuous feature rather than a binary feature. The paper is organized as follows. In section 2 we present the related work. Some background knowledge about artificial neural network has been discussed in section 3. In section 4, the proposed keyphrase extraction method has been discussed. We present the evaluation and the experimental results in section 5. 2. Related Work A number of previous works has suggested that document keyphrases can be useful in a various applications such as retrieval engines [1], [4], browsing interfaces [5], thesaurus construction [6], and document classification and clustering [7]. Some supervised and unsupervised keyphrase extraction methods have already been reported by the researchers. An algorithm to choose noun phrases from a document as keyphrases has been proposed in [8]. Phrase length, its frequency and the frequency of its head noun are the features used in this work. Noun phrases are extracted from a text using a base noun phrase skimmer and an offthe-shelf online dictionary. Chien [9] developed a PAT-tree-based keyphrases extraction system for Chinese and other oriental languages. 17 HaCohen-Kerner et al [10][11] proposed a model for keyphrase extraction based on supervised machine learning and combinations of the baseline methods. They applied J48, an improved variant of C4.5 decision tree for feature combination. Hulth et al [12] proposed a keyphrase extraction algorithm in which a hierarchically organized thesaurus and the frequency analysis were integrated. The inductive logic programming has been used to combine evidences from frequency analysis and thesaurus. A graph based model for keyphrase extraction has been presented in [13]. A document is represented as a graph in which the nodes represent terms, and the edges represent the co-occurrence of terms. Whether a term is a keyword is determined by measuring its contribution to the graph. A Neural Network based approach to keyphrase extraction has been presented in [14] that exploits traditional term frequency, inverted document frequency and position (binary) features. The neural network has been trained to classify a candidate phrase as keyphrase or not. Turney [15] treats the problem of keyphrase extraction as supervised learning task. In this task, nine features are used to score a candidate phrase; some of the features are positional information of the phrase in the document and whether or not the phrase is a proper noun. Keyphrases are extracted from candidate phrases based on examination of their features. Turney’s program is called Extractor. One form of this extractor is called GenEx, which is designed based on a set of parameterized heuristic rules that are fine-tuned using a genetic algorithm. Turney Compares GenEX to a standard machine learning technique called Bagging which uses a bag of decision trees for keyphrase extraction and shows that GenEX performs better than the bagging procedure. A keyphrase extraction program called Kea, developed by Frank et al. [16][17], uses the Bayesian learning technique for keyphrase extraction task. A model is learned from the training documents with exemplar keyphrases and corresponds to a specific corpus containing the training documents. Each model consists of a Naive Bayes classifier and two supporting files containing phrase frequencies and stopped words. The learned model is used to identify the keyphrases from a document. In both Kea and Extractor, the candidate keyphrases are identified by splitting up the input text according to phrase boundaries (numbers, punctuation marks, dashes, and brackets etc.). Finally a phrase is defined as a sequence of one, two, or three words that appear consecutively in a text. The phrases beginning or ending with a stopped word are not taken under consideration. Kea and Extractor both used IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org supervised machine learning based approaches. Two important features such as distance of the phrase's first appearance into the document and TF*IDF (used in information retrieval setting), are considered during the development of Kea. Here TF corresponds to the frequency of a phrase into a document and IDF is estimated by counting the number of documents in the training corpus that contain a phrase P. Frank et al. [16][17], has shown that the performance of Kea is comparable to GenEx proposed by Turney. An n-gram based technique for filtering keyphrases has been presented in [18]. In this approach, authors compute n-grams such as unigram, bigram etc for extracting the candidate keyphrases which are finally ranked based on the features such as term frequency, position of a phrase in a document and a sentence. 18 distributed to the output layer. Arriving at a node (a neuron) in the output layer, the value from each hidden layer neuron is again multiplied by a weight (wjk), and the resulting weighted values are added together producing a combined value at an output node. The weighted sum is fed into a transfer function (usually a sigmoid function), which outputs a value Ok. The Ok values are the outputs of the network. One hidden layer is sufficient for nearly all problems. In some special situations such as modeling data which contains a saw tooth wave like discontinuities, two hidden layers may be required. There is no theoretical reason for using more than two hidden layers. Input layer hidden layer output layer x1 3. Background In this section, we briefly describe some basics of artificial neural network and how to estimate class probability in an artificial neural network. The estimation of class probabilities is important for our work because we use the estimated class probabilities as the confidence scores which are used in re-ranking the phrases belonging to a class: positive or negative. Artificial Neural networks (ANN) are predictive models loosely motivated by the biological neural systems. In generic sense, the terms “Neural Network” (NN) and “Artificial Neural Network” (ANN) usually refer to a Multilayer Perceptron (MLP) Network, which is the most widely used types of neural networks. A multiplayer perceptron (MLP) is capable of expressing a rich variety of nonlinear decision surfaces. An example of such a network is shown in Figure 1. A multilayer perceptron neural network has usually three layers: one input layer, one hidden layer and one output layer. A vector of predictor variable values (x1...xi) is presented to the input layer. In the keyphrase extraction task, this input vector is the feature vector, which is a vector of values of features characterizing the candidate phrases. Before presenting a vector to the input layer, it is normalized. The input layer distributes the values to each of the neurons in the hidden layer. In addition to the predictor variables, there is a constant input of 1.0, called the bias that is fed to each of the hidden layers. The bias is multiplied by a weight and added to the sum going into the neuron. The value from each input neuron is multiplied by a weight (wij) and arrives at a neuron in the hidden layer, and the resulting weighted values are added together producing a combined value at a hidden node. The weighted sum is then fed into a transfer function (usually a sigmoid function), which outputs a value. The outputs from the hidden layer are x2 . . . . . . xi wij Hj wjk Ok Fig.1. A multilayer feed-forward neural network: A training sample, X = (x1, x2, . . .xi), is fed to the input layer. Weighted connections exist between each layer, where wij denotes the weight from a unit j in one layer to a unit i in the previous layer. The backpropagation algorithm performs learning on a multilayer feed-forward neural network. The backpropagation training algorithm was the first practical method for training multiplayer perceptron (MLP) neural networks. The backpropagation (BP) algorithm implements a gradient descent search through the space of possible network weights, iteratively reducing the error between the training example target values and network outputs. BP allows supervised mapping of input vectors and corresponding target vectors. The backpropagation training algorithm follows the following cycle to refine the weight values: (1) randomly choose a tentative set of weights (initial weight configuration) and run a set of predictor variable values through the network, (2) compute the difference between the predicted target value and the training example target value, (3) average the error information over the entire set of training instances, (4) propagate the error backward through the network and compute the gradient (vector of derivatives) of the change in error with respect to changes in weight values, (5) make adjustments to the weights to reduce the error. Each cycle is called an epoch. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 19 4.2 Candidate Phrase Identification One of the most important issues in designing a perceptron network is the number of neurons to be used in the hidden layer(s). If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting network will fit poorly to the training data. If too many neurons are used, the training time may be excessively long, and the network may over fit the data. When overfitting occurs, the network will begin to model random noise in the data. As a result, the model fits the training data extremely well, but it performs poorly to new, unseen data. Cross validation can be used to test for this. The number of neurons in the hidden layers may be optimized by building models using varying numbers of neurons and measuring the quality using cross validation method. 3.1 Computing Class probability Given the training data, the standard statistical technique such as Parzen Windows [22] can used to estimate the probability density in the output space. After calculating the output vector O for an unknown input, one can compute the estimated probability that it belongs to each class using the following formula: Pco(c | O) p (c | O ) , for class c p (c ' | O ) c' p(c|O) is the density of points of the category C at location O in the scatter plot of category 1 Vs. Category 0 in a two class problems [23]. We use the estimated class probabilities as the confidence scores to order phrases belonging to a class: positive or negative. The candidate phrase identification is an important step in key phrase extraction task. We treat all the noun phrases in a document as the candidate phrases [1]. The following sub-section discusses how to identify noun phrases. Noun Phrase Identification To identify the noun phrases, documents should be tagged. The articles are passed to a POS tagger called MontyTagger [25] to extract the lexical information about the terms. Figure 2 shows a sample output of the Monty tagger for the following text segment: “European nations will either be the sites of religious conflict and violence that sets Muslim minorities against secular states and Muslim communities against Christian neighbors, or it could become the birthplace of a liberalized and modernized Islam that could in turn transform the religion worldwide.” European/JJ nations/NNS will/MD either/DT be/VB the/DT sites/NNS of/IN religious/JJ conflict/NN and/CC violence/NN that/IN sets/NNS Muslim/NNP minorities/NNS against/IN secular/JJ states/NNS and/CC Muslim/NNP communities/NNS against/IN Christian/NNP neighbors/NNS,/, or/CC it/PRP could/MD become/VB the/DT birthplace/NN of/IN a/DT liberalized/VBN and/CC modernized/VBN Islam/NNP that/WDT could/MD in/IN turn/NN transform/VB the/DT religion/NN worldwide/JJ ./. Fig.2 A sample output of the tagger In figure 2, NN,NNS,NNP,JJ,DT,VB,IN,PRP,WDT,MD etc. are lexical tags assigned by the tagger. Adjective 4. Proposed Keyphrase Extraction Method The proposed keyphrase extraction method consists of three primary components: document preprocessing, candidate phrase identification and keyphrase extraction using a neural network. Start Noun 4.1 Document Preprocessing The preprocessing task includes formatting each document. If a source document is in pdf format, it is converted to a text format before submission to the keyphrase extractor. Article Fig. 3 DFA for noun phrase identification IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org The meanings of the tags are as follows: NN and NNS for nouns (singular and plural respectively), NNP for proper nouns, JJ for adjectives, DT for determiner, VB for a verb, IN for a preposition, PRP for a pronoun. This is not the complete tag set. The above mentioned tags are some examples of tags in the Penn Treebank tag set used by the MontyTagger. The noun phrases are identified from the tagged sentences using the DFA (deterministic finite automata) shown in figure 3. In this DFA, the states for adjective, noun represent all variations of adjectives and nouns. The figure 4 shows the noun phrases identified by our noun phrase identification component when the tagged sentences shown in figure 2 become its input. As shown in the figure 4, the 10th phrase is “Islam”, but manual inspection of the source text may suggest that it should be “Modernized Islam”. This discrepancy occurs since the tagger assigns a tag “VBN” to the word “Modernized” and “VBN” indicates participle form of a verb which is not accepted by our DFA in figure 3 as the part of a noun phrase. To avoid this problem “VBN” might be considered as a state in the DFA, but it might lead to recognizing some verb phrases mistakenly as the noun phrases. Document number Sentence Number Noun phrase Number Noun Phrases 100 4 1 European nations 100 4 2 sites 100 4 3 religious conflict 100 4 4 violence 100 4 5 sets muslim minorities 100 4 6 secular states 100 4 7 muslim communities 100 4 8 christian neighbors 100 4 9 birthplace 100 4 10 Islam 100 4 11 turn 100 4 12 religion Fig.4 Output of noun phrase extractor for a sample input 4.3 Features, Weighting and Normalization After identifying the document phrases, a document is reduced to a collection of noun phrases. Since, in our 20 work, we focus on the keyphrase extraction task from scientific articles which are generally very long in size (6 to more than 20 pages), the collection of noun phrases identified in an article may be huge in number. Among theses huge collection, a small number of phrases (5 to 15 phrases) may be selected as the keyphrases. Whether a candidate phrase is a keyphrase or not can be decided by a classifier based on a set of features characterizing a phrase. Discovering good features for a classification task is very much an art. The different features characterizing candidate noun phrases, feature weighting and normalization methods are discussed below. Phrase frequency, phrase links to other phrases and Inverse Document Frequency If a noun phrase is occurring more frequently in a document, the phrase is assumed to more important in the document. Number of times a phrase occurs independently in a document with its entirety has been considered as the phrase frequency (PF). A noun phrase may appear in a text either independently or as a part of other noun phrases. These two types of appearances of noun phrases should be distinguished. If a noun phrase P1 appears in full as a part of another noun phrase P2 (that is, P1 is contained in P2), it is considered that P1 has a link to P2. Number of times a noun phrase (NP) has links to other phrases is counted and considered as the phrase link count (PLC). Two features, phrase frequency (PF) and phrase link count (PLC) are combined to have a single feature value using the following measure: F freq (1 / 2) * PF * PF PLC In the above formula, frequency of a noun phrase (PF) is squared only to give it more importance than the phrase link count (PLC). The value 1/2 has been used to moderate the value. We explain below about this formula with an example: Assume a phrase P1 whose PF value is 10, PLC value is 20 and PF+PLC = 30. For another phrase P2 whose PF value is 20, PLC value is 10 and PF+PLC =30. So, for these two cases, simple addition of PF and PLC do not make any difference in assigning weights to the noun phrases although the independent occurrence of noun phrase P2 is more than that of the noun phrase P1. But the independent existence of a phrase should get higher importance while deciding whether a phrase is keyphrase worthy or not. In a more general case, consider that a single word noun phrase NP1 occurs only once in independent existence and occurs (n+1) times as a part of other noun phrases and NP2 is another phrase, which IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org occurs n times independently and occurs only once as a part of other phrases. In this situation, simple addition of PF and PLC will favor the first phrase, but our formula will give higher score to the second phrase because it occurs more independently than the first one. Inverse document frequency (IDF) is a useful measure to determine the commonness of a term in a corpus. IDF value is computed using the formula: log(N/df), where N= total number of documents in a corpus and df (document frequency) means the number of documents in which a term occurs. A term with a lower df value means the term is less frequent in the corpus and hence idf value becomes higher. So, if idf value of a term is higher, the term is relatively rare in the corpus. In this way, idf value is a measure for determining the rarity of a term in a corpus. Traditionally, TF (term frequency) value of a term is multiplied by IDF to compute the importance of a term, where TF indicates frequency of a term in a document. TF*IDF measure favors a relatively rare term which is more frequent in a document. We combine Ffreq and IDF in the following way to have a variant of Edmundsonian thematic feature [24]: The value of this feature is normalized by dividing the value by the maximum Fthematic score in a collection of Fthematic scores obtained by the phrases corresponding to a document. Phrase Position If a phrase occurs in the title or abstract of a document, it should be given more score. So, we consider the position of the first occurrence of a phrase in a document as a feature. Unlike the previous approaches [14] [16] that assume the position of a phrase as a binary feature, in our work, the score of a phrase that occurs first in the sentence i is computed using the following formula: 1 i that keyphrase consisting of 4 or more words are relatively rare in our corpus. Length of the words in a phrase can be considered as a feature. According to Zipf’s Law [21], shorter words occur more frequently than the larger ones. For example, articles occur more frequently in a text. So, the word length can be an indication for the rarity of a word. We consider the length of the longest word in a phrase as a feature. If the length of a phrase is PL and the length of the longest word in the phrase is WL, these two feature values are combined to have a single feature value using the following formula: F PL * WL lo g (1 P L ) * lo g (1 W L ) The value of this feature is normalized by dividing the value by the maximum value of the feature in the collection of phrases corresponding to a document. F thematic F freq * IDF F pos 21 , if i <= n , where n is the position of the last sentence in the abstract of a document. For i > n, Fpos is set to 0. Phrase Length and Word Length These two features can be considered as the structural features of a phrase. Phrase length becomes an important feature in keyphrase extraction task because the length of keyphrases usually varies from 1 word to 3 words. We find 4.4 Keyphrase Extraction Using Multilayer Perceptron Neural Network Training a Multilayer Perceptron (MLP) Neural Network for keyphrase extraction requires document noun phrases to be represented as the feature vectors. For this purpose, we write a computer program for automatically extracting values for the features characterizing the noun phrases in the documents. Author assigned keyphrases are removed from each original document and stored in the different files with a document identification number. For each noun phrase NP in each document d in our dataset, we extract the values of the features of the NP from d using the measures discussed in subsection 4.3. If the noun phrase NP is found in the list of author assigned keyphrases associated with the document d, we label the noun phrase as a “Positive” example and if it is not found we label the phrase as a “negative” example. Thus the feature vector for each noun phrase looks like {<a1 a2 a3 ….. an>, <label>} which becomes a training instance (example) for a Multilayer Perceptron Neural Network, where a1, a2 . . .an, indicate feature values for a noun phrase. A training set consisting of a set of instances of the above form is built up by running a computer program on a set of documents selected from our corpus. After preparation of the training dataset, a Multilayer Perceptron Neural Network is trained on the training set to classify the noun phrases as one of two categories: IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org “Positive” or “Negative”. Positive category indicates that a noun phrase is a keyphrase and the negative category indicates that it is not a keyphrase. Input: A file containing the noun phrases of a test document with their classifications (positive or negative) and the probability estimates of the classes to which the phrases belong. Begin: i. Select the noun phrases, which have been classified as positive by the classifier and reorder these selected noun phrases in decreasing order of their probability estimates of being in class 1 (positive). Save the selected phrases in to an output file and delete them from the input file. ii. For the rest of the noun phrases in the input file, which are classified by the classifier as “Negative”, we order the phrases in increasing order of their probability estimates of being in the class 0 (negative). In effect, the phrase for which the probability estimate of being in class 0 is minimum comes at the top. Append the ordered phrases to the output file. 22 WEKA uses backpropagation algorithm for training the multilayer perceptron neural network. The trained neural network is applied on a test document whose noun phrases are also represented in the form of feature vectors using the similar method applied on the training documents. During testing, we use –p option (soft threshold option). With this option, we can generate a probability estimate for the class of each vector. This is required when the number of noun phrases classified as positive by the classifier is less than the desired number of the keyphrases. It is possible to save the output in a file using indirection sign (>) and a file name. We save the output produced by the classifier for each test document in a separate file. Then we rank the phrases using the algorithm shown in figure 5 for keyphrase extraction. After ranking the noun phrases, K- top ranked noun phrases are selected as keyphrases for each input test document. 5. Evaluation and Experimental Results iii. Save the output file end Fig.5 Noun Phrase Ranking Based on Classifier’s Decisions For our experiment, we use Weka (www.cs.waikato.ac.nz/ml/weka) machine learning tools. We use Weka’s Simple CLI utility, which provides a simple command-line interface that allows direct execution of WEKA commands. The training data is stored in a .ARFF format which is an important requirement for WEKA. The multilayer perceptron is included under the panel Classifier/ functions of WEKA workbench. The description of how to use MLP in keyphrase extraction has been discussed in the section 3. For our work, the classifier MLP of the WEKA suite has been trained with the following values of its parameters: Number of layers: 3 (one input layer, one hidden layer and one output layer). Number of hidden nodes: (number of attributes + number of classes)/2 Learning rate: 0.3 Momentum: 0.2 Training iteration: 500 Validation threshold: 20 There are two usual practices for evaluating the effectiveness of a keyphrase extraction system. One method is to use human judgment, asking human experts to give scores to the keyphrases generated by a system. Another method, less costly, is to measure how well the system-generated keyphrases match the author-assigned keyphrases. It is a common practice to use the second approach in evaluating a keyphrase extraction system [7][8] [11][19]. We also prefer the second approach to evaluate our keyphrase extraction system by computing its precision and recall using the author-provided keyphrases for the documents in our corpus. For our experiments, precision is defined as the proportion of the extracted keyphrases that match the keyphrases assigned by a document’s author(s). Recall is defined as the proportion of the keyphrases assigned by a document’s author(s) that are extracted by the keyphrase extraction system. 5.1 Experimental Dataset The data collection used for our experiments consists of 150 full journal articles whose size ranges from 6 pages to 30 pages. Full journal articles are downloaded from the websites of the journals in three domains: Economics, Legal (Law) and Medical. Articles on Economics are collected from the various issues of the journals such as Journal of Economics (Springer), Journal of Public Economics (Elsevier), Economics Letters, Journal of Policy Modeling. All these articles are available in PDF format. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Articles on Law and legal cases have been downloaded from the various issues of the law journals such as Computer Law and Security Review (Elsevier), International Review of Law and Economics (Elsevier), European Journal of Law and Economics (Springer), Computer Law and Security Report (Elsevier), AGORA International Journal of Juridical Sciences(Open access). Medical articles are downloaded from the various issues of the medical journals such as Indian Journal of Medicine, Indian Journal of Pediatrics, Journal of Psychology and Counseling, African journal of Traditional, Complementary and Alternative Medicines, Indian Journal of Surgery, Journal of General Internal Medicine, journal of General Internal Medicine, The American Journal of Medicine, International Journal of Cardiology, Journal of Anxiety Disorders. Number of articles under each category used in our experiments is shown in the table 1. 23 noun phrase or multiple multiword noun phrases connected by prepositions (an example of a keyphrase containing multiple multiword noun phrases is: “The National Council for Combating Discrimination”), (2) the ill-formatted input texts which are generated by a pdf-totext converter from the scientific articles usually available in pdf format. 5.2 Experiments We conducted two experiments to judge the effectiveness of the proposed keyphrase extraction method. Experiment 1 In this experiment, we develop a neural network based keyphrase system as we discuss in this paper. All the features discussed in the subsection 4.3 are incorporated in this system. Table 1: Source documents used in our experiments Source Document Type Economics Law Medical Number of Documents 60 40 50 For the system evaluation, the set of journal articles are divided into multiple folds where each fold consists of one training set of 100 documents and a test set of 50 documents. The training set and the test set are independent from each other. The set of author assigned keyphrases available with the articles are manually removed before candidate terms are extracted. For all experiments discussed in this paper, the same splits of our dataset in to a training set and a test set are used. Some useful statistics about our corpus are given below. Total number of noun phrases in our corpus is 144978. The average number of author-provided keyphrases for all the documents in our corpus is 4.90. The average number of keyphrases that appears in all the source documents in our corpus is 4.34. Here it is interesting to note that all the author assigned keyphrases for a document may not occur in the document itself. The average number of keyphrases that appear in the list of candidate phrases extracted from all the documents in our corpus is 3.50. These statistics interestingly show that some keyphrase worthy phrases may be missed at the stage of the candidate phrase extraction. The main problems related to designing a robust candidate phrase extraction algorithm are: (1) an irregular structure of a keyphrase, that is, it may contain only a single word or a multiword Experiment 2 This is to compare the proposed system to an existing system. Kea [17] is now a publicly available keyphrase extraction system. Kea uses a limited number of features such as positional information and TF*IDF feature for keyphrase extraction. The keyphrase extraction system, Kea uses the Naïve Bayesian learning algorithm for keyphrase extraction. We download the version 5.0 of Kea1 and install it on our machine. A separate model is built for each fold which contains 100 training documents and 50 test documents. Kea builds a model from each training dataset using Naïve Bayes and uses this pre-built model to extract keyphrases from the test documents. 5.3 Results To measure the overall performance of the proposed neural network based keyphrase extraction system and the publicly available keyphrase extraction system, Kea, our experimental dataset consisting of 150 documents are divided into 3 folds for 3-fold cross validation where each fold contains two independent sets: a training set of 100 documents and a test set of 50 documents. A separate model is built for each fold to collect 3 test results, which are averaged to obtain the final results for a system. The number of keyphrases to be extracted (value for K) is set to 5, 10 and 15 for each of keyphrase extraction systems discussed in this paper. 1 http://www.nzdl.org/Kea/ IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Table 2 shows the author assigned keyphrases for the journal article number 12 in our corpus. Table 3 and table 4 show respectively the top 5 keyphrases extracted by the MLP based system and Kea when the journal article number 12 in our corpus is presented as a test document to these systems. Table 2: Author assigned keyphrases for the journal article number 12 in our test corpus Dno 12 12 12 12 AuthorKey adult immunization barriers consumer provider survey Table 3: Top 5 keyphrases extracted by the proposed MLP based keyphrases extractor Dno 12 12 12 12 12 NP immunization adult immunization healthcare providers consumers barriers Table 4: Top 5 keyphrases extracted by Kea Dno NP 12 12 12 12 12 adult immunization vaccine healthcare barriers Table 2 and table 3 show that out of 5 keyphrases extracted by the MLP based approach, 3 keyphrases match with the author assigned keyphrases. The overall performance of the proposed MLP based Keyphrases extractor has been shown in the table 5. Table 2 and table 4 show that out of 5 keyphrases extracted by Kea, only one matches with the author assigned keyphrases. The overall performance of Kea has been compared with the proposed MLP based keyphrase extraction system in table 5. Table 5: Comparisons of the performances of the proposed MLP based keyphrase Extraction System and Kea Number of keyphrases Average Precision Table 5 shows the comparisons of the performances of the proposed MLP based keyphrase extraction system and Kea. From table 5, we can clearly conclude that the proposed keyphrase extraction system outperforms Kea for all three cases shown in three different rows of the table. To interpret the results shown in the table 5, we like to analyze the upper bounds of precision and recall of a keyphrase extraction system on our dataset. Our analysis on upper bounds of precision and recall of a keyphrase extraction system on our dataset can be presented in two ways: (1) some author-provided keyphrases might not occur in the document they were assigned to. According to our corpus, about 88% of author-provided keyphrases appear somewhere in the source documents of our corpus. After extracting candidate phrases using our candidate phrase extraction algorithm, we find that only 72% of author provided keyphrases appear somewhere in the list of candidate phrases extracted from all the source documents. So, keeping our candidate phrase extraction algorithm fixed if a system is designed with the best possible features or a system is allowed to extract all the phrases in each document as the keyphrases, the highest possible average recall for a system can be 0.72. In our experiments, the average number of author-provided keyphrases for all the documents is only 4.90, so the precision would not be high even when the number of extracted keyphrases is large. For example, when the number of keyphrases to be extracted for each document is set to 10, the highest possible average precision is around 0.3528 (4.90 * 0.72/10 = 0.3528), (2) assume that the candidate phrase extraction procedure is perfect, that is, it is capable of representing all the source documents in to a collection of candidate phrases in such way that all author provided keyphrases appearing in the source documents also appear in the list of candidate phrases. If it is the case, 88% of the author provided keyphrases appear somewhere in the list of candidate phrases because, on an average, 88% of the author provided keyphrases appear somewhere in the source documents of our corpus. In this case, if a system is allowed to extract all the phrases in each document as the keyphrases, the highest possible average recall for a system can be 0.88 and when the number of keyphrases to be extracted for each document is set to 10, the highest possible average precision is around 0.4312(4.90 * 0.88/10 =0.4312). Average Recall 6. Conclusions MLP 5 10 15 24 0.34 0.22 0.17 Kea MLP Kea 0.28 0.19 0.15 0.35 0.46 0.51 0.29 0.40 0.48 This paper presents a novel keyphrase extraction approach using neural networks. For predicting whether a phrase is a keyphrase or not, we use the estimated class probabilities as the confidence scores which are used in re-ranking the phrases belonging to a class: positive or negative. To IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org identify the keyphrases, we use five features such as TF*IDF, position of a phrase’s first appearance, phrase length, word length in a phrase and the links of a phrase to other phrases. The proposed system performs better than a publicly available keyphrase extraction system called Kea. As a future work, we have planned to improve the proposed system by (1) improving the candidate phrase extraction module of the system and (2) incorporating new features such as structural features, lexical features. References [1] Y. B. Wu, Q. Li, Document keyphrases as subject metadata: incorporating document key concepts in search results, Journal of Information Retrieval, 2008, Volume 11, Number 3, 229-249 [2] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeking the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices. In Proceedings of the World Wide Web Conference, 2001, Hong Kong. [3] O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, A. Paepcke, and T. Winograd. Efficient Web Browsing on Handheld Devices Using Page and Form Summarization. ACM Transactions on Information Systems (TOIS), 2002, 20(1):82–115 [4] S. Jones, M. Staveley, Phrasier: A system for interactive document retrieval using Keyphrases, In: proceedings of SIGIR, 1999, Berkeley, CA [5] C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, E. Frank, Improving browsing in digital libraries with keyphrase indexes, Journal of Decision Support Systems, 2003, 27(1-2), 81-104 [6] B. Kosovac, D. J. Vanier, T. M. Froese, Use of keyphrase extraction software for creation of an AEC/FM thesaurus, Journal of Information Technology in Construction, 2000, 25-36 [7] S.Jonse, M. Mahoui, Hierarchical document clustering using automatically extracted keyphrase, In proceedings of the third international Asian conference on digital libraries, 2000, Seoul, Korea. pp. 113-20 [8] K. Barker, N. Cornacchia, Using Noun Phrase Heads to Extract Document Keyphrases. In H. Hamilton, Q. Yang (eds.): Canadian AI 2000. Lecture Notes in Artificial Intelligence, 2000, Vol. 1822, Springer-Verlag, Berlin Heidelberg, 40 – 52. [9] L. F Chien, PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval, Information Processing and Management, 1999, 35, 501 – 521. [10] Y. HaCohen-Kerner, Automatic Extraction of Keywords from Abstracts, In V. Palade, R. J. Howlett, L. C. Jain (eds.): KES 2003. Lecture Notes in Artificial Intelligence, 2003, Vol. 2773,Springer-Verlag, Berlin Heidelberg, 843 – 849. [11] Y. HaCohen-Kerner, Z. Gross, A. Masa, Automatic Extraction and Learning of Keyphrases from Scientific Articles, In A. Gelbukh (ed.): CICLing 2005. Lecture Notes in Computer Science, 2005, Vol. 3406, Springer-Verlag, Berlin Heidelberg, 657 – 669. [12] A. Hulth, J. Karlgren, A. Jonsson, H. Boström, Automatic Keyword Extraction Using Domain Knowledge, In A. [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] 25 Gelbukh (ed.): CICLing 2001. Lecture Notes in Computer Science, 2001, Vol. 2004, Springer-Verlag, Berlin Heidelberg, 472 – 482. Y. Matsuo, Y. Ohsawa, M. Ishizuka, KeyWorld: Extracting Keywords from a Document as a Small World, In K. P. Jantke, A. shinohara (eds.): DS 2001. Lecture Notes in Computer Science, 2001, Vol. 2226, Springer-Verlag, Berlin Heidelberg, 271– 281. J. Wang, H. Peng, J.-S. Hu, Automatic Keyphrases Extraction from Document Using Neural Network., ICMLC 2005, 633-641 P. D. Turney, Learning algorithm for keyphrase extraction, Journal of Information Retrieval, 2000, 2(4), 303-36 E. Frank, G. Paynter, I. H. Witten, C. Gutwin, C. NevillManning, Domain-specific keyphrase extraction. In proceeding of the sixteenth international joint conference on artificial intelligence, 1999, San Mateo, CA. I. H. Witten, G.W. Paynter, E. Frank et al, KEA: Practical Automatic Keyphrase Extraction, In E. A. Fox, N. Rowe (eds.): Proceedings of Digital Libraries’99: The Fourth ACM Conference on Digital Libraries. 1999, ACM Press, Berkeley, CA , 254 – 255. N. Kumar , K. Srinathan, Automatic keyphrase extraction from scientific documents using N-gram filtration technique, Proceeding of the eighth ACM symposium on Document engineering, September 16-19, 2008, Sao Paulo, Brazil. Q. Li, Y. Brook Wu, Identifying important concepts from medical documents. Journal of Biomedical Informatics, 2006, 668-679 C. Fellbaum, WordNet: An electronic lexical database. Cambridge: MIT Press, 1998. G.K. Zipf, The psycho-biology of language. Cambridge, 1935 (reprinted 1965), MA:MIT press R. Duda, P. Hart, Pattern classification and scene analysis, 1973, Wiley and Son J.S.Denker, Y. leCun, transforming neural-net output labels to probability distributions, AT & T Bell Labs Technical Memorandum 11359-901120-05 H. P. Edmundson. “New methods in automatic extracting”. Journal of the Association for Computing Machinery, 1969, 16(2), 264–285 H. Liu, MontyLingua: An end-to-end natural language processor with common sense, 2004, retrieved in 2005 from web.media.mit.edu/~hugo/montylingua. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694- 0784 ISSN (Print): 1694-0814 C Implementation & comparison of companding & silence audio compression techniques Mrs. Kruti Dangarwala1 and Mr. Jigar Shah2 1 Department of Computer Engineering, Sri S’ad Vidya Mandal Institute of Technology Bharuch, Gujarat, India 2 Department of Electronics and Telecommunication Engineering Sri S’ad Vidya Mandal Institute of Technology Bharuch, Gujarat, India Abstract Just about all the newest living room audio-video electronics and PC multimedia products being designed today will incorporate some form of compressed digitized-audio processing capability. Audio compression reduces the bit rate required to represent an analog audio signal while maintaining the perceived audio quality. Discarding inaudible data reduces the storage, transmission and compute requirements of handling high-quality audio files. This paper covers wave audio file format & algorithm of silence compression method and companding method to compress and decompress wave audio file. Then it compares the result of these two methods. Keywords: thresold, chunk, bitstream, companding, silence; 1. Introduction Audio compression reduces the bit rate required to represent an analog audio signal while maintaining the perceived audio quality. Most audio decoders being designed today are called "lossy," meaning that they throw away information that cannot be heard by most listeners. The information to be discarded is based on psychoacoustics, which uses a model human auditory perception to determine which parts of the audible spectrum the largest portion of the human population can detect. First, an audio encoder [1] divides the frequency domain of the signal being digitized into many bands and analyzes a block of audio to determine what's called a "masking threshold." The number of bits used to represent a tone depends on the masking threshold. The noise associated with using fewer bits is kept low enough so that it will not be heard. Tones that are completely masked may not have any bits allocated to them. Discarding inaudible data reduces the storage, transmission and compute requirements of handling high-quality audio files. Consider the example of a typical audio signal found in a CD-quality audio device. The CD player produces two channels of audio. Each analog signal [2] in each channel is sampled at a 44.1kHz sample rate. Each sample is represented as a 16-bit digital data word. To produce both channels requires a data rate of 1.4 Mbits/second. However, with audio compression this data rate is reduced around an order of magnitude. Thus, a typical CD player is reading compressed data from a compact disk at a rate just over 100 Kbits/s. Audio compression really consists of two parts. The first part, called encoding, transforms the digital audio data that resides, say, in a WAVE file, into a highly compressed form called bitstream. To play the bitstream on your soundcard, you need the second part, called decoding. Decoding takes the bitstream and reexpands it to a WAVE file. 2. WAVE AUDIO FILE FORMAT The WAVE file format [1] is a subset of Microsoft's RIFF spec, which can include lots of different kinds of data. RIFF is a file format for storing many kinds of data, primarily multimedia data like audio and video. It is based on chunks and sub-chunks. Each chunk has a type, represented by a four-character tag. This chunk type comes first in the file, followed by the size of the chunk, then the contents of the chunk. The entire RIFF file is a big chunk that contains all the other chunks. The first thing in the contents of the RIFF chunk is the "form type," which describes the overall type of the file's contents. So the structure of wave audio file looks like this: a) RIFF Chunk b) Format Chunk c) Data Chunk 26 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Table 1: RIFF CHUNK Byte Number 0-3 Description “RIFF”(ASCII Character) 4-7 8-11 Total Length of Package to follow (Binary) “WAVE”(ASCII Character) Description of RIFF chunk as follows: Offset Length Contents 0 4 bytes 'RIFF' 4 4 bytes <file length - 8> 8 4 bytes 'WAVE' Table 2: FORMAT CHUNK Byte Number 0-3 4-7 8-9 10-11 12-15 16-19 20-21 22-23 3. Silence Compression & Decompression Techniques 3.1 Introduction Silence Compression [4] on sound files is the equivalent of run length encoding on normal data files. In this case, the Runs we encode are sequences of relative silence in a sound file. Here we replace sequences of relative silence with absolute silence. So it is known as Lossy technique. 3.2 User Parameters: “fmt_”(ASCII character) Length of format chunk(Binary) Always 0x01 Channel nos(0x01=mono,0x02=stereo) Sample Rate(Binary,in Hz) Bytes Per Second Bytes Per Sample 1=8-bit mono,2=8-bit stereo/16-bit mono,4=16-bit stereo Bits Per Sample Description of FORMAT chunk as follows: 12 4 bytes 'fmt ' 16 4 bytes 0x00000010 20 2 bytes 0x0001 // Format tag: 1 = PCM 22 2 bytes <channels> 24 4 bytes <sample rate> // Samples per second 28 4 bytes <bytes/second> // sample rate * block align 32 2 bytes <block align> // channels * bits/sample / 8 34 2 bytes <bits/sample> // 8 or 16 Table 3: DATA CHUNK Byte Number 0-3 4-7 8-end 27 “data”(ASCII character) Length of data to follow Data(Samples) Description of DATA chunk as follows: 36 4 bytes ‘data’ 40 4 bytes <length of the data block> 44 bytes <sample data> The sample data must end on an even byte boundary. All numeric data fields are in the Intel format of lowhigh byte ordering. 8-bit samples are stored as unsigned bytes, ranging from 0 to 255. 16-bit samples are stored as 2's-complement signed integers, ranging from -32768 to 32767. For multi-channel data, samples are interleaved between channels, like this: sample 0 for channel 0, sample 0 for channel 1, sample 1 for channel 0 ,sample 1 for channel 1. For stereo audio, channel 0 is the left channel and channel 1 is the right. 1) Thresold Value : It considered as Silence. With 8-bit sample 80H considered as “pure” silence. Any Sample value within a range of plus or minus 4 from 80H considered as silence. 2) Silence_Code: It is code to encode a run of silence. We used value FF to encode silence.The Silence_code is followed by a single byte that indicates how many consecutive silence codes there are. 3) Start_Threshold: It recognize the start of a run of silence. We would not want to start encoding silence after seeing just a single byte of silence. It does not even become economical until 3 bytes of silence are seen. We may want to experiment with even higher values than 3 to see how it affects the fidelity of the recoding. 4) Stop_Threshold: It indicates how many consecutive non silence codes need to be seen in the input stream before we declare the silence run to be over. 3.3 Silence Compression Algorithm (Encoder) 1) Read 8-bit Sample Data From audio file. 2) Checking of Silence means find atleast 5 consecutive silence value: 80H or +4 /- 4 from 80H. (Indicate start of silence) 3) If get, Encode with Silence_Code followed by runs. \ (Consecutive Silence values). 4) Stop to Encode when found atleast two Non-Silence values. 5) Repeat all above steps until end of file character found. 6) Print input File size, Output File Size and Compression Ratio. This algorithm [4] takes 8-bit wave audio file as input. Here It find starting of silence means check that at least 5 consecutive silence value present or not. Here 80H considered as pure silence and +/- 4 from 80H also consider as silence. If found, then it start encoding process. Here consecutive silence values are encoded by silence_code followed by runs (Consecutive silence values). It stop encoding when it found at least two nonsilence values. Then it generate compressed file extension of that file is also wav file. Example of algorithm as follows: IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Input file consists of sample data like 80 81 80 81 80 80 80 45. Output file consists of compressed data like FF745. It display following attributes of input wave audio file. a. Input file size in terms of bytes b. Output file size in terms of bytes c. Compression ratio. 3.4 Silence Dcompression Algorithm (Decoder) 1) 2) 3) 4) Read 8-bit Sample from Compress file. Check the Silence code means 0xff if it found, Check the next value means runs which Indicate no of silence value. Replace it with 0x80 (silence value) no of runs times. Repeat above step until we get end of file Character. Example of algorithm as follows: In input file (compressed file (extension of this file .wav)) we find silence code , if we get then we check next value which indicate no of silence value. Then we replace with silence value no of runs times decided by user. We stop the procedure when we get end of file character. If we get value 0xff5 in compress file, decode that value by 0x80 0x80 0x80 0x80 0x80 0x80 Mapped 15-bit numbers can be decoded back into the original 16-bit samples by the inverse formula: Sample=65356 log2 (1+mapped/32767) Here the Amount of Compression[2] should thus be a user-Controlled parameter. And this is an interesting example of a compression method where the compression ratio is known in advance. Now no need to go through the “(1)” & “(2)”. Since the mapping of all the samples can be prepared in advance in a table. Both decoding and encoding are thus fast.Use [4] in this method. 4.2 Companding Compression Algorithm 1) 2) 4.1 Introduction Mapped=32767*(pow (2, sample/65356) 3) 4) 5) Non-Linear Formula For 16-bit samples to 15-bit samples conversion: (1) Using this formula every 16-bit sample data converted into 15-bit sample data. It performs non-linear mapping such that small samples are less affected than large ones. (2) Reducing 16-bit numbers to 15-bits does not produce much compression. Better Compression can be achieved by substituting a smaller number for 32767 in “(1)” & “(2)”. A value of 127, For example would map each 16-bit sample into 8-bit sample. So in this case we compress file with compression ratio of 0.5 means 50%. Here Decoding should be less accurate. A 16-bit sample of 60,100, for example, would be mapped into the 8-bit number 113, but this number would produce 60,172 when decoded by “(2)”. Even worse, the small 16-bit sample 1000 would be mapped into 1.35, which has to be rounded to 1. When “(2)” is used to decode a 1, it produce 742, significantly different from the original sample. 4. Companding Compression & Decompression Techniques Companding [4] uses the fact that the ear requires more precise samples at low amplitudes (soft sounds). But is more forgiving at higher amplitudes. A typical ADC used in sound cards for personal computers convert voltages to numbers linearly. If an amplitude a is converted to the number n, then amplitude 2a will be converted to the number 2n. It examines every sample in the sound file and uses a nonlinear formula to reduce the no. of bits devoted to it. 28 Input the No. of Bits to use for output code. Make Compress Look-Up Table using NonLinear Formula : (8-bit to Input bit) Value=128.0*(pow (2, code/N)-1.0)+0.5 Where, code: pow(2,inputbit) to 1 N: pow (2, Inputbit) For each code we assign value 0 to 15 in table: Index of table value J+127 code+N-1 128-j N-code where j=value to zero Now Read 8-bit samples from audio file and That sample become the index of compress look-up table, Find corresponding value, that output vale store in output file. Repeat step-III until we get end of file character. Print the Input file size in bytes, output file size in bytes & compression ratio. Description of algorithm is that it used for converting 8- bit sample file into user defined output bit. For example if we input output bit : 4 then we achieve 50% compression. and we say compression ratio is 0.5. So we say that in this method compression ratio is known in advance. We adjust compression ratio according our requirement. So it is crucial point compare to another method. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 4.3 Companding Decompression Algorithm 1) Find the No. of bits used in compressed file. 2) Make Expand Look-Up table using Non-Linear Formula: (Input bit 8-bit) Value=128.0*(pow (2.0, code/N)-1.0)+0.5 Where code: 1 to pow (2, Inputbit) N: pow (2, Inputbit) For each code: we assign value 0 to 255 in table: Index of table value N+code-1 128+(value+last_value)/2 N-code 127-(value+last_value)/2 Here initially, last_value=0 & for each code, last_value=value 3) 4) Now read input bit samples from audio file & that sample become the index of expand lookup table, find corresponding value, that output sample value store in output file. Repeat the step-III until file size becomes zero. 5. Result/comparisons Between Two Lossy Method 5.1 Companding Compression Method 1) INPUT AUDIO FILE: Name of File: Media Length: Audio Format: File Size: J1. WAV 4.320 sec PCM, 8000Hz, 8-Bit, Mono 33.8KB (34,618 BYTES) User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 1 34618 4390 User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 2 34618 8709 75% User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 3 34618 13028 63% User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 88% 4 34618 17347 50% 29 User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 5 34618 21666 38% User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 6 34618 25985 25% User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 7 34618 30304 13% User Parameter (No. Of Bits) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 8 34618 34618 0% 5.2 Silence Compression Method: 1)INPUT AUDIO FILE : Name of File: Media Length: Audio Format: File Size: J1. WAV 4.320 SEC PCM, 8000HZ, 8-BIT, Mono 33.8KB (34,618 BYTES) Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 34618 25099 28% 2) INPUT AUDIO FILE: Name Of File:: Media Length: Audio Format: File Size: Chimes2.wav 0.63 sec PCM, 22,050Hz, 8-Bit, Mono 14028 bytes Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 14028 7052 50% 3) INPUT AUDIO FILE: Name Of File: Media Length: Audio Format: File Size: Chord2.wav 1.09 sec PCM, 22,050Hz, 8-Bit, Mono 14028 bytes Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 14028 9074 36% 4) INPUT AUDIO FILE: Name Of File: Media Length: Audio Format: File Size: Ding2.wav 0.91 sec PCM, 22,050Hz, 8-Bit, Mono 20298 bytes Input File Size(in Bytes) 20298 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org Output File Size(in Bytes) Compression Ratio(in Percentage) 13887 32% 5) INPUT AUDIO FILE: Name Of File: Logoff2.wav Media Length: 3.54 sec Audio Format : PCM,22,050Hz,8-Bit,Mono File Size: 783625 bytes Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 783625 60645 23% 6) INPUT AUDIO FILE: Name Of File:: Media Length: Audio Format: File Size: Notify2.wav 1.35 sec PCM, 22,050Hz, 8-Bit, Mono 29930 Bytes Input File Size(in Bytes) Output File Size(in Bytes) Compression Ratio(in Percentage) 29930 13310 56% 6. Conclusions We can achieve more compression using silence compression if more silence values are present out input audio file. Silence method is lossy method because here we consider +/-4 from 80H is consider as silence & when we perform compression we replace it with 80H runs times. When we decompress the audio file using silence method we get the original size but we do not get original data. Some losses occur. Companding method is also lossy method but one advantage is here we can adjust compression ratio according our requirement. Here depending upon no. of bits used in output file we get compression ratio. When we decompress the audio file using companding method we get the original size as well as we get original data only minor losses occur. But we get audio quality. Compare to silence method companding method is good and better. References [1] [2] [3] [4] [5] [6] David Salomon Data Compression 1995, 2nd ed., The Complete reference John G. Proakis and Dimitris G. Manolakis Digital Signal Processing Principles, Algorithms & Applications ,3rd ed., David Pan, “A Tutorial on MPEG/Audio Compression”, IEEE multimedia, Vol 2, No. 2 Mark Nelson Data Compression, 2nd ed. Stephen J. Solari Digital Video and Audio Compression. Cliff Wootton A practical guide to video and audio compression Kruti J Dangarwala had passed B.E (computer science) in 2001, M.E (computer engg.) In 2005. She is currently employed at SVMIT, Bharuch, Gujarat State, India as an assistant 30 professor. She has published two technical papers in various conferences. Jigar H. Shah had passed B.E. (Electronics) in 1997, M.E. (Microprocessor) in 2006. Presently he is pursuing Ph.D. degree. He is currently employed at SVMIT, Bharuch, Gujarat State, India as an assistant professor. He has published five technical papers in various conferences and also has five book titles. He is life member of ISTE and IETE. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 31 Color Image Compression Based On Wavelet Packet Best Tree Prof. Dr. G. K. Kharate Principal, Matoshri College of Engineering and Research Centre, Nashik – 422003, Maharashtra, India Dr. Mrs. V. H. Patil Professor, Department of Computer Engineering, University of Pune Abstract In Image Compression, the researchers’ aim is to reduce the number of bits required to represent an image by removing the spatial and spectral redundancies. Recently discrete wavelet transform and wavelet packet has emerged as popular techniques for image compression. The wavelet transform is one of the major processing components of image compression. The result of the compression changes as per the basis and tap of the wavelet used. It is proposed that proper selection of mother wavelet on the basis of nature of images, improve the quality as well as compression ratio remarkably. We suggest the novel technique, which is based on wavelet packet best tree based on Threshold Entropy with enhanced run-length encoding. This method reduces the time complexity of wavelet packets decomposition as complete tree is not decomposed. Our algorithm selects the sub-bands, which include significant information based on threshold entropy. The enhanced run length encoding technique is suggested provides better results than RLE. The result when compared with JPEG-2000 proves to be better. Keywords: Compression, JPEG, RLE, Wavelet, Wavelet Packet 1. Introduction In today’s modern era, multimedia has tremendous impact on human lives. Image is one of the most important media contributing to multimedia. The unprocessed image heavily consumes very important resources of the system. And hence it is highly desirable that the image be processed, so that efficient storage, representation and transmission of it can be worked out. The processes involve one of the important processes- “Image Compression”. Methods for digital image compression have been the subject of research over the past decade. Advances in Wavelet Transforms and Quantization methods have produced algorithms capable of surpassing image compression standards. The recent growth of data intensive multimedia based applications have not only sustained the need for more efficient ways to encode the signals and images but also have made compression of such signals central to storage and communication technology. In Image Compression, the researchers’ aim is to reduce the number of bits needed to represent an image by removing the spatial and spectral redundancies. Image Compression method used may be Lossy or Lossless. As lossless image compression focuses on the quality of compressed image, the compression ratio achieved is very low. Hence, one cannot save the resources significantly by using lossless image compression. The image compression technique with compromising resultant image quality, without much notice of the viewer is the lossy image compression. The loss in the image quality is adding to the percentage compression, hence results in saving the resources There are various methods of compressing still images and every method has three basic steps: Transformation, quantization and encoding. The transformation transforms the data set into another equivalent data set. For image compression, it is desirable that the selection of transform should reduce the size of resultant data set as compared to source data set. Many mathematical transformations exist that transform a data set from one system of measurement into another. Some mathematical transformations have been invented for the sole purpose of data compression; selection of proper transform is one of the important factors in data compression scheme. In the process of quantization, each sample is scaled by the quantization factor whereas in the process of thresholding all insignificant samples are eliminated. These two methods are responsible for introducing data loss and it degrades the quality. The encoding phase of compression reduces the overall number of bits needed to represent the data set. An entropy encoder further compresses the quantized values to give better overall compression. This process removes the redundancy in the form of repetitive bits. We suggest the novel technique, which is based on wavelet packet best tree based on threshold Entropy with enhanced run-length encoding. This method reduces the time complexity of wavelet packets decomposition as complete tree is not decomposed. Our algorithm selects the sub-bands, which include significant information based on Threshold entropy. The results when compared with JPEG-2000 prove to be better. The basic theme of the paper is the extraction of the information from the original image based on Human Visual System. By exploring Human Visual interaction characteristics carefully, the compression algorithm can discard information, which is irrelevant to human eye. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 32 2. Today’s Scenario The International Standards Organization (ISO) has proposed the JPEG standard [2, 4, 5] for image compression. Each color component of still image is treated as a separate gray scale picture by JPEG. Although JPEG allows any color component separation, images are usually separated into Red, Green, and Blue (RGB) or Luminance (Y), with Blue and Red color differences (U = B – Y, V = R – Y). Separation into YUV color components allows the algorithm to take the advantages of human eyes’ lower sensitivity to color information. For quantization, JPEG uses quantization matrices. JPEG allows a different quantization matrix to be specified for each color component [3]. Though the JPEG provides good results previously, it is not perfectly suited for modern multimedia applications because of blocking artifacts. Wavelet theory and its application in image compression had been well developed over the past decade. The field of wavelets is still sufficiently new and further advancements will continue to be reported in many areas. Many authors have contributed to the field to make it what it is today, with the most well known pioneer probably being Ingrid Daubechies. Other researchers whose contribution directly influence this work include Stephane Mallat for the pyramid filtering algorithm, and the team of R. R. Coifman, Y. Meyer, and M. V. Wickerhauser for their introduction of wavelet packet [6]. Further research has been done on still image compression and JPEG-2000 standard is established in 1992 and work on JPEG-2000 for coding of still images has been completed at end of year 2000. The JPEG2000 standard employs wavelet for compression due to its merits in terms of scalability, localization and energy concentration [6, 7]. It also provides the user with many options to choose to achieve further compression. JPEG-2000 standard supports decomposition of all the sub-bands at each level and hence requires full decomposition at a certain level. The compressed images look slightly washed-out, with less brilliant color. This problem appears to be worse in JPEG than in JPEG-2000 [9]. Both JPEG-2000 and JPEG operate in spectral domain, trying to represent the image as a sum of smooth oscillating waves. JPEG-2000 suffers from ringing and blurring artifacts. [9] Most of the researchers have worked on this problem and have suggested the different techniques that minimize the said problem against the compromise for compression ratio. 3. Wavelet And Wavelet Packet In order to represent complex signals efficiently, a basis function should be localized in both time and frequency domains. The wavelet function is localized in time domain as well as in frequency domain, and it is a function of variable parameters. The wavelet decomposes the image, and generates four different horizontal frequencies and vertical frequencies outputs. These outputs are referred as approximation, horizontal detail, vertical detail, and diagonal detail. The approximation contains low frequency horizontal and vertical components of the image. The decomposition procedure is repeated on the approximation sub-band to generate the next level of the decomposition, and so on. It is leading to well known pyramidal decomposition tree. Wavelets with many vanishing yield sparse decomposition of piece wise smooth surface; therefore they provide a very appropriate tool to compactly code smooth images. Wavelets however, are ill suited to represent oscillatory patterns [13, 14]. A special from a texture, oscillating variations, rapid variations in the intensity can only be described by the small-scale wavelet coefficients. Unfortunately, these small-scale coefficients carry very little energy, and are often quantized to zero even at high bit rate. The weakness of wavelet transform is overcome by new transform method, which is based on the wavelet transform and known as wavelet packets. Wavelet packets are better able to represent the high frequency information [11]. Wavelet packets represent a generalization of multiresolution decomposition. In the wavelet packets decomposition, the recursive procedure is applied to the coarse scale approximation along with horizontal detail, vertical detail, and diagonal detail, which leads to a complete binary tree. The pyramid structure of wavelet decomposition up to third level is shown in figure 4.1, tree structure of wavelet decomposition up to third level is shown in figure 4.2, structure of three level decomposition of wavelet packet is shown in figure 4.3, and tree structure of wavelet packets decomposition up to third level is shown in figure 4.4. LL3 HL3 LH3 HH3 HL2 HL1 LH2 HH2 LH1 HH1 Figure 1 The-pyramid structure of wavelet decomposition up to third level IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 33 Entropy with lossy enhanced run-length encoding. This method reduces the time complexity of wavelet packets decomposition, and selects the sub-bands, which include significant information in compact. The Threshold entropy criterion finds the information contains of transform coefficients of sub-bands. Threshold entropy is obtained by the equation N 1 Entropy abs ( Xi ) Threshold (1) i 0 Figure 2 The-tree structure of wavelet decomposition up to third level LL1LL2 LL1HL2 HL1LL2 HL1HL2 LL1LH2 LL1HH2 HL1LH2 HL1HH2 LH1LL2 LH1LH2 HH1LL2 HH1HL2 LH1LH2 LH1HH2 HH1LH2 HH1HH2 Figure 3 The structure of two level decomposition of wavelet packet Figure 3 The complete decomposed three level tree 4. Proposed Compression Algorithm for Image Modern image compression techniques use the wavelet transform for image compression. Considering the limitations of wavelet transform for Image Compression we suggest the novel technique, which is based on wavelet packet best tree based on Threshold Where Xi is the ith coefficient of sub-band and N is the length of sub-band. The information contains of decomposed components of wavelet packets may be greater than or less than the information contain of component, which has been decomposed. The sum of cost (Threshold entropy) of decomposed components (child nodes) is checked with cost of component, which has been decomposed (parent node). If sum of the cost of child nodes is less than the cost of parent node, then the child nodes are considered as leaf nodes of the tree, otherwise child nodes are neglected from the tree, and parent node becomes leaf node of the tree. This process is iterated up to the last level of decomposition. The time complexity of proposed algorithm is less as compared to algorithm in paper [12]. In [12], the first wavelet packets decomposition of level ‘J’ takes place, and cost functions of all nodes in the decomposition tree are evaluated. Beginning at bottom of the tree, the cost function of the parent node is compared with union of cost functions of child nodes. According to the comparison of results the best basis node(s) is selected. This procedure is applied recursively at each level of the tree until the top most node of the tree is reached. In proposed algorithm there is no need of full wavelet packets decomposition of level ‘J’ and no need to evaluate cost function of all nodes initially. Algorithm of best basis selection based on Threshold entropy is: Load the image Set current node equal to input image Decompose the current node using wavelet packet tree Evaluate the cost of current node, and decomposed components Compare the cost of parent node (current node) with the sum of cost of child nodes (decomposed components). If the sum of cost of child nodes is greater than the parent node, consider the parent node as leaf node of the tree, and child nodes are pruned, else repeat the steps 3, 4, and 5 for each child node by considering a child node as a current node, until last level of the tree reached. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 This algorithm reduces the time complexity, because there is no need to decompose the full wavelet packets tree and no need to evaluate the costs initially. The decision of further decomposition, and cost calculation is based on the run time strategy of the algorithm, and it decides at run time whether to retain or prune the decomposed components [15]. Once the best basis has been selected based on cost function, the image is represented by a set of wavelet packets coefficients. The high compression ratio is achieved by using the thresholding to the wavelet packets coefficients. The advantages of wavelet packets can be gained by proper selection of thresholds. Encoder further compresses the coefficients of wavelet packets tree to give better overall compression. Simple Run-Length Encoding (RLE) has proven very effective encoding in many applications. Run-Length Encoding is a pattern recognition scheme that searches for the repetition (redundancy) of identical data values in the code-stream. The data set can be compressed by replacing the repetitive sequence with a single data value and length of that data. We propose the modified technique for the encoding named as Enhanced RunLength Encoding, and then for the bit coding well Image AISHWARYA CHEETAH LENA BARBARA MANDRILL BIRD ROSE DONKEY BUTTERFLY HORIZONTAL VERTICAL 34 known Huffman coding or Arithmetic coding methods are used. The problems with existing Run Length Encoding, is that the compression ratio obtained from run-length encoding schemes vary depending on the type of data to be encoded, and the repetitions present within the data set. Some data sets can be highly compressed by runlength encoding whereas other data sets can actually grow larger due to the encoding [16]. This problem of an existing run-length encoding techniques are eliminated up to the certain extent by using Enhanced Run-Length Encoding technique. In the proposed Enhanced RLE, the neighboring coefficients are compared, with acceptable value, which is provided by the user according to the applications. If the difference is less than the acceptable value then the changes are undone. 5. Results The proposed algorithm is implemented and tested over the range of natural and synthetic images. The natural test images used are AISHWARYA, CHEETAH, LENA, BARBARA, MANDRILL, BIRD, ROSE, DONKEY, and synthetic images used are BUTTERFLY, HORIZONTAL, and VERTICAL. The results for 10 images is given in table1 and few output images are given in figure 4 are given as follows: Table 1 Results of selected Images Percentage of compression Compression ratio Peak signal to noise ratio (dB) 97.6873 50 64.8412 97.3983 39 59.1732 98.4885 67 66.7228 97.0166 34 51.9716 94.5585 19 46.1098 97.1217 35 57.2633 94.4339 18 46.7576 97.0665 35 50.6168 96.0195 26 46.5698 89.2611 10 47.7637 97.5687 42 49.5102 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 Compression Ratio =35 PSNR = 57.2633 dB Original ROSE Image Original Bird image Compression Ratio= 18 PSNR = 46.7576 dB Original MANDRILL Image Compression Ratio = 21 PSNR = 46.1098 dB Original Horizontal Image Compression Ratio= 10 PSNR = 47.7637 dB Fig. 4 Resultant Images with PSNR and compression ratio 6. Conclusion The novel algorithm of image compression using wavelet packet best tree based on Threshold entropy and enhanced RLE is implemented, and tested over the set of natural and synthetic images and concluding remarks based on results are discussed. The results show that the compression ratio is good for low frequency (smooth) images, and it is observed that it is very high for gray images. For high frequency images such as Mandrill, Barbara, the compression ratio is good, and the quality of the images is retained too. These results are compared with JPEG-2000 application, and it is found that the results obtained by using the proposed algorithm are better than the JEPG2000. 7. References [1]. Subhasis Saha, “Image Compression- from DCT To [2]. [3]. [4]. [5]. 35 Wavelet: A Review,” ACM Crossroads Student Magazine, Vol.6, No.3, Spring 2000. Andrew B. Watson, “Image Compression Using the Discrete Cosine Transform,” NASA Ames Research Center, Mathematical Journal, Vol.4, Issue 1, pp 8188, 1994. “Video Compression- An Introduction” Array Microsystems Inc., White Paper, 1997. Sergio D. Servetto, Kannan Ramchandran, Michael T. Orchard, “ Image Coding Based on A Morphological Representation of Wavelet Data,” IEEE Transaction On Image Processing, Vol.8, No.9, pp.1161-1174, 1999. M. K. Mandal, S. Panchnathan, T. Aboulnasr, “Choice of Wavelets For Image Compression (Book Title: Information Theory & Application),” Lecture Notes In Computer Science, Vol.1133, pp.239-249, 1995. [6]. Andrew B. Watson, “Image Compression Using the Discrete Cosine Transform,” NASA Ames Research Center, Mathematical Journal, Vol.4, Issue 1, pp. 8188, 1994. [7]. Wayne E. Bretl, Mark Fimoff, “MPEG2 Tutorial Introduction & Contents: Video Compression BasicsLossy Coding,” 1999. [8]. M. M. Reid, R.J. Millar, N. D. Black, “ SecondGeneration Image Coding; An Overview,” ACM Computing Surveys, Vol.29, No.1, March 1997. [9]. Aleks Jakulin, “Baseline JPEG And JPEG2000 Artifacts Illustrated,” Visicron, 2002. [10]. Agostino Abbate, Casimer M. DeCusatis, Pankaj K. Das, “Wavelets and Subbands fundamentals and applications”, ISBN 0-8176-4136-X printed at Birkhauser Boston.. [11]. Deepti Gupta, Shital Mutha, “Image Compression Using Wavelet Packet,” IEEE, Vol.3, pp. 922-926, Oct. 2003. [12]. Andreas Uhl, “Wavelet Packet Best Basis Selection On Moderate Parallel MIMD Architectures1,” Parallel Computing, Vol.22, No.1, pp. 149-158, Jan. 1996. [13]. Francois G. Meyar, Amir Averbuch, Jan-Olvo Stromberg, Ronald R. Coifman, “Fast Wavelet Packet Image Compression,” IEEE Transaction On Image Processing, Vol.9, No.5, pp. 563-572, May 2000. [14]. Jose Oliver, Manuel Perez Malumbres, “Fast And Efficient Spatial Scalable Image Compression Using Wavelet Lower Trees,” Proceedings Of The Data Compression Conference, pp. 133-142, Mar. 2003. [15]. G. K. Kharate, Dr. A. A. Ghatol and Dr. P. P. Rege, “Image compression using wavelet packet tree”, Published at ICGST International Journal on Graphics, Vision and Image Processing GVIP-2005 Cairo, Egypt, Vol-5, Issue-7, July 2005. [16]. Mark Nelson, Jea-Loup Gailly “The Data Compression Book”, M & T Books, ISBN-1-55851434-1, Printed in the USA 1996. IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 36 A Pedagogical Evaluation and Discussion about the Lack of Cohesion in Method (LCOM) Metric Using Field Experiment. Ezekiel Okike School of Computer Studies, Kampala International University , Kampala, Uganda 256, Uganda. Abstract Chidamber and Kemerer first defined a cohesion measure for object-oriented software – the Lack of Cohesion in Methods (LCOM) metric. This paper presents a pedagogic evaluation and discussion about the LCOM metric using field data from three industrial systems. System 1 has 34 classes, System 2 has 383 classes and System 3 has 1055 classes. The main objectives of the study were to determine if the LCOM metric was appropriate in the measurement of class cohesion and the determination of properly and improperly designed classes in the studied systems. Chidamber and Kemerer’s suite of metric was used as metric tool. Descriptive statistics was used to analyze results. The result of the study showed that in System 1, 78.8% (26 classes) were cohesive; System 2 54% (207 classes) were cohesive; System 3 30% (317 classes) were cohesive. We suggest that the LCOM metric measures class cohesiveness and was appropriate in the determination of properly and improperly designed classes in the studied system. Keywords: Class Cohesion, LCOM Metric, Systems, Software Measurement. 1. Introduction Software metric is any type of measurement that relates to a software system, process or related documentation. On the other hand, software measurement is concerned with deriving a numeric value for some attributes of a software product or process. By comparing these values to each other and to standards that apply across an organization, one may be able to draw conclusions about the quality of a software or software processes. The Lack of Cohesion in Methods (LCOM) metric was proposed in [5,6] as a measure of cohesion in the object oriented paradigm. The term cohesion is defined as the “intramodular functional relatedness” in software [1]. This definition, considers the cohesion of each module in isolation: how tightly bound or related its internal elements are. Hence, cohesion as an attribute of software modules capture the degree of association of elements within a module, and the programming paradigm used determines what is an element and what is a module. In the object-oriented paradigm, for instance, a module is a class and hence cohesion refers to the relatedness among the methods of a class. Cohesion may be categorized ranging from the weakest form to the strongest form in the following order: coincidental, logical, temporal, procedural, communicational, sequential and functional. i. Coincidental cohesion: A coincidentally cohesive module is one whose elements contribute to activities in a module, but with no meaningful relationship to one another. An example is to have unrelated statements bundled together in a module. Such a module would be hard to understand what it does and can not be reused in another program. ii. Logical cohesion: A logically cohesive module is one whose elements contribute to activities of the same general category in which the activity or activities to be executed are selected from outside the module. A logically cohesive module does any of several different related things, hence, presenting a confusing interface since some parameters may be needed only sometimes. iii. Temporal cohesion: A temporally cohesive module is one whose elements are involved in activities that are related in time. That is, the activities are carried out at a particular time. The elements occurring together in a temporally cohesive module do diverse things and execute at the same time. iv. Procedural cohesion: A procedurally cohesive module is one whose elements are involved in different and possibly unrelated activities in which control flows from each activity to the next. Procedurally cohesive modules tend to be composed of pieces of functions that have little IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org relationship to one another (except that they are carried out in a specific order at a certain time). v. Communicational cohesion: A communicational cohesive module is one whose elements contribute to activities that use the same input or output data. vi. Sequential cohesion: A sequentially cohesive module is one whose elements are involved in activities such that output data from one activity serve as input data to the next. Some authors identify this as informational cohesion. vii. Functional cohesion: A functionally cohesive module contains elements that all contribute to the execution of one and only one problem-related task. The elements do exactly one thing or achieve one goal. viii. A module exhibits one of these forms of cohesion depending on the skill of the designer. However, functional cohesion is generally accepted as the best form of cohesion in software design. Functional cohesion is the most desirable because it performs exactly one action or achieves a single goal. Such a module is highly reusable, relatively easy to understand (because you know what it does) and is maintainable. In this paper, the term “cohesion” refers to functional cohesion. Several measures of cohesion have been defined in both the procedural and object-oriented paradigms. Most of the cohesion measures defined in the object-oriented paradigm are inspired from the Lack of Cohesion in methods (LCOM) metric defined by Chidamber and Kemerer. In this paper, the Lack of Cohesion in methods (LCOM) metric is pedagogically evaluated and discussed with empirical data. The rest of the paper is organized as follows: Section 2 presents a summary of the approaches to measuring cohesion in procedural and object-oriented programs. Section 3 examines the Chidamber and Kemerer LCOM metric. Section 4 present the empirical study of LCOM with three Java based industrial software systems. Section 5 presents the result of the study upon which the LCOM metric was evaluated. Section 6 concludes the paper by suggesting that Chidamber and Kemerer’s LCOM metric measures cohesiveness. 2. Measuring Cohesion in Procedural and Object oriented Programs 2.1 Measuring cohesion in procedural programs Procedural programs are those with procedure and data declared independently. Examples of purely procedure oriented languages include C, Pascal, Ada83, Fortran and so on. In this case, the module is a procedure and an 37 element is either a global value which is visible to all the modules or a local value which is visible only to the module where it is declared. As noted in [2], the approaches taken to measure cohesiveness of this kind of programs have generally tried to evaluate cohesion on a procedure by procedure basis, and the notational measure is one of “functional strength” of procedure, meaning the degree to which data and procedures contribute to performing the basic function. In other words the complexity is defined in the control flow. Among the best known measures of cohesion in the procedural paradigm are discussed in [3] and [4]. 2.2 Measuring cohesion in object-oriented systems In the Object Oriented languages, the complexity is defined in the relationship between the classes and their methods. Several measures exist for measuring cohesion in Object-Oriented systems [7,8,9,10,11,12]. Most of the existing cohesion measures in the object-oriented paradigm are inspired from the Lack of Cohesion in Methods (LCOM ) metric [5,6]. Some examples include LCOM3, Connectivity model, LCOM5, Tight Class Cohesion (TCC), and Low Class Cohesion (LCC), Degree of Cohesion in class based on direct relation between its public methods (DCD) and that based on indirect methods (DCI), Optimistic Class cohesion (OCC) and Pessimistic Class Cohesion (PCC). 3. The Lack of Cohesion in Methods (LCOM) Metric. The LCOM metric is based on the number of disjoint sets of instance variables that are used by the method. Its definition is given as follows [5,6]. Definition 1. Consider a class C1 with n methods M1, M2,…,Mn. Let {Ii}= set of instance variables used by method Mi. There are n such sets {Ii},…,{In}. Let P = { (Ii, Ij) | Ii ∩ Ij = } and Q = { (Ii, Ij) | Ii ∩ Ij ≠ }. If all n sets { I1}, …,{In} are then let P = LCOM = { |P|- |Q|, if |P| > |Q| = 0, otherwise Example: Consider a class C with three methods M1, M2 and M3. Let {I1} = {a,b,c,d,e} and {I2} = {a,b,e} and {I3} = {x,y,z}. {I1} ∩ {I2} is nonempty, but {I1} ∩ {I3} and {I2} ∩ {I3} are null sets. LCOM is (the number of null IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org intersections – number of non empty intersections), which in this case is 1. The theoretical basis of LCOM uses the notion of degree of similarity of methods. The degree of similarity of two methods M1 and M2 in class C1 is given by: σ( ) = {I1} ∩ {I2} where {I1} and {I2} are sets of instance variables used by M1 and M2 . The LCOM is a count of the number of method pairs whose similarity is 0 (i.e, σ( ) is a null set) minus the count of method pairs whose similarity is not zero. The larger the number of similar methods, the more cohesive the class, which is consistent with the traditional notions of cohesion that measure the inter relatedness between portions of a program. If none of the methods of a class display any instance behaviour, i.e. do not use any instance variables, they have no similarity and the LCOM value for the class will be zero. The LCOM value provides a measure of the relative disparate nature of methods in the class. A smaller number of disjoint pairs (elements of set P) implies greater similarity of methods. LCOM is intimately tied to the instance variables and methods of a class, and therefore is a measure of the attributes of an object class. 38 Where MI are methods in the class c and AI are the attributes (or instance variables ) in the class c ; AR denote attribute reference In this definition, only methods M implemented in class c are considered; and only references to attributes AR implemented in class c are counted. The definition of LCOM2 has been widely discussed in the literature [6,9,11,14,16]. LCOM2 of many classes are set to be zero although different cohesions are expected. 3.1 Remarks In general the Lack of Cohesion in Methods (LCOM) measures the dissimilarity of methods in a class by instancevariable or attributes. Chidamber and Kemerer’s interpretation of the metric is that LCOM = 0 indicates acohesive class. However, for LCOM >0, it implies thatinstance variables belong to disjoint sets. Such a class maybe split into 2 or more classes to make it cohesive. Consider the case of an n-sequentially linked methods as shown in figure 3.1 below where n methods are sequentially linked by shared instance variables. Shared instance variables In this definition, it is not stated whether inherited methods and attributes are included or not. Hence, a refinement is provided as follows [14]: Definition 2. … Let P = , if AR (m) = m MI (c) = {{m1,m2} m1,m2 MI( c) m1 m2 AR(m1) AR(m2) AI (c) = }, else M1 Let Q = {{ m1,m2} m1,m2 MI( c) m1 m2 AR (m1) AR(m2) AI( c) } Then LCOM2( c) = { P - Q, if P > Q = 0, otherwise M2 M3 Mn Fig. 3.1. n-Sequentially liked methods In this special case of sequential cohesion: n P (n 1) 2 (1) Q n 1 (2) so that LCOM n P Q 2(n 1) + 2 (3) IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org where [k]+ equals k, if k>0 and 0 otherwise [8]. 39 There are two pairs of methods accessing no common instance variables namely (<f, g>, <f, h>). Hence P = 2. From (1) and (2) One pair of methods shares variable E, namely, <g, h>. n P Q (n 1) (n 1) 2 Hence, Q = 1. Therefore, LCOM is 2 - 1 =1. 3.3 n 2n 2 2 The LCOM metric has been criticized for not satisfying all the desirable properties of cohesion measures. For instance, the LCOM metric values are not normalized n 2(n 1) 2 Critique of LCOM metric values [11,13]. A method for normalizing the LCOM metric has been proposed in [18, 19]. It is also observed n! 2(n 1) (n 2)!2! (4) From (4), for n < 5, LCOM = 0 indicating that classes with less than 5 methods are equally cohesive. For n 5, 1 < LCOM < n, suggesting that classes with 5 or more methods need to be split [8,18]. 3.2 Class design and LCOM computation that the LCOM metric is not able to distinguish between the structural cohesiveness of two classes, in the way in which the methods share instance variables [8]. Hence, a connectivity metric to be used in conjunction with the LCOM metric was proposed. The value of the connectivity metric always lies between 0 and 1 [8]. 4. The empirical study 4.1 The Method Chidamber and Kemerer’s suit of metrics namely: Lack of A B f() Cohesion in methods (LCOM), Coupling Between Object C Classes (CBO), Response For a Class (RFC),Weighted Methods Per Class (WMC), Depth of Inheritance (DIT) g() D h() E F and Number of Children (NOC) were used in the study. Two other metrics used in this experiment which are not part of the Chidamber and Kemerer metrics are: Number Fig. 3.2. Class design showing LCOM computation Source:[8] of Public Methods (NPM) and Afferent Coupling (CA). The choice of these metrics is informed by the need to have a metric to measure the number of public methods in a class as well as the number of other classes using a Class x { Int A, B, C, D, E, F; Void f() {…uses A,B, C …} Void g () {…uses D, E…} Figure 3.2 presents a class x written in C++. specific class. All the metrics used in this study provide the appropriate variables required for the experiments and the tools for measuring the metrics were readily available for use. In addition Chidamber and Kemerer’s set of The Lack of Cohesion in Methods (LCOM) for class x = 1, calculated as follows: IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 40 measure seems to be the basic set of object-oriented Let P be the pairs of methods without shared instance measures widely accepted [15]. variables, and Q be the pairs of methods with shared Specifically cohesion was measured using the LCOM instance variables. metric. Coupling was measured using CBO, RFC, and CA. Then LCOM = |P| – |Q| , if |P| < |Q| i.e. If this difference is Size was measured using WMC, and NPM. Inheritance negative, LCOM is set to zero was measured using DIT. Descriptive statistics was used • CBO. (Coupling between Object classes). A class is to analyze results. coupled to another, if methods of one class use methods or 4.2 attributes of the other, or vice versa. CBO for a class is Description of variables Table 4.1 below shows the variables used in the test then defined as the number of other classes to which it is systems. The metric of paramount interest is the LCOM coupled. This includes inheritance based coupling. although CBO, RFC, CA, WMC, NPM, NOC, and DIT • RFC (Response set for a class). The Response set for a values were obtained in order to verify if they are class consists of the set M of methods of the class, and the significant correlations between these and the LCOM. set of methods directly or indirectly invoked by methods Table 1: Metric variables used in the experiment in M. In other words, the response set is the set of methods Metric Meaning Attribute that can potentially be executed in response to a message LCOM Lack of Cohesion in Methods Cohesion CBO Coupling Objects Between Coupling RFC Response Class For Coupling CA Afferent Coupling Coupling WMC Weighted Per Class Size NPM NOC DIT a Method methods in the response set of the class. • NPM (Number of Public Methods). The NPM metric counts all the methods in a class that are declared as public. It can be used to measure the size of an Application Program Interface (API) provided by a package [17]. • CA (Afferent Coupling). A class’s Afferent Coupling is a measure of how many other classes use the specific class [17]. Number of Public Methods Size Number of Children Inheritanc e Depth of Inheritance received by an object of that class. RFC is the number of 5. Inheritanc e Result and Discussion The results of applying a Chidamber and Kemerer metric tool in the experimental study of the selected test systems consisting of 1472 Java classes from three different industrial systems are presented in this section. Descriptive statistics is used to analyze and interpret the results. • LCOM: Lack of Cohesion in Methods 5.1 Descriptive statistics of the test systems Descriptive statistics were used to obtain the minimum, maximum, mean, median, and standard deviation values IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org 41 for the test systems as shown in tables 2-4. In case of system, and there could be more outliers within. measurement for cohesion, the LCOM value lies between Descriptive statistics for the test systems are shown in a range [0, maximum]. From Chidamber and Kemerer’s tables 2-4 . Table 5 provides the combine descriptive interpretation of their LCOM metric, a class is cohesive if statistics for cohesion comparison across the test systems its LCOM =0. Using descriptive statistics, a median value respectively. Table 2: Descriptive statistics for system 1 in this range shows the level of cohesiveness in the system. This also means that at least half of the number of Statistics WMC DIT NOC CBO RFC 4d7 CA NPM classes in the system are cohesive. The actual number of N 34 34 34 34 34 34 34 34 cohesive classes and their percentages based on the Valid Missing 34 3 34 3 34 3 34 3 34 3 34 3 34 3 34 3 number of classes in the test systems were obtained from a Mean 7.88 1.41 .41 5.59 22.21 11 4.21 6.12 simple frequency count of cohesive classes in each test Median 4.00 1.00 .00 5.00 18.50 .00 2.00 4.00 system. Stil. Dcv 13.30 .50 1.52 5.87 24.23 44 4.78 8.63 In this experiment, we applied a normalized LCOM ie Mm 0 1 0 0 0 0 0 0 [0,1]. This means that systems exhibiting high cohesion Max 74 2 8 31 129 2531 22 45 cohesive classes, however a median value of 1 is low Statistics WMC DIT NOC CBO RFC LCOM CA NPM enough to be a cohesive class. A minimum value indicates N 383 383 383 383 383 383 383 383 the lowest LCOM value for the class being measured. If Valid Missing 383 0 383 0 383 0 383 0 383 0 383 0 383 0 383 0 Mean 8.24 2.14 .58 8.33 20.93 150.40 5.70 6.97 Median 3.00 2.00 .00 5.00 10.00 100 3.00 2.00 Std Dev 19.16 1.16 3.00 20.07 31.14 131.02 14.02 18.63 Min 0 1 0 0 0 0 0 0 Max 118 5 36 195 256 16290 157 181 show low median values between [0,1]. From Chidamber and Kemerer’s view, a median value of 0 indicates this value is zero, it is the cohesion of the class from the LCOM interpretation. A maximum value indicates the highest LCOM value for the class. Using Chidamber and Table 3: Descriptive statistics system 2 Kemerer’s metric the LCOM values for a class can be any value from zero [0,1,2,3,..., 222,.. .6789..., maximum] to any value. The presence of such values as 222.. .6789... maximum make the LCOM metric not really appealing to most practitioners because a cohesion metric should not generate values which are not standardized (normalized). However Chidamber and Kemerer’s position is that classes whose LCOM > 0 are improperly designed classes, and as such could be split to two or more classes to make them cohesive. The presence of outliers and un standardized (un normalized) values for LCOM is still a short coming with Chidamber and Kemerer LCOM metric. Using descriptive statistics, a maximum LCOM value indicates the value of the highest outlier in the measured Table 4: Descriptive statistics system 3 IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org DIT NOC N WM C 1055 1055 1055 Valid 1055 1055 1055 Statisti cs CB O 105 5 RFC CA NPM 1055 LCO M 1055 1055 1055 105 1055 1055 1055 1055 0 0 0 0 0 0 0 0 Mean 7.96 1.42 .36 6.25 26.49 44.91 1.69 5.91 Media n Std.De v 5.00 1.00 .00 3.00 16.00 6.00 .oo 4.00 9.40 .62 3.00 7.55 30.93 5.83 6.82 Min Max 0 109 1 4 0 64 0 65 0 210 180.4 5 0 2744 0 71 0 61 g 5.2 Table 5 below shows the comparison of cohesion measures across the three test systems. The actual number of cohesive and uncohesive classes per system and their percentages are indicated as shown above. A median value in range [0,1) indicates the system is cohesive. System2 383 COHESIVE UNCOHESIVE TOTAL 1472 considered in this work. In the field experiments result tables 2 — 4, it was observed that LCOM median values for systems 1 and 2 are 0.00 or 1.00. Hence these systems 3 whose LCOM median value is 6.00. To confirm this, a simple frequency count of cohesive and un-cohesive classes was carried out to find the actual percentages as shown in table 5. However, Chidamber and Kemerer’s view that a class is not cohesive when LCOM= 1 does not low value indicates high 0 Maxim 2534 26 7 Mean 79.03 (78.8%) (21.2%) Median 0 Std.dev 440.22 Minimum 0 207 176 Maxim 16290 (54%) (46%) Mean 150.40 Median 1.00 6. 1318.3 In this paper the concept of cohesion in both the cohesion and vice versa [14]. For illustration, suppose the cohesion of a class ci is 0 (LCOM (c1)0), and the cohesion of another class c2 is 1 (LCOM (c2)= 1); this should mean that LCOM (ci)> LCOM (c2), and should not 5 1055 value was 0. However, a median value of 1 (also low) was Minimum Std.dev System3 cohesive methods. In their original work this low median Since the LCOM metric is an inverse cohesion measure: a LCOM 34 median value indicates that at least 50% of the class have classes with LCOM =1 are improperly designed. DESCRIPTIVE STATISTICS CLASSES System1 maximum, mean, median, standard deviation) [6], a low seem appropriate as there was no reason to suggest that Table 5: Cohesion comparison across the test systems NO.OF Following Chidamber and Kemerer’s guide to interpreting are considered to have more cohesive classes than system Cohesion comparisons across systems. SYSTEMS Discussion their LCOM metric using descriptive statistics (minimum, 5 Missin 5.3 42 be interpreted to mean that LCOM (c2)=1 is not cohesive and therefore may be split. Conclusion procedural and object-oriented paradigm has been Minimum 0 317 738 Maxim 2744 (30%) (7%) Mean 44.91 Kemerer’s Lack of Cohesion in Methods (LCOM) metric Median 6.0 measures cohesiveness. However, the presence of outliers Std.dev 180.45 and not standardized values make the metric not as extensively discussed. It is suggested that Chidamber and appealing as its variant measures whose cohesion descriptive statistics values are standardized (normalized). IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010 www.IJCSI.org A normalized LCOM metric is already proposed in [19]. The metric may be used to predict improperly designed classes especially when the LCOM metric is used with reference to the Number of Public Methods (NPM) being greater than or equal five (NPM 5) [18,5,6]. Cohesion as an attribute of software when properly measured serves as guiding principle in the design of good software which is easy to maintain and whose components are reusable [3,4,5,18] References [1] E. Yourdon and L. Constantine, Structured Design: Fundamentals of a Discipline of Computer Program and Systems Design, Englewood Cliffs, New Jersey: Prentice-Hall, 1979. [2] M. F. Shumway . “Measuring Class Cohesion in Java”. Master of Science Thesis. Department of computer science, Colorado state university, Technical Report CD -97-113, 1997. [3] J. M. Bieman and L. M. Ott, . “Measuring Functional Cohesion”, IEEE Transactions on Software Engineering, vol. 20, no. 8, pp. 644-658, August 1994. [4] J. M. Bieman and B. K. Kang, 998. Measuring Design – level Cohesion. IEEE Transactions on Software Engineering, vol. 20, no. 2, pp. 111-124, February 1998. [5] S. R. Chidamber and C. F. Kemerer, “Towards a Metric Suite for Object Oriented Design”, Object Oriented Programming Systems, Languages and Applications, Special Issue of SIGPLAN Notices, vol. 26, no.10, pp. 197-211, October 1991. [6] S. R. Chidamber and C. F. Kemerer , “A Metric Suite for Object Oriented Design”, IEEE Transactions on Software Engineering, vol. 20, no. 6, pp. 476-493, June 1994. [7] W. Li and S. Henry, “Object Oriented Metrics that Predict Maintainability”. Journal of Systems and Software, vol. 23, pp. 111-122, February1993. [8] M. Hitz and B. Montazeri, “Chidamber and Kenmerer ‘s mtric Suite: A Measurement Theory Perspective”, IEEE Transactions on Software Engineering, vol. 22. no. 4, pp.267270, April 1996. [9] B. Henderson-Sellers, Software Metrics, U.K: Prentice Hall, 1996. [10] J. M. Bieman and B. K. Kang, “Cohesion and Reuse in an Object Oriented System”, Proceedings of the Symposium on Software Reusability (SSR ’95), Seattle: WA. Pp. 259-262, April 1995. [11] L. Badri and M. Badri, “A proposal of a New Class cohesion Criterion: An Empirical Study” Journal of Object Technology, vol. 3, no. 4, pp. 145-159, April 2004. [12] H. Aman, K. Yamasski, and M. Noda, “A Proposal of Class Cohesion Metrics using sizes of cohesive parts”, Knowledge based Software Engineering. T. Welzer et al. Eds. IOS press, pp. 102-107, September 2002. [13] B. S. Gupta, “A Critique of Cohesion Measures in he Object Oriented Paradigm”, M.S Thesis, Department of Computer Science, 42pp, 1997. Michigan 43 Technological University. iii+ [14] L. C. Briand, J. Daly and J. Wust, “A Unified Framework for Cohesion Measurement in Object Oriented Systems”, Empirical Software Engineering, vol. 3, no.1, pp. 67-117, 1998. [15] H. Zuse, “A framework for Software Measuremen”, New York: Walter de Gruyter, 1988. [16] V. R. Basili, L. C. Briand, and W. Melo, “A validation of Object Oriented Design Metrics as quality indicators”, IEEE Transaction On Software Engineering, vol. 22 , no.10, pp. 751761, 1996. [17] D. Spinellis, “Ckjm – A tool for calculating Chidamber and Kemerer Java metrics”, http://www.spinellis.gr/sw/ckjm/doc/indexw.html [18] E. U. Okike “Measuring class cohesion in object-oriented systems using Chidamber and Kemerar metrics and Java as case study. Ph.D thesis. Department of Computer Science, University of Ibadan, xvii + 133pp, 2007. [19] E. U. Okike and A. Osofisan, “An evaluation of Chidamber and Kemerer’s Lack of Cohesion in Method (LCOM) metric using different normalization approaches”.Afr J. Comp. & ICT vol. 1. No 2, pp 35-54, ISSN 2006-1781, 2008. Ezekiel U. Okike received the BSc degree in computer science from the University of Ibadan Nigeria in 1992, the Master of Information Science (MInfSc) in 1995 and PhD in computer science in 2007 all from the same University. He has been a lecturer in the Department of Computer Science, University of Ibadan since 1999 to date. Since September, 2008 to date, he has been on leave as a senior lecturer and Dean of the School of Computer Studies, Kampala International University, Uganda. His current research interests are in the areas of software engineering, software metrics, compilers and programming languages. He is a member of IEEE computer and communication societies. IJCSI CALL FOR PAPERS SEPTEMBER 2010 ISSUE Volume 7, Issue 5 The topics suggested by this issue can be discussed in term of concepts, surveys, state of the art, research, standards, implementations, running experiments, applications, and industrial case studies. Authors are invited to submit complete unpublished papers, which are not under review in any other conference or journal in the following, but not limited to, topic areas. See authors guide for manuscript preparation and submission guidelines. Accepted papers will be published online and authors will be provided with printed copies and indexed by Google Scholar, Cornell’s University Library, ScientificCommons, CiteSeerX, Bielefeld Academic Search Engine (BASE), SCIRUS and more. Deadline: 31st July 2010 Notification: 31st August 2010 Revision: 10th September 2010 Online Publication: 30th September 2010 • • • • • • • • • • • • • • • • Evolutionary computation Industrial systems Evolutionary computation Autonomic and autonomous systems Bio-technologies Knowledge data systems Mobile and distance education Intelligent techniques, logics, and systems Knowledge processing Information technologies Internet and web technologies Digital information processing Cognitive science and knowledge agent-based systems Mobility and multimedia systems Systems performance Networking and telecommunications • • • • • • • • • • • • • • • Software development and deployment Knowledge virtualization Systems and networks on the chip Context-aware systems Networking technologies Security in network, systems, and applications Knowledge for global defense Information Systems [IS] IPv6 Today - Technology and deployment Modeling Optimization Complexity Natural Language Processing Speech Synthesis Data Mining For more topics, please see http://www.ijcsi.org/call-for-papers.php All submitted papers will be judged based on their quality by the technical committee and reviewers. Papers that describe research and experimentation are encouraged. All paper submissions will be handled electronically and detailed instructions on submission procedure are available on IJCSI website (www.IJCSI.org). For more information, please visit the journal website (www.IJCSI.org) © IJCSI PUBLICATION 2010 www.IJCSI.org IJCSI The International Journal of Computer Science Issues (IJCSI) is a refereed journal for scientific papers dealing with any area of computer science research. The purpose of establishing the scientific journal is the assistance in development of science, fast operative publication and storage of materials and results of scientific researches and representation of the scientific conception of the society. It also provides a venue for researchers, students and professionals to submit ongoing research and developments in these areas. Authors are encouraged to contribute to the journal by submitting articles that illustrate new research results, projects, surveying works and industrial experiences that describe significant advances in field of computer science. Indexing of IJCSI: 1. Google Scholar 2. Bielefeld Academic Search Engine (BASE) 3. CiteSeerX 4. SCIRUS 5. Docstoc 6. Scribd 7. Cornell’s University Library 8. SciRate 9. ScientificCommons © IJCSI PUBLICATION www.IJCSI.org