Full-Text - International Journal of Computer Science Issues

advertisement
IJCSI proceedings are currently indexed by:
© IJCSI PUBLICATION 2010
www.IJCSI.org
IJCSI Publicity Board 2010
Dr. Borislav D Dimitrov
Department of General Practice, Royal College of Surgeons in Ireland
Dublin, Ireland
Dr. Vishal Goyal
Department of Computer Science, Punjabi University
Patiala, India
Mr. Nehinbe Joshua
University of Essex
Colchester, Essex, UK
Mr. Vassilis Papataxiarhis
Department of Informatics and Telecommunications
National and Kapodistrian University of Athens, Athens, Greece
EDITORIAL
In this second edition of 2010, we bring forward issues from various
dynamic
computer
science
areas
ranging
from
system
performance,
computer vision, artificial intelligence, software engineering, multimedia,
pattern
recognition,
information
retrieval,
databases
and
networking
among others.
We thank all our reviewers for providing constructive comments on papers
sent to them for review. This helps enormously in improving the quality
of papers published in this issue.
IJCSI is still maintaining its policy of sending print copies of the journal
to all corresponding authors worldwide free of charge. Apart from
availability of the full-texts from the journal website, all published
papers are deposited in open-access repositories to make access easier and
ensure continuous availability of its proceedings.
We are pleased to present IJCSI Volume 7, Issue 2, split in five numbers
(IJCSI Vol. 7, Issue 2, No. 3). The acceptance rate for this issue is
27.55%. Out of the 98 papers submitted for review, 27 were eventually
accepted for publication in this month issue.
We wish you a happy reading!
IJCSI Editorial Board
March 2010
www.IJCSI.org
IJCSI Editorial Board 2010
Dr Tristan Vanrullen
Chief Editor
LPL, Laboratoire Parole et Langage - CNRS - Aix en Provence, France
LABRI, Laboratoire Bordelais de Recherche en Informatique - INRIA - Bordeaux, France
LEEE, Laboratoire d'Esthétique et Expérimentations de l'Espace - Université d'Auvergne, France
Dr Constantino Malagôn
Associate Professor
Nebrija University
Spain
Dr Lamia Fourati Chaari
Associate Professor
Multimedia and Informatics Higher Institute in SFAX
Tunisia
Dr Mokhtar Beldjehem
Professor
Sainte-Anne University
Halifax, NS, Canada
Dr Pascal Chatonnay
Assistant Professor
MaÎtre de Conférences
Laboratoire d'Informatique de l'Université de Franche-Comté
Université de Franche-Comté
France
Dr Yee-Ming Chen
Professor
Department of Industrial Engineering and Management
Yuan Ze University
Taiwan
Dr Vishal Goyal
Assistant Professor
Department of Computer Science
Punjabi University
Patiala, India
Dr Natarajan Meghanathan
Assistant Professor
REU Program Director
Department of Computer Science
Jackson State University
Jackson, USA
Dr Deepak Laxmi Narasimha
Department of Software Engineering,
Faculty of Computer Science and Information Technology,
University of Malaya,
Kuala Lumpur, Malaysia
Dr Navneet Agrawal
Assistant Professor
Department of ECE,
College of Technology & Engineering,
MPUAT, Udaipur 313001 Rajasthan, India
Prof N. Jaisankar
Assistant Professor
School of Computing Sciences,
VIT University
Vellore, Tamilnadu, India
IJCSI Reviewers Committee 2010
• Mr. Markus Schatten, University of Zagreb, Faculty of Organization and Informatics, Croatia
• Mr. Vassilis Papataxiarhis, Department of Informatics and Telecommunications, National and Kapodistrian
University of Athens, Athens, Greece
• Dr Modestos Stavrakis, University of the Aegean, Greece
• Dr Fadi KHALIL, LAAS -- CNRS Laboratory, France
• Dr Dimitar Trajanov, Faculty of Electrical Engineering and Information technologies, ss. Cyril and Methodius
Univesity - Skopje, Macedonia
• Dr Jinping Yuan, College of Information System and Management,National Univ. of Defense Tech., China
• Dr Alexis Lazanas, Ministry of Education, Greece
• Dr Stavroula Mougiakakou, University of Bern, ARTORG Center for Biomedical Engineering Research,
Switzerland
• Dr Cyril de Runz, CReSTIC-SIC, IUT de Reims, University of Reims, France
• Mr. Pramodkumar P. Gupta, Dept of Bioinformatics, Dr D Y Patil University, India
• Dr Alireza Fereidunian, School of ECE, University of Tehran, Iran
• Mr. Fred Viezens, Otto-Von-Guericke-University Magdeburg, Germany
• Dr. Richard G. Bush, Lawrence Technological University, United States
• Dr. Ola Osunkoya, Information Security Architect, USA
• Mr. Kotsokostas N.Antonios, TEI Piraeus, Hellas
• Prof Steven Totosy de Zepetnek, U of Halle-Wittenberg & Purdue U & National Sun Yat-sen U, Germany, USA,
Taiwan
• Mr. M Arif Siddiqui, Najran University, Saudi Arabia
• Ms. Ilknur Icke, The Graduate Center, City University of New York, USA
• Prof Miroslav Baca, Faculty of Organization and Informatics, University of Zagreb, Croatia
• Dr. Elvia Ruiz Beltrán, Instituto Tecnológico de Aguascalientes, Mexico
• Mr. Moustafa Banbouk, Engineer du Telecom, UAE
• Mr. Kevin P. Monaghan, Wayne State University, Detroit, Michigan, USA
• Ms. Moira Stephens, University of Sydney, Australia
• Ms. Maryam Feily, National Advanced IPv6 Centre of Excellence (NAV6) , Universiti Sains Malaysia (USM),
Malaysia
• Dr. Constantine YIALOURIS, Informatics Laboratory Agricultural University of Athens, Greece
• Mrs. Angeles Abella, U. de Montreal, Canada
• Dr. Patrizio Arrigo, CNR ISMAC, italy
• Mr. Anirban Mukhopadhyay, B.P.Poddar Institute of Management & Technology, India
• Mr. Dinesh Kumar, DAV Institute of Engineering & Technology, India
• Mr. Jorge L. Hernandez-Ardieta, INDRA SISTEMAS / University Carlos III of Madrid, Spain
• Mr. AliReza Shahrestani, University of Malaya (UM), National Advanced IPv6 Centre of Excellence (NAv6),
Malaysia
• Mr. Blagoj Ristevski, Faculty of Administration and Information Systems Management - Bitola, Republic of
Macedonia
• Mr. Mauricio Egidio Cantão, Department of Computer Science / University of São Paulo, Brazil
• Mr. Jules Ruis, Fractal Consultancy, The netherlands
• Mr. Mohammad Iftekhar Husain, University at Buffalo, USA
• Dr. Deepak Laxmi Narasimha, Department of Software Engineering, Faculty of Computer Science and
Information Technology, University of Malaya, Malaysia
• Dr. Paola Di Maio, DMEM University of Strathclyde, UK
• Dr. Bhanu Pratap Singh, Institute of Instrumentation Engineering, Kurukshetra University Kurukshetra, India
• Mr. Sana Ullah, Inha University, South Korea
• Mr. Cornelis Pieter Pieters, Condast, The Netherlands
• Dr. Amogh Kavimandan, The MathWorks Inc., USA
• Dr. Zhinan Zhou, Samsung Telecommunications America, USA
• Mr. Alberto de Santos Sierra, Universidad Politécnica de Madrid, Spain
• Dr. Md. Atiqur Rahman Ahad, Department of Applied Physics, Electronics & Communication Engineering
(APECE), University of Dhaka, Bangladesh
• Dr. Charalampos Bratsas, Lab of Medical Informatics, Medical Faculty, Aristotle University, Thessaloniki, Greece
• Ms. Alexia Dini Kounoudes, Cyprus University of Technology, Cyprus
• Mr. Anthony Gesase, University of Dar es salaam Computing Centre, Tanzania
• Dr. Jorge A. Ruiz-Vanoye, Universidad Juárez Autónoma de Tabasco, Mexico
• Dr. Alejandro Fuentes Penna, Universidad Popular Autónoma del Estado de Puebla, México
• Dr. Ocotlán Díaz-Parra, Universidad Juárez Autónoma de Tabasco, México
• Mrs. Nantia Iakovidou, Aristotle University of Thessaloniki, Greece
• Mr. Vinay Chopra, DAV Institute of Engineering & Technology, Jalandhar
• Ms. Carmen Lastres, Universidad Politécnica de Madrid - Centre for Smart Environments, Spain
• Dr. Sanja Lazarova-Molnar, United Arab Emirates University, UAE
• Mr. Srikrishna Nudurumati, Imaging & Printing Group R&D Hub, Hewlett-Packard, India
• Dr. Olivier Nocent, CReSTIC/SIC, University of Reims, France
• Mr. Burak Cizmeci, Isik University, Turkey
• Dr. Carlos Jaime Barrios Hernandez, LIG (Laboratory Of Informatics of Grenoble), France
• Mr. Md. Rabiul Islam, Rajshahi university of Engineering & Technology (RUET), Bangladesh
• Dr. LAKHOUA Mohamed Najeh, ISSAT - Laboratory of Analysis and Control of Systems, Tunisia
• Dr. Alessandro Lavacchi, Department of Chemistry - University of Firenze, Italy
• Mr. Mungwe, University of Oldenburg, Germany
• Mr. Somnath Tagore, Dr D Y Patil University, India
• Ms. Xueqin Wang, ATCS, USA
• Dr. Borislav D Dimitrov, Department of General Practice, Royal College of Surgeons in Ireland, Dublin, Ireland
• Dr. Fondjo Fotou Franklin, Langston University, USA
• Dr. Vishal Goyal, Department of Computer Science, Punjabi University, Patiala, India
• Mr. Thomas J. Clancy, ACM, United States
• Dr. Ahmed Nabih Zaki Rashed, Dr. in Electronic Engineering, Faculty of Electronic Engineering, menouf 32951,
Electronics and Electrical Communication Engineering Department, Menoufia university, EGYPT, EGYPT
• Dr. Rushed Kanawati, LIPN, France
• Mr. Koteshwar Rao, K G Reddy College Of ENGG.&TECH,CHILKUR, RR DIST.,AP, India
• Mr. M. Nagesh Kumar, Department of Electronics and Communication, J.S.S. research foundation, Mysore
University, Mysore-6, India
• Dr. Ibrahim Noha, Grenoble Informatics Laboratory, France
• Mr. Muhammad Yasir Qadri, University of Essex, UK
• Mr. Annadurai .P, KMCPGS, Lawspet, Pondicherry, India, (Aff. Pondicherry Univeristy, India
• Mr. E Munivel , CEDTI (Govt. of India), India
• Dr. Chitra Ganesh Desai, University of Pune, India
• Mr. Syed, Analytical Services & Materials, Inc., USA
• Dr. Mashud Kabir, Department of Computer Science, University of Tuebingen, Germany
• Mrs. Payal N. Raj, Veer South Gujarat University, India
• Mrs. Priti Maheshwary, Maulana Azad National Institute of Technology, Bhopal, India
• Mr. Mahesh Goyani, S.P. University, India, India
• Mr. Vinay Verma, Defence Avionics Research Establishment, DRDO, India
• Dr. George A. Papakostas, Democritus University of Thrace, Greece
• Mr. Abhijit Sanjiv Kulkarni, DARE, DRDO, India
• Mr. Kavi Kumar Khedo, University of Mauritius, Mauritius
• Dr. B. Sivaselvan, Indian Institute of Information Technology, Design & Manufacturing, Kancheepuram, IIT
Madras Campus, India
• Dr. Partha Pratim Bhattacharya, Greater Kolkata College of Engineering and Management, West Bengal
University of Technology, India
• Mr. Manish Maheshwari, Makhanlal C University of Journalism & Communication, India
• Dr. Siddhartha Kumar Khaitan, Iowa State University, USA
• Dr. Mandhapati Raju, General Motors Inc, USA
• Dr. M.Iqbal Saripan, Universiti Putra Malaysia, Malaysia
• Mr. Ahmad Shukri Mohd Noor, University Malaysia Terengganu, Malaysia
• Mr. Selvakuberan K, TATA Consultancy Services, India
• Dr. Smita Rajpal, Institute of Technology and Management, Gurgaon, India
• Mr. Rakesh Kachroo, Tata Consultancy Services, India
• Mr. Raman Kumar, National Institute of Technology, Jalandhar, Punjab., India
• Mr. Nitesh Sureja, S.P.University, India
• Dr. M. Emre Celebi, Louisiana State University, Shreveport, USA
• Dr. Aung Kyaw Oo, Defence Services Academy, Myanmar
• Mr. Sanjay P. Patel, Sankalchand Patel College of Engineering, Visnagar, Gujarat, India
• Dr. Pascal Fallavollita, Queens University, Canada
• Mr. Jitendra Agrawal, Rajiv Gandhi Technological University, Bhopal, MP, India
• Mr. Ismael Rafael Ponce Medellín, Cenidet (Centro Nacional de Investigación y Desarrollo Tecnológico), Mexico
• Mr. Supheakmungkol SARIN, Waseda University, Japan
• Mr. Shoukat Ullah, Govt. Post Graduate College Bannu, Pakistan
• Dr. Vivian Augustine, Telecom Zimbabwe, Zimbabwe
• Mrs. Mutalli Vatila, Offshore Business Philipines, Philipines
• Dr. Emanuele Goldoni, University of Pavia, Dept. of Electronics, TLC & Networking Lab, Italy
• Mr. Pankaj Kumar, SAMA, India
• Dr. Himanshu Aggarwal, Punjabi University,Patiala, India
• Dr. Vauvert Guillaume, Europages, France
• Prof Yee Ming Chen, Department of Industrial Engineering and Management, Yuan Ze University, Taiwan
• Dr. Constantino Malagón, Nebrija University, Spain
• Prof Kanwalvir Singh Dhindsa, B.B.S.B.Engg.College, Fatehgarh Sahib (Punjab), India
• Mr. Angkoon Phinyomark, Prince of Singkla University, Thailand
• Ms. Nital H. Mistry, Veer Narmad South Gujarat University, Surat, India
• Dr. M.R.Sumalatha, Anna University, India
• Mr. Somesh Kumar Dewangan, Disha Institute of Management and Technology, India
• Mr. Raman Maini, Punjabi University, Patiala(Punjab)-147002, India
• Dr. Abdelkader Outtagarts, Alcatel-Lucent Bell-Labs, France
• Prof Dr. Abdul Wahid, AKG Engg. College, Ghaziabad, India
• Mr. Prabu Mohandas, Anna University/Adhiyamaan College of Engineering, india
• Dr. Manish Kumar Jindal, Panjab University Regional Centre, Muktsar, India
• Prof Mydhili K Nair, M S Ramaiah Institute of Technnology, Bangalore, India
• Dr. C. Suresh Gnana Dhas, VelTech MultiTech Dr.Rangarajan Dr.Sagunthala Engineering
College,Chennai,Tamilnadu, India
• Prof Akash Rajak, Krishna Institute of Engineering and Technology, Ghaziabad, India
• Mr. Ajay Kumar Shrivastava, Krishna Institute of Engineering & Technology, Ghaziabad, India
• Mr. Deo Prakash, SMVD University, Kakryal(J&K), India
• Dr. Vu Thanh Nguyen, University of Information Technology HoChiMinh City, VietNam
• Prof Deo Prakash, SMVD University (A Technical University open on I.I.T. Pattern) Kakryal (J&K), India
• Dr. Navneet Agrawal, Dept. of ECE, College of Technology & Engineering, MPUAT, Udaipur 313001 Rajasthan,
India
• Mr. Sufal Das, Sikkim Manipal Institute of Technology, India
• Mr. Anil Kumar, Sikkim Manipal Institute of Technology, India
• Dr. B. Prasanalakshmi, King Saud University, Saudi Arabia.
• Dr. K D Verma, S.V. (P.G.) College, Aligarh, India
• Mr. Mohd Nazri Ismail, System and Networking Department, University of Kuala Lumpur (UniKL), Malaysia
• Dr. Nguyen Tuan Dang, University of Information Technology, Vietnam National University Ho Chi Minh city,
Vietnam
• Dr. Abdul Aziz, University of Central Punjab, Pakistan
• Dr. P. Vasudeva Reddy, Andhra University, India
• Mrs. Savvas A. Chatzichristofis, Democritus University of Thrace, Greece
• Mr. Marcio Dorn, Federal University of Rio Grande do Sul - UFRGS Institute of Informatics, Brazil
• Mr. Luca Mazzola, University of Lugano, Switzerland
• Mr. Nadeem Mahmood, Department of Computer Science, University of Karachi, Pakistan
• Mr. Hafeez Ullah Amin, Kohat University of Science & Technology, Pakistan
• Dr. Professor Vikram Singh, Ch. Devi Lal University, Sirsa (Haryana), India
• Mr. M. Azath, Calicut/Mets School of Enginerring, India
• Dr. J. Hanumanthappa, DoS in CS, University of Mysore, India
• Dr. Shahanawaj Ahamad, Department of Computer Science, King Saud University, Saudi Arabia
• Dr. K. Duraiswamy, K. S. Rangasamy College of Technology, India
• Prof. Dr Mazlina Esa, Universiti Teknologi Malaysia, Malaysia
• Dr. P. Vasant, Power Control Optimization (Global), Malaysia
• Dr. Taner Tuncer, Firat University, Turkey
• Dr. Norrozila Sulaiman, University Malaysia Pahang, Malaysia
• Prof. S K Gupta, BCET, Guradspur, India
• Dr. Latha Parameswaran, Amrita Vishwa Vidyapeetham, India
• Mr. M. Azath, Anna University, India
• Dr. P. Suresh Varma, Adikavi Nannaya University, India
• Prof. V. N. Kamalesh, JSS Academy of Technical Education, India
• Dr. D Gunaseelan, Ibri College of Technology, Oman
• Mr. Sanjay Kumar Anand, CDAC, India
• Mr. Akshat Verma, CDAC, India
• Mrs. Fazeela Tunnisa, Najran University, Kingdom of Saudi Arabia
• Mr. Hasan Asil, Islamic Azad University Tabriz Branch (Azarshahr), Iran
• Prof. Dr Sajal Kabiraj, Fr. C Rodrigues Institute of Management Studies (Affiliated to University of Mumbai,
India), India
• Mr. Syed Fawad Mustafa, GAC Center, Shandong University, China
• Dr. Natarajan Meghanathan, Jackson State University, Jackson, MS, USA
• Prof. Selvakani Kandeeban, Francis Xavier Engineering College, India
• Mr. Tohid Sedghi, Urmia University, Iran
• Dr. S. Sasikumar, PSNA College of Engg and Tech, Dindigul, India
• Dr. Anupam Shukla, Indian Institute of Information Technology and Management Gwalior, India
• Mr. Rahul Kala, Indian Institute of Inforamtion Technology and Management Gwalior, India
• Dr. A V Nikolov, National University of Lesotho, Lesotho
• Mr. Kamal Sarkar, Department of Computer Science and Engineering, Jadavpur University, India
• Dr. Mokhled S. AlTarawneh, Computer Engineering Dept., Faculty of Engineering, Mutah University, Jordan,
Jordan
• Prof. Sattar J Aboud, Iraqi Council of Representatives, Iraq-Baghdad
• Dr. Prasant Kumar Pattnaik, Department of CSE, KIST, India
• Dr. Mohammed Amoon, King Saud University, Saudi Arabia
• Dr. Tsvetanka Georgieva, Department of Information Technologies, St. Cyril and St. Methodius University of
Veliko Tarnovo, Bulgaria
• Dr. Eva Volna, University of Ostrava, Czech Republic
• Mr. Ujjal Marjit, University of Kalyani, West-Bengal, India
• Dr. Prasant Kumar Pattnaik, KIST,Bhubaneswar,India, India
• Dr. Guezouri Mustapha, Department of Electronics, Faculty of Electrical Engineering, University of Science and
Technology (USTO), Oran, Algeria
• Mr. Maniyar Shiraz Ahmed, Najran University, Najran, Saudi Arabia
• Dr. Sreedhar Reddy, JNTU, SSIETW, Hyderabad, India
• Mr. Bala Dhandayuthapani Veerasamy, Mekelle University, Ethiopa
• Mr. Arash Habibi Lashkari, University of Malaya (UM), Malaysia
• Mr. Rajesh Prasad, LDC Institute of Technical Studies, Allahabad, India
• Ms. Habib Izadkhah, Tabriz University, Iran
• Dr. Lokesh Kumar Sharma, Chhattisgarh Swami Vivekanand Technical University Bhilai, India
• Mr. Kuldeep Yadav, IIIT Delhi, India
• Dr. Naoufel Kraiem, Institut Superieur d'Informatique, Tunisia
• Prof. Frank Ortmeier, Otto-von-Guericke-Universitaet Magdeburg, Germany
• Mr. Ashraf Aljammal, USM, Malaysia
• Mrs. Amandeep Kaur, Department of Computer Science, Punjabi University, Patiala, Punjab, India
• Mr. Babak Basharirad, University Technology of Malaysia, Malaysia
• Mr. Avinash singh, Kiet Ghaziabad, India
• Dr. Miguel Vargas-Lombardo, Technological University of Panama, Panama
• Dr. Tuncay Sevindik, Firat University, Turkey
• Ms. Pavai Kandavelu, Anna University Chennai, India
• Mr. Ravish Khichar, Global Institute of Technology, India
• Mr Aos Alaa Zaidan Ansaef, Multimedia University, Cyberjaya, Malaysia
• Dr. Awadhesh Kumar Sharma, Dept. of CSE, MMM Engg College, Gorakhpur-273010, UP, India
• Mr. Qasim Siddique, FUIEMS, Pakistan
• Dr. Le Hoang Thai, University of Science, Vietnam National University - Ho Chi Minh City, Vietnam
• Dr. Saravanan C, NIT, Durgapur, India
• Dr. Vijay Kumar Mago, DAV College, Jalandhar, India
• Dr. Do Van Nhon, University of Information Technology, Vietnam
• Mr. Georgios Kioumourtzis, University of Patras, Greece
• Mr. Amol D.Potgantwar, SITRC Nasik, India
• Mr. Lesedi Melton Masisi, Council for Scientific and Industrial Research, South Africa
• Dr. Karthik.S, Department of Computer Science & Engineering, SNS College of Technology, India
• Mr. Nafiz Imtiaz Bin Hamid, Department of Electrical and Electronic Engineering, Islamic University of
Technology (IUT), Bangladesh
• Mr. Muhammad Imran Khan, Universiti Teknologi PETRONAS, Malaysia
• Dr. Abdul Kareem M. Radhi, Information Engineering - Nahrin University, Iraq
• Dr. Mohd Nazri Ismail, University of Kuala Lumpur, Malaysia
• Dr. Manuj Darbari, BBDNITM, Institute of Technology, A-649, Indira Nagar, Lucknow 226016, India
• Ms. Izerrouken, INP-IRIT, France
• Mr. Nitin Ashokrao Naik, Dept. of Computer Science, Yeshwant Mahavidyalaya, Nanded, India
• Mr. Nikhil Raj, National Institute of Technology, Kurukshetra, India
• Prof. Maher Ben Jemaa, National School of Engineers of Sfax, Tunisia
• Prof. Rajeshwar Singh, BRCM College of Engineering and Technology, Bahal Bhiwani, Haryana, India
• Mr. Gaurav Kumar, Department of Computer Applications, Chitkara Institute of Engineering and Technology,
Rajpura, Punjab, India
• Mr. Ajeet Kumar Pandey, Indian Institute of Technology, Kharagpur, India
• Mr. Rajiv Phougat, IBM Corporation, USA
• Mrs. Aysha V, College of Applied Science Pattuvam affiliated with Kannur University, India
• Dr. Debotosh Bhattacharjee, Department of Computer Science and Engineering, Jadavpur University, Kolkata700032, India
• Dr. Neelam Srivastava, Institute of engineering & Technology, Lucknow, India
• Prof. Sweta Verma, Galgotia's College of Engineering & Technology, Greater Noida, India
• Mr. Harminder Singh BIndra, MIMIT, INDIA
• Dr. Lokesh Kumar Sharma, Chhattisgarh Swami Vivekanand Technical University, Bhilai, India
• Mr. Tarun Kumar, U.P. Technical University/Radha Govinend Engg. College, India
• Mr. Tirthraj Rai, Jawahar Lal Nehru University, New Delhi, India
• Mr. Akhilesh Tiwari, Madhav Institute of Technology & Science, India
• Mr. Dakshina Ranjan Kisku, Dr. B. C. Roy Engineering College, WBUT, India
• Ms. Anu Suneja, Maharshi Markandeshwar University, Mullana, Haryana, India
• Mr. Munish Kumar Jindal, Punjabi University Regional Centre, Jaito (Faridkot), India
• Dr. Ashraf Bany Mohammed, Management Information Systems Department, Faculty of Administrative and
Financial Sciences, Petra University, Jordan
• Mrs. Jyoti Jain, R.G.P.V. Bhopal, India
• Dr. Lamia Chaari, SFAX University, Tunisia
• Mr. Akhter Raza Syed, Department of Computer Science, University of Karachi, Pakistan
• Prof. Khubaib Ahmed Qureshi, Information Technology Department, HIMS, Hamdard University, Pakistan
• Prof. Boubker Sbihi, Ecole des Sciences de L'Information, Morocco
• Dr. S. M. Riazul Islam, Inha University, South Korea
• Prof. Lokhande S.N., S.R.T.M.University, Nanded (MH), India
• Dr. Vijay H Mankar, Dept. of Electronics, Govt. Polytechnic, Nagpur, India
• Dr. M. Sreedhar Reddy, JNTU, Hyderabad, SSIETW, India
• Mr. Ojesanmi Olusegun, Ajayi Crowther University, Oyo, Nigeria
• Ms. Mamta Juneja, RBIEBT, PTU, India
• Dr. Ekta Walia Bhullar, Maharishi Markandeshwar University, Mullana Ambala (Haryana), India
• Prof. Chandra Mohan, John Bosco Engineering College, India
TABLE OF CONTENTS
1. A General Simulation Framework for Supply Chain Modeling: State of the Art and Case
Study – pg 1-9
Antonio Cimino, Mechanical Department, University of Calabria, Rende (CS), 87036, Italy
Francesco Longo, Mechanical Department, University of Calabria, Rende (CS), 87036, Italy
Giovanni Mirabelli, Mechanical Department, University of Calabria, Rende (CS), 87036, Italy
2. Database Reverse Engineering based on Association Rule Mining – pg 10-15
Nattapon Pannurat, Faculty of Information Sciences, Nakhon Ratchasima College, 290 Moo 2,
Mitraphap Road, Nakhon Ratchasima, 30000, Thailand
Nittaya Kerdprasop, Data Engineering and Knowledge Discovery Research Unit, Suranaree
University of Technology, 111 University Avenue, Nakhon Ratchasima, 30000, Thailand
Kittisak Kerdprasop, Data Engineering and Knowledge Discovery Research Unit, Suranaree
University of Technology, 111 University Avenue, Nakhon Ratchasima, 30000, Thailand
3. A New Approach to Keyphrase Extraction Using Neural Networks – pg 16-25
Kamal Sarkar, Computer Science and Engineering Department, Jadavpur University, Kolkata700 032, India
Mita Nasipuri, Computer Science and Engineering Department, Jadavpur University, Kolkata700 032, India
Suranjan Ghose, Computer Science and Engineering Department, Jadavpur University, Kolkata700 032, India
4. C Implementation & comparison of companding & silence audio compression techniques
– pg 26-30
Kruti Dangarwala, Department of Computer Engineering, Sri S'ad Vidya Mandal Institute of
Technology, Bharuch, Gujarat, India
Jigar Shah, Department of Electronics and Telecommunication Engineering, Sri S'ad Vidya
Mandal Institute of Technology, Bharuch, Gujarat, India
5. Color Image Compression Based On Wavelet Packet Best Tree – pg 31-35
G. K. Kharate, Matoshri College of Engineering and Research Centre, Nashik - 422003,
Maharashtra, India
V. H. Patil, Department of Computer Engineering, University of Pune
6. A Pedagogical Evaluation and Discussion about the Lack of Cohesion in Method
(LCOM) Metric Using Field Experiment – pg 36-43
Ezekiel Okike, School of Computer Studies, Kampala International University , Kampala,
Uganda 256, Uganda
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
1
A General Simulation Framework for Supply Chain Modeling:
State of the Art and Case Study
Antonio Cimino1, Francesco Longo2 and Giovanni Mirabelli3
1
Mechanical Department, University of Calabria,
Rende (CS), 87036, Italy
2
Mechanical Department, University of Calabria,
Rende (CS), 87036, Italy
3
Mechanical Department, University of Calabria,
Rende (CS), 87036, Italy
Abstract
Nowadays there is a large availability of discrete event
simulation software that can be easily used in different
domains: from industry to supply chain, from healthcare to
business management, from training to complex systems
design. Simulation engines of commercial discrete event
simulation software use specific rules and logics for
simulation time and events management. Difficulties and
limitations come up when commercial discrete event
simulation software are used for modeling complex real
world-systems (i.e. supply chains, industrial plants). The
objective of this paper is twofold: first a state of the art on
commercial discrete event simulation software and an
overview on discrete event simulation models
development by using general purpose programming
languages are presented; then a Supply Chain Order
Performance Simulator (SCOPS, developed in C++) for
investigating the inventory management problem along the
supply chain under different supply chain scenarios is
proposed to readers.
Keywords: Discrete Event Simulation, Simulation languages,
Supply Chain, Inventory Management.
1. Introduction
As reported in [1], discrete-event simulation software
selection could be an exceeding difficult task especially
for inexpert users. Simulation software selection problem
was already known many years ago. A simulation buyer’s
guide that identifies possible features to consider in
simulation software selection is proposed in [2]. The guide
includes in the analysis considerations several aspects
such as Input, Processing, Output, Environment, Vendor
and Costs. A survey on users’ requirements about discrete-
event simulation software is presented in [3]. The analysis
shows
that
simulation
software
with
good
visualization/animation properties are easier to use but
limited in case of complex and non-standard problems.
Further limitations include lack of software compatibility,
output analysis tools, advanced programming languages.
In [4] and [5] functionalities and potentialities of different
commercial discrete-event simulation software, in order to
support users in software selection, are reported. In this
case the author provides the reader with information about
software vendor, primary software applications, hardware
platform requirements, simulation animation, support,
training and pricing.
Needless to say that Modeling & Simulation should be
used when analytical approaches do not succeed in
identifying proper solutions for analyzing complex
systems (i.e. supply chains, industrial plants, etc.). For
many of these systems, simulation models must be: (i)
flexible and parametric (for supporting scenarios
evaluation) (ii) time efficient (even in correspondence of
very complex real-world systems) and (iii) repetitive in
their architectures for scalability purposes [6].
Let us consider the traditional modeling approach
proposed by two commercial discrete event simulation
software, Em-Plant by Siemens PLM Software solutions
and Anylogic by Xj-Technologies. Both of them propose a
typical object oriented modeling approach. Each discrete
event simulation model is made up by system state
variables, entities and attributes, lists processing, activities
and delays. Usually complex systems involve high
numbers of resources and entities flowing within the
simulation model. The time required for executing a
simulation run depends on the numbers of entities in the
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
simulation model: the higher is the number of entities the
higher is the time required for executing a simulation run.
In addition, libraries objects, which should be used for
modeling static entities, very often fall short of recreating
the real system with satisfactory accuracy. In other words,
the traditional modeling approach (proposed by eM-Plant
and Anylogic as well as by a number of discrete event
simulation software), presents two problems: (i)
difficulties in modeling complex scenarios; (ii) too many
entities could cause computational heavy simulation
models. Further information on discrete event simulation
software can be found in [7].
An alternative to commercial discrete event simulation
software is to develop simulation models based on general
purpose programming languages (i.e. C++, Java). The use
of general purpose programming languages allows to
develop ad-hoc simulation models with class-objects able
to recreate carefully the behavior of the real world system.
The objective of this paper is twofold: first a state of the
art on commercial discrete event simulation software and
an overview on discrete event simulation models
development by using general purpose programming
languages are presented; then a Supply Chain Order
Performance Simulator (SCOPS, developed in C++) for
investigating the inventory management problem along the
supply chain under different supply chain scenarios is
proposed to readers.
Before getting into details of the work, in the sequel a
brief overview of paper sections is reported. Section 2
provides the reader with a detailed description of different
commercial discrete event simulation software. Section 3
presents a general overview of programming languages
and describes the main steps to develop a simulation
model based on general purpose programming languages.
Section 4 presents a three stages supply chain simulation
model (called SCOPS) used for investigating inventory
problems along the supply chain. Section 5 describes the
simulation experiments carried out by using the simulation
model. Finally the last section reports conclusions and
research activities still on going.
2. Discrete Event Simulation Software
2
languages, prices, etc. For each aspect and for each
software the survey reports a score between 0 and 10.
Table 1 help modelers in discrete event simulation
software selection. Moreover the following sections
reports a brief description of all the software of table 1 in
terms of domains of applicability, types of libraries (i.e.
modeling libraries, optimization libraries, etc.), inputoutput functionalities, animation functionalities, etc.
2.1 Anylogic
Anylogic is a Java based simulation software, by XJ
Technologies [8], used for forecasting and strategic
planning, processes analysis and optimization, optimal
operational management, processes visualization. It is
widely used in logistics, supply chains, manufacturing,
healthcare, consumer markets, project management,
business processes and military. Anylogic supports Agent
Based, Discrete Event and System Dynamics modeling
and simulation. The latest Anylogic version (Anylogic 6)
has been released in 2007, it supports both graphical and
flow-chart modeling and provides the user with Java code
for simulation models extension. For input data analysis,
Anylogic provides the user with Stat-Fit (a simulation
support software by Geer Mountain Software Corp.) for
distributions fitting and statistics analysis. Output analysis
functionalities are provided by different types of datasets,
charts and histograms (including export function to text
files or excel spreadsheet). Finally simulation optimization
is performed by using Optquest, an optimization tool
integrated in Anylogic.
2.2 Arena
Arena is a simulation software by Rockwell Corporation
[9] and it is used in different application domains: from
manufacturing to supply chain (including logistics,
warehousing and distribution) from customers service and
strategies to internal business processes. Arena (as
Anylogic) provides the user with objects libraries for
systems modeling and with a domain-specific simulation
language, SIMAN [10]. Simulation optimizations are
carried out by using Optquest. Arena includes three
modules respectively called Arena Input Analyzer (for
distributions fitting), Arena Output Analyzer (for
simulation output analysis) and Arena Process Analyzer
(for simulation experiments design). Moreover Arena also
provides the users animation at run time as well as it
allows to import CAD drawings to enhance animation
capabilities.
Table 1 reports the results of a survey on the most widely
used discrete event simulation software (conducted on 100
people working in the simulation field). The survey
considers among others some critical aspects such as
domains of application (specifically manufacturing and
logistics), 3D and virtual reality potentialities, simulation
Table 1: Survey on most widely used Simulation software
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
Logistic
Manufacturing
3D Virtual Reality
Simulation Engine
User Ability
User Community
Simulation Language
Runtime
Analysis tools
Internal Programming
Modular Construction
Price
3
Anylogic
Arena
AutoMod
Emplant
Promodel
Flexsim
Witness
6.5
6.6
6.6
7
7
6.2
6.8
7.5
6.5
7.2
6.1
7
7.5
7.5
6.9
8
8
9
7
7
8
7
7
6
7
6.5
7.3
7.5
6
6.7
6.25
6.5
6.9
6
6
5.6
7.2
7.2
6.8
8
7
6.5
6.5
6.5
7.1
7
6.5
5.8
6.5
6.7
6.7
7
9
7.5
6.5
7.5
7.7
6.2
7.5
7
7
6.7
7.2
7.5
7.5
6.6
6.7
6
6
7
7
5.7
7.5
7.5
7
8
8
8.5
6.5
7
7.8
6.5
7
6
2.3 Automod
2.5 Promodel
Automod is a discrete event simulation software,
developed by Applied Materials Inc. [11] and it is based
on the domain-specific simulation language Automod.
Typical domains of application are manufacturing, supply
chain, warehousing and distribution, automotive, airports
and semiconductor. It is strongly focused on transportation
systems including objects such as conveyor, Path Mover,
Power & Free, Kinematic, Train Conveyor, AS/RS,
Bridge Crane, Tank & Pipe (each one customizable by the
user). For input data analysis, experimental design and
simulation output analysis, Automod provides the user
with AutoStat [12]. Moreover the software includes
different modules such as AutoView devoted to support
simulation animation with AVI formats.
Promodel is a discrete event simulation software
developed by Promodel Corporation [14] and it is used in
different
application
domains:
manufacturing,
warehousing, logistics and other operational and strategic
situations. Promodel enables users to build computer
models of real situations and experiment with scenarios to
find the best solution. The software provides the users
with an easy to use interface for creating models
graphically. Real systems randomness and variability can
be either recreated by utilizing over 20 statistical
distribution types or directly importing users’ data. Data
can be directly imported and exported with Microsoft
Excel and simulation optimizations are carried out by
using SimRunner or OptQuest. Moreover, the software
technology allows the users to create customized frontand back-end interfaces that communicate directly with
ProModel.
2.4 Em-Plant
Em-plant is a Siemens PLM Software solutions [13],
developed for strategic production decisions. EM-Plant
enables users to create well-structured, hierarchical models
of production facilities, lines and processes. Em-Plant
object-oriented architecture and modeling capabilities
allow users to create and maintain complex systems,
including advanced control mechanisms. The Application
Object Libraries support the user in modeling complex
scenarios in short time. Furthermore EM-Plant provides
the user with a number of mathematical analysis and
statistics functions for input distribution fitting and single
or multi-level factor analysis, histograms, charts,
bottleneck analyzer and Gantt diagram. Experiments
Design functionalities (with Experiments Manager) are
also provided. Simulation optimization is carried out by
using Genetic Algorithms and Artificial Neural Networks.
2.6 Flexsim
Flexsim is developed by Flexsim Software Products [15]
and allows to model, analyze, visualize, and optimize any
kind of real process - from manufacturing to supply
chains. The software can be interfaced with common
spreadsheet and database applications to import and export
data. Moreover, Flexsim's powerful 3D graphics allow inmodel charts and graphs to dynamically display output
statistics. The tool Flexsim Chart gives the possibility to
analyze the simulation results and simulation
optimizations can be performed by using both Optquest as
well as a built-in experimenter tool. Finally, in addition to
the previous described software, Flexsim allow to create
own classes, libraries, GUIs, or applications.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
2.7 Witness
Witness is developed by Lanner Group Limited [16]. It
allows to represent real world processes in a dynamic
animated computer model and then experiment with
“what-if” alternative scenarios to identify the optimal
solution. The software can be easily linked with the most
common spreadsheet, database and CAD files. The
simulation optimization is performed by the Witness
Optimizer tool that can be used with any Witness model.
Finally the software provides the user with a scenario
manager tool for the analysis of the simulation results.
3. General Purpose and Specific Simulation
Programming Languages
There are many programming languages, general purpose
or domain-specific simulation language (DSL) that can be
used for simulation models development. General purpose
languages are usually adopted when the programming
logics cannot be easily expressed in GUI-based systems or
when simulation results are more important than advanced
animation/visualization [17]. Simulation models can be
developed both by using discrete-event simulation
software and general purpose languages, such as C++ or
Java [18].
As reported in [1] a simulation study requires a number of
different steps; it starts with problem formulation and
passes through different and iterative steps: conceptual
model definition, data collection, simulation model
implementation, verification, validation and accreditation,
simulation experiments, simulation results analysis,
documentation
and
reports.
Simulation
model
development by using general purpose programming
languages (i.e. C++) requires a deep knowledge of the
logical foundation of discrete event simulation. Among
different aspects to be considered, it is important to
underline that discrete event simulation model consists of
entities, resources control elements and operations [19].
Dynamic entities flow in the simulation model (i.e. parts in
a manufacturing system, products in a supply chain, etc.).
Static entities usually work as resources (a system part that
provides services to dynamic entities). Control elements
(such as variables, boolean expressions, specific
programming code, etc.) support simulation model states
control. Finally, operations represent all the actions
generated by the flow of dynamic entities within the
simulation model. During its life within the simulation
model, an entity changes its state different times. There are
five different entity states [19]: Ready state (the entity is
ready to be processed), Active state (the entity is currently
being processed), Time-delayed state (the entity is delayed
4
until a predetermined simulation time), Condition-delayed
state (the entity is delayed until a specific condition will be
solved) and Dormant state (in this case the condition
solution that frees the entity is managed by the modeler).
Entity management is supported by different lists, each
one corresponding to an entity state: the CEL, (Current
Event List for active state entity), the FEL (Future Event
List for Time-delayed entities), the DL (Delay List for
condition-delayed entities) and UML (User-Managed Lists
for dormant entities). In particular, Siman and GPSS/H
call the CEL list CEC list (Current Events Chain), while
ProModel language calls it AL (Action List). The FEL is
called FEP (Future Events Heap) and FEC (Future Event
Chain) respectively by Siman and GPSS/H. After entities
states definition and lists creation, the next step is the
implementation of the phases of a simulation run: the
Initialization Phase (IP), the Entity Movement Phases
(EMP) and the Clock Update Phase (CUP). A detailed
explanation of the simulation run anatomy is reported in
[19].
4. A Supply Chain Simulation Model
developed in C++
According to the idea to implement simulation models
based on general purpose programming languages, the
authors propose a three stage supply chain simulation
model implemented by using the Borland C++ Builder to
compile the code (further information on Borland C++
Builder can be found in [20]). The acronym of the
simulation model is SCOPS (Supply-Chain Order
Performance Simulator). SCOPS investigates the
inventory management problem along a three stages
supply chain and allows the user to test different scenarios
in terms of demand intensity, demand variability and lead
times. Note that such problem can be also investigated by
using discrete event simulation software [21], [22], [23]
and [24].
The supply chain conceptual model includes suppliers,
distribution centers, stores and final customers. In the
supply chain conceptual model a single network node can
be considered as store, distribution center or supplier. A
supply chain begins with one or more suppliers and ends
with one or more stores. Usually stores satisfy final
customers’ demand, distribution centers satisfy stores
demand and plants satisfy distribution centers demand. By
using these three types of nodes we can model a general
supply chain (also including more than three stages).
Suppliers, distribution centers and stores work 6 days per
week, 8 hours per day. Stores receive orders from
customers. An order can be completely or partially
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
satisfied. At the end of each day, on the basis of an OrderPoint, Order-Up-to-Level (s, S) inventory control policy,
the stores decide whether place an order to the distribution
centers or not. Similarly distribution centers place orders
to suppliers according to the same inventory control
policies. Distribution centers select suppliers according to
their lead times (that includes production times and
transportation times).
5
Dfi(t), demand forecast of the i-th item (evaluated by
means of the moving average methodology);
LTi, lead time of the i-th item;
si(t), order point at time t of the i-th item;
Si, order-up-to-level of the i-th item;
SSi(t), safety stock at time t of the i-th item;
Qi(t), quantity to be ordered at time t of the i-th item.
4.1 Supply Chain Orders Perfomance Simulator
According to the Order-Point, Order-Up-to-Level policy
[25], an order is emitted whenever the available quantity
drops to the order point (s) or lower. A variable
replenishment quantity is ordered to raise the available
quantity to the order-up-to-level (S). For each item the
order point s is the safety stock calculated as standard
deviation of the lead-time demand, the order-up to level S
is the maximum number of items that can be stored in the
warehouse space assigned to the item type considered. For
the i-th item, the evaluation of the replenishment quantity,
Qi(t), has to take into consideration the quantity available
(in terms of inventory position) and the order-up-to-level S.
The inventory position (equation 1) is the on-hand
inventory, plus the quantity already on order, minus the
quantity to be shipped. The calculation of sj(t) requires the
evaluation of the demand over the lead time. The lead time
demand of the i-th item (see equation 2), is evaluated by
using the moving average methodology. Both at stores and
distribution centers levels, managers know their peak and
off-peak periods, and they usually use that knowledge to
correct manually future estimates based on moving
average methodology. They also correct their future
estimates based on trucks capacity and suppliers quantity
discounts. Finally equations 3 and 4 respectively express
the order condition and calculate the replenishment
quantity.
Pi (t )  Ohi (t )  Ori (t )  Shi (t )
Dlt i (t ) 
(1)
t  LTi
 Df (k )
k t 1
i
(2)
Pi (t )  ( si (t )  SS i (t ))
(3)
Qi (t )  S i  Pi (t )
(4)
where,
Pi(t), inventory position of the i-th item;
Ohi(t), on-hand inventory of the i-th item;
Ori(t), quantity already on order of the i-th item;
Shi(t), quantity to be shipped of the i-th item;
Dlti(t), lead time demand of the i-th item;
SCOPS translates the supply chain conceptual model
recreating the complex and high stochastic environment of
a real supply chain. For each type of product, customers’
demand to stores is assumed to be Poisson with
independent arrival processes (in relation to product
types). Quantity required at stores is based on triangular
distributions with different levels of intensity and
variability. Partially satisfied orders are recorded at stores
and distribution center levels for performance measures
calculation.
In our application example fifty stores, three distribution
center, ten suppliers and thirty different items define the
supply chain scenario. Figure 1 shows the SCOPS user
interface. The SCOPS graphic interface provides the user
with many commands as, for instance, simulation time
length, start, stop and reset buttons, a check box for unique
simulation experiments (that should be used for resetting
the random number generator in order to compare
different scenarios under the same conditions), supply
chain configurations (number of items, stores, distribution
centers, suppliers, input data, etc.). For each supply chain
node a button allows to access the following information
number of orders, arrival times, ordered quantities,
received quantities, waiting times, fill rates. SCOPS
graphic interface also allows the user to export simulation
results on txt and excel files. One of the most important
features of SCOPS is the flexibility in terms of scenarios
definition. The graphic interface gives to the user the
possibility to carry out a number of different what-if
analysis by changing supply chain configuration and input
parameters (i.e. inventory policies, demand forecast
methods, demand intensity and variability, lead times,
inter-arrival times, number of items, number of stores,
distribution centers and plants, number of supply chain
echelons, etc.). Figure 2 display several SCOPS windows
the user can use for setting supply chain configuration and
input parameters.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
6
errors [1]. In this regards, during the simulation model
development, the authors tried to find the existence of
errors (bugs). The causes of each bug has been correctly
identified and the model has opportunely been modified
and tested (once again) for ensuring errors elimination as
well as for detecting new errors.
Fig. 1 SCOPS User Interface.
Before going into details of simulation model validation, it
is important to evaluate the optimal simulation run length.
Note that the supply chain is a non-terminating system and
one of the priority objectives of such type of system is the
evaluation of the simulation run length [1]. Information
regarding the length of a simulation run is used for the
validation. The length is the correct trade-off between
results accuracy and time required for executing the
simulation runs. The run length has been correctly
determined using the mean square pure error analysis
(MSPE). After the MSPE analysis, the simulation run
length chosen is 390 days.
Choosing for each simulation run the length evaluated by
means of MSPE analysis (390 days) the validation phase
has been conducted by using the Face Validation (informal
technique). For each retailer and for each distribution
centre the simulation results, in terms of fill rate, have
been compared with real results. Note that during the
validation process the simulation model works under
identical input conditions of the real supply chain. The
Face Validation results have been analyzed by several
experts; their analysis revealed that, in its domain of
application, the simulation model recreates with
satisfactory accuracy the real system.
Fig. 2 SCOPS Windows.
4.2 SCOPS verification, simulation run length and
validation
Verification and validation processes assess the accuracy
and the quality throughout a simulation study [26].
Verification and Validation are defined by the American
Department of Defence Directive 5000.59 as follows:
verification is the process of determining that a model
implementation accurately represents the developer’s
conceptual description and specifications, while validation
is the process of determining the degree to which a model
is an accurate representation of the real world from the
perspective of the intended use of the model.
The simulator verification has been carried out by using
the debugging technique. The debugging technique is an
iterative process whose purpose is to uncover errors or
misconceptions that cause the model’s failure and to
define and carry out the model changes that correct the
5. Supply Chain Configuration and Design of
Simulation Experiments
The authors propose as application example the
investigation of 27 different supply chain scenarios. In
particular simulation experiments take into account three
different levels for demand intensity, demand variability
and lead times (minimum, medium and maximum
respectively indicated with “-”, “0” and “+” signs). Table
1 reports (as example) factors and levels for one of the
thirty items considered and table 3 reports scenarios
description in terms of simulation experiments. Each
simulation run has been replicated three times (totally 81
replications).
Table 2: Factors and levels
Minimum
Medium
High
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Demand
Intensity
[inter-arrival
time]
3
5
8
Demand
Variability
[item]
[18,22]
[16,24]
[14,26]
Lead Time
[days]
2
3
4
23
24
25
26
27
1
7
+
+
+
+
+
-
0
0
+
+
+
-
0
+
0
+
-
5.1 Supply Chain Scenarios analysis and comparison
After the definition of factors levels and scenarios, the
next step is the performance measures definition. SCOPS
includes, among others, two fill rate performance
measures defined as (i) the ratio between the number of
satisfied Orders and the total number of orders; (ii) the
ratio between the lost quantity and the total ordered
quantity.
Simulation results, for each supply chain node and for
each factors levels combination, are expressed in terms of
average fill rate (intended as ratio between the number of
satisfied Orders and the total number of orders).
The huge quantity of simulation results allows the analysis
of a comprehensive set of supply chain operative
scenarios. Let us consider the simulation results regarding
the store #1; we have considered three different scenarios
(low, medium and high lead times) and, within each
scenario, the effects of demand variability and demand
intensity are investigated.
Figure 2 shows the fill rate trend at store #1 in the case of
low lead time.
Fill Rate - Store 1 - Low Lead Time
0.8
Table 3: Simulation experiments and supply chain scenarios
0.7
Lead
Time
1
-
-
-
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0
0
0
0
0
0
0
0
0
+
+
+
+
0
0
0
+
+
+
0
0
0
+
+
+
0
0
+
0
+
0
+
0
+
0
+
0
+
0
+
-
0.6
Fill Rate
Demand
Variability
0.5
Low Intensity
0.4
Medium Intensity
0.3
High Intensity
0.2
0.1
0
Low Variability
Medium Variability
High Variability
Deamand Variability
Fig. 2 Fill rate at store #1, low lead time.
The major effect is due to changes in demand intensity: as
soon as the demand intensity increases there is a strong
reduction of the fill rate. A similar trend can be observed
in the case of medium and high lead time (figure 3 and
figure 4, respectively).
Fill Rate - Store 1 - Medium Lead Time
0.8
0.7
0.6
Fill Rate
Run
Demand
Intensity
0.5
Low Intensity
0.4
Medium Intensity
0.3
High Intensity
0.2
0.1
0
Low Variability
Medium Variability
High Variability
Deamand Variability
Fig. 3 Fill Rate at store # 1, medium lead time.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Fill Rate - Store 1 - High Lead Time
Acknowledgments
All the authors gratefully thank Professor A. G. Bruzzone
(University of Genoa) for his valuable support on this
manuscript.
0.8
0.7
0.6
Fill Rate
8
0.5
Low Intensity
0.4
Medium Intensity
0.3
High Intensity
0.2
0.1
0
Low Variability
Medium Variability
High Variability
Deamand Variability
Fig. 4 Fill Rate at store # 1, high lead time.
The simultaneous comparison of figures 2, 3 and 4 shows
the effect of different lead times on the average fill rate.
The only minor issue is a small fill rate reduction passing
from 2 days lead time to 3 and 4 days lead time.
As additional aspect (not shown in figures 2, 3, and 4), the
higher is the demand intensity the higher is the average on
hand inventory. Similarly the higher is the demand
variability the higher is the average on hand inventory. In
effect, the demand forecast usually overestimates the
ordered quantity in case of high demand intensity and
variability.
6. Conclusions
The paper first presents an overview on the most widely
used discrete event simulation software in terms of
domains of applicability, types of libraries (i.e. modeling
libraries, optimization libraries, etc.), input-output
functionalities, animation functionalities, etc. In the
second part the paper proposes, as alternative to discrete
event simulation software, the use of general purpose
programming languages and provides the reader with a
brief description about how a discrete event simulation
model works.
As application example the authors propose a supply chain
simulation model (SCOPS) developed in C++. SCOPS is a
flexible simulator used for investigating different the
inventory management problem along a three stages
supply chain. SCOPS simulator is currently used for
reverse logistics problems in the large scale retail supply
chain.
References
[1] J. Banks, Handbook of simulation, Principles, Methodology,
Advances, Application, and Practice, New York: WileyInterscience, 1998.
[2] J. Banks, and R.G. Gibson, Simulation software buyer’s
guide, IIE Solution, pp 48-54, 1997.
[3] V. Hlupic, Discrete-Event Simulation Software: What the
Users Want, Simulation, Vol. 73, No. 6, 1999, pp 362-370.
[4] J. J. Swain, Gaming Reality: Biennial survey of discreteevent simulation software tools, OR/MS Today, Vol. 32,
No. 6, 2005, pp. 44-55.
[5] J. J. Swain, New Frontiers in Simulation, Biennial survey of
discrete-event simulation software tools, OR/MS Today,
2007.
[6] F. Longo, and G. Mirabelli, An Advanced supply chain
management tool based on modeling and simulation,
Computer and Industrial Engineering, Vol. 54, No. 3, 2008,
pp 570-588.
[7] G. S. Fishman, Discrete-Event Simulation: Modeling,
Programming, and Analysis. Berlin: Springer-Verlag, 2001.
[8] Anylogic by XjTech, www.xjtech.com.
[9] Arena
by
Rockwell
Corporation,
http://www.arenasimulation.com/.
[10] D. J., Hhuente, Critique of SIMAN as a programming
language, ACM Annual Computer Science Conference,
1987, pp 385.
[11] Automod
by
Applied
Materials
Inc.,
http://www.automod.com/.
[12] J. S. Carson, AutoStat: output statistical analysis for
AutoMod users in Proceedings of the 1997 Winter
Simulation Conference, 1997, pp. 649-656.
[13] Em-plant by Siemens PLM Software solutions,
http://www.emplant.com/.
[14] Promodel
by
Promodel
Corporation,
http://www.promodel.com/products/promodel/.
[15] Flexsim
by
Flexsim
Software
Products,
http://www.flexsim.com/.
[16] Witness
by
Lanner
Group
Limited,
http://www.lanner.com/en/witness.cfm.
[17] V. P. Babich and A. S. Bylev, An approach to compiler
construction for a general-purpose simulation language,
New York: Springer, 1991.
[18] M. Pidd, and R. A. Cassel, Using Java to Develop Discrete
Event Simulations, The Journal of the Operational Research
Society, Vol. 51, No. 4, 2000, pp. 405-412.
[19] T. J. Schriber, and D. T. Brunner, How discrete event
simulation work, in Banks J., Handbook of Simulation,
New York: Wiley Interscience, 1998.
[20] K. Reisdorph, and K. Henderson, Borland C++
Builder, Apogeo, 2005.
[21] G. De Sensi, F. Longo, G. Mirabelli, Inventory policies
analysis under demand patterns and lead times constraints in
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
[22]
[23]
[24]
[25]
[26]
a real supply chain, International Journal of Production
Research, Vol. 46, No. 24, 2008, pp 6997-7016.
F. Longo, and G. Mirabelli, An Advanced Supply Chain
Management Tool Based On Modeling & Simulation,
Computer and Industrial Engineering, Vol. 54, No. 3, 2008,
pp 570-588.
D. Curcio, F. Longo, Inventory and Internal Logistics
Management as Critical Factors Affecting the Supply Chain
Performances, International Journal of Simulation &
Process Modelling, Vol. 5, No 2, 2009, pp 127-137.
A. G. Bruzzone, and E. WILLIAMS, Modeling and
Simulation Methodologies for Logistics and Manufacturing
Optimization, Simulation, vol. 80, 2004, pp 119-174.
E. Silver, F. D. Pike, R. Peterson, Inventory Management
and Production Planning and Control, USA: John Wiley &
Sons, 1998.
O. Balci, Verification, validation and testing, in Handbook
of Simulation, New York: Wiley Interscience, 1998.
Antonio Cimino took his degree in Management Engineering,
summa cum Laude, in September 2007 from the University of
Calabria. He is currently PhD student at the Mechanical
Department of University of Calabria. He has published more than
20 papers on international journals and conferences. His research
activities concern the integration of ergonomic standards, work
measurement techniques, artificial intelligence techniques and
Modeling & Simulation tools for the effective workplace design.
Francesco Longo received his Ph.D. in Mechanical Engineering
from University of Calabria in January 2006. He is currently
Assistant Professor at the Mechanical Department of University of
Calabria and Director of the Modelling & Simulation Center –
Laboratory of Enterprise Solutions (MSC-LES). He has published
more than 80 papers on international journals and conferences.
His research interests include Modeling & Simulation tools for
training procedures in complex environment, supply chain
management and security. He is Associate Editor of the
“Simulation: Transaction of the society for Modeling & Simulation
International”. For the same journal he is Guest Editor of the
special issue on Advances of Modeling & Simulation in Supply
Chain and Industry. He is Guest Editor of the “International Journal
of Simulation and Process Modelling”, special issue on Industry
and Supply Chain: Technical, Economic and Environmental
Sustainability. He is Editor in Chief of the SCS M&S Newsletter
and he works as reviewer for different international journals.
Giovanni Mirabelli is currently Assistant Professor at the
Mechanical Department of University of Calabria. He has
published more than 60 papers on international journals and
conferences. His research interests include ergonomics, methods
and time measurement in manufacturing systems, production
systems maintenance and reliability, quality.
9
10
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
Database Reverse Engineering based on Association Rule
Mining

Nattapon Pannurat1, Nittaya Kerdprasop2 and Kittisak Kerdprasop2,
1
Faculty of Information Sciences, Nakhon Ratchasima College
290 Moo 2, Mitraphap Road, Nakhon Ratchasima, 30000, Thailand
2
Data Engineering and Knowledge Discovery Research Unit, Suranaree University of Technology
111 University Avenue, Nakhon Ratchasima, 30000, Thailand
Abstract
Maintaining a legacy database is a difficult task especially when
system documentation is poor written or even missing. Database
reverse engineering is an attempt to recover high-level
conceptual design from the existing database instances. In this
paper, we propose a technique to discover conceptual schema
using the association mining technique. The discovered schema
corresponds to the normalization at the third normal form, which
is a common practice in many business organizations. Our
algorithm also includes the rule filtering heuristic to solve the
problem of exponential growth of discovered rules inherited with
the association mining technique.
Keywords: Legacy Databases, Reverse Engineering, Database
Design, Database Normalization, Association Mining.
1. Introduction
Legacy databases are obviously valuable assets to many
organizations. These databases were mostly developed
with technologies in the 1970s [14] using old
programming languages such as COBOL and RPG, and
file systems of the mini-computer platforms. Some
databases were even designed with the outdated concepts
such as hierarchical data model, and thus made them
difficult to be maintained and adjusted to serve current
needs of modern companies.
One solution to modernize the legacy databases is to
migrate and transform their structures and corresponding
contents to the new systems. This approach is, however,
hard to achieve if the design document of the system does
no longer exist, which is the common situation in most
enterprises. To solve the problems of recovering database

Corresponding author
structures and migrating legacy databases, we propose the
database reverse engineering methodology.
The process of reverse engineering [7] originally aimed at
discovering design and production procedure from devices,
end products, or other hardware. This methodology often
used in the Second World War for military advantage by
copying opponents’ technologies. Reverse engineering of
software refers to the process of discovering source code
and system design from the available software [7].
In database community, reverse engineering is an attempt
to extract the domain semantics such as keys, functional
dependencies and integrity constraints from the existing
database structures [6, 13]. Typically, database reverse
engineering is the process of extracting design
specifications from legacy systems and making the reverse
transformation from logical to conceptual schema [6, 15].
Our work deals with the reverse schema process by
making a step further from logical schema to the lower
level of database instances. We apply the machine
learning technique, association rule mining in particular,
to induce dependency relationships among data attributes.
The major problem of applying association mining to reallife databases is that it always generates tremendous
amount of association rules [11, 12]. We thus include the
rule- filtering component in our design to select only
promising association rules.
The structure of this paper is organized as follows. Section
2 presents the basic concept and the design framework of
our methodology. Section 3 explains the system
implementation by means of an example. Section 4
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
11
discusses related work. Finally, Section 5 concludes the
paper.
2. Database Reverse Engineering with
NoWARs
The objective of our system is to induce conceptual
schema from the database instances with the basic
assumption that database design documents are absent. We
apply the normalization principles and the association
mining technique to discover the missing database design.
Normalization [8] is the process to transform unstructured
relation into separate relations, called normalized ones.
The main purpose of this separation is to eliminate
redundant data and reduce data anomaly (insert, update,
and delete). There are many different levels of
normalization depending on the purpose of database
designer. Most database applications are designed to be in
the third and the Boyce-Codd normal forms in which their
dependency relations [3] are sufficient for most
organizational requirements. Figure 1 [9] illustrates the
refinement steps from un-normalized relations to the
relations in fifth normal form.
The main condition to transform from one normal form to
the next level is the dependency relationship, which is a
constraint between two sets of attributes in a relation.
Experienced database designers are able to elicit this kind
of information. But in the reverse engineering process in
which the business process and operational requirements
are unknown, this task of dependency analysis is tough
even for the experienced ones. We thus propose to use the
machine learning technique called association mining.
Fig. 1 Normalization steps.
Association mining searches for interesting relationships
among a large set of data items [1, 2]. The discovery of
interesting association relationships among huge amounts
of business transaction records can help in many business
decision making process, such as catalog design, crossmarketing, and loose-leader analysis [11]. An example of
association rule mining is market basket analysis. This
process analyzes customer buying habits by finding
association the different items that customers place in their
shopping baskets. For example, customers are buying milk
also tend to buy bread and water drinking at the same
time. These can represent in association rule as follows.
milk [support=5%]  bread, water drinking [support=5%]
[confidence=100%]
A support of 5% for association rule means that 5% of all
the transactions under analysis show that milk, bread, and
water drinking are purchased together. A confidence of
100% means that 100% of the customers who purchased
some milk also bought bread and water drinking.
Our methodology of database reverse engineering
composes of designing and improving the process of
normalization with association analysis technique. We use
the normalization concept and association analysis
technique to create a new algorithm called NoWARs
(Normalization With Association Rules).
NoWARs is an algorithm that combines normalization
process and association mining technique together. We
can find association rules by taking the dataset on a
database and feeding into a data mining process. We use
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Apriori algorithm to find association rules. NoWARs has
two important steps, first finding association rules and
second normalization with rules obtained from the first
step. The details of NoWARs algorithm are shown in
Figure 2 and its workflow are shown in Figure 3.
12
3. Implementation
The algorithm NoWARs starts when user enter query to
define dataset to normalize. Then NoWARs will find the
association rules by calling Apriori algorithm and save
resulting a form of association rules in the database. Then
NoWARs will select some rules to use in normalization
process. Finally, use the selected rules to generate the 3NF
table in relational schema form. The input of NoWARs is
the un-normalize table. The example of input data format
is shown in Table 1.
Table 1: Example of input data.
INV
DATE
C_ID
P_ID
P_Name
QTY
001
9/1/2010
C01
P01
Printer
3
001
9/1/2010
C01
P02
Phone
5
002
9/1/2010
C03
P05
TV
6
002
9/1/2010
C03
P04
Lamp
2
:
:
:
:
:
Fig. 2 NoWARs algorithm.
:
The un-normalized data as shown in Table 1 will be
analyzed by the algorithm, and then its schema in 3NF is
generated. We perform experimentation with five datasets
as shown in Table 2. We use Oracle Database 10g XE
Edition, tested on Pentium IV 3.0 GHz with RAM 512
MB machine.
Table 2: Number of records and attributes in experimental datasets.
Number of
Records
Number of
attributes
Register
12438
157
Video_Rental
483478
523
Data_Org
91845
334
Invoice
119795
123
Car_Color
199337
312
Dataset Name
We take the Register dataset as a running example. This
dataset is originally un-normalized and its structure is as
follows.
Register (STUDENT_CODE, STUDENT_NAME,
TEACHER_CODE, TEACHER_NAME,
UNIT, SUBJECT_CODE, SUBJECT_NAME)
Fig. 3 The work flow of NoWARs algorihm.
After execution, its conceptual schema is recovered as
shown in Figure 4. The performance of rule-filtering also
analyzed and shown in Figure 5.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Fig. 4 The result of running NoWARs algorihm on Register dataset
1000000
10000
100
1
Register
Video_Rental
Rule Found
Data_Org
Invoice
Car_Color
Rule Used
Fig. 5 Performance of rule-filtering component of NoWARs algorithm in reducing number of association rules
13
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
4. Related Work
Since the introduction of a famous association technique
known as Apriori algorithm [1, 2], there have long been
immense attempts to integrate this technique to improve
database design, consistency checking, and querying. Han
et al. [10] improved the DBMiner system to work with
relational databases and data warehouses. DBMiner can do
many data mining tasks such as classification, prediction
and association. Sreenath, Bodagala, Alsabti, and Ranka
[16] adopted Apriori algorithm to work with relational
database system. They created Fast UPdate algorithm to
search association data when the system has new
transaction. Tsechansky, Pliskin, Rabinowitz and Porath
[17] applied Apriori to find association data from many
relations in the database. Berzal, Cubero, Marín and
Serrano [4] used Tree-Based Association Rule mining
(TBAR) to find association data in relational database.
They kept large item set in tree structure format to reduce
time cost in association process. Hipp, Güntzer and
Grimmer [12] implemented Apriori algorithm with C++
programming language to work on DB2 database system.
They used the program to find association data in
Daimler-Chrysler Company database.
In parallel to the attempts of applying learning techniques
to existing large databases, researchers in the area of
database reverse engineering have proposed some means
of extracting conceptual schema. Lee and Yoo [14]
proposed a method to derive a conceptual model from
object-oriented databases. The derivation process is based
on forms including business forms and forms for database
interaction in the user interface. The final products of their
method are the object model and the scenario diagram
describing a sequence of operations. The work of Perez et
al. [15] emphasized on relational object-oriented
conceptual schema extraction. Their reverse engineering
technique is based on a formal method of term rewriting.
They use terms to represent relational and object-oriented
schemas. Term rewriting rules are then generated to
represent the correspondences between relational and
object-oriented elements. Output of the system is the
source code to migrate legacy database to the new system.
Recent work in database reverse engineering has not
concentrated on a broad objective of system migration.
Researchers rather focus their study on a particular issue
of semantic understanding. Lammari et al. [13] proposed a
reverse engineering method to discover inter-relational
constraints and inheritances embedded in a relational
database. Chen et al. [5] also based their study on entityrelationship model. They proposed to apply association
rule mining to discover new concepts leading to a proper
design of relational database schema. They employed the
concept of fuzziness to deal with uncertainty inherited
14
with the association mining process. Our work is also in
the line of association mining technique application to the
database design. But our main purpose is for the
understanding of legacy databases and our method deals
with uncertainty by means of heuristic in the step of rule
filtering.
5. Conclusions and Future Work
A forward engineering approach to the design of a
complete database starts from the high-level conceptual
design to capture detail requirements of the enterprise.
Common tool normally used to represent these
requirements is the entity-relationship, or ER, diagram and
the product of this design phase is a conceptual schema.
Typically, the schema at this level needs some adjustments
based on the procedure known as normalization in order to
reach a proper database design. Then, the database
implementation moves to the lower abstraction level of
logical design in which logical schema is constructed in a
form of relations, or database tables.
In legacy systems that design documents are incomplete or
even missing, the system maintenance or modification is a
difficult task due to the lack of knowledge regarding highlevel design of the system. To tackle this problem, a
database reverse engineering approach is essential.
In this paper, we propose a method to discover conceptual
schema from the database instances, or relations. The
discovering technique is based on the association mining
incorporated with some heuristic to produce a minimal set
of association rules. Transformation rules are then applied
to convert association rules to database dependencies.
Normalization is the principal concept of our heuristic and
transformation. To deduce repeating group, insert
anomaly, delete anomaly and update anomaly. We
introduce the novel algorithm, called NoWARs, to
normalize the database tables. In the normalization
process, NoWARs uses only 100% confidence association
rules with any support values. The results from the
NoWARs algorithm are the same as the design schema
obtained from the database designer. But NoWARs cannot
normalize data model to the level higher than third normal
form, which might be the desired level of a highly secured
database. We thus plan to improve our methodology to
discover a conceptual schema up to the level of fifth
normal form.
Acknowledgments
This research has been conducted at the Data Engineering
and Knowledge Discovery (DEKD) research unit, fully
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
supported by Suranaree University of Technology. The
work of first and second authors has been funded by
grants from Suranaree University of Technology and the
National Research Council of Thailand (NRCT),
respectively. The third author has been supported by a
grant from the Thailand Research Fund (TRF, grant
number RMU5080026).
References
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining
association rules between set of items in large databases”, in
Proceedings of ACM SIGMOD International Conference on
Management of Data, 1993, pp. 207-216.
[2] R. Agrawal, and R. Srikant, “Fast algorithms for mining
association rules in large database”, in Proceedings of 20th
International Conference on Very Large Data Base, 1994, pp.
487-499.
[3] W. W. Armstrong, “Dependency structures of database
relationships”, Information Processing, Vol.74, 1974, pp.
580-583.
[4] F. Berzal, J. Cubero, N. Marin, and J. Serrano, “TBAR: An
efficient method for association rule mining in relational
databases”, Data & Knowledge Engineering, Vol.37, No.1,
2001, pp. 47-64.
[5] G. Chen, M. Ren, P. Yan, and X. Guo, “Enriching the ER
model based on discovered association rules”, Information
Sciences, Vol.177, 2007, pp. 1558-1556.
[6] R. Chiang, T. M. Barron, and V. C. Storey, “A framework for
the design and evaluation of reverse engineering methods for
relational databases”, Data & Knowledge Engineering,
Vol.21, 1997, pp. 57-77.
[7] E.J. Chikofsky, and J. H. Cross, “Reverse engineering and
design recovery: A taxonomy”, IEEE Software, Vol.7, No.1,
1990, pp. 13-17.
[8] E. F. Codd, “A relational model of data for large shared data
banks”, Communications of the ACM, Vol.13, No.6, 1970,
pp. 377-387.
[9] C. J. Date, and R. Fagin, “Simple conditions for guaranteeing
higher normal forms in relational databases”, ACM
Transactions on Database Systems, Vol.17, No.3, 1992, pp.
465-476.
[10] J. Han, et al., “DBMiner: System for data mining in
relational databases and data warehouses”, in Proceedings of
CASCON’97: Meeting of Minds, 1997, pp. 249-260.
[11] J. Han, and M. Kamber, Data Mining: Concepts and
Techniques, San Diego: Academic Press, 2001.
[12] J. Hipp, U. Guntzer, and U. Grimmer, “Integrating
association rule mining algorithms with relational database
systems”, in Proceedings of 3rd International Conference on
Enterprise Information Systems, 2001, pp. 130-137.
[13] N. Lammari, I. Comyn-Wattiau, and J. Akoka, “Extracting
generalization hierarchies from relational databases: A
reverse engineering approach”, Data & Knowledge
Engineering, Vol.63, 2007, pp. 568-589.
[14] H. Lee, and C. Yoo, “A form driven object-oriented reverse
engineering methodology”, Information Systems, Vol.25,
No.3, 2000, pp. 235-259.
[15] J. Perez, I. Ramos, V. Anaya, J. M. Cubel, F. Dominguez, A.
Boronat, and J. A. Carsi, “Data reverse engineering for
15
legacy databases to object oriented conceptual schemas”,
Electronic Notes in Theoretical Computer Science, Vol.72,
No.4, 2003, pp. 7-19.
[16] S. T. Sreenath, S. Bodogala, K. Alsabti, and S. Ranka, “An
efficient algorithm for the incremental updation of
association rules in large databases”, in Proceedings of 3rd
International Conference on KDD and Data Mining, 1997, pp.
263-266.
[17] S. Tsechansky, N. Pliskin, G. Rabinowitz, and A. Porath,
“Mining relational patterns from multiple relational tables”,
Decision Support Systems, Vol.27, No.1-2, 1999, pp. 179195.
Natthapon Pannurat received his bachelor and master degrees in
computer engineering in 2007 and 2009, respectively, from
Suranaree University of Technology. He is currently a faculty
member of Information Sciences, Nakhon Ratchasima College. His
research interests are database management systems, data
mining and machine learning.
Nittaya Kerdprasop is an associate professor at the school of
computer engineering, Suranaree University of Technology,
Thailand. She received her B.S. from Mahidol University, Thailand,
in 1985, M.S. in computer science from the Prince of Songkla
University, Thailand, in 1991 and Ph.D. in computer science from
Nova Southeastern University, USA, in 1999. She is a member of
ACM and IEEE Computer Society. Her research of interest
includes Knowledge Discovery in Databases, Artificial Intelligence,
Logic and Constraint Programming, Deductive and Active
Databases.
Kittisak Kerdprasop is an associate professor and the director of
DEKD research unit at the school of computer engineering,
Suranaree University of Technology, Thailand. He received his
bachelor degree in Mathematics from Srinakarinwirot University,
Thailand, in 1986, master degree in computer science from the
Prince of Songkla University, Thailand, in 1991 and doctoral
degree in computer science from Nova Southeastern University,
USA, in 1999. His current research includes Data mining, Machine
Learning, Artificial Intelligence, Logic and Functional Programming,
Probabilistic Databases and Knowledge Bases.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
16
A New Approach to Keyphrase Extraction Using Neural
Networks
Kamal Sarkar, Mita Nasipuri and Suranjan Ghose
Computer Science and Engineering Department, Jadavpur University,
Kolkata-700 032, India
Abstract
Keyphrases provide a simple way of describing a document,
giving the reader some clues about its contents. Keyphrases can
be useful in a various applications such as retrieval engines,
browsing interfaces, thesaurus construction, text mining etc..
There are also other tasks for which keyphrases are useful, as we
discuss in this paper. This paper describes a neural network based
approach to keyphrase extraction from scientific articles. Our
results show that the proposed method performs better than some
state-of-the art keyphrase extraction approaches.
Keywords: Keyphrase Extraction, Neural Networks, Text
Mining
1. Introduction
The pervasion of huge amount of information through the
World Wide Web (WWW) has created a growing need for
the development of techniques for discovering, accessing,
and sharing knowledge. The keyphrases help readers
rapidly understand, organize, access, and share
information of a document. Keyphrases are the phrases
consisting of one or more significant words. keyphrases
can be incorporated in the search results as subject
metadata to facilitate information search on the web [1]. A
list of keyphrases associated with a document may serve as
indicative summary or document metadata, which helps
readers in searching relevant information.
Keyphrases are meant to serve various goals. For example,
(1) when they are printed on the first page of a journal
document, the goal is summarization. They enable the
reader to quickly determine whether the given article
worth in-depth reading. (2) When they are added to the
cumulative index for a journal, the goal is indexing. They
enable the reader to quickly find a article relevant to a
specific need. (3) When a search engine form contains a
field labeled keywords, the goal is to enable the reader to
make the search more precise. A search for documents that
match a given query term in the keyword field will yield a
smaller, higher quality list of hits than a search for the
same term in the full text of the documents.
When the searching is done on the limited display area
devices such as mobile, PDA etc. , the concise summary in
the form of keyphrases , provides a new way for
displaying search results in the smaller display area[ 2] [
3].
Although the research articles published in the journals
generally come with several author assigned keyphrases,
many documents such as the news articles, review articles
etc. may not have author assigned keyphrases at all or the
number of author-assigned keyphrases available with the
documents is also too limited to represent the topical
content of the articles. Many documents also do not come
with author assigned keyphrases. So, an automatic
keyphrase extraction process is highly desirable.
Manual selection of keyphrases from a document by a
human is not a random act. Keyphrase extraction is a task
related to the human cognition.
Hence, automatic
keyphrase extraction is not a trivial task and it needs to
automated due to its usability in managing information
overload on the web.
Some previous works on automatic keyphrase extraction
used the machine learning techniques such as Naïve
Bayes, Decision tree, genetic algorithm [15] [16] etc.
Wang et.al (2006) has proposed in [14] a neural network
based approach to keyphrase extraction, where keyphrase
extraction has been viewed as a crisp binary classification
task. They train a neural network to classify whether a
phrase is keyphrase or not. This model is not suitable when
the number of phrases classified by the classifier as
positive is less than the desired number of keyphrases, K.
To overcome this problem, we think that keyphrase
extraction is a ranking problem rather than a classification
problem. One good solution to this problem is to train a
neural network to rank the candidate phrases. Designing
such a neural network requires the keyphrases in the
training data to be ranked manually. Sometimes, this is not
feasible.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
In this paper, we present a keyphrase extraction method
that uses a multilayer perceptron neural network which is
trained to output the probability estimate of a class:
positive (keyphrase) or negative (not a keyphrase).
Candidate phrases which are classified as positive are
ranked first based on their class probabilities. If the
number of desired keyphrases is greater than the number
of phrases classified as positive by the classifier, the
candidate phrases classified as negative by the classifier
are considered and they are sorted in increasing order of
the their class probabilities, that is, the candidate phrase
classified as negative with minimum probability estimate
is added first to the list of previously selected Keyphrases.
This process continues until the number of extracted
keyphrases exceed the number K, where K = the desired
number of the keyphrases.
Our work also differs from the work proposed by Wang
et.al (2006) [14] in the number and the types of features
used. While they use the traditional TF*IDF and position
features to identify the keyphrases, we use extra three
features such as phrase length, word length in a phrase,
links of a phrase to other phrases. We also use the position
of a phrase in a document as a continuous feature rather
than a binary feature.
The paper is organized as follows. In section 2 we present
the related work. Some background knowledge about
artificial neural network has been discussed in section 3. In
section 4, the proposed keyphrase extraction method has
been discussed. We present the evaluation and the
experimental results in section 5.
2. Related Work
A number of previous works has suggested that document
keyphrases can be useful in a various applications such as
retrieval engines [1], [4], browsing interfaces [5],
thesaurus construction [6], and document classification
and clustering [7].
Some supervised and unsupervised keyphrase extraction
methods have already been reported by the researchers. An
algorithm to choose noun phrases from a document as
keyphrases has been proposed in [8]. Phrase length, its
frequency and the frequency of its head noun are the
features used in this work. Noun phrases are extracted
from a text using a base noun phrase skimmer and an offthe-shelf online dictionary.
Chien [9] developed a PAT-tree-based keyphrases
extraction system for Chinese and other oriental
languages.
17
HaCohen-Kerner et al [10][11] proposed a model for
keyphrase extraction based on supervised machine
learning and combinations of the baseline methods. They
applied J48, an improved variant of C4.5 decision tree for
feature combination.
Hulth et al [12] proposed a keyphrase extraction algorithm
in which a hierarchically organized thesaurus and the
frequency analysis were integrated. The inductive logic
programming has been used to combine evidences from
frequency analysis and thesaurus.
A graph based model for keyphrase extraction has been
presented in [13]. A document is represented as a graph in
which the nodes represent terms, and the edges represent
the co-occurrence of terms. Whether a term is a keyword is
determined by measuring its contribution to the graph.
A Neural Network based approach to keyphrase extraction
has been presented in [14] that exploits traditional term
frequency, inverted document frequency and position
(binary) features. The neural network has been trained to
classify a candidate phrase as keyphrase or not.
Turney [15] treats the problem of keyphrase extraction as
supervised learning task. In this task, nine features are
used to score a candidate phrase; some of the features are
positional information of the phrase in the document and
whether or not the phrase is a proper noun. Keyphrases are
extracted from candidate phrases based on examination of
their features. Turney’s program is called Extractor. One
form of this extractor is called GenEx, which is designed
based on a set of parameterized heuristic rules that are
fine-tuned using a genetic algorithm. Turney Compares
GenEX to a standard machine learning technique called
Bagging which uses a bag of decision trees for keyphrase
extraction and shows that GenEX performs better than the
bagging procedure.
A keyphrase extraction program called Kea, developed by
Frank et al. [16][17], uses the Bayesian learning technique
for keyphrase extraction task. A model is learned from the
training documents with exemplar keyphrases and
corresponds to a specific corpus containing the training
documents. Each model consists of a Naive Bayes
classifier and two supporting files containing phrase
frequencies and stopped words. The learned model is used
to identify the keyphrases from a document. In both Kea
and Extractor, the candidate keyphrases are identified by
splitting up the input text according to phrase boundaries
(numbers, punctuation marks, dashes, and brackets etc.).
Finally a phrase is defined as a sequence of one, two, or
three words that appear consecutively in a text. The
phrases beginning or ending with a stopped word are not
taken under consideration. Kea and Extractor both used
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
supervised machine learning based approaches. Two
important features such as distance of the phrase's first
appearance into the document and TF*IDF (used in
information retrieval setting), are considered during the
development of Kea. Here TF corresponds to the
frequency of a phrase into a document and IDF is
estimated by counting the number of documents in the
training corpus that contain a phrase P. Frank et al.
[16][17],
has shown that the performance of Kea is
comparable to GenEx proposed by Turney.
An n-gram based technique for filtering keyphrases has
been presented in [18]. In this approach, authors compute
n-grams such as unigram, bigram etc for extracting the
candidate keyphrases which are finally ranked based on
the features such as term frequency, position of a phrase in
a document and a sentence.
18
distributed to the output layer. Arriving at a node (a
neuron) in the output layer, the value from each hidden
layer neuron is again multiplied by a weight (wjk), and the
resulting weighted values are added together producing a
combined value at an output node. The weighted sum is
fed into a transfer function (usually a sigmoid function),
which outputs a value Ok. The Ok values are the outputs of
the network.
One hidden layer is sufficient for nearly all problems. In
some special situations such as modeling data which
contains a saw tooth wave like discontinuities, two hidden
layers may be required. There is no theoretical reason for
using more than two hidden layers.
Input layer
hidden layer
output layer
x1
3. Background
In this section, we briefly describe some basics of artificial
neural network and how to estimate class probability in an
artificial neural network. The estimation of class
probabilities is important for our work because we use the
estimated class probabilities as the confidence scores
which are used in re-ranking the phrases belonging to a
class: positive or negative.
Artificial Neural networks (ANN) are predictive models
loosely motivated by the biological neural systems. In
generic sense, the terms “Neural Network” (NN) and
“Artificial Neural Network” (ANN) usually refer to a
Multilayer Perceptron (MLP) Network, which is the most
widely used types of neural networks. A multiplayer
perceptron (MLP) is capable of expressing a rich variety of
nonlinear decision surfaces. An example of such a network
is shown in Figure 1. A multilayer perceptron neural
network has usually three layers: one input layer, one
hidden layer and one output layer. A vector of predictor
variable values (x1...xi) is presented to the input layer. In
the keyphrase extraction task, this input vector is the
feature vector, which is a vector of values of features
characterizing the candidate phrases. Before presenting a
vector to the input layer, it is normalized. The input layer
distributes the values to each of the neurons in the hidden
layer. In addition to the predictor variables, there is a
constant input of 1.0, called the bias that is fed to each of
the hidden layers. The bias is multiplied by a weight and
added to the sum going into the neuron. The value from
each input neuron is multiplied by a weight (wij) and
arrives at a neuron in the hidden layer, and the resulting
weighted values are added together producing a combined
value at a hidden node. The weighted sum is then fed into
a transfer function (usually a sigmoid function), which
outputs a value. The outputs from the hidden layer are
x2
.
.
.
.
.
.
xi
wij
Hj
wjk
Ok
Fig.1. A multilayer feed-forward neural network: A training sample, X =
(x1, x2, . . .xi), is fed to the input layer. Weighted connections exist
between each layer, where wij denotes the weight from a unit j in one
layer to a unit i in the previous layer.
The backpropagation algorithm performs learning on a
multilayer
feed-forward
neural
network.
The
backpropagation training algorithm was the first practical
method for training multiplayer perceptron (MLP) neural
networks. The backpropagation (BP) algorithm
implements a gradient descent search through the space of
possible network weights, iteratively reducing the error
between the training example target values and network
outputs. BP allows supervised mapping of input vectors
and corresponding target vectors. The backpropagation
training algorithm follows the following cycle to refine the
weight values:
(1) randomly choose a tentative set of weights (initial
weight configuration) and run a set of predictor variable
values through the network, (2) compute the difference
between the predicted target value and the training
example target value, (3) average the error information
over the entire set of training instances, (4) propagate the
error backward through the network and compute the
gradient (vector of derivatives) of the change in error with
respect to changes in weight values, (5) make adjustments
to the weights to reduce the error. Each cycle is called an
epoch.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
19
4.2 Candidate Phrase Identification
One of the most important issues in designing a perceptron
network is the number of neurons to be used in the hidden
layer(s). If an inadequate number of neurons are used, the
network will be unable to model complex data, and the
resulting network will fit poorly to the training data. If too
many neurons are used, the training time may be
excessively long, and the network may over fit the data.
When overfitting occurs, the network will begin to model
random noise in the data. As a result, the model fits the
training data extremely well, but it performs poorly to
new, unseen data. Cross validation can be used to test for
this. The number of neurons in the hidden layers may be
optimized by building models using varying numbers of
neurons and measuring the quality using cross validation
method.
3.1 Computing Class probability
Given the training data, the standard statistical technique
such as Parzen Windows [22] can used to estimate the
probability density in the output space. After calculating
the output vector O for an unknown input, one can
compute the estimated probability that it belongs to each
class using the following formula:
Pco(c | O) 
p (c | O )
, for class c
 p (c ' | O )
c'
p(c|O) is the density of points of the category C at location
O in the scatter plot of category 1 Vs. Category 0 in a two
class problems [23].
We use the estimated class probabilities as the confidence
scores to order phrases belonging to a class: positive or
negative.
The candidate phrase identification is an important step in
key phrase extraction task. We treat all the noun phrases in
a document as the candidate phrases [1]. The following
sub-section discusses how to identify noun phrases.
Noun Phrase Identification
To identify the noun phrases, documents should be tagged.
The articles are passed to a POS tagger called
MontyTagger [25] to extract the lexical information about
the terms. Figure 2 shows a sample output of the Monty
tagger for the following text segment:
“European nations will either be the sites of religious
conflict and violence that sets Muslim minorities against
secular states and Muslim communities against Christian
neighbors, or it could become the birthplace of a
liberalized and modernized Islam that could in turn
transform the religion worldwide.”
European/JJ nations/NNS will/MD either/DT be/VB
the/DT sites/NNS of/IN religious/JJ conflict/NN
and/CC violence/NN that/IN sets/NNS Muslim/NNP
minorities/NNS against/IN secular/JJ states/NNS
and/CC Muslim/NNP communities/NNS against/IN
Christian/NNP neighbors/NNS,/, or/CC it/PRP
could/MD become/VB the/DT birthplace/NN of/IN
a/DT liberalized/VBN and/CC modernized/VBN
Islam/NNP that/WDT could/MD in/IN turn/NN
transform/VB the/DT religion/NN worldwide/JJ ./.
Fig.2 A sample output of the tagger
In figure 2, NN,NNS,NNP,JJ,DT,VB,IN,PRP,WDT,MD
etc. are lexical tags assigned by the tagger.
Adjective
4. Proposed Keyphrase Extraction Method
The proposed keyphrase extraction method consists of
three primary components: document preprocessing,
candidate phrase identification and keyphrase extraction
using a neural network.
Start
Noun
4.1 Document Preprocessing
The preprocessing task includes formatting each document.
If a source document is in pdf format, it is converted to a
text format before submission to the keyphrase extractor.
Article
Fig. 3 DFA for noun phrase identification
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
The meanings of the tags are as follows:
NN and NNS for nouns (singular and plural respectively),
NNP for proper nouns, JJ for adjectives, DT for
determiner, VB for a verb, IN for a preposition, PRP for a
pronoun. This is not the complete tag set.
The above mentioned tags are some examples of tags in
the Penn Treebank tag set used by the MontyTagger.
The noun phrases are identified from the tagged sentences
using the DFA (deterministic finite automata) shown in
figure 3. In this DFA, the states for adjective, noun
represent all variations of adjectives and nouns.
The figure 4 shows the noun phrases identified by our noun
phrase identification component when the tagged sentences
shown in figure 2 become its input. As shown in the figure
4, the 10th phrase is “Islam”, but manual inspection of the
source text may suggest that it should be “Modernized
Islam”. This discrepancy occurs since the tagger assigns a
tag “VBN” to the word “Modernized” and “VBN”
indicates participle form of a verb which is not accepted by
our DFA in figure 3 as the part of a noun phrase. To avoid
this problem “VBN” might be considered as a state in the
DFA, but it might lead to recognizing some verb phrases
mistakenly as the noun phrases.
Document
number
Sentence
Number
Noun
phrase
Number
Noun Phrases
100
4
1
European nations
100
4
2
sites
100
4
3
religious conflict
100
4
4
violence
100
4
5
sets muslim minorities
100
4
6
secular states
100
4
7
muslim communities
100
4
8
christian neighbors
100
4
9
birthplace
100
4
10
Islam
100
4
11
turn
100
4
12
religion
Fig.4 Output of noun phrase extractor for a sample input
4.3 Features, Weighting and Normalization
After identifying the document phrases, a document is
reduced to a collection of noun phrases. Since, in our
20
work, we focus on the keyphrase extraction task from
scientific articles which are generally very long in size (6
to more than 20 pages), the collection of noun phrases
identified in an article may be huge in number. Among
theses huge collection, a small number of phrases (5 to 15
phrases) may be selected as the keyphrases. Whether a
candidate phrase is a keyphrase or not can be decided by a
classifier based on a set of features characterizing a phrase.
Discovering good features for a classification task is very
much an art. The different features characterizing
candidate noun phrases, feature weighting and
normalization methods are discussed below.
Phrase frequency, phrase links to other phrases and
Inverse Document Frequency
If a noun phrase is occurring more frequently in a
document, the phrase is assumed to more important in the
document. Number of times a phrase occurs independently
in a document with its entirety has been considered as the
phrase frequency (PF). A noun phrase may appear in a text
either independently or as a part of other noun phrases.
These two types of appearances of noun phrases should be
distinguished. If a noun phrase P1 appears in full as a part
of another noun phrase P2 (that is, P1 is contained in P2),
it is considered that P1 has a link to P2. Number of times a
noun phrase (NP) has links to other phrases is counted and
considered as the phrase link count (PLC). Two features,
phrase frequency (PF) and phrase link count (PLC) are
combined to have a single feature value using the
following measure:
F freq 
(1 / 2) * PF * PF  PLC
In the above formula, frequency of a noun phrase (PF) is
squared only to give it more importance than the phrase
link count (PLC). The value 1/2 has been used to moderate
the value. We explain below about this formula with an
example:
Assume a phrase P1 whose PF value is 10, PLC value is
20 and PF+PLC = 30. For another phrase P2 whose PF
value is 20, PLC value is 10 and PF+PLC =30. So, for
these two cases, simple addition of PF and PLC do not
make any difference in assigning weights to the noun
phrases although the independent occurrence of noun
phrase P2 is more than that of the noun phrase P1. But the
independent existence of a phrase should get higher
importance while deciding whether a phrase is keyphrase
worthy or not. In a more general case, consider that a
single word noun phrase NP1 occurs only once in
independent existence and occurs (n+1) times as a part of
other noun phrases and NP2 is another phrase, which
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
occurs n times independently and occurs only once as a
part of other phrases. In this situation, simple addition of
PF and PLC will favor the first phrase, but our formula
will give higher score to the second phrase because it
occurs more independently than the first one.
Inverse document frequency (IDF) is a useful measure to
determine the commonness of a term in a corpus. IDF
value is computed using the formula: log(N/df), where N=
total number of documents in a corpus and df (document
frequency) means the number of documents in which a
term occurs. A term with a lower df value means the term
is less frequent in the corpus and hence idf value becomes
higher. So, if idf value of a term is higher, the term is
relatively rare in the corpus. In this way, idf value is a
measure for determining the rarity of a term in a corpus.
Traditionally, TF (term frequency) value of a term is
multiplied by IDF to compute the importance of a term,
where TF indicates frequency of a term in a document.
TF*IDF measure favors a relatively rare term which is
more frequent in a document. We combine Ffreq and IDF in
the following way to have a variant of Edmundsonian
thematic feature [24]:
The value of this feature is normalized by dividing the
value by the maximum Fthematic score in a collection of
Fthematic scores obtained by the phrases corresponding to a
document.
Phrase Position
If a phrase occurs in the title or abstract of a document, it
should be given more score. So, we consider the position of
the first occurrence of a phrase in a document as a feature.
Unlike the previous approaches [14] [16] that assume the
position of a phrase as a binary feature, in our work, the
score of a phrase that occurs first in the sentence i is
computed using the following formula:
1
i
that keyphrase consisting of 4 or more words are relatively
rare in our corpus.
Length of the words in a phrase can be considered as a
feature. According to Zipf’s Law [21], shorter words
occur more frequently than the larger ones. For example,
articles occur more frequently in a text. So, the word
length can be an indication for the rarity of a word. We
consider the length of the longest word in a phrase as a
feature.
If the length of a phrase is PL and the length of the longest
word in the phrase is WL, these two feature values are
combined to have a single feature value using the
following formula:
F PL
* WL

lo g (1  P L ) * lo g (1  W L )
The value of this feature is normalized by dividing the
value by the maximum value of the feature in the
collection of phrases corresponding to a document.
F thematic  F freq * IDF
F pos 
21
, if i <= n
, where n is the position of the last sentence in the abstract
of a document. For i > n, Fpos is set to 0.
Phrase Length and Word Length
These two features can be considered as the structural
features of a phrase. Phrase length becomes an important
feature in keyphrase extraction task because the length of
keyphrases usually varies from 1 word to 3 words. We find
4.4 Keyphrase Extraction Using Multilayer Perceptron
Neural Network
Training a Multilayer Perceptron (MLP) Neural Network
for keyphrase extraction requires document noun phrases
to be represented as the feature vectors. For this purpose,
we write a computer program for automatically extracting
values for the features characterizing the noun phrases in
the documents. Author assigned keyphrases are removed
from each original document and stored in the different
files with a document identification number. For each
noun phrase NP in each document d in our dataset, we
extract the values of the features of the NP from d using
the measures discussed in subsection 4.3. If the noun
phrase NP is found in the list of author assigned
keyphrases associated with the document d, we label the
noun phrase as a “Positive” example and if it is not found
we label the phrase as a “negative” example. Thus the
feature vector for each noun phrase looks like {<a1 a2 a3
….. an>, <label>} which becomes a training instance
(example) for a Multilayer Perceptron Neural Network,
where a1, a2 . . .an, indicate feature values for a noun
phrase. A training set consisting of a set of instances of the
above form is built up by running a computer program on
a set of documents selected from our corpus.
After preparation of the training dataset, a Multilayer
Perceptron Neural Network is trained on the training set to
classify the noun phrases as one of two categories:
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
“Positive” or “Negative”. Positive category indicates that a
noun phrase is a keyphrase and the negative category
indicates that it is not a keyphrase.
Input:
A file containing the noun phrases of a test document
with their classifications (positive or negative) and the
probability estimates of the classes to which the phrases
belong.
Begin:
i. Select the noun phrases, which have been classified as
positive by the classifier and reorder these selected noun
phrases in decreasing order of their probability estimates of
being in class 1 (positive). Save the selected phrases in to an
output file and delete them from the input file.
ii. For the rest of the noun phrases in the input file,
which are classified by the classifier as “Negative”, we
order the phrases in increasing order of their probability
estimates of being in the class 0 (negative). In effect, the
phrase for which the probability estimate of being in class 0
is minimum comes at the top. Append the ordered phrases to
the output file.
22
WEKA uses backpropagation algorithm for training the
multilayer perceptron neural network.
The trained neural network is applied on a test document
whose noun phrases are also represented in the form of
feature vectors using the similar method applied on the
training documents. During testing, we use –p option (soft
threshold option). With this option, we can generate a
probability estimate for the class of each vector. This is
required when the number of noun phrases classified as
positive by the classifier is less than the desired number of
the keyphrases. It is possible to save the output in a file
using indirection sign (>) and a file name. We save the
output produced by the classifier for each test document in
a separate file. Then we rank the phrases using the
algorithm shown in figure 5 for keyphrase extraction.
After ranking the noun phrases, K- top ranked noun
phrases are selected as keyphrases for each input test
document.
5. Evaluation and Experimental Results
iii. Save the output file
end
Fig.5 Noun Phrase Ranking Based on Classifier’s Decisions
For
our
experiment,
we
use
Weka
(www.cs.waikato.ac.nz/ml/weka) machine learning tools.
We use Weka’s Simple CLI utility, which provides a
simple command-line interface that allows direct execution
of WEKA commands.
The training data is stored in a .ARFF format which is an
important requirement for WEKA.
The multilayer perceptron is included under the panel
Classifier/ functions of WEKA workbench. The description
of how to use MLP in keyphrase extraction has been
discussed in the section 3. For our work, the classifier MLP
of the WEKA suite has been trained with the following
values of its parameters:
Number of layers: 3 (one input layer, one hidden layer and
one output layer).
Number of hidden nodes: (number of attributes + number
of classes)/2
Learning rate:
0.3
Momentum:
0.2
Training iteration:
500
Validation threshold:
20
There are two usual practices for evaluating the
effectiveness of a keyphrase extraction system. One
method is to use human judgment, asking human experts
to give scores to the keyphrases generated by a system.
Another method, less costly, is to measure how well the
system-generated keyphrases match the author-assigned
keyphrases. It is a common practice to use the second
approach in evaluating a keyphrase extraction system
[7][8] [11][19]. We also prefer the second approach to
evaluate our keyphrase extraction system by computing its
precision and recall using the author-provided keyphrases
for the documents in our corpus. For our experiments,
precision is defined as the proportion of the extracted
keyphrases that match the keyphrases assigned by a
document’s author(s). Recall is defined as the proportion
of the keyphrases assigned by a document’s author(s) that
are extracted by the keyphrase extraction system.
5.1 Experimental Dataset
The data collection used for our experiments consists of
150 full journal articles whose size ranges from 6 pages to
30 pages. Full journal articles are downloaded from the
websites of the journals in three domains: Economics,
Legal (Law) and Medical.
Articles on Economics are collected from the various
issues of the journals such as Journal of Economics
(Springer), Journal of Public Economics (Elsevier),
Economics Letters, Journal of Policy Modeling. All these
articles are available in PDF format.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Articles on Law and legal cases have been downloaded
from the various issues of the law journals such as
Computer Law and Security Review (Elsevier),
International Review of Law and Economics (Elsevier),
European Journal of Law and Economics (Springer),
Computer Law and Security Report (Elsevier), AGORA
International Journal of Juridical Sciences(Open access).
Medical articles are downloaded from the various issues of
the medical journals such as Indian Journal of Medicine,
Indian Journal of Pediatrics, Journal of Psychology and
Counseling,
African
journal
of
Traditional,
Complementary and Alternative Medicines, Indian Journal
of Surgery, Journal of General Internal Medicine, journal
of General Internal Medicine, The American Journal of
Medicine, International Journal of Cardiology, Journal of
Anxiety Disorders. Number of articles under each category
used in our experiments is shown in the table 1.
23
noun phrase or multiple multiword noun phrases
connected by prepositions (an example of a keyphrase
containing multiple multiword noun phrases is: “The
National Council for Combating Discrimination”), (2) the
ill-formatted input texts which are generated by a pdf-totext converter from the scientific articles usually available
in pdf format.
5.2 Experiments
We conducted two experiments to judge the effectiveness
of the proposed keyphrase extraction method.
Experiment 1
In this experiment, we develop a neural network based
keyphrase system as we discuss in this paper. All the
features discussed in the subsection 4.3 are incorporated in
this system.
Table 1: Source documents used in our experiments
Source Document Type
Economics
Law
Medical
Number of Documents
60
40
50
For the system evaluation, the set of journal articles are
divided into multiple folds where each fold consists of one
training set of 100 documents and a test set of 50
documents. The training set and the test set are
independent from each other. The set of author assigned
keyphrases available with the articles are manually
removed before candidate terms are extracted. For all
experiments discussed in this paper, the same splits of our
dataset in to a training set and a test set are used. Some
useful statistics about our corpus are given below.
Total number of noun phrases in our corpus is 144978.
The average number of author-provided keyphrases for all
the documents in our corpus is 4.90.
The average number of keyphrases that appears in all the
source documents in our corpus is 4.34. Here it is
interesting to note that all the author assigned keyphrases
for a document may not occur in the document itself.
The average number of keyphrases that appear in the list
of candidate phrases extracted from all the documents in
our corpus is 3.50. These statistics interestingly show that
some keyphrase worthy phrases may be missed at the stage
of the candidate phrase extraction. The main problems
related to designing a robust candidate phrase extraction
algorithm are: (1) an irregular structure of a keyphrase,
that is, it may contain only a single word or a multiword
Experiment 2
This is to compare the proposed system to an existing
system. Kea [17] is now a publicly available keyphrase
extraction system. Kea uses a limited number of features
such as positional information and TF*IDF feature for
keyphrase extraction. The keyphrase extraction system,
Kea uses the Naïve Bayesian learning algorithm for
keyphrase extraction.
We download the version 5.0 of Kea1 and install it on our
machine. A separate model is built for each fold which
contains 100 training documents and 50 test documents.
Kea builds a model from each training dataset using Naïve
Bayes and uses this pre-built model to extract keyphrases
from the test documents.
5.3 Results
To measure the overall performance of the proposed
neural network based keyphrase extraction system and the
publicly available keyphrase extraction system, Kea, our
experimental dataset consisting of 150 documents are
divided into 3 folds for 3-fold cross validation where each
fold contains two independent sets: a training set of 100
documents and a test set of 50 documents. A separate
model is built for each fold to collect 3 test results, which
are averaged to obtain the final results for a system. The
number of keyphrases to be extracted (value for K) is set
to 5, 10 and 15 for each of keyphrase extraction systems
discussed in this paper.
1
http://www.nzdl.org/Kea/
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Table 2 shows the author assigned keyphrases for the
journal article number 12 in our corpus. Table 3 and table
4 show respectively the top 5 keyphrases extracted by the
MLP based system and Kea when the journal article
number 12 in our corpus is presented as a test document to
these systems.
Table 2: Author assigned keyphrases for the journal article number
12 in our test corpus
Dno
12
12
12
12
AuthorKey
adult immunization
barriers
consumer
provider survey
Table 3: Top 5 keyphrases extracted by the proposed MLP based
keyphrases extractor
Dno
12
12
12
12
12
NP
immunization
adult immunization
healthcare providers
consumers
barriers
Table 4: Top 5 keyphrases extracted by Kea
Dno
NP
12
12
12
12
12
adult
immunization
vaccine
healthcare
barriers
Table 2 and table 3 show that out of 5 keyphrases extracted
by the MLP based approach, 3 keyphrases match with the
author assigned keyphrases. The overall performance of the
proposed MLP based Keyphrases extractor has been shown
in the table 5. Table 2 and table 4 show that out of 5
keyphrases extracted by Kea, only one matches with the
author assigned keyphrases. The overall performance of
Kea has been compared with the proposed MLP based
keyphrase extraction system in table 5.
Table 5: Comparisons of the performances of the proposed
MLP based keyphrase Extraction System and Kea
Number of
keyphrases
Average Precision
Table 5 shows the comparisons of the performances of the
proposed MLP based keyphrase extraction system and Kea.
From table 5, we can clearly conclude that the proposed
keyphrase extraction system outperforms Kea for all three
cases shown in three different rows of the table.
To interpret the results shown in the table 5, we like to
analyze the upper bounds of precision and recall of a
keyphrase extraction system on our dataset. Our analysis
on upper bounds of precision and recall of a keyphrase
extraction system on our dataset can be presented in two
ways: (1) some author-provided keyphrases might not
occur in the document they were assigned to. According to
our corpus, about 88% of author-provided keyphrases
appear somewhere in the source documents of our corpus.
After extracting candidate phrases using our candidate
phrase extraction algorithm, we find that only 72% of
author provided keyphrases appear somewhere in the list
of candidate phrases extracted from all the source
documents. So, keeping our candidate phrase extraction
algorithm fixed if a system is designed with the best
possible features or a system is allowed to extract all the
phrases in each document as the keyphrases, the highest
possible average recall for a system can be 0.72. In our
experiments, the average number of author-provided
keyphrases for all the documents is only 4.90, so the
precision would not be high even when the number of
extracted keyphrases is large. For example, when the
number of keyphrases to be extracted for each document is
set to 10, the highest possible average precision is around
0.3528 (4.90 * 0.72/10 = 0.3528), (2) assume that the
candidate phrase extraction procedure is perfect, that is, it
is capable of representing all the source documents in to a
collection of candidate phrases in such way that all author
provided keyphrases appearing in the source documents
also appear in the list of candidate phrases. If it is the case,
88% of the author provided keyphrases appear somewhere
in the list of candidate phrases because, on an average,
88% of the author provided keyphrases appear somewhere
in the source documents of our corpus. In this case, if a
system is allowed to extract all the phrases in each
document as the keyphrases, the highest possible average
recall for a system can be 0.88 and when the number of
keyphrases to be extracted for each document is set to 10,
the highest possible average precision is around
0.4312(4.90 * 0.88/10 =0.4312).
Average Recall
6. Conclusions
MLP
5
10
15
24
0.34
0.22
0.17
Kea
MLP
Kea
0.28
0.19
0.15
0.35
0.46
0.51
0.29
0.40
0.48
This paper presents a novel keyphrase extraction approach
using neural networks. For predicting whether a phrase is a
keyphrase or not, we use the estimated class probabilities
as the confidence scores which are used in re-ranking the
phrases belonging to a class: positive or negative. To
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
identify the keyphrases, we use five features such as
TF*IDF, position of a phrase’s first appearance, phrase
length, word length in a phrase and the links of a phrase to
other phrases. The proposed system performs better than a
publicly available keyphrase extraction system called Kea.
As a future work, we have planned to improve the
proposed system by (1) improving the candidate phrase
extraction module of the system and (2) incorporating new
features such as structural features, lexical features.
References
[1] Y. B. Wu, Q. Li, Document keyphrases as subject
metadata: incorporating document key concepts in search
results, Journal of Information Retrieval, 2008, Volume 11,
Number 3, 229-249
[2] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke.
Seeking the Whole in Parts: Text Summarization for Web
Browsing on Handheld Devices. In Proceedings of the
World Wide Web Conference, 2001, Hong Kong.
[3] O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, A.
Paepcke, and T. Winograd. Efficient Web Browsing on
Handheld Devices Using Page and Form Summarization.
ACM Transactions on Information Systems (TOIS), 2002,
20(1):82–115
[4] S. Jones, M. Staveley, Phrasier: A system for interactive
document retrieval using Keyphrases, In: proceedings of
SIGIR, 1999, Berkeley, CA
[5] C. Gutwin, G. Paynter, I. Witten, C. Nevill-Manning, E.
Frank, Improving browsing in digital libraries with
keyphrase indexes, Journal of Decision Support Systems,
2003, 27(1-2), 81-104
[6] B. Kosovac, D. J. Vanier, T. M. Froese, Use of keyphrase
extraction software for creation of an AEC/FM thesaurus,
Journal of Information Technology in Construction, 2000,
25-36
[7] S.Jonse, M. Mahoui, Hierarchical document clustering using
automatically extracted keyphrase, In proceedings of the
third international Asian conference on digital libraries,
2000, Seoul, Korea. pp. 113-20
[8] K. Barker, N. Cornacchia, Using Noun Phrase Heads to
Extract Document Keyphrases. In H. Hamilton, Q. Yang
(eds.): Canadian AI 2000. Lecture Notes in Artificial
Intelligence, 2000, Vol. 1822, Springer-Verlag, Berlin
Heidelberg, 40 – 52.
[9] L. F Chien, PAT-tree-based Adaptive Keyphrase Extraction
for Intelligent Chinese Information Retrieval, Information
Processing and Management, 1999, 35, 501 – 521.
[10] Y. HaCohen-Kerner, Automatic Extraction of Keywords
from Abstracts, In V. Palade, R. J. Howlett, L. C. Jain
(eds.): KES 2003. Lecture Notes in Artificial Intelligence,
2003, Vol. 2773,Springer-Verlag, Berlin Heidelberg, 843 –
849.
[11] Y. HaCohen-Kerner, Z. Gross, A. Masa, Automatic
Extraction and Learning of Keyphrases from Scientific
Articles, In A. Gelbukh (ed.): CICLing 2005. Lecture Notes
in Computer Science, 2005, Vol. 3406, Springer-Verlag,
Berlin Heidelberg, 657 – 669.
[12] A. Hulth, J. Karlgren, A. Jonsson, H. Boström, Automatic
Keyword Extraction Using Domain Knowledge, In A.
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
25
Gelbukh (ed.): CICLing 2001. Lecture Notes in Computer
Science, 2001, Vol. 2004, Springer-Verlag, Berlin
Heidelberg, 472 – 482.
Y. Matsuo, Y. Ohsawa, M. Ishizuka, KeyWorld: Extracting
Keywords from a Document as a Small World, In K. P.
Jantke, A. shinohara (eds.): DS 2001. Lecture Notes in
Computer Science, 2001, Vol. 2226, Springer-Verlag,
Berlin Heidelberg, 271– 281.
J. Wang, H. Peng, J.-S. Hu, Automatic Keyphrases
Extraction from Document Using Neural Network., ICMLC
2005, 633-641
P. D. Turney, Learning algorithm for keyphrase extraction,
Journal of Information Retrieval, 2000, 2(4), 303-36
E. Frank, G. Paynter, I. H. Witten, C. Gutwin, C. NevillManning, Domain-specific keyphrase extraction. In
proceeding of the sixteenth international joint conference on
artificial intelligence, 1999, San Mateo, CA.
I. H. Witten, G.W. Paynter, E. Frank et al, KEA: Practical
Automatic Keyphrase Extraction, In E. A. Fox, N. Rowe
(eds.): Proceedings of Digital Libraries’99: The Fourth
ACM Conference on Digital Libraries. 1999, ACM Press,
Berkeley, CA , 254 – 255.
N. Kumar , K. Srinathan, Automatic keyphrase extraction
from scientific documents using N-gram filtration
technique, Proceeding of the eighth ACM symposium on
Document engineering, September 16-19, 2008, Sao Paulo,
Brazil.
Q. Li, Y. Brook Wu, Identifying important concepts from
medical documents. Journal of Biomedical Informatics,
2006, 668-679
C. Fellbaum, WordNet: An electronic lexical database.
Cambridge: MIT Press, 1998.
G.K. Zipf, The psycho-biology of language. Cambridge,
1935 (reprinted 1965), MA:MIT press
R. Duda, P. Hart, Pattern classification and scene analysis,
1973, Wiley and Son
J.S.Denker, Y. leCun, transforming neural-net output labels
to probability distributions, AT & T Bell Labs Technical
Memorandum 11359-901120-05
H. P. Edmundson. “New methods in automatic extracting”.
Journal of the Association for Computing Machinery, 1969,
16(2), 264–285
H. Liu, MontyLingua: An end-to-end natural language
processor with common sense, 2004, retrieved in 2005 from
web.media.mit.edu/~hugo/montylingua.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694- 0784
ISSN (Print): 1694-0814
C Implementation & comparison of companding & silence
audio compression techniques
Mrs. Kruti Dangarwala1 and Mr. Jigar Shah2
1
Department of Computer Engineering,
Sri S’ad Vidya Mandal Institute of Technology
Bharuch, Gujarat, India
2
Department of Electronics and Telecommunication Engineering
Sri S’ad Vidya Mandal Institute of Technology
Bharuch, Gujarat, India
Abstract
Just about all the newest living room audio-video electronics
and PC multimedia products being designed today will
incorporate some form of compressed digitized-audio
processing capability. Audio compression reduces the bit rate
required to represent an analog audio signal while maintaining
the perceived audio quality. Discarding inaudible data reduces
the storage, transmission and compute requirements of
handling high-quality audio files. This paper covers wave
audio file format & algorithm of silence compression method
and companding method to compress and decompress wave
audio file. Then it compares the result of these two methods.
Keywords: thresold, chunk, bitstream, companding,
silence;
1. Introduction
Audio compression reduces the bit rate required to
represent an analog audio signal while maintaining the
perceived audio quality. Most audio decoders being
designed today are called "lossy," meaning that they
throw away information that cannot be heard by most
listeners. The information to be discarded is based on
psychoacoustics, which uses a model human auditory
perception to determine which parts of the audible
spectrum the largest portion of the human population
can detect. First, an audio encoder [1] divides the
frequency domain of the signal being digitized into
many bands and analyzes a block of audio to determine
what's called a "masking threshold." The number of bits
used to represent a tone depends on the masking
threshold. The noise associated with using fewer bits is
kept low enough so that it will not be heard. Tones that
are completely masked may not have any bits allocated
to them. Discarding inaudible data reduces the storage,
transmission and compute requirements of handling
high-quality audio files. Consider the example of a
typical audio signal found in a CD-quality audio device.
The CD player produces two channels of audio. Each
analog signal [2] in each channel is sampled at a 44.1kHz sample rate. Each sample is represented as a 16-bit
digital data word. To produce both channels requires a
data rate of 1.4 Mbits/second. However, with audio
compression this data rate is reduced around an order of
magnitude. Thus, a typical CD player is reading
compressed data from a compact disk at a rate just over
100 Kbits/s. Audio compression really consists of two
parts. The first part, called encoding, transforms the
digital audio data that resides, say, in a WAVE file, into a
highly compressed form called bitstream. To play the
bitstream on your soundcard, you need the second part,
called decoding. Decoding takes the bitstream and reexpands it to a WAVE file.
2. WAVE AUDIO FILE FORMAT
The WAVE file format [1] is a subset of Microsoft's
RIFF spec, which can include lots of different kinds of
data. RIFF is a file format for storing many kinds of data,
primarily multimedia data like audio and video. It is
based on chunks and sub-chunks. Each chunk has a type,
represented by a four-character tag. This chunk type
comes first in the file, followed by the size of the chunk,
then the contents of the chunk. The entire RIFF file is a
big chunk that contains all the other chunks. The first
thing in the contents of the RIFF chunk is the "form
type," which describes the overall type of the file's
contents. So the structure of wave audio file looks like
this: a) RIFF Chunk b) Format Chunk c) Data Chunk
26
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Table 1:
RIFF CHUNK
Byte Number
0-3
Description
“RIFF”(ASCII Character)
4-7
8-11
Total Length of Package to follow (Binary)
“WAVE”(ASCII Character)
Description of RIFF chunk as follows:
Offset Length
Contents
0
4 bytes
'RIFF'
4
4 bytes
<file length - 8>
8
4 bytes
'WAVE'
Table 2: FORMAT CHUNK
Byte Number
0-3
4-7
8-9
10-11
12-15
16-19
20-21
22-23
3. Silence Compression & Decompression
Techniques
3.1 Introduction
Silence Compression [4] on sound files is the
equivalent of run length encoding on normal data files.
In this case, the Runs we encode are sequences of
relative silence in a sound file. Here we replace
sequences of relative silence with absolute silence. So it
is known as Lossy technique.
3.2 User Parameters:
“fmt_”(ASCII character)
Length of format chunk(Binary)
Always 0x01
Channel nos(0x01=mono,0x02=stereo)
Sample Rate(Binary,in Hz)
Bytes Per Second
Bytes Per Sample 1=8-bit mono,2=8-bit
stereo/16-bit mono,4=16-bit stereo
Bits Per Sample
Description of FORMAT chunk as follows:
12 4 bytes 'fmt '
16 4 bytes 0x00000010
20 2 bytes 0x0001
// Format tag: 1 = PCM
22 2 bytes <channels>
24 4 bytes <sample rate> // Samples per second
28 4 bytes <bytes/second> // sample rate * block
align
32 2 bytes <block align> // channels *
bits/sample / 8 34 2 bytes
<bits/sample> // 8 or 16
Table 3: DATA CHUNK
Byte Number
0-3
4-7
8-end
27
“data”(ASCII character)
Length of data to follow
Data(Samples)
Description of DATA chunk as follows:
36
4 bytes ‘data’
40 4 bytes <length of the data block>
44 bytes <sample data>
The sample data must end on an even byte boundary.
All numeric data fields are in the Intel format of lowhigh byte ordering. 8-bit samples are stored as unsigned
bytes, ranging from 0 to 255. 16-bit samples are stored
as 2's-complement signed integers, ranging from -32768
to 32767.
For multi-channel data, samples are interleaved
between channels, like this: sample 0 for channel 0,
sample 0 for channel 1, sample 1 for channel 0 ,sample 1
for channel 1. For stereo audio, channel 0 is the left
channel and channel 1 is the right.
1) Thresold Value : It considered as Silence. With 8-bit
sample 80H considered as “pure” silence. Any Sample
value within a range of plus or minus 4 from 80H
considered as silence.
2) Silence_Code: It is code to encode a run of silence.
We used value FF to encode silence.The Silence_code is
followed by a single byte that indicates how many
consecutive silence codes there are.
3) Start_Threshold: It recognize the start of a run of
silence. We would not want to start encoding silence after
seeing just a single byte of silence. It does not even
become economical until 3 bytes of silence are seen. We
may want to experiment with even higher values than 3 to
see how it affects the fidelity of the recoding.
4) Stop_Threshold: It indicates how many consecutive
non silence codes need to be seen in the input stream
before we declare the silence run to be over.
3.3 Silence Compression Algorithm (Encoder)
1) Read 8-bit Sample Data From audio file.
2) Checking of Silence means find atleast 5 consecutive
silence value: 80H or +4 /- 4 from 80H. (Indicate
start of silence)
3) If get, Encode with Silence_Code followed by runs. \
(Consecutive Silence values).
4) Stop to Encode when found atleast two Non-Silence
values.
5) Repeat all above steps until end of file character
found.
6) Print input File size, Output File Size and
Compression Ratio.
This algorithm [4] takes 8-bit wave audio file as
input. Here It find starting of silence means check that at
least 5 consecutive silence value present or not. Here
80H considered as pure silence and +/- 4 from 80H also
consider as silence. If found, then it start encoding
process. Here consecutive silence values are encoded by
silence_code followed by runs (Consecutive silence
values). It stop encoding when it found at least two nonsilence values. Then it generate compressed file
extension of that file is also wav file.
Example of algorithm as follows:
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Input file consists of sample data like 80 81 80
81 80 80 80 45. Output file consists of compressed data
like FF745.
It display following attributes of input wave audio file.
a. Input file size in terms of bytes
b. Output file size in terms of bytes
c. Compression ratio.
3.4 Silence Dcompression Algorithm (Decoder)
1)
2)
3)
4)
Read 8-bit Sample from Compress file.
Check the Silence code means 0xff if it found,
Check the next value means runs which
Indicate no of silence value.
Replace it with 0x80 (silence value) no of runs
times.
Repeat above step until we get end of file
Character.
Example of algorithm as follows:
In input file (compressed file (extension of this file
.wav)) we find silence code , if we get then we check
next value which indicate no of silence value. Then we
replace with silence value no of runs times decided by
user. We stop the procedure when we get end of file
character.
If we get value 0xff5 in compress file, decode that value
by
0x80 0x80 0x80 0x80 0x80 0x80
Mapped 15-bit numbers can be decoded back into the
original 16-bit samples by the inverse formula:
Sample=65356 log2 (1+mapped/32767)
Here the Amount of Compression[2] should thus be a
user-Controlled parameter. And this is an interesting
example of a compression method where the
compression ratio is known in advance. Now no need to
go through the “(1)” & “(2)”. Since the mapping of all
the samples can be prepared in advance in a table. Both
decoding and encoding are thus fast.Use [4] in this
method.
4.2 Companding Compression Algorithm
1)
2)
4.1 Introduction
Mapped=32767*(pow (2, sample/65356)
3)
4)
5)
Non-Linear Formula For 16-bit samples to 15-bit
samples conversion:
(1)
Using this formula every 16-bit sample data
converted into 15-bit sample data. It performs non-linear
mapping such that small samples are less affected than
large ones.
(2)
Reducing 16-bit numbers to 15-bits does not
produce much compression. Better Compression can be
achieved by substituting a smaller number for 32767 in
“(1)” & “(2)”. A value of 127, For example would map
each 16-bit sample into 8-bit sample. So in this case we
compress file with compression ratio of 0.5 means 50%.
Here Decoding should be less accurate. A 16-bit sample
of 60,100, for example, would be mapped into the 8-bit
number 113, but this number would produce 60,172
when decoded by “(2)”. Even worse, the small 16-bit
sample 1000 would be mapped into 1.35, which has to
be rounded to 1. When “(2)” is used to decode a 1, it
produce 742, significantly different from the original
sample.
4. Companding Compression &
Decompression Techniques
Companding [4] uses the fact that the ear requires
more precise samples at low amplitudes (soft sounds).
But is more forgiving at higher amplitudes. A typical
ADC used in sound cards for personal computers
convert voltages to numbers linearly. If an amplitude a is
converted to the number n, then amplitude 2a will be
converted to the number 2n. It examines every sample in
the sound file and uses a nonlinear formula to reduce the
no. of bits devoted to it.
28
Input the No. of Bits to use for output code.
Make Compress Look-Up Table using NonLinear Formula : (8-bit to Input bit)
Value=128.0*(pow (2, code/N)-1.0)+0.5
Where, code: pow(2,inputbit) to 1
N: pow (2, Inputbit)
For each code we assign value 0 to 15 in table:
Index of table
value
J+127 
code+N-1
128-j 
N-code
where j=value to zero
Now Read 8-bit samples from audio file and
That sample become the index of compress
look-up table, Find corresponding value, that
output vale store in output file.
Repeat step-III until we get end of file
character.
Print the Input file size in bytes, output file size
in bytes & compression ratio.
Description of algorithm is that it used for converting
8- bit sample file into user defined output bit. For
example if we input output bit : 4 then we achieve 50%
compression. and we say compression ratio is 0.5.
So we say that in this method compression ratio is
known in advance. We adjust compression ratio
according our requirement. So it is crucial point compare
to another method.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
4.3 Companding Decompression Algorithm
1) Find the No. of bits used in compressed file.
2) Make Expand Look-Up table using Non-Linear
Formula:
(Input bit  8-bit)
Value=128.0*(pow (2.0, code/N)-1.0)+0.5
Where code: 1 to pow (2, Inputbit)
N: pow (2, Inputbit)
For each code: we assign value 0 to 255 in
table:
Index of table
value
N+code-1
128+(value+last_value)/2
N-code 
127-(value+last_value)/2
Here initially, last_value=0 & for each code,
last_value=value
3)
4)
Now read input bit samples from audio file &
that sample become the index of expand lookup table, find corresponding value, that output
sample value store in output file.
Repeat the step-III until file size becomes zero.
5. Result/comparisons Between Two Lossy
Method
5.1 Companding Compression Method
1) INPUT AUDIO FILE:
Name of File:
Media Length:
Audio Format:
File Size:
J1. WAV
4.320 sec
PCM, 8000Hz, 8-Bit, Mono
33.8KB (34,618 BYTES)
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in
Percentage)
1
34618
4390
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in
Percentage)
2
34618
8709
75%
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in
Percentage)
3
34618
13028
63%
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
88%
4
34618
17347
50%
29
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
5
34618
21666
38%
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
6
34618
25985
25%
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
7
34618
30304
13%
User Parameter (No. Of Bits)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
8
34618
34618
0%
5.2 Silence Compression Method:
1)INPUT AUDIO FILE :
Name of File:
Media Length:
Audio Format:
File Size:
J1. WAV
4.320 SEC
PCM, 8000HZ, 8-BIT, Mono
33.8KB (34,618 BYTES)
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
34618
25099
28%
2) INPUT AUDIO FILE:
Name Of File::
Media Length:
Audio Format:
File Size:
Chimes2.wav
0.63 sec
PCM, 22,050Hz, 8-Bit, Mono
14028 bytes
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
14028
7052
50%
3) INPUT AUDIO FILE:
Name Of File:
Media Length:
Audio Format:
File Size:
Chord2.wav
1.09 sec
PCM, 22,050Hz, 8-Bit, Mono
14028 bytes
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
14028
9074
36%
4) INPUT AUDIO FILE:
Name Of File:
Media Length:
Audio Format:
File Size:
Ding2.wav
0.91 sec
PCM, 22,050Hz, 8-Bit, Mono
20298 bytes
Input File Size(in Bytes)
20298
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
Output File Size(in Bytes)
Compression Ratio(in Percentage)
13887
32%
5) INPUT AUDIO FILE:
Name Of File: Logoff2.wav
Media Length: 3.54 sec
Audio Format : PCM,22,050Hz,8-Bit,Mono
File Size:
783625 bytes
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
783625
60645
23%
6) INPUT AUDIO FILE:
Name Of File::
Media Length:
Audio Format:
File Size:
Notify2.wav
1.35 sec
PCM, 22,050Hz, 8-Bit, Mono
29930 Bytes
Input File Size(in Bytes)
Output File Size(in Bytes)
Compression Ratio(in Percentage)
29930
13310
56%
6. Conclusions
We can achieve more compression using
silence compression if more silence values are
present out input audio file. Silence method is lossy
method because here we consider +/-4 from 80H is
consider as silence & when we perform
compression we replace it with 80H runs times.
When we decompress the audio file using silence
method we get the original size but we do not get
original data. Some losses occur. Companding
method is also lossy method but one advantage is
here we can adjust compression ratio according our
requirement. Here depending upon no. of bits used
in output file we get compression ratio. When we
decompress the audio file using companding
method we get the original size as well as we get
original data only minor losses occur. But we get
audio quality. Compare to silence method
companding method is good and better.
References
[1]
[2]
[3]
[4]
[5]
[6]
David Salomon Data Compression 1995, 2nd ed.,
The Complete reference
John G. Proakis and Dimitris G. Manolakis Digital
Signal Processing Principles, Algorithms &
Applications ,3rd ed.,
David Pan, “A Tutorial on MPEG/Audio
Compression”, IEEE multimedia, Vol 2, No. 2
Mark Nelson Data Compression, 2nd ed.
Stephen J. Solari Digital Video and Audio
Compression.
Cliff Wootton A practical guide to video and audio
compression
Kruti J Dangarwala had passed B.E (computer science) in
2001, M.E (computer engg.) In 2005. She is currently employed
at SVMIT, Bharuch, Gujarat State, India as an assistant
30
professor. She has published two technical papers in various
conferences.
Jigar H. Shah had passed B.E. (Electronics) in 1997, M.E.
(Microprocessor) in 2006. Presently he is pursuing Ph.D.
degree. He is currently employed at SVMIT, Bharuch, Gujarat
State, India as an assistant professor. He has published five
technical papers in various conferences and also has five book
titles. He is life member of ISTE and IETE.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
31
Color Image Compression Based On Wavelet Packet Best Tree
Prof. Dr. G. K. Kharate
Principal, Matoshri College of Engineering and Research Centre,
Nashik – 422003, Maharashtra, India
Dr. Mrs. V. H. Patil
Professor, Department of Computer Engineering, University of Pune
Abstract
In Image Compression, the researchers’ aim is to reduce the
number of bits required to represent an image by removing
the spatial and spectral redundancies. Recently discrete
wavelet transform and wavelet packet has emerged as popular
techniques for image compression. The wavelet transform is
one of the major processing components of image
compression. The result of the compression changes as per the
basis and tap of the wavelet used. It is proposed that proper
selection of mother wavelet on the basis of nature of images,
improve the quality as well as compression ratio remarkably.
We suggest the novel technique, which is based on wavelet
packet best tree based on Threshold Entropy with enhanced
run-length encoding. This method reduces the time
complexity of wavelet packets decomposition as complete
tree is not decomposed. Our algorithm selects the sub-bands,
which include significant information based on threshold
entropy. The enhanced run length encoding technique is
suggested provides better results than RLE. The result when
compared with JPEG-2000 proves to be better.
Keywords: Compression, JPEG, RLE, Wavelet, Wavelet
Packet
1. Introduction
In today’s modern era, multimedia has tremendous
impact on human lives. Image is one of the most
important media contributing to multimedia. The
unprocessed image heavily consumes very important
resources of the system. And hence it is highly
desirable that the image be processed, so that efficient
storage, representation and transmission of it can be
worked out. The processes involve one of the important
processes- “Image Compression”. Methods for digital
image compression have been the subject of research
over the past decade. Advances in Wavelet Transforms
and Quantization methods have produced algorithms
capable of surpassing image compression standards.
The recent growth of data intensive multimedia based
applications have not only sustained the need for more
efficient ways to encode the signals and images but also
have made compression of such signals central to
storage and communication technology. In Image
Compression, the researchers’ aim is to reduce the
number of bits needed to represent an image by
removing the spatial and spectral redundancies. Image
Compression method used may be Lossy or Lossless.
As lossless image compression focuses on the quality of
compressed image, the compression ratio achieved is
very low. Hence, one cannot save the resources
significantly by using lossless image compression. The
image compression technique with compromising
resultant image quality, without much notice of the
viewer is the lossy image compression. The loss in the
image quality is adding to the percentage compression,
hence results in saving the resources
There are various methods of compressing still images
and every method has three basic steps:
Transformation, quantization and encoding.
The transformation transforms the data set into another
equivalent data set. For image compression, it is
desirable that the selection of transform should reduce
the size of resultant data set as compared to source data
set. Many mathematical transformations exist that
transform a data set from one system of measurement
into another. Some mathematical transformations have
been invented for the sole purpose of data compression;
selection of proper transform is one of the important
factors in data compression scheme.
In the process of quantization, each sample is scaled by
the quantization factor whereas in the process of
thresholding all insignificant samples are eliminated.
These two methods are responsible for introducing data
loss and it degrades the quality.
The encoding phase of compression reduces the overall
number of bits needed to represent the data set. An
entropy encoder further compresses the quantized
values to give better overall compression. This process
removes the redundancy in the form of repetitive bits.
We suggest the novel technique, which is based on
wavelet packet best tree based on threshold Entropy
with enhanced run-length encoding. This method
reduces the time complexity of wavelet packets
decomposition as complete tree is not decomposed. Our
algorithm selects the sub-bands, which include
significant information based on Threshold entropy.
The results when compared with JPEG-2000 prove to
be better. The basic theme of the paper is the extraction
of the information from the original image based on
Human Visual System. By exploring Human Visual
interaction characteristics carefully, the compression
algorithm can discard information, which is irrelevant
to human eye.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
32
2. Today’s Scenario
The International Standards Organization (ISO) has
proposed the JPEG standard [2, 4, 5] for image
compression. Each color component of still image is
treated as a separate gray scale picture by JPEG.
Although JPEG allows any color component separation,
images are usually separated into Red, Green, and Blue
(RGB) or Luminance (Y), with Blue and Red color
differences (U = B – Y, V = R – Y). Separation into
YUV color components allows the algorithm to take the
advantages of human eyes’ lower sensitivity to color
information. For quantization, JPEG uses quantization
matrices. JPEG allows a different quantization matrix to
be specified for each color component [3]. Though the
JPEG provides good results previously, it is not
perfectly suited for modern multimedia applications
because of blocking artifacts.
Wavelet theory and its application in image
compression had been well developed over the past
decade. The field of wavelets is still sufficiently new
and further advancements will continue to be reported
in many areas. Many authors have contributed to the
field to make it what it is today, with the most well
known pioneer probably being Ingrid Daubechies.
Other researchers whose contribution directly influence
this work include Stephane Mallat for the pyramid
filtering algorithm, and the team of R. R. Coifman, Y.
Meyer, and M. V. Wickerhauser for their introduction
of wavelet packet [6].
Further research has been done on still image
compression and JPEG-2000 standard is established in
1992 and work on JPEG-2000 for coding of still images
has been completed at end of year 2000. The JPEG2000 standard employs wavelet for compression due to
its merits in terms of scalability, localization and energy
concentration [6, 7]. It also provides the user with many
options to choose to achieve further compression.
JPEG-2000 standard supports decomposition of all the
sub-bands at each level and hence requires full
decomposition at a certain level. The compressed
images look slightly washed-out, with less brilliant
color. This problem appears to be worse in JPEG than
in JPEG-2000 [9]. Both JPEG-2000 and JPEG operate
in spectral domain, trying to represent the image as a
sum of smooth oscillating waves. JPEG-2000 suffers
from ringing and blurring artifacts. [9]
Most of the researchers have worked on this problem
and have suggested the different techniques that
minimize the said problem against the compromise for
compression ratio.
3. Wavelet And Wavelet Packet
In order to represent complex signals efficiently, a basis
function should be localized in both time and frequency
domains. The wavelet function is localized in time
domain as well as in frequency domain, and it is a
function of variable parameters.
The wavelet decomposes the image, and generates four
different horizontal frequencies and vertical frequencies
outputs. These outputs are referred as approximation,
horizontal detail, vertical detail, and diagonal detail.
The approximation contains low frequency horizontal
and vertical components of the image. The
decomposition procedure is repeated on the
approximation sub-band to generate the next level of
the decomposition, and so on. It is leading to well
known pyramidal decomposition tree. Wavelets with
many vanishing yield sparse decomposition of piece
wise smooth surface; therefore they provide a very
appropriate tool to compactly code smooth images.
Wavelets however, are ill suited to represent oscillatory
patterns [13, 14]. A special from a texture, oscillating
variations, rapid variations in the intensity can only be
described by the small-scale wavelet coefficients.
Unfortunately, these small-scale coefficients carry very
little energy, and are often quantized to zero even at
high bit rate.
The weakness of wavelet transform is overcome by new
transform method, which is based on the wavelet
transform and known as wavelet packets. Wavelet
packets are better able to represent the high frequency
information [11].
Wavelet packets represent a generalization of multiresolution decomposition. In the wavelet packets
decomposition, the recursive procedure is applied to the
coarse scale approximation along with horizontal detail,
vertical detail, and diagonal detail, which leads to a
complete binary tree. The pyramid structure of wavelet
decomposition up to third level is shown in figure 4.1,
tree structure of wavelet decomposition up to third level
is shown in figure 4.2, structure of three level
decomposition of wavelet packet is shown in figure 4.3,
and tree structure of wavelet packets decomposition up
to third level is shown in figure 4.4.
LL3
HL3
LH3
HH3
HL2
HL1
LH2
HH2
LH1
HH1
Figure 1 The-pyramid structure of wavelet
decomposition up to third level
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
33
Entropy with lossy enhanced run-length encoding. This
method reduces the time complexity of wavelet packets
decomposition, and selects the sub-bands, which
include significant information in compact.
The Threshold entropy criterion finds the information
contains of transform coefficients of sub-bands.
Threshold entropy is obtained by the equation
N 1
Entropy   abs ( Xi )  Threshold (1)
i 0
Figure 2 The-tree structure of wavelet decomposition
up to third level
LL1LL2
LL1HL2
HL1LL2
HL1HL2
LL1LH2
LL1HH2
HL1LH2
HL1HH2
LH1LL2
LH1LH2
HH1LL2
HH1HL2
LH1LH2
LH1HH2
HH1LH2
HH1HH2
Figure 3 The structure of two level decomposition of
wavelet packet
Figure 3 The complete decomposed three level tree
4. Proposed
Compression
Algorithm
for
Image
Modern image compression techniques use the wavelet
transform for image compression. Considering the
limitations of wavelet transform for Image
Compression we suggest the novel technique, which is
based on wavelet packet best tree based on Threshold
Where Xi is the ith coefficient of sub-band and N is the
length of sub-band.
The information contains of decomposed components
of wavelet packets may be greater than or less than the
information contain of component, which has been
decomposed. The sum of cost (Threshold entropy) of
decomposed components (child nodes) is checked with
cost of component, which has been decomposed (parent
node). If sum of the cost of child nodes is less than the
cost of parent node, then the child nodes are considered
as leaf nodes of the tree, otherwise child nodes are
neglected from the tree, and parent node becomes leaf
node of the tree. This process is iterated up to the last
level of decomposition.
The time complexity of proposed algorithm is less as
compared to algorithm in paper [12]. In [12], the first
wavelet packets decomposition of level ‘J’ takes place,
and cost functions of all nodes in the decomposition
tree are evaluated. Beginning at bottom of the tree, the
cost function of the parent node is compared with union
of cost functions of child nodes. According to the
comparison of results the best basis node(s) is selected.
This procedure is applied recursively at each level of
the tree until the top most node of the tree is reached.
In proposed algorithm there is no need of full wavelet
packets decomposition of level ‘J’ and no need to
evaluate cost function of all nodes initially. Algorithm
of best basis selection based on Threshold entropy is:
 Load the image
 Set current node equal to input image
 Decompose the current node using wavelet
packet tree
 Evaluate the cost of current node, and
decomposed components
 Compare the cost of parent node (current
node) with the sum of cost of child nodes
(decomposed components). If the sum of cost
of child nodes is greater than the parent node,
consider the parent node as leaf node of the
tree, and child nodes are pruned, else repeat
the steps 3, 4, and 5 for each child node by
considering a child node as a current node,
until last level of the tree reached.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
This algorithm reduces the time complexity, because
there is no need to decompose the full wavelet packets
tree and no need to evaluate the costs initially. The
decision of further decomposition, and cost calculation
is based on the run time strategy of the algorithm, and it
decides at run time whether to retain or prune the
decomposed components [15].
Once the best basis has been selected based on cost
function, the image is represented by a set of wavelet
packets coefficients. The high compression ratio is
achieved by using the thresholding to the wavelet
packets coefficients. The advantages of wavelet packets
can be gained by proper selection of thresholds.
Encoder further compresses the coefficients of wavelet
packets tree to give better overall compression. Simple
Run-Length Encoding (RLE) has proven very effective
encoding in many applications. Run-Length Encoding
is a pattern recognition scheme that searches for the
repetition (redundancy) of identical data values in the
code-stream. The data set can be compressed by
replacing the repetitive sequence with a single data
value and length of that data. We propose the modified
technique for the encoding named as Enhanced RunLength Encoding, and then for the bit coding well
Image
AISHWARYA
CHEETAH
LENA
BARBARA
MANDRILL
BIRD
ROSE
DONKEY
BUTTERFLY
HORIZONTAL
VERTICAL
34
known Huffman coding or Arithmetic coding methods
are used.
The problems with existing Run Length Encoding, is
that the compression ratio obtained from run-length
encoding schemes vary depending on the type of data to
be encoded, and the repetitions present within the data
set. Some data sets can be highly compressed by runlength encoding whereas other data sets can actually
grow larger due to the encoding [16]. This problem of
an existing run-length encoding techniques are
eliminated up to the certain extent by using Enhanced
Run-Length Encoding technique. In the proposed
Enhanced RLE, the neighboring coefficients are
compared, with acceptable value, which is provided by
the user according to the applications. If the difference
is less than the acceptable value then the changes are
undone.
5. Results
The proposed algorithm is implemented and tested over
the range of natural and synthetic images. The natural
test images used are AISHWARYA, CHEETAH,
LENA, BARBARA, MANDRILL, BIRD, ROSE,
DONKEY, and synthetic images used are
BUTTERFLY, HORIZONTAL, and VERTICAL. The
results for 10 images is given in table1 and few output
images are given in figure 4 are given as follows:
Table 1 Results of selected Images
Percentage of compression Compression ratio Peak signal to noise ratio (dB)
97.6873
50
64.8412
97.3983
39
59.1732
98.4885
67
66.7228
97.0166
34
51.9716
94.5585
19
46.1098
97.1217
35
57.2633
94.4339
18
46.7576
97.0665
35
50.6168
96.0195
26
46.5698
89.2611
10
47.7637
97.5687
42
49.5102
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
Compression Ratio =35
PSNR = 57.2633 dB
Original ROSE Image
Original Bird image
Compression Ratio= 18
PSNR = 46.7576 dB
Original MANDRILL
Image
Compression Ratio = 21
PSNR = 46.1098 dB
Original Horizontal
Image
Compression Ratio= 10
PSNR = 47.7637 dB
Fig. 4
Resultant Images with PSNR and compression ratio
6. Conclusion
The novel algorithm of image compression using
wavelet packet best tree based on Threshold entropy
and enhanced RLE is implemented, and tested over the
set of natural and synthetic images and concluding
remarks based on results are discussed. The results
show that the compression ratio is good for low
frequency (smooth) images, and it is observed that it is
very high for gray images. For high frequency images
such as Mandrill, Barbara, the compression ratio is
good, and the quality of the images is retained too.
These results are compared with JPEG-2000
application, and it is found that the results obtained by
using the proposed algorithm are better than the
JEPG2000.
7. References
[1]. Subhasis Saha, “Image Compression- from DCT To
[2].
[3].
[4].
[5].
35
Wavelet: A Review,” ACM Crossroads Student
Magazine, Vol.6, No.3, Spring 2000.
Andrew B. Watson, “Image Compression Using the
Discrete Cosine Transform,” NASA Ames Research
Center, Mathematical Journal, Vol.4, Issue 1, pp 8188, 1994.
“Video Compression- An Introduction” Array
Microsystems Inc., White Paper, 1997.
Sergio D. Servetto, Kannan Ramchandran, Michael T.
Orchard, “ Image Coding Based on A Morphological
Representation of Wavelet Data,” IEEE Transaction
On Image Processing, Vol.8, No.9, pp.1161-1174,
1999.
M. K. Mandal, S. Panchnathan, T. Aboulnasr, “Choice
of Wavelets For Image Compression (Book Title:
Information Theory & Application),” Lecture Notes In
Computer Science, Vol.1133, pp.239-249, 1995.
[6]. Andrew B. Watson, “Image Compression Using the
Discrete Cosine Transform,” NASA Ames Research
Center, Mathematical Journal, Vol.4, Issue 1, pp. 8188, 1994.
[7]. Wayne E. Bretl, Mark Fimoff, “MPEG2 Tutorial
Introduction & Contents: Video Compression BasicsLossy Coding,” 1999.
[8]. M. M. Reid, R.J. Millar, N. D. Black, “ SecondGeneration Image Coding; An Overview,” ACM
Computing Surveys, Vol.29, No.1, March 1997.
[9]. Aleks Jakulin, “Baseline JPEG And JPEG2000
Artifacts Illustrated,” Visicron, 2002.
[10]. Agostino Abbate, Casimer M. DeCusatis, Pankaj K.
Das, “Wavelets and Subbands fundamentals and
applications”, ISBN 0-8176-4136-X printed at
Birkhauser Boston..
[11]. Deepti Gupta, Shital Mutha, “Image Compression
Using Wavelet Packet,” IEEE, Vol.3, pp. 922-926,
Oct. 2003.
[12]. Andreas Uhl, “Wavelet Packet Best Basis Selection
On Moderate Parallel MIMD Architectures1,” Parallel
Computing, Vol.22, No.1, pp. 149-158, Jan. 1996.
[13]. Francois G. Meyar, Amir Averbuch, Jan-Olvo
Stromberg, Ronald R. Coifman, “Fast Wavelet Packet
Image Compression,” IEEE Transaction On Image
Processing, Vol.9, No.5, pp. 563-572, May 2000.
[14]. Jose Oliver, Manuel Perez Malumbres, “Fast And
Efficient Spatial Scalable Image Compression Using
Wavelet Lower Trees,” Proceedings Of The Data
Compression Conference, pp. 133-142, Mar. 2003.
[15]. G. K. Kharate, Dr. A. A. Ghatol and Dr. P. P. Rege,
“Image compression using wavelet packet tree”,
Published at ICGST International Journal on
Graphics, Vision and Image Processing GVIP-2005
Cairo, Egypt, Vol-5, Issue-7, July 2005.
[16]. Mark Nelson, Jea-Loup Gailly “The Data
Compression Book”, M & T Books, ISBN-1-55851434-1,
Printed
in
the
USA
1996.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
ISSN (Online): 1694-0784
ISSN (Print): 1694-0814
36
A Pedagogical Evaluation and Discussion about the Lack of Cohesion in
Method (LCOM) Metric Using Field Experiment.
Ezekiel Okike
School of Computer Studies, Kampala International University ,
Kampala, Uganda 256, Uganda.
Abstract
Chidamber and Kemerer first defined a cohesion measure for
object-oriented software – the Lack of Cohesion in Methods
(LCOM) metric. This paper presents a pedagogic evaluation and
discussion about the LCOM metric using field data from three
industrial systems. System 1 has 34 classes, System 2 has 383
classes and System 3 has 1055 classes. The main objectives of
the study were to determine if the LCOM metric was appropriate
in the measurement of class cohesion and the determination of
properly and improperly designed classes in the studied systems.
Chidamber and Kemerer’s suite of metric was used as metric
tool. Descriptive statistics was used to analyze results. The result
of the study showed that in System 1, 78.8% (26 classes) were
cohesive; System 2 54% (207 classes) were cohesive; System 3
30% (317 classes) were cohesive. We suggest that the LCOM
metric measures class cohesiveness and was appropriate in the
determination of properly and improperly designed classes in the
studied system.
Keywords: Class Cohesion, LCOM Metric, Systems, Software
Measurement.
1.
Introduction
Software metric is any type of measurement that relates to
a software system, process or related documentation. On
the other hand, software measurement is concerned with
deriving a numeric value for some attributes of a software
product or process. By comparing these values to each
other and to standards that apply across an organization,
one may be able to draw conclusions about the quality of a
software or software processes. The Lack of Cohesion in
Methods (LCOM) metric was proposed in [5,6] as a
measure of cohesion in the object oriented paradigm.
The term cohesion is defined as the “intramodular
functional relatedness” in software [1]. This definition,
considers the cohesion of each module in isolation: how
tightly bound or related its internal elements are. Hence,
cohesion as an attribute of software modules capture the
degree of association of elements within a module, and the
programming paradigm used determines what is an
element and what is a module. In the object-oriented
paradigm, for instance, a module is a class and hence
cohesion refers to the relatedness among the methods of a
class. Cohesion may be categorized ranging from the
weakest form to the strongest form in the following order:
coincidental,
logical,
temporal,
procedural,
communicational, sequential and functional.
i. Coincidental cohesion: A coincidentally cohesive
module is one whose elements contribute to activities in a
module, but with no meaningful relationship to one
another. An example is to have unrelated statements
bundled together in a module. Such a module would be
hard to understand what it does and can not be reused in
another program.
ii. Logical cohesion: A logically cohesive module is one
whose elements contribute to activities of the same general
category in which the activity or activities to be executed
are selected from outside the module. A logically cohesive
module does any of several different related things, hence,
presenting a confusing interface since some parameters
may be needed only sometimes.
iii. Temporal cohesion: A temporally cohesive module is
one whose elements are involved in activities that are
related in time. That is, the activities are carried out at a
particular time. The elements occurring together in a
temporally cohesive module do diverse things and execute
at the same time.
iv. Procedural cohesion: A procedurally cohesive module
is one whose elements are involved in different and
possibly unrelated activities in which control flows from
each activity to the next. Procedurally cohesive modules
tend to be composed of pieces of functions that have little
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
relationship to one another (except that they are carried
out in a specific order at a certain time).
v. Communicational cohesion: A communicational
cohesive module is one whose elements contribute to
activities that use the same input or output data.
vi. Sequential cohesion: A sequentially cohesive module
is one whose elements are involved in activities such that
output data from one activity serve as input data to the
next.
Some authors identify this as informational
cohesion.
vii. Functional cohesion: A functionally cohesive module
contains elements that all contribute to the execution of
one and only one problem-related task. The elements do
exactly one thing or achieve one goal.
viii. A module exhibits one of these forms of cohesion
depending on the skill of the designer. However,
functional cohesion is generally accepted as the best form
of cohesion in software design. Functional cohesion is the
most desirable because it performs exactly one action or
achieves a single goal. Such a module is highly reusable,
relatively easy to understand (because you know what it
does) and is maintainable. In this paper, the term
“cohesion” refers to functional cohesion. Several measures
of cohesion have been defined in both the procedural and
object-oriented paradigms. Most of the cohesion measures
defined in the object-oriented paradigm are inspired from
the Lack of Cohesion in methods (LCOM) metric defined
by Chidamber and Kemerer. In this paper, the Lack of
Cohesion in methods (LCOM) metric is pedagogically
evaluated and discussed with empirical data. The rest of
the paper is organized as follows: Section 2 presents a
summary of the approaches to measuring cohesion in
procedural and object-oriented programs. Section 3
examines the Chidamber and Kemerer LCOM metric.
Section 4 present the empirical study of LCOM with three
Java based industrial software systems. Section 5 presents
the result of the study upon which the LCOM metric was
evaluated. Section 6 concludes the paper by suggesting
that Chidamber and Kemerer’s LCOM metric measures
cohesiveness.
2. Measuring Cohesion in Procedural and
Object oriented Programs
2.1
Measuring cohesion in procedural programs
Procedural programs are those with procedure and data
declared independently. Examples of purely procedure
oriented languages include C, Pascal, Ada83, Fortran and
so on. In this case, the module is a procedure and an
37
element is either a global value which is visible to all the
modules or a local value which is visible only to the
module where it is declared. As noted in [2], the
approaches taken to measure cohesiveness of this kind of
programs have generally tried to evaluate cohesion on a
procedure by procedure basis, and the notational measure
is one of “functional strength” of procedure, meaning the
degree to which data and procedures contribute to
performing the basic function. In other words the
complexity is defined in the control flow. Among the best
known measures of cohesion in the procedural paradigm
are discussed in [3] and [4].
2.2 Measuring cohesion in object-oriented systems
In the Object Oriented languages, the complexity is
defined in the relationship between the classes and their
methods. Several measures exist for measuring cohesion
in Object-Oriented systems [7,8,9,10,11,12]. Most of the
existing cohesion measures in the object-oriented
paradigm are inspired from the Lack of Cohesion in
Methods (LCOM ) metric [5,6]. Some examples include
LCOM3, Connectivity model, LCOM5, Tight Class
Cohesion (TCC), and Low Class Cohesion (LCC), Degree
of Cohesion in class based on direct relation between its
public methods (DCD) and that based on indirect methods
(DCI), Optimistic Class cohesion (OCC) and Pessimistic
Class Cohesion (PCC).
3. The Lack of Cohesion in Methods (LCOM) Metric.
The LCOM metric is based on the number of disjoint sets
of instance variables that are used by the method. Its
definition is given as follows [5,6].
Definition 1.
Consider a class C1 with n methods M1, M2,…,Mn. Let
{Ii}= set of instance variables used by method Mi. There
are n such sets {Ii},…,{In}. Let P = { (Ii, Ij) | Ii ∩ Ij = }
and Q = { (Ii, Ij) | Ii ∩ Ij ≠  }. If all n sets { I1}, …,{In}
are  then let P = 
LCOM = { |P|- |Q|, if |P| > |Q|
= 0, otherwise
Example: Consider a class C with three methods M1, M2
and M3. Let {I1} = {a,b,c,d,e} and {I2} = {a,b,e} and {I3}
= {x,y,z}. {I1} ∩ {I2} is nonempty, but {I1} ∩ {I3} and
{I2} ∩ {I3} are null sets. LCOM is (the number of null
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
intersections – number of non empty intersections), which
in this case is 1.
The theoretical basis of LCOM uses the notion of degree
of similarity of methods. The degree of similarity of two
methods M1 and M2 in class C1 is given by:
σ( ) = {I1} ∩ {I2}
where {I1} and {I2} are sets of instance variables used by
M1 and M2 . The LCOM is a count of the number of
method pairs whose similarity is 0 (i.e, σ( ) is a null set)
minus the count of method pairs whose similarity is not
zero. The larger the number of similar methods, the more
cohesive the class, which is consistent with the traditional
notions of cohesion that measure the inter relatedness
between portions of a program. If none of the methods of
a class display any instance behaviour, i.e. do not use any
instance variables, they have no similarity and the LCOM
value for the class will be zero. The LCOM value provides
a measure of the relative disparate nature of methods in
the class. A smaller number of disjoint pairs (elements of
set P) implies greater similarity of methods. LCOM is
intimately tied to the instance variables and methods of a
class, and therefore is a measure of the attributes of an
object class.
38
Where MI are methods in the class c and AI are the
attributes (or instance variables ) in the class c ; AR denote
attribute reference
In this definition, only methods M implemented in class c
are considered; and only references to attributes AR
implemented in class c are counted.
The definition of LCOM2 has been widely discussed in
the literature [6,9,11,14,16]. LCOM2 of many classes are
set to be zero although different cohesions are expected.
3.1
Remarks
In general the Lack of Cohesion in Methods (LCOM)
measures the dissimilarity of methods in a class by
instancevariable or attributes. Chidamber and Kemerer’s
interpretation of the metric is that LCOM = 0 indicates
acohesive class. However, for LCOM >0, it implies
thatinstance variables belong to disjoint sets. Such a class
maybe split into 2 or more classes to make it cohesive.
Consider the case of an n-sequentially linked methods as
shown in figure 3.1 below where n methods are
sequentially linked by shared instance variables.
Shared instance variables
In this definition, it is not stated whether inherited
methods and attributes are included or not. Hence, a
refinement is provided as follows [14]:
Definition 2.
…
Let P = , if AR (m) =   m  MI (c)
= {{m1,m2} m1,m2  MI( c)  m1  m2  AR(m1)  AR(m2)
 AI (c) =  }, else
M1
Let Q = {{ m1,m2}  m1,m2  MI( c)  m1  m2  AR
(m1)  AR(m2)  AI( c)   }
Then LCOM2( c) = { P - Q, if P > Q
= 0, otherwise
M2
M3
Mn
Fig. 3.1. n-Sequentially liked methods
In this special case of sequential cohesion:
n
P     (n  1)
 2
(1)
Q  n 1
(2)
so that LCOM
n
P  Q     2(n  1) +
 2
(3)
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
where [k]+ equals k, if k>0 and 0 otherwise [8].
39
There are two pairs of methods accessing no common
instance variables namely (<f, g>, <f, h>). Hence P = 2.
From (1) and (2)
One pair of methods shares variable E, namely, <g, h>.
n
P  Q     (n  1)  (n  1)
 2
Hence, Q = 1. Therefore, LCOM is 2 - 1 =1.
3.3
n
    2n  2
 2
The LCOM metric has been criticized for not satisfying all
the desirable properties of cohesion measures.
For
instance, the LCOM metric values are not normalized
n
    2(n  1)
 2

Critique of LCOM metric
values [11,13]. A method for normalizing the LCOM
metric has been proposed in [18, 19]. It is also observed
n!
 2(n  1)
(n  2)!2!
(4)
From (4), for n < 5, LCOM = 0 indicating that classes
with less than 5 methods are equally cohesive. For n  5,
1 < LCOM < n, suggesting that classes with 5 or more
methods need to be split [8,18].
3.2 Class design and LCOM computation
that the LCOM metric is not able to distinguish between
the structural cohesiveness of two classes, in the way in
which the methods share instance variables [8]. Hence, a
connectivity metric to be used in conjunction with the
LCOM metric was proposed.
The value of the
connectivity metric always lies between 0 and 1 [8].
4.
The empirical study
4.1
The Method
Chidamber and Kemerer’s suit of metrics namely: Lack of
A
B
f()
Cohesion in methods (LCOM), Coupling Between Object
C
Classes (CBO), Response For a Class (RFC),Weighted
Methods Per Class (WMC), Depth of Inheritance (DIT)
g()
D
h()
E
F
and Number of Children (NOC) were used in the study.
Two other metrics used in this experiment which are not
part of the Chidamber and Kemerer metrics are: Number
Fig. 3.2. Class design showing LCOM
computation
Source:[8]
of Public Methods (NPM) and Afferent Coupling (CA).
The choice of these metrics is informed by the need to
have a metric to measure the number of public methods in
a class as well as the number of other classes using a
Class x {
Int A, B, C, D, E, F;
Void f() {…uses A,B, C …}
Void g () {…uses D, E…}
Figure 3.2 presents a class x written in C++.
specific class. All the metrics used in this study provide
the appropriate variables required for the experiments and
the tools for measuring the metrics were readily available
for use. In addition Chidamber and Kemerer’s set of
The Lack of Cohesion in Methods (LCOM) for class x =
1, calculated as follows:
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
40
measure seems to be the basic set of object-oriented
Let P be the pairs of methods without shared instance
measures widely accepted [15].
variables, and Q be the pairs of methods with shared
Specifically cohesion was measured using the LCOM
instance variables.
metric. Coupling was measured using CBO, RFC, and CA.
Then LCOM = |P| – |Q| , if |P| < |Q| i.e. If this difference is
Size was measured using WMC, and NPM. Inheritance
negative, LCOM is set to zero
was measured using DIT. Descriptive statistics was used
• CBO. (Coupling between Object classes). A class is
to analyze results.
coupled to another, if methods of one class use methods or
4.2
attributes of the other, or vice versa. CBO for a class is
Description of variables
Table 4.1 below shows the variables used in the test
then defined as the number of other classes to which it is
systems. The metric of paramount interest is the LCOM
coupled. This includes inheritance based coupling.
although CBO, RFC, CA, WMC, NPM, NOC, and DIT
• RFC (Response set for a class). The Response set for a
values were obtained in order to verify if they are
class consists of the set M of methods of the class, and the
significant correlations between these and the LCOM.
set of methods directly or indirectly invoked by methods
Table 1: Metric variables used in the experiment
in M. In other words, the response set is the set of methods
Metric
Meaning
Attribute
that can potentially be executed in response to a message
LCOM
Lack of Cohesion in
Methods
Cohesion
CBO
Coupling
Objects
Between
Coupling
RFC
Response
Class
For
Coupling
CA
Afferent Coupling
Coupling
WMC
Weighted
Per Class
Size
NPM
NOC
DIT
a
Method
methods in the response set of the class.
• NPM (Number of Public Methods). The NPM metric
counts all the methods in a class that are declared as
public. It can be used to measure the size of an
Application Program Interface (API) provided by a
package [17].
• CA (Afferent Coupling). A class’s Afferent Coupling is
a measure of how many other classes use the specific class
[17].
Number of Public
Methods
Size
Number of Children
Inheritanc
e
Depth of Inheritance
received by an object of that class. RFC is the number of
5.
Inheritanc
e
Result and Discussion
The results of applying a Chidamber and Kemerer metric
tool in the experimental study of the selected test systems
consisting of 1472 Java classes from three different
industrial
systems
are
presented
in
this
section.
Descriptive statistics is used to analyze and interpret the
results.
• LCOM: Lack of Cohesion in Methods
5.1
Descriptive statistics of the test systems
Descriptive statistics were used to obtain the minimum,
maximum, mean, median, and standard deviation values
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
41
for the test systems as shown in tables 2-4. In case of
system, and there could be more outliers within.
measurement for cohesion, the LCOM value lies between
Descriptive statistics for the test systems are shown in
a range [0, maximum]. From Chidamber and Kemerer’s
tables 2-4 . Table 5 provides the combine descriptive
interpretation of their LCOM metric, a class is cohesive if
statistics for cohesion comparison across the test systems
its LCOM =0. Using descriptive statistics, a median value
respectively.
Table 2: Descriptive statistics for system 1
in this range shows the level of cohesiveness in the
system. This also means that at least half of the number of
Statistics
WMC
DIT
NOC
CBO
RFC
4d7
CA
NPM
classes in the system are cohesive. The actual number of
N
34
34
34
34
34
34
34
34
cohesive classes and their percentages based on the
Valid
Missing
34
3
34
3
34
3
34
3
34
3
34
3
34
3
34
3
number of classes in the test systems were obtained from a
Mean
7.88
1.41
.41
5.59
22.21
11
4.21
6.12
simple frequency count of cohesive classes in each test
Median
4.00
1.00
.00
5.00
18.50
.00
2.00
4.00
system.
Stil. Dcv
13.30
.50
1.52
5.87
24.23
44
4.78
8.63
In this experiment, we applied a normalized LCOM ie
Mm
0
1
0
0
0
0
0
0
[0,1]. This means that systems exhibiting high cohesion
Max
74
2
8
31
129
2531
22
45
cohesive classes, however a median value of 1 is low
Statistics
WMC
DIT
NOC
CBO
RFC
LCOM
CA
NPM
enough to be a cohesive class. A minimum value indicates
N
383
383
383
383
383
383
383
383
the lowest LCOM value for the class being measured. If
Valid
Missing
383
0
383
0
383
0
383
0
383
0
383
0
383
0
383
0
Mean
8.24
2.14
.58
8.33
20.93
150.40
5.70
6.97
Median
3.00
2.00
.00
5.00
10.00
100
3.00
2.00
Std Dev
19.16
1.16
3.00
20.07
31.14
131.02
14.02
18.63
Min
0
1
0
0
0
0
0
0
Max
118
5
36
195
256
16290
157
181
show low median values between [0,1]. From Chidamber
and Kemerer’s view, a median value of 0 indicates
this value is zero, it is the cohesion of the class from the
LCOM interpretation. A maximum value indicates the
highest LCOM value for the class. Using Chidamber and
Table 3: Descriptive statistics system 2
Kemerer’s metric the LCOM values for a class can be any
value from zero [0,1,2,3,..., 222,.. .6789..., maximum] to
any value. The presence of such values as 222.. .6789...
maximum make the LCOM metric not really appealing to
most practitioners because a cohesion metric should not
generate values which are not standardized (normalized).
However Chidamber and Kemerer’s position is that
classes whose LCOM > 0 are improperly designed classes,
and as such could be split to two or more classes to make
them cohesive. The presence of outliers and un
standardized (un normalized) values for LCOM is still a
short coming with Chidamber and Kemerer LCOM metric.
Using descriptive statistics, a maximum LCOM value
indicates the value of the highest outlier in the measured
Table 4: Descriptive statistics system 3
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
DIT
NOC
N
WM
C
1055
1055
1055
Valid
1055
1055
1055
Statisti
cs
CB
O
105
5
RFC
CA
NPM
1055
LCO
M
1055
1055
1055
105
1055
1055
1055
1055
0
0
0
0
0
0
0
0
Mean
7.96
1.42
.36
6.25
26.49
44.91
1.69
5.91
Media
n
Std.De
v
5.00
1.00
.00
3.00
16.00
6.00
.oo
4.00
9.40
.62
3.00
7.55
30.93
5.83
6.82
Min
Max
0
109
1
4
0
64
0
65
0
210
180.4
5
0
2744
0
71
0
61
g
5.2
Table 5 below shows the comparison of cohesion
measures across the three test systems. The actual number
of cohesive and uncohesive classes per system and their
percentages are indicated as shown above. A median value
in range [0,1) indicates the system is cohesive.
System2
383
COHESIVE UNCOHESIVE
TOTAL
1472
considered in this work. In the field experiments result
tables 2 — 4, it was observed that LCOM median values
for systems 1 and 2 are 0.00 or 1.00. Hence these systems
3 whose LCOM median value is 6.00. To confirm this, a
simple frequency count of cohesive and un-cohesive
classes was carried out to find the actual percentages as
shown in table 5. However, Chidamber and Kemerer’s
view that a class is not cohesive when LCOM= 1 does not
low value indicates high
0
Maxim
2534
26
7
Mean
79.03
(78.8%)
(21.2%)
Median
0
Std.dev
440.22
Minimum
0
207
176
Maxim
16290
(54%)
(46%)
Mean
150.40
Median
1.00
6.
1318.3
In this paper the concept of cohesion in both the
cohesion and vice versa [14]. For illustration, suppose the
cohesion of a class ci is 0 (LCOM (c1)0), and the cohesion
of another class c2 is 1 (LCOM (c2)= 1); this
should mean that LCOM (ci)> LCOM (c2), and should not
5
1055
value was 0. However, a median value of 1 (also low) was
Minimum
Std.dev
System3
cohesive methods. In their original work this low median
Since the LCOM metric is an inverse cohesion measure: a
LCOM
34
median value indicates that at least 50% of the class have
classes with LCOM =1 are improperly designed.
DESCRIPTIVE STATISTICS
CLASSES
System1
maximum, mean, median, standard deviation) [6], a low
seem appropriate as there was no reason to suggest that
Table 5: Cohesion comparison across the test systems
NO.OF
Following Chidamber and Kemerer’s guide to interpreting
are considered to have more cohesive classes than system
Cohesion comparisons across systems.
SYSTEMS
Discussion
their LCOM metric using descriptive statistics (minimum,
5
Missin
5.3
42
be interpreted to mean that LCOM (c2)=1 is not cohesive
and therefore may be split.
Conclusion
procedural and object-oriented paradigm has been
Minimum
0
317
738
Maxim
2744
(30%)
(7%)
Mean
44.91
Kemerer’s Lack of Cohesion in Methods (LCOM) metric
Median
6.0
measures cohesiveness. However, the presence of outliers
Std.dev
180.45
and not standardized values make the metric not as
extensively discussed. It is suggested that Chidamber and
appealing as its variant measures whose cohesion
descriptive statistics values are standardized (normalized).
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 2, No 3, March 2010
www.IJCSI.org
A normalized LCOM metric is already proposed in [19].
The metric may be used to predict improperly designed
classes especially when the LCOM metric is used with
reference to the Number of Public Methods (NPM) being
greater than or equal five (NPM 5) [18,5,6]. Cohesion as
an attribute of software when properly measured serves as
guiding principle in the design of good software which is
easy to maintain and whose components are reusable
[3,4,5,18]
References
[1] E. Yourdon and L. Constantine, Structured Design:
Fundamentals of a Discipline of
Computer Program and
Systems Design, Englewood Cliffs, New Jersey: Prentice-Hall,
1979.
[2] M. F. Shumway . “Measuring Class Cohesion in Java”.
Master of Science Thesis. Department of computer science,
Colorado state university, Technical Report CD -97-113, 1997.
[3] J. M. Bieman and L. M. Ott, . “Measuring Functional
Cohesion”, IEEE Transactions on Software Engineering, vol.
20, no. 8, pp. 644-658, August 1994.
[4] J. M. Bieman and B. K. Kang, 998. Measuring Design –
level Cohesion. IEEE Transactions on Software Engineering,
vol. 20, no. 2, pp. 111-124, February 1998.
[5] S. R. Chidamber and C. F. Kemerer, “Towards a Metric Suite
for Object Oriented Design”, Object Oriented Programming
Systems, Languages and Applications, Special Issue of SIGPLAN
Notices, vol. 26, no.10, pp. 197-211, October 1991.
[6] S. R. Chidamber and C. F. Kemerer , “A Metric Suite for
Object Oriented Design”, IEEE Transactions on Software
Engineering, vol. 20, no. 6, pp. 476-493, June 1994.
[7] W. Li and S. Henry, “Object Oriented Metrics that Predict
Maintainability”. Journal of Systems and
Software, vol. 23,
pp. 111-122, February1993.
[8] M. Hitz and B. Montazeri, “Chidamber and Kenmerer ‘s
mtric Suite: A Measurement Theory Perspective”, IEEE
Transactions on Software Engineering, vol. 22. no. 4, pp.267270, April 1996.
[9] B. Henderson-Sellers, Software Metrics, U.K: Prentice Hall,
1996.
[10] J. M. Bieman and B. K. Kang, “Cohesion and Reuse in an
Object Oriented System”, Proceedings of
the Symposium on
Software Reusability (SSR ’95), Seattle: WA. Pp. 259-262, April
1995.
[11] L. Badri and M. Badri, “A proposal of a New Class
cohesion Criterion: An Empirical Study” Journal of Object
Technology, vol. 3, no. 4, pp. 145-159, April 2004.
[12] H. Aman, K. Yamasski, and M. Noda, “A Proposal of
Class Cohesion Metrics using sizes of cohesive parts”,
Knowledge based Software Engineering. T. Welzer et al. Eds.
IOS press, pp. 102-107, September 2002.
[13] B. S. Gupta, “A Critique of Cohesion Measures in he
Object Oriented Paradigm”, M.S Thesis,
Department of
Computer Science,
42pp, 1997.
Michigan
43
Technological University. iii+
[14] L. C. Briand, J. Daly and J. Wust, “A Unified Framework
for Cohesion Measurement in Object Oriented Systems”,
Empirical Software Engineering, vol. 3, no.1, pp. 67-117, 1998.
[15] H. Zuse, “A framework for Software Measuremen”, New
York: Walter de Gruyter, 1988.
[16] V. R. Basili, L. C. Briand, and W. Melo, “A validation of
Object Oriented Design Metrics as quality indicators”, IEEE
Transaction On Software Engineering, vol. 22 , no.10, pp. 751761, 1996.
[17] D. Spinellis, “Ckjm – A tool for calculating Chidamber and
Kemerer Java metrics”,
http://www.spinellis.gr/sw/ckjm/doc/indexw.html
[18] E. U. Okike “Measuring class cohesion in object-oriented
systems using Chidamber and Kemerar metrics and Java as case
study. Ph.D thesis. Department of Computer Science, University
of Ibadan, xvii + 133pp, 2007.
[19] E. U. Okike and A. Osofisan, “An evaluation of Chidamber
and Kemerer’s Lack of Cohesion in Method (LCOM) metric
using different normalization approaches”.Afr J. Comp. & ICT
vol. 1. No 2, pp 35-54, ISSN 2006-1781, 2008.
Ezekiel U. Okike received the BSc degree in computer
science from the University of Ibadan Nigeria in 1992,
the Master of Information Science (MInfSc) in 1995 and
PhD in computer science in 2007 all from the same
University. He has been a lecturer in the Department of
Computer Science, University of Ibadan since 1999 to
date. Since September, 2008 to date, he has been on leave
as a senior lecturer and Dean of the School of Computer
Studies, Kampala International University, Uganda. His
current research interests are in the areas of software
engineering,
software
metrics,
compilers
and
programming languages. He is a member of IEEE
computer and communication societies.
IJCSI CALL FOR PAPERS SEPTEMBER 2010 ISSUE
Volume 7, Issue 5
The topics suggested by this issue can be discussed in term of concepts, surveys, state of the
art, research, standards, implementations, running experiments, applications, and industrial
case studies. Authors are invited to submit complete unpublished papers, which are not under
review in any other conference or journal in the following, but not limited to, topic areas.
See authors guide for manuscript preparation and submission guidelines.
Accepted papers will be published online and authors will be provided with printed
copies and indexed by Google Scholar, Cornell’s University Library,
ScientificCommons, CiteSeerX, Bielefeld Academic Search Engine (BASE), SCIRUS
and more.
Deadline: 31st July 2010
Notification: 31st August 2010
Revision: 10th September 2010
Online Publication: 30th September 2010
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Evolutionary computation
Industrial systems
Evolutionary computation
Autonomic and autonomous systems
Bio-technologies
Knowledge data systems
Mobile and distance education
Intelligent techniques, logics, and
systems
Knowledge processing
Information technologies
Internet and web technologies
Digital information processing
Cognitive science and knowledge
agent-based systems
Mobility and multimedia systems
Systems performance
Networking and telecommunications
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Software development and
deployment
Knowledge virtualization
Systems and networks on the chip
Context-aware systems
Networking technologies
Security in network, systems, and
applications
Knowledge for global defense
Information Systems [IS]
IPv6 Today - Technology and
deployment
Modeling
Optimization
Complexity
Natural Language Processing
Speech Synthesis
Data Mining
For more topics, please see http://www.ijcsi.org/call-for-papers.php
All submitted papers will be judged based on their quality by the technical committee and
reviewers. Papers that describe research and experimentation are encouraged.
All paper submissions will be handled electronically and detailed instructions on submission
procedure are available on IJCSI website (www.IJCSI.org).
For more information, please visit the journal website (www.IJCSI.org)
© IJCSI PUBLICATION 2010
www.IJCSI.org
IJCSI
The International Journal of Computer Science Issues (IJCSI) is a refereed journal for
scientific papers dealing with any area of computer science research. The purpose of
establishing the scientific journal is the assistance in development of science, fast
operative publication and storage of materials and results of scientific researches and
representation of the scientific conception of the society.
It also provides a venue for researchers, students and professionals to submit ongoing research and developments in these areas. Authors are encouraged to
contribute to the journal by submitting articles that illustrate new research results,
projects, surveying works and industrial experiences that describe significant advances
in field of computer science.
Indexing of IJCSI:
1. Google Scholar
2. Bielefeld Academic Search Engine (BASE)
3. CiteSeerX
4. SCIRUS
5. Docstoc
6. Scribd
7. Cornell’s University Library
8. SciRate
9. ScientificCommons
© IJCSI PUBLICATION
www.IJCSI.org
Download