A WEB-BASED ELECTRONIC FILING SYSTEM USING CONVERSION OF IMAGE FILE TO TEXT FILE APPROACH YOUSIF NABEIL YOUSIF DISSERTATION SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF COMPUTER SCIENCE FACULTY OF COMPUTER SCIENCE AND INFORMATIOM TECHNOLOGY UNIVERSITY OF MALAYA KUALA LUMPUR MARCH 2010 DEDICATION This work is dedicated to my beloved parents and my beloved wife II Acknowledgment The author wishes to extend his grateful appreciation to all those who have contributed directly and indirectly to the preparation of this thesis. Especially the author wishes to extend his thanks to Doctor Norizan Mohd Yasin, Project Supervisor, for her advice, guidance and encouragement throughout the preparation of this thesis. Special thanks to the reviews, assessments and comments from the Panel of assessors, which are significant in contributing toward the betterment of the thesis. Finally, the author expresses his sincere thanks to his family members and his best friends Gassan and Samih for the encouragement, inspiration and patience which they provided at every step during this course of studies. III Abstract The purpose of this thesis is to develop a document management system for the departments in Faculty of Computer Science and Information Technology (FCSIT) in University of Malaya, this system enables administering and managing of students files more efficient. This system can also used in many other departments in the faculty or university. The system developed is called Electronic Filing System (EFS). This system consists of scanning, storing, indexing, archiving, retrieval, and accessing of original documents. Electronic Filling System EFS also help users to save time in searching document. The system can prevent lost document or damage from the effects of disasters such as burn. The system also increases the user productivity of FCSIT and enhances the efficiency of using information, communication and technology. This study employs qualitative research method that includes observation, document analysis and interviews for data collection process. The finding of the data analysis is use as system function requirement in developing. IV TABLE OF CONTENTS Acknowledgment III Abstract IV List of Figures XII List of Tables XIV Chapter 1: Introduction 1.1 1.2 What is Paper Document 2 1.1.1 What is Document Management 3 What is Electronic Document Management System 3 1.2.1 3 What is Electronic Filing System 1.3 Problem Background 3 1.4 Objective 4 1.5 Expected Outcome 4 1.6 Project Scope 5 1.7 Research Significant 5 1.8 Research Methodology 6 1.9 Thesis Layout 7 1.10 Summary 8 V Chapter 2: Literature Review 2.1 The Definition of Document and Document Management 9 2.2 Different Type of Documents 10 2.3 The Uses of Paper Files and Electronic Files 13 2.4 Document Management System 14 2.5 Elements of Document Management 16 2.6 Centralized Filing System 18 2.6.1 19 2.7 2.8 2.9 Benefits of Centralized Filing System Electronic Document Management Systems 20 2.7.1 23 Advantages of EDMS Neural Networks 24 2.8.1 What is Neural Networks 24 2.8.2 Neural Networks Types 26 The First Type is Perceptron 26 The Second Type is Multi-Layer-Perceptron 27 The Third Type is Back Propagation Net 28 Self-Organizing Map 29 2.9.1 29 What is Self Organizing Map 2.10 Document Management Application- Literature Review 31 2.11 Summary 38 VI Chapter 3: Research Methodology 3.1 Research Methodology 39 3.2 Data Collection Process 41 3.2.1 Observation 42 3.2.2 Interviews 43 3.3 System Development Methodology Approach 46 3.3.1 Waterfall Model-Introduction 46 3.3.2 Task Regions of Waterfall Model 48 Requirement Gathering and Analysis 48 System Design 48 Coding 48 Testing 49 Installation 49 Maintenance 49 3.4 Justification of Methodology Selection 50 3.5 Summary 50 Chapter 4: Case Study 4.1 University of Malaya 52 4.2 Quality Management& Enhancement Centre (QMEC) 53 4.2.1 Documentation Section 54 4.2.2 54 Internal Quality Audit Section VII 4.2.3 Training & Awareness Section 54 4.2.4 Quality Management Section 54 4.2.5 Customers Feedback & Continuous Improvement Section 55 4.3 Faculty of Computer Science and Information Technology (FCSIT) 55 4.4 Office of FCSIT 56 4.5 Staff of FCSIT 56 4.6 Students of FCSIT 57 4.7 Research Unit of Analysis 58 4.8 The Current Document Managing System in FCSIT 59 4.9 Current System Drawbacks 59 4.10 Summary 60 Chapter 5: Data Collection and Analysis 5.1 The Answers to the Interview Questions 61 5.2 Observation 64 5.3 Challenges of Current Systems 65 5.4 General System Requirements 65 5.5 Summary 66 Chapter 6: System Implementation and Design 6.1 System Requirements 67 6.1.1 67 General Requirements VIII 6.2 6.3 6.4 6.1.2 User Management Requirements 68 6.1.3 Functional Requirements 68 6.1.4 Non-Functional Requirements 69 Systems Development Consideration 70 6.2.1 System Environment 70 6.2.2 Programming Language and Development Tools 72 PHP Programming Language 72 MySQL Database System 74 Development Tools 74 6.3.1 PHP Designer 2007 Personal 74 6.3.2 MySQL Query Browsers 75 6.3.3 MySQL Administrator 75 System Design 75 6.4.1 Interface Module 76 6.4.2 Administrator Module 78 Administrator: Manage Users 78 Administrator: Update User Account 79 Administrator: Updating Screen 80 Administrator: Delete User Account 81 Administrator: Admission Records 82 Delete Student Form 83 6.4.3 User Operation 84 84 User Interface Screen IX 6.4.4 User: Admission Records Screen 86 User: View Student Form Details 87 Update Student Form 88 User: Printing Students Form 90 Other Services 91 Java-Base Application Interface 91 Document Scanning 94 The Scanning Processes 94 Document Format 94 Scanning and Saving the Scanned Students Registration Forms into the System 98 Scanning the Students’ Registration Forms 98 Browse for the Forms 101 Artificial Neural Network 102 103 Definition of Self-Organizing Map 104 Form Training 105 6.5 6.6 Definition of Artificial Neural Network Training Procedures in Reading the Registration Form 106 Sending Email 108 System Testing 109 6.5.1 110 Unit Testing Summary 112 X Chapter 7: Conclusion 7.1 Project Objectives 113 7.2 Training Staff and Users 114 7.3 System Limitation 114 7.4 Future Enhancements 115 7.5 Summary 115 APPENDIX A 116 References 132 XI List of Figures Figure 2.1 The Activity Profile for 8 Disk Officers and Chiefs 13 Figure 2.2 Electronic Document Management Systems 23 Figure 2.3 Perceptron Characteristics 27 Figure 2.4 Multi-Layer-Perceptron Characteristics 28 Figure 2.5 Back Propagation Net Characteristics 29 Figure 2.6 Link Node 30 Figure 3.1 General Overview of “Waterfall Model” 47 Figure 4.1 Current Students Registration 58 Figure 6.1 Use Case of System Architecture 75 Figure 6.2 Web-Based Interfaces 77 Figure 6.3 Use Case of Admin Privilege 78 Figure 6.4 Administrator Management Page 79 Figure 6.5 Administrator Updating Accounts 80 Figure 6.6 Administrator Delete Users Accounts 81 Figure 6.7 Use Cases of Administrator Admission Records 82 Figure 6.8 Administrators: Admission Records Screen 83 Figure 6.9 Delete student form from the database 84 Figure 6.10 Web-Base User Interface Screen 85 Figure 6.11 Use Case of User: Admission Records Screen 86 Figure 6.12 Staff: Admission Records 87 Figure 6.13 Users: Student Form Screen 88 Figure 6.14 Users: Update Student Form 89 XII Figure 6.15 User: Printing the Admission Records Page 90 Figure 6.16 Users: Printing the Student Form 91 Figure 6.17 Java-Base Application Interfaces 92 Figure 6.18 Main Page of the Java-Base Application 93 Figure 6.19 Student Form 95 Figure 6.20 New Specially Designed Registrations Form 96 Figure 6.21 Designs New Registration Form Template 97 Figure 6.22 Selecting the Scanner 98 Figure 6.23 Scan process 99 Figure 6.24 Process of Changing from Image to Text 100 Figure 6.25 Upload Result Image-to-Texts 101 Figure 6.26 Browse and Select the Required Registration Form 102 Figure 6.27 Neural Network Processes 103 Figure 6.28 First Examples for Training Form 105 Figure 6.29 Second Examples for Training Form 106 Figure 6.30 Training Procedures 107 Figure 6.31 Selecting the Character and Numbers in the Form 108 Figure 6.32 Sending Student Forms by Email 109 XIII List of Tables Table 2.1 Different Between Free Ocr, OmniPage17 and Simple Ocr Systems 37 Table 3.1 Data Collection Source Evidence 41 Table 6.1 ESF System Environment 71 Table 6.2 ESF Hardware Requirement 72 Table 6.3 Unit Testing for the Entire Administrator Table 6.4 User Functionality Module 110 Unit Testing for the Entire Staff user Functionality Module 111 XIV Lest of Abbreviation ANN: Artificial Neural Network CAD: Computer Added Design COLD: Computer Output to Laser Disk DMS: Document Management System EDMS: Electronic Document Management System EFS : Electronic Filing System FCSIT: Faculty of Computer Science and Information Technology IDE: Integrated Development Environment IPS: Institute Postgraduate Studies ISC: International Student Centre IT: Information Technology OCR: Optical Character Recognition PHEI: Public Higher Education Institution QMEC: Quality Management& Enhancement Centre SOM: Self-Organizing Map UM: University Malaya WM: Waterfall Model XV Chapter 1: Introduction Paper files are one of the most important basics of office work. Regarding to the important increase in the quantity of files in the administrative transactions, people had used paper to create and distribute documents. Now, however, they create electronic documents using word processor or presentation documents using presentation software’s with a personal computer and distribute these documents via computer network (AbuSafiya & Mazumdar 2004). The file storage system has became gradually more important, especially as the world is heading towards computerized systems (Konishi & Ikeda 2007). Great advances in electronic information technology have made the creation, storage and flow of electronic documents not only feasible but economical, and consequently have led to great increases in productivity. Yet, paper documents exist in virtually every office and are involved in most business and non-business processes. There are some institutions having very large number of files which required a large storage space. According to (AbuSafiya & Mazumdar 2004) this has contributed additional problem for the managing and administering of proper storage. Poor management of documents for example files are being left lying on the office floor- can cause healthy and safety problem as this can be a risk hazard for people who also share the same office space as they can accidentally kick the file and cause injury. The management and administering of electronic documents are easier and efficient. Electronic documents require less storage 1 space as most documents are stored and saved virtually and electronically. Electronic systems enable the rapid creation and distribution of documents. Therefore, it can be speculated that people would eventually replaced paper documents with electronic documents and realize a paperless office. The aim of the electronic filing system, which is referred to as EFS in this thesis is to manage documents more efficiently and effectively by firstly to reduce storage space and secondly, to ensure the safety of the files as no one can access the system without the correct password and this is for the security of the student document. The EFS system’s function includes storage, retrieve, read and print documents files. Change in the text is allowed with permission. This is discussed in detailed in Chapter 6. In order to store the file, the system use paper files as input where users have to scan the paper source document using-scanner to input the file in the system and store it in the allocated space (database) in the system storage. This research refers to user staff interchangeably which refer to one who use the system. 1.1 What is Paper Document? Paper document is any source of information, in material form, capable of being used for reference or study or as an authority. Examples of document include: manuscripts, printed matter, illustrations, diagrams and museum specimens. 2 1.1.1 What is Document Management? A document management is the process of handling document in such a way that information can be created, shared, organized and stored efficiently and appropriately (LaMarca et al. 2006). 1.2 What is Electronice Document Management System? Electronic document management system (EDMS) is a computer system or suite of programs designed to store and track electronic document and other media. 1.2.1 What is Electronice Filing System? Electronic filing system is one of the pluralities of document files each associated with a mark of recognition of each document. Electronic filing system includes an input for the introduction of a recovery in the form of a single document or multiple registration process. 1.3 Problem Background Managing documents can be a problem in many organizations. Poor management of system can affect the efficiency of an organization. An organization need to have a systematic way and procedures in administering and managing their document. However, this is not always true. Many organizations still have problems in managing their document. Some institutions have to keep their documents/files for several years before they can be destroyed. The implies that there will be-numerous paper files in each institution. All this paper files need to be kept that takes up office space. This creates issue in storage 3 space. Also if there is a need to retrieve any document from the files time is need to look for the file. In the current system few users need to work in the archive to search for required files. Also, problem of time taken to search for files in the archive reduces user’s productivity of and efficiency. On the contrary, the new system does not require any specific user to sits in front of a computer to search for the required documents/files. Required documents/files can be searched quickly hence increasing users productivity at work and efficiency. 1.4 Objective The main objectives of this research are: • To study how files are being manage in organization. • To investigate the work process involve in managing files. • To develop a system to manage the files electronically. • To developed a program that is able to read handwritten character so that it can be scanned as imaged but can be manipulated as text to use paper source document that can be scanned as an input file and be kept for later use. 1.5 Expected Outcome The output of this research is a system that manages documents/file electronically. This system will overcome the problems identified in section 1.2 by allowing documents/files to be stored electronically hence minimizing the huge storage space needed for manual paper files. In addition, the process of storing, organizing and retrieving of the electronic file is made easier and faster. Also input of data into the 4 system is faster by the use of scanning devices where source paper document is scanned and stored as image in the system. 1.6 Project Scope This research is intended to develop a system that manages document for easy storage, organizing and retrieval of file. Input to the system will be the paper documents which will be scanned as imaged and stored in the computer system. This image is then converted to text by using the neural network method and displays it for staff manipulation just like any other text file. 1.7 Research Significant This research identifies the need of developing and promoting a comprehensive Web-based electronic filing system in Faculty of Computer Science and Information Technology (FCSIT) University of Malaya. Research significant normally imply to two users type: researchers and practitioners. For researcher, the research provides a good base for further study in the field of electronic filing. The research deeply studies the concept, contents and the important role of electronic filing in any education institutes. On the other hand, the research conducted on the system users identifies the awareness of the current system and the willingness to transform from practicing the conventional method of managing the manual file system to the modern method of managing through a web portal. Moreover, the research is replacing the work intensive, space-hogging file 5 cabinets with a fully automated paperless environment. The users can be more productive by using the electronic filing system, and get an immediate response to a document inquiry by: 1.8 • Providing a universal access to accurate administrative forms. • Reducing administrative time and costs for handling student’s document. • Organizing and saving all the student files in the system. • View all the students’ files that have been store in the database. • Using the system by different privileges for the head department and the staff. Research Methodology The research utilized qualitative in research methodology. The instruments used to collect data are interviews and observation. The interview was conducted with staffs in the office of the Faculty of Computer Science and Information Technology (FCSIT) University of Malaya. All the questions during the interviews are focused on information pertaining to the work process and technical information of the current system and the proposed new system. The qualitative data came from observations of application of the current system by staffs. The observation data are very important because it provides the technical aspect of the system. 6 1.9 Thesis Layout The thesis is divided into several sections for easy reading. Chapter 1: Introduction This is the first chapter that presents the background of the problem, the main objective of the research as well as the scope of the research. It also highlights the methodology used to conduct the research. Chapter 2: Literature Review This chapter presents a thorough and exhaustive research of previous work done in this field. It includes model, definitions and techniques used by other researchers. Chapter 3: Research Methodology This chapter explains the data collection process which include collection of primary data using interview and observation and secondary data resources used in the research. Chapter 4: Case Study This chapter details out the case study conducted at the Faculty o Computer Science and Information Technology, University Malaya. Chapter 5: Data Analysis and Findings This chapter analyses the data collected based on the case study the findings of the analysis done are then used as a guide to determine the user requirement of the proposed EFS. 7 Chapter 6: System Implementation and Design This chapter will focuses on implementation of the electronic file system. It describes the design system, development, implementation and testing. This included the process of coding the classes, user interface development, and it also discusses the creation installation package for the system. Also will focuses in details the design aspect of the EFS system it includes architecture design, functional design, data format design, user interface design and database design, the functional design is expressed in UML diagrams by use case diagram. Chapter 7: Conclusion This chapter concludes the research done. It also includes the contribution of the research, future enhancement and suggestion to improve the system. 1.10 Summary In summary, the statements above indicate the intention build a web-base electronic filing system, which can be used by staffs in the university offices. This system is used to store the student’s forms, and it will be developed to make the work of the staffs easier with the student’s document. 8 Chapter 2: Literature review This chapter reviews the existing literature pertaining to document handling and storage. Prior researches conducted in this domain are thoroughly and exhaustively reviewed. Capabilities and features of the existing document management systems were reviewed and critically analyses. Explains the objective of the literature review is to acquire a greater understanding of the information system that have been implemented and are already in use in similar situations. The suitable features of existing systems are examined and considered to be incorporated into the proposed system. 2.1 The Definition of Document and Document Management In general, the Word document usually means a container of information (often on paper), and contain drown information or written, used for the purpose specified in the regulation (Matheu 2005). Usually, a document is part of the paper or a collection of papers, for example, in a memorandum, correspondence, and mission statement, a receipt of materials or a client statement. Central to the idea of a document that normally would be without the difficulty of transport, storage and handling as a single unit. The word document usually means an information carrier containing written or drawn information for a particular purpose. Over the last decade, the term document has undergone a radical change in definition. This change is due in part to information technology. Thus, a 9 large part of the documents used in business in today's world, where files are stored in the person computer and are treated by the operating units, and email systems. Information technology is now capable of producing a new electronic document, which can house graphics, text, computer added design (CAD), and multimedia objects, audio or video clips (Zantout & Marir 1999). Documents are processed and stored in electronic form not as physical objects but as digital ones. The document is no longer the place where words are put on a page, but rather a collection of elements or objects related to a particular topic, brought as one. Therefore, a new definition of a document in electronic age emerges. 2.2 Different Type of Documents a) Paper Document The consumption of papers, which are usually made of wood fibre, are exercised a considerable pressure on forest ecosystems in the world. It seems on the face of the rise of the computer and the power capacity of storing documents in electronic form, would lead to a reduction in paper consumption ,which will no doubt news of the books of the forest (York 2006). However, paper files are still a widely used by people during the life cycle of the document; because the paper document are easy to note, it provides an inexpensive way to display large amounts of information, it is socially acceptable and good in the meetings and interactions are also very flexible using the paper . 10 The paper file has several advantages over the electronic file such as being able to lay flat on a desk and read without the assistance of any other tool or software. In addition, folders, files and piles have a number of visually distinguishable attributes: size, location, and look of the topmost document. Papers are extensively used for document reviewing and note-taking due to its versatility and simplicity (Sellen & Harper 1997). As users make annotations on the printed documents as a means to gather notes, it would be easy and fast to find and locate the information if the files are properly organized on the shelves by numbers or a predetermined pattern. While the paper file has its own merits, it is not free from defects. Many companies are still stored hundreds or even thousands of paper documents in filing cabinets or boxes that leads to have a huge space for storage purpose. Also, the information stored in this manner will be lost or damaged if there a fire or natural disaster. The cost of the continuation of these files and copy the material away from the site could be the most expensive, if the files were not arranged properly, it will take a long time to find a particular file (Sellen & Harper 1997). b) Electronic File There are many definitions for the electronic files. However, in this paper its definition would be limited to the following: • A collection of data stored in a defined electronic format. An electronic file may be a single electronic record, a group or series of transactions. • A document that is establishes and stored on a computer. 11 The electronic file has been improved in order to support the effort to reduce printing of documents such as letters and images (Matsuo, Nakamura & Tatekawa 2001). The electronic pager is an information terminal to store letters and images as electronic information coinable and editable freely and it is configured like a book or a notebook that is familiar to us heretofore. From one point of view, the electronic file is a block of arbitrary information, or resource for storing information, which is available to software and is usually based on some kind of durable storage. A file is durable in the sense that it is available for programs to use after the completion of the current program. Computer files can be viewed as the modern counterpart of paper documents that typically were kept in offices' and libraries' files, which are the source of the term. An electronic file has many benefits. It’s resolved a conventional problem concerning the display of a document when it is being communicated. Exchanging electronic file become more efficient and fast. Electronic document storage solves many problems, for e.g. to store electronic document in a server (Aura, Kuhn & Roe 2006). Electronically backing up of data is trivial and electronic document storage is cheap too. As the cost of hard drives is constantly coming down so does the cost of data storage online data storage is a specific form of electronic document storage that allows one to secure documents to a data middle and access them at any time from anywhere in the world As part of their service, most online data storage services back up your data regularly The electronic file also has its disadvantages. Two major disadvantages which can cause tremendous loss are damage to the computer and viruses attacks. If the computer is damaged and there is no copy of all the data it had stored prior to the damage, then the loss 12 would be enormous. The second major disadvantage is that of virus’s attacks which is very prevalently. If the computer is infected by a virus and deleted all the files in the database, it would not be easy to recover and restore the files to its original form (Aura, Kuhn & Roe 2006). 2.3 The Uses of Paper Files and Electronic Files This simple graph explains the uses of paper files and electronic files in a variety of wide-ranging activities. Based on the graph, files are widely used in nearly all industries (Sellen & Harper 1997): 2500 Paper Only 1500 Electronic Only Paper & Electronic 1000 No Documents 500 0 D rafting ow n creating ow n E diting ow n E diting ow n R ev.A nothers R ev.A nothers C ollob. C ollob . D ata C onrersations M eeting R eading Only D ocum ent N ote T aking F orm atting F orm F illing T yping T ext Organising P hotocpying D eailing w ith P rinting S earching for S earching for D eating w . T elephone T hinking & R esponding Language M in u tes 2000 Figure 2.1 The Activity Profile for 8 Disk Officers and Chiefs 13 Although there were 16 economists in the sample, due to their busy schedules and days away from the office, only 8 of them provided complete data on which to base our quantitative analysis. Figure 2.1 shows the activity profile we were able to construct. The profile shows the large proportion of their time was spent on authoring activities, as one might expect. The figure also shows the extent to which these processes relied on paper, or a combination of paper and electronic tools, In particular collaborative authoring processes, either in co-authoring a document or in reviewing the documents of others, were heavily paper-based. Paper was also often present in the drafting and editing of their own text and data, although this tended to be in conjunction with online tools (Sellen & Harper 1997). One can also see from Figure 2.1 that over half of conversations and the majority of meetings were supported by paper documents. Of further interest is the fact that it also tended to be the preferred medium for reading documents, for document delivery, for thinking and planning activities, and for document organization. 2.4 Document Management System A document management system is provided which organize, store and retrieves document according to properties attached to the document. Applications which function based on hierarchical path name communicate to the document management system through a translator. The most sophisticated method currently used to manage document is Document Management Systems, where the documents are stored centrally on a server and users interact with this central repository through interfaces implemented using standard web 14 browsers (Omar 2005). DMS are developed to provide a library and/or repository where documents can be created, managed and stored for easier access by departments and users across an enterprise. Document management system (DMS) is a management control system used to regulate the creation, use and maintenance of the creation of the document electronically. This system links the paper, images and electronic documents into flexible and powerful document management system. The DMS allows converting the paper to electronic format such as image, raw data, facsimile transmission, e-mail, sound or video clips and paper record can be linked through a single indexing and retrieval application. Bar code technology used on both paper and imaged document allows all records to be indexed, tracked and retrieved through a single user application (Omar 2005). There are also some systems which include scanning and network features that would allow multiple users on its network to simultaneously access the necessary documents remotely. Record can indexed stored, retrieved, printed or faxed by all authorized user on a network (Lea & Smith Judy Read 2002). Fax messages are captured, stored, routed or relaxed, eliminating the need for hard copies. Electronic document can be stored on optical as well as electronic media, and raw data can be automatically and directly located via searches on computer output to Laser Disk (COLD). COLD is a technique for transfer of computer generated output to visual disk so that it can be viewed and printed without using the original program. COLD combines the capabilities of scanning paper documents created on another system and linking them to COLD document. 15 The crucial technologies are incorporated into the document management system, these include full text retrieval, electronic document imaging, film based imaging, and workflow system. In addition, automatic work processes and scheduling, controlling and routing of electronic document and other related work processes in an organization are provided by workflow system (Lea & Smith Judy Read 2002). DMS can stop lost records, save storage space, manage records easily, find document quickly, make images centrally available, and eliminate file cabinets. Document imaging does not only keep all document organized, it also allows the documents to be maintained and backed up daily, weekly, monthly or even yearly. 2.5 Elements of Document Management A complete document management system consists of six components namely scanning, storage, indexing, archiving, retrieval, and access. The following paragraphs provide a brief description of each of these components. a) Document Scanning With the technology of scanning, it makes conversion of paper document to electronic format a fast, inexpensive, and easy process. A good quality scanner will allow putting your paper files into your computer easily. It should also be able to convert accurately the original information so as to ensure that no details are lost in the process. b) Document Storage Storage or Storing which is also called filing allows the placement of the hard copies or saves the computer records in a suitable location. The storage system creates an 16 organized document filing system and makes the retrieval simple and efficient. A stable storage system should be able to adopt the ever-changing documents, increasing volumes, and advancing technology. c) Document Indexing The indexing component in a document management system is a very important component. The system of indexing produces an organized document filing system and that makes for simple and efficient retrieval. A proper indexing system permits for more effective procedures and systems. The index can incorporate physical location information such as location: where the document is stored and document identification information: the date created, the date archived, and subject matters of its contents. In addition, indexing is the mental process of determining the filing segment or name by which a record is stored or the placing or the listing of items in an order that follows a specific system (Lea & Smith Judy Read 2002). d) Document Archiving Archive refers to a group of records or documents with specific characteristics, which also refers to a location in which these documents were kept. It is a long-term storage of electronic documents that can be taken in the future retrieval. e) Document Retrieval The system uses the information retrieval of documents, including index and text, to find images stored in the system (Lea & Smith Judy Read 2002). This system makes finding the right documents easily, as well as retrieves it quickly. Recovery is the process of 17 identifying and removing record or file from storage. It also work on information retrieval in a particular subject of stored data. f) Document Access One major component of any document management system is access. Document viewing should be readily available to those who need it, with the flexibility to control access to the system. To access any document you mast know the location of the file or it will be difficult to find it. 2.6 Centralized Filing System For organizations with many employees, undoubtedly, they must have experienced the frustration of attempting to find files and documents that have been used by someone else or which; quite possibly, they have been misplaced or lost and could not remember where they have been kept (Liu, McMahon & Culley 2008). They might be hidden on someone else's desk, in a drawer someplace, or at the bottom of your own stack of items ''to be dealt with later. Usually, the last person to use the file simply kept it in their work area. That happens when there is no central location and attendant procedures for records that need to be handled by many people from many departments. Thus, a centralized filing system would be one of the solutions that can be used to address this situation. Centralized filing system is a system in which all the records from all the departments or units are kept and located in one, central location. The management of these records is placed in the control of one staff or in the case of large centralized filing systems, several people. 18 Files are conveniently available to all departments. The establishment of a central filing system and its processes, need the determination of what information needs to be accessible to all staff and what should be available only to certain individuals. The director must be involved in all phases of the project for obvious reasons, but also because an analysis of information needs assessment and can describe the effect of excellence (Liu, McMahon & Culley 2008). Consider this quote from Information and Records Management ''No organization should permit bits and segments of its records to be scattered randomly wherever they happen to be created or to have accumulated (Liu, McMahon & Culley 2008). Organization should not arbitrarily forces the centralization of records without taking into account the practical needs of the offices. Therefore, it is important to be careful planning and consideration before the beginning of the reorganization of the files that are used by many people from various departments electronically. 2.6.1 Benefits of Centralized Filing System Some of the main benefits of a centralized filing system include less duplication of files and more efficient use of equipment, supplies, and space. All related data could also be kept together. Another benefit of a centralized filing system is that the organization would be able to provide a uniformed service (Liu, McMahon & Culley 2008). In addition, when access is provided to all staff positions, it would reduce the frustration of finding and locating information which in turn decreases any bickering among staffs. On top of that, another benefit is that it can simplify routine maintenance and annual and periodic archiving of the files. 19 In establishing central files, it is also important to designate what records must not be available to all. Besides the benefits of the creation the central filing system are not insignificant. Because it is unusual not to recognize some out-dated records that should be archived, or which are long overdue for destruction. These are taking up valuable space already, and now is the best and perhaps only opportunity to put records retention guidelines into effect. 2.7 Electronic Document Management Systems Electronic document management system (EDMS) is the application of technology to save paper and speed up communications, and increasing the productivity of business operations. From a broader perspective, the EDMS is a significant expansion in the area of information management and a concomitant increase in the responsibilities of managers and executives (Zantout & Marir 1999). Electronic document management systems have been approximately for several decades, technologies have been develop in recent years to include a variety of features (Zantout & Marir 1999). Imaging technology provides the facility to replace the paperbased document management system online while the multimedia technology involves the detain display of various data types, together with the facility to get back constituent objects of the multimedia document. In addition, systems may incorporate GroupWare, workbook and text retrieval functionality, with some overlapping among them. These are discussed in the following sections. The most important functions of the current document management systems enable users to: 20 • directly manipulate the documents, • index and store the records so that they can are retrievable, • communicate through the exchange of documents, • collaborate around documents, • Model and automate the flow of documents. Any company one day will feel the need for some kind of Electronic Document Management System to control their ever-growing number of various documents and drawings. Companies often resist the need for EDMS but are deterred by the expenses and difficulty involved in implementing an EDMS. Electronic document management system is used in an effective manner, requires a substantial change in working practices, in spite most technical aspects are resolved through the adoption of low cost databases and ease of integration with the Windows environment (Sprague Jr 1995). A useful EDMS should not only control documents but also provide access to them throughout the company, and even to customers or other participants in the project through the Internet or network. An EDMS should also centralize data in an easily accessible environment, allowing users to store, access, and modify information easily and quick. In addition, the task of managing all the information necessary to design and build any major business is a real challenge, and many believe that more efficient information management is a main mechanism for companies to increase its productivity. The standard features of a good system should be composed of the following functions: searching facility, viewing without the use of the original application, red-lining and marking- up feature, printing and plotting, workflows and document life cycles, revision and version control, 21 document security, document relationships, status reporting, issue/distribution management and remote access (Sprague Jr 1995). Electronic document management systems are the basic level of creating systems as shown in Figure 2.2. Data from a stand-alone systems is to establish whether the inputs from the CD or scanned into the system. EDMS then provides the data storage and retrieval system with outputs in the form of hard copies or computer files. 22 EDMS 1. Scan Printed document are scanned 2. Register All document are registered for author, input data, context type, etc., information Document metadata author input data type, Document metadata author input data type, 3. Store Documents are indexed and stored in database or file structures 4. Retrieve Documents are retrieved through search and queries. The system provides version check in/out, activity, track, etc. Figure 2.2 Electronic Document Management Systems 2.7.1 Advantages of EDMS Many companies use EDMS to standardize the way for the users (that have the right privileges) to find and access the document and information that they want. EDMS helps users to do their jobs more easily, and provides the company with the security and reliability of data, and management actions. Many of these features aim to save time, simplify the 23 work, protect the investment in creating these documents, enforce quality standards, and enable the audit and to ensure accountability (Groetzner, Guenthner & Streckeisen 2004). 2.8 • Generally efficient location and delivery of documentation. • The ability to manage documents and system data regardless of source or format • The ability to integrate computer and paper • Controlling the access, distribution and modification of documents • Provision of document editing and mark-up tools. Neural Networks In this research, neural network is a method that used to scan input data to save it in the database of the system, in order to develop the EFS system, the artificial neural network is used to read the forms and to change it from image to text. There are many researchers work in this algorism and everyone defines it in his own way. Some of the artificial neural network definitions are studied and presented here for discussion. 2.8.1 What is Neural Networks? Neural networks are a new way of programming a computer (Chen 1996). It is exceptionally good performance in pattern recognition and other tasks that are very difficult to program using conventional techniques. Programs using neural networks are also able to learn and adapt to changing circumstances. 24 Neural network has strong features in modeling the data, which able to capture and represent complex relationships between inputs and outputs. Incentive for the development of neural network technology stemmed from the desire to design a system that could perform an “intelligent” task similar to those of the human brain. Neural networks resemble the human brain in the following ways (Chen 1996): 1. A neural network acquires knowledge through learning. 2. A neural network's knowledge is stored within inter-neuron connection strengths known as synaptic weights. Real power and the exploitation of neural networks lie in their ability to represent both linear and non-linear relationships and their ability to learn directly from the model data. Traditional linear models are simply inadequate when it used to model data with a nonlinear characteristics (Chen 1996). Neural networks has different model of computing: • Von Neumann machines are based on processing / memory abstraction of human information processing. • Neural networks on the basis of the structure parallel to the brain of an animal. Neural networks are a form of multi-processor computer system, with • Simple processing elements. • A high degree of interdependence. 25 • Simple messages included. • Adaptive interaction between elements. 2.8.2 Neural Networks Types There are several types of neural networks. They can be discriminate by their type (feed forward or backward), their structure and learning algorithm used by them. The type of a neural net indicates if the neurons of one of the net’s layers were connected among each other. Feed forward neural networks only allow the existence of connections between neurons of different layers, while the networks of the feed backward type are also links between the neurons of the same layer (Cho 2000). In this section a selection of neural networks will be included. The First Type is Perceptron The Perceptron, was first introduce by F. Rosenblatt in 1958. It is a very simple type of neural net with two layers of neurons, which accept only binary input and output values (0 or 1). The learning process is supervised and the network is able to solve the basic logical operations such AND or OR. It is also used for the purposes of pattern classification. As you can see in figure 2.3 more complicated logical operations (such as the XOR) cannot be solved by perceptron (Cho 2000). 26 Figure 2.3 Perceptron Characteristics The Second Type is Multi-Layer-Perceptron The Multi-Layer Perceptron was introduced for the first time by M. Minsky S. Papert in 1969 as in figure 2.4. It is an extended Perceptron which has one or more hidden neurons layers between the input and output layers. Due to its extension structure, a multi-layered Perceptron is capable of solving all the logical operations, including the XOR problem (Cho 2000). 27 Figure 2.4 Multi-Layer-Perceptron Characteristics The Third Type is Back Propagation Net The Back propagation Net was published for the first time by the LG. Hinton, E. Rumelhart and R.J. Williams in 1986 and is one of the strongest types of neural net (Cho 2000). It has the same structure of the Multi-layers Perceptron and used the back propagation learning algorithm as display in figure 2.5. 28 Figure 2.5 Back Propagation Net Characteristics 2.9 Self-Organizing Map In the last section, an overview of neural network algorithm which used to convert the scanned document to text format. However, in the EFS implementation the selforganization map (SOM) algorithm, and a special type of neural network is used. This section describes (SOM) algorithm in details. 2.9.1 What is Self Organizing Map? Self-organizing map (SOM) is new software which is effective for the visualization of the high-dimensional data (Oja, Kaski & Kohonen 2003). It implemented an orderly highdimensional distribution mapping into regular low-dimensional grid. It is also able to convert complicated and non-linear statistical relationships between high-dimensional data 29 items into simple geometric relationships on the low-dimensional display Compresses the information while maintaining the most important metric and topographic relationships of primary data items on the screen, can also be thought in the production of type abstractions. The two aspects, visualization and abstraction, can be used in a number of ways in complex tasks such as process analysis, machine perception, control, and communications. The SOM usually consists of a regular two-dimensional grid of nodes (Kohonen 1998). A model for some of the observation is associated with each node as in Figure 2.6. The SOM algorithm computes the models to optimize the description of the area (a separate or continuously distributed) observations. The models are automatically ordered into meaningful two-dimensional mode which similar to each other in the grid than the more mixed ones. In this sense the SOM is similarity graph, and a clustering diagram, too. Its computation is a nonparametric, repeated the regression process (Kohonen 1998). Figure 2.6 Link Node 30 In this exemplary application, each processing element in the hexagonal grid holds a model of a short-time spectrum of natural speech. Note that neighbouring models are mutually similar. 2.10 Document Management Application- Literature Review Allergan, founded in 1948, is a technology-driven health care company with its headquarters in Irvine, California. It develops and commercializes eye care pharmaceutical, ophthalmic surgical device, over-the-counter contact lens care, movement disorder and dermatological markets throughout the world. Allergan markets its products in more than 100 countries, and in l997, generated approximately $l. l billion in worldwide revenue. In order to be successful, Allergan requires a very efficient and streamlined business operation strategy. To achieve that, Allergen’s management teams invested great amount of time to research and select the best business systems that would meet their needs as they plan ahead. They also must ensure that their staffs are adequately trained to reap the maximum benefits. This resulted in the purchase of IXOS-ARCHIVE1 as their imaging and archiving solution for SAPTM R/3@. Competitive global companies such as Allergan are implementing imaging and archiving solutions to manage the ever increasing volume of data so that system performance could be optimized, data would be secure and employee productivity would increase. IXOS-ARCHIVE, the SAP-certified imaging and archiving product suite, delivers 1 Available from: http://www.ixos.com, last accessed 15/6/2009 31 the solution for these concerns by storing and retrieving documents, reports, data and images digitally under the control of R/3 processes. System performance remains optimal with regular data archiving, data is stored securely and employees become more productive as they simply access needed information on-line, when and where it is needed, from one file system. The Accounts Payable Shared Service Centre was Allergen’s first department to utilize IXOS-ARCHIVE in order to cut costs and save time. Instead of processing hard copies of invoices, they are now scanned and processed for payment on-line. IXOS-ARCHIVE allows the scanned documents to be manipulated before being transported to the transaction processor. Pages that were inadvertently scanned upside down can be turned right side up. It has another feature which allows the page order to be changed for multiple-page documents. Once a scanned invoice is transported to the transaction processor, the processor can then send the invoice image to the appropriate person for payment approval and general ledger coding, or match the invoice on-line to the appropriate purchase order and receiver before processing for payment. Before IXOSARCHIVE was implemented, approval requests had to send via interoffice mail and the process could take as long as two weeks. Allergan has sites in California, Texas, Massachusetts, Mexico, Puerto Rico and Canada which are utilizing IXOS-ARCHIVE. Very often, employees or Cost Centre Managers from these different locations needed to research a particular invoice, and would call the Shared Service Accounts Payable Centre in Irvine and ask for copies of specific documents to be faxed to them. With IXOS-ARCHIVE, most of these phone calls to the 32 Shared Service Centre no longer have to be made. Authorized viewers can quickly access the appropriate invoice or related documents on their computer screens. In addition, outsmarting the paper documents to be microfilmed or microfiches (at a considerable cost) is no longer necessary. The Accounts Payable Shared Service Centre has realized considerable time savings as they receive fewer phone calls asking for research. They no longer need to stand in front of a photocopier making duplicates of the invoices which would later need to be sent out for approval, general ledger coding, or auditing before being mailed to the requestors. This translated to more efficient time management and improved performance. IXOSARCHIVE has also been put to use in the Return Goods processing department. It has helped to eliminate much manual paperwork and filing. In addition, SAP transaction history is being archived for permanent storage and retrieval. As the volume of the company’s data grows, system performance becomes a greater priority. Allergen’s management wisely choose to plan ahead before its system performance became adversely affected. They began data archiving. Sandy Howard, the Project Manager of Business Systems Development Group at Allergen noted, “Howe wanted to make sure we didn't run into any performance problems. Our financial ledger is growing by about l0 gigabytes per month, and so we started archiving this data first. We'l1 also start archiving our SIS and LIS reporting data. IXOS required very little implementation time and the IXOS SOFTWARE team has really 33 impressed us with their technical knowledge and outstanding customer servicemen. In 1985 Timeshare administrators Hutchinson & Co was formed to be a collection agent for just one resort. However, as the company grows they gradually took over all the back office administration which included the work of the trustee. As a result, five years later in 1990 they formed Hutchinson & Co Trust Company Ltd. The company now supports more than 100 resorts in the United Kingdom, Europe and South East Asia from its headquarters in Camberley, Surrey. Hutchinson is required to retain documentation relating to its customers’ timeshare agreements for a period of 80 years. The storing and accessing of such as amount of paperwork was becoming a very expensive problem. “We were surrounded by paperwork,” said Hutchinson Systems Administrator, David Earles. “Every single wall of the office was lined in shelves and files. We even had a separate building just to store the files.” Earles said that for several years the company had begun researching for available digital archiving systems but found that the technology at that time was not yet able to meet their needs. “Back then, the machines were too expensive and too slow to make it a worthwhile option for us. Fortunately, the situation has improved a lot. In June 2001, Earles and his colleagues began discussing a document management solution with Canon. “We went to Canon because we knew they were the best in the industry, and we needed a powerful, reliable solution,” he said. The document management solution proposed by Canon was a combination of the features from the Canon DR5020, a document scanner and the Scan-File 2000, archiving software. 34 Size and thickness of documents were not a problem with the Canon DR5020 as it can scan documents of up to and including A3. As for speed, it can handle up to 75 pages per minute, and with automatic feeding and double feed detection, the volume can go up to 500 sheets. The control panel is placed on the product itself making the DR5020 compact, lightweight and especially user-friendly. It also has a variety of scanning options: a barcode unit for automatic indexing, an endorser for “post stamping” documents and an imprinter for “pre-stamping'' documents on the actual digital image. Free OCR2 is a free online Optical Character Recognition (OCR) tool. It can be used to perform OCR on any image supply and it is free. To use it, just upload the image files. Free OCR can support different kinds of files such as JPG, GIF, TIFF BMP and PDF but can be used only for the first page. The only restriction in using the tool is that the image file must not be larger than 2MB, no wider or higher than 5000 pixels, and there is a limit of 10 image upload per hour. There is also an automatic image pre-processing optimization before the image is fed into the OCR engine. It reduces background noise and adjusts the resolution. The only thing left is to de-skew the image if skew is more than 10. While the free OCR can now handle images with multi-column text, it also supports more languages including Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Swedish, Tagalog, Turkish, Ukrainian and Vietnamese. The free OCR software enables you to extract text from an image and convert it into an editable text document. If you need the text from an image, you can just scan the text and use the OCR tool to convert it into editable text. The result is always plain text. 2 Available from: http://www.free-ocr.com, last accessed 1/5/2009 35 OmniPage 173 is the world’s most accurate document conversion application. With more than 99% character accuracy, it can convert PDF and paper files into electronic files that you can edit, search and share in the formats of your choice. You can turn documents that would take hours to re-type into perfectly formatted documents in just seconds. OmniPage 17 also supports the conversion of all popular image formats such as TIF, JPG, BMP, PCX, GIF, PDF, MAX and more. The tool also has optical character recognition of up to 99% for 119 different languages. OmniPage now recognizes Simplified and Traditional Chinese, Japanese and Korean. There is no better OCR application that could be found for the price. This increased level of accuracy also greatly reduces the need for post-recognition proof reading and correction. The superior accuracy of this document conversion software means that organizations can save significant amounts of time and money by radically improving the ways in which paper and digital documents are processed, archived and shared. It is really simple to use – tasks are automated, information is easily accessed, and productivity soars. Even digital camera or iPhone pictures can be converted into files that you can edit in your favorite PC applications, while scanned documents can be turned into electronic books, specially formatted for easy reading on the Amazon Kindle so you can take your documents with you. SimpleOCR4 is a popular OCR freeware with thousands of users worldwide. SimpleOCR, which has up to 99% accuracy, is also a royalty-free OCR SDK for developers to use in their custom applications. While SimpleOCR currently supports English and French languages, we are in the process of adding recognition for additional languages. If you have a scanner and want to avoid retyping your documents, SimpleOCR is the fastest, free way to do it. The SimpleOCR software is 100% free and not limited in any 3 Available from: http://www.nuance.com/imaging, last accessed 13/8/2009 4 Available from: http://www.simpleocr, last accessed 3/8/2009 36 way. Anyone can use SimpleOCR for free - home users, educational institutions, and even corporate users. If your documents have multi-column layouts, non-standard fonts, poor quality or colour images, you will need one of our commercial OCR applications or Imaging SDKs to get an accurate read. The OCR Guide compares desktop and server OCR solutions from several major engines, including ABBYY, IRIS, Nuance (formerly Scan Soft) and more. Our imaging solutions website, ScanStore.com offers demo downloads and online ordering for all these applications with your ScanStore User Account. We encourage you to try the SimpleOCR freeware and see how it works with your documents. If you need better features and accuracy for more demanding applications, please come back and find your solutions at ScanStore. SimpleOCR has a huge Dictionary. With more than 120, 000 words, it is unlikely that SimpleOCR will run into a word it does not know. In the rare event that it does not, our improved text editor allows you to easily add the new word to the dictionary. By adding new words to the dictionary, SimpleOCR becomes better with every use. Input Formats - SimpleOCR works with all fully compliant TWAIN scanners and accepts inputs from TIFF files. Meanwhile, Output Formats - SimpleOCR can save the documents it acquires in text formats (TXT and RTF), importable into almost every program, such as Word, WordPerfect, HTML editors, and e-mail programs, either fully formatted or as plain text. Additionally, it can also save scanned documents in the industry standard TIFF format, a format widely accepted as PDF files. This table show the different between the free ocr, omnipage17 and simple ocr systems. Feature Free ocr Omnipage 17 Simple ocr Free Free ocr system Have to parches Have to parches 37 File types JPG, GIF, TIFF BMP, PDF TIF, JPG, BMP, PCX, GIF, PDF, MAX JPG, GIF, TIFF BMP, PDF, MAX File size 2MB 4 MB 2 MB Languages 32 different languages 119 different languages English, French online online ocr Not online Not online Accuracy Up to 89% Up to 99% Up to 99% Table 2.1: Different between free ocr, omnipage17 and simple ocr systems 2.11 Summary The review of literature involves various fields such as definition of ‘document’ and ‘document management’, different type of documents, document management system, elements of document management and centralized filing system. The definition of electronic document management systems and its benefits been discussed. In addition to that, different types of neural networks and self-organizing map have also been identified. Example of current electronic filing systems has been discussed. 38 Chapter 3: Research Methodology This chapter focuses on criteria in selecting a suitable methodology to conduct the data collection process as well as for the project software life cycle and its development tool. 3.1 Research Methodology According to (Yin 1994) and (Zikmund 1987), research can be used for three purposes exploratory, descriptive and explanatory. Exploratory Studies are a valuable means of finding out what is happening, to seek new insights, to ask questions and to assess a phenomenon in a new light. (Robson 2002) explained that an exploratory study is a particularly useful approach if one wishes to clarify the understanding of a problem. The advantage of exploratory research is that it has great flexibility and is adaptable to change. The flexibility inherent in exploratory research does not mean the absence of direction. Descriptive Research is described within problem areas, where there already exist plenty of literature works and the aim is to study events that have occurred or are happening in the 39 present time. The aim of descriptive research is to describe the characteristics of a population or phenomenon. It seeks to determine the answers to who, what, when, where and how questions (Zikmund 1987). According to (Robson 2002)the objective of descriptive research is to portray an accurate profile of persons, events or situations. Usually it is taken as an extension of or a forerunner to a piece of exploratory research (Robson 2002). (Zikmund 1987)noted that accuracy is of immense importance in descriptive research. Though admit ted errors cannot be eliminated completely, a good research strives for descriptive precision. It is usually taken based on some previous knowledge and understanding of the nature of the research problem. Explanatory Research is aimed at establishing causal relationship variables. The emphasis here is on studying a situation or a problem in order to explain the relationships between variables. Usually, exploratory and/or descriptive research precedes this kind of research (Zikmund 1987), and according to the researcher must be knowledgeable about the research subject. The research purpose of this study has been assessed as both exploratory and descriptive. The study focuses on explorative research because of the limited knowledge about the research area, and since the research aims to gain a deeper understanding within this field. The research is also descriptive in nature, as the attempt is made to describe the data collected. 3.2 Data Collection Process 40 The data collected can be classified as primary versus secondary data. Primary data is gathered and assembled specifically for the research project at hand. Secondary data has already been collected for purposes other than the problem at hand. According to (Yin 1994),there are six sources of evidence that can be made the focus of data collection for case studies: documentation, archival records, interviews, direct observations, participant-observation, and physical artefacts. Each of these sources of evidence is explained in Table 3.2. Table 3.1: Data Collection Source Evidence Source of Evidence Description The different types of documents include statistics, registrations, official publications, letters, diaries, newspaper, journals, branch literature and brochures. Documents are mostly used for collecting secondary data. These can be, for example, service records, organisational records, Archival Records maps and charts, survey data, and personal records. Archival records are often used in computerised form, also for collecting secondary data. Documentation The interviews mostly take the form of an open-ended nature, in which an investigator can ask key respondents for the facts of a matter, as well as for the respondents’ opinions about events. The interview can also take the form of a focused interview, in which a respondent is interviewed for a short period of time, an hour for example. Moreover, the interview can entail more structured questions, along the lines of a formal survey This can involve observations of meetings, sidewalk activities, Direct Observation factory work, classrooms, and the like. Observational evidence is often useful in providing additional information about the topic being studied. To increase the reliability of observational evidence, a common procedure is to have more than a single observer making an observation, whether of the formal or the casual variety. Participant-observation is a special mode of observation in which Participantthe investigator is not merely a passive observer. Instead, the Observation investigator may take a variety of roles within a case study situation and may actually participate in the events being studied. Interviews 41 Physical Artefacts A final source of evidence is a physical or cultural artefact - a technological device, a tool or instrument, a work of art, or some other physical evidence. Such artefacts may be collected or observed as part of a field visit and have been used extensively in anthropological research. Source: Adapted from Yin, 1994, pp.85 This research employs two methods of collecting data: 3.2.1 Observation An observation can give useful insight into problems, work conditions, bottlenecks and Methods work (Avison & Fitzgerald 2006). Observation is the first method used to gather information regarding the development of an online Smart E-Portfolio System. Observation will help to identify the potential users of the system. For the purpose of this research, the office staff in FCSIT was visited to observe the current system used and to know how they are handle the students forms. Also, an observation has been done on my supervisor during one semester; to observe how the course portfolio is been collected and prepared. In the interview process the staff might not reveal all the needed information; hence observation helps by giving insight information. Observation include observe how the files are manage such as storage, accessing and searching. 3.2.2 Interviews The second method used to collect the data for this research is through interviews. 42 This method is chosen as it presented a significant source of information for a case study (Yin 1994). The type of data used in this method is called primary data as it is collected for a specific purpose by the researcher. There are three different types of interviews, mainly open-ended, focused and structured. An open-ended interview is used when the respondent is allowed to answer the questions in his/her own words. A focused interview is bound to a certain degree as despite following a set of questions, it is performed in an informal, conversational manner. The third type, which is a structured interview, is based on a survey, in which the researcher without any flexibility predetermines the questions (Yin 1994). An interview can be conducted over the telephone or in person. The most qualitative interview is done on a one-to-one or face-to-face basis. Some of the great advantages of interviewing someone in person are that it can include questions that are more complex, and that it can be conducted over a longer period of time. In this research, a one-to-one question and answers sessions were held. The interviews were recorded and reviewed later while incorporating researchers’ additional remarks. A total of two interviews were held with a number of FCSIT staff which include my supervisor and the office staff who works with the student document. Interviews with these office staff took place on 1/8/08 and are used to get the initial functional requirements of the application. The data gathered was then used to identify the data entities and hence the design of the application. 43 The findings of the interview indicate the need of electronic file storage system as it will improve work efficiency. This is because every student file needs to be kept for seven years in the office. All these files take a large office space staff fined it difficult to search for particular files if they need. Interviews were then conducted with them by asking relevant questions about the current system used to manage the student’s files, and to identify if there is a need for new system that makes the process of administering and managing the students files easier. The questions and the result of interview are displayed in chapter 5. The questions that have been asked during the interviews with the staff are: • Are the staffs satisfied with the current storage file system? This question has been asked to know if the users are satisfied with the current system that they are using. • What are the steps workflows in the current storage file system? This question has been asked to know the workflows of the current system and haw long it takes to be complete. • What are the advantages of the current storage file system? This question has been asked to know the advantages of the current system to use it when the new EFS devolved. • What are the disadvantages of the current storage file system? 44 This question has been asked to know the disadvantages of the current system to try to develop it, or to find new way for it in the workflows process for the new system. • In the steps that have been mentioned before which steps that take long time to be done in current storage file system? This question has been asked to know the problem of this steps that take long time to be done and try to fine where the problem is and fix it. • What do you think the reason for the delay in the process of the current storage files system? This question has been asked because the user how is working with the current system knows where and why the system is delay. • What are the changes that must take place in the current storage file system? This question has been asked because there must be some processes that the user don’t like it or think no need for it. • What are the capabilities of staff in dealing with computer? This question has been asked to know the capabilities of the user with the computer and to know haw the new system mast be developed. • Will the staff accept the change of the current storage file system to computerized system? This question has been asked because maybe some users see that there is no need to change the current system or they don’t want to start to learn anther new system. 45 • Do the staffs need any training for the new computerized system? This question has been asked to know if the users need for training if the new system is used. 3.3 System Development Methodology Approach Methodology is a collection of techniques for building models and it is applied across the development of a software life cycle. There are a few categories of software development methodologies such as: object oriented methodology where by systems are modelled as a collection of cooperating object, structured methodology those are based on functional (algorithmic) decomposition; and also data-driven methodology by which the structure of system is derived by mapping system inputs to output. A good software design methodology provides at least three models, which are structural model, functional model and control model. 3.3.1 Waterfall Model-Introduction The waterfall model is the classic software life cycle model. This model was the only widely accepted life cycle model until the early 1980s. The waterfall model is the earliest method of structured system development. Although under attack in recent years for being too rigid and unrealistic when it come to quickly meeting customer's needs, the waterfall model is still widely used. It is attributed with 46 providing the theoretical other process models, because it most closely resembles a '' model for software development (Royce 1987). Requirement gathering and analysis System Design Coding Testing Installation Maintenance Figure 3.1 General Overview of Waterfall Model Waterfall model methodology is chosen for this project since it gives full focus on each and every software development aspect. This is needed since the project involve directly with integration between the three element of technology which are electronic document imaging, electronic workflow and electronic centralized filing system. 47 3.3.2 Task Regions of Waterfall Model The waterfall model consists of the following steps according to (Gilb 1985): 3. 3.2.1 Requirement Gathering and Analysis All possible requirements of the system to be developed are captured in this phase. Requirements are set of functionalities and constraints that the end-user (who will be using the system) expects from the system. The requirements are gathered from the end-user by consultation, these requirements are analyzed for their validity and the possibility of incorporating the requirements in the system to be development is also studied. Finally, a requirement specification document is created which serves the purpose of guideline for the next phase of the model. System Design Before a starting for actual coding, it is highly important to understand what system is to be created and what it should look like? The requirement specifications from first phase are studied in this phase and system design is prepared. System design helps in specifying hardware and system requirements and also helps in defining overall system architecture. The system design specifications serve as input for the next phase of the model. Coding 48 Also knows as programming, this step involves the creation of the system software. Requirement and system specifications from the system design step are translated into machine readable computer code. Testing As the software is created and added to the developing system, testing is performed to ensure that it is working correctly and efficiently. Testing is generally focused on two areas: internal efficiency and external effectiveness. The goal of external effectiveness testing is to verify that the software is functioning according to system design, and that it is performing all necessary functions or sub-functions. The goal of internal testing is to make sure that computer code is efficient, standardized, and well document. Testing can be a labour-intensive process, due to iterative nature. Installation Once system has been testing satisfactorily it is delivered to the customer and installed for use. The introduction of the system has to be managing carefully so as not to cause unnecessary disruption and minimize the risk to changes Maintenance This phase of the waterfall model is virtually never ending phase. Generally, problems with the system developed (which are not found during the development life cycle) come up after its practical use starts, so the issues related to the system are solved after deployment of the system. Not all the problems come in picture directly but they arise time to time and needs to be solved; hence this process is referred as maintenance. 49 3.4 Justification of Waterfall Methodology Selection The methodology selection brings many benefits towards the final delivery of the proposed system. The selected methodology incorporates systematic development technique to the project. This approach will create a more scalable system as it models the real world via abstraction. The selection of Waterfall Model (WM) will encourage planning before designing and enforces some important rules in the process of developing the proposed system. It breaks the system into sub components with milestones corresponding to the completion of intermediate products. Since the WM is a discipline approach, it requires each stage of the software development to be documented. Besides that, the correctness of the product is checked on each stage of the product building. This ensures only the correct product that fulfils the users requirement are build during the whole development process. Another main reason for choosing WM is that it is a stable and reliable model. As it is widely used in the industry for a long time its reliability is tested and proved. The developers are also familiar with this model, as it is classic and popular. Due to the limited time in the software development process, using a stable and familiar model ensures reduced misunderstanding and problems in the system. 50 3.5 Summary This chapter has discussed the research methodology, research techniques, and research tools which are used in this dissertation. Research methodology produces the main guidelines for developing EFS. The research techniques are used to collect and capture requirements from end users who were interviewed and observed during work time. In chapter 5 the output of the data collection have been displayed and discussed. 51 Chapter 4: Case Study The university is an institution of higher education and research, which grants academic degrees in a variety of subject. A university provides both undergraduate and postgraduate degree. Each student enrolment requires students to fill in a form about their bio data. The university graduate every year hundreds of student from different disciplines to go out to the real world to implement and develop what they learn. 4.1 University of Malaya University of Malaya is the first and oldest public university in Malaysia. University of Malaya (UM) traditionally provides education; research and service to the society. The university had its roots in Singapore with the establishment of King Edward VII college of Medicine in 1905 and the Raffles College in 1929 to meet the need for medical and tertiary education. On October 8, 1949, University of Malaya was formed with the amalgamation of both colleges. The amalgamation, paved the way for University of Malaya to emerge as an education institution which will cater for the tertiary needs of Federated Malaya and Singapore (University of Malaya Student Handbook, 2007). 52 The growth of the university was very rapid during the first decade of its establishment and this resulted in setting up of two autonomous divisions in 1959, one located in Singapore and the other in Kuala Lumpur. In 1960, the government of the two territories indicated their desire to change the status of the divisions into that of a national university. Legislation was passed in 1961 and University of Malaya was established on January 1, 1962. To date, University of Malaya has an estimated population of 25,000 registered students, pursuing various levels of courses. The university has 12 faculties, 2 academies, 3 centres and 2 institutes. 4.2 Quality Management& Enhancement Centre (QMEC) In June 2002 was a turning point in UM history, with formalization of UM Quality Management System (QMS). Based on the framework and requirements of MS ISO 9001:2000, the UM QMS encompasses all core processes which include teaching and learning, research and consultation, and supporting services. On 24 December 2002, as listed in the Malaysia Book of Records, UM became the first public higher education institution (PHEI) to be certified with MS ISO 9001:2000 on a comprehensive basis. Quality Management& Enhancement Centre (QMEC) was formed on 27th July 2002 with the aim of managing and coordinating activities associated with the UM QMS. QMEC has been actively engaged in coordinating, strengthening and continually improving the UM QMS. These activities include conducting training sessions, courses and workshops in the effort to instil awareness amongst the staff of UM, and stress the importance of ensuring quality in all aspects of the organization. QMEC's scope has since expanded to include 53 other quality management framework, namely criteria for the Ministry of Higher Education quality management, Research University, University Ranking, and ASEAN University Network quality management. QMEC has five main sections as follow: 4.2.1 Documentation Section The Documentation Section consists of a Document Manager, an e-Document Manager and other members who are responsible for the management of controlled quality documents which are currently available online through the QMEC website. 4.2.2 Internal Quality Audit Section The Internal Quality Audit Section is headed by a Chief Auditor who is assisted by a Deputy Chief Auditor and Assistant Auditors. This section coordinates the University's internal quality audit exercises which aim to check on the University's compliance to UM QMS. 4.2.3 Training & Awareness Section The Training & Awareness Section comprises a Manager and other QMEC members. The section's main function is to coordinate training in all aspects that pertain to quality in UM. Activities on awareness and appreciation of UM QMS are regularly and continually conducted for all levels of UM staff. 4.2.4 Quality Assurance Section 54 The Quality Management Section members include a Manager as section head. Its main responsibility is to coordinate activities with regards to the quality management of PHEI in UM. It is responsible for monitoring internal quality management activities, disseminating good practices and conducting awareness and training programmers in quality assurance. 4.2.5 Customers Feedback & Continuous Improvement Section This section, consisting of a manager and other members, manages matters pertaining to feedback/complaints from customers. The Customer's Satisfaction Survey as well as the Continual Improvement Projects are also carried out and assessed by this section. 4.3 Faculty of Computer Science and Information Technology (FCSIT) Historically, the computer facilities and services at the University of Malaya were provided in the mid of 1967 by the Computer Centre, which was formed in 1965. In December 1969 the centre also took an additional role of teaching and research in the field of computer science and information technology (FCSIT, Annual Report, 2002). A post-graduate Diploma in Computer Science was then introduced in 1974. During the 1990/91 academic session the Centre began offering the Bachelor of Computer Science (CS) programmed with a maiden intake of 50 students. After various proposals, the University’s Council on September 1994 agreed to the formation of the Faculty of Computer Science and Information Technology (FCSIT) and a separate Computer Services Division. The Bachelor of Information Technology (IT) commence during the 1996/97 academic session. At present the faculty has four departments; Artificial Intelligence, 55 Software Engineering, Information Science and Computer Systems and Technology. Currently, apart from the two Bachelor programmers, its graduate studies offer Masters and Doctor of Philosophy programmers in Computer Science, Information Technology, Software Engineering and Library and Information Science. 4.4 Office of FCSIT This section about the office of FCSIT. This office responsible for everything related to the collage from lecturers, classroom and student. The office in this faculty divided into two offices: the postgraduate office managing doctoral and master students, and the undergraduate office that includes all the departments of the faculty. These offices manage the files of the students; each student has his own file that hold his entire document like registration forms that he/she filled to registration, certificates and payment vouchers made; unfortunately the student files are paper files and handled manually. According to the university’s regulations all the student files must be kept in the offices for seven years. This means a huge collection of thousands of files, and all these files are paper files that take large office space for storage. 4.5 Staff of FCSIT Each office has its own staffs that are responsible for the student files. Staffs of the undergraduate office have been interviewed. The interview identifies the problems they faced in managing the student paper files. 56 Special cabinet have been installed in both the offices. The staff put all the student’s files on these cabinets that create accumulate over time creating storage problems and searching task for a particular file becomes tedious. The existence of the students paper files in this manner many increase risks like fragmentation, fire and safety hazard. 4.6 Students of FCSIT At University of Malaya, there are two categories of students: i.e. (a) the undergraduate students and the postgraduate students and (b) Malaysian and foreign students with various disciplines within the university. For all the students, the undergraduate students register for the first time at the International Student Centre (ISC) and the postgraduate students register for the first time in the Institute Postgraduate Studies (IPS). Each student is required to fill in the student registration forms enclosed with transcript certificates and payment vouchers and submit it to the office. The ISC and IPS send copies of the student’s files that hold every document to students’ faculty, since faculties must keep every student’s files in their offices. Every new semester the undergraduate students have to complete t the registration process at the ISC and the faculty. The postgraduate students follow the same procedure where they have to complete the registration process at the IPS and the faculties (see figure 4.1). 57 All Students’ registration in ISC or IPS ISC and IPS send all students files to there faculties User’s office of the faculty administering and managing student files New students Organize files by names Organize files by years Existing students Organize files by department s Update student files Figure 4.1 Current Students Registration 4.7 Research Unit of Analysis This study is conducted at the FCSIT office that manages the undergraduate student’s files. These files hold all the students’ document from the first registration until they graduate. 58 4.8 The Current Document Managing System in FCSIT In the undergraduate office of FCSIT there are three staffs who work with the students’ files. This staffs administers and manages hundreds of new and old students’ files to be kept in the office; all these files are slowly taking up the office space. The staffs makes singular file for each student that hold all his/her document from the first registration day until the students complete their studies and graduate. These files are organized by disciplines, names and years before storing these files in the allocated space on the specially built cabinet. 4.9 Current System Drawbacks • The staff found it difficult to organize files. • The staff found difficulty to search for files. • The paper files can be torn. • The updates in the file make the appearance of the form look very messy. • If the head departments need to see any student’s file he/she has to ask one of the staff to search and bring the file take time and effort to search for a file. • If the staffs need to send the student file to anther office, they have to make copy From the file and send it by one of them. The way of the current file storage in the office, the problems that face the staff with the files. The new system will resolve all this problems and make the procedures be very simple for all sides. 59 4.10 Summary This chapter discusses the case study where the data collection took place. A thorough discussion on the unit of analysis, that is FCSIT, has been made. The work process involved in managing students files and the responsible staffs has also been explore. 60 Chapter 5: Data Analysis and Finding This chapter as a continuation of the previous chapters of the study - focuses on the analysis of data collection. The findings from the data analysis process are then used as system requirement to develop the proposed electronic file storage system (EFS) which is more efficient to administer and manage student’s files. 5.1 The Answers to the Interview Questions The answer that has been came from the staff and used to come out with the data analysis for the current system will be mentioned. • Are the staffs satisfied with the current storage file system? The answer to this question is yes they satisfied with the system because they do not have any other system or options to work with. The current storage system has been in placed since the first day they work in FCSIT. They also found out that the senior staffs have been working with this system all along. • What are the work processes involve in the current file storage system? In the case of a foreign student, when the student registers for the first time at IPS, a copy of all the student’s registration form will be sent by IPS to the respective faculties where the student has enrolled,. The staffs of the respective office receive the student registration forms and open a folder for each student before filing them away. Each student 61 will have their own folder which is paper based. Most of the time the staff filed the student folders by years or names and store them it the cabinets. • What are the advantages of the current storage file system? The current system has advantages like the existence of the files in the form of paper. it is better for the user to read from the paper that reading from the computer. That is the only advantages that can be found in the current system. • What are the disadvantages of the current storage file system? The current system has many disadvantages like the difficulty when receiving large number of student files at the beginning of every new semester and the process to manage, organized and store these files into the cabinets. The big number of student files requires many cabinets that take up large area of the office floor. Also, searching for a particular student’s files can be time consuming and tedious and the need to return the files back to the same location. If the staffs need to send any student file to another office, they have to scan the student’s fillister, save it in the computer as electronic document and send it by email or the staff can simply make a copy of the student’s file and post it. • In the work process steps that have been mentioned before which steps took the longest time in current storage file system? The answer of the staff for this question was that all the steps took a long time to be done. The most time consuming is organizing the student forms and keeping it in the cabinets because the large number of available files; it may take a few days to organize the 62 files, Other work process like searching for the files, returning the files to the same place and sending the files to other offices does not take more than half an hour to be done. • What do you think the reason for the delay in the process of the current storage files system? The answer from the staff for this question was that the work with too many students’ files/folders and all the process of the current system must done by hand. • What are the changes that must take place in the current storage file system? The answer from the staff for this question was that they hope if there is an electronic system that can save the entire student files into a computer (i.e. database) so it will be easy for them to administer and managed the students’ files. • What are the capabilities of staff in dealing with computer? The answer from the staff for this question was that all the staff in the office knows how to use the computer because there are many of the computers in the office used for some other works. • Will the staff accept the change of the current storage file system to computerized system? The answer from the staff for this question was that if the new computerized system is easy and good of course the staff will accept the changes because it will reduce the time and effort they used. 63 • Do the staffs need any training for the new computerized system? The answer the staffs for this question was yes; anyone who wants to work with the new system for the first time need to be trained so that the staff can deal with it in the right way without making any mistake because working with the student data is very sensitive. 5.2 Observation An observation can give useful insight into problems, work conditions, bottlenecks and methods work. Observation is the first method used to gather information regarding the development of a new electronic file storage system. A visit was made to the FCSIT undergraduate office to make personal observation on the existing system and the workflow of how staffs work with the student files. From the observation it is noted that: • The work process of the current system is still done manually. • The main stakeholder working with the students’ files are the administrator and staff of the faculty’s office. From the above finding, the staffs handling the students’ files are the main source for interview; because they are involve directly work with possessing the students’ files. 64 5.3 Challenges of Current Systems Based on the data collected and interviews the current system suffers some difficulties as summarized as below: 1. Files storage problem: Many cabinets needed to be built in the faculty’s office to store all the paper-based student files. 2. Files organizing problem: the staff do not how. A systematic way to organize the students’ files. Files are normally organized by characters, department or years. 3. Files retrieving problem: the staff take too long (time) to search for a particular students’ file because the search in done manually and there are thousands of students’ files in the cabinets. 4. Files sending problems: if the staff need to send any student’s file to any other location, they have to find the file and send it by the post or scan it as image and send it by -electronic email. 5.4 General System Requirements The requirements for the system were based on the findings of the literature review done as well as from the interview sessions done. After a careful analysis of the data collected, the findings of the analysis is used to derived the following application requirements: 1. The application must be a web based so the user can enter the system from any places. 2. The application must develop on open source platform to allow easier future upgrade and enhancements. 65 3. The interface of the system must be simple for the user to use it easily. 4. The users must be training to use the new system. There should be 2 main groups of user; administrator and staff. Each user shall have different access level or privilege to the application. The following describes each user role: - The Administrator will have access to entire system. - The staff can access all the system and are given the privilege to manage student file. 5.5 Summary In this chapter, an analysis of data collected was done whereby the responses to the interviews were analysed carefully and in depth. The findings of the analysis are use as input to determine the system function of the new student filing system. Chapter Six will discuss how the system function is determine and how the system is design, developed, implemented and tested. 66 Chapter 6: System Design, Development and Implementation This chapter describes the system design, development and implementation. The system in general is divided into two modules. The first module is the user interface which offers web-based interface. The second module is the documents scanning module which is a Java-based application used for scanning documents. The two modules are integrated into the proposed system to offer an efficient electronic file storage system. 6.1 System Requirements The requirements for the system are based on the findings from literature reviews, as well as from interview sessions with users. After careful analysis of the findings, the following application requirements were derived. 6.1.1 General Requirements The general requirements refer to the generic features of the application. These features shall span across the application regardless of functionality and modularity. The derived general requirements were as follows: • ï€ The application must be a web-based thin client system. This is to allow users to have greater access to the application. • The application must operate on an open source platform to allow easier future upgrades and enhancements. 67 • Application navigation should be easy to use and self explanatory. 6.1.2 User Management Requirements There should be two main groups of users; administrator, administrative staff. Each user shall have different access level or privilege to the application. The following describes each user’s role: • The administrator has access to manage the system users and assign roles. • The administrative staffs have access to manage student’s forms in the system. 6.1.3 Functional Requirements Functional requirements are important as they are used to determine what the system should be able to do, and the functions it should perform to produce a particular output or outputs that are desired by the system users. The system has 2 main components as explained in detail below: 1. The Administrator Subsystem includes modules to: a. login to the system using user name and password b. add administrative staff c. add students forms by scanning the form d. delete students forms from the database system e. update student forms in the database system f. print students forms g. log out of the system 68 2. The Administrative Staff Subsystem includes five modules to: h. login to the system using user name and password i. add students forms by scanning the form j. update student forms in the database system k. print students forms l. logout of the system 6.1.4 Non-Functional Requirements On-functional requirements are factors used to judge how the system operates. Unlike functional requirements, which describe the specific functions that the system has to deliver, non-functional requirements illustrate the quality of the system. • Accessibility - The system should be accessible to any of the authorised users anywhere without requiring excessive effort. This also includes platform compatibility with all the platforms. The system is designed to be a web-based system that can be accessed through a web browser with an Internet connection. To login to the system, the user should supply a valid username with the corresponding password. After the authentication of the user’s access rights have been made, the user is signed on. • Availability - The system should be readily available at any time of the day. Available means. 69 • Maintainability – The system should be easily maintained and does not demand too much effort to enhance or extend. • Security - All passwords are encrypted while usernames are unique to ensure that each system user is distinct from the other. This also certifies that only authorised users can use the functionalities of the system, based on the level of privilege and access rights granted. Besides these, only the system administrator is allowed to make any changes to the internal features and structure of the system. It is crucial that the system is secure from malicious attacks. • Usability - The system should require little effort to learn and use (refer interview. Thus, it is important that the layout of the system components and workflow of the system be consistent to accelerate the familiarisation and usability process. Besides that, the auto calculation of evaluation scores will also enable higher efficiency as the required time to accomplish the task is greatly reduced. 6.2 Systems Development Consideration 6.2.1 System Environment The final system runs on a typical web environment set up, which consists of the following components: relational database system, web server, web application, and the user interface or browser. The system was developed and tested on a single machine or server with the following software installations as listed in the table below: 70 Table 6.1: ESF System Environment Item Software Product Operating Microsoft Windows XP Professional System Web Server Apache server 2.2 Web Application PHP 4.3.10 MySQL 5.0 Database Server – Database Community Edition The server has the following hardware specifications as listed in the following table, which also represents the recommended minimum hardware requirements. 71 Table 6.2: ESF Hardware Requirement Item Hardware Specifications Processor Intel Pentium 4 (1.8MHz) Memory 1GB DDR2 RAM at 553 MHz Hard Disk 40GB SATA Network Interface 10/100 Ethernet 6.2.2 Programming Language and Development Tools This section examines the chosen scripting language and database, as well as look at the development tools that were used in the construction of the application. 6. 2.2.1 PHP Programming Language PHP is a recursive acronym for PHP: Hypertext Pre-processor. It is an open-source server-side scripting language that was first introduced in 1994. Since then, it has become the most popular open-source web-based programming language, used by over 6 million 72 domains with a monthly growth rate of 15% (according to Net craft, http://www.netcraft.com/survey/). Amongst the benefits of using PHP are: 1. The scripting language is very easy to learn and there is an abundance of PHP resources available on the Internet. This makes it easier to maintain and upgrade the PHP applications compared to other scripting languages such as Perl or ASP. 2. PHP works on almost any operating system. This cross-platform compatibility feature makes it easier to deploy and install completed application on existing Internet servers such as Apache, Microsoft and Netscape service solutions. Thus, it is highly suitable for today’s heterogeneous network environments. 3. PHP also has built-in supports for a wide variety of commercial, as well as noncommercial databases such as MySQL, Informix, mSQL, Microsoft SQL Server, PostgreSQL, Oracle, Sybase and also ODBC type database connection. 4. PHP supports protocols such as POP3, LDAP, SNMP, HTTP, COM, and IMAP, and also offers integration with various external libraries. This allows PHP developers to do almost anything, from generating PDF documents and creating graphic images to parsing XML documents. It is also able to work with other server-side languages, such as JAVA and COM. 5. Being an open source scripting language with wide distribution and a large community of users, PHP is very well supported. PHP bugs are found and fixed quite regularly, and the 73 language enjoys continuous improvements to enhance its capabilities due to its huge pool of open-source developers. Most importantly, all these benefits are made available to its users without any hidden cost. 6. 2.2.2 MySQL Database System MySQL is a powerful, secure and scalable multi-threaded, multi-user relational database management system owned by the Swedish firm MySQL AB. Although small in size as compared to other commercial relational databases, MySQL is extremely fast. Perhaps the most convincing reference of MySQL implementation is the Google Search engine, which is built entirely on MySQL technology. The main reason for using MySQL as the application database is because of PHP’s extensive built-in support for MySQL database. PHP has numerous functions available to allow developers to control and manipulate the MySQL database without having to code new procedures. This will expedite the application development as less coding needs to be done. 6.3 Development Tools The following tools were used during the development of the Content Storyboard Application system: 6.3.1 PHP Designer 2007 Personal This tool is available as a freeware and can be downloaded from the Internet. It is developed by MPSoftware and is an Integrated Development Environment (IDE) for PHP, 74 designed to help ease and enhance the process of editing, debugging and analysing PHP scripts. ï€ 6.3.2 MySQL Query Browsers MySQL Query Browser is a tool for creating, executing, optimising and testing SQL queries for the MySQL Database server. It is available for free at http://www.mysql.com. 6.3.3 MySQL Administrator MySQL Administrator is a free tool that is available from the MySQL website for administering and managing the MySQL databases. It provides database administrators with an easy to use but powerful visual interface that gives better visibility on how the databases operate. 6.4 System Design This section describes the user interface and the scanning modules. Figure 6.1 illustrates the design of the system and how it work. 75 Figure 6.1 Use Case of System Architecture The system user first has to scan the forms by using the desktop interface java and than the system will save it direct in the database. All the forms will be saved in the database In particular. If the user wants to open any student form, it can be opened from the web-base interface from any place if the user has the permeation of entering the system. The following sections describe the major functionalities of the system, and the basic operations offered by the system. 6.4.1 Interface Module This section describes the user interface module of the system. The user interface composes of two layers; the interface layer which is a web-based layer developed using PHP language. The second layer is the database layer. The database layer describes the internal storage layer of the system. It presents the structure of the database used to store the students’ forms. 76 The next figure 6.2 shows the web-base interface of the system that has been developed to manage the student forms. All the users’ mast enters the web-base system from this page because it is the main of the system. Figure 6.2 Web-Based Interfaces There are two types of users in this system, the administrator and staff user. The administrator is the global manager of the system who has the full access permissions. Administrator has the control of the system and can by provide usernames and other privileges for users. An administrator can update and do maintenance on the system. 77 On the other hand, the staffs as a user is a restricted user who has limited permissions comparing to the administrator the staff can work with both the web-based and Java-based application. The staffs have all the privilege except deleting information Staffs are also responsible for scanning students’ documentation by using the Java-based application. Each user has their own username and password to enter the system. The web-based is integrated with the Java-based application to complement each other. The web-based application does the reading of the scanned files that have been stored in the database by using the java-based application as shown in Figure 6.1. 6.4.2 Administrator Module The next sections will describe the entire administrator operations. Administrator: Manage Users In this system module, the administrator is the only one who is given the privilege to access the system. These use cases show the privilege given only to the admin and doesn’t given to the staff and that is for security. 78 Figure 6.3 Use Case of Administrator Privilege Administrator: Update User Account The administrator access the system by administrator username and password which give access and privilege to access the system. After that user have to click in the manage user link. Figure 6.4 shows the screen to manage user page. 79 Figure 6.4 Administrator Management Page Administrator: Updating Screen When the administrator accesses the user management page they can access any user account and update (figure 6.5). 80 Figure 6.5 Administrator Updating Accounts After the administrator access the user account as seen in figure 6.5 he/she can do system maintenance such as to change the username, user password and user access level and click in the update data button for the system to accept the new update. Any other users who are not an administrator will not be able to see the user management link in the system. Administrator: Delete User Account In this module the administrator must enter his/her own username and password to have the privilege to enter the manage user link as seen in figure 6.4 web-base manage user. The administrator can press the delete button to delete any user account from the system seen in figure 6.6. 81 Figure 6.6 Administrator Delete Users Accounts Administrator: Admission Records In this module the administrator and user can access to the admission records module in the system. The admission records module provides different privilege to the user and to the administrator as seen in figure 6.7. The admin have all the privilege of the system like viewing, update, delete, print and even the validity of entering the system. 82 Figure 6.7 Use Cases of Administrator Admission Records Figure 6.8 administrator admission records show all the privilege of the administrator user in the admission record module. 83 Figure 6.8 Administrators: Admission Records Screen In the main menu of the administrator admission record screen there is four links. The first one is HOME which take the user to the main page; the second link is MANAGE USER which takes the admin to the page of creating the user name and password of entering the system, the third link is ADMINSSIN RECORD which make the user view all student form that has been store in the database of the system, the fourth link is LOGOUT which sign the user out of the system. Delete Student Form The administrator is the only user how has the privilege to delete the student forms from the system, and that is for security reason. As shown in figure 6.9 that when the administrator enter the system by his/here user name and password they can find the delete 84 button inside the admission records. In figure 6.9 the administrator can delete the student form from the system. Figure 6.9 Delete student form from the database 6.4.3 User Operation The next sections will describe the entire user operations. User Interface Screen Figure 6.10 depicts the user interface screen. 85 Figure 6.10 Web-Base User Interface Screen The welcome note of the user interface is to explain the idea of the system that has been designed and the methods of use and benefits for the user. 86 User: Admission Records Screen Figure 6.11 Use Case of User: Admission Records Screen The staffs has some privilege in the system like viewing, update, print and scanning the student form by using the java interface of the system. The privilege of deleting can not been given to the staff because any form mast store at leas for seven years. 87 Figure 6.12 depicts shows all the privilege of users in the admission records module. This figure has been explained in figure 6.8. Figure 6.12 Staff: Admission Records User: View Student Form Details In this system module, both of the administrator user and the user can view the details of the student forms. When any of the users press the details button as seen in figure 6.12 it will open page with all the details that on the form the details of the form are presented web-base and are similar to the source document (form). This is because the form has been scanned and stored into the database of the system and can be accessed web-based. If the user wants to make any changes to the design of the web-based form then the user has to go 88 to the Java-based application and choose option to make changes to the forms design. Figure 6.13 present the view of the web-based student form. Figure 6.13 Users: Student Form Screen This page shows the user all the details of the student’s forms that had been scanned and store. From this page the user can update and print the forms as it will be explants in the next sections. User: Update Student Form The students during the study years may have changes to their data, for e.g. changing of postal address, phone number and email address. The students need to inform the FCSIT office about the change in their data. 89 The staffs who are responsible for the student forms will then update the students’ data in the system. To update the students’ detail, users have to use the update button as seen in figure 6.13; it will open the relevant page with all the details that the form holds. Then the user can do the necessary update on the student’s record as in figure 6.14. Figure 6.14 Users: Update Student Form Comparatively, there is a difference between figures 6.13 and figure 6.14. In figure 6.14 the marital statue is single and the nationality is Malaysia, but in figure 6.13 the marital status is married and the nationality is empty. That was the update done in the student‘s form. 90 User: Printing Students Form Both of the administrator and user can print the student forms from the system the user can print the main page of the admission records that hold all the student forms names that has been saved in the system as shown in figure 6.15. Figure 6.15 User: Printing the Admission Records Page Also the user can enter any portion of the student details and print it as shown in figure 6.16. 91 Figure 6.16 Users: Printing the Student Form Other Services In order to increase the use of the web-based electronic file storage system EFS system application it is linked to University of Malaya main page so that it gives accessibility to the university and the faculty web sites. This service adds flexibility to the system and provides an easy way to bring the users up to date with the university and the faculty web sites. 6.4.4 Java-Base Application Interface This section describes the Java-based application which is the basis in the development of the EFS system. No user is allowed to access the Java-based application; only the 92 administrator with administrator username and password can gives access to enter the webbase system that provides interface to the Java-based application. This is the main javabased application interface as in figure 6.17. Figure 6.17 Java-Based Application Interfaces After the administrator access the main page of the Java-base application all the processes that the system works is made available, as seen in figure 6.18. 93 Figure 6.18 Main Page of the Java-Based Application In figure 6.18 the user will find on the left of the screen three buttons. The first button is SCAN; the user can press it to enter the printer application for scanning the forms. The second button is VIEW; the user can press it to view all the students’ forms in the system. The third button is USER; jest the admin how can enter this part to view all the users of the system. In the medal of the screen there is TEMPLATE, the user can use it to change the design of the form in the system. All the bottom of the template there is the SOURCE which shows the user all the printers or devices that is connected with the computer. The button ACQUIRE also take the user to the part of the scanning. The button TRAIN opens all the training forms that have been used to train the system by different hand writing. 94 Document Scanning This section discusses about the java-based application that has been used to scan all the student’s documents. That is the important part in the whole system. The scanned documents are then saved in the database of the system. The Scanning Processes To initiate scanning by using the Java-base application there is a lot of processes that has been developed. The way of scanning the document by using the java-based application will be display in section Document Format This section describe about the design of student’s registration form in the University of Malaya and the changes that has been done to the forms to fit into with the new EFS system. Figure 6.19 illustrates the registration form that the University of Malaya currently using. 95 Figure 6.19 Student Form This current registration form is not a fit match to be used by the new EFS system because not all students have the same hand writing, and when some of them write they did not separate the letters. It is very difficult for the EFS system to read the entire different hand writing pattern because there is hundreds of different hand writing. The EFS system needs to be trained to learn to recognise the different hand writing. 96 A new registration form has been specially designed to work with the new EFS system. Figure 6.20 depicts an example for the new specially designed registration form that has been designed to be used in EFS. Figure 6.20 New Specially Designed Registrations Form 97 The design of the new registration form is to make each letter to be separated from one another. This helps the Java-base application algorithm to read each character by itself; hence able to read the students’ handwriting as identified characters. If the administrator wants to change the design of the registration form at anytime it can be done by pressing “NEW” in the template and design the new form, as shown in figure 6.21. The registration form can also be edited and deleted as needed. Figure 6.21 Designs New Registration Form Template 98 Scanning and Saving the Scanned Students Registration Forms into the System These sections descript how to save the scanned students’ registration forms into the system. There are two ways to save the scanned forms. Scanning the Students’ Registration Forms Users need to access the main page of the Java-base application. Next, the user needs to go to the correct option to do the scanning process. Next, user need to place the paper-based student registration’s form onto the scanner device. Then, the user need to select the scanner. The user has to choose the scanner button option as shown in figure 6.22. 99 Figure 6.22 Selecting the Scanner It is important to remember that before the user select the scanner, they must put the source form on the scanner because after scanner option has been selected and pressed “OK” then the EFS system will automatically scanned the form, as in figure 6.23. Figure 6.23 Scanning Process The resolution of the scanner must be 300 so the system can read the character. The resolution is to identify the size of the character that the system will read from the forms. After the system scanned the registration form, the user will press “ACCEPT” so the system can change the image to text. The user can see the process of the changes as in figure 6.24. 100 Figure 6.24 Process of Changing From Image to Text When the changing image-to-text process is finished the system will display the output. If the user wants to save the scanned registration form in the database then the user just need to press “OK” else press “NO”. The upload result is shown as in figure 6.25. 101 Figure 6.25 Upload Result Image-to-Texts Browse for the Forms When the users access the main page of the Java-base application, the users have to press the relevant button to select the source from which to acquire the image. The user has to choose the file button to browse for the required registration forms that need to have changes, as shown in figure 6.26. 102 Figure 6.26 Browse and Select the Required Registration Form After the user selects the required registration form, the system changes the registration form from image to text and the upload result is shown as in figure 6.25. Artificial Neural Network This system contain artificial neural network that used to read and scan the paper – source registration form and change it from image to text. The users scan the registration forms as image and save it into a database in the system via artificial neural network concept. The next section defined and explained the artificial neural network used. 103 Definition of Artificial Neural Network An artificial neural network (ANN), often just called a "neural network" (NN), is a mathematical model or computational model based on biological neural networks. It consists of an interconnected group of artificial neurons and processes information using a connectionless approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase (Matthews 2000). In more practical terms neural networks are non-linear statistical data modelling tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data (Chen 1996). Figure 6.27 Neural Network Processes 104 A neural network is an interconnected group of nodes, akin to the vast network of neurons in the human brain. In the above section the overview about the neural network has been presented as an algorithm used for converting the scanned document to text format. But, in the system implementation the self-organizing map algorithm is being used, a special type of neural network. The following section describes the (SOM) algorithm in details. Definition of Self-Organizing Map A self-organizing map is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two dimensional), discredited representation of the input space of the training samples, called a map (Kohonen 1998). Self-organizing maps are different than other artificial neural networks in the sense that they use a neighbourhood function to preserve the topological properties of the input space. This makes SOM useful for visualizing low-dimensional views of high-dimensional data, akin to multidimensional scaling. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map. 105 Form Training The artificial neural network that has been used in the system must be trained to be able to identify the different types of hand writing. In this system the artificial neural network was training by six different hand writing, two of them will be presented as an example as in figure 6.28 and figure 6.29, and after the training has been done, the system can identify the different hand writing. Figure 6.28 First Examples for Training Form 106 Figure 6.29 Second Examples for Training Form This is example for two forms same in the design but different in the hand writing. Training Procedures in Reading the Registration Form These sections descript how to train- the system by different forms to make it perfect reading. There are few and simple steps to train reading the forms. When the users enter the Java-base application, the user have to chose the “TRAINING’ in the template and press 107 the button “NEW” to browse for the form that must be trained and then select the correct button as shown in figure 6.30. Figure 6.30 Training Procedures After the user select - form and press “OK”, the training form will be open and the user have to enter all the character and numbers line by line and how many of them in each line. The user than highlight it in the form to let the system know the character and numbers that must been training in the selected form, as shown in figure 6.31. 108 Figure 6.31 Selecting the Character and Numbers in the Form Sending Email One of the operations that the system can do is sending the student’s form by email; the user must have a “GMAIL” account or university email in order to send the email with the student’s form. The user can send the student form by email just by using the java-base application. The student’s form cannot be sent by the web-based interface because it is all ready connected with the internet. Any user who has a username and password can view the students’ forms in detail in the web-based system. 109 When the user accessed the Java-base system application and view the details of the student’s form, the user need to press the “EMAIL’ button to send the student’s form, and the student form will be send as seen in figure 6.32. Figure 6.32 Sending Student Forms by Email 6.5 System Testing Testing is the process that is carried out to ensure that the system conforms to the specification and meets the requirements of the users, namely administrator and staff of the office. Testing had been conducted not only in the end but also during the development of the prototype system. Functional and interface testing were carried out for the module or for the whole system. Each and every link had been checked to make sure all the links are 110 working correctly. Interface testing is carried out to identify that the interface works correctly and faults are not created because of interface errors. 6.5.1 Unit Testing Unit Testing is to test software in terms of a unit, a module, a function, a specific section of code. This testing occurs while the software is being developed and before completion. For Unit Testing, test cases are designed to verify that an individual unit implements all design decisions made in the unit's design specification. A thorough unit test specification should include positive testing where the unit does what it is supposed to do, and also negative testing where the unit does not do anything that it is not supposed to do. Table 6.3 shows the Unit Testing for the entire Administrator Functionality module. Table 6.3 Unit Testing for the Entire Administrator user Functionality Module: Functionality Status login to the system using user name and password PASS add administrative staff PASS add students forms by scanning the form PASS delete students forms from the database system PASS 111 update student forms in the database system PASS print students forms PASS log out of the system PASS Also for the staff user the Table 6.4 shows the Unit Testing for the entire Staff Functionality module. Table 6.4 Unit testing for the Entire Users Functionality Module: Functionality Status login to the system using user name and PASS password add students forms by scanning the form PASS update student forms in the database system PASS print students forms PASS logout of the system PASS Both the table 6.3 and table 6.4 are showed that the functionality of the system has been successfully achieved and the user requirement have been met and implemented. 112 6.6 Summary This chapter begins the discussion on the designing the system functionality based on the findings of the data collection done in the case study. Different tools are looked when considering the tools to be used in the development and implementing of the EFS system. And finally, the chapter discusses and explains how the electronic file storage system and the Java-based application interact. The process of scanning and storing the image file and the process of manipulating the image file into text file has also been discussed. 113 Chapter 7: Conclusion This chapter presents the conclusion of the study. It discusses the differences between the current file system of the faculty and the new electronic file storage system that has been developed to simplify and facilitate the work of staff in administering and managing the student’s files. This chapter also discusses the limitation of the new electronic file storage system and its consideration for future enhancement. 7.1 Project Objectives On the whole objective of this study has been achieved. As discussed in Chapter1 section 1.4, the main objective of this research is to develop a system that improves the administering and managing process of student’s files. An electronic file storage system (EFS) which is a web-base system has been developed for this purpose. It is anticipated that if the EFS is used full scale in FCSIT, it can help in reducing the size of files storage space in the office, by scanning the forms and store it electronically in the database. In addition, the system helps to protect files from natural disaster. Electronic file storage also simplify and facilitate the work of staff handling the files. In addition, the system has functionality facilities that allow the sharing of students’ files by emails from the system direct be University email or Gmail (Google email). 114 7.2 Training Staff and Users The objective of developing the new electronic file system is to improve the work process in the FCSIT offices. The staff and users who are dealing with the new system must be trained to be able to use the system competently the right way. The training will be required in handling the two parts of the EFS system: (i) web-based interface and (ii) Javabased application. During the training all the system features and functions will be explained step by step in detail. Training must be easy and friendly so that users do not find any difficulties in understanding the system. Furthermore, a detail user manual has also been developed for referencing. The difference between the EFS web-based interface and Java-based application is in the training of using scanner. The work with the scanner needs more training because the users need to know how to manage the scanner and the procedure to scan input source, i.e. students’ form. 7.3 System Limitation The system has some limitation as discussed below: • Limitation in neural networks training The system is developed using neural networks algorithm that include the self organization maps (SOM). This algorithm cannot read all the hand writing of the people 115 because there are hundreds of different hand writing., That is why this algorithm must be trained at less by seven different types of hand writing but not more than that because of the system’s memory limitation. This system only keeps seven trained forms. Each additional form will replace the oldest forms in that memory. 7.4 Future Enhancements There are two enhancements for this project • Improve the storage capability of the system by enlarge the memory size of the neural network, so more training forms can be applied. • Reduce the scanning process time by using scanner more advanced and faster in scanning. 7.5 Summary This chapter presented the conclusion of the study. It summarizes the project objective and the training need for the staff and users for the new EFS system. It also underlined the constraints of the new system and its future enhancement. The EFS solved all the problems that have been mission in the problem stated in chapter 1. After the system has been developed all the students document has been store in the database of the system for the university, that mean no need for archiving the files and take up office space or losing time to search for files. All the work process has become electronically with the new system. 116 Appendix A: Source Code for Converting from Image to Text package som; /** * Java Neural Network Example Handwriting Recognition * Copyright 2005 by Heaton Research, Inc. * by Jeff Heaton (http://www.heatonresearch.com) 10-2005 */ public class KohonenNetwork extends Network { /** * The weights of the output neurons base on the input from the input neurons. */ double outputWeights[][]; /** * The learning method. */ protected int learnMethod = 1; /** * The learning rate. */ protected double learnRate = 0.3; /** * Abort if error is beyond this */ protected double quitError = 0.1; /** * How many retries before quit. */ protected int retries = 10000; /** * Reduction factor. */ 117 protected double reduction = .99; /** * The owner object, to report to. */ //protected Applet owner; /** * Set to true to abort learning. */ public boolean halt = false; /** * The training set. */ protected TrainingSet train; /** * The constructor. * * @param inputCount * Number of input neurons * @param outputCount * Number of output neurons * @param owner * The owner object, for updates. */ public KohonenNetwork(int inputCount, int outputCount/*, Applet owner*/) { int n; totalError = 1.0; this.inputNeuronCount = inputCount; this.outputNeuronCount = outputCount; this.outputWeights = new double[outputNeuronCount][inputNeuronCount + 1]; this.output = new double[outputNeuronCount]; //this.owner = owner; } /** * Set the training set to use. * * @param set * The training set to use. */ public void setTrainingSet(TrainingSet set) { 118 train = set; } /** * Copy the weights from this network to another. * * @param dest * The destination for the weights. * @param source */ public static void copyWeights(KohonenNetwork dest, KohonenNetwork source) { for (int i = 0; i < source.outputWeights.length; i++) { System.arraycopy(source.outputWeights[i], 0, dest.outputWeights[i], 0, source.outputWeights[i].length); } } /** * Clear the weights. */ public void clearWeights() { totalError = 1.0; for (int y = 0; y < outputWeights.length; y++) for (int x = 0; x < outputWeights[0].length; x++) outputWeights[y][x] = 0; } /** * Normalize the input. * * @param input * input pattern * @param normfac * the result * @param synth * synthetic last input */ void normalizeInput(final double input[], double normfac[], double synth[]) { double length, d; length = vectorLength(input); // just in case it gets too small if (length < 1.E-30) length = 1.E-30; 119 normfac[0] = 1.0 / Math.sqrt(length); synth[0] = 0.0; } /** * Normalize weights * * @param w * Input weights */ void normalizeWeight(double w[]) { int i; double len; len = vectorLength(w); // just incase it gets too small if (len < 1.E-30) len = 1.E-30; len = 1.0 / Math.sqrt(len); for (i = 0; i < inputNeuronCount; i++) w[i] *= len; w[inputNeuronCount] = 0; } /** * Try an input patter. This can be used to present an input pattern to the * network. Usually its best to call winner to get the winning neuron though. * * @param input * Input pattern. */ void trial(double input[]) { int i; double normfac[] = new double[1], synth[] = new double[1], optr[]; normalizeInput(input, normfac, synth); for (i = 0; i < outputNeuronCount; i++) { optr = outputWeights[i]; output[i] = dotProduct(input, optr) * normfac[0] + synth[0] * optr[inputNeuronCount]; 120 // Remap to bipolar (-1,1 to 0,1) output[i] = 0.5 * (output[i] + 1.0); // account for rounding if (output[i] > 1.0) output[i] = 1.0; if (output[i] < 0.0) output[i] = 0.0; } } /** * Present an input pattern and get the winning neuron. * * @param input * input pattern * @param normfac * the result * @param synth * synthetic last input * @return The winning neuron number. */ public int winner(double input[], double normfac[], double synth[]) { int i, win = 0; double biggest, optr[]; normalizeInput(input, normfac, synth); // Normalize input biggest = -1.E30; for (i = 0; i < outputNeuronCount; i++) { optr = outputWeights[i]; output[i] = dotProduct(input, optr) * normfac[0] + synth[0] * optr[inputNeuronCount]; // Remap to bipolar(-1,1 to 0,1) output[i] = 0.5 * (output[i] + 1.0); if (output[i] > biggest) { biggest = output[i]; win = i; } // account for rounding if (output[i] > 1.0) output[i] = 1.0; if (output[i] < 0.0) output[i] = 0.0; } 121 return win; } /** * This method does much of the work of the learning process. This method * evaluates the weights against the training set. * * @param rate * learning rate * @param learn_method * method(0=additive, 1=subtractive) * @param won * a Holds how many times a given neuron won * @param bigerr * a returns the error * @param correc * a returns the correction * @param work * a work area * @exception java.lang.RuntimeException */ void evaluateErrors(double rate, int learn_method, int won[], double bigerr[], double correc[][], double work[]) throws RuntimeException { int best, size, tset; double dptr[], normfac[] = new double[1]; double synth[] = new double[1], cptr[], wptr[], length, diff; // reset correction and winner counts for (int y = 0; y < correc.length; y++) { for (int x = 0; x < correc[0].length; x++) { correc[y][x] = 0; } } for (int i = 0; i < won.length; i++) won[i] = 0; bigerr[0] = 0.0; // loop through all training sets to determine correction for (tset = 0; tset < train.getTrainingSetCount(); tset++) { dptr = train.getInputSet(tset); best = winner(dptr, normfac, synth); 122 won[best]++; wptr = outputWeights[best]; cptr = correc[best]; length = 0.0; for (int i = 0; i < inputNeuronCount; i++) { diff = dptr[i] * normfac[0] - wptr[i]; length += diff * diff; if (learn_method != 0) cptr[i] += diff; else work[i] = rate * dptr[i] * normfac[0] + wptr[i]; } diff = synth[0] - wptr[inputNeuronCount]; length += diff * diff; if (learn_method != 0) cptr[inputNeuronCount] += diff; else work[inputNeuronCount] = rate * synth[0] + wptr[inputNeuronCount]; if (length > bigerr[0]) bigerr[0] = length; if (learn_method == 0) { normalizeWeight(work); for (int i = 0; i <= inputNeuronCount; i++) cptr[i] += work[i] - wptr[i]; } } bigerr[0] = Math.sqrt(bigerr[0]); } /** * This method is called at the end of a training iteration. This method * adjusts the weights based on the previous trial. * * @param rate * learning rate * @param learn_method * method(0=additive, 1=subtractive) * @param won * a holds number of times each neuron won * @param bigcorr * holds the error 123 * @param correc * holds the correction */ void adjustWeights(double rate, int learn_method, int won[], double bigcorr[], double correc[][]) { double corr, cptr[], wptr[], length, f; bigcorr[0] = 0.0; for (int i = 0; i < outputNeuronCount; i++) { if (won[i] == 0) continue; wptr = outputWeights[i]; cptr = correc[i]; f = 1.0 / (double) won[i]; if (learn_method != 0) f *= rate; length = 0.0; for (int j = 0; j <= inputNeuronCount; j++) { corr = f * cptr[j]; wptr[j] += corr; length += corr * corr; } if (length > bigcorr[0]) bigcorr[0] = length; } // scale the correction bigcorr[0] = Math.sqrt(bigcorr[0]) / rate; } /** * If no neuron wins, then force a winner. * * @param won * how many times each neuron won * @exception java.lang.RuntimeException */ void forceWin(int won[]) throws RuntimeException 124 { int i, tset, best, size, which = 0; double dptr[], normfac[] = new double[1]; double synth[] = new double[1], dist, optr[]; size = inputNeuronCount + 1; dist = 1.E30; for (tset = 0; tset < train.getTrainingSetCount(); tset++) { dptr = train.getInputSet(tset); best = winner(dptr, normfac, synth); if (output[best] < dist) { dist = output[best]; which = tset; } } dptr = train.getInputSet(which); best = winner(dptr, normfac, synth); dist = -1.e30; i = outputNeuronCount; while ((i--) > 0) { if (won[i] != 0) continue; if (output[i] > dist) { dist = output[i]; which = i; } } optr = outputWeights[which]; System.arraycopy(dptr, 0, optr, 0, dptr.length); optr[inputNeuronCount] = synth[0] / normfac[0]; normalizeWeight(optr); } /** * This method is called to train the network. It can run for a very long time * and will report progress back to the owner object. * * @exception java.lang.RuntimeException 125 */ public void learn() throws RuntimeException { int i, key, tset, iter, n_retry, nwts; int won[], winners; double work[], correc[][], rate, best_err, dptr[]; double bigerr[] = new double[1]; double bigcorr[] = new double[1]; KohonenNetwork bestnet; // Preserve best here totalError = 1.0; for (tset = 0; tset < train.getTrainingSetCount(); tset++) { dptr = train.getInputSet(tset); if (vectorLength(dptr) < 1.E-30) { throw (new RuntimeException( "Multiplicative normalization has null training case in trainint set "+tset)); } } bestnet = new KohonenNetwork(inputNeuronCount, outputNeuronCount/*, owner*/); won = new int[outputNeuronCount]; correc = new double[outputNeuronCount][inputNeuronCount + 1]; if (learnMethod == 0) work = new double[inputNeuronCount + 1]; else work = null; rate = learnRate; initialize(); best_err = 1.e30; // main loop: n_retry = 0; for (iter = 0;; iter++) { evaluateErrors(rate, learnMethod, won, bigerr, correc, work); totalError = bigerr[0]; if (totalError < best_err) { 126 best_err = totalError; copyWeights(bestnet, this); } winners = 0; for (i = 0; i < won.length; i++) if (won[i] != 0) winners++; if (bigerr[0] < quitError) break; if ((winners < outputNeuronCount) && (winners < train.getTrainingSetCount())) { forceWin(won); continue; } adjustWeights(rate, learnMethod, won, bigcorr, correc); // owner.updateStats(n_retry,totalError,best_err); if (halt) { // owner.updateStats(n_retry,totalError,best_err); break; } Thread.yield(); if (bigcorr[0] < 1E-5) { if (++n_retry > retries) break; initialize(); iter = -1; rate = learnRate; continue; } if (rate > 0.01) rate *= reduction; } // done copyWeights(this, bestnet); 127 for (i = 0; i < outputNeuronCount; i++) normalizeWeight(outputWeights[i]); halt = true; n_retry++; // owner.updateStats(n_retry,totalError,best_err); } /** * Called to initialize the Kononen network. */ public void initialize() { int i; double optr[]; clearWeights(); randomizeWeights(outputWeights); for (i = 0; i < outputNeuronCount; i++) { optr = outputWeights[i]; normalizeWeight(optr); } } } package som; import java.io.*; import java.util.*; /** * Java Neural Network Example Handwriting Recognition * Copyright 2005 by Heaton Research, Inc. * by Jeff Heaton (http://www.heatonresearch.com) 10-2005 */ abstract public class Network implements Serializable { /** * The value to consider a neuron on */ public final static double NEURON_ON = 0.9; 128 /** * The value to consider a neuron off */ public final static double NEURON_OFF = 0.1; /** * Output neuron activations */ protected double output[]; /** * Mean square error of the network */ protected double totalError; /** * Number of input neurons */ protected int inputNeuronCount; /** * Number of output neurons */ protected int outputNeuronCount; /** * Random number generator */ protected Random random = new Random(System.currentTimeMillis()); /** * Called to learn from training sets. * * @exception java.lang.RuntimeException */ abstract public void learn() throws RuntimeException; /** * Called to present an input pattern. * * @param input * The input pattern */ abstract void trial(double[] input); /** * Called to get the output from a trial. */ 129 double[] getOutput() { return output; } /** * Called to calculate the trial errors. * * @param train * The training set. * @return The trial error. * @exception java.lang.RuntimeException */ double calculateTrialError(TrainingSet train) throws RuntimeException { int i, size, tset, tclass; double diff; totalError = 0.0; // reset total error to zero // loop through all samples for (int t = 0; t < train.getTrainingSetCount(); t++) { // trial trial(train.getOutputSet(t)); tclass = (int) (train.getClassify(train.getInputCount() - 1)); for (i = 0; i < train.getOutputCount(); i++) { if (tclass == i) diff = NEURON_ON - output[i]; else diff = NEURON_OFF - output[i]; totalError += diff * diff; } for (i = 0; i < train.getOutputCount(); i++) { diff = train.getOutput(t, i) - output[i]; totalError += diff * diff; } } totalError /= (double) train.getTrainingSetCount(); ; return totalError; } 130 /** * Calculate the length of a vector. * * @param v * vector * @return Vector length. */ public static double vectorLength(double v[]) { double rtn = 0.0; for (int i = 0; i < v.length; i++) rtn += v[i] * v[i]; return rtn; } /** * Called to calculate a dot product. * * @param vec1 * one vector * @param vec2 * another vector * @return The dot product. */ double dotProduct(double vec1[], double vec2[]) { int k, m, v; double rtn; rtn = 0.0; k = vec1.length / 4; m = vec1.length % 4; v = 0; while ((k--) > 0) { rtn += vec1[v] * vec2[v]; rtn += vec1[v + 1] * vec2[v + 1]; rtn += vec1[v + 2] * vec2[v + 2]; rtn += vec1[v + 3] * vec2[v + 3]; v += 4; } while ((m--) > 0) { rtn += vec1[v] * vec2[v]; 131 v++; } return rtn; } /** * Called to randomize weights. * * @param weight * A weight matrix. */ void randomizeWeights(double weight[][]) { double r; int temp = (int) (3.464101615 / (2. * Math.random())); // SQRT(12)=3.464... for (int y = 0; y < weight.length; y++) { for (int x = 0; x < weight[0].length; x++) { r = (double) random.nextInt() + (double) random.nextInt() - (double) random.nextInt() - (double) random.nextInt(); weight[y][x] = temp * r; } } } } 132 References Abusafiya, M. & Mazumdar, S. 2004, Accommodating paper in document databases, ACM New York, NY, USA, pp. 155-162. Aura, T., Kuhn, T.A. & Roe, M. 2006, Scanning electronic documents for personally identifiable information, ACM New York, NY, USA, pp. 41-50. Avison, D. & Fitzgerald, G. 2006, Information systems development methodologies, tools and techniques, 4 edn, McGraw-Hill Education. Chen, H.H. 1996, Neural network: software simulation on a massively parallel computer, University of Portsmouth, Unpublished B. Sc.(Hons) Computing Final Year Project. Cho, J.M. 2000, 'Chromosome classification using back propagation neural networks', IEEE Engineering in Medicine and Biology Magazine, vol. 19, no. 1, pp. 28-33. Gilb, T. 1985, Evolutionary Delivery versus the "waterfall model", vol. 10, ACM New York, NY, USA, pp. 49-61. Groetzner, M., Guenthner, U. & Streckeisen, H. 2004, Method of storage management in document databases, United States Patent 6704753 , retrieved online from http://www.freepatentsonline.com/6.704753html. 133 Kohonen, T. 1998, 'The self-organizing map', Neurocomputing, vol. 21, no. 1-3, pp. 1-6. Konishi, K. & Ikeda, N.F.H. 2007, Data model and architecture of a paper-digital document management system, ACM New York, NY, USA, pp. 29-31. Lamarca, A.G., Dourish, J.P., Edwards, W.K. & Salisbury, M.P. 2006, Tagging related files in a document management system, United States Patent 7086000 ,retrieved online from http://www.freepatentsonline.com/7086000.html. Lea, G.M. & Smith Judy Read, K.N.F. 2002, 'Records Management With Disk And Practice Set With Disk:(2 Books With Disks). Liu, S., Mcmahon, C.A. & Culley, S.J. 2008, A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management, vol. 59, Elsevier, pp. 3-16. Matheu, F. 2005, Life cycle document management system for construction, Doctoral Thesis, Universitat Politecnica De Catalunya, Spain, retrieved online from http://www.tdx.cat/TDX-0518105-155912/#documents. Matsuo, H., Nakamura, T. & Tatekawa, M. 2001, Electronic paper file, United States Patent 7249324 ,retrieved online from http://www.freepatentsonline.com/7249324.html. Matthews, J. 2000, 'An Introduction to Neural Networks', Generation5 at the forefront of Artificial Intelligence. Oja, M., Kaski, S. & Kohonen, T. 2003, 'Bibliography of self-organizing map (SOM) papers: 1998-2001 addendum', Neural Computing Surveys, vol. 3, no. 1, pp. 1-156. Omar, M. 2005, Felda document management system, Master thesis, Universiti of Teknologi Malaysia. Robson, C. 2002, Real world research: a resource for social scientists and practitioner researchers. Blackwell, Oxford, UK. Royce, W.W. 1987, 'Managing the development of large software systems: concepts and techniques', IEEE Computer Society Press Los Alamitos, CA, USA, pp. 328-338. Sellen, A. & Harper, R. 1997, Paper as an analytic resource for the design of new technologies, ACM New York, NY, USA, pp. 319-326. Sprague Jr, R.H. 1995, Electronic document management: Challenges and opportunities for information systems managers, The Society for Information Management and 134 The Management Information Systems Research Center of the University of Minnesota, pp. 29-49. Yin, R.K. 1994, Case Study research: Design and methods, Second edition, Thousands Oaks: Sage Publications, Inc. York, R. 2006, Ecological paradoxes: William Stanley Jevons and the paperless office, vol. 13, SOCIETY FOR HUMAN ECOLOGY, p. 143. Zantout, H. & Marir, F. 1999, Document management systems from current capabilities towards intelligent information retrieval: an overview, vol. 19, Elsevier, pp. 471484. Zikmund, W.G. 1987, Business research methods, Dryden Press. 135