Interim Design Report for the Idaho State Board of Education Document Search Tool Prepared by: Data Miners Senior Design Team Team Members: Dallas Stinger Wenlong Huang Aaron Phillips Date: December 10, 2010 Executive Summary The Data Mining team designed and implemented a software application that allows the Idaho State Board of Education to search a large collection of text-based documents. The application (ISBEDigger) is designed to run on Windows XP, Windows Vista, and Windows 7. It was written in the C# programming language, using the Microsoft .NET Framework. The software has several run-time phases. The first is called indexing. Our software pre-analyzes all available documents, recording each unique word in the document, where it occurs within the document, and how many times it occurs. The results of this phase are stored in an index file, which is used to search the documents for keywords. The second phase is searching. The software utilizes multiple techniques to search documents based on inexact search parameters. The first is what is known as “stemming”. Stemming involves finding the root word of the given search string, as well as “stems” of those words, in order to perform a broad search of the index file. Our software also uses a large thesaurus to build a list of synonyms, which are also used to search the index file. The third phase is the results phase. When the user performs a search, a list of documents found to contain relevant information is displayed. The displayed files are ranked according to how well they reflect the initial search parameters. When the user selects a file, the text from the original file is read in and displayed, with relevant keywords highlighted. ISBEDigger is currently able to recognize several different file-types. It can index Microsoft Word, Excel, and plaintext files, as well as PDF files that do not require optical character recognition. 2 Table Of Contents 1.0 Introduction 1.1 Background……………….………….……………………….………..…….…………...……….... 4 1.2 Objective..................................................................................................................................... 4 1.3 Run-Time Modes…………………………….…….….…………….……..…….….………….…. 4 2.0 Problem Definition 2.1 Search a Collection of Documents Efficiently……………….………..…….…….….… 5 2.2 Search Text Without Exact String Specification…………………………...…….….... 5 2.3 Handle Multiple File-Types……………………………………………….……...…….….….. 5 2.4 Multi-User Access From Windows-Based Computers….….………….….……..…. 5 2.5 Correlation Between Documents…………………………………..………..……….…….. 6 3.0 Concepts Considered 3.1 Operating System and Underlying Framework……………………..…….….…….… 7 3.2 Native VS. Web-Based Interface……………………………………..………..……….……. 7 3.3 Search a Collection of Documents Efficiently………………….…….…………..….…. 8 3.4 Finding Useful Results Without Using Exact String Searching….….….…..…… 8 3.5 Handling Multiple File-Types……………………………………………….….…..….……... 9 3.6 Correlation Between Documents………………………………………….…….………… 10 4.0 Concept Selection 4.1 Operating System and Underlying Framework………………….……………..…... 11 4.2 Native VS. Web-Based Interface…….…………………………………...……….……..… 11 4.3 Search a Collection of Documents Efficiently……………………..……….…….…... 11 4.4 Finding Useful Results Without Using Exact String Searching….……...…….. 11 4.5 Handling Multiple File-Types………………………………………………….……....…… 12 4.6 Correlation Between Documents…………………………….…….……..……….……… 13 5.0 Selected Design 5.1 User Interface……………………………………………………….……….………………..….. 14 5.2 Reverse Indexing………………………………………………….……………..….………..…. 15 5.3 Searching……………………………………………………….……..…….…….……………..…. 15 5.4 Reading File-Types…………………………….….…….……….……………….…………..… 16 6.0 Future Work 6.1 Scheduled Delivery………………………………………….………….….….…….………..… 18 6.2 Future Semesters…………………………………..……………….…….…….………..……… 18 Appendix A – Requirements Document………………………..…….…….…….……..…………. 20 Appendix B – Searching Flowchart………………………………..…………….….…..……………. 21 3 1.0 Introduction 1.1 Background The client for this project was the Idaho State Board of Education. The State Board currently has a large collection of data stored on their internal network. This large data set contains several different file-types. It includes meeting minutes from State Board meetings, as well as documents related to decisions brought to the board for approval. Due to the large size and semi-unorganized structure of this information, it is difficult to thoroughly search through and find relevant information at a later date which is a necessary task. Therefore, there is a desire to have a tool to search through the documents quickly for relevant information. 1.2 Objective The goal of the Data Miners senior design team was to develop a piece of software that assists the State Board employees in searching the collection of documents for relevant information. We did this by creating an easy-to-use search tool that runs on most Windows-based computers, and can search several modern types of electronic text-based documents. We designed the software to search not only using the given search parameters, but also to allow the user to search effectively without knowing an exact text string. We did this by utilizing a well-known stemming algorithm, as well as through the use of a thesaurus. Our objective was to go through one iteration of the research/design/implementation/testing cycle, and deliver a working prototype by the end of the semester. 1.3 Run-Time Modes The ISBEDigger program has three distinct modes of run-time operation. The first mode, which is required for basic functionality, is called indexing. Indexing involves pre-analyzing all documents that the user has selected for searching. The indexer will create a list of all unique words, where they occur, and how many times they appear in each document. This information is stored in a compact index file, which is used during the search phase. Once the user has indexed the desired files, he/she can enter search parameters using one of two search modes. The second mode is searching. The software will then scan the index file using the given parameters, as well as parameters that are calculated using stemming and a thesaurus. Relevant documents are presented to the user in order of calculated relevancy. The third mode allows the user to select documents to preview, as well as access the original files. If the user chooses to preview a document, the document text will be displayed, with relevant words highlighted. 4 2.0 Problem Definition Below we outline several major project requirements. See Appendix A for a complete listing. 2.1 Search a collection of documents efficiently In order to find relevant information currently, a State Board employee must search through the shared network drive that contains the set of documents representing all meeting minutes and board decisions. This requires knowing which document to look in, when it was created, and where exactly in the document the pertinent information resides (some documents are over 1,600 pages in length). Our software needs to be able to search through all of this information and provide accurate results in a timely manner. Our goal was to provide search results in under one second, given reasonable search parameters. 2.2 Search text without exact string specification An important requirement that our software needs to satisfy is the ability to provide useful search results without requiring exact keywords from the user. We spent significant resources researching and implementing software that would broaden the search results to include text that was related to the given search parameters. 2.3 Handle multiple file-types The State Board has been accumulating electronic documents for almost 20 years. This means that the documents we need to search represent a wide variety of filetypes. Our software should read Microsoft Word, Excel, PDF, WordPerfect, and plaintext files. Our software must also be able to display a preview of a selected document, as well as open the original file. This requirement involved a significant amount of research into programmatic processing of these file-types. 2.4 Multi-user access from Windows-based computers The computers that the State Board employees use run Microsoft Windows, and as such our software must be able to run in that environment. We also must also allow for multiple employees to access the software at the same time. 5 2.5 Correlation between documents The documents that are created by the State Board often have shared or related content between separate files. The State Board employees would like to be made aware of related information, as well as time-sensitive information that exists in the documents. 6 3.0 Concepts Considered 3.1 Operating System and Underlying Framework One of the major requirements we were given was that our software must be able to run on Windows-based machines. We took this to mean Windows XP, Windows Vista, and Windows 7. This limited the choice of languages and libraries available to us. We considered several popular programming platforms: C#: o A Microsoft product, C# is a large, well-documented platform that was an excellent candidate for our project. Two members of our team have prior experience with the language, and it is built from the ground-up to work on Windows. It also provides support for COM components, which would turn out to be crucial to our ability to support Word and Excel files. C++: o All three team members have prior experience with C++, and it is able to run in a Windows environment. It is also the fastest of all the languages that we considered, which made it an excellent option, considering the data-processing-intensive indexing and searching that were required Python: o Python was considered due to the ease with which we could have created a web interface for our application. The downside to Python is that because it is an interpreted language, it runs slower than our other candidates. Java: o Java is well documented, and is also well supported in the Windows environment. However, it is not as tightly-couple with Windows as C#, and does not provide the speed boost that C++ does. Also, only one team member had prior experience with Java. 3.2 Native vs. Web-Based Interface One of the requirements given was that the State Board wanted the application to be accessible from the web. This differs from the traditional solution, which would be to run the software on each user’s machine. There were several concepts to consider when we made this decision: Permissions: o Some of the employees that use our software may not have access to all possible documents. Some documents will have file permissions 7 that restrict viewing. This means that a system would have to be developed to determine who has access to what when accessing remotely. This would likely involve a database that logs remote user credentials. Indexing: o If the application was web-based, the software would only have to index once day on a server, instead of once per user. This would simplify the indexing process. Machine Learning: o If the software runs on a user’s machine, it becomes much easier to keep track of which documents an individual accesses most, and weight them accordingly in future searches. Research and Implementation Time Required: o The time required to implement a web-based interface would be significantly higher than a native one. The prior experience of the team members, challenges that come with client-server communication, and security are all hindrances to a web based interface. 3.3 Searching Documents Efficiently The major algorithmic problem we encountered was how to best search through the large collection of documents. We researched and considered two approaches: Real-Time Search: o Real-time searching is simple and easy to implement. This algorithm involves opening each file in the system, and scanning each word for matches with the search criteria. The benefit of this method is that there is no initial overhead. Unfortunately, there becomes a significant increase in search time as the size of the text grows. Reverse-Indexing: o Reverse-indexing is the process of analyzing all documents before doing any searching. The words, as well as their location and frequency, are recorded in an index file, which is then searched at runtime. This algorithm has a large initial overhead, but greatly decreases search time, and makes it possible to do inexact searching. 3.4 Finding Useful Results without Exact String Searching Another major requirement that our system needed to satisfy was the ability to search for text without knowing exact keywords. We researched and implemented two separate methods in order to accomplish this: Stemming: o Stemming involves finding the root word of all search parameters, as well as “stems” of this word, which are the combinations of the root word with possible suffixes. This allows us to provide search results 8 that are extremely close in content, without needing an exact match between search keywords and results. We had the option of using one of several known stemming algorithms, or creating our own custom algorithm. A custom algorithm would provide results that are tailored to the context of the State Board, but would require much more time to implement. Thesaurus o We decided to use a thesaurus to generate a larger set of keywords when searching. Like with the stemming algorithm, we were faced with the choice of creating our own thesaurus, which would have words relating to our specific domain, or using a pre-built thesaurus, which is much easier to implement. 3.5 Handling Multiple File-Types We were given a broad range of file-types that needed to work with our system. We considered several file-types, as well as methods of implementation. Microsoft Word and Excel o Microsoft Word and Excel files are two of the more common file-types used by the State Board. We considered writing a piece of software that would parse these files, as well as existing solutions. Writing a parser for these files would be an extremely time-consuming task, as Word/Excel files have a vast array of options that must be accounted for. The two viable options that we had were using COM interop components (provided by Microsoft), or use third-party software (pay-to-use). PDF o Most of the sample files we had access to were of this type. There were several different options to consider here. First, we had to decide whether or not to attempt to read Optical Character Recognition (OCR) files, which are PDFs that are created from scanned images. The simple case, which are text-based PDF’s, are much easier to read. We also had to consider whether we wanted to write our own parser, or use third-party software. WordPerfect o WordPerfect files are not abundant in the State Board’s system, as most have been converted to PDF or Word documents. Also, since it is an older file-type, it will be more difficult to incorporate with our system. These facts placed WordPerfect files low on our priority list. Plaintext o Plaintext files are not abundant on the system, but are extremely simple to read. 9 We also faced a design decision in regards to how best structure our code to allow for easy re-use and extension. We designed a class hierarchy that would work well for the file-types we described above and allows more to be added in the future. 3.6 Correlation between documents We considered several different methods for correlating related documents. We wanted a way to group documents together both automatically and manually. We proposed the concept of “tagging,” as well as date-correlation. Tagging involves creating and maintaining a set of tags that are applied both manually and automatically (by using a set of rules to classify documents) to documents that are added to the system. This would allow employees to group documents that contain information on similar projects or ideas. Date-correlation essentially means grouping documents together based on the dates that they were created or last modified. 10 4.0 Concept Selection 4.1 Operating System and Underlying Framework The members of the group decided that C# was the best choice of language given the constraints. C# was built to run on Windows-based machines, and has a large framework that was created specifically for that environment. C# also makes it easy to develop a user interface, both natively and web-based. Finally, we felt that the run-time speed, while arguably slower than C++, was fast enough to fit our needs. 4.2 Native vs. Web-Based Interface The group elected to develop a native Windows application, as opposed to a webbased application. Although a web-based interface was a requested feature, we did not feel that it was the best use of our time. This is due to the fact that we were operating under a short 1-semester timeframe, and no members of the group had prior experience with web-based applications. In addition, a web-based application would have required a security credentials system. By developing a native application, we no longer have to worry about handling any security credentials. Another benefit of a native application is that it would allow us to easily track the users search history. This information could be used to rank files higher in future searches that have been previously viewed. 4.3 Searching Documents Efficiently The design we chose for searching was reverse indexing. This was clearly the best solution, as its benefits far outweigh those of real time searching. While we would not have to worry about creating any index files with real-time searching, searches would take far longer than 1 second to run, and adding inexact searching on top of that would only make the problem worse. By using a reverse indexing scheme, we are able to move a large majority of the computation into the “pre-analysis” phase. This means that because the user lets the software index for several hours, when he/she actually runs a search, it will be much faster. Also, we have developed a “Windows Service” which is a small program that runs in the background on the users machine, and fires off the indexing program at 1am daily. This will hopefully take the burden of indexing off of the user. 4.4 Finding Useful Results without Exact String Searching Stemming: o Due to the complex and inconsistent nature of the English language, we decided not to develop our own stemming algorithm. Instead, we implemented the Porter Stemming Algorithm in our application (An algorithm that is widely used in industry). It is somewhat dated, but it 11 was relatively simple to implement, especially compared with trying to create one from the ground up. Thesaurus o Like with the stemming algorithm, we decided not to create our own thesaurus. Instead, we use a large open-source thesaurus to find synonyms for our program. The advantage to this method is that it saved implementation time assembling a custom thesaurus that matches words found in the collection of documents. However, this also means that when we do a search for synonyms, we come up with words that are not relevant. 4.5 Handling Multiple File-Types We decided on several different techniques for allowing compatibility with the desired file-types. Microsoft Word and Excel o Microsoft Word and Excel are built using the outdated COM architecture. What this means for us is that there is no .NET library that handles reading and writing of Office documents in managed code. Instead, we either had to use Microsoft’s COM interop DLLs from within C#, or purchase a more stable third party library. Because we were not able to spend any money, we decided to go with the COM components. PDF o Reading PDF files represents a unique challenge, due to the fact that there are PDFs representing scanned images in the State Board’s collection of documents. These require optical character recognition (OCR), which we felt would be a time-consuming feature to implement. While there are OCR libraries available, we decided to use a free third-party library to do standard text-based PDFs, and leave OCR capability as a future addition. WordPerfect o Because the State Board has largely converted WordPerfect files to Word documents or PDF files, we decided not to implement this functionality. Plaintext o Plaintext files are by far the easier file-type to implement. We used standard .NET libraries to read these types of files. In addition, we decided to create a class structure that would allow for easy extension, should the State Board decide more file-types are required. 12 4.6 Correlation between documents Our team designed a solution for correlation that involved a tagging system (Section 3.6). This system would allow users to create and maintain a collection of tags that could be applied to sets of documents. However, after bringing this solution before the State Board, it became obvious that it did not meet requirements. Therefore, we decided instead to add a search parameter that would allow the user to search for documents within a specific date range. This does not provide any coupling between documents that can be retrieved later, but it will allow the user to find documents that are related based on date. 13 5.0 Selected Design 5.1 User Interface Figure 5.1 below shows the main program window, which is divided into three parts. The first part, which occupies the left quarter of the window, is the search box. The search box has two main parts. The top part is the Simple Search box, where the user can enter a search string, hit the “search” button, and view the results. The only options for this search method are a narrow or broad search, which tells the program how many words apart keywords can be when searching for results. Below the Simple Search box is the Advanced Search. Here the user enters a search string, as well as several optional search parameters. The user may enter words that must appear in the files, words that should not appear in the files, a date range, and specific file-types. The user may also choose whether or not to include stems and synonyms in the search results. The “slider bar” in the Advanced Search determines how the system should rank stems vs. synonyms. Figure 5.1: Main Application Window To the right of the search boxes is the results box. When the user performs a search, the table at the top is populated with the search results. These contain the search score the file received, as well as the file-name, location, creation date, and type. When the user double-clicks a row, the preview box is populated with the document text. If the user right-clicks on a row, a menu appears that allows the user to view the original file. 14 Additionally, the top menu bar allows the user to add or remove locations that should be indexed, as well as start the indexing process. 5.2 Reverse Indexing The first action that the user must perform when using the system is indexing. Without this step, there will be no indexing file, which means that no search results will be returned. Figure 5.2 shows a flowchart representing the indexing process. When indexing begins, a list of files is created from the directory list that is maintained by the user. The directories are recursively searched for all files that are able to be read by our software. The files are then “tokenized” into words, which is done through the use of regular expressions. After “stop words” have been removed (common words with no search value), a HashSet is created for each file. The HashSet pairs each unique word in the whole system with an object representing where the word was found in each file, along with its frequency. The hash set is then saved in text form in a file on the system. The file is extremely compact, and does not represent a significant loss of memory on the machine. Figure 5.2: Indexing 5.3 Searching Our search algorithm is fairly complex, and involves multiple steps (A complete diagram can be found in Appendix B). The two main modes of searching are Simple and Advanced: Simple Search: 1. User enters a search string, which is used to perform a binary search of the index file 2. A list of stems and synonyms of each word in the original string is created 15 3. The index file is searched using the stems and synonyms, and a score is created based on the possible permutations of words, and standard deviation of the distances between words 4. If the set of results is not large enough at this point, files with partial combinations are added to the results list Advanced Search 1. User enters a search string, which is used to perform a binary search of the index file 2. The results from step 1 are trimmed based on the optional search parameters 3. Stemming is (optionally) performed 4. Synonyms are (optionally) computed 5. The index file is searched using the stems and synonyms, and a score is created based on the possible permutations of words, and standard deviation of the distances between words 5. If the set of results is not large enough at this point, files with partial combinations are added to the results list 5.4 Reading File-Types We have designed our system not only to allow for the file-types we have implemented, but also for future types that may be added. We accomplished this through the use of abstraction and inheritance, both of which are features of objectoriented programming. Figure 5.4 shows the basic inheritance structure: Figure 5.4: Document Class Hierarchy 16 The WordDocument, PDFDocument, ExcelDocument, and TextDocument classes all “inherit” from the Document class. This means that they conform to the structure of the Document class. Each class performs the same function, but in a different way. For example, they all retrieve text from documents, but the WordDocument class uses a COM interop DLL, whereas the PDFDocument class uses the iTextSharp library. The value of this is that if the State Board decides that more file-types are required, this structure easily allows for extension. A child-class would be created that inherits from Document, but implements some unique type of file parsing. When the class is added to our system, it would only take a change in a few lines of code to make it function like Word, Excel, PDF, and text files do. 17 6.0 Future Work 6.1 Scheduled Delivery The software is currently undergoing some basic usability testing, and bugs are being fixed. We plan on releasing the software to the client between the window of December 10, 2010 and December 18, 2010. Because our senior design class is only one semester long, we must unfortunately make suggestions for future improvement that will hopefully be implemented by future teams. 6.2 Future Semesters We have identified several areas of our application that could be improved upon or extended. In addition, we recommend several enhancements that have not been implemented at any level. Reading Files o As mentioned in section 4.5, we currently have the ability to search through Microsoft Word and Excel files, non-OCR PDF files, and plaintext files. Future work should include an extension of these capabilities to all file-types represented in the State Board’s collection, especially WordPerfect and OCR-PDF files. o Microsoft Word and Excel files are currently built on top of unreliable COM interop components. We recommend that these be replaced with a more stable third-party library. Indexing Improvements o Currently, we have a relatively simple reverse-indexing scheme. If a future team were to focus on creating a more complex scheme, indextime could be significantly reduced. In addition, the indexing scheme could be made to cater more to the needs of the State Board. If the indexing were trimmed to only include words that were relevant in the Board’s domain, this could lead to more relevant search results. Searching Improvements o The time it takes to return results in our system scales quickly with the number of search terms added. It depends on the specific words entered, but when more than eight search terms are given, the time required to return results is much higher than the goal of one second. o Our thesaurus is extremely broad. If it could be trimmed, or a new one created from the ground up, that contained only words relevant to the State Board, search results would be more relevant, and search times would decrease. 18 o The Porter stemming algorithm currently in use by our software is outdated. We recommend that a new algorithm be implemented that is more up-to-date and reliable. Correlation o Currently, the only correlation that exists in our system is trimming search results to a date range. This, however, does not group documents beyond a single search. We recommend that a system be implemented that allows documents to be semi-permanently correlated based on content or date. Decision Database o It was mentioned by the State Board that a decision database would be a useful tool. They would like something that tracks all board decisions, as well as their lifetimes. Employees would be able to search the database for motions that are still in effect given a specific date range. 19 Appendices A: Requirements Document 20 B: Searching Flowchart 21