The Data Mining team designed and implemented a software

advertisement
Interim Design Report for the
Idaho State Board of Education Document Search Tool
Prepared by: Data Miners Senior Design Team
Team Members:
Dallas Stinger
Wenlong Huang
Aaron Phillips
Date: December 10, 2010
Executive Summary
The Data Mining team designed and implemented a software application that allows
the Idaho State Board of Education to search a large collection of text-based
documents. The application (ISBEDigger) is designed to run on Windows XP,
Windows Vista, and Windows 7. It was written in the C# programming language,
using the Microsoft .NET Framework.
The software has several run-time phases. The first is called indexing. Our software
pre-analyzes all available documents, recording each unique word in the document,
where it occurs within the document, and how many times it occurs. The results of
this phase are stored in an index file, which is used to search the documents for
keywords.
The second phase is searching. The software utilizes multiple techniques to search
documents based on inexact search parameters. The first is what is known as
“stemming”. Stemming involves finding the root word of the given search string, as
well as “stems” of those words, in order to perform a broad search of the index file.
Our software also uses a large thesaurus to build a list of synonyms, which are also
used to search the index file.
The third phase is the results phase. When the user performs a search, a list of
documents found to contain relevant information is displayed. The displayed files
are ranked according to how well they reflect the initial search parameters. When
the user selects a file, the text from the original file is read in and displayed, with
relevant keywords highlighted.
ISBEDigger is currently able to recognize several different file-types. It can index
Microsoft Word, Excel, and plaintext files, as well as PDF files that do not require
optical character recognition.
2
Table Of Contents
1.0 Introduction
1.1 Background……………….………….……………………….………..…….…………...……….... 4
1.2 Objective..................................................................................................................................... 4
1.3 Run-Time Modes…………………………….…….….…………….……..…….….………….…. 4
2.0 Problem Definition
2.1 Search a Collection of Documents Efficiently……………….………..…….…….….… 5
2.2 Search Text Without Exact String Specification…………………………...…….….... 5
2.3 Handle Multiple File-Types……………………………………………….……...…….….….. 5
2.4 Multi-User Access From Windows-Based Computers….….………….….……..…. 5
2.5 Correlation Between Documents…………………………………..………..……….…….. 6
3.0 Concepts Considered
3.1 Operating System and Underlying Framework……………………..…….….…….… 7
3.2 Native VS. Web-Based Interface……………………………………..………..……….……. 7
3.3 Search a Collection of Documents Efficiently………………….…….…………..….…. 8
3.4 Finding Useful Results Without Using Exact String Searching….….….…..…… 8
3.5 Handling Multiple File-Types……………………………………………….….…..….……... 9
3.6 Correlation Between Documents………………………………………….…….………… 10
4.0 Concept Selection
4.1 Operating System and Underlying Framework………………….……………..…... 11
4.2 Native VS. Web-Based Interface…….…………………………………...……….……..… 11
4.3 Search a Collection of Documents Efficiently……………………..……….…….…... 11
4.4 Finding Useful Results Without Using Exact String Searching….……...…….. 11
4.5 Handling Multiple File-Types………………………………………………….……....…… 12
4.6 Correlation Between Documents…………………………….…….……..……….……… 13
5.0 Selected Design
5.1 User Interface……………………………………………………….……….………………..….. 14
5.2 Reverse Indexing………………………………………………….……………..….………..…. 15
5.3 Searching……………………………………………………….……..…….…….……………..…. 15
5.4 Reading File-Types…………………………….….…….……….……………….…………..… 16
6.0 Future Work
6.1 Scheduled Delivery………………………………………….………….….….…….………..… 18
6.2 Future Semesters…………………………………..……………….…….…….………..……… 18
Appendix A – Requirements Document………………………..…….…….…….……..…………. 20
Appendix B – Searching Flowchart………………………………..…………….….…..……………. 21
3
1.0 Introduction
1.1 Background
The client for this project was the Idaho State Board of Education. The State Board
currently has a large collection of data stored on their internal network. This large
data set contains several different file-types. It includes meeting minutes from State
Board meetings, as well as documents related to decisions brought to the board for
approval. Due to the large size and semi-unorganized structure of this information,
it is difficult to thoroughly search through and find relevant information at a later
date which is a necessary task. Therefore, there is a desire to have a tool to search
through the documents quickly for relevant information.
1.2 Objective
The goal of the Data Miners senior design team was to develop a piece of software
that assists the State Board employees in searching the collection of documents for
relevant information. We did this by creating an easy-to-use search tool that runs on
most Windows-based computers, and can search several modern types of electronic
text-based documents. We designed the software to search not only using the given
search parameters, but also to allow the user to search effectively without knowing
an exact text string. We did this by utilizing a well-known stemming algorithm, as
well as through the use of a thesaurus. Our objective was to go through one iteration
of the research/design/implementation/testing cycle, and deliver a working
prototype by the end of the semester.
1.3 Run-Time Modes
The ISBEDigger program has three distinct modes of run-time operation. The first
mode, which is required for basic functionality, is called indexing. Indexing involves
pre-analyzing all documents that the user has selected for searching. The indexer
will create a list of all unique words, where they occur, and how many times they
appear in each document. This information is stored in a compact index file, which is
used during the search phase. Once the user has indexed the desired files, he/she
can enter search parameters using one of two search modes. The second mode is
searching.
The software will then scan the index file using the given parameters, as well as
parameters that are calculated using stemming and a thesaurus. Relevant
documents are presented to the user in order of calculated relevancy. The third
mode allows the user to select documents to preview, as well as access the original
files. If the user chooses to preview a document, the document text will be displayed,
with relevant words highlighted.
4
2.0 Problem Definition
Below we outline several major project requirements. See Appendix A for a
complete listing.
2.1 Search a collection of documents efficiently
In order to find relevant information currently, a State Board employee must search
through the shared network drive that contains the set of documents representing
all meeting minutes and board decisions. This requires knowing which document to
look in, when it was created, and where exactly in the document the pertinent
information resides (some documents are over 1,600 pages in length). Our software
needs to be able to search through all of this information and provide accurate
results in a timely manner. Our goal was to provide search results in under one
second, given reasonable search parameters.
2.2 Search text without exact string specification
An important requirement that our software needs to satisfy is the ability to provide
useful search results without requiring exact keywords from the user. We spent
significant resources researching and implementing software that would broaden
the search results to include text that was related to the given search parameters.
2.3 Handle multiple file-types
The State Board has been accumulating electronic documents for almost 20 years.
This means that the documents we need to search represent a wide variety of filetypes. Our software should read Microsoft Word, Excel, PDF, WordPerfect, and
plaintext files. Our software must also be able to display a preview of a selected
document, as well as open the original file. This requirement involved a significant
amount of research into programmatic processing of these file-types.
2.4 Multi-user access from Windows-based computers
The computers that the State Board employees use run Microsoft Windows, and as
such our software must be able to run in that environment. We also must also allow
for multiple employees to access the software at the same time.
5
2.5 Correlation between documents
The documents that are created by the State Board often have shared or related
content between separate files. The State Board employees would like to be made
aware of related information, as well as time-sensitive information that exists in the
documents.
6
3.0 Concepts Considered
3.1 Operating System and Underlying Framework
One of the major requirements we were given was that our software must be able to
run on Windows-based machines. We took this to mean Windows XP, Windows
Vista, and Windows 7. This limited the choice of languages and libraries available to
us. We considered several popular programming platforms:




C#:
o A Microsoft product, C# is a large, well-documented platform that was
an excellent candidate for our project. Two members of our team have
prior experience with the language, and it is built from the ground-up
to work on Windows. It also provides support for COM components,
which would turn out to be crucial to our ability to support Word and
Excel files.
C++:
o All three team members have prior experience with C++, and it is able
to run in a Windows environment. It is also the fastest of all the
languages that we considered, which made it an excellent option,
considering the data-processing-intensive indexing and searching that
were required
Python:
o Python was considered due to the ease with which we could have
created a web interface for our application. The downside to Python is
that because it is an interpreted language, it runs slower than our
other candidates.
Java:
o Java is well documented, and is also well supported in the Windows
environment. However, it is not as tightly-couple with Windows as C#,
and does not provide the speed boost that C++ does. Also, only one
team member had prior experience with Java.
3.2 Native vs. Web-Based Interface
One of the requirements given was that the State Board wanted the application to be
accessible from the web. This differs from the traditional solution, which would be
to run the software on each user’s machine. There were several concepts to consider
when we made this decision:

Permissions:
o Some of the employees that use our software may not have access to
all possible documents. Some documents will have file permissions
7



that restrict viewing. This means that a system would have to be
developed to determine who has access to what when accessing
remotely. This would likely involve a database that logs remote user
credentials.
Indexing:
o If the application was web-based, the software would only have to
index once day on a server, instead of once per user. This would
simplify the indexing process.
Machine Learning:
o If the software runs on a user’s machine, it becomes much easier to
keep track of which documents an individual accesses most, and
weight them accordingly in future searches.
Research and Implementation Time Required:
o The time required to implement a web-based interface would be
significantly higher than a native one. The prior experience of the
team members, challenges that come with client-server
communication, and security are all hindrances to a web based
interface.
3.3 Searching Documents Efficiently
The major algorithmic problem we encountered was how to best search through the
large collection of documents. We researched and considered two approaches:


Real-Time Search:
o Real-time searching is simple and easy to implement. This algorithm
involves opening each file in the system, and scanning each word for
matches with the search criteria. The benefit of this method is that
there is no initial overhead. Unfortunately, there becomes a significant
increase in search time as the size of the text grows.
Reverse-Indexing:
o Reverse-indexing is the process of analyzing all documents before
doing any searching. The words, as well as their location and
frequency, are recorded in an index file, which is then searched at runtime. This algorithm has a large initial overhead, but greatly decreases
search time, and makes it possible to do inexact searching.
3.4 Finding Useful Results without Exact String Searching
Another major requirement that our system needed to satisfy was the ability to
search for text without knowing exact keywords. We researched and implemented
two separate methods in order to accomplish this:

Stemming:
o Stemming involves finding the root word of all search parameters, as
well as “stems” of this word, which are the combinations of the root
word with possible suffixes. This allows us to provide search results
8

that are extremely close in content, without needing an exact match
between search keywords and results. We had the option of using one
of several known stemming algorithms, or creating our own custom
algorithm. A custom algorithm would provide results that are tailored
to the context of the State Board, but would require much more time
to implement.
Thesaurus
o We decided to use a thesaurus to generate a larger set of keywords
when searching. Like with the stemming algorithm, we were faced
with the choice of creating our own thesaurus, which would have
words relating to our specific domain, or using a pre-built thesaurus,
which is much easier to implement.
3.5 Handling Multiple File-Types
We were given a broad range of file-types that needed to work with our system. We
considered several file-types, as well as methods of implementation.




Microsoft Word and Excel
o Microsoft Word and Excel files are two of the more common file-types
used by the State Board. We considered writing a piece of software
that would parse these files, as well as existing solutions. Writing a
parser for these files would be an extremely time-consuming task, as
Word/Excel files have a vast array of options that must be accounted
for. The two viable options that we had were using COM interop
components (provided by Microsoft), or use third-party software
(pay-to-use).
PDF
o Most of the sample files we had access to were of this type. There
were several different options to consider here. First, we had to
decide whether or not to attempt to read Optical Character
Recognition (OCR) files, which are PDFs that are created from scanned
images. The simple case, which are text-based PDF’s, are much easier
to read. We also had to consider whether we wanted to write our own
parser, or use third-party software.
WordPerfect
o WordPerfect files are not abundant in the State Board’s system, as
most have been converted to PDF or Word documents. Also, since it is
an older file-type, it will be more difficult to incorporate with our
system. These facts placed WordPerfect files low on our priority list.
Plaintext
o Plaintext files are not abundant on the system, but are extremely
simple to read.
9
We also faced a design decision in regards to how best structure our code to allow
for easy re-use and extension. We designed a class hierarchy that would work well
for the file-types we described above and allows more to be added in the future.
3.6 Correlation between documents
We considered several different methods for correlating related documents. We
wanted a way to group documents together both automatically and manually. We
proposed the concept of “tagging,” as well as date-correlation. Tagging involves
creating and maintaining a set of tags that are applied both manually and
automatically (by using a set of rules to classify documents) to documents that are
added to the system. This would allow employees to group documents that contain
information on similar projects or ideas. Date-correlation essentially means
grouping documents together based on the dates that they were created or last
modified.
10
4.0 Concept Selection
4.1 Operating System and Underlying Framework
The members of the group decided that C# was the best choice of language given the
constraints. C# was built to run on Windows-based machines, and has a large
framework that was created specifically for that environment. C# also makes it easy
to develop a user interface, both natively and web-based. Finally, we felt that the
run-time speed, while arguably slower than C++, was fast enough to fit our needs.
4.2 Native vs. Web-Based Interface
The group elected to develop a native Windows application, as opposed to a webbased application. Although a web-based interface was a requested feature, we did
not feel that it was the best use of our time. This is due to the fact that we were
operating under a short 1-semester timeframe, and no members of the group had
prior experience with web-based applications. In addition, a web-based application
would have required a security credentials system. By developing a native
application, we no longer have to worry about handling any security credentials.
Another benefit of a native application is that it would allow us to easily track the
users search history. This information could be used to rank files higher in future
searches that have been previously viewed.
4.3 Searching Documents Efficiently
The design we chose for searching was reverse indexing. This was clearly the best
solution, as its benefits far outweigh those of real time searching. While we would
not have to worry about creating any index files with real-time searching, searches
would take far longer than 1 second to run, and adding inexact searching on top of
that would only make the problem worse. By using a reverse indexing scheme, we
are able to move a large majority of the computation into the “pre-analysis” phase.
This means that because the user lets the software index for several hours, when
he/she actually runs a search, it will be much faster. Also, we have developed a
“Windows Service” which is a small program that runs in the background on the
users machine, and fires off the indexing program at 1am daily. This will hopefully
take the burden of indexing off of the user.
4.4 Finding Useful Results without Exact String Searching

Stemming:
o Due to the complex and inconsistent nature of the English language,
we decided not to develop our own stemming algorithm. Instead, we
implemented the Porter Stemming Algorithm in our application (An
algorithm that is widely used in industry). It is somewhat dated, but it
11

was relatively simple to implement, especially compared with trying
to create one from the ground up.
Thesaurus
o Like with the stemming algorithm, we decided not to create our own
thesaurus. Instead, we use a large open-source thesaurus to find
synonyms for our program. The advantage to this method is that it
saved implementation time assembling a custom thesaurus that
matches words found in the collection of documents. However, this
also means that when we do a search for synonyms, we come up with
words that are not relevant.
4.5 Handling Multiple File-Types
We decided on several different techniques for allowing compatibility with the
desired file-types.




Microsoft Word and Excel
o Microsoft Word and Excel are built using the outdated COM
architecture. What this means for us is that there is no .NET library
that handles reading and writing of Office documents in managed
code. Instead, we either had to use Microsoft’s COM interop DLLs from
within C#, or purchase a more stable third party library. Because we
were not able to spend any money, we decided to go with the COM
components.
PDF
o Reading PDF files represents a unique challenge, due to the fact that
there are PDFs representing scanned images in the State Board’s
collection of documents. These require optical character recognition
(OCR), which we felt would be a time-consuming feature to
implement. While there are OCR libraries available, we decided to use
a free third-party library to do standard text-based PDFs, and leave
OCR capability as a future addition.
WordPerfect
o Because the State Board has largely converted WordPerfect files to
Word documents or PDF files, we decided not to implement this
functionality.
Plaintext
o Plaintext files are by far the easier file-type to implement. We used
standard .NET libraries to read these types of files.
In addition, we decided to create a class structure that would allow for easy
extension, should the State Board decide more file-types are required.
12
4.6 Correlation between documents
Our team designed a solution for correlation that involved a tagging system (Section
3.6). This system would allow users to create and maintain a collection of tags that
could be applied to sets of documents. However, after bringing this solution before
the State Board, it became obvious that it did not meet requirements. Therefore, we
decided instead to add a search parameter that would allow the user to search for
documents within a specific date range. This does not provide any coupling between
documents that can be retrieved later, but it will allow the user to find documents
that are related based on date.
13
5.0 Selected Design
5.1 User Interface
Figure 5.1 below shows the main program window, which is divided into three
parts. The first part, which occupies the left quarter of the window, is the search
box. The search box has two main parts. The top part is the Simple Search box,
where the user can enter a search string, hit the “search” button, and view the
results. The only options for this search method are a narrow or broad search,
which tells the program how many words apart keywords can be when searching
for results. Below the Simple Search box is the Advanced Search. Here the user
enters a search string, as well as several optional search parameters. The user may
enter words that must appear in the files, words that should not appear in the files, a
date range, and specific file-types. The user may also choose whether or not to
include stems and synonyms in the search results. The “slider bar” in the Advanced
Search determines how the system should rank stems vs. synonyms.
Figure 5.1: Main Application Window
To the right of the search boxes is the results box. When the user performs a search,
the table at the top is populated with the search results. These contain the search
score the file received, as well as the file-name, location, creation date, and type.
When the user double-clicks a row, the preview box is populated with the document
text. If the user right-clicks on a row, a menu appears that allows the user to view
the original file.
14
Additionally, the top menu bar allows the user to add or remove locations that
should be indexed, as well as start the indexing process.
5.2 Reverse Indexing
The first action that the user must perform when
using the system is indexing. Without this step,
there will be no indexing file, which means that no
search results will be returned. Figure 5.2 shows a
flowchart representing the indexing process.
When indexing begins, a list of files is created from
the directory list that is maintained by the user.
The directories are recursively searched for all files
that are able to be read by our software. The files
are then “tokenized” into words, which is done
through the use of regular expressions. After “stop
words” have been removed (common words with
no search value), a HashSet is created for each file.
The HashSet pairs each unique word in the whole
system with an object representing where the
word was found in each file, along with its
frequency. The hash set is then saved in text form
in a file on the system. The file is extremely
compact, and does not represent a significant loss
of memory on the machine.
Figure 5.2: Indexing
5.3 Searching
Our search algorithm is fairly complex, and involves multiple steps (A complete
diagram can be found in Appendix B). The two main modes of searching are Simple
and Advanced:

Simple Search:
1. User enters a search string, which is used to perform a binary search
of the index file
2. A list of stems and synonyms of each word in the original string is
created
15

3. The index file is searched using the stems and synonyms, and a score
is created based on the possible permutations of words, and standard
deviation of the distances between words
4. If the set of results is not large enough at this point, files with partial
combinations are added to the results list
Advanced Search
1. User enters a search string, which is used to perform a binary search
of the index file
2. The results from step 1 are trimmed based on the optional search
parameters
3. Stemming is (optionally) performed
4. Synonyms are (optionally) computed
5. The index file is searched using the stems and synonyms, and a score
is created based on the possible permutations of words, and standard
deviation of the distances between words
5. If the set of results is not large enough at this point, files with partial
combinations are added to the results list
5.4 Reading File-Types
We have designed our system not only to allow for the file-types we have
implemented, but also for future types that may be added. We accomplished this
through the use of abstraction and inheritance, both of which are features of objectoriented programming. Figure 5.4 shows the basic inheritance structure:
Figure 5.4: Document Class Hierarchy
16
The WordDocument, PDFDocument, ExcelDocument, and TextDocument classes all
“inherit” from the Document class. This means that they conform to the structure of
the Document class. Each class performs the same function, but in a different way.
For example, they all retrieve text from documents, but the WordDocument class
uses a COM interop DLL, whereas the PDFDocument class uses the iTextSharp
library. The value of this is that if the State Board decides that more file-types are
required, this structure easily allows for extension. A child-class would be created
that inherits from Document, but implements some unique type of file parsing.
When the class is added to our system, it would only take a change in a few lines of
code to make it function like Word, Excel, PDF, and text files do.
17
6.0 Future Work
6.1 Scheduled Delivery
The software is currently undergoing some basic usability testing, and bugs are
being fixed. We plan on releasing the software to the client between the window of
December 10, 2010 and December 18, 2010. Because our senior design class is only
one semester long, we must unfortunately make suggestions for future
improvement that will hopefully be implemented by future teams.
6.2 Future Semesters
We have identified several areas of our application that could be improved upon or
extended. In addition, we recommend several enhancements that have not been
implemented at any level.

Reading Files
o As mentioned in section 4.5, we currently have the ability to search
through Microsoft Word and Excel files, non-OCR PDF files, and plaintext
files. Future work should include an extension of these capabilities to all
file-types represented in the State Board’s collection, especially
WordPerfect and OCR-PDF files.
o Microsoft Word and Excel files are currently built on top of unreliable
COM interop components. We recommend that these be replaced with a
more stable third-party library.
 Indexing Improvements
o Currently, we have a relatively simple reverse-indexing scheme. If a
future team were to focus on creating a more complex scheme, indextime could be significantly reduced. In addition, the indexing scheme
could be made to cater more to the needs of the State Board. If the
indexing were trimmed to only include words that were relevant in the
Board’s domain, this could lead to more relevant search results.
 Searching Improvements
o The time it takes to return results in our system scales quickly with the
number of search terms added. It depends on the specific words entered,
but when more than eight search terms are given, the time required to
return results is much higher than the goal of one second.
o Our thesaurus is extremely broad. If it could be trimmed, or a new one
created from the ground up, that contained only words relevant to the
State Board, search results would be more relevant, and search times
would decrease.
18


o The Porter stemming algorithm currently in use by our software is
outdated. We recommend that a new algorithm be implemented that is
more up-to-date and reliable.
Correlation
o Currently, the only correlation that exists in our system is trimming
search results to a date range. This, however, does not group documents
beyond a single search. We recommend that a system be implemented
that allows documents to be semi-permanently correlated based on
content or date.
Decision Database
o It was mentioned by the State Board that a decision database would be a
useful tool. They would like something that tracks all board decisions, as
well as their lifetimes. Employees would be able to search the database
for motions that are still in effect given a specific date range.
19
Appendices
A: Requirements Document
20
B: Searching Flowchart
21
Download