Information Science paper on Data Mining (Mar

advertisement
Text Mining –
An Examination of Text Mining Software for Tri-State Times
Sujan Manandhar
Virginia Dressler
Information Science
Dr. J. Holmes
Mar 19, 2007
Introduction
With more than an estimated 80 percent of content on the Web to be in unedited,
unstructured text format (Haravu & Neelameghan, 2003, p. 100), a growing problem for
effective, relevant information retrieval methods particularly in situations of massive amounts of
data is increasingly evident. This idea can be seen in a query of commercial search engines that
often yield high recall and low precision. Commercial search engines also include indexing and
ranking mechanisms that are unbeknownst to the user and greatly impact the result of any search
in the ordering mechanisms and selection of terms. The greater number of recalled documents
entails more time and energy on the user side to sort out meaningful information from the
irrelevant. As a result, an increasing demand for quality information retrieval methods has
become a field of interest in software developments.
Text mining is in a relatively early phase, and as such, is still quite limited in application
and use, yet often yielding beneficial results. In this paper, the authors will look into text mining
as a solution to one of the problems of information retrieval. Text mining can be defined as the
discovery of new, previously unknown information, by automatically extracting information
from processed text sources. In addition, text mining is capable of linking extracted information
together to form new facts or new hypotheses to be explored further by more conventional means
of experimentation. (Hearst, 2003). As well, text mining can be used to uncover a “narrative in
an unstructured mass of text” (Haravu & Neelameghan, p. 103) and how a particular
environment is evolving (a defined market or business, for example).
Concepts between texts can be linked together as processed through a number of natural
language processing methods, such as corresponding thesaurus, glossary, a pre-structured subject
representations or a schedule of a classification scheme. Depending on the application, these are
1
either preformatted factors of a software package, or can be manipulated by the end user. For this
study, the authors chose relatively basic applications of text mining for our purpose to allow
study of the results of a query in respects to the decisions and considerations of a test group.
The authors conducted a study in which they evaluated different text mining software for
the Tri-State University newspaper information organization needs. This study will compare
various text mining systems on their functionality and ease of use, as well as fitting the small
budget and skills-set of the users.
Problem Statement
By utilizing different text mining software, the same sets of documents will deem
different results while processing the same query. Differences in the algorithms and weighting
schemas are the main impact the retrieved data, though the assessment and evaluation from the
end user should also be considered as an influencing factor as to the overall effectiveness and
relevance of the retrieved information. Additionally, a comparison of the results will examine the
varying degrees of satisfaction for users, indicating a strong degree of subjectivity and difference
of the user’s relation to information and language.
Extent of the study
This study will look at only a small aspect of some simple applications to provide a
cursory example of the limitations and benefits of text mining. By using a small set of documents
from a single source, the authors hope to provide enough limitation to effectively examine the
effectiveness of retrieved information.
2
We believe that future research will be needed to further delve into the topic with larger
case studies and how text mining can be implemented into other settings and uses other than
medical, technical and business. As well, a review and study of the major applications of text
mining would be useful at this point in time. Since this is an early stage for the applications of
text mining, we feel that it is also important to observe the current limitations and gain
information from existing applications. The possibility of creating links between increasingly
massive banks of textual information will be a major topic of study as our dependence and stock
in technology increases.
Background
Text mining can mean different things depending on the method and purpose. Sharp
(2001) aptly says, “text mining per se is new and is still defining itself.” Yeates (2002) says that
text mining discovers patterns in natural language text, and is the process of analyzing text to
extract information from it for particular purposes. Natarajan (2005) describes text mining as
intelligent text analysis by discovering unknown links and relationships between sets of
documents, even perhaps between non-trivial information. Whatever the context or definition,
the computational process involved in text mining is the fundamental aspect of the detection of
pattern.
The applications of text mining date to the mid-1990s, with IBM’s release of the first
software package, Intelligent Text Miner, in 1998. The acknowledgment of information existing
as raw material to an organization has impacted this aspect of information retrieval to software
designers (do Prado, Oliveira, et. al., 2004, p. 225). As well, the considerations of the user have
become a larger part of software design, particularly with natural language processing
3
capabilities. This is particularly effective in situations where there is a lack of controlled (or
structured) data, such as Web content, e-mails, and other informal documents. A simple
information retrieval is often complicated by the factors inherent to unstructured data, such as the
variety of spelling.
Natarajan (2005) cites five requirements for high quality text mining. The desirable conditions
and factors for information retrieval include: information comprehensiveness, quality of
knowledge base, high-quality method of information retrieval, techniques and protocols of
information extraction (implementation of internal thesaurus, etc.) and technical expertise (p.
33). Interestingly, classification and cataloguing methods of the library have often been used for
reference in many of the various natural language processing applications.
Literature Review
The topic of text mining is still a relatively new area, and the most of literature on the
subject tends to be geared towards the practical application (business, financial, medical, etc.).
But as far back as in the 1950s, Luhn (1958) in a seminal paper on automatic abstracting, pointed
out "the resolving power of significant words" in primary text. Doyle (1961) hints on early text
mining and says "natural characterization and organization of information can come from
analysis of frequencies and distributions of words in libraries." Swanson (1988_1) a major force
behind text mining, examined scientific literature as a natural phenomenon worthy of
"exploration, correlation, and synthesis."
Sullivan (2004) discusses text mining within a business model, outlining the constructs of
such software and the impact and benefits of text mining in a practical scenario. Sullivan also
mentions the current limitations as of 2004 as to the state of NLP processing and error rates. He
4
also concludes that pre-existing (preformatted) processing schemas and components in many text
mining applications are inappropriate for most queries. More effective are the methods that allow
the user to identify relations and categories within a sample set of documents (“supervised
learning”, pg. 102). In turn, algorithms would be created by these decisions and choices of
related documents.
do Prado (2004) applies the CRISP-DM methodological approach to a case study. A set
of 57,000 documents from a news agency were gathered in 2001 to study the relationships
between the text and the defined grouping of information (types of news). Clusters of text
connected unrelated items, indicating a method of knowledge discovery. Seven main schemas of
clustering were found in this group of documents, both hierarchical and non-hierarchical. A
conceptual approach to the search deemed better results (p. 224). Sharp (2001) looks at several
examples of text mining and natural language processing (NLP). He also traces the aspects of
machine learning in text mining and how this can play a pivotal role in its development. He is
able to highlight some the main features of text mining along with some of the seminal works
related to it.
Feldman and Sanger (2006) were one of the first to encapsulate the whole topic of text
mining into a book that breaks down the core principles as well as examining probabilistic
models. A selection of existing applications in application to specific fields of interest are also
studied, mainly business, technical and medical situations.
Major Concepts in Text Mining
Before getting into the nuts and bolts of text mining, it is important to know some of the
5
key concepts that lead up to it. To try and expound on all the concepts related to text mining
would not be possible within the realms of this paper. The authors will try to explain some of the
important terms that this study comes across – natural language processing, knowledge
discovery, and data mining.
Knowledge Discovery
Knowledge discovery is the process of finding novel, interesting, and useful patterns in data.
Data mining is a subset of knowledge discovery. This method allows the data to suggest new
hypotheses to test (Purple Insight, n.d.). James M. Caruthers makes an interesting analogy and
says "Instead of mining for a nugget of gold, knowledge discovery is more like sifting through a
warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself.
After appropriate assembly, however, a Rolex watch emerges from the disparate parts." (Venere,
2004)
In his seminal work Swanson (1988_2) proved that it was possible to discover new
knowledge from existing literature by linking the information present in complementary but
disconnected articles. Smalheiser and Swanson (1998) postulated a number of new biomedical
hypotheses, which were later verified by domain experts. They developed two approaches,
known as Open and Closed discovery, for generating new hypotheses. However, their research
required substantial manual input. Since then, a number of efforts have aimed to automate the
discovery process.
Natural Language Processing
Natural language processing (NLP) is a major component of text mining. NLP is the
6
branch of linguistics which deals with computational models of language. Sharp (2001) says that
NLP can differentiate how words are used such as by sentence parsing and part-of-speech
tagging, and in the process add discriminatory power to statistical text analysis. He says that
NLP could be a powerful tool for text mining.
Natural language has evolved to help humans communicate with one another and also to
record information. Computers are still incapable of understanding natural language, though with
language processing mechanisms an attempt to make meaningful and relevant connections
between words continues to be of interest and study. Humans are able to differentiate and apply
linguistic patterns to text, overcoming obstacles (such as slang, spelling variations, and
contextual meaning), while computers do not handle them easily. Meanwhile, although our
language capabilities allow us to comprehend unstructured data, we lack the computer's ability to
process text in large volumes at high speeds. The key to text mining is creating technology that
combines a human's linguistic capabilities with the speed and accuracy of a computer (Fan,
Wallace, et. al. 2005).
The complete understanding of natural language text is difficult to attain. Text mining
focuses on extracting a small amount of information from text with high reliability. (Yeates,
2002) Natural language is ambiguous and the same keyword may express entirely different
meanings, e.g. “Washington” may be a person or a place. Such ambiguity is normally resolved
through context. The inverse problem is that different expressions may refer to the same
meaning, e.g. “car” and “automobile”. From these two problems, it is easy to rule out the surface
expression of the keywords alone as a proper representation for text mining. (Chibelushi, Sharp,
& Salter, n.d.)
Moreover, because of the centrality of natural language text to its mission, text mining
7
also draws on advances made in other computer science disciplines concerned with the handling
of natural language. Perhaps most notably, text mining exploits techniques and methodologies
from the areas of information retrieval, information extraction, and corpus-based computational
linguistics (Feldman & Sanger, 2006).
One example of the application of a large-scale natural language processing database to
practical search methods was with the WordNet project (Stevenson, 2003, p. 39). A cognitive
psychologist used research on the structure of the human mental lexicon and attempted to
construct a resource that would mirror this structure. A basic block of terms were created in the
research and found that over half of these terms to have a rather large number of synonyms,
which were tiered into hierarchical chains. For example, the term ‘canary’ was related to about
20 other terms, closest term being ‘finch’ and furthest (but still deemed to be related) ‘entity’ (p.
40). In terms of this application in conjunction with text mining, all relevant synonyms would in
effect be used with a query. Again, there is a number of thesaurus and dictionary programs,
whose quality also varies, which would become a part of the natural language processing. The
quality of these programs would impact the search, as too many synonyms would result in higher
recall of similar terms that would less likely be relevant.
Technological advances are, however, beginning to close the gap between human and
computer languages. The field of natural language processing has produced technologies that
teach computers natural language, enabling them to analyze, understand, and even generate text.
(Fan, Wallace, et. al. 2005) As more time and research is put into this facet of text mining, the
greater the possibilities of relevant, meaningful results will be made.
8
Data Mining
Text mining has its roots in data mining and consequently has many similar features.
Like data mining, text mining seeks to extract useful information from data sources through the
identification and exploration of interesting patterns. But unlike data mining, in text mining the
data sources are created from defined and processed document collections. Interesting patterns
are found not among formalized database records but in the unstructured textual data in the
documents in these collections (Feldman and Sanger, 2006). In addition, both text mining and
data mining systems have similar high-level architecture like preprocessing routines, patterndiscovery algorithms, and presentation-layer elements such as visualization tools to enhance the
browsing of answer sets.
Text mining adopts many of the specific types of patterns in its core knowledge discovery
operations that were first introduced in data mining research (Feldman & Sanger, 2006). While
data mining mostly deals with structured data, text mining is designed to handle structured data
from databases or XML files, and can also handle unstructured or semi-structured data sets (such
as email, full-text documents, and HTML files). As a result, text mining is a better solution for
companies where huge volumes of diverse information must be merged and managed (Fan,
Wallace, et. al. 2005).
To date, however, most research and development efforts have centered on data mining
using structured data. Because data mining deals with data have already been stored in a
structured format, much of its preprocessing focus falls on two critical tasks: 1) scrubbing and
normalizing data and 2) creating extensive numbers of table joins. But in text mining,
preprocessing deals with the identification and extraction of representative features for natural
language documents. These preprocessing operations mainly transform unstructured data stored
9
in document collections into a more explicitly structured intermediate format (Feldman &
Sanger, 2006).
Process of Text Mining
Text mining can be examined in three stages: Text preparation, text processing and text
analysis. A key element of text mining is its focus on the document collection. At its simplest, a
document collection can be any grouping of text-based documents. Practically speaking,
however, most text mining solutions are aimed at discovering patterns across very large
document collections. Another critical element in text mining is the document. For practical
purposes, a document can be very informally defined as a unit of discrete textual data within a
collection that usually, but not necessarily, correlates with some real-world document such as a
business report, legal memorandum, e-mail, research paper, manuscript, article, press release, or
news story (Feldman & Sanger, 2006).
During the initial stage, text preparation uses a selected set of documents and is input
using text mining software that cleans and preprocesses the data. The text processing stage is
where the user enters the picture and enters an information query into the program. An algorithm
is applied to the processed data, which clusters the data to find meaningful patterns. Text
analysis evaluates and determines the relevance of mined text into a more tangible form.
(Natarajan, 2005)
After text files are input to a system, text mining software produces a semantic network
of key concepts and terms in each file, defined by a weighing algorithm. This algorithm is used
to find meaningful patterns in data by frequency of terms. Phrasal analysis is also available in
addition to the term searches. This method has often proved to be the most effective method of
10
text mining, as noun phrases such as company names, personal names, locations, or case names
would often be more useful than single term searches. Adoption of subject headings and content
descriptors are frequently derived from library classification schemes. The user can enter a query
and receive a compilation of relevant data as pulled from documents in the format of an XML or
HTML file, or even a comma separated file. These forms of the results are often highly visual or
graphical, and aim to produce an organizational knowledge map. The purpose of this compilation
would be to discover new knowledge or information based on similar concepts or to find a
narrative in an unstructured set of documents.
The understanding of the specific mapping of the selected group of information is
important to the information retriever. There should be a certain level of awareness of the
classification, clustering and categorization schemes of platform servers, network servers,
database and workgroup applications to effectively use text mining software. The quality of the
information that is mined is largely dependent on a certain level of organization within the base
of the originating knowledge base. As well, the results are reviewed by the individual to assign
relevance and value to the information.
Figure 1- Text Mining Process (Adapted from Fan, Wallace, et. al. 2005)
11
Technologies in Text Mining
There are several techniques in the text mining technologies that are utilized through different
applications. Some of these include information extraction, topic tracking, summarization,
categorization, clustering, concept linkage, information visualization, and question answering.
(Fan, Wallace, et. al.) In this section we reviewed some of the technologies that we used in
evaluating the text mining software for our research, the Tri-State project.
Information extraction - This technology is a popular method for analyzing unstructured text and
identifying key phrases and relationships within text. It looks for predefined sequences in the text
by pattern matching. The technology is especially useful when dealing with large volumes of
text. (Fan, Wallace, et. al. 2005)
Summarization - Text summarization helps users figure out whether a lengthy document meets
their needs and is worth reading. It is important to reduce the length and detail of a document
while retaining its main points and overall meaning. Sentence extraction mines important
sentences from a given text by statistically weighting all the sentences in the text. Other
heuristics like position information and extracting sentences following key phrase like "in
conclusion" is used in summarization. Headings and other markers of subtopics are searched in
order to identify the document's key points (Fan, Wallace, et. al. 2005)
Topic tracking - A topic-tracking system keeps user profiles and, based on the documents a user
views, forecasts other documents of interest to the user. This allows users to choose keywords
and notifies them when news relating to the topics becomes available. Some tools let users select
particular categories of interest and infers users' interests based on their reading histories and
12
click-through information they've left behind online. (Fan, Wallace, et. al. 2005)
Term weighting and association rules are also common in text mining. In the term weighting
technique document representation is done by removing functional words (e.g. conjunctions,
prepositions, pronouns, adverbs, etc.) and then assigning weights to content words (e.g. agent,
decision making), in order to describe how important the word is for that particular document or
document collection. This is because some words carry more meaning than others.
(Chibelushi, Sharp, and Salter, n.d.) Association rules have made their way from data mining to
text mining. An association rule is a probabilistic statement about the co-occurrence of certain
events in a database or large collection of texts. (Ibid.)
Purpose of Text Mining
The purpose of text mining is to make meaningful connections between unstructured text
data. As we have discussed, three main stages in text mining are data preparation, data
processing and data analysis. Depending on the software or sites reviewed (or guidance of the
human counterpart), this would be formatting the data during the selection and preprocessing
stage. One issue in the result of mined text is the lack of a hierarchy in the display of indexes or
clustered information.
A data-mining algorithm is then applied to the preprocessed data. Sentence and paragraph
identification as well as tagging parts of speech in a set of documents are discerned at this point.
Natural Language Processing would provide conceptual relations between entitled and perhaps
make links between certain chucks of information (the people, associated companies, etc.) Text
analysis is the more subjective aspect to the process, in the evaluation of the output. After being
13
run through certain algorithms, the resulting text is further subjected to further processing (Link
Discovery tool, or other).
Rudimentary term extraction is the most basic form of text mining that weights lists of
terms from a set of texts into a feature vector. A search of any scale would in effect measure the
similarities between documents by the feature vectors. In some text mining software, the user or
systems administrator would take a sample group of documents and create certain rules on terms
which the software translates into an algorithm (this has been referred to as “supervised learning”
Sullivan, p. 102), mentioned earlier. Alternatively, other software is set with preexisting
classification schemas that weight phrases and terms (or “unsupervised learning”, Sullivan, p.
102).
The underlying notion in text mining is that frequency of term occurrence equates
relevance. Perhaps more useful is Maximum frequent sequences (MFS), which is a method of
extracting phrases that occur the most frequently in a set of documents. Specific phrases can
often provide higher precision in the recall, for example by the use of company name, product
name, proper name, etc. Also to note, a certain level of awareness on the part of the user as to the
effect of language of the query, controlled or natural, with relation to the search is vital to
pertinent results.
Within an increasing set of documents, patterns begin to emerge within the mined text,
and as does the number of patterns eventually. More successful cases of text mining are in areas
with highly controlled language, such as Biotechnology, Competitive intelligence, and Consumer
product development. Table 1 summarizes some of the main uses and the technologies used in
text mining in the principal industries.
14
Table 1 - Applications of text mining in various industries (Source: Fan, Wallace, et. al. 2005)
As previously mentioned, natural language techniques were applied to text mining to
attempt to represent the user and the document in the search process to aid the search method. In
addition, searching with NLP can also classify documents together by discovering multiple point
relationships between terms and phrases. NLP is however prone to error, particularly in
environments with a range of topics and styles. Even with this in mind, there can be beneficial
connections to be made between previously unrelated items.
15
Research Model Using the CRISP-DM
CRISP-DM was developed in late 1996 by three main players of the then young datamining market - DaimlerChrysler (then Daimler-Benz), SPSS (then ISL), and NCR.
DaimlerChrysler had already had some experience in applying data mining in its business
operations. SPSS had been providing services related to data mining since 1990 and also
launched the first commercial data mining workbench - Clementine - in 1994. NCR had the
Teradata data and had teams of data mining consultants and technology specialists to service its
clients’ requirements. (CRISP-DM, 2000) Over the years the model has been developed and
suited for better data mining for various purposes. We found that this particular model would be
an effective method in application with our case study.
The CRISP-DM methodological model consists of the following (Sullivan, 2004)

Business understanding- The clients’ point of view is considered at the first stage,
identifying the requirements and objectives of the selected applications. Problems and
restrictions of each application are identified and examined as well. This phase also
incorporates a description of the client background, the business objectives, and a
description of the criteria used to measure the success of the study.

Data understanding- All relevant information is identified to carry out the application,
and also to develop an initial gauge on the applications’ content, quality and utility. This
initial collection of data assists the analyst to discover the particulars of the individual
programs. As well, problems related to the format and values of the applications are
looked at this point. The manner in which data was collected, including the different
sources, meaning, volumes, reading procedure, etc. will also be of interest. These can also
16
give an indication of the quality of the data.

Data Preparation- In this stage, the final data set from which the model will be created
and validated will be reviewed. Tools for data extraction, cleaning, and transformation
are applied to data preparation. Combinations of tables, format changing and aggregation
of values are drawn out to satisfy the input requirements of the particular learning
algorithms.

Modeling- Data-mining techniques are selected and applied at this stage, according to the
objectives as defined in the first step of the model. The core phase of KDD (Knowledge
Discovery and Data Mining) is modeling, which corresponds to the choice of the
technique, its parameterization, and its execution over a defined data training set. As well,
other models can be created in this phase if required.

Evaluation- A review of the previous steps will be made in order to verify the results
against the objectives as defined in the business understanding phase. The next tasks to be
preformed will also be defined here. Dependent on the results, route corrections may be
defined, which correspond to the return to one of the already performed phases using
other parameters, or looking for additional data.

Deployment- This phase sets the necessary actions to make the acquired knowledge
available to the organization. A final report is generated to explain the results and the
experiences useful to the client business.
17
Figure 2: Phases of the CRISP-DM Process Model (Source: CRISP-DM, 2000)
The CRISP-DM process is more of a cycle in which the sequence of the phases is not
necessarily stringent. Moving back and forth between different phases is essential. The
sequence of the tasks is dependent on the outcome of each phase. The arrows (See Figure 2)
indicate the most important and frequent dependencies between phases. The outer circle in the
figure indicates that the process is cyclic in nature.
The mining process continues after a solution has been deployed. The lessons learned
during the practice can generate new, often more focused business questions. Subsequent mining
processes will benefit from the experiences of previous ones, with discovery and examination of
successful results.
18
In our study for the Tri-State Times we decided to use the CRISP model as well. It is a
popular and proven solution. Many business solutions have depended on the CRISP model. In a
study by Chibelushi, Sharp, and Salter (n.d.) in analyzing the transcripts of the meetings, they
recorded a set of meetings and transcribed them for further processing. These transcripts were
manually edited to prepare for the modeling phase and then further analyzed to track the themes
discussed and extract the key issues and any associated actions, as well as identifying the
initiator. The approach combined statistical natural language processing and semantic analysis of
the transcripts. do Prado (2004) applies the CRISP-DM methodological approach to a case study
in order to look into the relationships between the text and the defined grouping news items.
The Tri-State Times Text Mining Project
The student newspaper Tri-State Times are researching the use of text mining software to
assist in finding news articles, columns, and editorials that are similar to the selected news items.
In addition, the software should be able to investigate the primary terms and phrases in the
selected article so that similarities to other articles can be identified.
The members of the Tri-State Times have approached the School of Library Information
Science (SLIS) to assist in researching the text mining software that would be appropriate for
their use. However due to lack of funds, they would like to use a system that can be purchased
for a minimal cost or one that is available for free. In addition, the upkeep of the system should
be easy and should not incur any major additional costs.
The members of the research team at SLIS will conduct a study in which they will
evaluate different text mining software that will possibly satisfy the needs of the Tri-State Times.
Once the search has been narrowed down, they will conduct tests based on the criteria of the
19
tasks that are needed by the newspaper. In addition to comparing the ease of use of the various
systems, a survey will be conducted with a sample of the users to determine the functionality of
the system and to determine what the users feel about the system and the resulting data sets. A
combination of these factors will be used to determine the best text mining solution for the
newspaper.
Step 1 - Business Understanding
During this initial phase, it is necessary to focus on understanding the project objectives and
requirements from the organization’s perspective, and then converting this knowledge into a data
mining problem definition, and a preliminary plan designed to achieve the objectives.
The student newspaper at Tri-State University is short staffed. The editorial team has
consulted the SLIS for assistance to see if they can come up with a solution in order to help them
archive important articles. The objective of this project is to find a method to find similar
articles, columns, and editorials to stories that the editorial team selects. In order to achieve this,
they survey and select major news articles and then look for similar articles covering the same
news story in other major newspapers. One can look into most of the major news sources
manually one at a time, or a text mining model can be used that would assist in the process. The
text mining model will be able to identify the key terms and ideas in the articles. These terms can
also be used to look up similar articles. In addition, other articles can be identified and selected
from other news sources automatically with the use of certain text mining software. Finally with
the use of key terms and sentences can also formulate a summary of the article. This summary
can be input into a collection/database of summaries and can be accessed in the future for use in
writing editorials, opinions, and other articles.
20
Step 2 - Data Understanding
The data understanding begins with initial data collection of news stories from various national
news websites – e.g. CNN, Fox News, Yahoo News, etc. Next local stories of interest are
selected from local and regional newspapers. In order to get familiar with the data the main
stories are identified from one or two of these news sources. A major data quality issue may be
the availability of more than just a single version of the news articles at various times, especially
on the Internet. Another issue is amount of time spent on identifying the main news article,
whether or not this is unwieldy or not. Once the editorial team selects the news article (dataset) a
member of the staff should be able to enter these in the system and generate the output of key
terms and phrases, summary of the article, and possible matches of similar articles.
Step 3 - Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed
into the modeling tool(s)) from the initial raw data. In this case, the raw data includes all the
news articles that the editorial team deemed to be archived during the first selection of these
articles. The final dataset includes the articles that have been weeded out from the raw data and
are considered more important than the rest of the articles. During this phase, you choose one or
two versions of the main stories of the day from news sources. These news articles are converted
to text versions so that they can be inputted without the images. These text versions of the news
stories make up the final dataset that are ready to be entered into the text mining system.
Step 4 - Modeling
During this phase, various modeling techniques are selected and applied. The SLIS team looked
21
at various text mining tools depending on the needs of the Tri-State Times. Text mining tools
come in various capabilities and prices. Major vendors offer text mining tools that cost in the
region of thousands of dollars. Some of the major text mining tools are compared below along
with their major features:
Table 2 - Text-mining technologies offered by commercial vendors (Source: Fan, Wallace, et. al. 2005)
Although most of the above systems offered all the functionalities that the Tri-State
project needed, these tools were beyond the budget of the student newspaper. In addition,
learning and applying these systems would require additional effort. Hence the SLIS team had to
look for systems that were simpler to use, and were less expensive or offered for free on the
Internet. With this in mind, the initial costs were minimal. The SLIS team narrowed down the
search to three text mining models: 1) Termine, 2) Textalyzer, and 3) Topicalizer.
The SLIS team took a sample set of chosen articles from the final data set and plugged
22
them into the respective models. The output from these systems were then tabulated and
compared. In some cases, the outputs were not comparable. In these cases, another set of data
was taken and re-plugged into the systems. The outputs of these articles were then evaluated.
Step 5 - Evaluation
During this stage the SLIS team evaluated the results that were generated from the three selected
models of text mining. Each of the results were compared on the basis of the three criteria – 1)
terms and phrases selected 2) summary generated 3) generated list of similar articles.
The key terms that the systems generated were evaluated in terms of whether the terms
selected could be used to look for other related articles from other news sources. The summaries
generated were compared to see if the models were able to generate a concise summary that
could be used in the future for editorials and opinions. In addition, the summaries were evaluated
to see if the system was able to connect ideas, or just extract sentences from the input text.
Finally, the set of articles that the system generates as similar news articles of the input
are also evaluated. The number of articles in the output and the variety of sources are also
investigated. These outputs were evaluated by a team of eleven participants who looked at the
three criteria: 1) terms and phrases selected 2) summary generated 3) generated list of similar
articles – for each of the three systems. In addition, these users also evaluated the systems in the
ease of use and the available functionalities.
The users tested the three software models with a few news articles that they input into the
systems. The output were generated and compared for each of the systems. In addition, a short
23
survey was taken to evaluate their user experience. (See Appendix A for the survey). Overall the
users were satisfied with all three models of text mining that were evaluated. The fact that these
systems were available on the Internet for no cost was attractive the users, who were aware of the
financial constraints. In addition the ease-of-use of all three systems was also a feature that the
participants liked. All three models tested did provide keywords and phrases that were helpful in
locating other news articles of interest to the users. But due to the additional functionalities of the
Topicalizer, including the summary and the links to other articles, most of the participants
preferred this system to the other ones tested.
Topicalizer is a service which automatically analyses a document specified by a URL or a
plain text regarding its word, phrase and text structure. It provides functional information on a
given text including the following: Word, sentence and paragraph count, collocations, syllable
structure, lexical density, keywords, readability and a short abstract on what the given text is
about. (Topicalizer, n.d.)
The results of the study were summarized as follows:
Based on the information collected from the users, there was a 72% agreement rate to the
24
resulting sets of information. Interestingly, this was also practically concurrent with previous
studies measuring the frequency of agreement between expert and non-expert semantic taggers
(Stevenson, 2003, p. 50). Often, the areas of disagreement involved topics that were more subject
to multiple interpretation (the user was not sure of context or relevance).
Step 6 - Deployment
Creation of the model is generally not the end of the project. Once the testers chose Topicaliser
as the system of choice for the Tri-State Times and the trial runs were complete, the results are
organized and stored in a way that they can be retrieved as per need of the student newspaper. A
structure for reporting is created with a consistent and understandable format, so that the users of
these results.
Conclusion
By using a selection of text mining software, our study found that differences exist in the manner
in which people relate to information, even from the same query. Apart from differences in the
particular algorithms and weighting schemas, definitive difference was found in the decision
processes of different individuals to the same set of information, and also in the comparison of
satisfaction surveys. This ultimately proves to be more of a subjective nature, though important
information can be found within these differences for further improvements in the software.
We felt that although certain limitations can be found in text mining, there are also opportunities
for further research and study. There have been many benefits in the existent applications of text
mining, and we feel that there are many possibilities for improvements in the current software.
Our study was limited to a specific set of data and users, while future research could utilize a
25
much broader scope of application (larger sets of data, wider array of topic, larger user base).
References
Atkinson-Abutridy, John. (2004) Semantically-Driven Explanatory Text Mining:
Beyond Keywords. Retrieved March 3, 2007, from Universidad de Concepci´on, Departamento
de Ingenier´ıa Inform´atica,
http://www.springerlink.com/content/v78y4u242a67uupe/fulltext.pdf
Chibelushi, C., Sharp, B., Salter, A. (2004) A Text Mining Approach to Tracking Elements of
Decision Making: a pilot study. Retrieved March 5, 2007, from Staffordshire University, School
of Computing,
http://www.comp.lancs.ac.uk/computing/research/cseg/projects/tracker/css_iceis04.pdf
CRISP-DM (2000) Retrieved on Mar 02, 2007 from (Reference: http://www.crispdm.org/Overview/index.htm)
do Prado, H. A., Moreira de Oliveira, J. P., Ferneda, E., Wives, L. K., Silva, E. M., Loh, S.
(2004). Transforming Textual Patterns into Knowledge. In Raisinghani, Mahesh (Ed.), Business
intelligence in the digital economy: opportunities, limitations, and risks. (p. 207-227). Hershey,
PA: Idea Group Publications.
Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the Association for
Computing Machinery, 8, 223-239.
Fan, W., Wallace, l., Rich, S. and Zhang., Z. (2005) “ Tapping Into the Power of Text Mining”.
Retrieved on Mar 10, 2007 from
http://pubs.dlib.vt.edu:9090/2/01/text_mining_final_preprint.pdf
Feldman, R. & Sanger, J. (2006). The text mining handbook: advanced approaches in analyzing
unstructured data. New York: Cambridge University Press.
Haravu, L. J. and Neelameghan, A. (2003). Text Mining and Data Mining in Knowledge
Organization and Discovery: The Making of Knowledge-Based Products. Cataloging &
Classification, 37 (1/2), 97-113.
Hearst, M. (2003). What Is Text Mining?. Retrieved on March 12, 2007 from
http://www.ischool.berkeley.edu/~hearst/text-mining.html
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and
Development, 2, 159-165.
Mironova, S. Y., Berry M. W., Atchley, S., Beck, M. (2004). Advancements in text mining
algorithms and software. In Kargupta, H. (Ed.) Data mining: next generation challenges and
future directions. (p. 425-436). Menlo Park, CA: MIT Press
26
Natarajan, M., (2005, July). Role of Text Mining in Information Extraction and Information
Management. DESIDOC Bulletin of Information Technology, 25 (4), 31-8.
Purple Insight, (n.d.) Retrieved on March 12, 2007 from
http://www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html
Sharp, M. (2001). Text Mining. Seminar in Information Studies, Retrieved on Mar 02, 2007 from
http://www.scils.rutgers.edu/~msharp/text_mining.htm
Smalheiser, N.R., Swanson, D.R., (1998). “Using ARROWSMITH: A Computer Assisted
Approach to Formulating and Assessing Scientific Hypotheses”, Computer Methods and
Programs in Biomedicine 57(3), 149-153.
SPSS (2005) Improve Business Results with Text Mining. Retrieved March 1, 2007, from
http://www.spss.com/PDFs/TMC4SPC-0105.pdf
Stevenson, M. (2003) “Word Sense disambiguation.” Stanford, CA : Center for the Study of
Language and Information.
Sullivan, D. (2004). Text Mining in Business Intelligence. In Raisinghani, Mahesh (Ed.),
Business intelligence in the digital economy: opportunities, limitations, and risks. (pp. 98-110).
Hershey, PA: Idea Group Publications
Swanson, D. R. (1988_1). Historical note: Information retrieval and the future of an illusion.
Journal of the American Society for Information Science, 39, 92-98.
Swanson, D.R. (1988_2). Migraine and Magnesium: Eleven neglected connections. Perspectives
in Biology and Medicine, 31, 526-557
Termine Retrieved on Feb 22, 2007 from http://www.nactem.ac.uk/software/termine/
Textaly Retrieved on Feb 22, 2007 from http://textalyser.net/
Topicalizer Retrieved on Feb 22, 2007 from http://www.topicalizer.com/
Venere, E. (2004) “ 'Knowledge discovery' Could Speed Creation of New Products” Purdue
News Service Retrieved on Mar 03, 2007 from
http://www.purdue.edu/UNS/html4ever/2004/041018.Caruthers.discover.html
Yeates, S. (2002). Text Mining. Retrieved on March 02, 2007 from
http://www.cs.waikato.ac.nz/~nzdl/textmining/
27
Appendix – A Questionnaire for Participants (Note: One form for each system)
Please feel free to add or remove success viewpoints as appropriate.
Estimate the level of success each query, using this response scale.
5 very satisfied 4 satisfied 3 neutral 2 dissatisfied 1 very dissatisfied
1. __ The generated summary was relevant to the query.
2. __ The terms and phrases of the query were found to give relevant results.
3. __ The results gave too many variations to similar articles.
4. __ The system was easy to use and understand.
5. __ The system provided simple and ample methods of search mechanisms.
28
Identify both the successful and unsuccessful elements found in the resulting information. Were
these useful for further study, or were they irrelevant?
.
Estimate the level of satisfaction of the results, using this response scale.
5 very satisfied 4 satisfied 3 neutral 2 dissatisfied 1 very dissatisfied
1. __ The resulting data sets was sufficient for our purpose of use and study.
2. __ The extracted information lead to a discovery of new knowledge.
3. __
4. __
5. __
29
Rate the following characteristics for the environment for the project being reviewed. Use a scale
of 1 to 5, where 1 is the lowest rating and 5 is the highest. If the item does not apply, mark an X.
__ ease of software use
__ software quality
__ analysis capability
__ design capability
__ appropriateness of technology to query
__ effective use of data configuration
__ quality assurance of data
__ clarity of source
30
Download