The Evolution and Evaluation of an
Internet Search Tool for Information Analysts
Elizabeth T. Whitaker and Robert L. Simpson, Jr.
Georgia Tech Research Institute and Applied Systems Intelligence, Inc.
Atlanta, GA. 30332 and Roswell, GA. 30076
Betty.Whitaker@gtri.gatech.edu
BSimpson@asinc.com
web (Whitaker and Simpson, 2004). A number of search
strategies have been devised to illustrate the assistance to
the analyst in the performance of knowledge discovery
activities. A software prototype that applies case-based
reasoning, combined with other reasoning techniques, has
been developed for use by information analysts to help
discover novel information from documents on the Internet.
The search strategies have been developed based on
feedback from analysts and information gleaned from
literature and has been stored as cases in a case library.
The cases, in combination with the prototype software, are
being used to illustrate and investigate improvements in the
knowledge discovered and the time required for analysis.
Abstract
We are working on a project aimed at building next
generation analyst support tools that focus analysts’
attention on the most critical and novel information found
within the data,, thus helping analysts deal with the
information overload problem. .This paper discusses the
Case-based Reasoning for Knowledge Discovery (CBR for
KD) which is designed to support the Internet-based search
and information gathering activities of information analysts.
An information analyst gathers, organizes and analyzes
information and based on that analysis, makes predictions
that can be used for decision making. Because of the huge
volumes of data that information analysts must search,
effective information gathering on the web is a complex
activity requiring planning, text processing, and
interpretation of extracted data to find information relevant
to a major analysis task or subtask (Etzioni and Weld,
1994), (Knoblock, 1995), (Lesser, 1998) and (Nodine,
Fowler et al., 2000). We have identified knowledge
discovery plan categories that correspond to different
contextual domains which provide analysts with indications
of activities of potential interest, opportunity or threat.
Using a case-based reasoning engine, a plan is selected from
one of these categories and it is adapted to the current
knowledge discovery problem. The resulting search plan is
executed, relevant information is extracted from
unstructured documents, and the extracted information is
used to make further inferences and launch additional
searches. This paper discusses the evolution and evaluation
of the system presented in the FLAIRS 2004 Paper “CaseBased Reasoning in Support of Intelligence Analysis.”
The Users and Their Tasks
The envisioned users are information analysts who are
given an assignment to research a topic and produce a
finished information product. Information analysts may
work for business or government. Examples of areas in
which they provide analysis are business intelligence,
technology tracking, financial information, crime analysis,
counter-terrorist information or military defense. The
analytic process includes searching, reading, organizing,
integrating and drawing inferences from many sources of
information, both proprietary and open. The users are
typically very methodical, and analysts attempt to be
thorough, but occasionally time constraints prevent them
from doing as much research as they would like (Patterson,
Roth and Woods, 1999). They may receive a task outside
of their areas of expertise (Heuer, 2001), (Krizen, 1999),
(Bodnar, 2003). The tasking can be either long term or
short term, that is, the analyst may be given a few hours or
a few days to provide the finished analysis, or may be
monitoring a situation over a long sustained period.
Background
We previously introduced our project called Case-Based
Reasoning for Knowledge Discovery (CBR for KD) in
which the Georgia Tech Research Institute team is
investigating information analysts’ strategies for
discovering new knowledge in support of a variety of
assignments called “taskings,” and analyzing the data
collected as they conduct searches for information on the
The variety of approaches to information analysis derives
from differences in the analysts’ domains of expertise, the
agency or company they work for, the sources of
Copyright © 2007, American Association for Artificial Intelligence
(www.aaai.org). All rights reserved.
435
•
information they can access, their experience and the stage
of analytic process on which they focus. The basic analytic
process is as follows:
• Understand the customer’s need
• Decompose the task into component questions
• For each question, gather relevant information
• Analyze available information to form hypotheses and
test for conclusions
• Produce the finished analytic product
The focus of our project is the third step in the analytical
process: gather relevant information.
Figure 1 is a simplified illustration of our approach of
capturing the analyst’s implicit search strategies in the form
of explicit knowledge discovery plans which are used via
our software to accelerate and improve derived knowledge
needed to support an improved overall analytic process.
An Analyst’s Knowledge Discovery Problems
Examples of knowledge discovery problems that an analyst
might use to fulfill some of the knowledge requirements of
a tasking include the following which will act as goals in
the context of case-based planning:
• Find researchers and market needs for cutting edge
knowledge discovery research
• Find events connected to people who are
mathematicians from Country X
• Describe (or explain) the computer capabilities of the
Offshore Gambling Casinos
• Find clusters of related neural network experts
• Discover organizations with illegal gambling activity
• Discover bioterrorist experts who might be associated
with a given set of organizations
• Who are the leaders of the Terrorist Organization Y
Research Approach
Our approach is to represent analytic strategies as domain
specific search plans such that a future analyst support
system could reuse a successful analytic strategy on
massive data by interacting with the analyst. Significant
portions of the search and analytic strategy can be
automated, but we have come to understand the importance
of allowing interaction between the analysts and the
automated process. Analysts want to understand the search
strategies as well as the results and to be able to interact
with and tailor these strategies based on their experience
and background knowledge. We have added the ability for
the analyst to add or delete terms to the search plans to
enhance the chosen plan.
Implicit
Search Strategies
Case: Goal: Find clusters of restaurant owners in city X,
who are associated with suspect organizations
KD Plan:
1. Extract names of people with the particular
characteristic “restaurant owner,” “City X”
2. Find organizations that each person is associated with
3. Compare these organizations against suspect
organizations and store resulting organizations in a
database used to accumulate intermediate values and
results that the analyst may wish to reference later
4. Find links between people (the above selected
restaurant owners in City X) through organizations,
with the result being links between people with a
particular characteristic (restaurant owners in City X)
who are associated with suspect organizations
Present
Transform
Tasking
Analyst
A set of prior cases will help the system in managing the
computationally expensive task of search planning. For
example, a notional case would contain the following:
Discovered
Knowledge
Analyst
Case-Based
Reasoning
For Knowledge
Discovery
Knowledge
Discovery
Plans
Determining what assumptions drive the analysts’
search and making those explicit in the search plans
Discovered
Knowledge
Future
Figure 1: CBR for KD Research Approach
Capabilities we have explored include:
• Identifying and capturing the often implicit search
plans (analytic strategies) used by successful
information analysts
• Providing an infrastructure for reusing search plans
among a community of analysts for the purpose of
enabling collaborative investigation of hypotheses and
respective assemblies of supporting evidence, which
ultimately constitute the discovery of new knowledge;
• Determining the types of queries analysts issue to a
particular source and which sources they query for a
particular type of problem so that the queries may be
made explicit in a case’s search plan;
Steps 1 through 4 are the sequence of actions used to solve
the knowledge discovery problem. This same case could
be instantiated differently and reused by the analyst to
solve the following related problems: “find clusters of
microbiology experts,” “find clusters of explosives
experts,” and “find clusters of drug dealers.”
The CBR for KD project is based on the assumption that
analysts have many assignments that cause them to search
for and make inferences about information, reusing
techniques that have worked for them before. Often this
436
documents are unstructured and are not marked-up. In
natural language there are many different words to express
related concepts, and in some situations we are searching
for terms which include two or more concepts, e.g.
biological chemistry expert. We must not only be able to
recognize terms related to biological chemistry and terms
related to expert, but in order to increase the recall, we
must be able to deal with a variety of grammatical
constructs and alternative phrases:
• biochem expert
• Bio-chemistry expert
• Authority on bio-chem
• Former director of biological chemistry laboratory
• Researcher in biological chemistry
• Bio-chem specialist
• Expert in the production of bio-chem products
• Professor of bio-chemistry
involves extracting pieces of information from many open
source websites, sorting them, organizing and linking them
in ways that are very time consuming. Tools to automate a
significant subset of that work, will allow analysts to focus
on the difficult aspects of their assignments, allowing them
to examine more sources, discover more pieces of evidence
and find associations among those pieces of evidence.
Case-Based Planning
Our technical approach leverages research in cased-based
planning. Case-based planning (also CBR Planning)
(Hammond, 1989) is the reuse of past plans to solve new
planning problems.
The system retrieves previously
generated solutions from a case library and adapts or
repairs one of them (the closest match) to suit the current
problem. A plan in this context consists of
• A goal: a knowledge discovery problem that the
information analyst wishes to solve,
• An initial state: in analysts’ knowledge discovery plans
the initial state is described by a set of pieces of
information including the analyst’s task goal and task
decomposition, as well as explicit assumptions such as
background about the world context, e.g., history,
recent events and geographic conditions
• A sequence of actions that when executed starting in the
initial state results in a goal state
For concepts that analysts use routinely in searches,
building elaboration files, that is, files that contain many
phrases related to the concept(s) that an analyst is searching
for, will magnify the utility of the knowledge discovery
plans. The elaboration files contain knowledge that can be
used to adapt plans to different domains according to
analysts’ specializations. This allows the construction of
knowledge discovery plans that when executed spawn
searches for many alternative wordings, saving analysts
from this tedious set of activities. A simple analysis,
yielding an elaboration file which contains only a handful
of terms, makes for much more effective searches.
The goal is to have a set of pieces of information with their
appropriate connections and inferences to address the
knowledge discovery problem. Each action has a set of
preconditions which, in information space, consist of the
state of having the information necessary to perform the
next step. Each action also has a set of post-conditions
consisting of the knowledge and the representation of
partial solutions that exist after the step is performed.
Many case-based reasoning systems retrieve and adapt
problem solutions given as advice or for execution by a
human. The CBR for KD work has the added complexity of
providing and adapting solutions that must be automatically
executed by the software. The approach is to represent
knowledge discovery plans as scripts and to map them to
the modules being used to execute the individual steps of
knowledge discovery plans. This will also allow module
reuse for better support of plan actions.
Plan actions are chosen from the following set:
Create String
Search
Extract
Extract Named Entities
Filter
Create Table
Write Elaboration
Write Table
Display Results
Our analysis has led to the following set of components for
building a knowledge discovery plan:
• Actions allowed by the knowledge discovery plan and
executable by the system
• Preconditions and sequences to define the allowable
execution sequences of plan actions
• Parameters for the adaptation of plans to plans that
address the current problem
• Elaboration files for mapping search concepts to a set
of specific search engine query terms
• Similarity metric for retrieving the appropriate cases
• Parser file for recognition of word constructions in
information extraction from documents
There are over 50 cases in our current case library. A case
consists of a Knowledge Discovery Plan with its associated
features, i.e., a set of attributes that describe and
characterize the plan, a plan for solving the knowledge
discovery problem, and a description of the expected
results of executing the plan.
One of the actions that our plans include is an elaboration
action. Many of the knowledge discovery plans stored in
the case library and reused in support of a new problem,
include plan steps which use search engines to search the
Internet for particular topics. They also include steps
which scan the documents retrieved for particular terms.
One of the things that make this difficult is that the
Based on the preconditions for a given action, there are
constraints on the order in which the plan actions can be
executed. Each plan action has a set of parameters which
will include the conceptual object to be acted upon and
potentially an object to be produced or transformed by the
437
action. Elaboration files are necessary to map from a
domain concept being sought to all of the terms and
phrases that can be used to express it in retrieved
documents. This technique makes our searches more
powerful and allows the elaborations to be customized by
the user to the types of wordings that are most likely to be
fruitful. The similarity metric is used to describe the most
important characteristics of a case and to help retrieve the
most relevant cases to the knowledge discovery problem
being solved by a user. Parser files are used by the system
to extract potentially relevant information. They include
wording permutations and patterns that can be recognized
by the Extraction modules.
The following features of the knowledge discovery cases
were chosen for their utility in defining the similarities to a
current knowledge discovery problem for the purpose of
retrieving a case to adapt and reuse:
1)_Structure: One of the primary attributes of knowledge
discovery problems is the “structure” of the information
that the analyst is looking for. Because the information
being searched for by the information analysts goes beyond
looking for simple facts or processes, the structure can be
very complex, having primary influence on the
characteristics of the search plan. Examples of structures
commonly used in a knowledge discovery plan are:
Associations or relationships:
One technique that
analysts and other information workers use when looking
for relationships between two entities, is to look for
common relationships to a third entity
Clusters: A more complex kind of association is a cluster
of related entities such as:
People (e.g., counterfeiting experts or gang members)
Organizations (suspect businesses and organizations that
do business with criminal organizations)
Events (e.g., bombings, attacks, or criminal events)
Time Sequences: When tracking an event such as a
criminal event or trying to identify a potential criminal
event, there are sequences of subevents or activities that
take place as part of training and preparation. We have
created search plans to search Internet web pages and
documents, extract information and create a representation
that allows the analyst to see the time sequence of events.
Many knowledge workers, such as historians or
epidemiologists, search for information and relate it to a
time sequence in order to explain, prevent, influence or
predict future events.
Spatial Associations: An important structure that analysts
use is a representation relating the movements of entities
through space and time. In trying to predict or influence
terrorist events, analysts might look for components that
can be tied together in a space-time representation.
Sample KD Plan: “Find Clusters of Biochemical
Experts”
1. CreateString (Parameters) Creates search string on
biochemical experts using elaboration file
2. Search (Parameters) Uses search string for web
searches
3. Extract (Parameters) From URLs found, extracts
phrases with the requested content
4. ExtractNamedEntities (Parameters) Marks up or tags
names, organizations, dates, etc
5. CreateString (Parameters) Names found from extract
are put in a string
6. WriteElab (Parameters) Elaborates organization words
7. Search (Parameters) Searches data sources for names
found above
8. CreateTable (Parameters) Creates a database table to
store information found
9. Extract (Parameters) For each URL found, extracts
phrases with name and organization
10. ExtractNamedEntities (Parameters) From the above
phrases, extracts organization names
11. WriteTable (Parameters) Information is written to a
database table
12. Display (Parameters) Displayed for the user
2) Type of knowledge: Information analysts have taskings
that require searches or knowledge discovery techniques
that are specialized to the types of knowledge that they are
looking for. Some examples of types of knowledge that
analysts find useful are:
• capabilities, such as software development processes etc.
• expertise and knowledge
• beliefs and intents
• communication patterns
• financial transactions
• organizational e.g., people filling leadership positions
Knowledge Discovery Cases
Since we want to be able to retrieve and adapt the most
useful plan for solving the analyst’s current problem the
important features of the knowledge discovery plan must be
identified, and stored as a feature vector. The feature
choices are driven by the goal of the knowledge discovery
plan and the characteristics that distinguish one case from
another.
Through requirements analysis, knowledge acquisition,
experimentation with knowledge discovery, and analysis of
published analysts’ processes, we have identified and
classified some of the types of information that information
analysts look for and attached some initial attributes that
have proven useful in case retrieval. There are eight types
of knowledge in our feature set which is used to index the
cases. This will continue to evolve as we learn more about
the knowledge discovery problems needed by different
information analysts.
3) Focus of Information: The Focus of Information is the
specific information domain of the knowledge search being
performed. Examples are “neural network,” “weapons of
mass destruction,” “terrorism,” and “gambling.” The Focus
of Information is included as one of the attributes used to
index into the case library to identify the most similar case.
There may be sources, search approaches, and inferences
that analysts reuse and share in a particular domain.
438
Reid, 2002 ) was administered as an on-line questionnaire.
The metrics collected for each scenario analysis were:
Timing information: overall time, time in applications and
time in CBR for KD windows
System performance: time to formulate case selection
criteria, # cases run, # cases abandoned, time to process
cases, time to process each search of a multi-search case,
number of nodes in results tree, including instances where
the set is empty , number of leaves (snippets) in the results
tree and number of documents represented in results set
Document measures: # relevant documents, percentage of
viewed documents that were considered relevant, fraction
of documents that were represented in final report and #
keystrokes in report vs. # cut/paste events.
Other measures: Cognitive Workload; Product ratings and
rankings;
Questionnaire
ratings
and
comments;
Observation notes and Scenario complexity rating.
4) Geographic Area: Analysts often search for particular
types of information related to a particular geographic area.
There are specialized websites, search and inference
techniques related to specific geographic areas that are
represented in a case. Searching for information about a
geographic area may require references to distances,
cultures, geography and climate, for example.
For case retrieval we are experimenting with similarity
metrics so that the retrieved cases are those that most
closely match the target case. One approach is to supply
different weightings on the above features. Our analysis of
the current system suggests that the highest weight is for
“Structure” and the lowest weights are the “Focus of
Information” and “Geographic Area.” Our adaptation
process does not combine actions from multiple plans to
create a unique, domain-specific plan. Instead the nearest
plan is adapted by search-and-replace of the corresponding
terms from the elaboration files. Because it is easy for us to
adapt a case by changing its Focus of Information or
Geographic area by substituting the concepts and related
elaboration files in the plan script, it is unnecessary to
match those features closely. The Structure has much more
influence on the steps of the knowledge discovery plan,
making it harder to adapt a plan to a different structure. As
we continue to grow the case library and increase the
richness of the case representation, we expect our similarity
metric to require more sophisticated matching. For
example, in the current implementation, features either
match completely or not at all (1 or 0), but over time we
may find it useful to define partial matches.
Some of the summary results of the study were:
• CBR for KD was easy to learn to use. Subjects
exhibited proficiency after training and also indicated
this on post-scenario questionnaires.
• CBR for KD was used for an average of about 20
minutes during the 2-hour session; this came at the
expense of both using Word and using the IE browser.
Most of the time, subjects used the Case Selector panel
to express their information need and to select a
matching case and the Results panel to browse the
information returned by the system.
• Seven of the twenty cases in the Case Library (at the
time of this experiment) matched the needs of the
analysts.
• CBR for KD use was accompanied by a decreased
temporal pressure compared to Google use. This
finding is based on the TLX data and questionnaire
responses. When subjects were debriefed, most
subjects indicated that CBR for KD saved them time.
• Although there were no appreciable differences in the
quality of the reports that subjects wrote when they
used CBR for KD, most subjects indicated in their
debrief that they believed that they wrote better reports
with the tool than without it.
• The information that CBR for KD returned to the
subjects was perceived to be of better quality. One
subject cautioned that filtering data can be a bad thing
since it tends to weed out alternative explanations and
divergent opinions. Other subjects noted that the tool
would be a good tool to use at the beginning of an
analysis when they are working at the broad picture;
other tools could be used later to fill in gaps.
Results from User Evaluation
The CBR for KD system was evaluated during early 2005
in a study performed by National Institute of Standards and
Technology (NIST). Six Naval reservists, most of whom
had experience as analysts, participated in the study. They
had a variety of backgrounds mostly professional. After
receiving training, they each performed two analysis tasks,
one using Google for search as the baseline for comparison,
and one using CBR for KD for search. The task was to
find the information to use in a report and to provide the
outline of that report. The subjects were given two hours to
search for the information and provide the outline for the
requested report. Each task was based on a scenario that
described the subject of the report that the analyst was to
provide. One of the tasks was to provide a report on the
“Status of Russia’s Chemical Weapons Program” and the
other was to “Identify the Science and Technology Interests
of Iran by Analyzing the Organizational Structure of the
Universities.” The primary goal of the study was to
determine the utility of CBR for KD in terms of time
savings, lower cognitive workload, improved quality of
product, and increased amount of information. No subject
finished before the 2-hour limit. After the two hours, the
NASA TLX a cognitive workload instrument, (Dufort and
Other findings from the study identified user interface
aspects that need improvements. This is an area that is not
part of our current research. The case library associated
with the current prototype CBR for KD system is small and
may not cover the space necessary for a particular analyst’s
area of interest. For this evaluation we seeded the case
library with a few special cases. However the evaluation
identified needed extensions to functionality that should be
439
addressed by expanding the collection of cases. Later
evaluations have provided us suggestions for the redesign
of the user interface to align more closely with the thought
processes of the analysts.
Strategic Intelligence Research. Joint Military Intelligence
College. Washington, DC.
Conclusions
Etzioni, O. and Weld, D. A softbot-based interface to the
internet. CACM, 37(7):72--76, July 1994.
Dufort, P. and Reid, L.D., 2002, "NASA TLX Task Load
Index Evaluation Program User Guide", Prepared for
DCIEM.
We have presented some detail about the application of
case-based reasoning to a challenging new task: knowledge
discovery for information analysts. We have shown the
results of our research of the information analysts’ needs
and processes, the initial case representation, the overall
system design and the current state of implementation. In
conducting this research, we have been challenged on
numerous occasions as we applied case-based reasoning to
the broad task of knowledge discovery, and the even
broader domain of information analysis. This required
experimenting with approaches to apply case-based
reasoning to an area where there are not explicitly
represented preexisting cases. The project required making
explicit a conceptual model of search planning, an area that
the domain practitioners rarely think about.
We
conceptualized hypothetical user scenarios to guide the
exploration of system design in the context of different
layers of knowledge and likely types of questions that
analysts might ask both within and between layers. Cases
were created by mapping current user processes into
envisioned future technologies. We hope that our research
can ultimately become part of an analyst’s standard toolkit
to assist in effective knowledge discovery about threats to
U.S. strategic interests.
Hammond, K.. 1989. Case-Based Planning: Viewing
Planning as a Memory Task. San Diego: Academic Press.
Heuer, Richards J.. 2001. Psychology of Intelligence
Analysis, Center for the Study of Intelligence.
Jones, Morgan D. 1995. The Thinkers Toolkit, Three
Rivers Press, New York.
Kolodner, Janet L. 1993. Case-Based Reasoning, Morgan
Kaufmann.
Kolodner, J.; Simpson, R. 1989. The MEDIATOR:
Analysis of an Early Case-Based Problem Solver,
Cognitive Science, Vol. 13, Number 4: 507-549.
Knoblock, C. 1985 Planning, Executing, Sensing, and
Replanning for Information Gathering. In Proceedings of
the Fourteenth International Joint Conference on Artificial
Intelligence, Montreal, Canada.
Krizan, L. 1999. Intelligence Essentials for Everyone,
Joint Military Intelligence College, Washington, DC.
Lesser, V., et al. 1999. BIG: A Resource-Bounded
Information Gathering Decision Support Agent, UMass
Computer Science Technical Report 1998-52, Multi-Agent
Systems Laboratory, Computer Science Department,
University of Massachusetts.
Several areas are natural for further development of this
system.
One of these is the integration of more
opportunities for the user to prune the retrieved document
set. We have also done a small experiment with multilingual searching and have had some success. Design and
development of the maintenance tools to support this
system in an operational environment is needed. These are
three directions for further development of this tool.
Nodine, M., Fowler, J, et al. 2000. Active Information
Gathering in Infosleuth. International Journal of
Cooperative Information Systems Vol. 9, Nos. 1 & 2: 3-27.
World Scientific Publishing Company.
Patterson, E., Roth, E. and Woods, D. 1999. Aiding the
Intelligence Analyst in Situations of Data Overload: A
Simulation Study of Computer-Supported Inferential
Analysis Under Data Overload. The Ohio State University
Institute for Ergonomics/Cognitive Systems Engineering
Laboratory, Report ERGO-CSEL-99-02.
Acknowledgements
This paper is based upon work funded by the U.S.
Government and any opinions, findings, conclusions or
recommendations expressed in this material are those of the
authors and do not necessarily reflect the views of the U.S.
Government. The authors gratefully acknowledge the
evaluation work of Emile Morse and Jean Scholtz at NIST,
the assistance and contributions of our Georgia Tech
Research Institute Case-Based Reasoning for Knowledge
Discovery research team members: Laura Burkhart, Reid
MacTavish, Collin Lobb, and especially our Government
Contracting Officer's Technical Representative.
Whitaker, E. T. and Bonnell, R.D. 1992. Plan Recognition
in Intelligent Tutoring Systems, Intelligent Tutoring Media.
Whitaker, E. and Simpson, R. 2003. Case-Based Reasoning for
Knowledge Discovery. In Proceedings of Human Factors and
Ergonomics Society (HFES) Conference, Denver, CO.
Whitaker, E. and Simpson, R.2004. “Case-Based Reasoning in
Support of Intelligence Analysis,” In Proceedings of 17th Annual
FLAIRS Conference, Miami, FL.
References
Bodnar, J. 2003. Warning Analysis for the Information
Age: Rethinking the Intelligence Process. Center for
440