Assessing the Effectiveness of LSI in ... the Intention of a User’s Query

advertisement
From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved.
Assessing the Effectiveness of LSI in Approachingthe Intention of a
User’s Query
Massoud Moussavi I 2and Robert
Najlis
The World Bank*’
1818H Street N.W.
Washington,D.C. 20433
[email protected]~
2ComputerScience Department
Indiana University
Lindley Hall, Rm.215
Bloomington,Indiana 47405
[email protected]
Abstract
keywords and documents also need to be found. One
system for doing this is Latent Semantic Indexing (LSI).
Documents
retrieved in response to a user’s query
shouldreflect the intention of the user. Keyword
searchesare not sufficient to accomplish
this task. This
paper considers Latent SemanticIndexing(LSI) as
possible solution. LSI is consideredin the context of a
large set of WorldBankdocuments.LSI is shownto
be useful in this regard, but not a completesolution.
Additional, domainspecific, information wouldbe
neededas well.
Introduction
The WorldBank has a very large repository of case
studies and reports on various economicand social
development projects undertaken in the developing
countries. Staff working on a new development project
often have to perform a keywordsearch to find an
examplethat is similar to their situation. But keyword
based searches can miss pertinent documents because
those documentsdo not contain the presented keywords.
Furthermore, they might retrieve documents which,
although containing the keywords, in fact are unrelated
to the actual search (Moussavi 1999). This means that
the difficult task of trying to find the correct wordsto
enter for a search is placed on the user. It wouldbe
preferable to give as muchof the job as possible to the
computerinstead. Thefirst place to look for solutions is
how the knowledge base is indexed. Keywords are an
essential building block to any indexing scheme, but they
are not enough by themselves. Correlations between the
* The findings,
interpretations,
and conclusions
LSI groups wordsby their relationships to other
words in and across documents. This allows the system
to place related documentsin close proximity, and
unrelated ones further away. Thus, although some actual
keywordsin the documentsmight differ, if there is
enough similarity across documents, they will be
grouped together. Furthermore, a search on one of the
keywords will also yield documentscontaining the other
related words. LSI works by taking a matrix of terms
and documentsand analyzing it by singular value
decomposition (SVD). Each term and document
indexed by its vector in the matrix (Deerwester, et ai.
1990) (Laham1997). The result is a geometric space
correlating terms and documents.
Research
This paper describes findings from an ongoing
examination of application of LSI to indexing (and
concept generation) a large set of World Bank
documents on economic and social development issues.
Our research objective is to develop a conversational
system (Aha and Breslow 1997) via which the users can
retrieve the most relevant documentsby answering
questions on indexed features. This paper focuses on the
efficacy of LSI in the retrieval of documentsin a manner
whichmore closely fits with the user’s concept for the
query. Queries of over 500 abstracts of documentson the
topic of economicgrowth in East Asia, and especially
expressed
and should not be attributed
in any manner to the World
to the Board of Directors
or the countries
they represent.
320
FLAIRS-2001
in this
paper
Bank, to its
are
those
affiliated
of the
author
organizations,
Korea, were conducted. Each query returned 100
documents.The correlation of the queries and the
documentsretrieved by LSI will be examined.
Threeof the queries that werepresentedto LSI:
( I ) Howdid East Asiancountries achievetheir economic
growth
(2) Howdid East Asian countries achieve their growth
rates
(3) Howdid Koreaachieve its economicgrowth
LSItransformedthose user entered queries into
the followingtbrms:
( i ) east asian countries achieveeconomicgrowth
(2) east asian countries achievegrowthrates
(3) korea achieve economicgrowth
Thus, all commonlyused words were removedfrom
the query before the search was begun. The phrases
werenot treated as sentences,but rather as a
conglomerationof keywords.Queries could be
presentedin the formof sentences, or evenas
documents,but LSIwouldbase the search solely on the
relevance of the keywords,not by attemptingany kind of
linguistic parsingof the query.
certain projects, and wheninformation problems
discourage lending to small and medium-size
firms. The assumptionunderlying policy-based
assistance and other formsof industrial assistance
(such as lowertaxes) is that the mainconstraint
on newor expandingenterprises is limited to
access to credit. Theauthors give an overviewof
credit policies in East Asiancountries (China,
Japan, and the Republicof Korea)as well as
India, and summarizewhatthese countries have
learned about directed credit programs.Among
the lessons: 1) Credit programsmustsmall,
narrowlyfocused, and of limited duration (with
clear sunset provisions); 2) subsidies mustbe low
to minimizedistortion of incentives as well as the
tax on financial intermediationthat all such
programsentail; 3) credit programsmustbe
financedby long-termfunds to preventinflation
and macroeconomic
instability, recourse to
central bankcredit should be avoidedexcept in
the very early stages of developmentwhenthe
central bank’sassistance can help jump-start
economicgrowth;4) they should aim at achieving
positive externalities (or avoidingnegativeones),
any help to declining industries shouldinclude
plans for their timely phaseout;5) they should
promoteindustrialization and export orientation
in a competitiveprivate
Results
The documentsreturned for queries (I) and (2)
quite similar in manyways,but query(2) also included
manydocumentsrelating to interest rates and other
topics whichwerenot quite as relevant to economic
growthon the whole. Results of query (3) were highly
focusedon Korea.at times evento the exclusionof
economicgrowth.
This document
seemsto fit fairly well into the
conceptof the query,so this is an appropriateretrieval
from LSI. Notably though, this documentwas not
consideredterribly relevant to query(3), whereit was
only given a weightof 0.279761,quite a low score,
especially giventhe strong weightingit receivedfor the
other twoqueries. This, despite the fact that Koreais
even mentionedexplicitly in the document.
Queries (1) and (2) both returned with document
as the mostrelevant, giving it a weightof 0.740514and
0.712007respectively. As can be seen, the document
refers to the effect of credit policies in the development
of East Asiancountries:
Documents
512, 516, and 517 were all considered to
be highly relevant to query(3), being given weights
0.690721,0.619509,and 0.596581respectively. All of
the documentsrefer to Korea’s economicgrowth, thus
one wouldexpect this to be the case.
Credit policies : lessons fromEast Asia -Directed credit programswerea majortool of
developmentin the 1960sand 1970s. In the
1980s, their usefulness wasreconsidered.
Experiencein most countries showedthat they
stimulatedcapital-intensiveprojects, that
preferential funds wereoften (mis)usedfor
nonprioritypurposes,that a declinein financial
discipline led to lowrepaymentrates, and that
budget deficits swelled. Moreover,the programs
were hard to remove.But Japan and other East
Asiancountries havelong touted the merits of
focused, well-managed
directed credit programs,
saying they are warrantedwhenthere is a
significant discrepancybetweenprivate and social
benefits, wheninvesmentrisk is too high on
Document512:
Korea- Problemsand issues in a rapidly growing
economy-- The phenomenaleconomicprogress
of the Republicof Koreaduringthe past decadeis
one of the outstandingsuccessstories in
international development.Despite a lack of
natural resources, Koreahas madethe transition
froma largely agricultural society into a semiindustrialized countryin a relatively short time.
This booktraces the impressiveindustrial growth
and discussesthe issues that Korea’sfive-year
plan for 1977-81musttry to resolve. It focuseson
the problemof mobilizingfinancial resources both foreign capital and domesticsavings- for
future growthand on the problemof increasing
industrial exports without undermining
efforts
KNOWLEDGE
MANAGEMENT 321
towardimportsubstitution. This study of Korea’s
plans for industrial expansionemphasizesthe
textile, electronics, andshipbuildingindustries, as
well as the contribution that small and mediumsize firms can make.It also considersthe social
goals of moreevendistribution of the benefits of
growth, increased employment,and greater parity
betweenurban and rural incomes.
Onewouldalso expect that any documentsso
relevant to a query on the economicgrowthof Korea,
wouldalso be of value in a querypertaining to the
economicgrowth of East Asia. However,LSI did not
weighthese documentsheavily for queries (1) and (2).
For query(1), they were given weightsof 0.406988,
0.204237,and0.221794,while for query (2) they were
given weights of 0.356005, 0.169973,and 0.276461
respectively. Likely, these documentsmight not even be
consideredrelevant in a search for queries (I) and (2).
In the case of document
504, LSIrated it similarly
for queries (1), (2), and (3), 0.414071,0.370077,
0.372094respectively. This despite the fact that the
documentdoes not mentionEast Asia or Koreaat any
point in the document.
Nepal- 1997Economicupdate: the challenge of
accelerating economicgrowth-- This report was
prepared for the NepalAid Groupmeeting
scheduledto be held in Paris in early 1998. The
report focusesselectively on a limited set of key
issues critical to accelerating economicgrowth
and improving developmentmanagementin the
short-to-mediumterm. The report is organized
into twochapters. ChapterI reviewsrecent
economicdevelopments,focusing particularly on
economicgrowth, fiscal management,the
financial sector, and progressin economic
reforms. Chapter 2 looks at the development
agendawhich will accelerate economicgrowth
and developmenton a sustainable basis. The main
elements of the developmentagendainclude: 1)
Aimfor sustained high rates of broad-basedand
more equitable economicgrowth, and make
strong efforts to reducethe populationgrowthrate
in order to improvethe living conditionsof the
predominantlyrural poor. 2) Makeconcerted
efforts to promotehumanresource development
and to providebasic infrastructure and services.
Improvingeducation, health, and nutrition are
essential for enhancing
literacy, skills,
productivity, and income-earningcapacity.
Increasingbasic infrastructure woulddirectly help
support economicactivities and improveliving
standards. 3) Totake care of those not directly
benefiting from the growthprocess, especially
vulnerable and underprivilegedgroups, including
women.
322
FLAIRS-2001
In fact, perhapsit is becausethis document
mentions
neither East Asiaor Koreathat the weightsare so
similar. It is not pulledoff in anydirection by the
inclusion of anyof these terms.
Anotherinteresting result is the case of document
230. It is a very short document,withoutevenan
abstract.
Date: 1997-06-26
Report#:i 6812
Title: Korea-Public Hospital Modernization
Project
Abstract:
Thereis verylittle, if any relation to economic
development.Whythen wouldit receive a weight from
query(3) as high as 0.684027?Onepossible explanation
is that of the fewtermspresent in the document,Koreais
one. Thusthe relation of the occurrenceof the term
Koreato the total numberof terms in the document
is
quite high. However,
there is no correlation to the
concerns of economicgrowthmentionedin the query.
Not surprisingly document230was not one of the
documentsretrieved in query(1) or query(2) (each
queryretrieved 100of the over 500 documents).
Discussion
WhileLSIseemsquite able to retrieve documents
relevant to the keywords
given in a queryand in this
respect its performanceis superior to keyword-based
searchengines,it is still not satisfactoryin dealingwith
the intention of the user. Our last examplewhereLSI
returns a documentthat has no relation to economic
developmentin Koreabrings up the followinginteresting
question: Whatadditional (external to LSI) criteria
shouldbe used for rejecting or selecting a document’?
Wecould consider several parameters: length of a
document,relative weightsof terms within a query(for
instance, given our domainknowledge,we might
associate a higher weight to ’economicgrowth’than
’Korea’), etc. Moreover,we can take advantageof
subsumptionaxioms(Woods2000) and rules that
capture a gooddeal of domainknowledgeabout
relationships betweenconceptsto preprocessqueries and
achieve a better matchto user’s intention. For example,
our results werenot quite as expectedin so far as
returning similar documentsfor queries on economic
growthin East Asia and Korea.Giventhat Koreais an
East Asiancountry, one mightwell expectto see a higher
correlation of weightsfor retrieved documents.
Preprocessingof the queryin this case amountsto
utilizing a keywordbasedconcepthierarchy to insure
that related keywordsare included in any given query.
Thusfor example,Koreawouldbe included in a query
on East Asia.
It is clear that, despiteLSI’sability to find
correlations betweenterms and documents,there is still a
greatneedfor storingdomain
knowledge
in termsof
heuristicrulesandrelationsbetween
concepts
(e.g.,
relationbetween
countriesandregions).Sucha system
wouldconsistof threeintegratedcomponents:
A case
libra,, containing
concretecasesof development
projects;Aknowledge
basethat containsgeneraldomain
knowledge
in the formof relationsamong
featuresand
heuristicrules;andAsituationassessment
userinterface
that derivesmoremeaningful
featuresbeforeattempting
retrievalof a usefulcase(Moussavi
1999).LSIwould
integralto thesystemin a number
of ways.Forone
thing,it would
beusedto indexthecaselibrary,thus
extending
the indexing
of thesystemto includemore
thanjust theknowledge
base.
One, very useful, ability of manyCBR
applicationsis the ability to adapt old cases to
newones. Given a knowledgebase of
documents,no adaptation is possible, as the
documentscannot be changed. However,with
a systemthat is constructedto handleuser
t~eries
of the
base, Thus
the queries
emselves
canknowledge
becomecases.
adaptation
of old queriesto newonesis a possibility.
Furthermore,this focuses the adaptation
processon the desires of the user, rather than
the facts of the knowledgebase, whichseems
appropriate for improvingsystem response to
user needs. In order to do this, the queries,
both old and new, wouldneed to be indexed.
Again, LSIis a powerfultool for
accomplishing such indexing. LS] wouldbe
useful for this in two ways.One,the actual
queries could be indexed. Whilethe queries
are quite short, over time, quite a numberof
themwouldbe built up. Secondly,the results
of the queries could be indexed. This would
allow for the generation of a knowledge
concerning query results. Such knowledge
could be used as feedbackfor the systemin the
creation and maintenanceof concept
categories, types of informationfounduseful
by users given with different searches, and
other informationrelated to search cases. Such
informationwouldbe invaluable to the longterm maintenanceand growth of such a CBR
system. Indexing by a system such as LSI
wouldmakethe reformation eminently more
accessible.
LSIwasgoodat this task, it wouldlikely need
to be augmentedby other, domainspecific
knowledge.
Acknowledgements
Theauthorswouldlike to thankthe FLAIRS
reviewersfor their helpfulcomments
andfeedback.
We
would
also like to thankDavid
Leake
for his helpand
suggestions.
References:
Aha, D. W. and Breslow, L.A. (1997). Refining
Conversational Case Libraries. In Proceedings of the
Second International
Conference on Case-Based
reasoning, pages 267-278. Providance, RI: SpringerVerlag.
S. Deerwester,
S.T.Dumais,
G.W.
Furnas,
T.K.
Landauer,
R. Harshman,
"Indexing
by LatentSemantic
Analysis",
Journalof the American
Societyfor
Information
Science,41(6),1990,pp.391-407.
Laham,
D.(1997).LatentSemantic
Analysisapproaches
to categorization.
In M.G.Shafto&P. Langley
(Eds.),
Proceedings
of the 19thannualmeeting
of theCognitive
ScienceSociety(p. 979).Mawhwah,
NJ:Erlbaum
Moussavi,
Massoud.
(1999).A Case-Based
Approach
to
Knowledge
Management.
In (Aha&Mufioz-Avila,
1999)
Woods,
William
(1997).Conceptual
Indexing:A better
wayto organizeknowledge.
TechnicalReportSMLI
TR-97-61,
SunMicrosystems
Laboratories,Mountain
View,CA.
www.sun.comlresearchltechrepl
1997/abstract-61
.html.
Summary
This paper lookedat the effectiveness of
LSIin returning documentsthat are close to
what the user was looking for. The
performanceof LSI in this task has
ramificationsfor its use in that task, as wellas
other related. In order for LSIto be of use in
anyof those tasks, it mustreturn results true to
the user’s intentions. It wasseen that while
KNOWLEDGE
MANAGEMENT
323
Download