Hypertext Summary Extraction for Fast Document Browsing

From: AAAI Technical Report SS-97-02. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. Hypertext SummaryExtraction for Fast DocumentBrowsing Kavi Mahesh Computing Research Laboratory (3CRL) NewMexico State University Las Cruces, NM88003-8001 USA mahesh~crl.nmsu.edu http://crl.nmsu.odu/users/mahesh/papers/hypex-s-mmaryhtml importance). The most interesting pieces should be most easily accessible to the user and can be considered to constitute the snmmaryof the document.Less interesting (~ moredetailed) pieces should still be accessible but hidden behind more interesting ones in multiple layers of detail. In this view of summarization,each layer can be considered a snmmaryof the more detailed layer immediately behindit. Abstract This article describes an application of Natural Langnage Processing (NLP) techniques to enable fast browsing of on-line documentsby automatically generating Hypertext s~lmmaries of one or more documents. Unlike previous w~k on summarization, the system described here, HyperGen,does not produce plain-text snmmariesand does not throw awayparts of the document that weren’t included in the summary. HyperGenis based on the view that snmmarization is essentially the task of synthesizing Hypertextstructure in a documentso that parts of the document"important" to the user are accessible up front while other parts are hiddenin multiple layers of increasing detail. In fact, HyperGengenerates short descriptions of the contents and rhetorical purposes of the hidden parts to label the Hypertext links between the s~lmmary and the different layers of detail that it generates. A prototype HyperGensystem has been implementedto illustrate the techniques and demonstrate its usefulness in browsing World Wide Web documents. I. Summarization as Hypertext eration Linkingthe different pieces and layers together in a Hypertextstructure enables the user to navigate to the different parts of the documentas desired. Labeling such Hypertextlinks with a short phrase or two indicates to the user what lies behind a link. In HyperGen,a prototype system that generates Hypertext surnrnaries of ~nslish and Spanishtexts, these labels are in fact very short summaries of the pieces of text that lie behind them. Each label snmmarizesthe content and/or the perceived rhetorical purpose(e.g., illustration, historical background. comparison,explanation, etc.) of one 0¢ morepieces of text. Users can "open" a hidden piece and "zoominto" any part of the documentif its label suggests matter of interest to them. Hypertext links in a HyperGensnmmaryneed not be limited to other parts of the samedocument.Suchlinks can in fact function as cross-links to related parts of other documents,even those in other languages. Suchcrosslinks are generated by HH0erGen using keywordsthat it recoglfizes while snmmarizingthe documentsinvolved. Gen- There is no correct algorithm for snmmarization.What belongs in a snmmaryextracted from a documentis determinedby the ne~s. likes, and stylistic preferences of users and the tasks they are performingusing the summary. Whythen should a snmmarizatiou system decide what to include in a snmmaryand what to throw away7 Onthe other hand. not throwingawayany part of the documentdefeats the very purpose of snmmarization, namely,fast browsingof documents.It seemsthat in general the best wayto snmmarizea documentis to assign to each piece of the documenta level of interestingness (or HyperGenhas been developed under the belief that a summarizationsystemthat (i) runs muchfaster than reading the full document,(ii) is robust, (iii) portable, richly customizable, and (v) provides excellent aids for the user to navigate throughthe difference parts of a document, will be useful in practice even if the summaries are less than ideal. HyperGenis intended for document browsingscenarios over any large collection of docu- 95 ments,such as an on-line library or the WorldWideWeb, whetherthe documents themselvesare in plain or Hypertext. Current Webbrowsersand search engines merely followpredefinedHypertextlinks andretrieve entire documentsby searchingfor keywordsin pre-c(ymputed index databases. HyperGen goesbeyond thistechnology and dynamicallyconstructs Hypertextpresentationsof documentsat mtdtiplelevels of detail, frombrief snmmoxies to entire documents.This viewof summarization as a meansfor documentvisualization througha meremanipularionof Hypertextfinks betweendifferent parts of documentspromisesto deliver the full potential of Hypertext documentcollections such as the WWW to the large community of Internet users. 2. Example Figure 1 showsa Hypertextsummary extracted by HyperGen. TheWebbrowserscreen on the left showsthe mainsllmmaryalong with automaticallygeneratedkeywordsin the document and several labels in between paragraphsthat haveHypertextlinks to parts of the documentnot includedin the summary. Theparagraphs shouldin fact be considered as "callouts"("highlights"or "pullquotes")extracted fromthe document.Thelabels, Hypertext links, andthe separatefiles for the hiddenparts are all automaticallygeneratedby HyperGen. Thepart of the document hiddenbehindthe first label ’TheDealer Network Option"is shownonthe right half of Figure1. It canbe seenthat this label is in fact a sectionheadingin the document.Otherlabels, such as "RankXerox’s experfise...the chaos..." are automaticaUy extractedfrom oneof t he"mostinteresting"sentencesin the hiddenpart. showsthe basic architecture of the HyperGen snmmarizarion engine. HyperGen has adopteda robust, graded approachto buildingthe coreengineby incorporatingstatistical, syntactic, anddocument-structure analysesamongother techniques.This approachis less expensiveandmore robust than a snmm~izaflon techniquebasedentirely on a single method.HyperC~n is designedin sucha waythat as additionalresources,suchas lexical andother lmowledgebases or text processingand machinetranslation engines, become available, they canbeincorporated into HyperGen’s architecture toincrementally enhance its capabilities andimprovethe quality of summaries.For example, there isnomorphological analyzer (or stemmer) in the current implementation of HypexGea. If available, it can be addedeasily as shownin Figure2. Someof the mainmodulesin HyperGen are (i) document structureanalysis,(ii) sentenceselection,(iii) sentence simplification,(iv) summary construction,and(v) customization. 3.1. Document Structure Analysis Document structure analysis is importantfor extracting the topic of a text (PaiceandJones,1993;SaltonandSinghal, 1994;Salton,et al, 1995).In suchan enalysis,for example,rifles andsub-rifles wouldbe givena more impcctantweightthan the bodyof the text. Similarly,the introductionandconclusion forthe text itself andfor each section are moreimportantthan other paragraphs, andthe first andlast sentencein eachparagraphare more importantthan others. Theapplicabilityof these depends, of course, on the style adoptedin a particular domain, andon the language:the stylistic structureandthe presentationof arguments vary significantly across genres andlanguages.Structureanalysis mustbe tailored to a particular type of text in a particular language.Document structure analysis in HyperGen involvesthe following subtasks: ¯ Document Structure Parsing: HyperGen assumes that documents haveI-/FML markers forheaders, secflon headings, andparagraph breaks. Itseparates the rifle, sections andsubsection headings, andother data andgraphics from paragraphs oftext. ¯ Sentence Segmentation: HyperGen breaks each paragraph into sentences bylooking forsentence boundaries. Ituses a stop list todeal with special cases such ascertain uses of’.’inabbreviations. Someof the labels are in fact links to large chunksof the documentwhichoften contain several subsections. These chunksare in fact summarized again by HyperGen to create multiplelayers of intermediatedetail. HyperGen is currentlybeingextendedto producelabels describingthe rhetorical purposeof a pieceof text. For example,rhetorical categoriesare identified andpresentedusingphrases such as "comments""analysis" and "example"Optionally, any multimedia elementsin the document will also be retained by HyperGen in the appropriateparts of the HypertextsnmmJtry. 3. Generating Hypertext Summaries Thecore snmmarizarion problemis taking a single text andproducinga shorter text in the samelanguagethat containsall the mainpointsin the input text. Figure2 96 Hidden Subpart 1 0069:Ajoint venture betweenthe XeroxCorporation.... Dat4,:December 27, I~, Thtrsda¥,LateE~on- F~] Soerco: CopydBht (c) 199U TheNew York’r’~tcsCompany ~ Ke~words in Document: state owned Rank Xerox Europe Eastern ............................................................................... Nobody hastotenSe..p’pLeh~gt’~ber how dl~.cul’( It Is todobusiness ~t aimed thai is In ber,vce.n Cortm~u~ts~n mxd capit~sm. H~ hastwochoices forse~l~$ Xerox pEotocopi~s in Czechoslovakia, andas far ashcisc~c~x~l, botharcrott=n. Thet~st Lsto cont~cdohxgl~sines$ wlOx the twoc~ingstat~-ovmcd comp~cs thathadheldthem~nopoly for ~trtbuttn$ offlc~cqulpmcnt tractorthe C~orm’mmists. Oivea how bureaucratic tutti Inefficient the),have always been,Mr.Lehngruber, them’~ dir~qor of Htmsary and~z~chos]ovakia for ]),ankXerox Ltd.,would not m~t~ar~mom~a th~ ~r~e. ........................................................................ The Dealer Network Option AJoint venture between the XeroxCorporation and the RankOr~udzation of Britain, RankXeroxsells Xeroxphotocopiers outside the Americasand has been Involvedin Eastern Europednce 1.964. RankX~ox’s_ex~rdse...the chaos...now In theyenr ended Oct. RankXeroxsold about $50 mlIllon In equipmentto Eastern Europeand about $50n’dlllon to the Soviet Union. C~echoslovalda....whege ...there...a Beck to the~v.,~,nor~ The Dealer Network Option Thesecond choice-theonehe would prefer-- Is to createa network o! Ind~eadent d~v]e~. The prob]cm, E~s okl,is thattheonlyptopicwiththenecks s ~c~pite~ er~thos c whowerecorruptCormntmist ot~alsor thos~involved in money-changh~g or e~e~b]ack-~ act~vitles. "Nobody~t thefts crookshave~emoney7 Mr. Le~’q~ruber said,"~dthequestion L~whether youas a re~ectab]e company wanttodobusiness withthecr~oks." mere RmtkXeroxestimates that it has about 70 percent of the marketin the Soviet Unionand about one-third in Eastern Europe, where its share ranges from 10 percent In Hupry to about 50 percent In Romanhc Plant in Soviet Union With big stato-owned compenlesbeing broken up throughout Eastern Europe and with small budnesses expected to multiply rapidly, RankXerox-- Hke other Westerncompanies -- Is also trying to create a networkof dealers and re’vice or~mdmtlons to sex, re the newlydecentralized mmket. "Withthe secret police some," Mr. Ldmgrubersaid, "contact xvith Western °’ compiles is no longer on the no-no list. Figure1. A Hypertextsummary generatedby HyperGen is shownon the left. Onthe right is one of the parts of the documenthiddenunderthe label "TheDealerNetworkOption"Thedocumentis 200 lines long. 97 Stop List ~ SntncR ord Frequency and’~ eyword Analysis ~] (Sentence Seleclion~ / KMorphologicai Analysis / Figure2. Architectureof HyperGen showingthe basic steps in extracting the Hypertextsummary of a document. Notethat only selected sentencesare parsed. 3.2. SentenceSelection Text Structure Heuristics: HyperGen uses heuristic rules basedon document structure to rank the sentences in a document (see below).For example,different scoresare assignedto first sentencesof paragraphs,to single-sentenceparagraphs,andso on. In order to allowa multitudeof techniquesto contribute to sentenceselection, HyperGen uses a flexible method of scoringthe sentencesin a document by each of the techniquesand then rankingthemby combiningthe different scores. Text-structurebasedheuristics provideone 98 Finally, sentencesin the document are rankedbasedon the scores returnedby all of the abovetechniques.Highest rankingsentencesare selected to constitutea summaryof a givenlength(whichis either a default or set by the user). wayto rank and select sentencesin a document.Additional methodsincluded in Hype~are described below. 3.2.1. WordFrequencyAnalysis Thebasic techniqueis to sort the wordsin the docmnent by frequencyof occurren~within the documentand select a fewof the mostfrequentcontentwords(i.e., wordsother thanarticles, prepositions,conjunctionsand other closed-classwords).Sentencescontainingthose wordsget a score increment.Supportingprocesses neededfor wordfrequencyanalysis include: 3.3. Sentence Simplification Sentencesselected for inclusionin the s-mmary are often lengthyandcan be simplifiedto further shortenthe summary. HypexGen has implementeda novel "phrase dropping" (or parse-tree pluming)algorithmbasedon phrasestructure heuristics for F.n~lishandSpanish.For exampie, it dropsembedded clauses andright-branchin~ preposiricoal adjuncts. Figure3 showsan exampleof sentencesimplification. ¯ Morphological Analysis(Optional): Statistical analysis worksmovereliably on a text whichhas been morphologically processedto recognizethe same wordwithdifferent inflections. Thecurrent implementationof HyperGen uses simple string matching for countingandccanparingwords. This method requiresa robust, shallowparser, at least a partial onethat identifies phraseboundaries,with good coverage.Thesimplification moduleis currently being integrated with the sentence selection moduleso asnot do dropparts that weredeemed importantby the selection scores. Thesnmmary shownin Figure1 wasproducedwithoutsentencesimplification 3.2.2. CorpusStatistics Wordfrequencyanalysis can be enhancedby counting wordfiequenciesacross an entire corpusinstead of just withina document.Wordsthat havesignificantly higher frequencieswithin the document relative to the whole corpusare likely to indicatesignificantparts of the document.Suchtechniqueswill be integratedinto future enhancementsof HyperGen. Shallow Syntactic Parsing. A shallow syntactic analysis helpsnot only sentencesimplificationbut also the recognitionof importantsentencesandextractionof semanticallyrelevant parts of these sentences.For example, labels for Hypertext links are generatedbyextracting several nounphrasesfromthe highest rankingsentencein the hiddenpart attachedto the link. Syntacticanalysisis also usedin HyperGen to recognizethe rhetorical purposesof texts. Avery simpleparser has beenimplementedfor Englishusing only a closed-class English lexicon. This lexiconmerelyspecifies the part of speech of closed-classwordsanda few stop words.Theparser can correctly tag manynoun,verb, andprepositional phrasesin texts. HyperC~n is very fast in spite of usinga parser, since the parserrunsonly on selectedsentences. 3.2.3. Keyword andKeyword Pattern Analysis Preset or user-specifiedkeywords fromeither the domain or to focusonparticular style can be usedto introduce elementsof targeted snmmarizaticn.Keyword-based rankingand selection has already beenimplemented in HyperGen. This will be further developedin the future to accept moreexpressivekeywordpatterns. Thenewcapability will enableother query-based IE_/[Rmodules to be integrated with HyperGen. Open-classwordsin document titles andsection headings are also treated as keywords. Sentencesin eachsection containingsuchkeywords are given score increments to boosttheir chancesc~ beingselectedfor inclusionin the summary. 99 Thursday, A group of bankers will assemble in Hong Kong to sign a @400million 12-year loan agreement to help fund local tycoon Gordon Wu’s dream of building 120-kilometre, six-lane "super highway"from the colony to Guangdong,capital of the neighbouring Chineseprovinceof Guangzhou, formerlyCanton. On a Figure 3. Sentencesimplification exampleshowingthe retained parts in boldfaceandthe droppedparts in a smaller font. 3.4. a label that links it to other piecesandsubpiecesof the document in the Hypertextsnmrnal T. SummaryConstruction Usingthe simplifiedparts of selectedsentences,a summaryis constrnctedby extractingthe corresponding parts of the sourcedocument.Anumber of issues such as capitalization andother punctuationmustbe addressedto renderthe 5nlmrnary andmakeit readable,especially whensentences havebeensimplified by droppingsome of their parts. In addition,the document title (if any)and informationabout the sourceand date of the document are presentedat the be~nnlng of a snmmary. Severalkeywords(frequent wordsor user specified keywords)are also presentedat the top of the snmms1T. F, uttLreversions of HyperGen will highlight occurrencesof the keywords in the bodyof the snrnmal T or the hiddenparts andprovide cross-links to other documents containingthe same keywords. Current workon HyperGen is developingalgorithmsfor identifyingthe rhetorical purposeof a paragraphby matching its sentencesagainsta glossaryof syntacticpatterns that act as keysfor the rhetorical purposes.The glossarycontainssyntactic patternsfor phrasessuchas "considerx" "illustrated by x" 3A.1. Hypertext Generation Eachcontiguouspieceof unselectedtext is writteninto a separate HTML file with Hypertextlinks to the appropriate point in the $11mmary (or the nextlayerin front of it, in general).Labelsare generatedfor these links as follows:if thereis a sectionheadingfor the hiddenpart, use it as the label; otherwise,select the highestscoringsentence(s)in the hiddenpart, parse them,simplifythem droppingphrasesandif theyare longerthan a preset maximum length, pick any keywordsor other noun phrasesin themto constructa label of lengthless thanthe maximum length. Theselabels are in a sense snmmaries of the hiddenparts andare intendedto indicate the main topic (or theme)of the hiddenpart of the document. If a hiddenpart is longerthan a (user cnstomizable) thresholdor has multiplesections, thenthe part is summarizedagain by Hypergen to create an intermediate layer of snmmary with hiddensubparts behindit. Thus the document is ultimatelybrokeninto individualparagraphswherethe contents of each unit is s.mmarized by Apartfrom Hypertextlinks betweena snmmary and different parts of a document,future versions of HyperGen will generateadditional navigationalaids by matching keywords and keytopic areas across documents in a collection. Results will be presentedin the Hypenext summaryby proviclin~an overallset of links to related documents as well as links to related documents basedon each keyword.Somesuch cross-links mayin fact be betweendocuments in different languages(if the keywordscan be translated by a machinetranslation module). 3.5. User Customization Users can customizeHypergenin the followingways: ¯ setting the length of the snmmary: relative to the length of the document or in absolute terms (number of charactersor sentences) ¯ specifyingkeywordsto use: e.g., "joint venture" ¯ specifyingthe numberof frequent or title wordsto find ¯ controlling sentencerankingheuristics: by adjusting the levels of preferencefor frequentwords,keywords,title words,first sentencein a paragraph,etc. ¯ extendingthe stop list: by specifyingwordsto be ignored In essence, all aspects of HyperGen’s summarization behaviorcan be controlled and customizedby a user througha friendly graphicalinterface. Futureversionsof i00 HypevOen will allow users to customizerhetorical purposeidentification by addingnewphrasepatterns to lock for or by telling HyperGen to focusen particular rhetorical types. marionthat is of interest to the useror is relevantto the taskonhand),it is critical for sucha system to indicateto the user the contentsof the unselectedparts of a document.HyperG-en providesthis newfunctionality. 4. HyperGen:The System 6. Future Work:Further Uses of NLP HyperGen has beenimplemented entirely in Java and can be usedin conjunctionwith a Web browser.It is fast, robust, modular,portable,essentiallymultilingual,and simple(e.g., doesnot requirecompletelexicons,full parsers, etc.). It has beentested on severalEnglishand Spanishnewsarticles. Nousability studyhas beenconductedso far to determineif HyperGen s1)mmarleS are in fact useful for Webbrowsingor other document filtering applications. Anewversion of HyperGen with rhetoricadpurposelabels will be developedanddemonstratedduring the symposium. Results fromtesting HyperC~n on larger collections of documents andpreliminary usability studies will be reportedduringthe symposium if possible. Thissectionbriefly outlinesadditionalapplicationsof NLPtechniquesto improvethe quality of HyperGen’s summaries.Thecore s-mmarizationmethodoutlined abovehas the advantageof simplicity, doesnot require a languagegenerator, anddelivers grammatical and fairly readable s~)mmaries even in untargetedsummarization situations. However, it suffers fromseveralproblems whichlead to importantresearchissues: ¯ Pronouns,reference, anaphora,etc.: Whenthe selected sentencescontainpronounsor other references, anaphora,or ellipsis, the requiredcontextmay be missing in the snmrnary, thereby hurting its read- ability and ease of understanding.HyperGen needs simplereferenceresolution techniquesto overcome the problemsandheuristics for avoidingsuchproblemsby not selecting sentencesthat are beyondthe capabilitiesof the resolutiontechniques. 5. Related Work Previousworkin summarization (also called automatic abstracting) has addressedprimarilythe simplerproblem of producinga plain-text snmmary of a single document in the samelanguage(Cohen,1995;Lnhn,1958;Palce, 1990;Pinto Molina,1995;Prestonand Williams,1994; Saltonet al., 1994).Document retrieval andclassification efforts, on the other hand,haveproduceda multitudeof techniquesfor selectinga subsetof an entire collectionof documents.However, they simplypresent the retrieved documents in their entirety, providinglittle supportfor quicklydigestingthe contentsof an individualdocument. Informationextraction (Cowieand Lehnert,1996)systemshaveassumed that whatis of interest to users is known a priori in the formof templates.Mostworkin natural languagegenerationhas focusedon generating s~mrnaries of data, not texts (Kalita. 1989;Kukich,et al, 1994; McKeown. et al, 1995;Robin. 1994). Workin machinetranslation has assumed for the mostpart that sourcetexts mustbe translatedin their entirety. Akey drawback of the aboveresearchis that they assumed that a document mustbe either processed(e.g., translated, retrieved,etc.) in its entirety or that it mustbe summarized whilethrowingawaythe unselectedparts. Bybringing someof these areas together, HyperGen has attemptedto exploit NLPtechniquesto improvethe utility of Webbrowsers.Sincethere is no algorithmfor summarizationthat guaranteesa summary that always meets the needsof a task (i.e., nevermissesona pieceof infor- i01 ¯ Poor flow in summary: Since the stlrnmary is put togetherby conjoiningdifferent pieces of the source text, it is not likely to havea goodflow.However, this maynot be a critical problemfor HyperGen since it does not producea plain-text snmmary. HyperCw.n s~mmaries are inherently Hypertextsand are interspersedwith hyperlinksto other pieces of documents. Further applications of NLPtechniquesmayyield benefits in severalareas: Ontology-based content classification: Futureversions of HyperGen will attemptto classify the topic (or "theme")of a piece of text usinga broadcoverageontology, such as the Mikrokosmos Ontology(Carlsonand Nirenburg,1990;, Maheshand Nirenburg,1995; Mahesh, 1996). Wordsin a text will be mapped to ontologicalconcepts using a semi-automaticallyacquiredmappingfrom WordNet entries to conceptsin the Mikrokosmos Ontology. Anappropriate clustering algorithm will beused to select the mainconceptfromamongthe conceptsto whichwordsin a piece of text are mapped.Sucha methodis expectedto yield better labels that snmmarize hiddenparts than current onesoften generatedby simplifying a sentencein the hiddentext. Translationissues: Aglossary-based(or other fast and simple)translation enginecan be usedto translate necessary parts of documents in other languagesto F~nglish. Theentire document shouldbe translated only if demanded by the user. Differentmethodsfor interleaving ,~lmmarization andtranslation operationscan be explored to minimize translation efforts. For example,the source text maybe summarized in the source languageand the salmmary translated to English. Unselectedparts may sometimes needto be translated to Englishbef~elabels canbe generatedto identify anddescribetheir topic areas andrhetorical purposes. for manyinsightful comments anddiscussions. Self-evaluation: Everydecision madeby HyperGen is basedon scores assignedaccordingto a variety of usercnstomizableparameters.Self-evaluationcan be addedto this designby attachingsimpleevaluationproceduresto each of the parametersso that a measureof the system’s confidenceis determinedwhenevera decision is made anda score is assignedalong someparameter.Asimple algorithmcan be developedfor combiningindividual measuresof confidenceso that an overall measurecan be presentedto the user. Suchconfidencemeasurescan be integratedinto the labels andsummaries by generating Englishphrasessuchas "certainly about""appearsto be" or "not sure about"to expressHyperGen’s evaluation of its owns, mmarization decisions. Cowie,J. andLehnert. W.(1996). "InformationExtraction?’ Communications of the ACM, special issue on Natural LanguageProcessing, January 1996 References Carlson,L. and Nirenburg,S. (1990). "WorldModeling for NLP"Centerfor MachineTranslation, Carnegie Mellon University, Tech Report CMU-CMT-90-121. Cohen,J.D. (1995). "Highlights: Language-and domainindependentautomaticindexingterms for abstracting" Journal of the American Societyfor InformationScience, 46(3): 162-174. Kalita, J. (1989)."Automatically generatingnatural languagereports." International Journal of Man-Machine Studies, 30:399-423. Kuldch,K., McKeown, K., Shaw,J., Robin,J., Morgan, N., andPhillips, J. (1994)."User-needs analysis and design methodology for an automateddocumentgenerator." In Zampolli,A., Calzolari,N., andPalmer,M., editors, CurrentIssues in Computational Linguistics:In Honorof DonWalker. KluwerAcademicPress, Boston, MA. 7. Conclusions H_P.(1958)."Theautomaticcreation of literary abstracts" I.B.M.J.of Res. Dev.,2(2): 159-165. Lnhn, Thereis an overloadof on-line documents on the users of the Intemet. Currentsearch enginesandbrowsersare limitedin their capabilitiesto easethis overloadandto help users quicklylind anddigest the informationthey needfromthe large collectionof arbitrarily irrelevant documentsavailable to them. TheHyperGen system describedin this article attemptsto remedy this situation by applyingNLPtechniquesto enhancethe capabilities of a Webbrowser.HyperGen automaticallyconstructs Hypertextsummaries of documents so that users can see interesting subparts of documents in a summary up front and, at the sametime, see labels summarizing the parts that are hiddenbehindthe mainsummary in layers of increasingdetail. Theauthorfirmlybelievesthat these ~mmaries are moremeaningfulanduseful than the first fewlines of documents typically displayedin the results froma state-of-the-art Netsearchengineandthat these capabilities cannotbe achievedwithoutthe use of NLP techniques. Mahesh,K. andNirenburg,S. (1995). "Asituated ontologyfor practical NLP."In Proceedingsof the Workshop on Basic OntologicalIssues in Knowledge Sharing, InternationalJoint Conference on Artificial Intelligence (IJCAI-95), Montreal,Canada,August1995. Mahesh,K. (1996). "Ontologydevelopmentfor machine translation: Ideologyand methodology." Technical Report, ComputingResearchLaboratory, NewMexico State University, LasCruces, NM. McKeown, K., Robin,J., and Kuldch,K. (1995). "Generating concisenatural languagesummaries."Information Processingand Management (special issue on s,,mmarizatiou). Paice, C.D.andJones, P.A. (1993)."Theidentification importantconceptsin highlystructuredtechnical papers?’ Proc.of the 16th ACM SIGIRconference, PittsburghPA,June27-July1, 1993;pp.69-78. Acknowledgments I wouldlike to thank Sergei NirenburgandP~miZajac 102 Palce, C.D.(1990)."Constructingliterature abstracts computer:Techniquesand prospects" Iuf.Proc. & Management 26(1). 171-186. Pinto Molina,M.(1995). "Documentary abstracting: towarda methodological model."Journal of the American Societyfor InformationScience,46(3):225-234. Preston, K. and Williams,S. (1994). "Managing the informationoverload: Newautomaticsummarization tools are goodnewsfor the hard-pressedexecutive." Physicsin Business,Institute of Physics,June1994. Robin,J. (1994). "Revision-based generationof natural languages, mmariesprovidl,~ historical background: Corpus-basedanalysis, design, implementation and evaluation." Ph.D.Thesis, Computer ScienceDepartmerit, ColumbiaUniversity, TechnicalReportCUCS034-94. Salton,G., Allen, J., Buckley,C. andSinghal,A. (1994). "Automaticanalysis, themegeneration, and summarization of machine-readable texts." Science,264:14211426. Salton,G., Sing,hal, A., Buckley,C., andMitra,M. (1995)."Automatic text decomposition using text segmentsand text themes"TechnicalReport, Department of Computer Science,CornellUniversity,Ithaca, NY. Salton, G. andSinghal, A. (1994). "Automatic text theme generationandthe analysisof text structure."Technical Report TR94-1438,Departmentof ComputerScience, ComellUniversity,Ithaca, NY. 103

Hypertext Summary Extraction for Fast Document Browsing

Related documents

Products

Support

Hypertext Summary Extraction for Fast Document Browsing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib