A Case-Basedapproachto Software Component Retrieval Carmen Fern~indez-Chamizo, Pedro A. Gonz~lez-Calero Luis Hermtndez-Yltfiez, and Alvaro Urech Baqu~ Dep. Informtttica -Fac. F/sica - Universidad Complutense 28040 Madrid, SPAIN 34-1-3944381 (Phone); 34-1-3944687 (FAX); cfernan@dia.ucm.es From: AAAI Technical Report SS-93-07. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Abstract A major problemconcerning the reusability of software is the retrieval of software components. Different approaches, ranging from automatic indexing methods to knowledge-basedsystems, have been followed to solve this problem. In this paper we present three prototypes which implementdifferent methods of componentindexing. From these prototypes we propose a hybrid system that uses CBRas primary approach but taking advantage of IR methods to facilitate the access to the adequate components. 1. Introduction Softwarereuse is widely’believedto be one of the most promisingtechnologiesfor improvingsoftware quality andproductivity(Biggerstaff& Richter1987). Work on reusability has followed several approaches.As Krueger(Krueger1992)says, wecan find verydifferent types andlevels of reusability. Hedivides the different approachesto softwarereuse into eight categories: high-level languages, design and code scavenging,source codecomponentes, softwareschemas, application generators, very high-level languages, transformational systemsand software architectures. Theseapproachesrely on reuse techniquesthat rangefrom abstractions at a very low level, such as assembly languagepatterns, to very high level abstractions, as softwaredesigns. However, there are manytechnical and non-technicalproblemsto be solvedbefore widespread softwarereuse becomesa reality. Oneof the problemsis the one concernedwith the classification, storage and retrieval of reusable components.Ourcurrent workis aimedat the solutionof this problem. If a reusabilitylibraryis to be successful,it mustbe structuredto help the reuser to find softwarecomponents of interest, browsethroughrelated components to locate the component that best suits the current needs,andbuild a custom-tailored software componentusable in the softwaresystembeing developed. Theuser shouldbe able to find possiblecomponents with only an approximate idea of the component’s function, without requiring to write a detailed formal specification. 2. Software Component Retrieval Oneof the fundamentalissues in the reusability of softwareis the retrieval of softwarecomponents. In order to reuse a software component,wehave to be able to retrieve it froma component base in an easy andefficient way.Therefore,the softwarecomponent retrieval focuses in two majorproblemareas: the classification of the software componentsand the retrieval method. Both problems are interrelated, because the way the components are organizedinfluencesthe retrieval method to be used. Traditionalapproaches to softwareretrieval fall into two complementary categories: high.level classification techniques, which emphasizeretrieval by software category, and low-level cross-reference tools, which facilitate variouskindsof browsing at the codelevel. We Thisworkis supported by the SpanishComitteeof Science&Technology (CICYT, TIC92-0058) 35 are primarily concernedwith high-level techniquesbut low-level tools are also needed to establish some importantrelationships betweencomponents at the code level. The high-level classification of the software components can followbasically two approaches.Wecan extract the classification informationfromthe component itself (code, documentation). Or, on the other hand, can implement a classification schemeusing information about the components that lies outsideof them.That is, we can provide for each componentsomeadditional informationfor the classificationscheme to base on; this information could be obtained, for example,from an analysis of the domain.Mostof the systemsbasedon the first approachuse automaticindexingmethods,whilethe systems based on the second approach are usually "knowledge-based systems". ThoseLAswhich have a high resolving powerin the documentwith respect to the corpus are selected as indices. In Guru,all softwarecomponents are classified, stored, compared,and retrieved according to these indices. Theycan evenbe organizedinto hierarchies for browsingusing clustering techniques. At the retrieval stag.e, the user canspecifya queryin free-style natural language, whichis indexed using the sameindexing technique.Theset of LAsextractedfromthe querydirects the repositorysearch. Variousrankingmeasurescan then be usedby the Gurusystemto select the best candidate for a particular query.Thissystemhas beenappliedto the UNIXcommand set and also to an object-oriented class library (Helm& Maarek1991). As the informationprovidedby IR tools is derived automatically,this approachpresentsadvantagesin cost, transportability and scalability. Statistical methods, however,can’t substitute for meaning. 2.1. Automaticindexing approach 2.2. Knowledge-Basedapproach Dueto the increasingsize of natural language descriptions of softwarecomponents in recent libraries, information retrieval (IR) techniquesbasedon statistical methods (Salton 1986) are becomingmore usual in component retrieval. This approachextracts informationfromthe natural-language documentation of the software components.It doesn’t use any semanticknowledgeand it doesn’t intend to understandthe documentation.The goal of this approachis to characterizeeach component by a set of indicesthat are automaticallyextractedfrom its natural languagedocumentation. Maarek(Maarek et al. 1991) identifies three requirements that shouldbe fulfilled by IR techniquesin the softwaredomain:allowmultiplewordretrieval, select onlykeyindices, andachievehigh precision. Thereis a growinginterest in the potential contributions of artificial intelligenceto softwareengineering(Arango 1988). Developmentof knowledge-basedtools for softwarereusability is oneof the mostpromisingresearch topicsin this area. Thekey feature of this approachis that it draws semanticinformationabout softwarecomponentsfroma humanexpert. Knowledge-based systemsare often very sophisticated.Asa tradeoff, they requiredomainanalysis and a great deal of pre-encoded, manuallyprovided semanticinformation. Prieto-Dfaz(Prieto-Diaz &Freeman1987) created a classification schemebasedon the library science. He proposes a set of six facets: three related to the functionality of the component and three related to its environment.Thedifferent values a facet can haveare called terms. Theseterms are structured aroundcertain supertypes that represent organizing concepts. Conceptualdistances betweenterms are assignedby the user. Toclassify a component, a value for eachfacet, a term, mustbe given, so eachcomponent is characterized by a six-tuple of terms. The search for a reusable component is accomplishedby entering a querywith six terms.Theset of termsis finite, but a thesaurus is provided to help makingthe query. Theconceptual graph that organizesdomainconcepts represents manuallyencoded knowledgeabout the domain. Thesystem proposedin (Frakes and Nejmeh1987) uses an existing IR system, CATALOG, for storing and retrieving C software components.Each componentis characterizedby a set of single- termindices that are automatically extracted from the natural- language headers of C programs. The atomic indexing unit in the Guru system (Maareket al. 1991)is the lexical affinity (LA).An betweentwounits of languagestandsfor a correlation of their common appearance.A set of LA-based indices (or profile) is built for each documentby performing statistical analysisof the distributionof wordsnotonlyin the document, but also in the corpusformedbythe set of all the documents in the repository.LA-based profiles are automatically built for each componentdescription. The system proposed in (Woodand Sommerville 1988) uses Conceptual Dependency(Schank 1972) 36 represent knowledgeabout software components.They define the "componentdescriptor frames" in order to represent the function performedby the component and the objects manipulatedby the function. After analyzing severalapplicationdomains,theyidentified from10 to 25 basic functions(primitiveactions) for software.There onebasic functionfor eachclassification of conceptually similar verbs, that is, verbsthat describe semantically similar software functions. Also required is a classification of the objects manipulatedby software components,into classes or "nominals"that represent conceptuallysimilar objects. Theinterface to the system is forms-based. hierarchyensuresthat components are properlyorganized andcategorized.Thetaxonomy can also be useful in query formulation and reformulation. Whenquerying the database,if there are no answersor if there are too many answers, the hierarchy can be used to specialize, generalize, or look for alternatives for an appropiate portionof the query;modifythis portionandqueryagain. LASSIE incorporates a natural-languageinterface which uses a list of compatibilitytuples to parse the input. Compatibility tuples, which indicate plausible associations amongobjects, are obtained from the frame-like knowledgerepresentation mechanism used by LASSIE. Embley and Woodfield (Embley & Woodfield 1987)definea knowledge structure for a softwarelibrary consistingof abstract data types (ADTs).This knowledge structure supportsdifferent relations amongADTs.The ADTs in the library includenatural languagedescriptions andkeywords to assist in findingandbrowsing activities. Relationships in the ADTlibrary can be automatic (relationships implied by ADTinformationcontainedin the library) or user-defined(explicitly providedby the user). Defined relationships for automatically determiningcloseness, generalizations,specializations, generics, and aggregationsimplicitly imposea useful structureonthe library. In addition,library administrators may also define their own generalizations, specializations,andclassificationsto furtheraid the user. The"find" operation, implementedin the prototype, allows several meansby whichan ADTcan be located: by name,by general descriptionsin natural-language,by keywordsor by operation descriptions. Closeness betweennatural-languagedescriptions is determinedby evaluating the numberand frequencyof content words foundin both the givenandcandidatedescriptions. All these knowledge-based systems have several things in common. Theyall represent the knowledgein frame-likestructureswithsimilarsets of slot-fillers. They all organize the knowledgearound the functions performedby the components,although someof them also includeframerepresentationsof the objectsinvolved. In every single case, the characterization of the components with slot-fillers is donemanually,following a pre-establishedmodelof the domain. TheLASSIE system(Devanbuet al. 1991) embodies a frame-based knowledgerepresentation. Software components are describedin terms of the operationsthat theyperform.Eachof these actionsis describedby giving its actor, object, recipient, agent,environment, andso on. Theserelationshipsare codedin a specializedknowledge representation system whichclassifies them into a conceptualhierarchy.Thefour principalobjects typesare OBJECT,ACTION,DOER,and STATE.Nodes below DOERand OBJECTrepresent the architectural component of the system. Nodes below ACTION represent the system’s functional component. The relationship between the two system componentsis captured by various slot-filler relationships between ACTIONs, OBJECTsand DOERs. This taxonomic 37 Another approach is undertaken in CODEFINDER (Henninger1991). This systemfaces a slightly different goal. It exploresthe problemof findingprogramexamples whichare relevant to a design task. CODEFINDER uses an associative spreadingactivation methodfor locating softwareobjects. Curtis(Curtis 1989)has analyzeddifferent software indexing methodsused by knowledge-based systems. He postulatesthat the effectiveuse of a reusablelibrary will require an indexing schemesimilar to the knowledge structures built into mostprogrammers workingin an applicationarea. Thechallengeof suchindexingschemes is that the organizationof a programmer’s knowledge will changewith increasedexperience. 3. Previous work Wehave developedthree prototypes: ARGOS, S ARCand PROSA.ARGOS is an IR system which helps UNIX users find out the commands they need. TheSARC system implementsa component classification mechanism based on facets. The PROSA system is a CBRsystem for ProgramSynthesis whichindexes software components in a DynamicKnowledgeBase. These prototypes are describedbelow. 3.1. ARGOS: An IR assistant for Total values of frequencyof appearanceof the wordsand LAsin the wholecorpora are computed. UNIX ARGOS (Buenagaet al. 1992) is an on line intelligent assistant for UNIX basedon IR techniques,user modeling and hypertext. The goal of this system is to obtain information about the UNIXcommand(or commands) that implements an operationdescribedby the user using natural-language.TheUNIX manualtext files constitute the document database. ARGOS uses the method proposedin (Maarek1991) to characterizeeachdocument by detectingthe lexical affinities (LAs). The user will normally use ARGOS to solve a particularproblem,but he or she will needto makeseveral queriesto find the solution. Todisplaythe information on screen, the systemtakes into accountnot only the last query,but the wholeset of previousqueriesrelative to the user’s problem.Thereis a buttonfor the user to click on whenthe documentation displayedis irrelevant to his or her problem.Thesystemwill use this informationabout irrelevance whendoingthe subsequentsearches. This is accomplishedby the ARGOS’ User ModelModule.The user maypartially reset the user modelto indicatethat he or she wantsto start a newsearchon a differentsubject. ARGOS uses hypertext techniques as navigating tool through the information. Whenthe user wants to knowmoreabout a topic, ARGOS searches the related information in the document database,anddisplaysit. The hypertext links are not fixed. They are generated dynamically,depending on the interest of the user. Theuser modeltries to guess whatare the user’s interests in order to modifythe subsequent searchesor to makecorrectionsto the outputof the IR module.Theuser modelis incremental. The information is implicitly obtainedby the interactionwiththe user. Theanalysis of the documentsin the database is performed onlyonce.Theresults of the analysisare stored in auxiliary files. In the analysis of eachdocument, the followinggoals are accomplished: ¯ Onlyopenclass words(nouns,verbs, adjectivesand adverbs) are considered. Closed class words (pronouns, prepositions, conjunctions and interjections)are ignored. ¯ Wordsare taken into their canonical form, eliminating morphemes and suffixes. ¯ For each document,the set of wordsand their frequencyof appearanceis computed.The set of lexical affinities andtheir frequencyof appearance is also computed. 38 Valuesfor the resolvingpowerof eachwordand LA are calculated for eachdocument. 3.2. SARC: A faceted system for class retrieval Wehave also developed the SARC system (Cervig6n Hernfindez 1992),a prototypethat helpsthe reusersin the retrieval of software components. The software components are the classes of an object-orientedlibrary andthe systemlets the user select oneof twolibraries: S malltalkor ObjectProfessional.Thesystemincorporates a friendly user interface with multiple windows,mouse supportandcontext-sensitivehelp. SARC integrates high-level functional knowledge and low-level knowledge.The high-level functional knowledgeis built into the classification scheme,a simplified version of the faceted classification of (Prieto-Dfaz 1991) with somemodificationsdue to the differences betweenobject classes and general software components. Thecomponents are characterizedby facets. For eachfacet, a component has an associatedset of terms (or keywords).In order to find candidatecomponents, the user selects onetermfor eachfacet andthe systemshows a list with all the components that matchthe required terms,if any. Thesystemalso uses low-level knowledge acquired by a deepanalysis of the sourcecodeof the components. Thefacetedclassificationis constructed by the interaction with expert programmerswhocharacterize the new components they incorporate to the library. This means that the characterization is moreor less subjective.Forthis reason,the systemusesan affinity metricbasedin the code shared betweenany pair of components(Chidamber Kemerer1991). These affinity values let the system associateto eachcandidatea set of "near"components to be considered by the user. Theuser establishesthe affinity thresholdthe systemuses in this search. SARC uses other metrics in order to computethe reuse complexityof the selected components,sorting them by the complexity values. Oncethe systemdisplays the selected components, the user can inspect the code and access a hypertext browserthat allowshimor her to get moreinformation about the selected andother related components. Finally, whena newcomponentis added to the library, the systemasksthe programmer for the termsthat characterizethe facets of the newcomponent. In order to help the user in the characterizationof the facet that reflects the functionality of the component, the system extracts keywords fromthe comments of the sourcecode. Theuser mayselect anyof those keywords,or any other word,as terms. integers, links, etc. andnon-primitive objectslike linear lists, trees, graphs,etc. Thesystemknows implicitlyabout the primitive objects. Non-primitive objects are introducedinto the systemgivingtheir namesanda set of attributes. Newobjectsare definedin termsof the objects (primitive or non-primitive)knownby the system. Operationsrefer to the actions that manipulatethe objects. Wecan distinguish betweenprimitiveoperations (non-definedoperations) and non-primitiveoperations (stated in terms of primitive or previously defined operations). 3.3. PROSA:A CBRsystem for Program Synthesis The study of the behavior of humanprogrammerscan offer useful guidancefor supporting software reuse, because human programmers perform many programming tasks by "reusing" previously acquired mental schemas,rather than by reasoning from first principles (Rich andWaters1988)(Steier 1991). Programming is usually taught by example.Froma set of training programs,students can abstract some programschemas, or even somegeneral programming techniques. Therefore, weconsider that CBR(Riesbeck &Schank1989) is a natural approachto the software design problem.Followingthis approachwedevelopeda prototype, the PROSA System(Mdndizet al. 1992), whichis describedbelow.. The core of the system is a DynamicKnowledge Base (DKB)(Fermindez-Valmayor and Fermindez Chamizo1992), implemented as a discrimination net of frame-like structures named MOPs(Memory OrganizationPackages)(Schank1980). This system initially developedto modelthe learningprocessesin a different application domain(Fem/mdez-Valmayor and Fern~dez Chamizo1991). Initially, weintroducedinto the DKB two kinds of information: a basic Concept Dictionary and a Case Library. The Concept Dictionary is a hierarchical structurethat containsthe basic terminology to be usedin the applicationdomain.At anytime, the user can expand the dictionary(and the CaseLibrary)accordingto his her needs.Eachconceptdefinitionis introducedas a name and a property list (set of attribute-value pairs characterizing the concept). Weaccess the conceptsby givingtheir names or bygivingtheir attribute lists. Access by nameis providedby a concepttable whichlinks each namewith the MOP correspondingto the concept in the DKB. There are two kinds of entities in the Concept Dictionary:objects andoperations. The system makesabstractions from the common attributes of various objects or operations. Whenan abstract conceptis obtainedin this manner,the system asks the user for a name.This name,if providedby the user, will identifythe abstractconcept. Eachease stored in the CaseLibraryconsists of a problemand its solution. A solution is described at different levels of abstraction (stepwise refinement paradigm).At each level, the solution is expressedin pseudocodeas the compositionof several subproblem solutions (software components).Therefore, we can represent recursively the refinement tree which correspondsto the problemsolution. Thefinal refinement levelin this tree canbe directlytranslatedto an imperative programminglanguage. Problem descriptions and solutionsare also introducedinto the systemas property lists and they are stored in the DKB as instance MOPs. Problem(and subproblem)solutions are accessible giving the problemattributes or someof the concepts mentionedin the problem(or subproblem) description. Casesare indexedin the DKB by the operationsthey implement. The subproblems obtained in the decomposition of the initial problemare also indexedby the operationsthey implement.Theseindexesallow the reuse of subproblem solutions in other problems. Whenintroducinga newtraining case, the system attemptsto find common attributes betweenthe newcase andexisting cases resident in the DKB,similar to the abstraction process we described in the Concept Dictionary. Fromthese common attributes the system generates abstraction MOPswhichrepresent generic problems. Common attributes are selected from the problem descriptions (without refinement trees). Therefore, they correspond to abstract problem descriptionsinsteadof abstractsolutions. Whena new problem is presented, the sytem searches the DKB for analogousproblems,by following a top-downstrategy. The major goal of PROSA is the Objectsrefer to entities that are manipulated by a program. Weuse primitive objects like characters, 39 adaptation of the retrieved solutions to solve the new problems. The problem solving mechanismsused in this adaptationprocessare out of the scopeof this paper, where we are concernedwith the software indexing methods. 4. A Hybrid Approach to Software Components Retrieval The goal of our current work is to develop a tool for indexing and retrieving software components from a library of object classes. In fact, this tool will be part of a wider environment that will embodyseveral tools for helping and teaching the usage of the class library. We intend to take advantage of the results of the three previously developedsystems, but taking into account the different requirements of the newsystem: ¯ Object-oriented design, unlike classical functional design, bases the modular decomposition of a software systemon the classes of objects the system manipulates, not on the functions the system performs (Meyer 1987). Therefore, in a class library, the software componentsare classes of objects which include the operational blocks (methods)that are applicable to these objects. In a class library, the documentationis organized aroundobject classes instead of aroundfunctional blocks. ARGOSand PROSAassumed that each software description correspondedto a functional block. Due to the inheritance mechanism, many class descriptions are incomplete because the inherited methodsare decribed in the ancestor classes. The Case Library of software componentswe used in the PROSA system was specifically designed for it. Therefore, the functional description of each software componentwas tailored to fit the property list required by the prototype. Now,the goal is to apply the same indexing methodsto a commercially available class library. Load Declaration constructorLoad(varS : idStream); Purpose Loada string dictionary froma stream. Ducription S is a properlyinitialized streamobject (a streamfile openedfor reading,for example).Notethat S canbe any descendant of the IdStreamtype,which includes DosIdStream,BufldStream,and MemIdStream. Loadreads the next sequenceof bytes fromthe streamS. Thesebytes musthavebeenwritten by a previouscall to StringDict.Store.Loadallocates heapspacefor the hash table, then reads a series of strings and data valuesfromthe stream and adds themto the dictionary. If Loadfails, the constructorwill returnFalse andan error codewill be stored in the globalvariableInitStatus. Thestreamregistration procedurefor a StringDictis StringDictStream. Declaration procedureStore(varS : idStream); Purpox Storea string dictionaryin a stream. Deecription S is a properlyinitialized streamobject (a streamfile openedfor writing, usually). Notethat S can be anydescendant the IdStreamtype, whichincludes DosldStream,BufldStream,and MemIdStream (although youtypically wouldn’t write to a Meml-dStream). Store writes the current size of the hashtable followedby a packedimageof all the strings anddata values for the dictionary. Thisimagemaybe read by a later call to Load. Tocheckfor errors that mayhaveocurredduringthe Stere, call the stream’sGetStatusfunctionas shownin the examplebelow. Thestreamregistration procedurefor a StringDictis StringDictStream. Figure 1 40 As starting point, wehave selected two widely available class libraries, Object Professional (Turbo PowerSoftware1990)and Smalltalk(Digitalk 1986), experimentwith. In this way,wecan benefit fromthe experienceacquiredusing the faceted SARC system,that workswith the sameclass libraries. Weare particularly interestedin the set of termsselectedto characterizeeach class, and in the low-levelknowledge obtainedfromthe deepanalysis of the sourcecode(inheritance andclient relationshipsbetween classes). ("string dictionary")is describedin twelvemanualpages, includingthe descriptionof its thirteen methods.Some fragmentsof two of its methoddescriptionsare shownin Fig. 1. Wehaveanalyzedthe purposedescription of every methodin this library, andalso somemethodsof other different libraries, and wecan concludethat the syntax and the vocabularyused in these descriptions is very simple.Byperforminga domainanalysis, wecan identify a set of basic functions and objects that allowsus to expressthe functionalities of the method.Atthe moment, weare characterizing each methodby a MOP structure that will be stored in the DynamicKnowledgeBase (DKB)that we used in the PROSA system. Following with the previous example, we will have two MOPs characterizingthe twomethodsof Fig.1. Theclass libraries wehaveselected are relatively small. Theycontain aroundone hundredclasses andone thousandmethods.Theassociate documentation contains approximately twothousandpagesof text, mostof which are natural languagedescriptions with someexamplesof use at the sourcecode-level. Theuser will describehis or her needsby givinga free-text generaldescriptionof the class or of the methods he or she requires. Theretrievingprocessconstitutes an iterative process of queryingthe systemto retrieve adequatecomponents.This process can be implemented as a formulate-retrieve-reformulate cycle (Devanbu et al. 1991).If the answerto an initial queryis unsatisfactory, the user can reformulate the query by using partial descriptionsof the retrieved classes, or browseamongst functionally related classes. Anotherapproachwouldbe basedona interactiveprocesswhere,after the first query, the systeminterrogates the user to disambiguatethe meaningof the query. Weconsider that the browsing approachis the mostadequateapproachfor our initial prototype. Thekey points of the systemwill be the indexing andrelrieving methods. I-M -LOAD.#12 isA M-LOAD selector : implementor : Load l-M-StringDict code: I.M-LOAD-IMPLEMENTATION.#15 origin : destination : I-M-Stream I-M-StringDict I-M-STORE.#5 iaA M-STORE selector : implementor : code : origin : Store I-M-StringDict I-M~TORE-IMPLEMENTATION.#8 l.M~StringDict d~tination : I.M-Stream EverymethodMOP includes three slots specifying the ownerclass of that method,implementor,the actual implementation of the method,code, andthe identifier of the method,selector. The MOP also includes the slots correspondingto the specific action being described (origin anddestinationin this case). Indexing methods Wewill assignan indexnot onlyto the classes but also to every methodof each class. Eachmethodwill be linked to the ownerclass andeachclass will be linked to every oneof its methods.Thiswill allowthe accessbyclass and the access by method. These two MOPS will be linked in the DKBunder the corresponding action type, LOAD or STORE. Both actions are placed under the basic action ASSIGN, becausetheyare instancesof it. Weintend to use a hybrid approach (knowledge basedandautomaticfree-text indexing)to characterize the software components.Wedescribe this approachin the context of the ObjectProfessionallibrary. Aclass descriptionmayrangefrom3 to 55 text pages.Amethod, however, is usuallydescribedin less thanonepageandits purposeis describedwith only oneline. This figures are not mantained in other libraries, but the proportionis more or less conserved.For example,the object StringDict This processwill producea classification of the methods in the library. Amethod will usuallybe classified under only one action, but whena methodimplements severalactionsit will be placedunderall of them. MOPscorresponding to functionally related methodswill be close in the DKB. For example,there are manyother "store methods"in the library. Oneof them 41 corresponds to the class SingleList and its associate MOP is: I-M-STORE.#7isA M-STORE selector : Store implementor: I-M-SingleList code : I-M-STORE-IMPLEMENTATION.#12 origin : I-M-SingleList affinity basedindices of every class in the library. These indices will be usedin the first step of the analysis of the user’s query. Furthermore,we will create for each class a MOPcontaining information about its structural description, its relationships with other classes (inheritance, clien0 and the links to all its methods.The structural description definesa set of values for the class, i.e. a list of other classes neededto construct the current class. For example, destination : I-M.Stream This MOPwill be placed under the same M-STORE MOPthat the MOPrepresenting the previous "store method". These methods, however,were not close in the class hierarchy becausethey belongto different classes. StringDict {Dictionary of strings} object(Root) sdSize : Word; {Maximum elements in hash pool} sdUsed : Word; {Current elements in hash pool} sdPool : HashPoolPtr; sdStatus : Word; {Pointer to hash pool} {Status variable} The slructural description and the relationships with other classes will be obtained from the source code using the low-level tool of the SARC system. MOPs correspondingto classes are stored in the DKB under the MOPOBJECTat the root level of the memory. A fragment of the resulting memoryorganization is shown in Fig. 2. Anabstraction MOPwill be generated from the two MOPsbecause they have the same filler for the destinationslot: M-STORE-TO-STREAM isA M.STORE destination : I.M-Stream Aclass description, as previouslystated, can extend to several text pages. Therefore, it would be a very complextask to express all this information as MOPs.We think this is the right place to use statistical indexing methods. So, we will use ARGOS to obtain the lexical Retrieving method The analysis of the user’s query will be madein two steps: Figure 2 42 Automaticindexing methodshave proved to be succesfulwhenapplied to softwarecomponents. Wedon’t haveanyresults that predict better results in component retrieval by addingknowledgebasedmethods. 1. Thequery is indexedby the samestatistical methodusedto indexthe classes. Theprofile of lexical affinities detectedin the queryallowsthe accessto the classes with similar profiles. MOPs correspondingto these classes andall their methods will be activated. 2. The query is analyzed again by an expectation-based parser (Dyer 1983) (Martin 1989) which uses the MOPsactivated in the step 1 as expectations.Whenthe parser recognizesthe methodsor the structural descriptionof a class, this class will be displayedas an answerto the query.Acompletematching between the queryandthe selectedclass is not necessary, becausethe class requiredby the user maynot exist in the library. In this case,the user wantsto obtaina functionally closeclass to adaptit. If the parserhas recognized several classes, the class with morerecognizedmethodswill be displayed. If several classes have the samenumberof recognizedmethods,the rate betweenthe functionality requiredby the user andthe global functionality of the class will be used to select the candidate class. For example, if the querywas"a dictionaryto be usedto detect whethera particularwordis in the dictionaryor not", two classes wouldbe selected, SlringDictandStringSet.Both implementthe two methodsrequired, but the first one implementsanother eleven methodsand the secondone doesn’thaveanyothers. Sothe S tringSetwill be displayed becauseit fits better the user requirements. Thisparser has to be completed witha browser.This browserwill allow to explorethe classes functionally related to the class selected by the system.Thebrowser will also allowto exploresimilar methodsbelongingto differentclasses. Thecurrentstate of the prototypeonly supportsthe indexingmethods.Weare undertakingthe designof the parser and specifyingthe browserfeatures. Weare also recodifying someparts of the prototype becausethe previous systems were implemented in different languagesand machines. Anywaywe believe that CBRapproaches can be usefulwhenteaching(or aiding)the user to reusesoftware components.If someof the componentsin the library comefrompast designs, the experienceacquiredin the designingof sucha systemcan be of use to novices.Even experienceddesigners, can save time and effort using reusable design templatesand exploringrationale from previousdesigns,insteadof findingtheir waythroughtrial anderror prototyping. Asweundertook not onlythe retrievingtask but also the educationaltask, weneedto represent the user’s mentalmodelof the domain,and also be able to reason about the functionality of the components. Therefore,in order to develop an environment which helps programmerslearn about the library’s structure and contents, wehavechosenCBRas the primaryapproach. Weexpect that this approach will produce, as a side-effect, a better performancein the retrieving of softwarecomponents. Evenin this case, the information provided by IR tools wouldbe useful since it can be derivedautomatically. 6. References Arango,G., Baxter, I. & Freeman,P., 1988.AFramework for IncrementalProgressin the Applicationof Artificial Intelligence to Software Engineering, ACM Software Eng.Notes,vol. 13, no. 1,46-50. Biggerstaff, T.J. & Richter, C., 1987. Reusability Framework, Assessment,and Directions, IEEESoftware, vol.4, 2. Theprototype is being implemented in BIMProlog underSunOS4.1 in a SUNSparc workstation. 5. Conclusions Wehavepresenteda hybridapproachto the indexingand retrieving problem for software components.This approachconsists of using CBRtechniquesto represent knowledge about classes and methods,but simultaneously usingstatistical indexingmethods to facilitate the access to classes. 43 Buenaga,M., Fernandez,B., Vaquero,A. & Urech, A., 1992.An OnLine Intelligent Assistant for Unix, Tech. Report DIA92/16. Departmentof ComputerScience, Complutense University of Madrid. Cervig6n,C., Hernfindez,L., 1992. Manualde uso del Sistemade Ayudaparala Recuperacitnde Clases. Tech. Report DIA92/11. Departmentof ComputerScience, Complutense University of Madrid. Chidamber,S.R. & Kemerer, C.F., 1991. "Towardsa Metrics Suite for Object-OrientedDesign". OOPSLA-91. Curtis, B., 1989."Cognitiveissues in reusing software artifacts", SoftwareReusability,volumeH, Applications and Experience, (Biggerstaff, T.J. &Perlis, A.J., eds.), ACMPress, Addison-Wesley Publishing Company. Martin, C., 1989. "Case-based parsing". Inside Cased-Based Reasoning. Riesbeck, C.K. & Schank, R.C.(eds.) 1989. Hillsdale, N.J. Lawrence Erlbaum Associates. Devanbu,P., Ballard, B.W., Brachman,RJ. &Selfridge, P.G., 1991. "LASSIE: A Knowledge-Based Software Information System", Automating Software Design (Lowry, M.R. & McCarlney, R.D., eds.), AAAIPress/ The MITPress. Digitalk, 1986. SmalltalklV. Tutorial and Programming Handbook. M6ndiz, I., Fern~Indez-Chamizo, C. & Fermindez-Valmayor, A., 1992. "Program Synthesis Using Case Based Reasoning", 12 IFIP Congress, Poster Session: Software Development and Maintenance, Madrid, Sept. 1992. Meyer, B., 1987. Reusability: The Case for Object-OrientedDesign. IEEESoftware, vol.4, 2. Dyer, M., 1983. In Depth Understanding. MITPress. Embley, D.W. & Woodfield, S.N., 1987. "A Knowledge Structure for ReusingAbstract Data Types", Proceedings of the Ninth International Conference on Software Engineering, ACM,360-368. Fern~dez-Valmayor,A.&Fermtndez Chamizo, C., 1991. "An Educational Application of MemoryOrganization Models", AdvancedResearch on Computersin Education (R. Lewis&S. Otsuki, eds.). Elsevier Science Publishers B.V., IFIP. Prieto-D[az, R., Freeman,P., 1987. Classifying Software for Reusability. IEEESoftware, January. Prieto-Dfaz, R., 1991. Implementing Faceted Classification for Software Reuse. Communication s of the ~ ACM,vol. 34, n 5. Rich, C. &Waters, R.C., 1988. Automatic Programming : Myths and Prospects, Computer, August 1988, 40-51. Fermtndez-Valmayor,A.& FermtndezChamizo, C., 1992. Educational and Research Utilization of a Dynamic KnowledgeBase, Computers &Education, 18, No. 1-3, 51-61. Frakes, W. B. & Nejmeh, B. A., 1987. "Software Reuse through Information Retrieval". Proceedingsof the 20th Annual HICSS, Kona, HI. Riesbeck, C.K. & Schank, R.C., Cased-Based Reasoning. Hillsdale, ErlbaumAssociates. 1989. Inside N.J. Lawrence Salton, G., 1986. Another look at Automatic TextRetrieval Systems, Communicationsof the ACM,vol.29, . Helm,R., Maarek, Y.S., 1991. "Integrating Information Retrieval and DomainSpecific Approachesfor Browsing and Retrieval in Object-Oriented Class Libraries". OOPSLA-91. Schank, R.C., 1972. Conceptual Dependency: A Theory of Natural Language Understanding, Cognitive Psychology, 552-631. Schank, R.C., 1980. Language and Memory,Cognitive Science, 4, 243-284. Steier, D., 1991. "AutomatingAlgorithmDesign within a General Architecture for Intelligence", Automating Software Design (Lowry, M.R. &McCartney,R.D., eds.), AAAIPress/The MIT Press. Henninger,S., 1991. "Retrieving Software Objects in an Example-Based Programming Environment". Proceedings of the Fourteenth Annual International ACM/SIGIRConference on Research and Development in InformationRetrieval. Krueger, C.W., 1992. Software Reuse. ACMComputing Surveys,vol. 24, 2. Maarek, Y.S. et al., 1991. An Information Retrieval Approach for Automatically Constructing Software Libraries. IEEETransactions on Software Engineering, vol. 17, 8. 44 Turbo PowerSoftware, 1990. Object Professional User’s Manual. Wood, M. & SommerviUe, I., 1988. An Information Retrieval System for Software Components, ACMSIGIR, vol. 22, 3,4.