*
Département de mathématiques et d'informatique
Université du Québec à Trois-Rivières,
C.P. 500, Trois-Rivières, Québec, Canada, G9A 5H7
{elamrani, biskri, delisle}@uqtr.uquebec.ca http://www.uqtr.uquebec.ca/{~delisle/, ~biskri/}
Abstract
We present the salient features of a new research project in which we are developing software that can help the user to customise (subsets of) the WWW to her individual and subjective information needs. The software’s architecture is based on several natural language processing tools, some of them numeric and some others linguistic. The software involves a semi-automatic processing strategy in which the user’s contribution ensures that the results are useful and meaningful to her. These results are saved in various databases that capture several types of information (e.g. classes of related Web sites/pages, user profile, data warehouse, lexicons, etc.) that will constitute a local and personalised subset of the WWW, support the user’s information retrieval tasks and make them more effective and more productive.
Whether one looks at the World Wide Web (abbreviated here as WWW or Web) as the largest electronic library ever built by man or as the biggest semi-structured database of digital information available today, one has to admit that the WWW has also generated unprecedented levels of user frustration. Indeed, finding on the Web the information that is relevant to one’s needs all too often leads to one or both of these situations: either search results are unrealistically too numerous (typically with unacceptable levels of noise, i.e. with very low precision) to be processed in a reasonable period of time, or search results are suspiciously too few (typically due to unacceptable levels of silence, i.e. with very low recall) to be trusted. There are several interrelated reasons why information retrieval (IR) on the Web is so difficult; let us now consider them.
* The work described here is in progress. In particular, the agent-based architecture is being implemented and is not described here. However, further details will be available in the final version of this paper and at the conference.
First, most WWW search engines are based on information access schemes, in particular indexing, with which only superficial searches may be performed. Accessing information via an index is a simple and efficient technique but it is not necessarily the best one for IR purposes, at least from the user’s perspective. Typically, search engines do not have enough “intelligence” to go beyond the surface of the Web.
Second, the wide variety of data on the WWW and the lack of more content-based access information (i.e. metadata) associated with each document makes it difficult for engines to search more intelligently the Web. The various types of document encoding schemes, formats and contents, combined to the heterogeneous and multimedia nature (i.e. text, image, sound, etc.) of documents now available on the WWW, significantly complicates the task of making Web documents more intelligible by computerised means.
Third, the incredible rate at which the Web has developed and continues to do so, plus the number of communication links, computers, documents, and users it now involves, and all this within the context of an obvious lack of standardisation and organisation, cast serious doubts on the feasibility of an eventual project of “Web normalisation”.
The WWW has pretty much evolved as a dynamic organism and will most likely keep on doing so in the near future.
Fourth, despite certain hopes of normalisation and selfimposed control within the WWW community, an important difficulty remains, a difficulty which we think is often overlooked: when a user performs a search on the
WWW, whether she knows exactly what she is looking for or not, she is actually trying to find a match between her concepts (i.e. what she knows) and those available on the
Web (i.e. what the Web “knows”). As of today, this mapping can essentially be established through the use of keywords (indexing terms). However, if the available
keywords are not at the meeting point of the user’s concepts and the concepts used in the relevant Web documents, the user will find search engines’ results desperately useless as they appear meaningless to her.
Some argue that in order to solve the IR problem on the
WWW, we should develop “brighter” search engines and standardise all aspects of the Web, including how indexing terms should be determined, how documents should be categorised, what metadata should be associated with any document, and even how any document should be encoded
(e.g. with XML). Maybe this is how it should be. However, we think this is not how it will be in the foreseeable future.
We believe the WWW will continue to evolve almost freely, subject to the democratic pressure of its large user community, and will thus resist to any large scale normalisation effort. If our hypothesis is right, if only for a few more years, the conclusion is clear: Web users will continue to experience frustration with search engines’ results for a good while!
The purpose of this paper is to present an alternative to the current situation in which many WWW users are trapped: the Web is not how they think it should be and they simply cannot productively make use of current search engines’ results. So be it! Why not allow them to weave (i.e. construct) themselves a meaningful subset of the WWW especially tailored to their needs? In order to do that, we propose a semi-automatic approach which involves the use of natural language processing (NLP) tools. In Section 3, we argue that the evaluation of Web documents with regard to a user’s information needs is highly personal and that the user must actively take part in the evaluation
(interpretation) process. The computerised tools are designed to facilitate her IR endeavour by supporting the task of identifying intersecting concepts between what she knows and what the WWW has to offer, by supporting the weaving of a personalised sub-WWW, and by learning from the user’s interaction with the system in order to push user assistance even further. Our work can be associated with recent research related to the personalisation (or customisation) of the Web and also to other recent work on the use of NLP and Artificial Intelligence techniques for the Web : we summarise such related work in the next section.
Due to space constraints, the review of related work will only appear in the final version. However, our references appear at the end of the paper.
We now describe the main elements of our Personal Web
Weaver Project, which is currently in progress. Although the project is still in its early stage, we have already designed and developed several components— implemented components were programmed in C++ and
Prolog. In continuation with some of our recent work on hybrid models, in particular Biskri & Meunier (1998),
Biskri & Delisle (1999) and Biskri & Delisle (2000), this project combines numerical and linguistic tools. The main justification for such a hybrid approach is that numerical tools quickly give indices about a text’s (corpus’) subject matter, whereas linguistic tools use these indices to perform further more expensive processing in order to extract the most interesting information for a given user.
The quick numerical processing can be applied to a very large text (corpus) in order to select subsets that seem to deserve further detailed processing by the linguistic tools.
As we know all too well, the Web grows in size and complexity at an incredible rate: every day thousands of new Web pages are created. In a few years’ time, the
WWW has become an impressive source of information.
But how to get access to this information? Or, more precisely, how to reach Web sites (and Web pages) that are of interest to us, i.e. of interest to our individual needs ?
The current default solution to this non trivial problem is to perform a search on the WWW with the help of a search engine. Typically, such a search is performed after the user has specified a few keywords, most of the time less than five and often only two or three. This deceptively simple procedure hides several difficulties and subtleties. First, the user often has only partial knowledge of the domain she is searching or knows only partially what her exact information needs are. Second, as a consequence of the first point, the user does not know what keywords will best represent the profile of her information needs. Third, many search engines use indexes which are built automatically with a list of “objective” or “standard” keywords. But these keywords are anything but objective! Lots of words have lots of different meanings or usages, depending on the particular context or domain in which the appear. This is the problem we consider in the Personal Web Weaver
Project: we present a hybrid software tool, comprised of several components, that constitutes an environment for tailoring (a subset of) the Web to the user’s information needs. Let us now have a look at the overall processing strategy which is organised in four main phases.
Phase #1 (fully automatic): An initial query is formulated and submitted to a standard search engine, such as
AltaVista or Yahoo!. This initial query is most probably sub-optimal, because of the reasons we mentioned just above. But that is exactly our starting point: we only expect the user to have a relatively good idea of what she is looking for, not too vague but not too precise either. In fact, we expect this initial query to subsume the exact information she is hoping to find on the Web. In a
subsequent step, the user will be given the opportunity to user numerical classifier original query the WWW standard Web search engine corpus made up of segments (i.e. of
Web page contents) reconsider her query in the light of the search results list of
Web page adresses corpus contruction process classes of similar (or related) Web sites & classes of co-occurring words new query
Phase #1 the user picks new words from the classes and improves the original query not
OK the user determines if the results satisfy her information needs
OK
Phase #2 final query & final classes
Figure 1: From the user’s original query to a satisfying query (phases #1,2 of a 4-phase process) returned for her original query. These search results are similarities correspond to content similarities between the typically Web addresses of sites containing information in textual or other forms. Of course, there is the possibility that the number of Web addresses returned by the search engine will be very large, maybe too large for practical text segments, thus between the Web sites, then related segments will tend to be grouped in the same classes. In other words, the classes produced by the numerical classifier will tend to contain segments about the same reasons. We are planning to consider only a certain percentage of these addresses (amongst the highest ranked) when their number exceeds a certain upper limit, a limit which remains to be determined by empirical evaluation.
This selection will be done, if necessary, by the “corpus construction process” module (see Figure 1 above).
An important hypothesis in this project is the following: we consider that each Web site represents a text segment.
Thus, a set of Web sites forms a set of text segments which, taken altogether, can be looked at as a corpus— every segment retains its identity and source. We then submit this corpus to a numerical classifier (Meunier
1997; Rialle et al et al .,
., 1998) that will help us identify segments sharing lexical regularities. Assuming that such topic and, by the same token, will identify the lexical units that tend to co-occur within these topics. And this is where we have a first gain with respect to the original query: the numerical classifier’s results will provide a list of candidate query terms that are related to the user’s original query and that will help her reformulate a new, more precise, query (perhaps using some of the words in the original query plus some others from step #1). Above,
Figure 1 depicts phase #1.
Phase #2 (semi-automatic): The classes of Web pages obtained at the end of phase #1 are now considered as extra or contextual information, relative to the user’s original query, that will allow her to reformulate her query (if need be). Indeed, words in classes of co-occurring words can act
as carrier of more selective or complementary sense when compared with the keywords used in the original query.
For instance, such words will be more selective when the original keywords are polysemous or when they have multiples usages in different domains; these new words will be complementary when they broaden the subset of the WWW implied by the original query. Now, which of these new words should be picked by the user to reformulate her query?
An obvious starting point if to pick words that appear in classes in which the original query keywords appear. At that point the user can formulate an updated query from which she will obtain, through the processing already presented in phase #1, another bunch of classes containing similar (or related) Web pages and cooccurring words. Again, at the end of this step, the user may discover new words that could guide her in a more precise reformulation of her query. Eventually, after a few iterations, the user will end up with a precise query formulation that leads (via the Web addresses) to the information she was looking for or, at least, with which she is now satisfied. It should be clear that it entirely up to the user to determine when this query reformulation iterative process will end. This is a personal decision relative to a personal information need via a personal evaluation scheme: this is what subjectivity is all about and it must be accounted for in customisable Web tools. Phase #2 is depicted in the lower part of Figure 1 above.
Phase #3 (semi-automatic): At the end of phase #2, the user has an improved request that satisfies her information needs. It is time now to explore in further details the final resulting segments produced by the numerical classifier.
This in-depth exploration will serve two main purposes: i) to allow the user to construct (weave) her own personal subset of the Web, including the construction of a meaningful index; and ii) to extract information and knowledge from these Web sites, much in the sense of data mining. To assist the user during this phase, a variety of
NLP tools will be available, such as text summarisers, text analysers, or, more simply, lexicon construction tools, that can all be applied to entire segments or subsets of them that are of particular interest to the user—we already have several tools (e.g. Biskri & Delisle, 1999; Biskri &
Meunier, 1998; Biskri et al ., 1997), some others are being developed, and others are available on the WWW. Further research is required to identify other potentially useful tools and eventually develop them.
The extracted information will be saved in a database, following the basic principles of data warehousing. This information will be organised in a way that will facilitate further processing in the context of this project. For example, for the reformulation of her query (see phase #2), the user may find useful to consult the database on relationships between words, terms concepts and Web addresses. In a sense, this database will constitute the basic level of a memory-based learning mechanism that will guide the user during her interaction with the system. The database aspects of the project have numerous ramifications, one of which is the evaluation of extended
SQLs that are especially developed for the processing of textual data. This is, again, subject to further research.
Finally, the improved request will be used to update the user’s profile kept by the system. User profiles have several applications (e.g. information routing), the most important of which we mentioned in our review of related work (Section 2). Phase #3 is depicted in Figure 2 (not included here due to space constraints).
Phase #4 (semi-automatic): It is in this phase that personalisation (or, customisation) takes place. With the system being developed within the Personal Web Weaver
Project, the user will be able to shape a subset of the
WWW according to her own personal vision. Web pages that the user has explored (in the course of previous phases) and decided to keep as truly useful and informative will be indexed by her, according to her criteria. Indexing will be done locally on her computer with the help of a set of words, terms, proper nouns, etc., to which she can relate in a meaningful way. At the end of this fourth phase, the user will be able to directly consult Web pages that she knows are useful to her, as well as to submit search queries based on her own (local) personal index only. The customised Web represents well her information needs and can be seen as a projection of the WWW over the set of attributes that come from a combination of the user’s profile and the dynamic choices she made when exploring the Web.
Of course, this customised Web comes at a certain price.
As is typical of semi-automatic approaches involving some knowledge extraction mechanism, this cost is usually relatively high at the beginning but tends to significantly decrease in the long run—see Barker et al . (1998). In this case, a semi-automatic “multi-extraction” process (see
Figure 3: not included here due to space constraints) working on the textual content of the Web pages kept by the user will produce a lexicon, a list of single- and multiword terms, a list of proper names, a list of acronyms, etc.
These lists will populate a term database that will be useful during the personalised indexing process. Since it is the user who selected the Web pages that were of interest to her, the probability is high that the resulting terms (and associated concepts) will be meaningful to her and thus well adapted to her customised Web index. Obviously, not all terms in that database will be used for indexing: here again, the user’s supervision and decision will determine the final outcome. Finally, in order to keep track of changes in Web sites or pages that are on the user’s personal index (i.e. in her customised Web), an update process will regularly verify the URLs that appear in the user’s index.
There are three important aspects that distinguish our work when compared to others with similar objectives: 1) we think that users should not expect too much from the Web community in terms of self-regulation or generalised development of tools that will significantly improve the adaptability of the Web to individual information needs; 2) we think that the very idea of developing an “objective”
Web is problematic; 3) we think that the idea that only fully automatic Web tools are of any interest will prevent the users from reaching many of their personal goals when searching the Web. Such tools must make room for the user’s subjectivity: this is a must in a process involving the human in a personal reading or text interpretation activity.
All this justifies an approach such as the one put forward here: let us give the tools to Web users that will allow them to weave their own personal and subjective, but truly meaningful, WWW. Only then will information retrieval take another dimension from the user’s viewpoint, more efficient, more effective and certainly more productive than what it typically is now.
We have presented the main ideas of an new research project called the Personal Web Weaver Project. The latter is still at an early stage and it is much too soon to draw conclusions at this point in time: a lot of work remains to be done at the implementation level and an extensive evaluation involving “real” users will have to be carried out once the implementation is completed. One aspect of evaluation could try to measure the gains that our software can bring to the user, in the short and longer run, and make a comparison with current tools. Several aspects of our present design might also have to be reconsidered or extended. For instance, user queries are currently formulated with keywords only. But we are also interested in testing queries based on: bookmarks, phrases, sentences or paragraphs; syntactic and semantic markers; logical representations; and even extended SQL queries in situations where our software could be integrated in a database environment. As well, several aspects of machine learning seem quite promising to further investigate in the context of the current system architecture. For instance, the potential iteration from phase #4 to phase #1, guided by the information already integrated in the user’s personal
Web, could intelligently support further extensions of the user’s Web as well as provide relevant background for new information retrieval queries. There is also the possibility of integrating an interactive user feedback mechanism that would allow the system to keep track of successful operations during the overall Web customisation process and thus better adapt to the user’s information needs. At last, two other areas of research will have to be investigated: user interface design (to make the semi-automatic processing as easy to use as possible) and multimedia adaptation (to process data in other forms than text alone).
Acknowledgments. This research is supported by l’UQTR’s internal funding program (FIR) and by the
Natural Sciences and Engineering Research Council of
Canada (NSERC).
Amati G., F. Crestani & F. Ubaldini (1997), “A Learning
System for Selective Dissemination of Information”,
Proc. of the 15 th International Joint Conf. on Artificial Intelligence
(IJCAI-97) , Nagoya, Japan, 23-29 August 1997, 764-769.
Anikina N., V. Golender, S. Kozhukhina, L. Vainer & B.
Zagatsky (1997) , “REASON : NLP-based Search System for the WWW Lingosense Ltd.”, 1997 AAAI Symposium on
Natural Language Processing for the World Wide Web , Menlo
Park, California, 24-26 March 1997, 1-9.
Arampatzis A.T., T. Tsoris & C.H.A. Koster (1997),
“IRENA : Information Retrieval Engine Based on Natural
Language Analysis”,
Proc. of the Conf. on Computer-Assisted
Information Searching on Internet (RIAO-97), Montréal
(Québec), Canada, 25-27 June 1997, 159-175.
Barker K., S. Delisle & S. Szpakowicz (1998).
“Test-
Driving TANKA : Evaluating a Semi-automatic System of
Text Analysis for Knowledge Acquisition”,
12th Biennial
Conference of the Canadian Society for Computational Studies of Intelligence (CAI'98) , Vancouver (B.C.), Canada, June 18-
20 1998, 60--71. Published in Lectures Notes in Artificial
Intelligence #1418 , Springer.
Biskri I. & J.-G. Meunier (1998), “Vers un modèle hybride pour le traitement de l’information lexicale dans les bases de données textuelles”,
Actes du Colloque International JADT-98 ,
Nice, France.
Biskri I. & S. Delisle (1999), “Un modèle hybride pour le textual data mining - un mariage de raison entre le numérique et le linguistique”,
6
ème
Conférence Annuelle sur le Traitement
Automatique des Langues (TALN-99)
, Cargèse, Corse, 12-17 juillet 1999, 55-64.
Biskri I. & S. Delisle (2000), “User-Relevant Access to
Textual Information Through Flexible Identification of Terms:
A Semi-Automatic Method and Software Based on a
Combination of N-Grams and Surface Linguistic Filters”, Actes de la 6ème Conférence RIAO-2000 (Content-Based Multimedia
Information Access) , Paris (France), 12-14 avril 2000, to appear.
Biskri I., C. Jouis, F. Le Priol, J.-P. Desclés, J.-G. Meunier
& W. Mustafa Elhadi (1997), “Outil d'aide à la fouille documentaire : approche hybride numérique linguistique”,
Actes du colloque international FRACTAL 1997, Linguistique et Informatique : Théories et Outils pour le Traitement
Automatique des Langues , Revue Annuelle BULAG - Année
1996-1997 – Numéro Hors-Série, Besançon, France, 35-43.
Chandrasekar R. & B. Srinivas (1997), “Using Syntactic
Information in Document Filtering : A Comparative Study Of
Part-Of-Speech Tagging And Supertagging”, Proc. of the Conf. on Computer-Assisted Information Searching on Internet
(RIAO-97) , Montréal (Québec), Canada, 25-27 June 1997, 531-
545.
Ciravegna F., A. Lavelli, N. Mana, J. Matiasek, L.
Gilardoni, S. Mazza, M. Ferraro, W.J. Black, F. Rinaldi
& D. Mowatt (1999), “FACILE: Classifying Texts
Integrating Pattern Matching and Information Extraction”,
Proc. of the 16 th International Joint Conf. on Artificial
Intelligence (IJCAI-99) , Stockholm, Sweden, 31 July-6 August
1999, 890-895.
Cohen W.W. & D. Kudenko (1997), “Transferring and
Retraining Learned Information Filters”, Proc. of the 14 th
National Conf. on Artificial Intelligence (AAAI-97) , Providence
(Rhode Island), USA, July 27-31 1997, 583-590.
Corvaisier F., A. Mille & J.M. Pinon (1997), “Information
Retrieval on The World Wide Web Using a Decision Making
System”,
Proc. of the Conf. on Computer-Assisted Information
Searching on Internet (RIAO-97)
, Montréal (Québec), Canada,
25-27 June 1997, 284-295.
Craven, M., D. DiPasquo, D. Freitag, A. McCallum, T.
Mitchell, K. Nigam & S. Slattery (1998), “Learning to
Extract Symbolic Knowledge from the World Wide Web”,
Proc. of the 15 th National Conf. on Artificial Intelligence
(AAAI-98), Madison, Wisconsin, USA, July 26-30 1998, 509-
516.
DiMarco C. & M.E. Foster (1997), “The Automated
Generation of Web Documents that Are Tailored to the
Individual Reader”,
1997 AAAI Symposium on Natural
Language Processing for the World Wide Web , Menlo Park,
California, 24-26 March 1997, 44-53.
Gaizauskas R. & A.M. Robertson (1997), “Coupling
Information Retrieval and Information Extraction : A New
Text Technology for Gathering Information from the Web”,
Proc. of the Conf. on Computer-Assisted Information
Searching on Internet (RIAO-97),
Montréal (Québec), Canada,
25-27 June 1997, 356-372.
Gaizauskas R. & Y. Wilks (1998), “Information Extraction:
Beyond Document Retrieval”,
Journal of Documentation ,
54(1), 70-105.
Geldof S. & W. van de Velde (1997), “Context-Sensitive
Hypertext Generation”,
1997 AAAI Symposium on Natural
Language Processing for the World Wide Web , Menlo Park,
California, 24-26 March 1997, 54-61.
Joachims T., D. Freitag & T. Mitchell (1997), “Webwatcher
: A Tour Guide for the World Wide Web”,
Proc. of the 15 th
International Joint Conf. on Artificial Intelligence (IJCAI-97) ,
Nagoya, Japan, 23-29 August 1997, 770-777.
Jouis C., W. Mustafa el-Hadi & V. Rialle (1998),
“COGNIWEB, Modélisation hybride linguistique et numérique pour un outil de filtrage d’informations sur les réseaux”,
Actes de la Rencontre Internationale sur l’Extraction, le Filtrage et le Résumé Automatique (RIFRA-98)
, Sfax, Tunisie, 11-14 novembre 1998, 191-203.
Katz B. (1997), “Annotating the World Wide Web Using
Natural Language”, Proc. of the Conf. on Computer-Assisted
Information Searching on Internet (RIAO-97)
, Montréal
(Québec), Canada, 25-27 June 1997, 136-158.
Kindo T., H. Yoshida, T. Morimoto & T. Watanabe
(1997) , “Adaptive Personal Information Filtering System that
Organizes Personal Profiles Automatically”,
Proc. of the 15 th
International Joint Conf. on Artificial Intelligence (IJCAI-97) ,
Nagoya, Japan, 23-29 August 1997, 716-721.
Leong H., S. Kapur & O. de Vel (1997), “Text
Summarization for Knowledge Filtering Agents in Distributed
Heterogeneous Environments”, 1997 AAAI Symposium on
Natural Language Processing for the World Wide Web , Menlo
Park, California, 24-26 March 1997, 87-94.
Meunier J.-G., I. Biskri, G. Nault & M. Nyongwa (1997),
“ALADIN et le traitement connexionniste de l’analyse terminologique”,
Proc. of the Conf. on Computer-Assisted
Information Searching on Internet (RIAO-97) , Montréal
(Québec), Canada, 25-27 June 1997, 661-664.
Ohsawa Y. & M. Yachida (1997), “An Index Navigator for
Understanding and Expressing User’s Coherent Interest”,
Proc. of the 15 th International Joint Conf. on Artificial Intelligence
(IJCAI-97) , Nagoya, Japan, 23-29 August 1997, 722-728.
Perkowitz, M. & O. Etzioni (1998), “Adaptive Web Sites:
Automatically Synthesizing Web Pages”,
Proc. of the 15 th
National Conf. on Artificial Intelligence (AAAI-98), Madison,
Wisconsin, USA, July 26-30 1998, 727-732.
Pustejovsky J., B. Boguraev, M. Verhagen, P. Buitelaar &
M. Johnston (1997), “Semantic Indexing and Typed
Hyperlinking”,
1997 AAAI Symposium on Natural Language
Processing for the World Wide Web , Menlo Park, California,
24-26 March 1997, 120-128.
Rialle V., J.-G. Meunier, S. Oussedik, I. Biskri & G. Nault
(1998), “Application de l'algorithmique génétique à l'analyse terminologique”,
Acte du Colloque International JADT 98 ,
Nice, France.
Salton G. & M. McGill (1983), Introduction to Modern
Information Retrieval , McGraw-Hill.
Salton G. (1989), Automatic Text Processing: The
Transformation, Analysis and Retrieval of Information by
Computer , Addison-Wesley.
Sugimoto M., N. Katayama & A. Takasu (1997),
“COSPEX: A System for Constructing Private Digital
Libraries”, Proc. of the 15 th International Joint Conf. on
Artificial Intelligence (IJCAI-97) , Nagoya, Japan, 23-29
August 1997, 738-744.
Zaiane O.R., A. Fall, S. Rochefort, V. Dahl & P. Tarau
(1997), “On-Line Resource Discovery Using Natural
Language”,
Proc. of the Conf. on Computer-Assisted
Information Searching on Internet (RIAO-97)
, Montréal
(Québec), Canada, 25-27 June 1997, 336-355.
Zarri G.P. (1997), “Natural Language Indexing of Multimedia
Objects in the Context of a WWW Distance Learning
Environment”,
1997 AAAI Symposium on Natural Language
Processing for the World Wide Web , Menlo Park, California,
24-26 March 1997, 155-158.