COMP341001 This question paper consists of 4 printed pages, each of which is identified by the Code Number COMP341001 UNIVERSITY OF LEEDS School of Computing January 2009 DB32: Technologies for Knowledge Management Time allowed: 2 hours Answer ALL THREE questions. Question 1 (a) Serge Sharoff is a lecturer at Leeds University who has published many research papers relating to technologies for knowledge management, for example: Sharoff, S., L.Sokolova. 1995. Representation of technical manuals by means of rhetorical relations (in Russian). Proceedings of DIALOGUE'95, Kazan, Russia, pp. 283-293. Kononenko I., Sharoff, S. 1996. Understanding short texts with integration of knowledge representation methods. In D. Bjorner, M. Broy, I.V. Pottosin (Eds.) Perspectives of System Informatics. Springer Lecture Notes in Computer Science 1181, pages 111-121. Berlin: Springer Verlag. Sharoff, S. 1998. On difference between natural language ontology and ontology of the problem domain. (in Russian) Proceedings of the 6th Russian National Artificial Intelligence Conference, Pushchino, Russia, pp. 41-49. Sharoff, S., V. Zhigalov 1999. Register-domain separation as a methodology for development of natural-language interfaces to databases. Proceedings of INTERACT'99 Seventh IFIP Conference on Human-Computer Interaction, Edinburgh, pp. 79-85. Sharoff, S. 2004. Towards basic categories for describing properties of texts in a corpus. Proceedings of LREC2004 Language Resources and Evaluation Conference, Lisbon, pp. 1743-1746. Sharoff, S. 2006. Open-source corpora: using the net to fish for linguistic data. International Journal of Corpus Linguistics, volume 11(4), pp. 435-462 Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: a competition for cleaning web pages. Proceedings of LREC2008 Language Resources and Evaluation Conference, Marrakech. (i) Imagine you are asked to assess the impact of Dr Sharoff’s research, by finding a list of papers by other researchers which cite these publications. Suggest three Information Retrieval tools you could use for this task. State an advantage and a disadvantage of each of these three IR tools for this search task, in comparison to the other tools. [9 marks] (ii) Suggest three reasons why citations for some of these papers might not be found by any of your suggested IR tools. [3 marks] (b) What is the difference between Information Retrieval and Information Extraction? A Knowledge Management consultancy aims to build a database of all Data Mining tools available for download via the WWW, including name, cost, implementation language, input/output format(s), and Machine Learning algorithm(s) included; should they use IR or IE for this task, and why? [4 marks] TURN OVER 1 COMP341001 (c) A specialised set of documents are indexed by the terms: T = {pudding, jam, traffic, lane, treacle} Four documents d1 .. d4 have the following index term vectors: d1 = (0.8, 0.8, 0.0, 0.0, 0.4) d2 = (0.0, 0.0, 0.9, 0.7, 0.0) d3 = (0.8, 0.0, 0.0, 0.0, 0.8) d4 = (0.6, 0.8, 0.4, 0.6, 0.0) The following query found documents d1 and d4 were the two best matches: (1.0, 0.6, 0.0, 0.0, 0.0) A user decided that, although document d1 was relevant, document d4 was not. Use this relevance feedback to generate a new query q', using the feedback function: q' = q + di / | HR | - di / | HNR| where is the weight of initial query, is the weight of positive feedback, is the weight of negative feedback. You should assume = = = 0.5 [4 marks] Question 2 “In 2008, Leeds University adopted the Blackboard Virtual Learning Environment (VLE) to be used in undergraduate taught modules in all schools and departments. In future, lectures and tutorials may become redundant at Leeds University: if we assume that student learning fits Coleman’s model of Knowledge Management processes, then the Virtual Learning Environment provides technologies to deal with all stages in this model. All relevant explicit, implicit, tacit and cultural knowledge can be captured and stored in our Virtual Learning Environment, for students to access using Information Retrieval technologies.” Is this claim plausible? In your answer, explain what is meant by Coleman’s model of Knowledge Management processes, citing examples relating to learning and teaching at Leeds University. Define and give relevant examples of the four type of knowledge; and state whether they could be captured and stored in our VLE, and searched for via an Information Retrieval system. [20 marks] TURN OVER 2 COMP341001 Question 3 The ukus.arff dataset below relates to a set of example English-language text documents. For each document, the dataset shows the frequency of the term center; the frequency of the term centre; the relative frequency of these 2 terms expressed as a percentage centerpercent; the frequency of the term color; the frequency of the term colour; the relative frequency of these 2 terms expressed as a percentage colorpercent; and lastly whether the document came from the UK or US, indicating which of these two varieties of English it represents. @relation ukus @attribute center numeric @attribute centre numeric @attribute centerpercent numeric @attribute color numeric @attribute colour numeric @attribute colorpercent numeric @attribute english {UK,US} @data 1,32,3, 0,20,0, UK 0,25,0, 0,12,0, UK 9,27,33, 0,84,0, UK 0,19,0, 0,24,0, UK 0,16,0, 0,14,0, UK 0,16,0, 0,12,0, UK 0,21,0, 0,38,0, UK 0,25,0, 0,34,0, UK 2,26,7, 2,3,40, UK 2,32,5, 1,59,2, UK 31,0,100, 55,0,100, US 61,0,100, 26,0,100, US 24,0,100, 11,0,100, US 12,1,92, 21,4,84, US 8,0,100, 4,2,67, US 10,0,100, 8,0,100, US 19,0,100, 22,0,100, US 14,0,100, 7,0,100, US 14,0,100, 6,0,100, US 8,5,62, 24,0,100, US (a) Explain the difference between association rules and classification rules. Illustrate your answer with an association rule and a classification rule involving the above dataset; state the accuracy of your rules with respect to the training set. [6 marks] (b) Draw a decision tree derived from this training data, predicting whether a document is written in UK English or US English (with minimal classification errors). [2 marks] (c) In general, an algorithm for machine-learning a decision tree from a training data-set must decide which attribute in the dataset would make the best top-level decision-point. Explain the criterion used in choosing the most appropriate attribute, illustrating with your answer to (b). [2 marks] (d) The test.arff dataset below relates to another set of example English-language text documents. @relation test @attribute center numeric @attribute centre numeric @attribute centerpercent numeric @attribute color numeric @attribute colour numeric @attribute colorpercent numeric @attribute english {UK,US} @data 10,5,66, 0,20,0, UK 0,10,0, 4,4,50 UK 4,4,50, 10,10,50 UK Draw the confusion matrix you would get if you used test.arff to evaluate your answer to (b). [2 marks] TURN OVER 3 COMP341001 (e) Explain the difference between supervised and unsupervised learning, giving examples involving these datasets. [4 marks] (f) The researcher who collected these datasets chose to record the attributes center, centre, centrepercent, color, colour, colourpercent as possible indicators or predictors of the variety of English used in each document: UK English or US English. Suggest three justifications for this choice of features. Suggest an additional feature which might improve results with unseen data-sets; justify your added feature. [4 marks] END 4