exam paper - School of Computing

advertisement
COMP341001
This question paper consists
of 4 printed pages,
each of which is
identified by the Code
Number COMP341001
 UNIVERSITY OF LEEDS
School of Computing
January 2009
DB32: Technologies for Knowledge Management
Time allowed: 2 hours
Answer ALL THREE questions.
Question 1
(a) Serge Sharoff is a lecturer at Leeds University who has published many research papers relating to technologies
for knowledge management, for example:
Sharoff, S., L.Sokolova. 1995. Representation of technical manuals by means of rhetorical relations (in Russian).
Proceedings of DIALOGUE'95, Kazan, Russia, pp. 283-293.
Kononenko I., Sharoff, S. 1996. Understanding short texts with integration of knowledge representation methods. In
D. Bjorner, M. Broy, I.V. Pottosin (Eds.) Perspectives of System Informatics. Springer Lecture Notes in Computer
Science 1181, pages 111-121. Berlin: Springer Verlag.
Sharoff, S. 1998. On difference between natural language ontology and ontology of the problem domain. (in Russian)
Proceedings of the 6th Russian National Artificial Intelligence Conference, Pushchino, Russia, pp. 41-49.
Sharoff, S., V. Zhigalov 1999. Register-domain separation as a methodology for development of natural-language
interfaces to databases. Proceedings of INTERACT'99 Seventh IFIP Conference on Human-Computer Interaction,
Edinburgh, pp. 79-85.
Sharoff, S. 2004. Towards basic categories for describing properties of texts in a corpus. Proceedings of LREC2004
Language Resources and Evaluation Conference, Lisbon, pp. 1743-1746.
Sharoff, S. 2006. Open-source corpora: using the net to fish for linguistic data. International Journal of Corpus
Linguistics, volume 11(4), pp. 435-462
Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: a competition for cleaning
web pages. Proceedings of LREC2008 Language Resources and Evaluation Conference, Marrakech.
(i) Imagine you are asked to assess the impact of Dr Sharoff’s research, by finding a list of papers by other
researchers which cite these publications. Suggest three Information Retrieval tools you could use for this task. State
an advantage and a disadvantage of each of these three IR tools for this search task, in comparison to the other tools.
[9 marks]
(ii) Suggest three reasons why citations for some of these papers might not be found by any of your suggested IR
tools.
[3 marks]
(b) What is the difference between Information Retrieval and Information Extraction? A Knowledge Management
consultancy aims to build a database of all Data Mining tools available for download via the WWW, including name,
cost, implementation language, input/output format(s), and Machine Learning algorithm(s) included; should they use
IR or IE for this task, and why?
[4 marks]
TURN OVER
1
COMP341001
(c) A specialised set of documents are indexed by the terms: T = {pudding, jam, traffic, lane, treacle}
Four documents d1 .. d4 have the following index term vectors:
d1 = (0.8, 0.8, 0.0, 0.0, 0.4)
d2 = (0.0, 0.0, 0.9, 0.7, 0.0)
d3 = (0.8, 0.0, 0.0, 0.0, 0.8)
d4 = (0.6, 0.8, 0.4, 0.6, 0.0)
The following query found documents d1 and d4 were the two best matches:
(1.0, 0.6, 0.0, 0.0, 0.0)
A user decided that, although document d1 was relevant, document d4 was not.
Use this relevance feedback to generate a new query q', using the feedback function:
q' = q + di / | HR | - di / | HNR|
where  is the weight of initial query,  is the weight of positive feedback,  is the weight of negative feedback.
You should assume  =  =  = 0.5
[4 marks]
Question 2
“In 2008, Leeds University adopted the Blackboard Virtual Learning Environment (VLE) to be used in
undergraduate taught modules in all schools and departments. In future, lectures and tutorials may become
redundant at Leeds University: if we assume that student learning fits Coleman’s model of Knowledge
Management processes, then the Virtual Learning Environment provides technologies to deal with all stages
in this model. All relevant explicit, implicit, tacit and cultural knowledge can be captured and stored in our
Virtual Learning Environment, for students to access using Information Retrieval technologies.”
Is this claim plausible? In your answer, explain what is meant by Coleman’s model of Knowledge Management
processes, citing examples relating to learning and teaching at Leeds University. Define and give relevant examples
of the four type of knowledge; and state whether they could be captured and stored in our VLE, and searched for via
an Information Retrieval system.
[20 marks]
TURN OVER
2
COMP341001
Question 3
The ukus.arff dataset below relates to a set of example English-language text documents. For each document, the
dataset shows the frequency of the term center; the frequency of the term centre; the relative frequency of these 2
terms expressed as a percentage centerpercent; the frequency of the term color; the frequency of the term colour;
the relative frequency of these 2 terms expressed as a percentage colorpercent; and lastly whether the document
came from the UK or US, indicating which of these two varieties of English it represents.
@relation ukus
@attribute center numeric
@attribute centre numeric
@attribute centerpercent numeric
@attribute color numeric
@attribute colour numeric
@attribute colorpercent numeric
@attribute english {UK,US}
@data
1,32,3,
0,20,0,
UK
0,25,0,
0,12,0,
UK
9,27,33, 0,84,0,
UK
0,19,0,
0,24,0,
UK
0,16,0,
0,14,0,
UK
0,16,0,
0,12,0,
UK
0,21,0,
0,38,0,
UK
0,25,0,
0,34,0,
UK
2,26,7,
2,3,40,
UK
2,32,5,
1,59,2,
UK
31,0,100, 55,0,100, US
61,0,100, 26,0,100, US
24,0,100, 11,0,100, US
12,1,92, 21,4,84, US
8,0,100, 4,2,67,
US
10,0,100, 8,0,100, US
19,0,100, 22,0,100, US
14,0,100, 7,0,100, US
14,0,100, 6,0,100, US
8,5,62,
24,0,100, US
(a) Explain the difference between association rules and classification rules. Illustrate your answer with an
association rule and a classification rule involving the above dataset; state the accuracy of your rules with respect to
the training set.
[6 marks]
(b) Draw a decision tree derived from this training data, predicting whether a document is written in UK English or
US English (with minimal classification errors).
[2 marks]
(c) In general, an algorithm for machine-learning a decision tree from a training data-set must decide which attribute
in the dataset would make the best top-level decision-point. Explain the criterion used in choosing the most
appropriate attribute, illustrating with your answer to (b).
[2 marks]
(d) The test.arff dataset below relates to another set of example English-language text documents.
@relation test
@attribute center numeric
@attribute centre numeric
@attribute centerpercent numeric
@attribute color numeric
@attribute colour numeric
@attribute colorpercent numeric
@attribute english {UK,US}
@data
10,5,66,
0,20,0,
UK
0,10,0,
4,4,50
UK
4,4,50,
10,10,50 UK
Draw the confusion matrix you would get if you used test.arff to evaluate your answer to (b).
[2 marks]
TURN OVER
3
COMP341001
(e) Explain the difference between supervised and unsupervised learning, giving examples involving these datasets.
[4 marks]
(f) The researcher who collected these datasets chose to record the attributes center, centre, centrepercent, color,
colour, colourpercent as possible indicators or predictors of the variety of English used in each document: UK
English or US English. Suggest three justifications for this choice of features. Suggest an additional feature which
might improve results with unseen data-sets; justify your added feature.
[4 marks]
END
4
Download