复件 Assignment 1

advertisement
Assignment 1: Database Search Results (10%)
Submit a report that includes:
1) a list of 10-20 articles or books which seem potentially relevant to your
research topic
1. Aas, K. and L. Eikvil, Text Categorisation: A Survey. 1999.
2) A review of 1 paper from this list that you and your supervisor consider
particularly important for your research topic. This should be no more than 1
page long and include:
- Paper details
It is worthy of mentioning a report named “Text categorisation: A survey” which
was contributed in 1999 by K. Aas and L. Eikvil, from Norwegian Computing
Center. This 37 pages report consists of 7 chapters plus an appendix which
describes the standardised Reuters-21578 collection of newswires for the year
1987.
Succeeding the beginning introduction, Chapter 2 lists the steps of pre-process
which transform free text (raw text) into a proper representation for following
categorisation task. There are 6 classifying algorithms described in Chapter 3,
and all of them have been successfully implemented in previous text
classification work. Chapter 4 sets up performance measures for evaluations
of category ranking and binary categorisation. Based on the two preceding
chapters, Chapter 5 leads readers to the description of the evaluated previous
work using Reuters-21578 collection. Then authors’ own work using the same
Reuters collection was presented in Chapter 6, followed by a summary,
Chapter 7.
-Summary of the paper
In this report, it described different approaches for pre-processing, indexing,
dimensionality reduction, and classification, which constitute a typical progress
in text categorisation. Moreover, based on description of the results from
previous text categorisation work using Reuters collection as well as authors’
own experiments, the following classification methods were evaluated:
Rocchio’s algorithm, Naive Bayes, K-nearest neighbour, Decision Trees,
Support Vector Machines, and Voted Classification.
Evaluation results showed all of the methods perform give acceptable classifier
products, while neither one of them is obviously superior over the rest.
-1- 2 main strengths and weaknesses of the paper
Pros 1. It is persuasive that the surveyed previous work had been applied
with a number of statistical classification and machine learning
techniques
Pros 2. It is competent to measure progress in field that the standardised
Reuters-21578 collection was chosen to all the implementations
Cons 1. Authors’ own experiments adopted a combination of simple
approaches, which limited the sufficiency of proof for their
perspective
Cons 2. There still are some ingredients within text categorisation this report
did not clarify - such as multi-class and multi-label problem
- How does the paper relate to your research topic?
Mentioning of my research topic - Converting Natural Language to Medical
Codes,
- Questions that you want to ask the authors
This method is also called winner-take-all classification. Suppose the dataset
is to be classified into M classes. Therefore, M binary SVM classifiers may be
created where each classifier is trained to distinguish one class from the
remaining M-1 classes. For example, class one binary classifier is designed to
discriminate between class one data vectors and the data vectors of the
remaining classes. Other SVM classifiers are constructed in the same manner.
During the testing or application phase, data vectors are classified by finding
margin from the linear separating hyperplane. The final output is the class that
corresponds to the SVM with the largest margin.
However, if the outputs corresponding to two or more classes are very close to
each other, those points are labeled as unclassified, and a subjective decision
may have to be made by the analyst. Otherwise, a reject decision (Sch?lkopf
and Smola, 2002) may also be applied using a threshold to decide the class
label. This multiclass method has an advantage in the sense that the number
of binary classifiers to construct equals the number of classes. However, there
are some drawbacks. First, during the training phase,
- directions for future research that you think are worthwhile to consider identified by you or the authors
3) A list of the top conference and journals in your research area (up to 6 each)
top conference:
journals:
4) A list of the main research groups working in your area (up to 6).
Download