Assignment 1: Database Search Results (10%) Submit a report that includes: 1) a list of 10-20 articles or books which seem potentially relevant to your research topic 1. Aas, K. and L. Eikvil, Text Categorisation: A Survey. 1999. 2) A review of 1 paper from this list that you and your supervisor consider particularly important for your research topic. This should be no more than 1 page long and include: - Paper details It is worthy of mentioning a report named “Text categorisation: A survey” which was contributed in 1999 by K. Aas and L. Eikvil, from Norwegian Computing Center. This 37 pages report consists of 7 chapters plus an appendix which describes the standardised Reuters-21578 collection of newswires for the year 1987. Succeeding the beginning introduction, Chapter 2 lists the steps of pre-process which transform free text (raw text) into a proper representation for following categorisation task. There are 6 classifying algorithms described in Chapter 3, and all of them have been successfully implemented in previous text classification work. Chapter 4 sets up performance measures for evaluations of category ranking and binary categorisation. Based on the two preceding chapters, Chapter 5 leads readers to the description of the evaluated previous work using Reuters-21578 collection. Then authors’ own work using the same Reuters collection was presented in Chapter 6, followed by a summary, Chapter 7. -Summary of the paper In this report, it described different approaches for pre-processing, indexing, dimensionality reduction, and classification, which constitute a typical progress in text categorisation. Moreover, based on description of the results from previous text categorisation work using Reuters collection as well as authors’ own experiments, the following classification methods were evaluated: Rocchio’s algorithm, Naive Bayes, K-nearest neighbour, Decision Trees, Support Vector Machines, and Voted Classification. Evaluation results showed all of the methods perform give acceptable classifier products, while neither one of them is obviously superior over the rest. -1- 2 main strengths and weaknesses of the paper Pros 1. It is persuasive that the surveyed previous work had been applied with a number of statistical classification and machine learning techniques Pros 2. It is competent to measure progress in field that the standardised Reuters-21578 collection was chosen to all the implementations Cons 1. Authors’ own experiments adopted a combination of simple approaches, which limited the sufficiency of proof for their perspective Cons 2. There still are some ingredients within text categorisation this report did not clarify - such as multi-class and multi-label problem - How does the paper relate to your research topic? Mentioning of my research topic - Converting Natural Language to Medical Codes, - Questions that you want to ask the authors This method is also called winner-take-all classification. Suppose the dataset is to be classified into M classes. Therefore, M binary SVM classifiers may be created where each classifier is trained to distinguish one class from the remaining M-1 classes. For example, class one binary classifier is designed to discriminate between class one data vectors and the data vectors of the remaining classes. Other SVM classifiers are constructed in the same manner. During the testing or application phase, data vectors are classified by finding margin from the linear separating hyperplane. The final output is the class that corresponds to the SVM with the largest margin. However, if the outputs corresponding to two or more classes are very close to each other, those points are labeled as unclassified, and a subjective decision may have to be made by the analyst. Otherwise, a reject decision (Sch?lkopf and Smola, 2002) may also be applied using a threshold to decide the class label. This multiclass method has an advantage in the sense that the number of binary classifiers to construct equals the number of classes. However, there are some drawbacks. First, during the training phase, - directions for future research that you think are worthwhile to consider identified by you or the authors 3) A list of the top conference and journals in your research area (up to 6 each) top conference: journals: 4) A list of the main research groups working in your area (up to 6).