Combining labeled and unlabeled data for text classification with a large number of categories Ghani, R. Center for Automated Learning & Discovery, Carnegie Mellon Univ.; This paper appears in: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on 11/29/2001 -12/02/2001, 2001 Location: San Jose, CA , USA On page(s): 597-598 2001 References Cited: 6 Number of Pages: xxi+677 INSPEC Accession Number: 7169351 Abstract: We develop a framework to incorporate unlabeled data in the error-correcting output coding (ECOC) setup by decomposing multiclass problems into multiple binary problems and then use co-training to learn the individual binary classification problems. We show that our method is especially useful for classification tasks involving a large number of categories where co-training doesn't perform very well by itself and when combined with ECOC, outperforms several other algorithms that combine labeled and unlabeled data for text classification in terms of accuracy, precision-recall tradeoff, and efficiency Index Terms: error correction codes learning (artificial intelligence) pattern classification text analysis accuracy binary classification problems categories co-training error correcting output coding setup labeled data multiclass problems multiple binary problems precision-recall tradeoff text classification unlabeled data Documents that cite this document Select link to view other documents in the database that cite this one. IR34 Finding comparatively important concepts between texts Lecoeuche, R. Div. of Inf., Edinburgh Univ.; This paper appears in: Automated Software Engineering, 2000. Proceedings ASE 2000. The Fifteenth IEEE International Conference on 09/11/2000 -09/15/2000, 2000 Location: Grenoble , France On page(s): 55-60 2000 References Cited: 15 Number of Pages: xiii+330 INSPEC Accession Number: 6735774 Abstract: Finding important concepts is a common task in requirements engineering. For example, it is needed when building models of a domain or organising requirements documents. Since a lot of information is available in textual form, methods to identify important concepts from texts are potentially useful. Traditional methods for finding important concepts from texts rely on the assumption that the most frequent concepts are the most important. We present an approach that does not depend on this assumption. It makes use of two texts to find important concepts comparatively. We show that this approach is viable. It discovers concepts similar to those found by traditional approaches as well as concepts that are not frequent. Finally, we discuss the possibility of extending this work to requirements classification Index Terms: classification systems analysis text analysis important concepts requirements classification requirements documents requirements engineering textual form Automated support for text-based system assessment Merriman, M. Evans, R.P. Park, S. George Mason Univ., Fairfax, VA; This paper appears in: Systems Engineering of Computer Based Systems, 1995., Proceedings of the 1995 International Symposium and Workshop on 03/06/1995 -03/09/1995, 6-9 Mar 1995 Location: Tucson, AZ , USA On page(s): 85-92 6-9 Mar 1995 References Cited: 8 INSPEC Accession Number: 4981028 Abstract: Describes the need to explore and evaluate text descriptions of proposed computer-based systems, presents an approach for automated support for textbased system assessment, and reports on the use of this approach in support of system assessment for a complex multi-segment project: the Federal Bureau of Investigation's Integrated Automated Fingerprint Identification System (IAFIS). Text-based system assessment is a key to early discovery of issues and risk in system development. Text-based assessments encounter many challenges, particularly the high potential for ambiguity in English, but still have significant advantages Index Terms: English language FBI Federal Bureau of Investigation IAFIS Integrated Automated Fingerprint Identification System ambiguity automated support change impact assessment classification complex multi-segment project computer aided software engineering configuration control document handling early issues discovery fingerprint identification natural languages project support environments proposed computer-based systems requirements risk assessment risk discovery risk management system development system documentation text categorization text descriptions text-based system assessment traceability English language FBI Federal Bureau of Investigation IAFIS Integrated Automated Fingerprint Identification System ambiguity automated support change impact assessment classification complex multi-segment project computer aided software engineering configuration control document handling early issues discovery fingerprint identification natural languages project support environments proposed computer-based systems requirements risk assessment risk discovery risk management system development system documentation text categorization text descriptions text-based system assessment traceability IR 35 Automated diagnosis of non-native English speaker's natural language Fox, R. Bowden, M. Dept. of Math. & Comput. Sci., Northern Kentucky Univ., Highland Heights, KY, USA; This paper appears in: Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings. 14th IEEE International Conference on On page(s): 301- 306 2002 ISSN: 1082-3409 Number of Pages: xx+548 INSPEC Accession Number: 7555168 Abstract: Typical grammar checking software use some form of natural language parsing to determine if errors exist in the text. If a sentence is found ungrammatical, the grammar checker usually seeks a single grammatical error as an explanation. For non-native speakers of English, it is possible that a given sentence contain multiple errors and grammar checkers may not adequately explain these mistakes. This paper presents GRADES, a diagnostic program that detects and explains grammatical mistakes made by non-native English speakers. GRADES performs its diagnostic task, not through parsing, but through the application of classification and pattern matching rules. This makes the diagnostic process more efficient than other grammar checkers. GRADES is envisioned as a tool to help non-native English speakers learn to correct their English mistakes, but is also a demonstration that grammar checking need not rely on parsing techniques. Index Terms: computational linguistics grammars natural languages GRADES grammar checking grammar checking software grammatical mistakes natural language parsing non-native English speakers IR 36 Automatic text categorization and its application to text retrieval Wai Lam Ruiz, M. Srinivasan, P. Dept. of Syst. Eng. & Eng. Manage., Chinese Univ. of Hong Kong, Shatin; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on On page(s): 865-879 Volume: 11, Issue: 6, Nov/Dec 1999 ISSN: 1041-4347 References Cited: 23 CODEN: ITKEEH INSPEC Accession Number: 6526752 Abstract: We develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two realworld document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided Index Terms: information retrieval MEDLINE database automatic text categorization detailed analysis document retrieval technique instance-based learning learning paradigm real-world document collections retrieval feedback retrieval quality text retrieval Documents that cite this document Select link to view other documents in the database that cite this one. A research on Web resources automatic classification using SVMs Cai Wei Wang Yongcheng Yin Zhonghang Zou Tao Shanghai Jiao Tong Univ.; This paper appears in: Intelligent Control and Automation, 2002. Proceedings of the 4th World Congress on On page(s): 1359- 1363 vol.2 Volume: 2, 2002 ISSN: Number of Pages: 4 vol. 3353 INSPEC Accession Number: 7412373 Abstract: With the rapid growth of Web information, text categorization has become an important research field for the management of Internet information. Most of the existing methods are based on traditional statistics, which provides a conclusion only for the situation where sample size is tending to infinity. So they may not work well in practical case with limited samples and easily lead to the problem of over-fitting. These papers theoretical analyze the reason of over-fitting and introduce its condition as well as the method to solve it. We introduce SVMs, a method to avoid over-fitting, which is based on statistical learning theory matching the limited number of Internet news examples. Index Terms: Bayes methods Internet decision theory learning (artificial intelligence) learning automata text analysis Internet information management SVMs Web information Web resources automatic classification statistical learning theory support vector machines text categorization Documents that cite this document Select link to view other documents in the database that cite this one. IR 37 Text classification and keyword extraction by learning decision trees Sakakibara, Y. Misue, K. Koshiba, T. Fujitsu Lab., Ltd., Numazu, Shizuoka; This paper appears in: Artificial Intelligence for Applications, 1993. Proceedings., Ninth Conference on 03/01/1993 -03/05/1993, 1-5 Mar 1993 Location: Orlando, FL , USA On page(s): 4661-5 Mar 1993 References Cited: 0 INSPEC Accession Number: 4851079 Abstract: Summary form only given. The authors propose a completely new approach to the problem of text classification and automatic keyword extraction by using machine learning techniques. They introduce a class of representations for classifying text data based on decision trees, and present an algorithm for learning it inductively. The algorithm does not need any natural language processing technique, and is robust to noisy data. It is shown that the learning algorithm can be used for automatic extraction of keywords for text retrieval and automatic text categorization. Some experimental results on the use of the algorithm are reported Index Terms: classification learning (artificial intelligence) linguistics natural languages automatic keyword extraction automatic text categorization decision trees learning machine learning natural language processing noisy data text classification text retrieval Documents that cite this document Select link to view other documents in the database that cite this one. IR 38 Automatic category generation for text documents by self-organizing maps Hsin-Chang Yang Chung-Hong Lee Dept. of Inf. Manage., Chang Jung Univ., Tainan ; This paper appears in: Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on 07/24/2000 -07/27/2000, 2000 Location: Como , Italy On page(s): 581-586 vol.3 Volume: 3, 2000 References Cited: 8 Number of Pages: 6 vol.(xxxvii+371+xxxvi+313+679+630+669+659) INSPEC Accession Number: 6703760 Abstract: One important task for text data mining is automatic text categorization, which assigns a text document to some predefined category according to their correlations. Traditionally, these categories as well as the correlations among them are determined bp human experts. In this paper, we devised a novel approach to automatically generate categories. The self-organizing map model is used to generate two maps, namely the word cluster map and the document cluster map, in which a neuron represents a cluster of words and documents respectively. Our approach is to analyze the document cluster map to find centroids of some super-clusters. We also devised a method to select the category term from the word cluster map. The hierarchical structure of categories may be generated by recursively applying the same method. Text categorization is the natural consequence of such automatic category generation process Index Terms: category theory data mining generalisation (artificial intelligence) self-organising feature maps text analysis automatic category generation document cluster map hierarchical structure self-organizing maps text categorisation text data mining text document word cluster map Documents that cite this document Select link to view other documents in the database that cite this one. Using rough sets to construct sense type decision trees for text categorization Bleyberg, M.Z. Elumalai, A. Comput. & Inf. Sci. Dept., Kansas State Univ., Manhattan, KS; This paper appears in: IFSA World Congress and 20th NAFIPS International Conference, 2001. Joint 9th 07/25/2001 -07/28/2001, 25-28 July 2001 Location: Vancouver, BC , Canada On page(s): 19-24 vol.1 Volume: 1, 25-28 July 2001 References Cited: 8 Number of Pages: 5 vol.(xxxviii+xxii+3100) INSPEC Accession Number: 7081704 Abstract: Accurate text categorization is needed for efficient and effective text retrieval, search and filtering. Finding appropriate categories and manually assigning them to existing documents is very laborious. The paper shows a simple procedure for automatic extraction of atomic sense types (semantic categories) from documents based on rough sets. The atomic sense types are nodes of a sense type decision tree, which represents a taxonomy Index Terms: decision trees information retrieval rough set theory text analysis atomic sense types automatic extraction rough sets semantic categories sense type decision tree sense type decision trees taxonomy text categorization text filtering text retrieval text search IR 39 Automatic text categorization: case study Correa, R.F. Ludermir, T.B. This paper appears in: Neural Networks, 2002. SBRN 2002. Proceedings. VII Brazilian Symposium on On page(s): 1502002 ISSN: Number of Pages: xiii+270 INSPEC Accession Number: 7568947 Abstract: Text categorization is a process of classifying documents with regard to a group of one or more existent categories according to themes or concepts present in their contents. The most common application of it is in information retrieval systems (IRS) to document indexing. A method to transform text categorization into a viable task is to use machine-learning algorithms to automate text classification, allowing it to be carried out fast, into concise manner and in broad range. The objective of this work is to present and compare the results of experiments on text categorization using artificial neural networks of multilayer perceptron and self-organizing map types, and traditional machine-learning algorithms used in this task: C4.5 decision tree, PART decision rules and Naive Bayes classifier. Index Terms: classification decision trees information retrieval learning (artificial intelligence) multilayer perceptrons self-organising feature maps text analysis Naive Bayes classifier PART decision rules decision tree document classification document indexing information retrieval systems machine-learning multilayer perceptron neural networks self-organizing maps text categorization Documents that cite this document Select link to view other documents in the database that cite this one. IR40 An incremental approach to text representation, categorization, and retrieval O'Neil, P. Rome Lab.; This paper appears in: Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on 08/18/1997 -08/20/1997, 18-20 Aug 1997 Location: Ulm , Germany On page(s): 714-717 vol.2 Volume: 2, 18-20 Aug 1997 References Cited: 8 Number of Pages: 2 vol. xxiv+1119 INSPEC Accession Number: 5704673 Abstract: Efficient and accurate information retrieval is a goal of just about everyone. Whether you are looking for information on the Internet, a book or article in the library, satellite imagery of missile silos, or a recipe for dinner, finding exactly what you want or need, even if you know exactly what you are looking for, can be an imposing and most difficult task. Many current techniques require an intimate understanding of the actual processes involved. The method presented in this paper provides for an automatic representation of text data by vectors, which can then be manipulated to categorize and organize the data. Information can be retrieved without knowledge of the underlying process. The user can ask for information using normal discourse. This technology can also be applied to data mining and visualization Index Terms: data structures data visualisation information retrieval data mining data visualization incremental approach information retrieval missile silos normal discourse text categorization text representation text retrieval IR 41 Managing semantic content for the Web Sheth, A. Bertram, C. Avant, D. Hammond, B. Kochut, K. Warke, Y. Dept. of Comput. Sci., Georgia Univ., Athens, GA ; This paper appears in: Internet Computing, IEEE On page(s): 80- 87 Volume: 6, Issue: 4, Jul/Aug 2002 ISSN: 1089-7801 INSPEC Accession Number: 7344439 Abstract: By associating meaning with content, the Semantic Web will facilitate search, interoperability, and the composition of complex applications. The paper discusses the Semantic Content Organization and Retrieval Engine (SCORE, see vvww.voquette.com), which is based on research transferred from the University of Georgia's Large Scale Distributed Information Systems. SCORE belongs to a new generation of technologies for the emerging Semantic Web. It provides facilities to define ontological components that software agents can maintain. These agents use regular expression based rules in conjunction with various semantic techniques to extract ontology-driven metadata from structured and semistructured content. Automatic classification and information-extraction techniques augment these results and also let the system deal with unstructured text. Index Terms: Internet classification information resources information retrieval meta data search engines software agents Internet SCORE Semantic Content Organization and Retrieval Engine Semantic Web classification expression based rules information extraction interoperability ontological components ontology-driven metadata searching semantic search engine software agents IR42 Three term weighting and classification algorithms in text automatic classification Qian Diao Shanghai Jiaotong Univ.; This paper appears in: High Performance Computing in the Asia-Pacific Region, 2000. Proceedings. The Fourth International Conference/Exhibition on 05/14/2000 -05/17/2000, 2000 Location: Beijing , China On page(s): 629-630 vol.2 Volume: 2, 2000 References Cited: 2 Number of Pages: 2 vol. xxiv+1179 INSPEC Accession Number: 6604932 Abstract: Three automatic text classification algorithms are provided. They are the Bayes method based on Bayes theorem and IDF (Invert Document Frequency), VSM based on Shannon entropy and a fuzzy method based on fuzzy theory. Furthermore, the method of combining term weighting methods with three classification algorithms is also provided in the paper Index Terms: Bayes methods classification entropy fuzzy set theory text analysis Bayes method Bayes theorem Invert Document Frequency Shannon entropy VSM automatic text classification algorithms fuzzy method fuzzy theory term weighting algorithms IR 43 The TaxGen framework: automating the generation of a taxonomy for a large document collection Muller, A. Dorre, J. Gerstl, P. Seiffert, R. Dept. of Software Solutions Dev., IBM Germany; This paper appears in: System Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on 01/05/1999 -01/08/1999, 1999 Location: Maui, HI , USA On page(s): 9 pp.1999 References Cited: 12 Number of Pages: liii+341 INSPEC Accession Number: 6182111 Abstract: Text mining is an active area of research and development, which combines and expands techniques found in related areas like information retrieval, computational linguistics and data mining to perform an analysis of large corpora of digital documents. This paper describes the TaxGen text mining project carried out at the IBM Software Development Lab. at Boeblingen, Germany. The goal of TaxGen was the automatic generation of a taxonomy for a collection of previously unstructured documents, namely a set of 73,000 news wire documents spanning one year Index Terms: classification computational linguistics data mining information retrieval text analysis very large databases IBM Software Development Lab., Boeblingen, Germany TaxGen text mining project automatic taxonomy generation computational linguistics data mining digital documents information retrieval large document collection news wire documents text corpus analysis unstructured documents Automatic labeling of self-organizing maps for information retrieval Merkl, D. Rauber, A. Inst. fur Softwaretech., Tech. Univ. Wien; This paper appears in: Neural Information Processing, 1999. Proceedings. ICONIP '99. 6th International Conference on 11/16/1999 -11/20/1999, 1999 Location: Perth, WA , Australia On page(s): 37-42 vol.1 Volume: 1, 1999 References Cited: 17 Number of Pages: 3 vol. xv+1240 INSPEC Accession Number: 6605092 Abstract: The self-organizing map is a very popular unsupervised neural network model for the analysis of high-dimensional input data as in information retrieval applications. However, the interpretation of the map requires much manual effort, especially as far as the analysis of the learned features and the characteristics of identified clusters is concerned. We present our novel LabelSOM method which, based on the features learned by the map, automatically selects the most descriptive features of the input patterns mapped onto a particular unit of the map, thus making the characteristics of the various clusters within the map explicit. We demonstrate the benefits of this approach on an example from text classification using a real-world document archive. In this particular case, the features correspond to keywords describing the contents of a document. The benefit of this approach is that the various document clusters are characterized in terms of shared keywords, thus making it easy for the user to explore the contents of an unknown document archive Index Terms: classification information retrieval self-organising feature maps unsupervised learning LabelSOM method automatic labeling document archive high-dimensional input data analysis information retrieval keywords self-organizing maps text classification unsupervised neural network model Automatic document classification based on probabilistic reasoning: model and performance analysis Wai Lam Kon-Fan Low Dept. of Syst. Eng. & Eng. Manage., Chinese Univ. of Hong Kong, Shatin; This paper appears in: Systems, Man, and Cybernetics, 1997. 'Computational Cybernetics and Simulation'., 1997 IEEE International Conference on 10/12/1997 -10/15/1997, 12-15 Oct 1997 Location: Orlando, FL , USA On page(s): 2719-2723 vol.3 Volume: 3, 12-15 Oct 1997 References Cited: 6 Number of Pages: 5 vol. 4535 INSPEC Accession Number: 5753489 Abstract: We develop a new approach to test classification based on automatic feature extraction and probabilistic reasoning. The knowledge representation used to perform such task is known as Bayesian inference networks. A Bayesian network text classifier is automatically constructed from a set of training test documents. We have conducted a series of experiments on two text document corpus, namely the CACM and Reuters, to analyze the performance of our approach, which are described in the paper Improving the classification accuracy of automatic text processing systems using context vectors and back-propagation algorithms Farkas, J. Centre for Inf. Technol. Innovation, Ind. Canada, Laval, Que.; This paper appears in: Electrical and Computer Engineering, 1996. Canadian Conference on 05/26/1996 -05/29/1996, 26-29 May 1996 Location: Calgary, Alta. , Canada On page(s): 696-699 vol.2 Volume: 2, 26-29 May 1996 References Cited: 13 Number of Pages: 2 vol. xl+1026 INSPEC Accession Number: 5456375 Abstract: We analyze some of the benefits of combining the context-vector representation of documents with the back-propagation paradigm for document classification. We discuss an implementation of this architecture, called NeuroFile, which combines automatic document classification with similarity-based, as well as Boolean retrieval facilities in a single electronic filing system. The quality of performance of NeuroFile is compared with an earlier system called NeuroClass. We show that NeuroFile achieves a 9% classification improvement over NeuroClass Index Terms: backpropagation classification document image processing feedforward neural nets information retrieval word processing Boolean retrieval facilities NeuroClass NeuroFile automatic document classification automatic text processing systems backpropagation algorithms classification accuracy context vectors context-vector representation documents electronic filing system performance similarity based retrieval facilities Towards classifying full-text using recurrent neural networks Farkas, J. Centre for Inf. Technol. Innovation, Ind. Canada, Laval, Que.; This paper appears in: Electrical and Computer Engineering, 1995. Canadian Conference on 09/05/1995 -09/08/1995, 5-8 Sep 1995 Location: Montreal, Que. , Canada On page(s): 511-514 vol.1 Volume: 1, 5-8 Sep 1995 References Cited: 12 INSPEC Accession Number: 5205006 Abstract: This paper describes an automatic document classification system called NeuroClass, developed for the air transportation field of Transport Canada. The properties of the system show that for the specific domain for which NeuroClass was developed, recurrent neural networks as developed by Elman (1990) can be used to build systems that classify natural language full-text automatically and reliably with a degree of accuracy proportional to the level of class adherence of the text involved IR 44 Acquisition of linguistic patterns for knowledge-based information extraction Jun-Tae Kim Moldovan, D.I. Dept. of Comput. Eng., Dongguk Univ., Seoul; This paper appears in: Knowledge and Data Engineering, IEEE Transactions on On page(s): 713-724 Volume: 7, Issue: 5, Oct 1995 ISSN: 1041-4347 References Cited: 35 INSPEC Accession Number: 5103896 Abstract: The paper presents an automatic acquisition of linguistic patterns that can be used for knowledge based information extraction from texts. In knowledge based information extraction, linguistic patterns play a central role in the recognition and classification of input texts. Although the knowledge based approach has been proved effective for information extraction on limited domains, there are difficulties in construction of a large number of domain specific linguistic patterns. Manual creation of patterns is time consuming and error prone, even for a small application domain. To solve the scalability and the portability problem, an automatic acquisition of patterns must be provided. We present the PALKA (Parallel Automatic Linguistic Knowledge Acquisition) system that acquires linguistic patterns from a set of domain specific training texts and their desired outputs. A specialized representation of patterns called FP structures has been defined. Patterns are constructed in the form of FP structures from training texts, and the acquired patterns are tuned further through the generalization of semantic constraints. Inductive learning mechanism is applied in the generalization step. The PALKA system has been used to generate patterns for our information extraction system developed for the fourth Message Understanding Conference (MUC-4) Index Terms: knowledge acquisition knowledge based systems learning by example linguistics natural languages pattern recognition word processing FP structures PALKA Parallel Automatic Linguistic Knowledge Acquisition automatic acquisition domain specific linguistic patterns domain specific training text input text knowledge based information extraction knowledge based natural language processing knowledge-based information extraction linguistic pattern acquisition semantic constraints