Combining labeled and unlabeled data for text classification with a

advertisement
Combining labeled and unlabeled data for text
classification with a large number of categories
Ghani, R.
Center for Automated Learning & Discovery, Carnegie Mellon Univ.;
This paper appears in: Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on
11/29/2001 -12/02/2001, 2001
Location: San Jose, CA , USA
On page(s): 597-598
2001
References Cited: 6
Number of Pages: xxi+677
INSPEC Accession Number: 7169351
Abstract:
We develop a framework to incorporate unlabeled data in the error-correcting
output coding (ECOC) setup by decomposing multiclass problems into multiple
binary problems and then use co-training to learn the individual binary
classification problems. We show that our method is especially useful for
classification tasks involving a large number of categories where co-training
doesn't perform very well by itself and when combined with ECOC, outperforms
several other algorithms that combine labeled and unlabeled data for text
classification in terms of accuracy, precision-recall tradeoff, and efficiency
Index Terms:
error correction codes learning (artificial intelligence) pattern classification text analysis
accuracy binary classification problems categories co-training error correcting output
coding setup labeled data multiclass problems multiple binary problems precision-recall
tradeoff text classification unlabeled data
Documents that cite this document
Select link to view other documents in the database that cite this one.
IR34
Finding comparatively important concepts between
texts
Lecoeuche, R.
Div. of Inf., Edinburgh Univ.;
This paper appears in: Automated Software Engineering, 2000. Proceedings
ASE 2000. The Fifteenth IEEE International Conference on
09/11/2000 -09/15/2000, 2000
Location: Grenoble , France
On page(s): 55-60
2000
References Cited: 15
Number of Pages: xiii+330
INSPEC Accession Number: 6735774
Abstract:
Finding important concepts is a common task in requirements engineering. For
example, it is needed when building models of a domain or organising
requirements documents. Since a lot of information is available in textual form,
methods to identify important concepts from texts are potentially useful.
Traditional methods for finding important concepts from texts rely on the
assumption that the most frequent concepts are the most important. We present
an approach that does not depend on this assumption. It makes use of two texts
to find important concepts comparatively. We show that this approach is viable. It
discovers concepts similar to those found by traditional approaches as well as
concepts that are not frequent. Finally, we discuss the possibility of extending this
work to requirements classification
Index Terms:
classification systems analysis text analysis important concepts requirements
classification requirements documents requirements engineering textual form
Automated support for text-based system assessment
Merriman, M. Evans, R.P. Park, S.
George Mason Univ., Fairfax, VA;
This paper appears in: Systems Engineering of Computer Based Systems,
1995., Proceedings of the 1995 International Symposium and Workshop
on
03/06/1995 -03/09/1995, 6-9 Mar 1995
Location: Tucson, AZ , USA
On page(s): 85-92
6-9 Mar 1995
References Cited: 8
INSPEC Accession Number: 4981028
Abstract:
Describes the need to explore and evaluate text descriptions of proposed
computer-based systems, presents an approach for automated support for textbased system assessment, and reports on the use of this approach in support of
system assessment for a complex multi-segment project: the Federal Bureau of
Investigation's Integrated Automated Fingerprint Identification System (IAFIS).
Text-based system assessment is a key to early discovery of issues and risk in
system development. Text-based assessments encounter many challenges,
particularly the high potential for ambiguity in English, but still have significant
advantages
Index Terms:
English language FBI Federal Bureau of Investigation IAFIS Integrated Automated
Fingerprint Identification System ambiguity automated support change impact
assessment classification complex multi-segment project computer aided software
engineering configuration control document handling early issues discovery fingerprint
identification natural languages project support environments proposed computer-based
systems requirements risk assessment risk discovery risk management system
development system documentation text categorization text descriptions text-based
system assessment traceability English language FBI Federal Bureau of Investigation
IAFIS Integrated Automated Fingerprint Identification System ambiguity automated
support change impact assessment classification complex multi-segment project
computer aided software engineering configuration control document handling early
issues discovery fingerprint identification natural languages project support environments
proposed computer-based systems requirements risk assessment risk discovery risk
management system development system documentation text categorization text
descriptions text-based system assessment traceability
IR 35
Automated diagnosis of non-native English speaker's
natural language
Fox, R. Bowden, M.
Dept. of Math. & Comput. Sci., Northern Kentucky Univ., Highland Heights, KY,
USA;
This paper appears in: Tools with Artificial Intelligence, 2002. (ICTAI
2002). Proceedings. 14th IEEE International Conference on
On page(s): 301- 306
2002
ISSN: 1082-3409
Number of Pages: xx+548
INSPEC Accession Number: 7555168
Abstract:
Typical grammar checking software use some form of natural language parsing to
determine if errors exist in the text. If a sentence is found ungrammatical, the
grammar checker usually seeks a single grammatical error as an explanation. For
non-native speakers of English, it is possible that a given sentence contain
multiple errors and grammar checkers may not adequately explain these
mistakes. This paper presents GRADES, a diagnostic program that detects and
explains grammatical mistakes made by non-native English speakers. GRADES
performs its diagnostic task, not through parsing, but through the application of
classification and pattern matching rules. This makes the diagnostic process more
efficient than other grammar checkers. GRADES is envisioned as a tool to help
non-native English speakers learn to correct their English mistakes, but is also a
demonstration that grammar checking need not rely on parsing techniques.
Index Terms:
computational linguistics grammars natural languages GRADES grammar checking
grammar checking software grammatical mistakes natural language parsing non-native
English speakers
IR 36
Automatic text categorization and its application to
text retrieval
Wai Lam Ruiz, M. Srinivasan, P.
Dept. of Syst. Eng. & Eng. Manage., Chinese Univ. of Hong Kong, Shatin;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions
on
On page(s): 865-879
Volume: 11, Issue: 6, Nov/Dec 1999
ISSN: 1041-4347
References Cited: 23
CODEN: ITKEEH
INSPEC Accession Number: 6526752
Abstract:
We develop an automatic text categorization approach and investigate its
application to text retrieval. The categorization approach is derived from a
combination of a learning paradigm known as instance-based learning and an
advanced document retrieval technique known as retrieval feedback. We
demonstrate the effectiveness of our categorization approach using two realworld document collections from the MEDLINE database. Next, we investigate the
application of automatic categorization to text retrieval. Our experiments clearly
indicate that automatic categorization improves the retrieval performance
compared with no categorization. We also demonstrate that the retrieval
performance using automatic categorization achieves the same retrieval quality
as the performance using manual categorization. Furthermore, detailed analysis
of the retrieval performance on each individual test query is provided
Index Terms:
information retrieval MEDLINE database automatic text categorization detailed analysis
document retrieval technique instance-based learning learning paradigm real-world
document collections retrieval feedback retrieval quality text retrieval
Documents that cite this document
Select link to view other documents in the database that cite this one.
A research on Web resources automatic classification
using SVMs
Cai Wei Wang Yongcheng Yin Zhonghang Zou Tao
Shanghai Jiao Tong Univ.;
This paper appears in: Intelligent Control and Automation, 2002.
Proceedings of the 4th World Congress on
On page(s): 1359- 1363 vol.2
Volume: 2, 2002
ISSN:
Number of Pages: 4 vol. 3353
INSPEC Accession Number: 7412373
Abstract:
With the rapid growth of Web information, text categorization has become an
important research field for the management of Internet information. Most of the
existing methods are based on traditional statistics, which provides a conclusion
only for the situation where sample size is tending to infinity. So they may not
work well in practical case with limited samples and easily lead to the problem of
over-fitting. These papers theoretical analyze the reason of over-fitting and
introduce its condition as well as the method to solve it. We introduce SVMs, a
method to avoid over-fitting, which is based on statistical learning theory
matching the limited number of Internet news examples.
Index Terms:
Bayes methods Internet decision theory learning (artificial intelligence) learning
automata text analysis Internet information management SVMs Web information Web
resources automatic classification statistical learning theory support vector machines text
categorization
Documents that cite this document
Select link to view other documents in the database that cite this one.
IR 37
Text classification and keyword extraction by learning
decision trees
Sakakibara, Y. Misue, K. Koshiba, T.
Fujitsu Lab., Ltd., Numazu, Shizuoka;
This paper appears in: Artificial Intelligence for Applications, 1993.
Proceedings., Ninth Conference on
03/01/1993 -03/05/1993, 1-5 Mar 1993
Location: Orlando, FL , USA
On page(s): 4661-5 Mar 1993
References Cited: 0
INSPEC Accession Number: 4851079
Abstract:
Summary form only given. The authors propose a completely new approach to
the problem of text classification and automatic keyword extraction by using
machine learning techniques. They introduce a class of representations for
classifying text data based on decision trees, and present an algorithm for
learning it inductively. The algorithm does not need any natural language
processing technique, and is robust to noisy data. It is shown that the learning
algorithm can be used for automatic extraction of keywords for text retrieval and
automatic text categorization. Some experimental results on the use of the
algorithm are reported
Index Terms:
classification learning (artificial intelligence) linguistics natural languages automatic
keyword extraction automatic text categorization decision trees learning machine
learning natural language processing noisy data text classification text retrieval
Documents that cite this document
Select link to view other documents in the database that cite this one.
IR 38
Automatic category generation for text documents by
self-organizing maps
Hsin-Chang Yang Chung-Hong Lee
Dept. of Inf. Manage., Chang Jung Univ., Tainan ;
This paper appears in: Neural Networks, 2000. IJCNN 2000, Proceedings of
the IEEE-INNS-ENNS International Joint Conference on
07/24/2000 -07/27/2000, 2000
Location: Como , Italy
On page(s): 581-586 vol.3
Volume: 3, 2000
References Cited: 8
Number of Pages: 6 vol.(xxxvii+371+xxxvi+313+679+630+669+659)
INSPEC Accession Number: 6703760
Abstract:
One important task for text data mining is automatic text categorization, which
assigns a text document to some predefined category according to their
correlations. Traditionally, these categories as well as the correlations among
them are determined bp human experts. In this paper, we devised a novel
approach to automatically generate categories. The self-organizing map model is
used to generate two maps, namely the word cluster map and the document
cluster map, in which a neuron represents a cluster of words and documents
respectively. Our approach is to analyze the document cluster map to find
centroids of some super-clusters. We also devised a method to select the
category term from the word cluster map. The hierarchical structure of categories
may be generated by recursively applying the same method. Text categorization
is the natural consequence of such automatic category generation process
Index Terms:
category theory data mining generalisation (artificial intelligence) self-organising feature
maps text analysis automatic category generation document cluster map hierarchical
structure self-organizing maps text categorisation text data mining text document word
cluster map
Documents that cite this document
Select link to view other documents in the database that cite this one.
Using rough sets to construct sense type decision
trees for text categorization
Bleyberg, M.Z. Elumalai, A.
Comput. & Inf. Sci. Dept., Kansas State Univ., Manhattan, KS;
This paper appears in: IFSA World Congress and 20th NAFIPS International
Conference, 2001. Joint 9th
07/25/2001 -07/28/2001, 25-28 July 2001
Location: Vancouver, BC , Canada
On page(s): 19-24 vol.1
Volume: 1, 25-28 July 2001
References Cited: 8
Number of Pages: 5 vol.(xxxviii+xxii+3100)
INSPEC Accession Number: 7081704
Abstract:
Accurate text categorization is needed for efficient and effective text retrieval,
search and filtering. Finding appropriate categories and manually assigning them
to existing documents is very laborious. The paper shows a simple procedure for
automatic extraction of atomic sense types (semantic categories) from
documents based on rough sets. The atomic sense types are nodes of a sense
type decision tree, which represents a taxonomy
Index Terms:
decision trees information retrieval rough set theory text analysis atomic sense types
automatic extraction rough sets semantic categories sense type decision tree sense type
decision trees taxonomy text categorization text filtering text retrieval text search
IR 39
Automatic text categorization: case study
Correa, R.F. Ludermir, T.B.
This paper appears in: Neural Networks, 2002. SBRN 2002. Proceedings.
VII Brazilian Symposium on
On page(s): 1502002
ISSN:
Number of Pages: xiii+270
INSPEC Accession Number: 7568947
Abstract:
Text categorization is a process of classifying documents with regard to a group
of one or more existent categories according to themes or concepts present in
their contents. The most common application of it is in information retrieval
systems (IRS) to document indexing. A method to transform text categorization
into a viable task is to use machine-learning algorithms to automate text
classification, allowing it to be carried out fast, into concise manner and in broad
range. The objective of this work is to present and compare the results of
experiments on text categorization using artificial neural networks of multilayer
perceptron and self-organizing map types, and traditional machine-learning
algorithms used in this task: C4.5 decision tree, PART decision rules and Naive
Bayes classifier.
Index Terms:
classification decision trees information retrieval learning (artificial intelligence) multilayer
perceptrons self-organising feature maps text analysis Naive Bayes classifier PART
decision rules decision tree document classification document indexing information
retrieval systems machine-learning multilayer perceptron neural networks self-organizing
maps text categorization
Documents that cite this document
Select link to view other documents in the database that cite this one.
IR40
An incremental approach to text representation,
categorization, and retrieval
O'Neil, P.
Rome Lab.;
This paper appears in: Document Analysis and Recognition, 1997.,
Proceedings of the Fourth International Conference on
08/18/1997 -08/20/1997, 18-20 Aug 1997
Location: Ulm , Germany
On page(s): 714-717 vol.2
Volume: 2, 18-20 Aug 1997
References Cited: 8
Number of Pages: 2 vol. xxiv+1119
INSPEC Accession Number: 5704673
Abstract:
Efficient and accurate information retrieval is a goal of just about everyone.
Whether you are looking for information on the Internet, a book or article in the
library, satellite imagery of missile silos, or a recipe for dinner, finding exactly
what you want or need, even if you know exactly what you are looking for, can be
an imposing and most difficult task. Many current techniques require an intimate
understanding of the actual processes involved. The method presented in this
paper provides for an automatic representation of text data by vectors, which can
then be manipulated to categorize and organize the data. Information can be
retrieved without knowledge of the underlying process. The user can ask for
information using normal discourse. This technology can also be applied to data
mining and visualization
Index Terms:
data structures data visualisation information retrieval data mining data visualization
incremental approach information retrieval missile silos normal discourse text
categorization text representation text retrieval
IR 41
Managing semantic content for the Web
Sheth, A. Bertram, C. Avant, D. Hammond, B. Kochut, K. Warke, Y.
Dept. of Comput. Sci., Georgia Univ., Athens, GA ;
This paper appears in: Internet Computing, IEEE
On page(s): 80- 87
Volume: 6, Issue: 4, Jul/Aug 2002
ISSN: 1089-7801
INSPEC Accession Number: 7344439
Abstract:
By associating meaning with content, the Semantic Web will facilitate search,
interoperability, and the composition of complex applications. The paper
discusses the Semantic Content Organization and Retrieval Engine (SCORE, see
vvww.voquette.com), which is based on research transferred from the University
of Georgia's Large Scale Distributed Information Systems. SCORE belongs to a
new generation of technologies for the emerging Semantic Web. It provides
facilities to define ontological components that software agents can maintain.
These agents use regular expression based rules in conjunction with various
semantic techniques to extract ontology-driven metadata from structured and
semistructured content. Automatic classification and information-extraction
techniques augment these results and also let the system deal with unstructured
text.
Index Terms:
Internet classification information resources information retrieval meta data search
engines software agents Internet SCORE Semantic Content Organization and Retrieval
Engine Semantic Web classification expression based rules information extraction
interoperability ontological components ontology-driven metadata searching semantic
search engine software agents
IR42
Three term weighting and classification algorithms in
text automatic classification
Qian Diao
Shanghai Jiaotong Univ.;
This paper appears in: High Performance Computing in the Asia-Pacific
Region, 2000. Proceedings. The Fourth International
Conference/Exhibition on
05/14/2000 -05/17/2000, 2000
Location: Beijing , China
On page(s): 629-630 vol.2
Volume: 2, 2000
References Cited: 2
Number of Pages: 2 vol. xxiv+1179
INSPEC Accession Number: 6604932
Abstract:
Three automatic text classification algorithms are provided. They are the Bayes
method based on Bayes theorem and IDF (Invert Document Frequency), VSM
based on Shannon entropy and a fuzzy method based on fuzzy theory.
Furthermore, the method of combining term weighting methods with three
classification algorithms is also provided in the paper
Index Terms:
Bayes methods classification entropy fuzzy set theory text analysis Bayes method
Bayes theorem Invert Document Frequency Shannon entropy VSM automatic text
classification algorithms fuzzy method fuzzy theory term weighting algorithms
IR 43
The TaxGen framework: automating the generation of
a taxonomy for a large document collection
Muller, A. Dorre, J. Gerstl, P. Seiffert, R.
Dept. of Software Solutions Dev., IBM Germany;
This paper appears in: System Sciences, 1999. HICSS-32. Proceedings of
the 32nd Annual Hawaii International Conference on
01/05/1999 -01/08/1999, 1999
Location: Maui, HI , USA
On page(s): 9 pp.1999
References Cited: 12
Number of Pages: liii+341
INSPEC Accession Number: 6182111
Abstract:
Text mining is an active area of research and development, which combines and
expands techniques found in related areas like information retrieval,
computational linguistics and data mining to perform an analysis of large corpora
of digital documents. This paper describes the TaxGen text mining project carried
out at the IBM Software Development Lab. at Boeblingen, Germany. The goal of
TaxGen was the automatic generation of a taxonomy for a collection of previously
unstructured documents, namely a set of 73,000 news wire documents spanning
one year
Index Terms:
classification computational linguistics data mining information retrieval text analysis
very large databases IBM Software Development Lab., Boeblingen, Germany TaxGen text
mining project automatic taxonomy generation computational linguistics data mining
digital documents information retrieval large document collection news wire documents
text corpus analysis unstructured documents
Automatic labeling of self-organizing maps for
information retrieval
Merkl, D. Rauber, A.
Inst. fur Softwaretech., Tech. Univ. Wien;
This paper appears in: Neural Information Processing, 1999. Proceedings.
ICONIP '99. 6th International Conference on
11/16/1999 -11/20/1999, 1999
Location: Perth, WA , Australia
On page(s): 37-42 vol.1
Volume: 1, 1999
References Cited: 17
Number of Pages: 3 vol. xv+1240
INSPEC Accession Number: 6605092
Abstract:
The self-organizing map is a very popular unsupervised neural network model for
the analysis of high-dimensional input data as in information retrieval
applications. However, the interpretation of the map requires much manual effort,
especially as far as the analysis of the learned features and the characteristics of
identified clusters is concerned. We present our novel LabelSOM method which,
based on the features learned by the map, automatically selects the most
descriptive features of the input patterns mapped onto a particular unit of the
map, thus making the characteristics of the various clusters within the map
explicit. We demonstrate the benefits of this approach on an example from text
classification using a real-world document archive. In this particular case, the
features correspond to keywords describing the contents of a document. The
benefit of this approach is that the various document clusters are characterized in
terms of shared keywords, thus making it easy for the user to explore the
contents of an unknown document archive
Index Terms:
classification information retrieval self-organising feature maps unsupervised learning
LabelSOM method automatic labeling document archive high-dimensional input data
analysis information retrieval keywords self-organizing maps text classification
unsupervised neural network model
Automatic document classification based on
probabilistic reasoning: model and performance
analysis
Wai Lam Kon-Fan Low
Dept. of Syst. Eng. & Eng. Manage., Chinese Univ. of Hong Kong, Shatin;
This paper appears in: Systems, Man, and Cybernetics, 1997.
'Computational Cybernetics and Simulation'., 1997 IEEE International
Conference on
10/12/1997 -10/15/1997, 12-15 Oct 1997
Location: Orlando, FL , USA
On page(s): 2719-2723 vol.3
Volume: 3, 12-15 Oct 1997
References Cited: 6
Number of Pages: 5 vol. 4535
INSPEC Accession Number: 5753489
Abstract:
We develop a new approach to test classification based on automatic feature
extraction and probabilistic reasoning. The knowledge representation used to
perform such task is known as Bayesian inference networks. A Bayesian network
text classifier is automatically constructed from a set of training test documents.
We have conducted a series of experiments on two text document corpus, namely
the CACM and Reuters, to analyze the performance of our approach, which are
described in the paper
Improving the classification accuracy of automatic
text processing systems using context vectors and
back-propagation algorithms
Farkas, J.
Centre for Inf. Technol. Innovation, Ind. Canada, Laval, Que.;
This paper appears in: Electrical and Computer Engineering, 1996.
Canadian Conference on
05/26/1996 -05/29/1996, 26-29 May 1996
Location: Calgary, Alta. , Canada
On page(s): 696-699 vol.2
Volume: 2, 26-29 May 1996
References Cited: 13
Number of Pages: 2 vol. xl+1026
INSPEC Accession Number: 5456375
Abstract:
We analyze some of the benefits of combining the context-vector representation
of documents with the back-propagation paradigm for document classification.
We discuss an implementation of this architecture, called NeuroFile, which
combines automatic document classification with similarity-based, as well as
Boolean retrieval facilities in a single electronic filing system. The quality of
performance of NeuroFile is compared with an earlier system called NeuroClass.
We show that NeuroFile achieves a 9% classification improvement over
NeuroClass
Index Terms:
backpropagation classification document image processing feedforward neural nets
information retrieval word processing Boolean retrieval facilities NeuroClass NeuroFile
automatic document classification automatic text processing systems backpropagation
algorithms classification accuracy context vectors context-vector representation
documents electronic filing system performance similarity based retrieval facilities
Towards classifying full-text using recurrent neural
networks
Farkas, J.
Centre for Inf. Technol. Innovation, Ind. Canada, Laval, Que.;
This paper appears in: Electrical and Computer Engineering, 1995.
Canadian Conference on
09/05/1995 -09/08/1995, 5-8 Sep 1995
Location: Montreal, Que. , Canada
On page(s): 511-514 vol.1
Volume: 1, 5-8 Sep 1995
References Cited: 12
INSPEC Accession Number: 5205006
Abstract:
This paper describes an automatic document classification system called
NeuroClass, developed for the air transportation field of Transport Canada. The
properties of the system show that for the specific domain for which NeuroClass
was developed, recurrent neural networks as developed by Elman (1990) can be
used to build systems that classify natural language full-text automatically and
reliably with a degree of accuracy proportional to the level of class adherence of
the text involved
IR 44
Acquisition of linguistic patterns for knowledge-based
information extraction
Jun-Tae Kim Moldovan, D.I.
Dept. of Comput. Eng., Dongguk Univ., Seoul;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions
on
On page(s): 713-724
Volume: 7, Issue: 5, Oct 1995
ISSN: 1041-4347
References Cited: 35
INSPEC Accession Number: 5103896
Abstract:
The paper presents an automatic acquisition of linguistic patterns that can be
used for knowledge based information extraction from texts. In knowledge based
information extraction, linguistic patterns play a central role in the recognition
and classification of input texts. Although the knowledge based approach has
been proved effective for information extraction on limited domains, there are
difficulties in construction of a large number of domain specific linguistic patterns.
Manual creation of patterns is time consuming and error prone, even for a small
application domain. To solve the scalability and the portability problem, an
automatic acquisition of patterns must be provided. We present the PALKA
(Parallel Automatic Linguistic Knowledge Acquisition) system that acquires
linguistic patterns from a set of domain specific training texts and their desired
outputs. A specialized representation of patterns called FP structures has been
defined. Patterns are constructed in the form of FP structures from training texts,
and the acquired patterns are tuned further through the generalization of
semantic constraints. Inductive learning mechanism is applied in the
generalization step. The PALKA system has been used to generate patterns for
our information extraction system developed for the fourth Message
Understanding Conference (MUC-4)
Index Terms:
knowledge acquisition knowledge based systems learning by example linguistics natural
languages pattern recognition word processing FP structures PALKA Parallel
Automatic Linguistic Knowledge Acquisition automatic acquisition domain specific linguistic
patterns domain specific training text input text knowledge based information extraction
knowledge based natural language processing knowledge-based information extraction
linguistic pattern acquisition semantic constraints
Download