Reference

advertisement
Reference for MetaClass
This file is an advanced reference for the MetaClass database. In italic are the terms and abbreviations
used in the database. The database consists of 5 tables:
1) Datasets
Datasets are described in that table in their standard form, i.e. the form in which they are in the UCI
repository or -if they are not in any repository- their most common form. From the definition adopted a
dataset is defined by its class distribution -and so other characteristics may vary-, hence this standard
form determines the values of the fields related to dataset size (training and test), number of variables,
and type of variables.
The fields of the table are, with their coding between brackets:
- Identification number [ID]: In the form d_X where X is a number.
- Name [Name]: Standard dataset name, shortened.
- Origin [Origin]: Physical provenance of the dataset. Simplified to UCI, Statlog (for the datasets which
are in Statlog but not in the UCI), or Other.
- Domain [Domain]: General domain of the dataset. Every dataset is different, so any process of
categorisation is necessarily subjective. We chose the following categories:
* Credit: Credit scoring datasets, i.e. classification of credit applicants as `good' or `bad'.
Examples: Australian credit [Crx] and German credit [GerCr], both from Statlog.
* Medicine: Medical diagnosis datasets, where the goal is usually to determine whether a
patient is at risk or not for a specific disease (and which physical attributes affect this risk).
Examples: Cleveland heart disease (of which a 5-class version [HeartCl5] and a 2-class version
[HeartCl2] have been used), Wisconsin breast cancer [BreastW], Pima diabetes [Pima].
* Image: Image recognition (classification of a set of pixels) or pixel-by-pixel classification
like in satellite imaging.
Examples: Vehicle recognition [Vehicle], Satimage [SatIm], image segmentation [Segm].
* Speech: Speech generation (production of stress and/or phonemes from alphabetic
representation) or speech recognition.
Examples: the multiple variations around the UCI Nettalk dataset ([Nettalkfull], [Nettalk-A], [Netpho],
[Netstr], [Nettyp]...).
* Physics: Datasets related to the control of some physical or industrial process, and those
where the physical properties of an object are analysed to classify the object.
Examples: shuttle controls [Shuttle], sonar waves [Sonar], glass recognition from chemical components
[Glass].
* Genetic: Molecular biology datasets, which consist of sequences of nucleotides.
Examples: Splice junction DNA [Splice], E. Coli promoter gene sequences [Promoters].
* Artificial: Datasets invented to test classification methods, and datasets about games.
Examples: CART book's example datasets: [Waveform21] and [Waveform40], [Led7] and [Led24]; the
MONK series ([MONK1], [MONK2], [MONK3]).
* Other: Datasets of various other sources: psychology, biology, sociology... Sometimes very
similar to the Artificial domain.
Examples: the 1984 United Stated Congressional Voting Records ([Votes16], and its variation
[Votes15] where the physician-freeze attribute, which discriminates the 2 classes too well, has been
removed), the agaricus-lepiota dataset [Mushroom], Fisher's Iris data [Iris].
- Sub-domain [SubDomain]: Indicates potential similarity between datasets, typically when they
present different class distributions but are otherwise very similar (for example, the UCI Heart
Cleveland dataset has 5 classes which have been reduced to 2 for Statlog and numerous other studies,
and hence by our definition these two versions constitute two separate datasets, but with the same subdomain coded Heart). Blank if the dataset does not show similarity to any other.
-Training set size [Train]: Is meaningless if the dataset is artificial or simulated and is not widely used
with the same size (e.g. the Waveform and Led datasets).
- Test set size [Test]: Idem as training set size; is also meaningless if there is no widely used distinct
test set.
- Number of classes [#Class].
- Total number of variables [#Feat].
- Number of categorical variables [#Cat].
- Number of binary variables [#Bin].
- Number of numerical variables [#Num].
- Articles [Articles]: List of studies where results related to the dataset have been published (in the
form of a list of study IDs ``aX,aY,aZ...'').
- Minimum known error rate [Min-err].
- Maximum known error rate [Max-err].
- Number of known error rates [Num-err].
- Default error rate [Def-err]: Error rate of the default rule which consists in classifying all the
instances in the most frequent class of the training set; derived from the test set if possible, otherwise
from the training set -both derivations would normally give similar results anyway-.
- Short description [Full name]: Gives a small amount of information about what the dataset
represents.
- Special remarks [Special]: Helps the user identify and retrieve the dataset if necessary.
2) Paradigms
We define study paradigms as the way studies use datasets to produce results. Hence compared with
the Datasets table which characterizes the usual form in which datasets are known, the Paradigms one
describes the specificity of dataset use. Two different studies utilizing the same dataset -in the sense we
define them- would have two different paradigms; the same paradigm would be shared at most by all
the results related to a certain dataset within a specific study.
The fields of the table are, with their coding between brackets:
- Identification number [ID]: In the form p_X where X is a number.
- Result type [Type]: Error (Error rate) / Loss (Error rate with non-identity misclassification matrix) /
ALS (Average Logarithmic Score) / AQS (Average Quadratic Score) / DT and DQ (reliability measures
specific to Titterington et al. (1981), see the Articles table for the exact reference).
- Estimation method [Estimation]: tt (Train set and test set) / cv (Cross-validation) / loo (Leave-one
out).
- Training set size [Train]: Actual number of distinct instances on which methods were trained.
- Test set size [Test]: Actual number of distinct instances on which methods were assessed.
- Number of cross-validation folds [Cv]: Is meaningful only if the estimation method is crossvalidation.
- Number of test iterations [Iterations]: Records the number of times the whole training/testing process
was iterated. Is usually the number of train/test splits or cross-validations made (which is often equal to
one).
- Total number of variables [#Feat]. Blank if the dataset has been used in its standard form.
- Number of categorical variables [#Cat]. Same remark as [#Feat].
- Number of binary variables [#Bin]. Same remark as [#Feat].
- Number of numerical variables [#Num]. Same remark as [#Feat].
- Related dataset [Dataset]: ID of the dataset in the Datasets table corresponding to the paradigm.
- Related study [Article]: ID of the study in the Articles table corresponding to the paradigm.
3) Articles
This table stores the data related to the studies themselves. It is used for bookkeeping mainly.
The fields of the table are, with their coding between brackets:
- Identification number [ID]: In the form a_X where X is a number.
- Name [Name]: Under the form ``First author's name XX'' where XX are the last two digits of the year
in which the study was made public.
- Selected [Sel]: Does the study include results which have been selected in the meta-analysis? (y/n).
- Reason for exclusion [Reason]: Details why the whole study was not included (if relevant).
- Reference [Reference]: How the study was found. Usually in the form ``Ref-XX'' where XX is the ID
of the referring body or the name of its first author if it is not in the database itself. Can be Other if the
study was discovered through personal communication, proximity of another referred study in a
collection of articles, random search...
- Type [Type]: Article / InProceedings / Book / Misc.
- List of authors' names [Authors]: In BiBTeX format (as are all the following fields).
- Title [Title]
- Origin [Origin]: Name of the journal, book, or URL in which the study was made public.
- Year [Year]
- Pages [Pages]
- Volume [Volume]
- Number [Number]
- Editors [Editors]
- Publisher [Publisher]
- Address [Address]
4) Methods
Methods are considered like dataset paradigms, i.e. they are always specific to studies. We adopted this
view since methods are always applied in different ways across studies, and it leaves the user of the
database free to group methods according to his or her own criterion. We provide however one kind of
subjective grouping (the column [Type]), which may be used in analyses.
The fields of the table are, with their coding between brackets:
- Identification number [ID]: In the form m_X where X is a number.
- Name [Name]: Method name as close as possible to the one reported in the original study (Warning:
this may not be unique since different studies often use the same abbreviations - hence check with the
study ID-)
- Type [Type]: Common method name.
- Description [Description]: Gives some more information about what the method is, especially if it is
different from the `standard' implementation (which is given in [Type]).
- Article [Article]: Study ID where the method appears.
5) Results
This is the core table of the database. Results and estimated standard errors are stored in this table,
along with scaled results (to avoid repetitive computing of those results). To each result correspond a
dataset, a dataset paradigm, a method, and a study.
The fields of the table are, with their coding between brackets:
- Article ID [Article]
- Dataset ID [Dataset]
- Method ID [Method]
- Paradigm ID [Paradigm]
- Result on test set [Result]
- Result scaled between 0 and 1 by dataset [Scaled]: 0/1 scaled results (for each dataset, the best
known result becomes 0, the worst 1, and other results are linearly scaled in between).
- Estimated standard error [SE]: Meaningful when there has been more than one iteration in the
estimation procedure, and if available from the study.
Download