Error rate estimating if the dataset available to machine learning

advertisement
Jean-Hugues CHAUCHAT
Université Lumière-Lyon 2 - Faculté de Sciences Economiques et de Gestion
Laboratoire ERIC (Equipe de Recherche sur l'Ingénierie des Connaissances)
5 avenue Pierre Mendès-France - 69676 Bron Cedex - FRANCE
jean-hugues.chauchat@univ-lyon2.fr
1) Error rate estimation if the dataset available for machine learning results from
cluster sampling, or stratified sampling. A simulation study and a real application
of automatic spoken language identification.
KEYWORDS: Data mining, language identification, cross-validation, clustered dataset, stratified
dataset.
ABSTRACT. If the dataset available for machine learning results from cluster sampling, the usual crossvalidation error rate estimate can lead to biased and misleading results. An adapted cross-validation is described
for this case. Using simulation, the sampling distribution of the generalization error rate estimate, under
cluster or simple random sampling hypothesis, are compared to the true value. The results highlight the impact
of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning
set and test set should result from a random partition of the clusters, and not from a random partition of the
examples. This result is confirmed on a real application of automatic spoken language identification. The
stratified case is also presented.
2) Multidimensional Data Analysis, Seriation, Blockmodels and Bertin's Graphics
KEYWORDS: Graphical Software, Seriation, Blockmodel, Cluster Analysis, Correspondence
Analysis, Principal Components Analysis
The idea of permuting the rows and columns of a data matrix for the purpose of revealing hidden structure is
having an increasing influence in applied mathematics. Numerous applications are made in production
management, searching for "group technology" (see Kamrani & al. 1993, for a bibliographic survey; Arvindh &
Irani 1994; Mukhopadhyay & Gopalakrishnan 1995) , as well as in archaeology, phytosociology, economics,
history and behavioural sciences (see references in Arabie 1991; Caraux 1984; Bertin, 1967, 1981 and 1987).
To find out, to analyse, to make visible the structure of a data matrix, one can use either mathematical or
graphical approaches :
- Multidimensional Statistical Data Analysis (Principal Components, or Correspondence, or Cluster Analysis),
or other algorithms (see, for instance : Marcotorchino 1987 ; Arvindh & Irani 1994) ;
- Graphics we can very efficiently analyse with our eyes, powerful analysis tools, to discover and to show
similarities and contrasts between the matrix elements (rows or columns); Jacques Bertin established some rules
to build up such graphics (Bertin 1967, 1981 and 1987).
We present a new exploratory method that integrates these two approaches, and ADAMO (Risson & al.


1994), a software which brings into play this method with Microsoft Windows , or MacIntosh .
AMADO is working with any matrix consisting of positive values : contingency table, logic table which may
possibly represents a graph, symmetrical table of co-occurrence, ...
Two examples are presented :
- an ecological data set : an items (species) by attributes (sites) matrix having binary entries denoting the
presence or absence of a particular species at a given site;
- a cross-classification of more than 10,000 US patents granted to France-based firms during the 1985-1990
period (Bergeron. & al. 1996). The data matrix is a contingency table where N i,j = number of patents deposed by
firms of industrial sector "i" in the technological field "j", that is to say the technological knowledge which
characterises patents.
In every case, the graphics significantly improve the readibility of the results of the multidimensional
statistical data analysis.
They provide the researcher who collected the data with an easily understandable visual representation of the
data structure, without using any list of means, variances, correlations, contributions, etc.
Each bit of information, each entry of the table is presented in its original form. Only the orders of the rows
and columns are changed, but everything is there.
Download