Error rate estimating if the dataset available to machine learning

Jean-Hugues CHAUCHAT Université Lumière-Lyon 2 - Faculté de Sciences Economiques et de Gestion Laboratoire ERIC (Equipe de Recherche sur l'Ingénierie des Connaissances) 5 avenue Pierre Mendès-France - 69676 Bron Cedex - FRANCE jean-hugues.chauchat@univ-lyon2.fr 1) Error rate estimation if the dataset available for machine learning results from cluster sampling, or stratified sampling. A simulation study and a real application of automatic spoken language identification. KEYWORDS: Data mining, language identification, cross-validation, clustered dataset, stratified dataset. ABSTRACT. If the dataset available for machine learning results from cluster sampling, the usual crossvalidation error rate estimate can lead to biased and misleading results. An adapted cross-validation is described for this case. Using simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, are compared to the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, and not from a random partition of the examples. This result is confirmed on a real application of automatic spoken language identification. The stratified case is also presented. 2) Multidimensional Data Analysis, Seriation, Blockmodels and Bertin's Graphics KEYWORDS: Graphical Software, Seriation, Blockmodel, Cluster Analysis, Correspondence Analysis, Principal Components Analysis The idea of permuting the rows and columns of a data matrix for the purpose of revealing hidden structure is having an increasing influence in applied mathematics. Numerous applications are made in production management, searching for "group technology" (see Kamrani & al. 1993, for a bibliographic survey; Arvindh & Irani 1994; Mukhopadhyay & Gopalakrishnan 1995) , as well as in archaeology, phytosociology, economics, history and behavioural sciences (see references in Arabie 1991; Caraux 1984; Bertin, 1967, 1981 and 1987). To find out, to analyse, to make visible the structure of a data matrix, one can use either mathematical or graphical approaches : - Multidimensional Statistical Data Analysis (Principal Components, or Correspondence, or Cluster Analysis), or other algorithms (see, for instance : Marcotorchino 1987 ; Arvindh & Irani 1994) ; - Graphics we can very efficiently analyse with our eyes, powerful analysis tools, to discover and to show similarities and contrasts between the matrix elements (rows or columns); Jacques Bertin established some rules to build up such graphics (Bertin 1967, 1981 and 1987). We present a new exploratory method that integrates these two approaches, and ADAMO (Risson & al.   1994), a software which brings into play this method with Microsoft Windows , or MacIntosh . AMADO is working with any matrix consisting of positive values : contingency table, logic table which may possibly represents a graph, symmetrical table of co-occurrence, ... Two examples are presented : - an ecological data set : an items (species) by attributes (sites) matrix having binary entries denoting the presence or absence of a particular species at a given site; - a cross-classification of more than 10,000 US patents granted to France-based firms during the 1985-1990 period (Bergeron. & al. 1996). The data matrix is a contingency table where N i,j = number of patents deposed by firms of industrial sector "i" in the technological field "j", that is to say the technological knowledge which characterises patents. In every case, the graphics significantly improve the readibility of the results of the multidimensional statistical data analysis. They provide the researcher who collected the data with an easily understandable visual representation of the data structure, without using any list of means, variances, correlations, contributions, etc. Each bit of information, each entry of the table is presented in its original form. Only the orders of the rows and columns are changed, but everything is there.

Error rate estimating if the dataset available to machine learning

Related documents

Products

Support

Error rate estimating if the dataset available to machine learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib