Abstract The use of ground truth (GT) data in the learning and/or the assessment of classification algorithms is essential. In order to obtain better decisions, these algorithms must provide consistent results as regards the reference spectral signatures of GT and observation on the Earth’s surface. Using a biased or a simplified GT attached to a hyperspectral image to be partitioned does not allow a rigorous explanation of the physical phenomena that these images reflect. Unfortunately, this scientific problem is not always treated carefully and generally neglected in the relevant literature. In this case, the impacts that can result from classification algorithm design is negative. This is inconsistent with respect to the considerable investments in both the development of sophisticated sensors and the design of objective classification algorithms. Any GT must be validated according to a rigorous protocol before any utilization, which is unfortunately not always the case. In this paper, we study and bring evidence through two examples of images (Indian Pine and Pavia University) that misleadingly are frequently used without care by the remote sensing community, because the associated GTs are not accurate. Through this analysis, we prove also the heterogeneity of the spectral signatures of some GT classes by using a semi-supervised and an unsupervised classification method. Through this critical analysis, we propose a general framework for a minimum objective assessment and validation of the GT accuracy, before exploiting them in a classification method. Introduction During the last decade, hyperspectral imagery has become an appropriate earth observation means to help decision making, and today, is considered an excellent information source to help analysis and interpretation of imaged objects in a variety of applications, well beyond the field of remote sensing. Indeed, it is recognized that hyperspectral imagery allows a better characterization of physical phenomena and allows a more accurate discrimination of observed materials than traditional 3-bands in the visible range (RGB) or even multispectral images (a few to tens of spectral bands). Aerial hyperspectral imagery provides detailed and objective information on a scene thanks to its large spectral range (several hundreds of spectral bands covering the visible and infrared domains) and its fine spatial resolution (a few tenths of centimeters). Owing to such richness of information, the interest in hyperspectral image (HSI) data has increased during the recent years in many application fields. Among these fields, we can mention the qualitative and quantitative inventory of vegetation species and their spatial distribution,1,2the early detection of vegetation diseases3,4 and of invasive species,5,6 the identification of marine algae,7,8 the human and animal impacts on environment,9,10 etc. Despite the richness of information it provides, and the wide range of currentand potential applications it encompasses, hyperspectral imagery exploitation is still a big challenge, due to the difficulty to analyze image data sets which can be very large in both the spatial and spectral dimensions.11,12 To highlight and exploit this wealth of information given by hyperspectral images (HSIs), classification is a central stage in decision making processes. It helps summarizing the image information content by assigning a uniquelabel to similar pixelsin the image, objectively based on its spectral signature. Classification methodscan be categorized in three families, namely supervised, semi-supervised or unsupervised.13,14,15Supervised methods require a priori knowledge of the ground truth (GT) label information in the learning and assessment stages.16,17 In the case ofsemi-supervised methods,the knowledge of the number of classes(often given by the GT), and/or some threshold values, or the number of iterations for iterative methods, are required to perform the classification task.18,19Lastly, unsupervised methods objectively aggregate the objects (pixels) in classes without any knowledge (neither the number of classes to discriminate, nor learning samples). They estimatethe number of classes and aggregate pixels in classes owing to oneor several optimization criteria.20 Whatever the category a classification method belongs to, one always needsa reliableGT. This knowledgeis essential during the stages of evaluation and validation of classification results or algorithms, otherwise the assessment of classification methods will have no scientific credibility. For instance, imagine we have an aerial HSI of cultivated area for which the reference classes information (GT) is wrongly summarized to a single class content (e.g. wheat). It is very likely that this image exhibits spectral variations due to existence of several homogenous regions, though it is claimed as homogeneous and wrongly reduced to a single region in the GT map. These spectral variations detected by the hyperspectral imager may come from regions in which the seeded plants did not grow uniformly for multiple reasons (plant disease, local moisture, path through the plant crop, etc.). Now, assume we want to apply and assess an unsupervised (no prior knowledge) classification algorithm to this image. The chosen algorithm, without much a prior iinformation, will probably be able to objectively discriminate these variations and to provide several classes which account for these variations, therefore highlightinginformational content which is not present in the original GT map. On the one hand, forcing pixels to belong to a wrong class during the learning stage of a supervised classification method,or assuming a lower number of classes with respect to reality in the case of semi-supervised method can have a high negative impact: the measured classification accuracy does not significantly reflect the physical reality of the observed image, since pixels with very different spectral signatures are merged into ‘virtual’ classes. Indeed, in the absolute, a homogenous class must be formed of individuals or objects having the same or very close characteristics. Therefore, at a first level, a ground truth must first take into account the physical characteristics of the objects present in the imaged scene. Then at another level, during the elaboration of GT or with the help of end-users it must mention end-user show the classes will be forced to merge to form virtual classes: for example, in an agricultural field, how pixels belonging to bare soil should be grouped with those belonging to growing corn. The practical consequences of such knowledge-based (sometimes arbitrary) class merging are here not very serious critical, but they might be disastrous, depending on the application area; for example, in the medical field, imagine the consequences of confusing a tumor area with a sane area from a partitioned image. Another important point is the evaluation of classification methods based on a false or simplified GT. With such a GT, unsupervised classification methods15 are doomed to failure and unjustifiably disqualified versus supervised or semi-supervised methods using a biased GT, though they are likely to provide classification maps closer to the physical reality. To illustrate the problem addressed here, we provide in this paper an analysis of the GT data associated with two well-known HSIs: Indian Pine (AVIRIS) and Pavia University (ROSIS). Both images have been extensively used in the remote sensing literature dealing with HIS pixels classification or clustering. For example, so far, more than 200 scientific papers mention these two data sets in their abstract or keywords. By analyzing some specific classes defined in the GT map, and when possible thanks to field observations, we demonstrate the fact that these reference maps are illconditioned and should be at least reconsidered before being used for classification purposes. We must specify that the problem raised here does not aim to propose a new method for selecting learning samples. It is rather an objective critical analysis that underlines the use of inconsistent GT data, for the assessment of classification algorithms, as well as the incoherent results given by certain algorithms, which follow the biased GTs too closely. This scientifically worrying problem is becoming more and more pregnant and unfortunately creates a lot of confusion in the related scientific literature. It calls into question the credibility of the contribution of new generation sensors and the accurate and objective analysis, with sophisticated algorithms, of the information these sensors can acquire. It is regrettable that this problem is not systematically avoideddespite the existence of credible scientific reasons.This paper underlines the fact that any ground truth should not be considered systematically as absolute. Before any use it must be validated according to a rigorous protocol which is unfortunately not always the case. It is therefore important to remember that the fineness and richness of the data provided by the new generation of imaging sensors, and the development of increasingly sophisticated algorithms must contribute to more and more objective decision making. The paper gives a comprehensive analysis and further details of the work published in Chehdi and Cariou22. The steps of the proposed analysis can be used as a basic approach to validate a ground truth data set. The remaining of the paper is organized into two sections. The second section presents (i) a spectral analysis of two popular HSIs based on their associated GT maps,(ii) an assessment of the homogeneity of the GT classes owing to a semisupervised and an unsupervised classification methods, (iii) a description of the impacts of a biased GT, and (iv) a general framework to assess and validate a given GT data base. The last section provides a conclusion. Spectral analysis of biased Ground Truth of HSIs and Impacts In the remote sensing field, the ancillary data associated with acquired imagesare sometimes misleadingly called ground truth data because they are incorrect or too much simplified. This problem is particularly frequent in airborne and space borne remote imaging where GT data are often utilized in an abusive and inappropriate manner. Before we prove this finding, it is very important to first recall before definitions and the meaning of the GT authenticity. 1.1 Ground truth definition According to the Oxford English dictionary,23 there are three definitions of ground truth, depending on its usage: i. Information that has been checked or facts that have been collected at source. ii. Information obtained by direct measurement at ground level, rather than by interpretation of remotely obtained data (as aerial or satellite images, etc.), especially as used to verify or calibrate remotely obtained data. iii. Information obtained by direct observation of a real system, as opposed to a model or simulation; a set of data that is considered to be accurate and reliable, and is used to calibrate a model, algorithm, procedure, etc. Also (specifically in image recognition technologies) information obtained by direct visual examination, especially as used to check or calibrate an automated recognition system. These definitions converge and bring no confusion to the interpretation of the noun “ground truth”.They are also concordant with that given by Claval21in the sense “thatit guarantees the authenticity of the collected observations”. 1.2 Ground truth authenticity Since the advent of technological remote sensing means, several authors have pointed out the risk of abandoning the precision and authenticity of the so-called "microlevel" knowledge (e.g. Rundstrom and Kenzer24) in favor of the "macrolevel" generalization. Therefore, the field work, called "intimate sensing" by Porteous25, nevertheless corresponds to a necessary complement of knowledge, even at the macroscopic scale. Whatever the application domain or the theme which a “ground truth” is associated to, this latter therefore must guarantee the authenticity and accuracy of observations, and must be faultless, since it is a reference, a model. In a decision making framework based on image processing and analysis, a GT map must be consistent with the corresponding image data since the latter are bounded to the physical characteristics of objects or real materials which are present in the imaged scene. Moreover, each area declared as homogeneous classes must refer to the same content. This GT area must therefore be coherent with the corresponding area in the HSI that objectively represents the real scene, meaning that the pixels of an homogeneous image region must have similar spectral features; otherwise the results of the objective analysis of images exploited in the decision making process will never match those of the simplified or wrong GT.This means that any analysis method using untrue GT data will provide biased and non-rigorously exploitable results as well as irrelevant conclusions. To illustrate this, in the following subsection, we will present analysis results focusingon two significant examples, namely the cases of the Indian Pine and Pavia University datasets.These are the most widely used benchmark datasets (HSIs and associated superimposable GT maps) referred to in the remote sensing community for classification purposes. For each dataset, we first present the characteristics of the image and the correspondingGT. Then, we show the main results of the different analyses performed to put in evidence the inhomogeneity of GT classes owing to the corresponding spectral signatures of the pixels. The average and standard deviation of the spectral signatures of each GT class are presented. We highlight the anomalies of these two GTs by calculating the spectral dispersions within the reference classes. Due to space limitation, only the results of the analysis on GT classes presenting a significant number of samples are presented. We also provide examples of classification results which highlight the need in subdividing the classes of the original GTs to allow a better coherence with the HSIs, based on the spectral features. Finally, we discuss the approximations made in constructing the GT maps associated with HSIs and their negative impacts in the analysis and interpretation of their informational content. 1.3 Analysis of two biased ground truth 1.3.1 Indian Pine GT classes The AVIRIS Indian Pine HSI26 has a spatial size of 145x145 pixels, where each pixel is characterized by a set of 220 spectral values (features).The spectral range is from 400 to 2499 nm. The ground spatial resolution is approximately 20m per pixel.The corresponding GT mapis made of 16 classes. Fig. 1displays the HSI visualized under two different wavelengths triplets in order to highlight the variations in the regions corresponding to each original GT class, as well as the image of class labels of the associated GT. Table 1 details the nature of each supposed homogeneous class and the number of pixels that compose it. Spectral signature analysis of hyperspectral images A. In a HSI, a pixel is characterized by its spectral signature, a set of features corresponding to spectral bands. Let X= {x1, x2, … ,xN} the set of elements (pixels) to be partitioned.Each pixel xi is characterized by the feature vectorAi= A(xi) = (ai1, ai2, … , aiB)T, where B is the number of features (spectral bands). Consider a partition of X into K indexed subsets or classes Cn 1nK , and li 1, label associated to pixel xi, so that the n-th class Cn xi : li n1iN and Cn =M n . , K the The average spectral signature (barycenter) is given by: g n 1 M n x C i n A xi , (1) The metric used here to calculate the dispersion of a class Cn, is the L1-norm distance(sum of the absolute values of errors). The total dispersion of a class Cnis defined by: D n d xi , gn , xi Cn (2) with d xi , gn the L1-norm distance between a pixel xi and the barycenter gn of class Cn: d xi , gn B a g , ik ik k 1 (3) In order to account for the population size within a class, we also calculate the average total dispersion of class Cn: D Dn M n , (4) n For the GT data under study,n = 1, 2, …,16, i.e.K = 16. Table 2shows the results of the total dispersion, the averaged total dispersion as well as the dispersion rank of each GT class using the L1-norm distance for the Indian Pine dataset. The four GT classes which exhibit the highest total dispersion are (in decreasing order), C11, C2, C12 and C14. This ranking is different when considering the average dispersions. Apart from C12, these classes are the ones which contain the highest number of pixels. In the following, we have limited the spectral analysis to C11(Soybeans min-till) and C2 (Corn no-till) GT classes.Fig. 2shows the selected regions of the original image corresponding to these GT classes. Fig. 3 shows the spectral signatures of the pixels, the average spectral signature and the standard deviation within the C11 and C2GT classes.The wavelengths of the first band and the last band are respectively to 400 and 2499 nm. The high variations of the spectral signatures inside each GT class confirm the dissimilarity of the pixels which form these two classes. This conclusion is consistent with the disparity of these classes observed with just three bands of the original HSIas seen in Fig. 1and Fig. 2and no further criterion is necessary to confirm it. The most homogeneous class for this GT is C7, even if a few pixels are distant from the class barycenter. This fact is confirmed by observing the weak variations of the standard deviation of the spectral signatures around the average spectral signature (see Fig. 4). Fig. 1Original Indian Pine image. (a) and (b): visualization based on two compositions of three different spectral bands (26, 16, 6) / (37, 21, 5) resp.; (c) and (d) the selected regions of the images (a) and (b) respectively corresponding to the GT class labels given in (e). Table 1Data from the Indian Pine GT. Total GT pixels: 10 336 Table 2Total dispersion and average dispersion of Indian Pine spectral signatures per GT class using the 1-norm distance. Supposed Dispersion in each Average dispersion GT class and dispersion in each class and Classes rank dispersion rank Fig. 2Indian Pine original images visualized with three spectral bands (26, 16, 6) corresponding to the label of C11, claimed as Soybeans min-till, and C2, claimed as Corn no-till. Fig. 3Indian Pine GT: Spectral signatures (black), average spectral signature (central curve), and standard deviation interval (blue) of C11 and C2GT classes. Fig. 4Indian Pine GT: Spectral signatures (black), average spectral signature (central curve), and standard deviation interval (blue) of the assumed homogeneous class C7. B. Discussion The above examples of C11and C2classes show that some regions of the HSI which are declared in the GT as relating to two classes of vegetation species, do not exhibit coherent and similarspectral signatures in the acquired image. The corresponding variations can be even easily detected visually, moreover only from visible bands. One might ask whether such variations do really exist from the field viewpoint and are not part of some artifacts, e.g. caused by the sensor itself.In fact, some answers to theissue of heterogeneity of most original GT classes,reside in the supplemental material provided with the HSI, i.e. the observation notes and field picturesassociated with the field work of Baumgartner et al.26, which surprisingly is barely referred to in the HSI classification literature. This ~70 pages document including handwritten notes taken approximately at the time of the Indian Pine flight survey, as well as the pictures taken by the field specialists, contain rich information that has only partially been reported in the GT map. For instance, let us consider the field numbered as 3-10 in the observation notes document. On Ошибка! Источник ссылки не найден.-(a) thisfield corresponds to the bottom-most left-mostfieldamong those of C11class (soybeans min-till). Thevegetative canopy reportedfor this field in the observation notes is soybeans, drilled in 8” rows, and a plant height of 4-5”, with very few weed infestation. In the same report, the soil characteristics also mention a minimum tillage system, not freshly tilled, with corn residues on the surface. These observations, which are only partly reported in the name of the C11 GT class, seem to indicate that the same, uniform soil and vegetation conditions are available over the whole field. However, this is not the case, as can be seen from the picture of this field taken during the field observations. Thispicture show inОшибка! Источник ссылки не найден.-(a)available from the field work25clearly exhibits local variations along lineaments traversing the north part of the field (particularly the WSW-ENE lineament)has been taken from the north end of the field, in the direction of southeast.The first line of trees and bushes at the background correspond to the east end of the field. Ошибка! Источник ссылки не найден.-(b) shows theC11 class regions overlaid on a Google Earth archive image acquired three months before the hyperspectral acquisition.The two orange lines in Ошибка! Источник ссылки не найден.-(b) delineate approximately the field of view of the picture in Ошибка! Источник ссылки не найден.-(a).We can notice that the local variations of grey levels in this image are in accordance with those observed in the HSI. As said above, thiscrop field is claimed by the GT map as uniformly grown with soybeans on a minimum tillage soil. However, the central part of the picture in Ошибка! Источник ссылки не найден.-(a) showing brown areas (probably bare soil) partlycontradicts the original GT class map. Besides, this area is very likely to correspond to the lineaments detectable in both the HSI (cf. C11 of Fig. 2 and in Ошибка! Источник ссылки не найден.-(b)). The variations of the spectral signatures in classes C11 et C2 (cf. Fig. 3) probably have two origins: one originates from the influence of the nature and moisture of the soil because the vegetation is not found at a very advanced stage of growth and the other of the inclusion in these classes of objects of different natures. From this example, it is clear that the users and developers of classification algorithms must paygreat attention to the GT maps provided with the HSI for their absolute truthfulness.