JMP Tutorial ~ Graphical Displays and Summary Statistics for Numeric Data Graphical Displays for Numeric Data Histograms and Outlier Boxplots To obtain a histogram and boxplot for numeric data select Distribution from the Analyze pull down menu and place the variable(s) that you wish to examine in the right hand box. These data from from a study of DDT levels found in fish in the Tennessee River near the Wheeler Reservoir. The data is contained in the file Catfish.JMP. The variables in this data file are: location - location on the river from which the fish were sampled. distance - distance of the sample location from the mouth of the Tennesee River. species - numeric indicator of the fish species (1 = catfish, 2= smallmouth buffalo, 3 = largemouth bass) Spec. Name - fish species length - length of fish sampled (cm) weight - weight of fish sampled (g) DDT - DDT concentration found in a fillet of the fish (parts per million - ppm) log(DDT) - natural logarithm of the DDT concentration We begin by examining histograms and boxplots for the length, weight and DDT concentration of the fish sampled. To do this select Distribution from the Analyze menu and place length, weight and DDT in the right hand box. The results are shown below. 1 The Horizontal Layout, Prob Axis, Normal Curve & Smooth Curve options have been used in constructing the plots above. These are options are illustrated in the graphics below: 2 The normal curve and smooth curve density estimate are added by selecting these options from the Fit Distribution pullout menu, We can see that the lengths of the fish sampled appear to have a skewed left distribution with several outliers on the low end. These outliers are all largemouth bass. The typical length appears to be somewhere between 42-45 cm in length. The weight distribution appears to be slightly skewed to the right, but is not far from normal as evidenced by the fairly close agreement between the normal curve and smooth curve distribution estimate. There also a couple of outliers flagged in the boxplot. A typical weight for the fish sampled is approximately 1000 grams. The DDT concentrations of the fish sampled follow a severely skewed right distribution with several obvious outliers on the high end. Using the location and Spec. Name columns to label the points in succession shows these observations correspond to catfish and smallmouth buffalo sampled from locations 1, 8, and 13. Examination of the map shows that locations 1 and 13 are in close proximity to the plant that was the source of the DDT contamination of the ecosystem. Transformations to Improve Normality When the distribution of a variable is markedly skewed (left or right) we can often times use a transformation to obtain approximate normality. The common remedy is to consider raising the variable to some power. This type of transformation is known as a power transformation. To remove right skewness we consider using powers less than 1 such as 1/2 (i.e. square root), 1/3 (i.e. cube root), 0 (which corresponds to a log transformation), 1/2 (i.e. reciprocal square root), -1 (i.e. reciprocal) , .... etc. As a rule of thumb, we often avoid using negative power transformations because they change the ordering of the data, i.e. the largest observed value with become the smallest and vice versa. Also the associated units of a negative power transformed variable can be difficult to explain. To remove left skewness, which is less common, we typically raise the power of the variable in question (e.g. 1.5, 2 or 3). 3 In this example we see that the distribution of the DDT concentration is extremely skewed to the right. To improve normality we will consider transformation to the log scale. To do this in JMP you must use the JMP Calculator which allows you to perform a variety of data transformations and manipulations. To create a column containing the a function of another column double click to the right of the last column to add a new column to the spreadsheet. Next double click at the top of the column to obtain the Column Info window. In the window change the name of the new column to log10(DDT) and select Formula from the New Property pull-down menu and click Edit Formula. The JMP Calculator should then appear on the screen. To take the base 10 logarithm of the DDT variable select Transcendental from the menu to the right of the calculator keypad because the logarithm is a transcendental (non-algebraic) function. In the list that appears in the rightmost menu select base 10 logarithm (i.e. log10). In formula window you should see log. Now you need supply the name of the variable you wish to take the logarithm of, which is DDT in this case. 4 From leftmost menu select DDT from the list and the formula window will then look like: Log10(DDT) Finally click Apply and close the calculator window. The new column you created should now contain the base 10 logarithm of the DDT concentrations. The histogram and boxplot for the log scale DDT readings are shown below. We can clearly see approximate normality has been achieved through transformation. Summary Statistics - Measures of Central Tendency, Variability and Location Next to each of the histograms and boxplots shown above you will find the basic summary statistics for each variable shown below. 5 To obtain the variance and coefficient of variation you need to select More Moments from Display Options pull-out menu. To obtain z-scores associated with each observation select Save Standardized from the Save menu which is located within the main pull-down menu for the variable. 6 Three new columns labeled Std length, Std weight, and Std DDT will appear in the original spreadsheet containing the z-scores. You could examine the distribution of the zscores themselves by using the Distribution command. Any observations with z-scores exceeding 3 in absolute value could be classified as potential outliers. The histogram below is for length standardized using z-scores. All of the observations with extreme z-scores for length are Largemouth Bass. 7 Comparative Displays In this study we could compare the DDT levels of the different fish species and also compare DDT levels of fish by location. We first consider the potential difference in the DDT levels in catfish found at different river locations by using comparative boxplots and mean diamonds. To do this in JMP select Fit Y by X from the Analyze menu and put Location in the X box and log(DDT) in the Y box. The resulting display will show the log(DDT) levels plotted versus the location number. To add boxplots or items to this plot use the Display Options menu located within the main pull-down menu. The options and their effects are summarized below... Box Plots - adds quantile boxplots to the display Mean Diamonds - adds mean diamonds to the plot Mean Lines – adds a horizontal showing the mean for each group/population. Mean CI Lines – adds lines depicting the 95% confidence interval for the mean to the plot. Mean Error Bars - adds the means and standard errors (Ch. 6) to the plot Std Dev Lines - add lines one standard deviation above and below the mean. Connect Means - adds line segments connecting the individual means. X-Axis Proportion - if checked the space allocated to the groups will proportional to the sample size for that group. Points Jittered – “jitters” the points so individual observations are more easily seen. Points Spread – staggers the points much more than jittering. 8 The display below shows comparative boxplots for log(DDT) level across location with the X-axis proportional option turned off. Here we can clearly see that the fish from locations 1 & 13 have the highest DDT levels and locations 6 & 17 appear to have the lowest. It is important to note that latter locations are the only locations where largemouth bass were sampled. We can construct a similar display for comparing the log DDT measurements across species by placing Species Name instead of location in the X box. To obtain summary statistics for the log(DDT) levels within each species type select Quantiles and Mean, Std Dev, Std Err from the main pull-down menu. The results are shown on the following page. 9 How do different species compare in terms of summary statistics? Catfish have the highest mean and median DDT levels in the log scale while largemouth bass have the smallest. Catfish have the smallest amount of variation and seen by comparing the standard deviations or the coefficient of variations. (CV s 100%) x CDF Plots The plot below gives the CDF plots for the DDT levels found in each the fish species in this study. To obtain these select the CDF Plots from the Oneway Analysis... pull-down menu. We can clearly see that we are much more likely to find a catfish with a high DDT level, e.g. there is an approximate 50% chance that we sample a catfish with a log10(DDT) level exceeding 1 which is 10 ppm in the original scale. This same for small-mouth buffalo is less than 25% and estimated to be 0 for bass. 10