U.B.C. BIOLOGY 300 BIOMETRICS COMPUTER LAB ASSIGNMENTS 1. INTRODUCTION The purpose of these computer lab exercises is to provide you with an additional means of understanding basic statistics taught in lectures and to provide exposure to data analysis using a modern microcomputer system. In the lab, you are provided access to a computer network containing the necessary programs, especially JMPin, the student version of JMP from the SAS institute. JMP is one of the most versatile and easiest to use statistical programs and is widely used in academic, government and corporate settings. Although a number of features have been included in this package of programs that are beyond the scope of most introductory biostatistics courses, this package is easy to use and designed to introduce novice users to statistical analysis. The program is designed to emphasise the graphical and exploratory requirements of statistics. The package of programs is entirely menu driven and runs in Windows and MacIntosh environments. The program is installed on all the computers in the Biostatistics lab in room 4329, as well as on the computers in Zoolab, the undergraduate computer lab on the second floor of the biology building. To run a personal copy of JMPin at home you will need Windows 3.01 or higher or Windows 95 or 98, a 386 or higher microprocessor, and at least 4mb of RAM. To run the program on a MAC, you need a current MacIntosh with at least 2mb of free RAM. This program is not designed to replace the JMPin documentation. It is designed as a set of exercises to provide you with practical, hands on experience with biostatistics. This is our second year using JMPin and these new lab exercises, so bear with us as we continue to remove any glitches. We welcome your comments for suggested changes and improvements. By all means, experiment with the program and try new things. The odds are that you will discover useful tricks that we haven’t mentioned. Outline of Topics for this Week 1. Logging onto the System 2. Using JMPin in the labs and at home 3. Data Entry and Editing 4. Data Description and Exploration: Levels of Measurement Data Types Plotting Distributions Histograms Quantile Box Plots Outlier Box Plots Other Graphic Techniques Descriptive Statistics Measures of Central Tendency Measures of Dispersion Using The Program In Class: General Start-up Instructions Our computer network and server require passwords to allow you access. You will be assigned a password and user-id during your first lab. The user-id will be valid for the duration of the course and will allow access to the network, the Internet, assorted applications and a home directory where you can store several megabytes of files. The password you will be given will be temporary. You should change it during the first lab session to protect you from hackers, etc. Follow the instructions you will be given in class to change your password using the telnet program. This is the only way that you can change your password. Write your password and id down in a secure location. You will need them for the rest of term to access the system. To access the system, type your user-id and password into the windows networking dialog box that will be displayed on the screen of your computer. If a further dialog box appears asking if you want to use this password for windows, hit cancel. This dialog box has no function and will be disappearing from the computers shortly. Once windows has booted up, click on the START button in the bottom left corner of the screen. Use the mouse to move to the PROGRAMS option, then to JMPin. 2. DATA ENTRY AND EDITING Data may be entered from experimental designs or observational studies involving simple random sampling. Random sampling requires that each member of the population has an equal and independent chance of being selected. This requirement can be met by assigning each individual in a population a number and using a table of random numbers to decide which individuals will be included in the sample. More often, the researcher simply uses all of the individuals available and assumes that the sample is a random representation of the population about which inferences are to be made (i.e., the sample of convenience). Using the Program Before you can use any of the exploratory or inferential statistics programs, data must be put into the computer's memory. Data may be entered directly or may be stored in a file from a previous session. When you first open JMPin, you will see a table labeled untitled 1. This is where you will enter data. The table is currently blank, with 1 column and 0 rows. To enter data you will need to add some rows. Double clicking on the ROWS menu choice, then choosing add rows, allows you to specify how many data points you wish to enter. You may also double click directly on the chart at any point to add rows down to the cell in which you have moved the mouse cursor. Columns may be added to the table in a similar fashion. Rows and columns may be deleted by selecting the rows or columns you wish to remove (move the cursor to the desired location, hold down the left mouse button, and drag the cursor over the rows or columns you wish to remove), and then accessing the pull down menus for rows and columns. Try adding some rows and columns to the table and then deleting them again. Editing values is as simple as selecting and changing them. More advanced tools are also available including formatting, transforming and grouping data points. We will explore some of these tools in future exercises. In general, each column on the screen represents a single variable. A variable is simply the measurement of interest. Each cell on the screen represents a single data point. Accessing Files from the Server The procedure to access files from the shared drive is to choose the open file menu. From the drive sub-menu choose shared. From the shared directory choose the file that you want. Problems 1. Ten randomly chosen sections of a river showed the following number of spawning coho salmon: 22, 18, 40, 16, 12, 17, 23, 41, 29, 33. a) What is a "variable"? How many variables are in this data set? b) Enter these data and save them in a file named salmon. (Note: if you have changed your password but didn't log out afterwards, the computer will not let you save your data, since your password will be different from what you logged in with. In future labs this should not be a problem.) c) Change the third value to 19 and the eighth value to 27. d) Insert a value of 16 after the second record. e) Delete the fourth and fifth values. f) Add the following values to the data set: 17, 15, 11, 21, 23, 26. Your data set should now include the following values: 22, 18, 16, 12, 17, 23, 27, 29, 33, 17, 15, 11, 21, 23, 26. 3. EXPLORING AND DESCRIBING DATA The first thing to do with a set of data is to inspect it visually. Inspection affords an opportunity to determine the shape of a distribution. This information is of interest on its own, but will also help to determine the type of analysis to carry out next on the data. There are a number of useful tools for this, including descriptive statistics, histograms and boxplots. JMPin offers all of these (including two different variations of the boxplot) plus a number of additional exploratory tools. For now, however, we will limit ourselves to these three methods as they are the most widely useful and most widely used. A. Histograms A histogram is a plot of the commonness of different values of a variable. The X-axis of such a plot consists of the range of values that the variable can assume. The Y-axis indicates the frequency of observations occurring in each interval of X-values. This frequency is represented by a bar that allows the viewer to easily compare frequencies of observations in different intervals of X. The number and width of X intervals used in a histogram is arbitrary, and there is no set rule for determining how many classes to use. By luck of the draw some classes of X will be over-represented in a sample and others will be under-represented. If the X-axis is finely divided into too many intervals, many classes will contain no observations as the result of chance alone, and the histogram will resemble the skyline of a city dominated by skyscrapers. Someone viewing such a histogram will have difficulty determining the shape of the true distribution. Using fewer, larger classes can alleviate this problem. Holes in the distribution are smoothed over, giving a better picture of the shape of the distribution. It is possible to go overboard with smoothing. Histograms consisting of a few very wide classes may hide significant features of a distribution. As the number of classes is increased, take note of how robust the overall pattern is and of new features as they appear. Are features real, or are they simply the result of chance variation in the sample of observations? There is no set rule for answering this question; however, you can form an opinion by performing a small mental experiment. Suppose that you were to increase the size of the sample by a few observations and that you were to strategically place those observations on the histogram in such a way that they reduce the conspicuousness of a feature that interests you. The feature is likely to be an aberration if it is wiped out but real if it remains conspicuous. Once you understand the distribution of a sample, how many x intervals should you use to present the data to an audience? Divide the axis as finely as possible without requiring the viewer to do too much smoothing to see the pattern. Why? Viewers will miss your point or not bother with the histogram at all if they must take the time to smooth the pattern themselves. The fineness with which the x-axis is divided provides viewers with a means of judging the strength of whatever pattern is purported to be present. Viewers will be convinced by a pattern that remains clearly visible when many classes are used, whereas they will be suspicious when relatively few classes are used for large samples. Indeed, it is probably safe to assume that most people who present data will operate according to this strategy, and it is reasonable to judge histograms on the assumption that investigators follow the same rule. Using the Program To produce a histogram, choose ANALYZE, then DISTRIBUTION OF Y from the pulldown menus. Select the variable you wish to analyze. The graph that appears provides a rough estimate of the distribution. You can increase the amount of information the program provides by modifying the graph. First, choose the check mark icon at the lower left of the histogram window. This accesses the controls for this set of analyses. Change to a horizontal layout and the histogram will be in the same orientation you have seen in class. Add a count axis, so that you can more easily see the frequencies for each interval. If you use the pointing tool (arrow) that is the default tool and click on the histogram you will be able to enlarge the diagram by clicking and dragging the small square appearing in the bottom right corner of the graph. Problems 1. Open the data file bigclass by choosing file, then open, from the menus at the top of the screen. The file is located in the shared directory (along with most of the data files we will use this term). Select the variable weight, then carry out the procedures mentioned above. a) Describe the general shape of the data distribution using the terms explained to you by your TA (normal, uniform, skewed left or right, platykurtic, leptokurtic or bimodal). b) Choose the hand tool and move it within the histogram. What happens when you move the hand parallel to the X axis? Why is this happening? c) How strongly is the histogram affected by changes in interval start points? d) What happens when you move the hand parallel to the count (frequency) axis? e) What are the consequences of too few intervals in a histogram? Too many? f) Try highlighting one bar of the histogram using the pointer tool (arrow). What effect does this have on the original data table when you examine it? g) The check menu provides several other useful tools. Display a normal or bell-shaped curve over your histogram. The normal curve is one of the most useful distributions for statistical analyses. It is often this shape that we hope to find in a plot of our data. How well does your histogram approximate a normal curve? (In future sessions we will learn about more powerful tools for testing normality.) B. Descriptive statistics Qualitative descriptions of distributions are also useful. The most common method of describing the location of a distribution is the mean. Breadth of a distribution can be described using the standard deviation. Mean and standard deviation completely describe a distribution that conforms to a normal bell-shaped curve. They are less apt descriptors of non-normal distributions, particularly distributions that are skewed or contain outliers. Another related statistic is the standard error, which we will deal with in future sessions. Quantiles or percentiles of a distribution are alternative descriptors that can be used to describe both the location and spread of a distribution. The median or 50th percentile is often used as a measure of location. The difference between the 75th percentile (or 3rd quartile) and the 25th percentile (or 1st quartile) can be used to describe the breadth of a distribution. This distance is often referred to as the interquartile range. These quantiles are particularly useful in the form of a boxplot, described in the next session. Problems 1. Continuing to use your histogram data, examine the statistics given in the tables beside the histogram. a) How similar are the values for the mean and median of the weight data? Do you think that this result will always occur? b) Change the weight for Tim, data point 6, to 384 (note that this is an American program, so weights are given in pounds. Also note that JMPin, like most statistics programs differs from spreadsheets like Excel. It does not automatically refresh graphs when you change your data. You must produce a new analysis to see the change. This allows you to view the effects of changes by comparing a new graph to the original.). How does this affect the mean and median of your data set? Which measure of central tendency is more sensitive to outliers (unusual or aberrant data points)? DO NOT SAVE THE EDITED DATA SET. C. BOXPLOTS In order to conduct many parametric statistical tests it must be assumed that data have been sampled from normally-distributed populations. This assumption can be tested using goodness of fit tests such as chi-squared or the Kolmogorov-Smirnov test. Prior to testing, however, the general distribution of a data set should be scrutinised graphically. Both histograms and boxplots provide graphical summaries of data to help indicate the distribution of the variable in the population. Histograms indicate the frequency of occurrence of all values, whereas boxplots summarise only the most prominent features of a data set. A boxplot shows the centre and spread of a data set, as well as the extent and nature of departures from symmetry. A boxplot is particularly useful for detecting outliers. Outliers are observations that lie unusually far from the main body of the data. These unusual observations may reflect an unusual distribution of the variable in the population (e.g., the data may be highly skewed), but sometimes they are errors of measurement or transcription, or represent individuals from a population other than the one under study. Whatever the cause, a decision must be made to either use these extreme values or to eliminate them from further analyses. When should outliers be deleted? There is no correct answer to this question. If an outlier is not an error but is deleted, then valuable information is lost and a bias is introduced into later statistical tests. Yet including an erroneous outlier also has harmful consequences. The decision to delete an observation or not should always be based on what is known about the sampling procedure and the experimental design. A good strategy is to conduct analyses with and without the suspect observation and compare the results. If the conclusions from the two analyses are different then the decision to reject a value or not must be made with great care. Boxplots are also informative about other aspects of a distribution, such as asymmetry. If a distribution is asymmetrical, and hence not normal, then a transformation of the data may often result in a more normal distribution. If no simple transformation is satisfactory, then nonparametric statistics should be used in subsequent analyses. Statistics such as the mean and standard deviation can be drastically affected by even a single outlier. Therefore, boxplots are based on measures that are resistant to the presence of a few outliers. These measures are the median and the interquartile range. To draw a basic boxplot, the n observations in a data set are first ordered from smallest to largest and the overall median is determined. The overall median is then plotted as a horizontal line. Next, the median of the smallest half of the data (lower quarter) and the median of the largest half of the data (upper quarter) are determined. Note that the overall median will be considered in both halves of the data set if n is odd. The interquartile range is then calculated simply as the difference between the medians of the upper and lower halves of the data set. The interquartile range is shown graphically by plotting the medians of each half of the data set as horizontal lines and then joining the ends of the lines to form a box. If the data set is symmetrical (i.e. from a uniform or normally-distributed population), then the box will appear to be divided equally into two halves by the overall median. A vertical line is then drawn between the smallest and the largest values in the data set to indicate the range. In a symmetrical data set, this vertical line will extend the same distance on either side of the box. Using the Program JMPin produces two variants of the boxplot: an outlier boxplot and a quantile boxplot. Both are available from the check menu for distribution of y analyses. The default version displayed by the program is the outlier box plot. In this figure, the tail is shown as a solid line to a distance of 1.5 times the interquartile range away from the central box. Data points beyond this are shown individually as outliers, or aberrant points. A red line on the side of the plot illustrates the range for the most closely grouped set of 50% of the data points. The quantile boxplot shows the distribution from the quartiles to the minimum and maximum values in the data set as solid lines but puts tick marks at selected quantiles along the tails. These can include values such as the 90% quantile, the 95% quantile or the 99% quantile (see the help menu of the program for a diagram showing quantiles on a quantile boxplot). This quantile boxplot also shows a diamond which represents a 95% confidence interval around the mean of the data set. We will deal with confidence intervals in a future lab session. Problems 1. Using the bigclass data set and the weight variable, examine the outlier and quantile boxplots. a) Does the data contain any outliers? Use the selection tool (arrow) to highlight any such values in the data set. b) Is the interquartile range (box) symmetrical about the overall median? Does the range of the data set extend equally on either side of the box? Are the data normally distributed? c) Use the cross tool on the outlier boxplot and use it to locate which quantiles are represented by tick marks on the quantile boxplot. Compare the values to those displayed in the quantiles chart for this analysis. How can these tick marks be useful for checking the symmetry or normality of a data set? d) As you may recall, we altered the value for Tim’s weight from 84 to 384 pounds. Change the value back to 84 pounds. What effect does this have on the boxplots? 2. Use the same data set, but examine the height variable (given in inches). Format the output as you did for the weight variable. a) Compare the information from the boxplots and histogram. Which graphic tool provides more information about the distribution of the data? b) Use the hand tool to alter the number of intervals and their starting points in the histogram. How strongly does this affect the shape of the histogram? Is there any change in the accompanying boxplots? What does this suggest about the reliability of histograms as a sole tool for exploring data distributions? c) What information can you gather about normality of the data from the boxplots? 3. Use the same data set, but examine the sex variable. a) What happens to the boxplots for this variable? Why do you think this might happen? b) Mosaic plots such as the one displayed for sex, are most useful for comparing breakdowns of responses across subsets of a variable. What type of data is information about sex? We will work more with this data type in future weeks. Answers for this Lab Assignment Return to Main Lab Page Return to Main Course Page