IB STATISTICS HANDBOOK Mag Karl Schauer BSc xkcd.com IB STATISTICS HANDBOOK Why Statistics? ...................................................................................4 What should you know? ....................................................................4 Types of Data .....................................................................................5 Categorical data..................................................................................5 Ordinal Data .......................................................................................5 Numerical data ...................................................................................5 Frequency ...........................................................................................5 Other Data ..........................................................................................5 Sampling Techniques ..........................................................................6 Random sampling ................................................................................6 Systematic Sampling ............................................................................6 Stratified Sampling ...............................................................................6 Descriptive statistics ...........................................................................7 Averages ...................................................................................................7 Mean ..................................................................................................7 Median ...............................................................................................7 Mode ..................................................................................................7 Measures of central Tendency .....................................................................8 Standard deviation ..............................................................................8 Minimum, Maximum and Range ...........................................................8 Quartiles and Interquartile Ranges ........................................................8 Histograms .................................................................................................9 The normal curve ......................................................................................10 Data tables .......................................................................................11 Title ...................................................................................................11 Labels ................................................................................................11 Data ..................................................................................................11 Summary statistics ..............................................................................11 Formatting .........................................................................................11 Graphical techniques ........................................................................13 Formatting and Labelling ...........................................................................13 2 IB STATISTICS HANDBOOK Bar graphs ...............................................................................................13 Line graphs ..............................................................................................14 Scatter plots and correlation ......................................................................14 Correlation ........................................................................................15 Extrapolation and Interpolation ...........................................................17 Other graphs ............................................................................................19 Hypothesis Testing ...........................................................................20 Testing for differences ...............................................................................21 The T-test ...........................................................................................21 ANOVA ............................................................................................24 The Mann-Whitney U test....................................................................24 Testing for Correlation ..............................................................................25 The Pearson correlation test ................................................................25 Spearman Rank correlation test ..........................................................26 Other statistical tests .................................................................................27 The chi-squared test ............................................................................27 Nearest neighbour analysis ................................................................27 Critical Value tables ..........................................................................30 How to use spreadsheet software ...................................................31 More Help and further reading ........................................................32 3 IB STATISTICS HANDBOOK WHY STATISTICS? An academic investigation is a way to try to answer a question. This question must be defined, and a method determined to collect appropriate data. Predictions are then made based on the knowledge gained by answering previous questions in previous investigations. So where do statistics come in? Statistics are the tool you need to boil down all of your carefully collected data into a clear answer. Importantly, they also tell you how sure you can be of that answer. W H AT S H O U L D YO U KNOW? In order to complete your internally assessed work or data-based extended essays in the IB, you will need to apply some basic statistics. You will need to summarise and describe your data using descriptive statistics like averages and standard deviations. Then you will need to present your data in tables and graphs. Finally, depending on the investigation, you may need to perform a hypothesis test or other calculations to definitively answer your question. You won’t generally be expected to do the sometimes complicated calculations by hand. Tools like spreadsheet software (Excel, LibreOffice etc.) or your Tinspire handheld make many calculations a trivial matter of entering numbers. You can and will likely need to look up tutorials for using your software online, as there are many different softwares and platforms, but there are many tutorials readily available (see this list of resources for help). What is trickier, and mostly up to you, is deciding what statistics to apply in what circumstances, and understanding what those calculations tell you. This handbook should help you with those decisions and that understanding. This handbook is intended to be used digitally, and contains some cross-referencing and external links. Underlined text, as well as the table of contents can be clicked to take you where you want to go. 4 IB STATISTICS HANDBOOK T YPES OF DATA You might encounter all types of data in your investigations. It is important to distinguish between a few different types of data because not all statistical techniques work with all types of data. C ATEGORIC AL DATA This type of data fits into defined categories. For example: red, green and blue as options for people’s favourite colour are categories. ORDINAL DATA This is similar to categorical data, but there is a clear order of the groups. For example: low, medium and high income categories. These categories don’t always or necessarily have the same distances between them. NUMERIC AL DATA This type of data includes measurements of all kinds. There is a clear order, as in ordinal data, but the distances between data points are clearly defined. Length, mass and speed are all numerical data. FREQUENCY When a statistician uses the word frequency, they generally mean a count of the number of things. ‘What is the frequency of…’ can usually be translated to ‘ how many…’ OTHER DATA This list is certainly not complete. There are other specialty types of data that you might encounter, but these should be sufficient for most of your investigations. 5 IB STATISTICS HANDBOOK SAMPLING TECHNIQUES Since you will never have enough time or resources to measure all of the possible data points in the population (and if you use statistics, you shouldn’t need to), you will always only measure a small portion of all of the possible points called a sample. But which data points should go into the sample? In order to have a fair test, it is important that each possible data point is equally likely to be chosen for the sample. That is to say, there should be no sampling bias. In order to do this, you will need to use a sampling strategy that fits your investigation. RANDOM SAMPLING Just to be clear, it is not sufficient to claim that a sample is random if you have simply chosen ‘at random’ places to measure. You might have a subconscious bias for certain measurements. To be truly random, you will need to assign a number to each possible data point, and use a random number generator to tell you which measurements to collect. One way to do this is with the RAND() function in spreadsheet software. Simply enter ‘=RAND()’ in a cell, and it will show you a random number between 0 and 1. You can then multiply this number by whatever you need to in order to have a random number between 0 and that number. For example: if you wanted a random number between 0 and 100, you could simply type ‘=RAND()*100’ into a cell. SYSTEMATIC SAMPLING In this technique, you simply choose to sample at regular intervals. For example, you might choose to make a measurement every ten meters on a transect. STRATIFIED SAMPLING This more complicated sampling method is only used if the population is made up of different sub-sets that make up different proportions of the whole. It might be important to make sure that one sub-set isn’t being over- or under-represented in the data. This is most commonly used with survey data. 6 IB STATISTICS HANDBOOK DESCRIPTIVE STATISTICS Once you have collected your data, you will need to boil it down. Descriptive statistics, sometimes called summary statistics, do just that: they help your reader see the general trends and patterns in your data. AV E R AG E S MEAN This is generally what is meant when someone says ‘average’. It is the sum of the values divided by the number of values. This is the most common way to summarise sample data. MEDIAN This is the ‘middle’ data point. There are just as many data points higher and lower than this one in the sample. This measure of average is more likely to be used if the sample is distributed in a strange way, or if outliers might strongly affect the mean. For example, in a sample of measuring personal wealth, one or two billionaires in the sample might heavily skew the mean to show an average income that is higher than almost every single individual. In this case it would be appropriate to use median income to better represent the sample. MODE Mode is much less commonly used. It is the ‘most common’ data point. Or in other words, the data point with the highest frequency. Example given the five values in the sample: 1 , 3, 3, 5, 8, 11, 12 Mean 43/7 Median Mode 1, 3, 3, 5, 8, 11, 12 1, 3, 3, 5, 8, 11, 12 6.1 5 3 average middle most common 7 IB STATISTICS HANDBOOK MEASURES OF CENTRAL TENDENCY Just knowing the average of a group only tells part of the story. Another important aspect is the spread in the data. How similar are the data points to each other? Are they all basically the same, or are there wild differences? These measurements all describe how spread out the data is around the mean. STANDARD DEVIATION Standard deviation is the average of the differences between each data point and the mean. A large standard deviation, relative to the size of the measurement, means that the data is very spread out. This means that there are generally large differences between data points. A small standard deviation, relative to the size of the measurements, indicates that the measurements are close together, or that they ‘agree’. Standard deviations are useful for numerical data. If the data is not numerical, or is not normal, you will need a different way to show the spread of the data. MINIMUM, MAXIMUM AND RANGE To give a basic idea of the spread of your data it is often good to include the total range of the data, that is to point out the highest and lowest measured values (maximum and minimum), and the distance between them (range). QUARTILES AND INTERQUARTILE RANGES These measures of spread apply to the median in a similar way to how the standard deviation applies to the mean. Here is how they are calculated: 1. Arrange the data in rank order and divided into four equal parts, each containing an equal number of values. Each section is called a quartile. The quartile containing the highest values is the upper quartile, while the one with the lowest values is the lower quartile. 2. The Upper Quartile Value (UQV) or Q1 is the mean of the lowest value in the upper quartile and the highest value in the quartile below it. 3. The Lower Quartile Value (LQV) or Q3 is the mean of the highest value in the lower quartile and the lowest value in the quartile above it. 4. The Inter-Quartile Range (IQR) is the difference between both values calculated in #2 and #3. A high IQR means the data is very dispersed, while a low IQR means the data is less dispersed. 8 IB STATISTICS HANDBOOK For example: 1 2 3 4 5 6 7 8 9 10 11 Q3 Q2 Q1 IQR = Q1- Q3=8.5-3.5 = 5 For a more detailed explanation, visit this site: https://stattrek.com/statistics/dictionary.aspx?definition=interquartile%20range A histogram is a way to visualise data in a sample. It is essentially a bar chart with categories for the measured values (for example 1-10, 11-20, 21-30) on the x-axis and frequency (the number of data points in that category) on the y-axis. A histogram can be useful to show if your data are normally distributed, that is they generally look like a bell curve. (See section on the normal curve) It may be important to show this before you can use some types of hypothesis tests. Frequency HISTOGRAMS Your histogram might not show a normal curve. The data might be skewed to one direction, or even show several peaks. These might be important aspects to bring up in your evaluation, and might help you choose what hypothesis test, if any, you can use. Normal Right-skewed Age This histogram shows the distribution of age in a sample. It appears to be slightly right-skewed. Left-skewed These are some shapes of histograms you might encounter. Here is a deeper explanation of histograms and how they are made by hand: https://youtu.be/4eLJGG2Ad30 Here is a longer explanation of the different shapes you might encounter in histograms: https://youtu.be/Y53_8WRrPzg 9 IB STATISTICS HANDBOOK THE NORMAL CURVE Many data just happen to fit to a normal distribution curve, also called a bell curve or, more technically, a gaussian curve. If your data fits this type of distribution, you can make some predictions using your data and the mathematics behind this curve. The standard, ‘bell’, or gaussian curve shows the pattern of how normally distributed data spreads around the mean. If your histogram looks like this, your data is probably normally distributed. The shaded areas under the curve represent the proportion of data points that will likely be found in this section of the curve. The x-axis shows standard deviation distances with the mean at 0. The area under the normal curve shows how many data points are likely to be found in any given range. 68% of data points will, on average, be within one standard deviation of the mean, and 95% will fall within 2 standard deviations. This is helpful in predicting probabilities, and this type of math is the basis for hypothesis tests. The normal distribution curve is one of many used in statistics, but it is the most common shape you will likely encounter. If your data appear to be normally distributed; that is, your histogram appears to have a normal curve shape, then you may be able to use some hypothesis tests that require normal data as a prerequisite. 10 IB STATISTICS HANDBOOK DATA TABLES Once you have boiled your data down into some tangible values, you will need to present the raw data and your descriptive statistics in well-organised tables. Designing data tables is an art form all its own. A few points might help you make yours beautiful. TITLE Be sure that each table has a meaningful and descriptive title (not just ‘table 1’). With multiple tables, it is usually a good idea to number them (hint: check that your numbers are right before you hand in a draft!) so that you can refer to them easily in your text. LABELS Your data columns need proper labels including: • A clear descriptive title of what data is listed in the column, • The appropriate units of those values, and • The measurement precision of those values. DATA The data itself should have the correct number of significant figures to reflect the precision of the data (see these links for help with significant figures). Be careful not to show more precision (more significant figures) in an average. You will likely need to format the cells of your table to show the appropriate number of digits, since the trailing zeros disappear otherwise. If you have very large or very small values, simply use scientific notation. SUMMARY STATISTICS You may want to include your averages and standard deviations right in the table with your raw data. If you have a lot of data, or it is relatively complex, you might want to create a separate data table of your summary statistics. You should use whatever you think will help your reader see the data best. FORMATTING It is usually a good idea, if possible, to present your data table on one page. Having the first half of a table at the end of one page, with the last half continuing on the next makes it very hard to get an overview of the data. Also, try to size your columns carefully to fit them on the page, but not to muddle the titles. 11 IB STATISTICS HANDBOOK Here is an example of a well-organised table: Table 1: The height of 15 Z. mays plants after growing for 30 days at different fertiliser concentrations in three different field sites. Plant height 30 days after germination (+/- 1mm) Concentration of fertiliser in Standard Field site 1 Field site 2 Field site 3 Average 0.10 345 330 404 360 39 0.20 442 410 430 427 16 0.30 510 470 550 510 40 0.40 580 530 603 571 37 0.50 200 130 240 190 56 soil (+/- 0.10 deviation mg/kg) 12 IB STATISTICS HANDBOOK GRAPHICAL TECHNIQUES Once you have presented your data in tables, you will need to make it more readily visible to your reader. It is important to choose the right graph for the type of data you are presenting. The formatting and labelling of the graph is also important. If done well, graphs should show the reader the answer to your research question at a glance. FORMATTING AND L ABELLING Generally speaking, the same rules for tables apply to graphs. Be sure that each graph has a clear and descriptive title, that the axes are labelled in the same way that the columns of the corresponding tables are labelled. Make sure that the axes are scaled so that the data fill the graph, and that the scale numbers reflect the same level of precision as the data. Also, make sure that the independent variable (that you changed or defined on purpose) is on the x-axis, and that the y-axis shows your dependent variable (measured result). It is almost always best to graph your processed data, and not the raw data, unless there is something important you want to show the reader about your raw data. BAR GRAPHS Bar graphs are used to represent numerical data (y-axis) from different categories (x-axis). Bar graphs of averages should have error bars showing standard deviation, or some other measure of spread. Somewhere on the graph or in its caption you need to declare what the error bars represent. Be sure that you are using the unique standard deviation values that Figure 1: The average growth of cress seeds after growing for 4 days under different coloured light. The error bars represent one standard deviation. Average growth of cress plants after 4 days (+/-1mm) 40 30 20 10 0 Red Blue Green Yellow Orange Color of light applied to growing plants 13 IB STATISTICS HANDBOOK you calculated in your tables, and not the automatic values that some softwares apply (incorrectly). LINE GRAPHS Line graphs and scatter plots are often confused for each other. Line graphs show straight lines connecting the dots of the data points. This is to represent the fact that line graphs show multiple measurements of the same thing. The straight line is an assumed linear change of that measured value between measurements. Therefore, only use a line graph if you are tracking the change of something. Figure 2: Global average temperature from 1880 - 2000 14.50 Average global temperaure (+/-0.01°C) 14.42 14.34 14.26 14.18 14.10 14.02 13.94 13.86 13.78 13.70 1870 1890 1910 1930 1950 1970 1990 2010 Year SC ATTER PLOTS AND CORREL ATION Scatter plots are used to compare numerical values on both axes. If both your independent and dependent variables are numerical measurements, this is probably the type of graph you should use. Each dot on the graph represents a data point, and this can show trends in the data. Usually this type of investigation is looking for some sort of relationship between the two variables. You will need to start with this type of graph to look for correlations, or in order to perform interpolations or extrapolations. If you graph average values, you will need error bars to show the spread of the data (see bar graphs). Be sure, however, to use all data points, not just averages, to calculate an R2 value. 14 IB STATISTICS HANDBOOK Figure 1: This graph shows a strong positive correlation between the height and age of the sampled trees. 32 y = 0.7334x - 2.8792 R² = 0.9189 Height of tree (+/-2m) 25.6 19.2 12.8 6.4 0 0 10 20 30 40 50 Age of tree (+/-1year) CORREL ATION A correlation is a relationship between two numerical variables. A correlation can be positive or negative: • Positive correlation: As the independent variable increases, the dependent variable also increases. • Negative correlation: As the independent variable increases the dependent variable decreases. Positive correlation Negative correlation No correlation 15 IB STATISTICS HANDBOOK The line of best fit or trend line is chosen by the computer to be as close as possible to all of the data points. It is an approximation of the linear trend in the data. The closer all of the data points are to the trend line, the stronger the correlation. The degree of correlation is measured by the correlation coefficient, r or R, more technically called the ‘Pearson product-moment correlation coefficient’. This value ranges from -1 for a set of data that aligns perfectly to a line with a negative slope, to 1, for a set of data points that align perfectly to a line with a positive slope. The closer the r value is to either -1 or 1, the stronger the relationship between the two variables. r= 1 0.8 0.3 0 -0.3 -0.8 -1 Alternatively, correlation can be reported as the coefficient of determination, r2 or R2. This is simply the correlation coefficient squared, which therefore always has a positive value between 0 and 1. This value is defined as the proportion of the variance in the dependent variabel that can be predicted by the independent variable. For example, with an r2 value of 1, all of the data points align perfectly. That means that the value of the dependent variable can be predicted with 100% precision for any value of the independent variable. For an r2 value of 0.8, 80% of the variance in the dependent variable is predicted by the independent variable, and therefore values of the dependent variable can be predicted with 80% precision given any independent variable. Predicting the values of variables based on a correlation is called extrapolation or interpolation. In order to make claims about a linear correlation, it is important that the data show a linear trend. If the data are not linear, or not expected to be linear, it is not appropriate to compare them to a trend line! Here are some examples of when linear regression is not appropriate: 16 IB STATISTICS HANDBOOK • Enzyme activity is expected to increase logarithmically as temperature increases, then peak at the optimum temperature for that enzyme, then drop sharply as the enzyme denatures at higher temperatures. A graph comparing temperature and enzyme activity might therefore look something like this: Enzyme activity 50 40 This shape of a graph should not be compared to a line. 30 20 10 0 0 20 40 60 80 Temperature (+/-1°C) • The rate of a chemical reaction slows over time as substrate is used up. The shape of the curve produced is predictable and depends on the type of reaction. The graph of the concentration of the product over time might look something like this: Concentration of product (+/- 0.1M) 50 40 This shape of a graph should not be compared to a line. Instead, a Spearman Rank test can be performed. 30 20 10 0 0 30 60 90 120 Time (+/-1s) When interpreting r or r2 values, be sure to be realistic about the strength of the correlation. An r2 value of 0.3 may or may not indicate any kind of relevant relationship. If you want more certainty about whether your correlation is statistically significant, you should consider using a Pearson’s R test for correlation or a Spearman’s Rank test. You can read more about these tests in the section on hypothesis testing. EXTRAPOL ATION AND INTERPOL ATION If you have determined a strong linear relationship in your data, you can use this data to make predictions. You can use the equation of the line of best fit to calculate the expected value for an unknown. 17 IB STATISTICS HANDBOOK Interpolation is using the trend line to predict values within the range of your data. Extrapolation is expanding the trend line beyond the data to make predictions outside of the range of data. The further the predicted value is from the measured values, the less reliable the extrapolated value will be. An example of extrapolation is using current trends in climate change to make predictions about how the planet’s climate will continue to change in the future. This is how climatologists make predictions about how warm the earth might be in the coming decades. An example of interpolation is determining the osmolarity of a tissue. Suppose you measured the rate of osmosis in potato tissue in various concentrations of sugar solution. Your data were linear and looked like this: Change in mass of potato tissue after 1h (+/-0.01g) 3.00 y = -7.1143x + 2.073 R² = 0.9659 You must use all of the data points (not just the averages!) in order for your software to accurately calculate R2. 2.00 1.00 0.00 -1.00 0.29 -2.00 0.00 0.10 0.20 0.30 0.40 0.50 Concentration of sucrose solution (+/-0.01M) 18 IB STATISTICS HANDBOOK The high R2 value of 0.97 shows that the data have a strong linear correlation. The negative slope value of -7.1143 shows that the relationship is a negative correlation. Because the trend is very strong, the equation for the line can be used to make predictions. You were asked to determine what concentration would be isotonic to the potato tissue, that is, at what concentration no net osmosis would occur. At this concentration the change in the mass of the potato tissue would be zero. To find the corresponding concentration, you can substitute zero for y (the change in mass) in the trend line equation, and solve for x (the concentration): y = − 7.1143x + 2.073 → 0 = − 7.1143x + 2.073 −2.073 = − 7.1143x → −2.073 =x −7.1143 x = 0.2914 This value of 0.29 M is where the trend line crosses the x axis, and it is the osmolarity, or isotonic concentration of the potato tissue. It must be rounded to reflect the precision of the measurements that were used to calculate it. This process could be repeated to predict any given change in mass or any concentration within the range of the data. OTHER GRAPHS Though the graphs listed above are the most likely you will need, there are of course many other types of graphs. Here are two others that you might consider: Pie charts show the breakdown of a group into its parts, usually percentages. The percentages should add up to 100. Avoid too many categories, as the chart can quickly become difficult to read. Radar charts can be used to show many different attributes at once, and compare these between locations or individuals. 19 IB STATISTICS HANDBOOK HYPOTHESIS TESTING The goal of an experiment or investigation is to answer a specific question. The data should make it clear what the answer to that question is. Often, due to the uncertainty inherent in data, the answer may not be entirely clear. It may look like there is a difference between two groups, but the difference might only be due to chance. It may appear that there is a correlation between two variables, but the sample may have been a fluke. Hypothesis testing allows you to determine how sure you are of the answer, and the likelihood of the observed pattern being due to chance. A hypothesis test requires that you make an assumption, and calculate the probability of this assumption being true. This assumption is called the null hypothesis, H0. If this null hypothesis assumption can be shown to be very unlikely, then you can conclude instead that the alternative hypothesis, HA, is true. Despite the naming, these hypotheses are different than your experimental hypothesis, that is, your reasoning about what you think will happen in your experiment. You always need to declare and explain an experimental hypothesis in the exploration portion of your work. You only need to declare null and alternative hypotheses in the context of your hypothesis test, if you choose to use one. This should be included in your explanation of the data analysis. A hypothesis test generates a test statistic. The value of this test statistic gives you information about the likelihood of the null hypothesis being true. That test statistic can then be compared to table of critical values that it must be higher or lower than in order to conclude a statistically significant result. Usually this process is simplified, and a p-value can be calculated based on the test statistic. The p of p-value stands for probability, and it is the probability of the null hypothesis being true given your data. It is always a value between 0 and 1 (i.e. 0 and 100%) . If the p-value is low enough, then it is very unlikely that the null hypothesis is true, and it can safely be rejected. When the null hypothesis is rejected, the alternative hypothesis can be concluded, and there is a statistically significant result. The p-value is compared to the alpha value. For our intents and purposes you will use an alpha value of 0.05. This is the threshold probability below which you determine that the null hypothesis is too unlikely. That is, if the probability of the null hypothesis being true (p-value) is less than 5%, then you should conclude that it is too unlikely to be reasonable and reject the null hypothesis. For an example of this process, read the section on ttesting. If the p-value is above the alpha threshold of 0.05, then you must ‘fail to reject the null hypothesis’. This is different than accepting the null hypothesis! You don’t have enough 20 IB STATISTICS HANDBOOK evidence to conclude that the null hypothesis is true. Instead you simply ‘fail to reject’ and conclude that you cannot be sure whether the observed result is due to random chance or a real effect. For example, if a test gives a p-value of 0.2, there is a 20% chance that the null hypothesis is true given your data. It would not be reasonable to conclude, then, that the null hypothesis is true, as there is only a 20% chance of this being the case. You also cannot rule it out entirely, since 20% is a significantly high likelihood. Therefore you simply ‘fail to reject the null hypothesis’. TESTING FOR DIFFERENCES Often an investigation aims to find differences between groups. The t-test, ANOVA, and the Mann-Whitney U test are different ways to determine whether observed differences are statistically significant, or just due to random chance. The t-test is used when the data can be assumed to be normal and the sample sizes are relatively large (more than 10 measurements). It might be a good idea to make a histogram to see if the data appear to be normal, but at the very least you should state that you assume the data to be normally distributed, and why you think it is. If the assumptions for normality are met for the t-test, but you have more than two groups, you will need to perform an ANOVA (analysis of variance) test to see if the variability between the groups is due to chance or some real effect. If it is not safe to assume that the data are normally distributed, you have small samples, or your data are ordinal, but not numerical, then you can make a comparison between two groups using the Mann-Whitney U Test instead. This test is less likely to find a difference if there is one, but it is safer to use if the prerequisites for a t-test are unclear or not met. THE T-TEST The t-test assumes a null hypothesis that there is no significant difference between the groups (any observed difference is due to chance), then calculates the probability of that hypothesis given your data. H0: There is no significant difference between the two groups HA: The observed difference between the groups is statistically significant, and not likely due to chance. 21 IB STATISTICS HANDBOOK Suppose you want to find out if dandelions (T. officinale) grow to different heights in two different types of soil. In your experiment you measure the average growth of the dandelions in each of two soil types. Figure 1: The average maximum height of 16 T. officinale plants grown in two different soil types. Error bars represent one standard deviation. Maximum achieved height of T. officinale plants (+/-0.1cm) Soil a Soil b 8.0 15.5 6.3 14.7 9.1 14.5 13.2 12.2 12.0 10.1 6.3 15.0 10.0 12.1 11.0 13.2 12.1 16.0 9.8 14.2 8.5 13.1 12.2 9.9 9.7 17.8 10.1 10.3 13.2 16.4 10.3 19.0 Average 10.1 14.0 Standard deviation 2.1 2.7 Average maximum height of plants (+/-0.1cm) 17.0 12.8 8.5 4.3 0.0 Soil a Soil b You notice a difference between the groups. The plants in soil b have grown taller on average than the plants in soil a. Since the error bars are overlapping, it is hard to say whether this observed difference is due to chance, or whether the two are really different. Therefore you decide to perform a t-test to find out. First, you need to determine whether the data are normally distributed, so you make a histogram to see. 22 IB STATISTICS HANDBOOK Figure 2: A histogram of the data shows that it appears to be normally distributed. Soil a Soil b 5 Frequency 4 3 2 1 0 6.0-7.9 8.0-9.9 10.0-11.9 12.0-13.9 14.0-15.9 16.0-17.9 18.0-19.9 Height of plants (+/-0.1cm) Because the data appear to be normally distributed, and the sample sizes are sufficiently large (n=16), you can proceed with the t-test. Using the T.TEST() function of your spreadsheet software, you enter the following information: =T.TEST(dataset from soil a, dataset from soil b, 2, 2) This decimal may or may not be required in your software. These are the lists of raw values, not the averages. The two 2’s in the syntax tell the software what kind of ttest to perform. You want a ‘two-tailed’, ‘non-paired’ test. The T.TEST() function of your spreadsheet software returns the p-value for the test. The pvalue is the the probability of H0 being true given your data. You can watch a tutorial video on the t-test in Excel here: https://youtu.be/DPNUpldVC4M As the p-value decreases, the likelihood of the difference being due to random chance also decreases. Eventually, the p-value is so small, that it is no longer reasonable to assume that the null hypothesis is true, and therefore the null hypothesis can be rejected. The most commonly used threshold level (also called the alpha value) for rejecting the p-value is 23 IB STATISTICS HANDBOOK 0.05. That means that if the probability of the null hypothesis being true sinks below 0.05 (5%), then it can be rejected and the alternative hypothesis accepted. In the case of these data the p-value is 0.00008. This is well below the threshold level of 0.05, and therefore the null hypothesis can be rejected. The observed difference between the two soils is not due to random chance, it is a statistically significant difference. A N OVA The ANOVA test should be used if you have more than two groups to compare. Though you could theoretically perform many t-tests between each of the possible combinations, this is inefficient, and mathematically risky. Every time you perform a t-test, there is a small probability that your difference was in fact due to a random fluke, and not a real difference. If you perform many t-test, the likelihood of making such an error increases. The ANOVA test assumes the following hypotheses: H0: The groups are all the same HA: At least one of the groups appears to be different than the rest. The variability between groups is not likely to be due to chance. The ANOVA test produces a p-value that can be interpreted in the same way as in the ttest. If the p-value is below the threshold of 0.05, then the null hypothesis can be rejected and the alternative hypothesis concluded. It is not clear from the ANOVA test what groups are different from each other, but instead that the variability between the groups is not due to chance. You can learn how to perform the ANOVA test in Excel or LibreOffice here: Excel: https://youtu.be/qQSQr_JldyY LibreOffice: https://youtu.be/TxTKq4W8qX8 THE MANN-WHITNEY U TEST The Mann-Whitney U test is mathematically very different than the t-test, but achieves a similar goal of finding out whether an observed difference is statistically significant. Instead of using the actual values of the measurements for the comparison, it simply compares the rank order of the values. This is somewhat analogous to the difference between a mean and median average with the t-test being similar to the mean and the 24 IB STATISTICS HANDBOOK Mann-Whitney U similar to the median. This type of hypothesis test that does not rely on the actual value of the measurements is called a non-parametric test. The null and alternative hypotheses are the same as for the t-test: H0: There is no significant difference between the two groups HA: The observed difference between the groups is statistically significant, and not likely due to chance. Although you can painstakingly calculate the Mann-Whitney U statistic by hand, with some help from your spreadsheet software, this is not a requirement of the IB. Instead simply enter your two sets of data in this online calculator: https://www.socscistatistics.com/tests/mannwhitney/default2.aspx The calculator gives you the p-value for the test, which you compare to the alpha threshold of 0.05 as in the t-test. If the p-value is below 0.05, you can reject the null hypothesis and conclude that there is a statistically significant difference. TESTING FOR CORREL ATION If your investigation intends to look for a relationship between two variables, it might be a good idea to test whether the correlation your data suggest is statistically significant or likely to be due to chance. A correlation test does just that. These tests work in a similar way to tests for differences. The two most relevant tests are the Pearson productmoment correlation test also called the Pearson correlation test and the Spearman rank correlation test. THE PEARSON CORREL ATION TEST The Pearson correlation test, similar to the t-test, requires the data to meet some prerequisites. In order to run this test, the following conditions must be true: • The data are numerical for both variables. • The data are paired, that is, there are two measurements or values for each data point, the dependent and independent variables. • The data follow a linear trend. • The data can be assumed to be normally distributed for both variables. • There are no obvious outliers in the data set. 25 IB STATISTICS HANDBOOK If these conditions are met, you can continue with the test. The null and alternative hypotheses are as follows: H0: There is no correlation in the data. The observed trend is due to random chance. HA: The observed trend is a statistically significant correlation. In your spreadsheet software, enter the formula to calculate r, the Pearson correlation coefficient: =PEARSON(dataset of independent variable, dataset of dependent variable) This formula calculates the r value that then needs to be compared to a critical value table (see critical value tables here). The critical value table shows what values of r are significant. The strength of the test depends on the number of data points included (n), so the critical value also changes with n. Keep in mind that one data point has two measurements. If you were comparing, for example, height and weight of plants, and measured height and weight of 12 plants, then n would be 12, not 24. You simply need to compare the r value that you calculated to the critical value corresponding to the number of data points you used. If the absolute value of r is greater than the critical value, then the correlation is statistically significant and not likely to be due to chance. For example, if your r value was -0.65 and you had 10 data points, you would compare that r value to the critical value 0.521 from the critical value table and conclude that the absolute value of r is greater than the critical value. Therefore the correlation is statistically significant. SPEARMAN RANK CORREL ATION TEST The Spearman rank correlation is denoted by the symbol ρ (rho) or rs. Analogous to how the Mann-Whitney U test compares rank instead of the actual values of the data, the Spearman rank test determines a correlation in the data by looking at the rank order of the data instead of its actual values. Use this test instead of the Pearson correlation test if any of the following are true: • The data are not numerical but are ordinal. • The data are not normally distributed. 26 IB STATISTICS HANDBOOK • The data do not appear to have a linear trend, but do trend either positively or negatively. • There are apparent outliers in the data. The null and alternative hypotheses are the same as for the Pearson correlation test: H0: There is no correlation in the data. The observed trend is due to random chance. HA: The observed trend is a statistically significant correlation. To calculate the test statistic, simply enter your data in the calculator at this site, and interpret the p-value as in the other tests: https://www.socscistatistics.com/tests/spearman/default2.aspx OTHER STATISTIC AL TESTS Though making comparisons between groups and determining correlations between variables are the most common statistical tests, there are many other ways to test data for significance. The chi-squared test for goodness of fit and the nearest neighbour analysis are two that you might need for biology and geography respectively. THE C HI-SQUARED TEST This test determines whether data fit a pattern or model. The data for the test needs to be categorical. One of the simplest versions of this test is to test for an association between species, that is whether the location of some species of immobile organism associates with the location of another species. A chi-squared test can also be used in genetics to determine if the frequency of genotypes matches the expected ratios. For more information about this type of test and how to calculate and interpret the chi-squared value, refer to your oxford biology textbook on pages 215 (association between species) and 453 (genotype ratios). Here is a demonstration of how you can use Excel to calculate chi-squared: https://youtu.be/o0VhMWeotFg N E A R E S T N E I G H B O U R A N A LY S I S In geography, the nearest neighbour analysis can be used to determine if the spacing between points is random, clustered, or ordered. First, the data is collected by measuring 27 IB STATISTICS HANDBOOK the distance between each location (eg. tree) and it’s nearest neighbour. The nearest neighbour index, (NNI or Rn) is then calculated according to the following formula: NNI = 2D̄ n A Where D̄ is the average nearest neighbour distance, n is the number of observations, and A is the total area studied. The NNI value can indicate that the points are clustered, random, or ordered, depending on its value: NNI = 0 : The points are completely clustered NNI = 1.0 : The points have a completely random distribution NNI = 2.15 : The points are distributed uniformly. 0 1.0 2.15 clustered random uniform Given the number of data points, you can compare the NNI value to the critical value table. If NNI is below the number for clustered points, then there is statistically significant clustering. If it is above the value for uniformity, then the points are statistically significantly uniform. If it lies between the two values, the Tree Nearest Distance points are randomly distributed. number tree between trees 1 2 1.1 2 1 1.1 3 2 1.3 4 7 0.4 5 3 1.2 6 7 1.0 7 4 0.4 8 9 2.0 9 8 2.0 D̄ = For example, the following data was collected by measuring the distances between trees in a 36m2 park: 1+ 2+ 8+ 3+ 4+ 7+ 5+ 6+ 1.2 n= 9 9+ A = 36m2 28 IB STATISTICS HANDBOOK Since n= 9, A=36m2, and D̄ = 1.17m The NNI value is therefore calculated as: NNI = 2D̄ n = 2 ⋅ 1.2 A 9 = 1.2 36 For n= 9, the critical value table gives 0.713 as the limit below which the points would be considered clustered, and 1.287 as the upper limit, above which the data would be considered ordered. You can therefore conclude that the trees are randomly dispersed. 29 IB STATISTICS HANDBOOK C R I T I C A L VA L U E TA B L E S Pearson r n Critical value 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 26 31 36 41 46 51 61 71 81 91 101 0.988 0.900 0.805 0.729 0.669 0.622 0.582 0.549 0.521 0.497 0.476 0.458 0.441 0.426 0.412 0.400 0.389 0.378 0.369 0.360 0.323 0.296 0.275 0.257 0.243 0.231 0.211 0.195 0.183 0.173 0.164 Nearest neighbour index Critical values n clustered uniform 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 50 60 70 80 90 100 0.392 0.504 0.570 0.616 0.649 0.675 0.696 0.713 0.728 0.741 0.752 0.762 0.770 0.778 0.785 0.792 0.797 0.803 0.808 0.812 0.817 0.821 0.825 0.828 0.831 0.835 0.838 0.840 0.843 0.846 0.848 0.850 0.853 0.855 0.857 0.859 0.861 0.862 0.864 0.866 0.867 0.869 0.870 0.872 0.878 0.889 0.897 0.904 0.909 0.914 1.608 1.497 1.430 1.385 1.351 1.325 1.304 1.287 1.272 1.259 1.248 1.239 1.230 1.222 1.215 1.209 1.203 1.197 1.192 1.188 1.183 1.179 1.176 1.172 1.169 1.166 1.163 1.160 1.157 1.155 1.152 1.150 1.148 1.145 1.143 1.141 1.140 1.138 1.136 1.134 1.133 1.131 1.130 1.128 1.122 1.111 1.103 1.096 1.091 1.086 30 IB STATISTICS HANDBOOK HOW TO USE SPREADSHEET SOFT WARE You will need to spend some time learning how to use your brand of software on your platform, as they all differ somewhat. Excel©(subscription based) and LibreOffice© (freeware) are both good options, but you could also use Numbers© on MacOS, or Google© Sheets, though the latter has some significant limitations. Many of these calculations can also be performed on a Tinspire© handheld. Searching the web, or using your software’s help function will usually yield quick answers to tricky problems. Here are some tips and resources that might help you on your way: Tip: Be sure you know whether your software expects a decimal ( . ) or a comma ( , ) as a separator. If you use the wrong one, the computer does not recognise your data as numbers, but instead treats it as text which causes all calculations to fail. Use the 'search and replace’ function of your software to change all of them at once. Tip: Use this site to find the appropriate function in your software’s language: http://www.excelfunctions.eu The Moodle site 7AB Tabellenkalkulation has guides for performing simple calculations and making diagrams in Excel© and LibreOffice© here: https://moodle.tsn.at/course/view.php?id=36089 At Mr. Schauer’s youtube channel you can find a handful of videos on data analysis: bit.ly/mrschauersyoutube Here are some instructions on how to make a histogram in Excel: https://support.office.com/en-us/article/create-a-histogram-85680173-064b-4024b39d-80f17ff2f4e8 For more information on calculating quartiles and the inter-quartile range using excel, visit this site: https://www.statisticshowto.com/probability-and-statistics/interquartile-range/ #IQRExcel 31 IB STATISTICS HANDBOOK MORE HELP AND FURTHER READING For help reviewing how to use significant figures appropriately, watch these videos: An introduction to significant figures: https://youtu.be/eCJ76hz7jPM Rules to determine significant figures: https://youtu.be/eMl2z3ezlrQ For lots of in-depth information on the geography Internal Assessment, visit these pages: https://www.thinkib.net/geography/page/22606/ia-student-guide https://sites.google.com/site/geographyfais/fieldwork For more help with biological statistics for the IA, visit this site: https://www.biologyforlife.com/statistics.html For a very useful handbook of basic statistics, look for a copy of this book: Methods of Statistical Analysis of Fieldwork Data. St. John, P. and Richardson, D.A. Geographical association 1996. 32