Academic Skills Advice Projects Involving Statistics (& SPSS) Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs, tests etc) and different ways of looking at your data. This summary provides suggestions for getting started when using SPSS for your project. Using Statistics: You can use statistics to: look at what has already happened, draw conclusions, predict what is likely to happen in the future. Before you start: You need to know exactly what your question is – what do you want to know and why? Only collect data that will help you to answer your question. You will need to decide what type of sampling to use. (You will probably already use sampling in real life, for example you might sample a small cube of cheese in the supermarket and come to the conclusion that you will enjoy the whole block of cheese.) Experiment: Look at what’s already happened Collect, describe and organise your data Look at averages, spread, shape etc. Predict: Make general conclusions about the whole population based on your sample Test your data and use your results to predict what might happen in the future. © H Jackson 2012 /2014/ Academic Skills 1 A brief overview of SPSS: SPSS is a useful stats package which helps you to analyse your data and draw conclusions from it. It has 2 different windows: the input window, where you set up and enter your data, the output window, which appears every time you ask SPSS to do something (e.g. produce a report or chart). The Input window has 2 tabs: the variable view (for specifying what type of data should be entered) the data view (for inputting your data (like a spreadsheet)). A suggested order to follow: Experiment: Set up your variables in the variable view tab. Enter your data in the data view tab (1 row = 1 case (e.g. 1 subject’s data)). It’s often good to start with “Explore” as this can give you ideas about your data and where to start (analyse / descriptive statistics / explore). Generate any appropriate charts or graphs that help to see what is happening with your data (e.g. bar charts to compare frequencies, box plots to compare distributions). Look at descriptive statistics (e.g. mean, max, min, standard deviation etc) and make comparisons (decide what it tells you). If your questions are about the relationships between data look at scatter graphs, correlation, crosstabs, regression etc. Predict: Once you have done all the comparisons and drawn some conclusions you need to decide how likely your results are to happen again in the future. You can test your idea (hypothesis) by doing hypothesis testing: Check for normality to see what sort of data you have and, therefore, which tests can be performed (using histograms, Normal Q-Q plot and Kolmogorov Smirnov or Shapiro Wilk). This helps to decide if parametric tests are appropriate. Decide on the test to use (see decision making flow chart). The test you choose will tell you how significant your results are and whether they are likely to happen again or if they are just due to chance. © H Jackson 2012 /2014/ Academic Skills 2 Some useful information: The Null Hypothesis (often denoted 𝐻0 ): is the assumption that what you were testing is not true and that things just happened by chance. Instead of trying to prove that your idea is right you will be trying to prove that the null hypothesis is probably wrong. E.g. of null hypothesis: 𝐻0 = there is no difference between the means (𝜇1 = 𝜇2 ). The Alternative Hypothesis (often denoted 𝐻𝐴 or 𝐻1 ): this is your idea – what you think is true. You have to assume this is wrong until you find evidence to say otherwise. E.g. of alternative hypothesis: 𝐻𝐴 = there is a difference between the means (𝜇1 ≠ 𝜇2 ). The ‘p’ value is the common name for the ‘sig’ value produced by the various tests. It is “the probability of obtaining your results if 𝐻0 is true”. The ‘p’ value provides evidence for us to decide whether we can reject 𝐻0 . Commonly if 𝒑 ≤ 𝟎. 𝟎𝟓 then you reject 𝑯𝟎 and accept 𝑯𝑨 (normally this means that you have found a significant difference at the 5% level). If your test statistic falls in the critical region you would reject 𝐻0 . When SPSS reports test statistics (e.g. ‘f’, ‘t’ value etc.) we tend to just look at the significance (‘p’) value because SPSS has done all the hard work for us and found the probability corresponding to the test statistic (we used to have to look this up in tables). Generally if p<0.05 we “reject the null hypothesis” otherwise we say “there is not enough evidence to reject the null hypothesis” (we cannot say that we “accept the null hypothesis”). However, we can “accept the alternative hypothesis”. An example scenario: You have done some investigating and think that the population of the village you live in is above average intelligence. If the average intelligence rating is 100 then your hypothesis is that your village >100. Remember that the null hypothesis says that you are wrong and everything is equal: 𝐻0 : your village population has average intelligence (𝜇 = 100). 𝐻1 : your village population has above average intelligence (𝜇 > 100). You run an appropriate test and SPSS reports a ‘p’ value of 0.0065 (this is equivalent to 0.65%, i.e. less that 1%). The ‘p’ value is less than 0.05 so you will reject the null hypothesis and report that you are confident that your village has above average intelligence. (The ‘p’ value is saying that if 𝐻0 is true then there is only a 0.65% chance of your data “happening”. This is such a small chance that we conclude that 𝐻0 must not be true and reject it.) © H Jackson 2012 /2014/ Academic Skills 3 Academic Skills Advice Hypothesis Test Decision Making Flow Chart Categorical Data type? Chi-Squared test (one sample or two sample) Continuous Questions about relationships or difference between outcome means? Relationships More than 2 Differences How many groups? (2 or more?) How many variables? Multiple Regression Analysis 2 2 Correlation Analysis (Parametric or Non-Parametric?) Parametric Pearson’s r or simple regression Non-Parametric Spearman’s Rank Correlation Parametric or Non-Parametric? Parametric T-test (independent or paired) Non-Parametric Mann-Whitney U Test (different groups) Wilcoxin’s Rank Sums Test (same group) More than 2 Parametric or Non-Parametric? Parametric Non-Parametric ANOVA Please note that this flow chart is neither definitive nor exhaustive. There are other tests, and approaches and you should bear this in mind with regard to your own data. © H Jackson 2012 /2014/ Academic Skills 4 Kruskal-Wallis Test Glossary of terms: There are lots of technical (and sometimes confusing) terms used in statistics. This glossary is to give a basic idea, in “layman’s terms”, of what things mean. Categorical data Data that can only take certain values – there are gaps between. E.g. shoe size (you can buy size 5 or 5 ½ but not size 5.3), counting (we count, 1, 2, 3, etc.), people in a class (there may be 22 or 23 but not 22.5). Continuous data Data that can take any value – there are no gaps between it. E.g. height (a person can be 150cm or 150.1cm (or even 150.15cm)). Differences Looking at the difference between the means (averages) of different sets of data. Non-parametric data Does not satisfy the assumptions of parametric data. Parametric data A large part of this is checking that the data is normally distributed. Strictly speaking we should also check that the data: Has the same variances Is at least interval (scale) level Is independent (e.g. answers from one participant do not affect another. Independent data such as one person doing the same test twice is OK) Relationships Looking at the relationship between sets of data, for example does one set of data increase as the other increases, or vice versa? Or is there no pattern (relationship)? This is often called correlation. 𝝁 The arithmetic mean (average) © H Jackson 2012 /2014/ Academic Skills 5