PY1PR1 lecture 1: Describing data Dr David Field General Information • The Research Methods course consists of statistics lectures, workshop exercises, and laboratory practicals • Bring calculators to workshops, not mobile phone! • You should have two handouts for this lecture containing • Handout 1 – The schedule for Autumn term Psychological Research Methods PY1PR1 – Details of Assessment for this module • Handout 2 – Lecture handout – “Describing data” General Information • PowerPoint presentations for this lecture series will be available to download from my web page – http://www.personal.rdg.ac.uk/~sxs02dtf/home.html • and also on BlackBoard • There is additional information in the “notes” sections at the bottom of the slides that you won’t see projected on the screen today • So, no need to write everything down • Today’s slides contains some questions that you should try to answer at home using the course textbook – Discovering Statistics using SPSS. 3rd Edition. Andy Field • The questions are repeated at the end of your printed handout Using the course textbook • The material in today’s lecture is covered in Andy Field chapters 1 and 2 • My slides and handouts will indicate which specific sections of the textbook you need to read – e.g. calculating the mean is covered in section 1.7.2 • But reading whole chapters is a good idea • A guide to the meanings of symbols and Greek letters is given on page XXXI, just before before Chapter 1 • Occasionally I will point out to you an issue where my teaching will diverge from the book – There is subjectivity in statistics – For purposes of this course, my procedure should be followed • If you have studied ‘A’ Level Psychology then you will be familiar with some of today’s topics – But you might find that things are covered in more depth here • If you have not studied ‘A’ level, you might be wondering “Why on earth am I taking a course in statistics?” Help with statistics • University of Reading Maths Support Centre • Located on the first floor of the Main Library • Specialist statistics tutor available every Wednesday afternoon in term time from 2.00pm4.00pm • Alternatively, in a form with your question on the website and get a reply by email – http://www.reading.ac.uk/mathssupport/ What is data? • Data is made up of variables • A variable is something that can take different values between individuals or in the same individual at different time points – Gender can take the value “male” or “female” – Age can take a minimum numeric value of zero, and a maximum numeric value of many years – Time to react to your name being called out is an example of a variable that would vary if you measured it in the same individual at several time points • It is usual in Psychology to measure the value of a variable in many separate individuals What does statistics do to data? • Describe – today’s topic – Different types of variables • categorical, ordinal, continuous (interval and ratio) – If you have measured the same variable in many individuals you need a way of summarising the data – What’s the “average” value? – How much variation is there in the data? • Compare – ask if one group differs from another on the value of a variable • Relate – ask how one variable changes as a function of another one Variables are classified according to their level of measurement • Country of birth – Example values are France, UK, Germany – this is an unordered category because France is not more or less than the UK – We may assign numbers to category values for convenience (e.g. 1 = UK, 2 = France), but you cannot meaningfully add or subtract the numbers – This severely restricts the type of statistics we can use with categorical variables Variables are classified according to their level of measurement • Finishing position in a running race – this is an ordinal variable because 1st is better (more) than 2nd – but you can’t finish 1.5th (no decimals) Variables are classified according to their level of measurement • Finishing position in a running race – this is an ordinal variable because 1st is better (more) than 2nd – but you can’t finish 1.5th (no decimals) – it is not meaningful to say that 3rd is twice as good as 6th because gaps between positions are not equal – therefore, you can’t add, subtract, multiply, or divide the values of ordinal variables and statistics should be calculated based on ranks Variables are classified according to their level of measurement • Annual salary – this is a continuous variable because the gap between £20,000 and £21,000 is the same as that between £40,000 and £41,000 – it makes sense to add and subtract, and decimal places make sense too – Annual salary has a true zero that refers to the absence of the quantity under consideration (money) • Ratio level measurement – Zero does not mean absence for all continuous variables (e.g. zero celsius is not the absence of temperature) • Interval level measurement Working with variables • The following examples are based on an imaginary set of data • The following variables have been measured in a sample of 30 people – Country of birth – Intelligence Quotient (IQ) – Extroversion Measures of central tendency • If we have values on a variable for a sample of 30 people (or 300 people) one thing we might need to do is summarise the values in a shorter form • The aim is to find a single number that characterises the typical value of the variable in the sample • The options we will consider are the – Mode – Mean – Median • Which one you use depends in part on the level of measurement of the variable Measures of central tendency • The mode can be used with all data types, and is the only measure applicable to unordered categories • The mode is the most frequently occurring score, and may be illustrated with a pie chart • In the example data set the variable “birthCountry” contains 15 instances of “France”, 13 instances of “UK”, and 2 instances of “Germany France UK Germany Questions to answer at home • What is the modal birth country for a sample containing 20 UK, 23 French, 50 Indian, and 50 Chinese? – What word describes this sample? Central tendency for ordinal, interval and ratio level variables • Before calculating a measure of central tendency you should first visually inspect the variable using a frequency histogram • Histograms are most informative for large sample sizes of several hundred cases or more – but they are still an essential step for small samples • The first step in producing a histogram is to sort the cases in the variable from lowest to highest • The second step is to count the frequency of occurrence of each value The 30 IQ values from earlier • 109 77 79 109 • Sorted: 90 101 97 97 134 103 101 103 115 124 105 118 114 90 117 68 • 68 72 97 104 117 77 97 105 118 79 100 105 124 82 101 109 134 90 90 96 97 101 101 103 103 109 109 114 115 140 100 82 140 72 104 109 101 96 105 97 The IQ score 101 occurs 3 times in the sample Histogram x axis intervals or “bin sizes” • In the previous example the interval was equal to one unit on the IQ scale • Typically, the interval will be wider than a single unit of the scale • Be aware of the interval, because a bad interval choice can make a histogram misleading – often every score contained in a variable is slightly different, so a histogram with very small bin sizes will just look flat With the same data, the interval is now 5 IQ points Note that the y axis maximum has now changed With the same data, the interval is now 50 IQ points Note that the y axis maximum has now increased dramatically The mean (commonly “average”) • To calculate the mean you sum all the scores (e.g. IQ’s 109 + 90 + 134 + 115 + 114 +….) • Then you divide by the number of scores you added together (30, in the example data set) • This gives an indication of the typical score The mean IQ in this sample is 101.9 The median • The median is the score that lies in the middle of the sample, which therefore has an equal number of scores higher and lower than it • To calculate the median you first sort the scores, as for making a histogram 3 13 10.5 7 6 8 8 12 4 3 4 6 7 8 8 10.5 12 13 1 2 3 4 5 6 7 8 9 The median • Then assign ranking positions in the list and locate the score corresponding to the middle rank • At home, find out how the is procedure modified when the number of scores in the variable is even? 3 13 10.5 7 6 8 8 12 4 3 4 6 7 8 8 10.5 12 13 1 2 3 4 5 6 7 8 9 The mean IQ in this sample is 101.9 The median IQ is 102 The mean Extroversion score in this sample is 36.17 The median is 33 When to choose the median • Firstly, if the histogram is not symmetrical about its peak (most frequently occurring value) then the median and mean will differ, and you can make the case that the middle ranking score (median) is a more appropriate description of central tendency • Secondly, if the histogram reveals a few outlying values that seem to be quite different from the rest of the sample, then these outlying values will have a large and disproportionate influence on the mean, but not on the median • Always calculate both and compare them These outliers will “drag” the mean away from the median Measures of dispersion • Imagine we contact the example sample and use a questionnaire to assess their attitude to the European Union • The questionnaire produces scores ranging from 5 (very negative) to 50 (very positive). • We can compare French and British attitudes to the European Union • There are only 2 Germans in the sample, and intuitively this is too few to assess German attitudes to the European Union The first 10 cases from the 30 in the example. Note missing data for Germany Mean 22.20 Median 23 Mean 22.54 Median 23 The range • The simplest measure of dispersion is obtained by subtracting the minimum score from the maximum score – French sub-sample attitudeEurope has a range of 22 – UK sub-sample attitudeEurope has a range of 31 • Reporting the mean and the range is adequate as a way of comparing UK and French attitudes to Europe in this sample • But the range fails to capture dispersion properly in some cases, which is why the standard deviation is normally preferred – At home, find out what the weaknesses of the range as a measure of dispersion are The standard deviation • This is a measure of how much all the scores in a data set vary around the mean in the same units as the mean itself (e.g. years, grams) – A big SD implies very spread out data – If the SD is small the data is clustered close to the mean • Understanding what the standard deviation means, and how to calculate it, is very important • It will be mentioned frequently in the next two lectures The standard deviation • For each score in the sample, subtract the mean of the sample to produce “deviation scores” scores 1 4 5 6 9 11 deviations -5 -2 -1 0 3 5 • 1 – 6 = -5, 4 – 6 = -2…………………….11 – 5 = 5 • Intuitively, the mean of the deviation scores will be a measure of the amount of variation in the sample But the mean deviation is always zero because the positives deviations exactly cancel the negative ones The standard deviation • The negative signs are removed by squaring the deviation scores • 22 = 4, -22 = 4, 32 = 9, -32 = 9, -42 = 16 etc • An important statistic called the variance is obtained by assessing the central tendency in the squared deviation scores • Sum the squared deviations – The squaring process increases the relative contribution of scores that are far from the mean to the variance, compared to those scores that are close to the mean • To calculate the variance you divide the sum of squared deviations by the number of original scores minus 1 The standard deviation scores 1 4 5 6 9 11 deviations -5 -2 -1 0 3 5 squared deviations 25 4 1 0 9 25 • The sum of the squared deviations is 64 • The mean deviation (variance) is therefore – 64 /(6 – 1) = 12.8 • If the units of the scores is Kg, what is the units of the variance? The standard deviation scores 1 4 5 6 9 11 deviations -5 -2 -1 0 3 5 squared deviations 25 4 1 0 9 25 • The sum of the squared deviations is 64 • The mean deviation (variance) is therefore – 64 /(6 – 1) = 12.8 • If the units of the scores is Kg, what is the units of the variance? The standard deviation • To convert the variance back into units we can understand intuitively we take the square root of the variance and call it the standard deviation – In the worked example the square root of 12.8 is 3.58 • The standard deviation (SD) is in the same units as the sample mean, so, for example, you can write that the mean weight of adult domestic cats in the sample is 5.0 Kg (SD 1.0 Kg) • If the population of cat weights is normally distributed then 68% of cats will weigh 5.0 Kg +/one SD from the mean – 68% of cats weigh between 4Kg and 6Kg Mean 22.20 SD 6.5 Mean 22.54 SD 8.7 List of questions to answer at home • What is the modal birth country for a sample containing 20 UK, 23 French, 50 Indian, and 50 Chinese? – What word describes this sample? • How the is procedure for calculating the median modified when the number of scores in the variable is even compared to when there are an odd number of scores? • The range fails to capture dispersion properly in some cases, which is why the standard deviation is normally preferred – Find out what the weaknesses of the range as a measure of dispersion are • Below is a list of statistical terms that you should know the meaning of in order to be sure you have understood the material from today’s lecture. Note that the technical meaning of terms in statistics is not always the same as the everyday meaning of the words. You can use this list to help you with your exam revision. • Variable • Level of measurement – Categorical – Ordinal – Continuous • Interval • Ratio • Measures of central tendency – Mode – Mean – Median • Frequency histogram – Bin sizes • Measures of dispersion – Range – Variance – Standard deviation Variance (s2) formula The square (2) of the average difference between each individual score and the mean for that sample Each score in sample Mean of sample Formula: s 2 (X X ) The sum of.. 2 N 1 Number of scores in sample minus 1 Standard deviation formula Formula: s (X X ) 2 N 1 Step 1. Calculate the variance Step 2. Take the square root of the variance