0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field) Chris Snijders c.c.p.snijders@gmail.com www.tue-tm.org/moodle EXAMPLE: NETFLIX DVD RENTAL 2 Example: The Netflix Prize $1,000,000 www.netflixprize.com 3 Example: the Netflix prize (3) input input output input input input input input input = = = = kind of previous rentals number of previous rentals day of the week ... output = extent to which you like a movie 4 Example: The Netflix Prize (2) • Predict the extent to which a person will like a movie, from previous ratings by others. • NB – Measurement – Root Mean Square Error – Large prizes! – You have about 2 Gb of data to work on ... 5 0AP03: two parts Blumberg et al. Gerrit Rooks Blocks A and B Field Blocks B and C Chris Snijders {LANGUAGE=ENGLISH} 6 Understanding statistics using SPSS http://www.sagepub.co.uk/field/field.htm -CD rom material -- data sets -- some software (g*power) -answers to (some) assignments in the book -test banks (note: not identical to exam) 7 www.tue-tm.org/moodle enrolment key = "fieldspss" 8 Course home-page http://www.tue-tm.org/moodle 9 Let’s get acquainted … • Technische InnovatieWetenschappen Bachelor’s A: –never heard of it – Pre-Master program B: was a topic in previous lectures, • Technische Bedrijfskunde but don’t ask me what it is or – Bachelor’s how to do it – Pre-Master program C: was covered and understood === Some key concepts: - Stochastic variables, distributions, normal distribution - SPSS usage (StatGraphics users?) - Mean, median, skewness, kurtosis - Correlation - Simple regression: Y = a + b X - Factor analysis - A chi2 test 10 Understanding statistics using SPSS About: Style About: Content 1 2 3 4 5 6 7 About statistics SPSS Exploring data Correlation Multiple regression Logistic Regression The t-test 8 ANOVA 9-12, 14 More ANOVAs 13 Non-param. tests 15 Factor analysis 16 Chi2-tests etc 11 T-test, chi2-test • We have two groups of students, one group that started early and worked regularly, one group that started late (in the last three lectures or later) Are the grades of the students in the regular group higher? (t-test) REGULAR LATE Average 6.3 3.0 Max 8.4 3.9 Are the regular students more likely to pass the course? (chi-2 test) REGULAR LATE Pass 30 2 No 10 38 12 Exam for the Field-part [tentative: check the course website later] Chapters 1, 2, 3, 4, 7: assumed to be common knowledge Chapters 5, 8, 15, <probably 9-12, 14, perhaps 6> + additional material supplied with the course (such as PS – software) === Exam on laptop: 1 – multiple choice questions 2 – you are given data and must be able to handle the data sensibly 13 The average (quantitative) paper … • Problem formulation – What are sensible questions? • Theory-development and hypotheses – “What do I expect to be the answer to my question, and what are the implications from the theory that I want to test” (nb: different in exploratory work) • Choice of research design – – – – – Experiment Survey Case study Participant observation … • Data collection – – – – Designing questionnaires Designing experimental procedures Finding your respondents. Sampling (how and how many?) … • Analysis of results – – Measurement: from raw data to measured constructs Relational claims: X Y ? • Conclusions – What can we conclude, given our analyses? 14 About the course setup • Mainly on moodle-site, studyweb only used to send mail to you • “Do-it-yourself course”: mastering SPSS, getting up to speed with SPSS, keeping up with the material is up to you – Extra material and links on the website – Practice material for the exam If you do not practice in between, you will not be able to pass the exam. • Part 1-Rooks Part 2-me : “Think, then do” : “Do, then think” • We have data, now what do we do? (and partly we collect these data from you) • Hybrid setup: – English/Dutch – business administration / social sciences 15 THE ART OF SAMPLING 16 Sampling population sample We want conclusions about the population, but we only have (enough time and money to collect) data from part of the population, a sample. From sample data to population statement: STATISTICAL INFERENCE 17 Two parts to every analysis population sample • Calculate some property of the sample – Mean (mean length of soccer players) – Difference between mean of two groups (difference in length of soccer-players) – Correlation between two things measured (correlation between length and number of goals you score) • Calculate a confidence interval around the property, creating a statement about the property in the sample 18 On sampling "analog cheese" Analog cheese = palm oil + starch (zetmeel) "Keuringsdienst van waarde" took a sample of 11 products and found 5 to contain "analog cheese" Estimate of the percentage of products containing analog cheese = 5/11 = 45% What is the (approximate) confidence interval? A B C D E 40 32 25 17 9 – – – – – 50 58 65 77 81 % % % % % 19 Applying the 1/sqrt(n) rule You want to predict how many seats in congres a certain Dutch political party will get. You allow for a range of plus or minus 2 seats. Say you expect the number of seats to be around 50. You intend to call a representative sample of people. About how many do you need? A B C D E F 50 100 500 5,000 50,000 more than 50,000 20 Some more sampling Suppose you want to know, say, the percentage of people in The Netherlands who support the recent foreign policy of the US-government. The Netherlands has 12,000,000 voters. According to your (correct) calculations you need a sample of 2,000 people. Now you want to do the same, but in France (population = 36,000,000 voters). How large should your sample size be in France? A B C D E less than 2,000 about 2,000 about 6,000 more than 6,000 you need more information Rule of thumb: For large populations, the required sample size is independent of the population size 21 Explanation: Mean and variance of the mean We measure x and get measurements x1, …, xn xi x n N n s x2 Var ( x) ( ) N n xi measuremen t of x for unit i N size of population n size of sample 2 ( x x ) i1 i n s x2 n 1 variance in the sample Expectation and variance give the 95%confidence-interval: x 1.96 Var ( x) , x 1.96 Var ( x) 22 Sample size determined by: Are white soccer players smaller? • How precise do you want to measure your statistic? [what is the height difference you would find interesting enough to report about] • What is the probability of Type I error that you will allow? (rejecting the H0-hypothesis when in fact it is true) Usually 5% [How small do you want the probability to be that you reject “(on average) black and non-black players are equally tall” when in fact it is true?] • How likely do you want it to be that you will find an effect, assuming that it exists in the population? Power, usually 80% or 90%. • Onesided or twosided tests? You need special purpose software for this, for instance G*Power (on the disc), or PS 23 XY 24 All the same, but different • Problem formulation – What are sensible questions? X1 Y 1 “What do I expect to be the answer to my question, and what are the implications from the theory that I want to X test” 2 Y2 (nb: different in exploratory work) … • Theory-development and hypotheses – • Choice of research design – – – – – Experiment Survey Case study Participating research … • Data collection – – – – X1 Y 1 X2 Y 2 HOW? Designing questionnaires Designing experimental procedures Finding your respondents. Sampling (how and how many?) … • Analysis of results – – Measurement: from raw data to measured constructs Relational claims: X Y ? • Conclusions – What can we conclude? X1 Y 1 X2 Y 2 AND? 25 About $80 / hour 26 It is all about XY : X “white soccer player” “being a woman” “being bald” “left handed” “listen to Mozart” Y “length” “sensitive to alcohol” “prob. of a heart-attack” “die early” “score higher on IQ-test” Y = dependent variable response variable target variable Y-variable explanandum X = independent variable X-variable predictor variable explanans Usually we want to say something like “X causes Y”, but often we have to settle for “X is related to Y”. 27 Survey vs experiment (Milgram) Y = which voltage do you apply? measured X's: – subject is male – subject is young manipulated X's: – experimentor wears white coat – experimentor is older (vs young) Experiment: researcher determines X Survey: researcher measures X 28 XY 29 Kinds of variables (in case you forgot) Categorical / Nominal Two or more categories, without intrinsic ordering (ex.: “kind of movie”: action/drama/...) When only two categories, also called a binary variable (ex.: gender, “age over 40”, etc) Ordinal Two or more categories, with intrinsic ordering (ex.: 5-point ratings such as never/sometimes/often/always, …) Interval Ordinal + intervals between values are evenly spaced (age, income, number of movies rented). NB Not always easy to classify. Categorical and interval are the most important (often ordinal are treated as either categorical or interval). 30 Statistics at UCLA {http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm} Y X 31 Dealing with data 1. Import SPSS file 2. Check your data • To get acquainted with it • For outliers and coding errors 3. Determine the kind of analysis 4. Recode your data so that you have the variables in the appropriate format 5. Check the assumptions for the analysis of choice (1) 6. Run your analysis 7. Check the assumptions for the analysis of choice (2) 8. If necessary, back to 3. until CONCLUSION 32 Fact and fiction Are white soccer players smaller? 33 Example data: soccer players File: soccer_0AP03.sav. All players from WC2002. Let’s see what the data looks like: <to SPSS> Variable view vs Data view Run a “Frequencies” Check histograms Create new variables (Transform > Compute) Recode variables (Transform > Recode) Run analyses USE SYNTAX FILES (*.SPS)! 34 Weekly not-on-the-exam fact input input input output input Suppose: You have a handful of numerical inputs and want to use these to predict some output. For instance: chance of survival of a firm based on firm characteristics, probability of job success based on credentials, probability of surgery survival based on medical records, … We compare experts in the field with computer models (both have the same amount of data). Out of 160 studies of this kind, how often do the experts perform significantly better? (sources: see “Super Crunchers” by Ayres) 35 To Do Get familiar with SPSS: reading data, recoding variables, and running a t-test or a correlation. Especially recoding variables and the syntax window are important. You should be able to do the assignments on the web page fairly quickly. Check chapters 1 through 4 (up to 4.5.4) of the Field-book for anything that looks unfamiliar to you. Don’t wait until the last couple of weeks! Add to the WIKIs 36