Introduction to SPSS for GHA Staff Prof Gwilym Pryce: g@gpryce.com Tutors: George Vlachos, Christian Holz Lab notes based on material by John Malcolm Plan: • A. Data Types – 1. Variables – 2. Constants • B. Introduction to SPSS – 1. SPSS Menu Bar – 2. File Types • C. Tabulating Data – 1. Categorical variables – 2. Continuous variables • D. Graphing Data – 1. Categorical variables – 2. Continuous variables A. Data Types • 1. Variables • 2. Constants 1. What is a variable? – A measurement or quantity that can take on more than one value: • • • • • E.g. size of planet: E.g. weight: E.g. gender: E.g. fear of crime: E.g. income: varies from planet to planet varies from person to person varies from person to person varies from person to person varies from HH to HH – I.e. values vary across ‘individuals’ = the objects described by our data • Individuals = basic units of a data set whom we observe or experiment on in a controlled way – not necessary persons • (could be schools, organisations, countries, groups, policies, or objects such as cars or safety pins) • Variables = information that can vary across the individuals we observe – e.g. age, height, gender, income, exam scores, whether signed Nuclear Test Ban Treaty Variable Type, for Coding Purposes: Variable View of the Data Variable Type for Coding Purposes: • Available data types in SPSS are as follows: – – – – – – Numeric – the default for new variables Comma Dot Scientific Notation Date String • Numeric – A variable whose values are numbers. Values are displayed in standard numeric format. – The Data Editor accepts numeric values in standard format or in scientific notation. • Comma – A numeric variable whose values are displayed with commas delimiting every three places, and with the period as a decimal delimiter. – The Data Editor accepts numeric values for comma variables with or without commas, or in scientific notation. • Values cannot contain commas to the right of the decimal indicator. • Dot – A numeric variable whose values are displayed with periods delimiting every three places and with the comma as a decimal delimiter. – The Data Editor accepts numeric values for dot variables with or without periods, or in scientific notation. • Values cannot contain periods to the right of the decimal indicator. • Scientific notation – A numeric variable whose values are displayed with an imbedded E and a signed power-of-ten exponent. – The Data Editor accepts numeric values for such variables with or without an exponent. • The exponent can be preceded either by E or D with an optional sign, or by the sign alone--for example, 123, 1.23E2, 1.23D2, 1.23E+2, and even 1.23+2. • Date – A numeric variable whose values are displayed in one of several calendar-date or clock-time formats. Select a format from the list. You can enter dates with slashes, hyphens, periods, commas, or blank spaces as delimiters. The century range for two-digit year values is determined by your Options settings (from the Edit menu, choose Options and click the Data tab). • Custom currency – A numeric variable whose values are displayed in one of the custom currency formats that you have defined in the Currency tab of the Options dialog box. Defined custom currency characters cannot be used in data entry but are displayed in the Data Editor. • String – Values of a string variable are not numeric and therefore are not used in calculations. – They can contain any characters up to the defined length. – Uppercase and lowercase letters are considered distinct. – Also known as an alphanumeric variable. Conceptual Approach to Variable Type: • Numeric = values are numbers that can be used in calculations. • String = Values are not numeric, and hence not used in calculations. – But can often be coded: I.e. transformed into a numerical variable: • e.g. If (LA = ‘Aberdeen’) X = 1. If (LA = ‘East Renfrewshire’) X = 2. etc. Continuous vs Categorical • Continuous (or Scale or quantitative Variables) = data values are numeric values on an interval or ratio scale – (e.g., age, income). Scale variables must be numeric. – E.g. dimmer switch: brightness of light can be measured along a continuum from dark to full brightness • Categorical Variables = variables that have values which fall into two or more discrete categories – E.g. conventional light switch: either total darkness or full brightness, on or off. – Male or female, employment category, country of origin Two types of Categorical variables: Ordinal & Nominal • Ordinal variables = Data values represent categories with some intrinsic order – (e.g., low, medium, high; strongly agree, agree, disagree, strongly disagree). – Ordinal variables can be either string (alphanumeric) or numeric values that represent distinct categories (e.g., 1=low, 2=medium, 3=high). Ordinal variables: • Values fall within discrete but ordered categories – I.e. the sequence of categories has meaning • e.g. education categories: – – – – – – 1 = primary 2 = secondary 3 = college 4 = university undergraduate 5 = university postgraduate masters 6 = university postgraduate phd • e.g. 1= Very poor, 2= poor, 3=good, 4=very good Nominal variables • Nominal Variables = Data values represent categories with no intrinsic order – sequence of categories is arbitary -ordering has no meaning in and of itself: • e.g. country of origin: Wales, Scotland, Germany… • e.g. make of car: Ford, Vauxhall • e.g. job category • e.g. company division – Nominal variables can be either string (alphanumeric) or numeric values that represent distinct categories (e.g., 1=Male, 2=Female). 2. What is a constant? – A measurement or quantity that has only one value for all the objects described in our data – Also called a ‘scalar’ or ‘intercept’ or ‘parameter’ • • • • E.g. speed of light in a vacuum: constant for all light transmissions E.g. ratio of diameter to circumf.: constant for all circles E.g. ave. increase in life expectancy: constant at 1 year pa since 1900 E.g. Price elasticity of housing supply: assumed constant for a particular market • Often it is a constant that want to estimate: – we employ statistical techniques to estimate ‘parameters’ or ‘constants’ that summarise or link variables. • e.g. mean = ‘typical’ value of a variable = measure of central tendency • e.g. standard deviation = measure of the variability of a variable = measure of spread • e.g. correlation coefficient = measures the correlation between two variables • e.g. slope coefficients = how much y increases when x increases Plan: • A. Data Types – 1. Variables – 2. Constants • B. Introduction to SPSS – 1. SPSS Menu Bar – 2. File Types • C. Tabulating Data – 1. Categorical variables – 2. Continuous variables • D. Graphing Data – 1. Categorical variables – 2. Continuous variables B. Introduction to SPSS 1. SPSS Menu Bar • When you first open SPSS, you will usually be presented with a blank Data View window – The Data View lists variables as columns and observations (also called “cases” or “individuals”) as rows • Data View without and with data looks like this… Data View of Home Sales data: • Variable View looks like this… Variable View of Home Sales data: SPSS Menu Bar B.2. File Types & SPSS Structure • If you try opening a new file (File, New), you will see that you are presented with five choices of file type. • These choices reflect the basic structure of SPSS: – Data – Syntax • Steep learning curve, but essential for larger projects – Backup – Record/checking – Re-use – Output • Graphs, tables, commands, error messages SPSS Scripting Facility • The scripting facility allows you to automate tasks, including: – – – – Automatically customize output in the Viewer. Open and save data files. Display and manipulate dialog boxes. Run data transformations and statistical procedures using command syntax. – Export charts as graphic files in a number of formats. Plan: • A. Data Types – 1. Variables – 2. Constants • B. Introduction to SPSS – 1. SPSS Menu Bar – 2. File Types • C. Tabulating Data – 1. Categorical variables – 2. Continuous variables • D. Graphing Data – 1. Categorical variables – 2. Continuous variables C. Tabulating Data • 1. Categorical Data: Frequency Tables – E.g. Neighbourhood type (House Sales data) • Analyse, Descriptive Statistics, Frequencies Neighborhood Valid A B C D E F G Total Frequency 42 319 258 467 500 372 482 2440 Percent 1.7 13.1 10.6 19.1 20.5 15.2 19.8 100.0 Valid Percent 1.7 13.1 10.6 19.1 20.5 15.2 19.8 100.0 Cumulative Percent 1.7 14.8 25.4 44.5 65.0 80.2 100.0 • Categorical Data: Crosstabs (2-Way Tables) – E.g. Does Ethnic Minority Status affect job type? (Emplment data) • Analyse, Descriptive Statistics, Crosstabs MinorityClassification * Employment Category Crosstabulation Minority Classification No Yes Total Count % within Employment Category Count % within Employment Category Count % within Employment Category Employment Category Clerical Custodial Manager 276 14 80 Total 370 76.0% 51.9% 95.2% 78.1% 87 13 4 104 24.0% 48.1% 4.8% 21.9% 363 27 84 474 100.0% 100.0% 100.0% 100.0% 2. Scale Data • Scale or quantitative data: usually a measurement of size or quantity – not meaningful to report % or count • Not unless you break the variale into categories (& then it becomes categorical data!) • e.g. income bands = “grouped data” • Tables of raw data not much use unless only a few values... How tabulate 129,000 observations? Borrower 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CM SML 1988 Total Income Borrower . 21 . 22 . 23 . 24 . 25 . 26 . 27 . 28 . 29 . 30 . 31 . 32 . 33 . 34 18720 35 16000 36 16455 37 . 38 7020 39 4576 40 CM SML 1988 Total Income Borrower 10800 41 . 42 19072 43 . 44 . 45 . 46 . 47 . 48 . 49 . 50 . 51 . 52 . 53 . 54 . 55 . 56 . 57 11500 58 2912 59 11745 60 CM SML 1988 Total Income Borrower . 61 7216 62 . 63 12000 64 9758 65 6084 66 . 67 . 68 . 69 9345 70 9810 71 14406 72 9190 73 . 74 . 75 . 76 . 77 . 78 . 79 . 80 CM SML 1988 Total Income . . . . . . . . 18336 15096 . 12597 9700 . . . . 5295 4539 . Tables of Summary Statistics for Continuous Data: • Descriptives Function in SPSS: – E.g. House Sales data • On SPSS Menu Bar select: – Analyze, Descriptive Statistics, Descriptives Descriptive Statistics N Current Salary Valid N (listwise) 474 474 Minimum $15,750 Maximum $135,000 Mean $34419.6 Std. Deviation $17,075.661 • Explore Function in SPSS: – On SPSS Menu Bar select: • Analyze, Descriptive Statistics, Explore Descriptives Current Salary Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Lower Bound Upper Bound Statistic $34419.6 $32878.4 Std. Error $784.311 $35960.7 $32455.2 $28875.0 3E+008 $17075.7 $15,750 $135,000 $119,250 $13,163 2.125 5.378 .112 .224 Plan: • A. Data Types – 1. Variables – 2. Constants • B. Introduction to SPSS – 1. SPSS Menu Bar – 2. File Types • C. Tabulating Data – 1. Categorical variables – 2. Continuous variables • D. Graphing Data – 1. Categorical variables – 2. Continuous variables D. Graphs of Variables: 1. Graphs of Categorical Data • Pie Charts – If all the categories sum to a meaningful total, then you can use a pie chart – Pie charts emphasise the differences in proportions between categories – OK for a single snapshot, but not very good for showing trends • would need to have a separate pie chart for each year •On SPSS Menu Bar select: •Graphs, Pie, Summaries for Groups of Cases • Bar Charts – can show either % or count – not very good for showing trends in more than one category Income Support claimants with housing costs by statistical group in May 1999 100 90 80 70 000's 60 50 40 30 20 10 0 Aged 60 or over Lone Parents Disabled Category of Claimant DSS Quarterly Statistical Enquiry Other Income Support claimants with housing costs by statistical group in May 1999 100 90 80 70 000's 60 50 40 30 20 10 0 Aged 60 or over Lone Parents Disabled Category of Claimant DSS Quarterly Statistical Enquiry Other Income Support claimants with housing costs by statistical group: May 1993 to May 1999 140 120 100 Other 000s Disabled 000s Lone Parents 000s Aged 60 or over 000s 80 000's 60 40 20 0 1993 1994 1995 Year 1996 1997 1998 1999 Beware of scaling... Income Support claimants with housing costs by statistical group: May 1993 to May 1999 120 110 100 Lone Parents 000s 000's 90 80 70 60 1993 1994 1995 Year 1996 1997 1998 1999 Income Support claimants with housing costs by statistical group: May 1993 to May 1999 200 180 160 140 000's 120 100 80 60 40 20 0 1993 1994 1995 1996 Year Lone Parents 000s 1997 1998 1999 D. Graphs of Variables: 2. Graphs of Continuous Data • What are we interested in when describing data? • E.g. income: – – – – Is income evenly spread? Or are most people rich? Or are most people poor? Or are most reasonably well off? • This are all questions about the variable’s Distribution – We can represent the whole data set with one picture... 3000.0 - 4500.0 6000.0 - 7500.0 0.0- 10500.0 1500.0 9000.0 12000.0 - 13500.0 15000.0 - 16500.0 18000.0 - 19500.0 21000.0 - 22500.0 24000.0 - 25500.0 27000.0 - 28500.0 30000.0 - 31500.0 33000.0 - 34500.0 36000.0 - 37500.0 39000.0 - 40500.0 42000.0 - 43500.0 45000.0 - 46500.0 48000.0 - 49500.0 51000.0 - 52500.0 54000.0 - 55500.0 57000.0 - 58500.0 •On SPSS Menu Bar select: •Graphs, Histogram, and select variable 12000 10000 8000 6000 4000 2000 Std. Dev = 12830.02 Mean = 17993.3 0 N = 125541.00 TOTAL INCOME OF BORROWER(S) -.25 - .25 .75 - 1.25 1.75 - 2.25 2.75 - 3.25 3.75 - 4.25 4.75 - 5.25 5.75 - 6.25 6.75 - 7.25 7.75 - 8.25 8.75 - 9.25 9.75 - 10.25 10.75 - 11.25 11.75 - 12.25 12.75 - 13.25 13.75 - 14.25 14.75 - 15.25 15.75 - 16.25 16.75 - 17.25 Frequency LTV Frequency Distribution All HHs in Low Price Areas (1995-1998 CML SML Data) 60000 50000 40000 30000 20000 Std. Dev = .25 10000 Mean = .80 0 N = 74736.00 LTV 0.00 - .05 .05 - .10 .10 - .15 .15 - .20 .20 - .25 .25 - .30 .30 - .35 .35 - .40 .40 - .45 .45 - .50 .50 - .55 .55 - .60 .60 - .65 .65 - .70 .70 - .75 .75 - .80 .80 - .85 .85 - .90 .95 .90- 1.00 - .95 1.00 - 1.05 1.05 - 1.10 1.10 - 1.15 1.15 - 1.20 1.20 - 1.25 1.25 - 1.30 1.30 - 1.35 1.35 - 1.40 1.40 - 1.45 1.45 - 1.50 Frequency LTV Frequency Distribution All HHs in Low Price Areas (1995-1998 CML SML Data) 30000 20000 10000 Std. Dev = .22 Mean = .80 0 N = 74552.00 LTV .05 - .10 .10 - .15 .15 - .20 .20 - .25 .25 - .30 .30 - .35 .35 - .40 .40 - .45 .45 - .50 .50 - .55 .55 - .60 .60 - .65 .65 - .70 .70 - .75 .75 - .80 .80 - .85 .85 - .90 .90- 1.00 - .95 .95 0.00 - .05 Frequency LTV Frequency Distribution All HHs in Low Price Areas (1995-1998 CML SML Data) 30000 20000 10000 Std. Dev = .22 Mean = .78 0 N = 70545.00 LTV LTV Frequency Distribution All HHs in Low Price Areas (1995-1998 CML SML Data) 70000 60000 50000 30000 20000 Std. Dev = .22 10000 Mean = .80 N = 74552.00 LTV 1.00 - 1.50 .50 - 1.00 0 0.00 - .50 Frequency 40000 Summary • A. Data Types – 1. Variables – 2. Constants • B. Introduction to SPSS – 1. SPSS Menu Bar – 2. File Types • C. Tabulating Data – 1. Categorical variables – 2. Continuous variables • D. Graphing Data – 1. Categorical variables – 2. Continuous variables