University of the Philippines, Diliman, Quezon City Department of Mathematics summer term 2010 Assoc.Prof. Dr.Christian Traweger University of Innsbruck, Austria Department of Political Science Social Science Methode Group christian.traweger@uibk.ac.at overview - topics lecture-content basics in market research, definitions frequency distributions, measures of central tendency, measures of dispersion boxplots, statistical testing hypotheses, hypotheses-testing, level of significance probability, probability tables, contingency tables Chisquare-test for testing two/more independent samples for nominal variables Mann-Whintey U-test for ordinal data Kruskal Wallis test for ordinal data analysis of variance (ANOVA), t-test measures of association – Cramer´s V for nominal data correlation coefficient for ordinal data – Spearman correlation coefficient for metric data – Pearson regression analysis written exam on May 17th 2007 2 Basics in market research formulation of goals; what should be analysed? survey methods questionnaires sampling data collection (field work) plausibilty and follow up work regarding interviewers representativeness data analysis and statistical analysis written report presentation of results (press conferences, board meetings,.....) 3 formulation of goals: What should be analysed ? Is market research the proper tool for solving the problem? What are the constraints ? What is to consider ? Survey methods: To choose the right survey method is determined from the budget, the time frame and how much information should be retrieved trough the survey. 4 kinds of surveys: surveys can be conducted - face to face - through mailing/post-mail - by telephone (CATI, CI,......) - via Internet, text Discuss the advantages and disadvantages of the above survey methods. questionnaire: - questions, where every answer is possible - questions with specific answers: - dichotomous scale: only two answers two chose from - Multiple choice question: three or more alternatives and you can choose one or more answers - Likert-Scale: the subject may be asked to indicate his/her degree of agreement with each of a series of statements relevant to the attitude. e.g.: Travelling with big and well known airlines is safer. Strongly agree agree undecided disagree strongly disagree - Semantic differential: scale with a pair of attitudes e.g.: the hotel xy is: modern ------------- old-fashioned 4 - Itemized rating scale: e.g.: As a President, Gloria Arroyo has been very competent somewhat competent neither competent nor incompetent somewhat incompetent very incompetent - stapel scale: e.g.: As a President, G.A. has been -3 -2 -1 competent +1 +2 +3 Always try to avoid batteries of same answers within more than 12 items. Scales: In order to chose the correct statistical test and /or statistical measures, it is important to know what scale is used: - nominal - ordinal - metric (interval, ratio) Sampling methods A sample is part of a population (e.g.: Philippine population). In order to relate the results of the sample to the population, the sample has to be an exact representation of the population, with regard to various demographic characteristics. - Quota sampling - Random sampling 1) Determinatioin of sample-points: area sampling 2) Determination of households (randomized) 3) Determination of the person in the household „Each element must have the same chance to be selected“. 5 Sample Error: if one has no survey-results (e.g. before starting the survey), you have to start with the worst case: 50% : 50% answering pattern Sample error e = 1.96 2 p (1 p) n p = % (e.g. 50% = 0.5) Sample size: The sample size is determined through……. - the exactness of the sample (interval for the most likely result) - ... how sure someone wants to be, when predicting a result (95%), and - the budget which is available by the customer n = 1.962 . p (1 p) e2 n......sample size p......percentage of respondents, that gave a certain answer e......sample error 6 Table to determine the sample error: p........percentage of people giving a certain answer n........sample size Sample error in % 10 / 90 15 / 85 20 / 80 25 / 75 30 / 70 35 / 65 40 / 60 45 / 55 50/50 =0.1 =0.15 =0.2 =0.25 =0.3 =0.35 =0.4 =0.45 =0.5 % n= 100 200 300 400 500 600 700 800 900 1000 1500 2000 2500 3000 5.88% 4.16% 3.39% 2.94% 2.63% 2.40% 2.22% 2.08% 1.96% 1.86% 1.52% 1.31% 1.18% 1.07% 7.00% 4.95% 4.04% 3.50% 3.13% 2.86% 2.65% 2.47% 2.33% 2.21% 1.81% 1.56% 1.40% 1.28% Sample Error = 7.84% 5.54% 4.53% 3.92% 3.51% 3.20% 2.96% 2.77% 2.61% 2.48% 2.02% 1.75% 1.57% 1.43% 8.49% 6.00% 4.90% 4.24% 3.80% 3.46% 3.21% 3.00% 2.83% 2.68% 2.19% 1.90% 1.70% 1.55% 1,96 2 p (1 p) n 8.98% 6.35% 5.19% 4.49% 4.02% 3.67% 3.39% 3.18% 2.99% 2.84% 2.32% 2.01% 1.80% 1.64% 9.35% 6.61% 5.40% 4.67% 4.18% 3.82% 3.53% 3.31% 3.12% 2.96% 2.41% 2.09% 1.87% 1.71% 9.60% 6.79% 5.54% 4.80% 4.29% 3.92% 3.63% 3.39% 3.20% 3.04% 2.48% 2.15% 1.92% 1.75% 9.75% 6.89% 5.63% 4.88% 4.36% 3.98% 3.69% 3.45% 3.25% 3.08% 2.52% 2.18% 1.95% 1.78% 9.80% 6.93% 5.66% 4.90% 4.38% 4.00% 3.70% 3.46% 3.27% 3.10% 2.53% 2.19% 1.96% 1.79% p = in % example: Within a survey 500 travellers on the airport in Manila where asked, how they book their trips. 30% said, that they book their holidays "last minute". Now one can maintain, that the real percentage of people, that book their holiday "last minute", is with a likelihood of 95% between 26% and 34% ( 30% +- 4.02%). Is the sample more than 10% of the population, the sample error has to be computed as: Sample Error SP = 1,96 2 p (1 p) ( N n) n ( N 1) p = %, N = population SP...corrected for small population !!! (sample is more than 10% of population) 7 data collection: Actual fieldwork: - face to face - by telephone - or by mailing by specially trained interviewers Plausibility and follow up work regarding interviewers This part is done visually as well as computer-assisted. Special questions in the questionnaire will help to verify , how serious the respondent was. Special market-research companies, working with a CATI-system, are able to monitor every interview. Representativeness The objective of a representative survey is an exact "picture" of the respective population. The sample has to be an exact representation of the population, with regard to various demographic characteristics. Usually representativeness is based on: Sex Age Education 8 Data analysis and statistical anaylsis: Within the data analysis, one has always to refer to the scale of each question; so you know when it makes sense to calculate a mean, median, mode, variance,….. . The same differentiation has to be made when applying statistical tests. Statistical measures according to variable-type (scale): Variable-type Nominal Ordinal Metric measures Frequency distribution, Mode Frequency distribution, Mode, Median, Quartiles Min., Max., Median, Mean, Variance, Std.Dev., Std. Err.; Graphical representation bar and pie charts bar and pie charts boxplots histogramm,.... Concerning graphical representation statisticians can apply any graphic that visualizes the results in an optimal way. The above table gives only a brief overview of graphical representation. Written report: The final report should contain tables, graphs and a verbal interpretation of each question. Interesting hypotheses should be tested and explained. Presentation of results: Beside the long written report a short report should be made and discussed within a board meeting and with the customer. A press conference with a power-point presentation should be prepared. 9 Descriptive Statistics: Frequency Distribution: Shows the number of times each observation occurs when the values of a variable are arranged in order according to their magnitude. Example: student-scores on exam score: xi (absolute hi (relative frequency) frequency) Hi (cumulative relative frequency) 1 2 8.3 8.3 2 4 16.7 25.0 3 9 37.5 62.5 4 6 25.0 87.5 5 3 12.5 100.0 total 24 100.0 How to compute relative frequencies: h2 = (x2/N)*100 = 16.7 % How to compute cumulative relative frequencies: % H3 = h1 + h2 + h3 = 8.3 + 16.7 + 37.5 = 62.5 % A frequency curve is the result of: a histogram curve-polygon in the histogram smoothing of the curve-polygon density-curve 10 Statistical measures: Mode: xMode hi max Definition: Observation that occurs with the greatest frequency. In the previous frequency distribution table, the mode is score 3. Mean : x 1 N * xi N i 1 Def: The mean is the arithmetic average and is defined as the sum of a set of values divided by their number; as its computation involves algebraic manipulation of the individual data values, the mean is an approximate measure of central location for metric data only. Excurs : mean of given class-intervals x 1 k * xi * xmi N i 1 Bsp: class 1 2 3 4 x IQ-Interval 80 – 100 101 – 121 122 – 142 143 – 163 Mean-value in class 90 111 132 153 xi 4 9 9 3 Σ= 25 xi xmi 360 999 1188 459 Σ= 3006 1 k * xi * xmi = 1/25 * 3006 = 120.24 N i 1 Geometric Mean: xG = n x1 * ......xn xi > 0 !!!!! Def: To be used on annual growth-processes or loss-processes. e.g. the mean company turnover as a time-serie: turnover-increase from year 1 to year 2: 2% turnover-increase from year 1 to year 2: 18% IMPORTANT: Calculation via growth-factors (1.02 bzw. 1.18) and not growth-rates! xG = 1,02 *1,18 = 1.097 ; that is 9.7% mean turnover growth. 11 Median : odd number of n: z x( n 1) / 2 Even number of n: z xn / 2 x( n / 2)1 2 Def: Observation or potential observation in a set that devides the set so that the same number of observations lie on each side of it. Quantiles: Divisions of a probability distribution or frequency distribution into equal, ordered subgroups, for examples: quartiles or percentiles: Quartiles: four groups 25% each Perzentiles: the set of divisions that produce exactly 100 equal parts in a series of continuous values Very commonly used quartiles: Q1 : 1.Quartile = x 0,25 Q2 : 2.Quartile = x 0,5 = z =Median Q3 : 3.Quartile = x 0,75 With min-value, Q1, z, Q3, max-value a Box-Plot can be drawn to show a better picture of the distribution of a set of observations. 12 Population Variance: σ2 = Sample Variance s2 = 1 k * ( xi x ) 2 N i 1 k 1 * ( xi x ) 2 N 1 i 1 Def: The average of square differences between observations and their mean Standard Deviation: s = s 2 or s= 2 Def: Square root of the variance. Normal distribution/ Normal curve: distribution under the bell shaped curve x ±s = ~ 68% of the observations x ±2s = ~ 95% of the observations x ±3s = ~ 99.7% of the observations Standard Error of Mean: s x = s N We can be 95% confident, that real mean lies in an interval of x ± 1.96 s x . Variation Coefficient: VC = s 100 % x Def: This coefficient shows if a distribution is homogeneous. If the coefficient shows a result of more than 50%, the distribution is more likely not homogeneous and so the mean would not be a good measure. 13 Graphical representation: Boxplot 1600,00 ek 1400,00 1200,00 1000,00 A graphical method of displaying the important characteristics of a set of observations. The display is based on the five-number summary - x0,25 = begin of the box - x0,75 = end of the box - the median in the box, which covers the inter quartile-range - two "whiskers" extending to xmin und xmax, outliers or "outside observations" are being indicated separately. In order to compute so called outliers, the inter quartile-range is used: (dQ = x0.75 - x0.25 ) The lower and upper end of the whiskers are: zu = x0.25 – 1.5* dQ zo = x0.75 + 1.5* dQ All observations greater zo or smaller zu are outliers or "outside observations". 14 Statistical Testing: Statistical tests within data analysis are needed to determine differences between groups or to compare groups. This could be two different surveys with same questions or in our case one single sample, which is split into two or more sub-samples on the basis of some characteristic; e.g. creating subsamples of Males and Females, different age-groups, different educationlevels,… Another point of interest could be to compute relationships between two variables; this is also called the measure of association, where especially the magnitude of the relationship is what we are interested in. These measures of association are usually calibrated so as to range between 0 and +/- 1, with 0 indicating no relationship between the variables (=complete independence) and 1 a perfect relationship (the + or – sign indicates the direction of the relationship; (e.g. +1…..the higher someones income, the more money he/she spends for christmas presents). As a very rough rule of thumb, a relationship is usually considered very strong, if the association measure is larger than 0.8 strong, if the association measure is between 0.7 and 0.8 moderate, if the association measure is between 0.4 and 0.7 weak, if the association measure is below 0.4 15 The table below gives an overview about the different statistical tests referring to their level of measurement or scale: Scale Comparing groups Nominal Chisquare test Computing relationships Contingency-coefficient or Cramer´s V n=2 groups: Mann Whitney U-test Ordinal n>2 groups: Kruskal-Wallis test Metric One-way ANOVA; Pre-condition: the metric variable has to be Normally distributed and Homogenity of Variance Spearman´s rank order correlation Pearson´s product moment correlation 16 Hypothesis-Testing: One approach to making inferences about the population is via hypothesis testing. Whereas in estimation the focus was on making some informed guesses about the values of population parameters using our sample data and a relevant sampling distribution, in testing hypotheses the aim is to examine whether a particular proposition concerning the population is likely to hold or not. Within a bivariate analysis one presumption concerning for instance men and women could be, that there is a differnce in their smoking behaviour. Men are more likely to smoke than women. These presumptions for a population based on a survey are also called hypotheses. Basically there are two types of hypotheses: - Null-hypothesis: H0 - Alternative-hypothesis: H1 In the case of comparing groups, the null-hypothesis indicates that there is no difference between two or more groups. Only null-hypotheses can be tested; if they are rejected, this is taken to signify support for the alternative hypothesis. We can never test an alternative hypothesis directly and nor can we ever prove a hypothesis. 17 Steps in hypothesis testing: Formulate the null and the alternative hypothesis Specify the significance level Select an appropriate statistical test Identify the probability distribution of the test statistic and define the region of rejection Compute the value of the test statistic from the data and decide whether to reject or not reject the null hypothesis; any statistical software computes the exact level of significance;again it is your turn to decide whether to reject or not reject the null hypothesis Formulating the null and the alternative hypothesis The null hypothesis should contain a statement of equality or no (=null) difference between groups or no (=null) relationship between variables. By convention, a null hypothesis is denoted – as mentioned previously – as H0 (read: H-nought) and is given always the benefit of the doubt, that it is assumed to be true unless it is rejected as a result of the testing procedure. Note that inability to reject the null hypothesis does not prove that H 0 is actually true; it may be true, but our tests are only capable of disproving (but not confirming) a hypothesis. If, as a result of testing, the null hypothesis is rejected, this is interpreted as signifying support for the alternative hypothesis which, again by convention is denoted as H 1 (read Hone). Remember that since H0 and H1 are complementary, the alternative hypothesis should always include a statement of inequality. 18 Specification of Significance Level Having formulated the null and the alternative hypothesis, the next step is to specify the circumstances under which H0 will be rejected and H1 will not be rejected. For a better understanding of this decision problem the following example should help: As a market researcher you are consulting a big company, which is inventing a new product on the market. One result of your market study was, that significantly more men than women will buy this new product (=you rejected H0 in favour of H1). Based on your study the marketingmanager of the company will develop a new strategy and advertising campaign in order to have two equal groups of buyers; the product should be sold equally under men and women. All this costs a lot of money. The very first question of the marketing researcher concerning your study could be: “How sure are you about your advise ?” or “How high is the possibility of an error?”, or in statistical notation: “How high is the risk of wrongly rejecting H0? As you want to minimize the possibility of an error, one possible answer could be: “The chance for an error is ≤ 5% !!!” 19 There are four possible outcomes whenever you test any hypotheses: Situation in the population H0 Decision made H0 not rejected correct decision H1 Beta-error based on a sample….. H1 / H0 rejected Alpha-error () correct decision We denote as our significance level and use it to indicate the maximum risk we are willing to take in rejecting a true null hypothesis; the less risk we are willing to assume, the lower the . Typical values for are 0.05 , 0.01 and 0.001. Always remember that the significance level is a probability of making a mistake: rejecting a true null-hypothesis. Having specified a significance level, the way we use it is simple. If the result of our statistical test is such that the value obtained has a probability of occurrence less than or equal to , then we reject H0 in favour of H1 and we declare the test result as significant. Is the probability, associated with the test result, greater than , we cannot reject H0 and we denote the test result as non-significant. This is the reason why statistical tests are often referred to as significance tests and hypothesis-testing as significance-testing. 20 Selection of an Appropriate Statistical Test Basically there are three criteria to look at, when selecting a statistical test. First, the type of hypothesis to be tested, will require a different test each time. For example, different tests are appropriate for hypotheses concerning differences between groups compared with those for hypotheses concerning relationships between variables. Second, the distributional assumptions made regarding the population from which the sample was drawn will affect the choice of test. The most used assumption in this context is, that the sample data have been drawn from a normally distributed population. Another question could be “are the samples coming from populations with equal variances?”; (especially when applying an ANOVA). Third, the level of measurement (data scale) of the variables involved in the analysis is also relevant to consider. Based on above criteria the following tests are commonly used: Scale Comparing groups Nominal Chisquare test Computing relationships Contingency-coefficient or Cramer´s V n=2 groups: Mann Whitney U-test Ordinal n>2 groups: Kruskal-Wallis test Metric Spearman´s rank order correlation One-way ANOVA; Pre-condition: the metric variable has to be Pearson´s product moment correlation Normally distributed and Homogenity of Variance 21 Identification of the Probability Distribution of the Test Statistic Each test generates what is known as a test statistic, which is a measure for expressing the results of the test. Referring to the table on the last page test statistics are e.g. Chisquare-value, U-value, F-value,… Each of these test-statistics can be computed from the sample data and has to be compared with his comparable value drawn from a table (see appendices in statistics literature); this value – drawn from a table – is also called the critical value, that separates the rejection and the acceptance region. In a next step you make a graph from the distribution, apply the critical value from the table, define the acceptance region and the rejection region and either reject or don’t reject H0. Computation of the test statistic How to compute the test statistic will be shown on the next pages, depending which test is used. Every test statistic can be transformed in a standard normal distribution (=bell shape curve). All mentioned test statistics (their value from the tables) from above are depending from a statistical measure, called degrees of freedom and of course the applied level of significance. In order to standardize these test statistics a ztransformation has to be applied. The critical z-value for the rejection of H0 (|z-value| greater than…) within the standard normal distribution are at a significance level of 1%, or α=0.01 the z-value +/- 2.575 5%, or α=0.05 the z-value +/- 1.96 10%, or α=0.10 the z-value +/- 1.645 22 Probability – in brief: The subject view of probability states that the probability is an estimate of what an individual thinks is the likelihood that an event will happen. This could also mean that two individuals estimate the prbability differently. How can we proof, that probabilioty is somehow determined. Flip a coin and you will see the probability for head is 50% as well as the probability for tail. Throw a dice: The possibility for a 6 is 1/6=16.66% The set of all possible results is called the probability space; each possible result is also called an outcome. An event is a set that consists of a group of outcomes. Probability of an event: First we count the number of outcomes in A. We call that N(A), which is the number of outcomes in A. s is thetotal number of outcomes in the probability space. The probability is now defined as as: o Probability that event A will occur is P(A)=[N(A)]/s So probability is sometimes also defined as the long term relative frequency with which an outcome or event occurs. As there are many other ways of describing probabilities, e.g.: probability of a union probability of an intersection, and other probability related topics like o sampling with replacement o sampling without replacement we will focus more on the sort of probability that is related to frequency tables. 23 CONDITIONAL PROBABILITY: Widely to be interpreted probabilities within a frequency- or contingencytable are: - Joint probability - Marginal probability - conditional probability events and probabilities event E event F Total Count % within row % within column % of Total Count % within row % within column % of Total Count % within row % within column event C 52 59,1% 45,6% 23,5% 62 46,6% 54,4% 28,1% 114 51,6% 100,0% event D 36 40,9% 33,6% 16,3% 71 53,4% 66,4% 32,1% 107 48,4% 100,0% Total 88 100,0% 39,8% 39,8% 133 100,0% 60,2% 60,2% 221 100,0% 100,0% - the row % of Total are the joint probabilities: o joint prob. between C and E =p(C&E)=52/221=0.235=23.5% o joint prob. between F and D =p(F&D)=71/221=0.321=32.1% o E and D / F and C accordingly - the marginal probability is to find under Total within row for event C and D and under Total within column for event E and F. o marginal prob. event C = p(C)=114/221 = 0.516 = 51.6% o marginal prob. event E = p(E)=88/221 = 0.398 = 39.8% o marginal prob. for D and F accordingly 24 - the conditional probability are: o E/C = the probability for event E under the condition of C is 52/114= 0.456=45.6% o F/C = the probability for event F under the condition of C is 62/114= 0.544=54.4% …………. …………. o C/F = the probability for event C under the condition of F is 62/133= 0.466=46.6% o D/F = the probability for event D under the condition of F is 71/133= 0.534=53.4% 25 Tests to compare differences Chisquaretest: The Chisquaretest is the most widely-used non-parametric test. Our form of chisquaretest is the test of homogeneity, to test two samples, if they come from populations with similar or like distributions. It is applied on nominal variables, to compute possible differences between two or more groups. The Pearson Chisquaretest is computed: (O E ) 2 E i 1 n 2 O....... observed values E....... expected values It is the comparison of observed versus expected values in a twodimensional table, where the expected values are those calculated from the product of the appropriate row and column totals divided by the number of observations. The expected values would be the answering pattern, when there is no difference between e.g. two groups. expected values c......column-total/ eij ci*r j N r.......row-total/ N......TOTAL(sample) 26 Below example shows the application of the chisquare-test: The qquestion is: Are there differences between men and women regarding their book-reading behaviour? The two hypotheses are: H0: There is no difference between men and women regardin their book reading behaviour. H1: Between men and women there are significant differences regarding their book reading behaviour. The following table shows the result of an survey within 300 students. 1.step: observed frequencies: Men Women total: absolute in % detective novels Love novels SciFi total 65 70 135 45% 25 50 75 25% 50 40 90 30% 140 160 300 100% detective novels Love novels SciFi 63 72 35 40 42 48 2.step: expected frequencies: Men Women 2calc. = ((65-63)2/63) + ((25-35)2/35) +......+((40-48)2/48) = 8.333 27 In order to find the critical 2-value in the table of the Chisquaredistribution, we need to know the degrees of freedom of our test-problem and the proposed level of significance. For the chisquare-distribution the degrees of freedom are calculated as: DF= (c-1) * (r-1) = (3-1)*(2-1) = 2 that means 2 degrees of freedom The level of significance should be 5%; this is α=0.05 ! The critical value then is: 2Table 0.95; 2 = 5.991 As the – from the sample – calculated chisquare-value is greater than the critical chisquare-value we reject H0. The chisquare-distribution can be approximized through the normaldistribution. The appropriate z-value can be calculated as: z= 2 2 2 DF 1 (this is only one way to calculate z, others may be found in the SPSS-handbook) for the above example we calculate z as follows: z= 2 8,333 (2 * 2) 1 = 2.351 28 The exact level of significance (=aerea on the very left and very right side of our bell shape curve of the standard normal distribution) can be computed by an Integral with the lower and upper limit of z = 2.4 : z 2 x 1 1 2 e 2 d x 0.016395 z The probability of making a type 1-error (=α-error), which means, we reject H0 in favor of H1, but in the population (~reality) H0 would be true (=the right decision), is only 1,6%. According to the result above: Sig=0.016395 0.05 we can reject H0. There statistically significant differences between men and women regarding their book reading behaviour. If we reject H0, the contingency table has to be interpreted (see output below !). 29 SPSS-Output: SEX * BOOK Crosstabulation SEX male female Total Count % within SEX % within BOOK Count % within SEX % within BOOK Count % within SEX % within BOOK detective s tories 65 46,4% 48,1% 70 43,8% 51,9% 135 45,0% 100,0% BOOK love novels 25 17,9% 33,3% 50 31,3% 66,7% 75 25,0% 100,0% Sc iFi 50 35,7% 55,6% 40 25,0% 44,4% 90 30,0% 100,0% Total 140 100,0% 46,7% 160 100,0% 53,3% 300 100,0% 100,0% Chi-Square Te sts Pearson Chi-Square Lik elihood Ratio Linear-by-Linear As soc iation N of Valid Cases Value 8,333a 8,459 ,661 2 2 As ymp. Sig. (2-sided) ,016 ,015 1 ,416 df 300 a. 0 c ells (,0% ) have expected count less than 5. The minimum expected count is 35, 00. Interpretation of the contingency-table !!! 30 Special application of the chisquare-test: - With a 2x2 table: - chisquare test with continuity correction or Yates correction for continuity When testing for independence in a contingency table, a continuous probability distribution, namely the chisquare distribution, is used as an approximation to the discrete probability of observed frequencies, namely the multinoial distribution. To improve this approximation Yates suggested a correction that involves subtracting 0.5 from the positive discrepancies (observed-expected) and adding 0.5 to the neagtive discrepancies beforde these values are squared in the calculation of the usual chisquare statistics. If the sample size is very large, the correction will have only little effect on the value of the test statistic. Example: Is there are difference between men and women regarding their religiosity? n [ O E 0.5]2 i 1 E 2 The hypotheses are: H0: Between men an women there is no difference regarding their religiosity. H0: M = F 31 H1: Between men and women there are differences regarding their religiosity. H0: M F SEX * Are you a religious person ? Crosstabulation SEX female male Total Are you a religious person ? yes no 6 2 75,0% 25,0% Count % within SEX % within Are you a religious person ? Count % within SEX % within Are you a religious person ? Count % within SEX % within Are you a religious person ? Total 8 100,0% 60,0% 20,0% 40,0% 4 33,3% 8 66,7% 12 100,0% 40,0% 80,0% 60,0% 10 50,0% 10 50,0% 20 100,0% 100,0% 100,0% 100,0% Chi-Square Tests Pearson Chi-Square Continuity Correction a Likelihood Ratio Fis her's Exact Test Linear-by-Linear As sociation N of Valid Cases Value 3,333b 1,875 3,452 3,167 df 1 1 1 1 As ymp. Sig. (2-sided) ,068 ,171 ,063 Exact Sig. (2-sided) Exact Sig. (1-sided) ,170 ,085 ,075 20 a. Computed only for a 2x2 table b. 2 cells (50,0%) have expected count less than 5. The minimum expected count is 4,00. Above example shows, that the - based on the continuity correction - is =0.171; this indicates that there is no difference between men and women regarding their religiosity, we cannot rejetc H0. If we would reject H0, the risk for a type 1-error, which means, rejecting H0 in favor of H1 , while H0 is the right decision, would H0 be 17%. 32 NONPARAMETRIC TESTS The following statistical tests Mann Whitney U-Test and Kruskal Wallis Test are used when one compares two or more groups on an ordinal variable or one metric Variable that was not normally distributed. . Mann-Whitney U-Test: The Mann-Whitney U-test (also known as the Wilcoxon rank sum W test) is very useful when you have two groups to compare on a variable which is measured at ordinal level or a metric variable which is not normally distributed. The test focuses on differences in central location and makes the assumption that any differences in the distributions of the two populations are due only to differences in locations (rather than, say, variability). Basically all observations are ranked and divided in the two observed groups. Where one test-statistic is the sum of ranks in the smaller group, the U-test statistic is computed via this so called W (= sum of ranks). The null hypothesis tested by the Mann-Whitney U-Test is, that there is no difference between the two groups in terms of location, focusing on each mean rank as a measure of central tendency, only when comparing them. The value of the mean rank itself does not state anything about how good or bad something was evaluated or ranked. On an ordinal scale collected data only allow to interprete informations based on their ranks. Mean and Variance are meaningless in their interpretation. 33 The following example shows the application of the Mann Whitney U-test: The question is: Is there a difference between male and female students in how they rate the possibilities of doing at least some sport in the area where they live. The hypotheses are: H0: There is no difference between male and female students regarding how they rate the possibilities of doing at least some sport in the area where they live. H1: There is a significant difference between male and female students regarding how they rate the possibilities of doing at least some sport in the area where they live. (Alpha = 5%) The below table shows the data as collected: male female ti (ti3-ti) Very good 6 3 9 720 Good 5 1 6 210 Less good 2 3 5 120 bad 0 4 4 60 total n1=13 n2=11 N=24 =1110 rating In a first step all data are ranked by their rating and – in this case – by sex (=the grouping variable); so we calculate a rank-value for all very good ratings, good ratings, less good ratings and bad ratings, independent of the 34 grouping variable sex. So we have 24 rankings, starting with the first male person that said very good until the last female – in our example 24th – person, that rated with bad. In a next step the rank values are weighted by the grouping variable (=sex; number of males and females) and so the ranksums for all male and female persons can be calculated: Rating 1=Very good 2=Good 3=Less good 4=Bad Male Female 1+2+3+4+5+6+ 5 x 6=30 7+8+9=45/9=5 5 x 3=15 10+11+12+13+14+ 12.5 x 5=62.5 16+17+ 18 x 2=36 + 22.5 x 0=0 15=75/6=12.5 12.5 x 1=12.5 18+19+20=90/5=18 18 x 3=54 21+22+23+24=90/4=22.5 22.5 x 4=90 R1=128.5 R2=171.5 128.5/13=9.88 171.5/11=15.59 Sum of ranks R Mean rank R1 + R2 = [N*(N+1)]/2 128.5 + 171.5 = 300 In a next step we have to determine the U-test-statistic: U1 = R1 – [n1*(n1+1)]/2 U1 = 128.5 – 91 = 37.5 U2 = R2 – [n2*(n2+1)]/2 U2 = 171.5 – 66 = 105.5 35 The U-test-statistic is always the smaller value of U1 and U2: U = Min! (U1, U2) = 37.5 According to the U-distribution table the critical value for U is: U/2;n1;n2 = U0.025;13;11 = 42 As the calculated U-value (=37.5) is smaller (remember: U-distribution) than the critical U-value from the table (Utabulated = 42), we are able to reject H0 and favor H1; so we can interpret the calculated mean ranks. SPSS-Output: Mann Whitney U-Test Ranks rating s portposs ibilities in your area sex male female Total N 13 11 24 Mean Rank 9,88 15,59 Sum of Ranks 128,50 171,50 36 In order to get the exact level of significance, we first have to compute z: The following formula for computing z has a correction-term for so called ties (=more people could have given the same answer): U z= n1* n2 2 n1* n2 * (N 3 N 12 * N * ( N 1) m (t 3 i ti ) i 1 z = - 2.054 Excurs:---------------------------------------------------------------------Is each data value of the ordinal or non normally distributed variable only one of a kind (each value of this variable appears only one time within this variable) , then we would not need a correction for ties and z could be computed as: n1* n2 2 n1* n2 * (n1 n2 1) 12 U z= -------------------------------------------------------------------------------Computing the exact level of significance (limits */- 2.054): z 2 x 1 1 2 2 e d x 0.028307 0.040 z sig = 0.040 = 4.0% As 0.040 0.05 H1 ; in other words: we reject H0 !! 37 Conclusion: There is a significant difference between male and female students regarding how they rate the possibilities of doing at least some sport in the area where they live. How can we interpret the difference ? Take a look at the mean ranks. Mean Ranks Male Female 9.88 15.59 The mean rank itself does not help us from it´s value. As soon as we compare both mean ranks we can see the difference. The mean rank of the female persons is higher than the one from the male persons. Let´s have a look at the coding values of the variable rating the possibilities of doing sport in the area, where one lives: 1 = very good 2 = good 3 = less good 4 = bad A higher value is a worse rating than a low value; this means: Female persons rated the possibilities worse than men. We cannot say how much worse!! In order to clarify the result even more, we could take a closer look at the contingency-table of the two analysed variables. 38 Output of Mann Whitney U-Test continued: Mann-Whitney-U-Test Test Statisticsb Mann-Whitney U Wilcoxon W Z As ymp. Sig. (2-tailed) Exact Sig. [2*(1-tailed Sig.)] rating sportpossibilities in your area 37,500 128,500 -2,054 ,040 a ,047 a. Not corrected for ties. b. Grouping Variable: sex To determine if H0 or H1 , we look at the „Asymptotic Significance” (0.04), which is corrected for ties and we decide to reject H0 . Once again: If we rejct H0 , we interpret the mean ranks or take a look at the contingency-table. Ranks rating s portposs ibilities in your area sex male female Total N 13 11 24 Mean Rank 9,88 15,59 Sum of Ranks 128,50 171,50 For interpretation only: contingency-table rating sportpossibilities in your area * sex Crosstabulation sex male rating s portpos sibilities in your area very good good les s good bad Total Count % within sex Count % within sex Count % within sex Count % within sex Count % within sex 6 46,2% 5 38,5% 2 15,4% 0 ,0% 13 100,0% female 3 27,3% 1 9,1% 3 27,3% 4 36,4% 11 100,0% Total 9 37,5% 6 25,0% 5 20,8% 4 16,7% 24 100,0% 39 Kruskal-Wallis H-Test: The H-test by Kruskal-Wallis is used to compare more than 2 independet groups (n>2) regarding an ordinal or not normally distributed metric variable. Like the Mann-Whitney U-test, the Kruskal-Wallis-H-Test tests the same null hypothesis, but across k rather than two groups. Basically all observations are ranked and divided in the k observed groups. Where one test-statistic is the sum of ranks in the smaller group, the Chisquare-test statistic will be computed in order to find out if H 0 will be rejected or not. The test statistic is based on an approximation of the chisquare distribution with k-1 degrees of freedom. Were k is the number of groups compared. An adjustment, taking into account tied ranks, is also provided. The null hypothesis tested by the Kruskal-Wallis H-Test is, that there is no difference between the k groups in terms of location, focusing on each mean rank as a measure of central tendency, only when comparing them. The value of the mean rank itself does, as within the Mann Whitney U-test, not state anything about how good or bad something was evaluated or ranked. On an ordinal scale collected data only allow to interprete informations based on their ranks. Mean and Variance are meaningless in their interpretation. 40 Example to demonstrate the application of the Kruskal-Wallis-H- test: Is there a difference between the four regions of origin regarding how they rate the possibilities of doing at least some sport in the area where they live? The hypotheses are: H0: There is no difference between students originating from the four areas (Luzon,Visayas, Mindanao, abroad) regarding how they rate the possibilities of doing at least some sport in the area where they live.. H1: There is a significant difference between students originating from the four areas (Luzon,Visayas, Mindanao, abroad) regarding how they rate the possibilities of doing at least some sport in the area where they live. (Alpha = 5%) The computation of the sum of ranks and the mean ranks is the same as within the Mann Whitney U-Test. We start with the following table: Luzon Visayas Mindano abroad ti (ti3-ti) 1=Very good 6 1 0 2 9 720 2=Good 5 0 0 1 6 210 3=Less good 2 2 1 0 5 120 4=Bad 2 1 1 0 4 60 TOTAL 15 4 2 3 24 1110 41 Luzon Visayas Mindano abroad 30 5 0 10 62.5 0 0 12.5 3=Less good 36 36 18 0 4=Bad 45 22.5 22.5 0 Sum of ranks R *) 173.5 63.5 40.5 22.5 Mean rank (=R/n) 11.57 15.88 20.25 7.5 1=Very good 2=Good *) Sum of ranks and mean ranks are computed accordingly to the Mann-Whitney U-test. The test value H is an approximation of the chisquare distribution with k-1 degrees of freedom, were k is the number of groups (here: 4 groups of origin). DF = 4 – 1 = 3 The formula below refers to a data set, where every value appears only one time within the ordinal variable: H= 12 * N * ( N 1) Ti2 3 * ( N 1) i 1 n i k In our example it appears that the same rating was given more often that only one time; in this case a correction term is used: H´ = H 1 m (t 3 i 1 i 3 ti ) N N H´err.= 5.519 42 Due to the chisquare distribution approximation refer to the chisquare table in order to find the critical value: H´1-;DF = H´0,95;3 = 7.815 As our calculated H-value is smaller than the critical value from the chisquare table, we cannot reject H0 . According to the chisquare-test one can calculate the z-value and then the exact level of significance. SPSS-Output - Kruskal-Wallis-Test: Ra nks rat ing s port pos sibilities in your area origin Luzon Vis ayas Mindanao abroad Total N 15 4 2 3 24 Mean Rank 11,57 15,88 20,25 7,50 Test Statisticsa,b Chi-Square df As ymp. Sig. rating sportpossi bilities in your area 5,519 3 ,138 a. Kruskal Wallis Test b. Grouping Variable: origin To determine if H0 or H1 , we look at the „Asymptotic Significance” (0.138), which is corrected for ties and we decide not to reject H0 . Once again: If we would rejct H0 , we interpret the mean ranks or take a look at the contingency-table. 43 Analysis of Variance: Analysis of Variance (ANOVA) is used to compare the means of two or more groups (the t-test is used to compare only two groups) regarding a metric variable. As we can see the F-test of the ANOVA is a more widely to use test, so we will go deeper in his test-statistic. Two assumptions must be met before you can run a oneway ANOVA. Normaldistribution (Kolmogorov-Smirnoff Test) Equal variances of the groups (Levene-statistic) If these assumption are not given, we have to go back to the tests for the ordinal scale and compute a Mann Whitney U-test or a Kruskal-Wallis test. The Analysis of Variance partitions the total variance of a set of observations into parts due to particular factors, for example, sex , origin, age-groups, etc and comparing variances (mean squares) by way of F-tests, differences between means can be assessed. The following hypotheses will be verified or falsified: H0 : 1 = 2 = ...... = n H1 : 1 ≠ 2 ≠ ...... ≠ n 44 One approach to calculate the F-test statistic (Fcalculated by data) is: W T * 2 W 1 T 1 Fcalculated = , where T = sum of total-variation and W is the variation within the groups. N T = (x i x) 2 i 1 W……. to be computed accordingly for each level (group). Degrees of freedom: 1 = I – 1, where I is number of groups to be compared 2 = N – I, where N is the sample-size 45 The following example demonstrates the application of an ANOVA: The question is: Is there a difference betwenn male and female persons regarding how often they perform sport per week, if at all? Formulate the hypotheses !!! How often sport a week (if at all) group-mean T W 1= male 1 3.7273 (1-3.0667)2 (1-3.7273)2 1 2 (2-3.0667)2 (2-3.7273)2 1 2 (2-3.0667)2 (2-3.7273)2 1 3 (3-3.0667)2 (3-3.7273)2 1 4 (4-3.0667)2 (4-3.7273)2 1 4 (4-3.0667)2 (4-3.7273)2 1 4 (4-3.0667)2 (4-3.7273)2 1 5 (5-3.0667)2 (5-3.7273)2 1 5 (5-3.0667)2 (5-3.7273)2 1 5 (5-3.0667)2 (5-3.7273)2 1 6 (6-3.0667)2 (6-3.7273)2 (1-3.0667)2 (1-1.25)2 (1-3.0667)2 (1-1.25)2 (1-3.0667)2 (1-1.25)2 (2-3.0667)2 (2-1.25)2 Sum = 42.933 Sum= 24.93 1 2=female 1.25 1 2 1 2 2 2 Total mean=3.0667 T .........42.933 W......... 24.93 46 1 2 =I–1=2–1=1 = N – I = 15 – 2 = 13 Fcomputed by data = (1-(W/T))/(W/T) * ( 2 / 1 )= 9.387 As within the other statistical tests, the critical F-value has to be taken from the table of the F-distribution: F tab 0.05; 1; 13 = 4.75 As Fcomp. ≥ Ftab. we can reject H0 and we can state, that there is a statistically significant difference in the means of the two groups (male/female) regarding how often they perform sports per week. 47 SPSS – Output of ANOVA including K-S Test for normal distribution and test of homogeneity of variances: 1) K-S Test for normal-distribution: One-Sample Kolmogorov-Smirnov Test N Normal Parameters a,b Most Extrem e Di fferences Mean Std. Deviati on Absolute Positive Negati ve Kolmogorov-Sm irnov Z As ymp. Sig. (2-tailed) how often sport per week 15 3,0667 1,75119 ,195 ,195 -,170 ,757 ,615 a. Test di stribution is Norm al. b. Calculated from data. H0: normal distribution is given H1: the data are not normally distributed 2) Test of Homogeneity of Variances Test of Homogeneity of Variances how often s port per week Levene Statistic 4,224 df1 df2 1 13 Sig. ,061 H0: Homogeneity of Variances is given H1: Homogeneity of Variances is not given 48 3) ANOVA - results Descriptives how often sport per week N male female Total 11 4 15 Mean 3,7273 1,2500 3,0667 Std. Deviation 1,55505 ,50000 1,75119 95% Confidence Interval for Mean Lower Bound Upper Bound 2,6826 4,7720 ,4544 2,0456 2,0969 4,0364 Std. Error ,46887 ,25000 ,45216 Minimum 1,00 1,00 1,00 Maximum 6,00 2,00 6,00 ANOVA how often s port per week Between Groups W ithin Groups Total Sum of Squares 18,002 24,932 42,933 df 1 13 14 Mean Square 18,002 1,918 F 9,386 Sig. ,009 H0: The means of the groups male and female are equal H1: The means of the groups male and female are not equal As the level of sognificance is less/equal 0.05 (Sig=0.009) we can reject H0 and we can say, that there are statistically significant differences between the means of the groups (male and female persons). The next step is the interpretation of the means from above. 49 Measures of association / relationships: Contingency coeffizient and Cramer´s V: (0 bis 1) Regarding nomnal variables an appropriate measure of relationship is the contingency coeffizient or Cramer´s V. Its values always fall between 0 and 1 and, thus, can be interpreted as reflecting relationships of different magnitudes Both measures are chisquare-based as the following formulas show: C= 2 n2 Cramer´s V = 2 n ( k 1) ; where n is the sample-size and k is the smaller of the number of rows or columns in the contingency table. We analyse the question: Is there a relationship between the own sportbehaviour and the smoking-behaviour The hypotheses are: H0 : There is no relationship between sport-behaviour and the smoking-behaviour. H1 : There is a significant relationship between sport-behaviour and the smoking-behaviour. 50 The results of the contingency-coefficient and the Cramer´s V are: chisqu C= chisqu n chisqu 0.488 Cramer´s V = n ( k 1) 0.559 SPSS-Output contingency coeffizient and Cramer´s V:: Symmetric Measures Nominal by Nominal Cramer's V Contingency Coefficient N of Valid Cases Value ,559 ,488 24 Approx. Sig. ,024 ,024 a. Not as suming the null hypothesis . b. Us ing the asymptotic standard error assuming the null hypothesis . The level of significance (sig=0.024) indicates a significant relationship between the two variables; the magnitude is 0.559 (Cramer´s V) and 0.488 (contingency-coefficient). 51 Spearman´s Rank-Order Correlation: (-1 to +1) If dealing with a situation in which both variables concerned are ordinal, you can investigate not only the strength of the association but also its direction (positive or negative relationship). Spearman´s rank-order correlation coefficient ranges from –1 to +1, with values close to zero indicating little or no association between the variables. Having two variables x and y: x1 ≤ ........ ≤xn, being ranked in an order, the rank (xi)=i, and y1 ≤ ........ ≤yn, being ranked in an order, the rank (yi)=i; Each observation gives a pair of ratings (xi , yi).As well within the x-values as within the y-values identical values can appear and mean rank-values will be calculated. The formula for the computation of Spearman´s rank-order coefficient is: d rSP= 1 - ; 2 i 6 (n 2 1)n if we have to correct our data for ties, which is quite often the case, the following formula has to be applied: rSP corr. = 1 - d 6 2 i (n 2 1)n (T x´ T y´ ) ; where di are the rank-differences and n is the sample-size. Tx´ = 12 (t x3 ´ i t x´ ) ; i considers the frequency of each single rating-value of the ordinal variable Ty´ to be computed likewise. 52 Example: Two variables are given…. x....rating of the possibilities of doing sport, where you alive (1 = very good 2 = somehow good 3 = not so good) y....rating of your own sporting-behaviour (1 = very sporty 2 = somehow sporty 3 = not so sporty) The hypotheses are: H0: Between the rating of the possibilities of doing sport, where you live and the rating of your own sporting-behaviour there is no relationship H1: Between the rating of the possibilities of doing sport, where you live and the rating of your own sporting-behaviour there is a significant relationship. x 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 Rank-values rk(xi) 1+2+3+4+5+6+7+8+9+10=55/10=5.5 11+12+13+14+15=65/5=13 16+17+18+19+20+21+22+23+24+25= = 205/10=20.5 y 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 1 2 2 2 3 3 3 3 3 3 rk (yi) Rg (1) = 4 Rg(2) = 12.5 Rg(3) = 21.5 Computation of rank differences (di) Rangdifferenzen 5.5 – 4= 1.5 (d) 5.5 – 4= 1.5 5.5 – 4= 1.5 5.5 - 4 = 1.5 5.5 – 4= 1.5 5.5. - 4 = 1.5 5.5 - 12.5 = 7 5.5 - 12.5 = 7 5.5 - 12.5 = 7 5.5 - 12.5 = 7 13 – 12.5= 0.5 13 – 12.5= 0.5 13 – 12.5 = 0.5 13 – 21.5= 8.5 13 – 21.5 = 8.5 20.5 – 4 = 16.5 20.5 – 12.5= 8 20.5 – 12.5= 8 20.5 – 12.5= 8 20.5 – 21.5 = 1 20.5 – 21.5 = 1 20.5 – 21.5 = 1 20.5 – 21.5 = 1 20.5 – 21.5 = 1 20.5 – 21.5 = 1 Sum: d2 2.25 2.25 2.25 2.25 2.25 2.25 49 49 49 49 0.25 0.25 0.25 72.25 72.25 272.25 64 64 64 1 1 1 1 1 1 825 53 Tx´ = 12 (t x3 ´ i t x´ ) =½*[(10 i 3 -10)+(53-5)+(103-10)] = 1050 Ty´ = ½*[(73-7)+(103-10)+(83-8)] = 915 6 * 825 rSP corr. = 1 - (25 1)25 (1050 915 ) 2 =1- 4950 113635 = + 0.6369 The relationship between the two variables is positive; we yet don´t know if this relationship is significant or not. Via a z-transformation we can calculate the exact level of significance: z=ż* n3 , where ż is the correlation-figure and can be computed via r as: 1 r ż = ½*ln( 1 r ) = ½ * ln 1 0,6369 1 0,6369 = 0.753 following: z = ż * n 3 = 0.753* 25 3 = 3.532 The level of significance is: z 2 x 1 1 2 e 2 d x 0.001229270561 z As the level of significance is ≤0.05, (sig=0.00123), we can reject H0 and one can say, the relationship between the rating of the possibilities of doing sport, where you live and the rating of your own sporting-behaviour is statistically significant. The magnitude is positiv +0.637. This means: the better somebody rates the sporting posibilities, where he/she lives, the better he/she also rates his/her own sporting behaviour. 54 SPSS - Output: Spearman´s rank-order coefficient Correl ations Spearman's rho rat ing the sportpossibilities, where one lives rat ing the own sporting-behaviour Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N rat ing the sportpossibilities, rat ing the own where one lives sporting-behaviour 1,000 ,637** . ,001 25 25 ,637** 1,000 ,001 . 25 25 **. Correlation is s ignificant at the 0.01 level (2-tailed). How to read the output: 1) read the level of significance: decide whether H0 or H1 2) if H1: how strong is the relationship ? 3) which direction is the relationship ? 4) interpretation of the result ! 55 Pearson´s Product Moment Correlation (-1 to +1) This is the most widely used measure of association for examining relationships between metric (interval/ratio) variables Also known as Pearson´s r the product moment correlation coeffizient focuses specifically on linear relationships. It ranges from –1 (a perfect negativ linear relationship through 0 (no relationship) to +1 (a perfect positive linear relationship). The emphasis on linear is important because if two variables are linked to one another by means of a nonlinear relationship the Pearson correlation coeffizient cannot detect it. It is always important and a good idea to plot the relationship between the variables in a scattergram or scatterplot before apllying the Pearson correlation coefficient. Having two metric variables x and y from one observation (respondent) means to have from: x1, ........ ,xn, an appropriate y1, ........ ,yn ; So each observation gives a pair of values (xi , yi).; Within the following example we want to determine the possible realtionship between the variable income (x) and the variable expenses for x-mas presents (y). The hypotheses are: H0: Between income and expenses for x-mas presents is no relationship H1: Between income and expenses for x-mas presents there is a significant relationship 56 Formula for computing the Pearson´s product moment correlation coeffizient: n rP = [( x i x ) * ( y i y )] i 1 n (x n i x) 2 * i 1 (y y) 2 i 1 income (x) 1000,00 1050,00 1100,00 1200,00 1250,00 1250,00 1250,00 1500,00 1400,00 1600,00 mean(x)=1260 rP = i 37100 334000 * 4290 expenses (y) 75,00 80,00 85,00 90,00 100,00 110,00 110,00 130,00 120,00 140,00 mean(y)=104 [( xi x ) * ( y i y )] 7540,00 5040,00 3040,00 840,00 40,00 -60,00 -60,00 6240,00 2240,00 12240,00 Sum=37100 ( xi x ) 2 67600,00 44100,00 25600,00 3600,00 100,00 100,00 100,00 57600,00 19600,00 115600,0 Sum=334000 ( yi y) 2 841,00 576,00 361,00 196,00 16,00 36,00 36,00 676,00 256,00 1296,00 Sum=4290 = +0.98 The relationship between income and expenses for x-mas presents is a positive with a magnitude of +0.98. In other words: The more somebody earns, the more he/she spends for x-mas presents. 57 SPSS-Output Pearson´s r: Correlations monthly income in USD expens es for x-mas presents in USD Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N monthly income in USD 1 . 10 ,980** ,000 10 expens es for x-mas presents in USD ,980** ,000 10 1 . 10 **. Correlation is s ignificant at the 0.01 level (2-tailed). This output shows, that there is a significant (sig=0.000 ≤ 0.05) relationship between the two variables. The level of significance can be computed like the significance within the Spearman rank order coefficient via z. H 0 can be rejected. The correlation coefficient +0.98 indicates a very strong positive relationship between income and expenses for x-mas presents. Again: The more somebody earns, the more he/she spends for x-mas presents. 58 Regression-Analysis Sometimes it might be interesting in statistics to investigate the question: How much does one quantity affect another quantity ? How can one variable be explained through her relationship to another variable ? Within the regression analysis we try to find a mathematical approach or mathematical function (=equation) to predict the value of an dependent variable if we know the value of the independent variable. Usually a first step into regression analysis is to find out if there is any relationship between two variables at all, using the Pearson correlation coefficient. If so, a diagram (e.g. a scatterplot) would also show this kind of relationship. Knowing more about that relationship, we would be able to predict – refering to our data (income and expenses for x-mas presents) – how much one will spend for x-mas presents if we´d know his/her income. This form of relationship will be represented by a straight line in the diagram. Such a line is called a regression line, and this type of analysis is called regression analysis. We need to know two numbers to summarize the line: the slope and the vertical intercept. We also would like some way to measure whether the line fits the data very well or rather poorly. This could be expressed by measure called R2 (ranges from 0 to 1). The closer to 1 the better the fit, or the more we are able to predict the value of the dependent variable with our applied model. 59 At this point we introduce the simple regression, which shows the relationship between one dependent and one independent variable. The relation between the two variables is expressed by a linear function: ŷi = α + βxi , where β is the slope and α is the vertical intercept. β is also known as the regression-coefficient. The sign expresses the direction of the relationship: Positive: the more, the more / the less, the less Negativ: the more, the less / the less, the more ~ 0: no pattern, no relationship. The formula to compute β is: n β= n n (x * y ) n * x * y 1 i i i 1 i 1 n n x i 2 i i 1 i i 1 1 *( xi ) 2 n i 1 The vertical intercept, after knowing β, is: α = y *x 60 Within the following example, the regression analysis will be demonstrated: The variables of our last example will be used (income, expenses for x-mas presents): ni income ( x) expenses (y) y2 x2 x iy i 1 2 3 4 5 6 7 8 9 10 1000.00 1050.00 1100.00 1200.00 1250.00 1250.00 1250.00 1500.00 1400.00 1600.00 75.00 80.00 85.00 90.00 100.00 110.00 110.00 130.00 120.00 140.00 5625.00 6400.00 7225.00 8100.00 10000.00 12100.00 12100.00 16900.00 14400.00 19600.00 1000000 1102500 1210000 1440000 1562500 1562500 1562500 2250000 1960000 2560000 75000.00 84000.00 93500.00 108000.0 125000.0 137500.0 137500.0 195000.0 168000.0 224000.0 Mean 1260 104 11245 1621000 134750 Sum 12600 1040 112450 16210000 1347500 1 *1040 *12600 10 1 16210000 *12600 2 10 1347500 β= = 0.1110778 α = 104 – 0.1110778 * 1260 = -35.958 The regression-model for our data is: ŷi = -35.958 + 0.1110778*xi Remember: The relationship between our two variables has to be linear, otherwise it makes no sense to compute the above model; so you better check the data via a simple correlation diagram, which shows you roughly the relationship. 61 Correlation-diagram of the variables income / expenses for x-mas presents: correlation scatterplot: income vs. expenses 150 140 130 120 110 100 90 80 70 900 1000 1100 1200 1300 1400 1500 1600 1700 m onthl y i ncom e i n USD As we can see from the diagram (scatterplot), there is a linear relationship between the two variables. The regression line, that should be applied to our data, fits best, when the distance between the line and the single pair of observations is min!. The method for finding that line is called least square method (to be calculated by the euklidian distance). The differences between the observed and the fitted or estimated values are called residuals. The values of the residuals should be as minimal as possible. ˆi yi yˆi 62 If applying the above regression-model ŷi = -35.958 + 0.1110778*xi to the question: How much will somebody spend for x-mas presents, if his monthly income is 1250,-USD ? , the answer will be: ŷi = 0.1110778 *1250 +(-35.958) = 102.8892 USD We can expect a person that earns 1250,- USD to spend 102.89 USD for xmas presents. The difference for an actual observed income of 1250,between the actual expenses of 100,- USD in the data set and estimated expenses of 102.89, is –2,8892; this is called the residual for the expenses value of 100.- USD. For our data set, the estimated values for y and the residuals are: income (x) 1000.00 1050.00 1100.00 1200.00 1250.00 1250.00 1250.00 1500.00 1400.00 1600.00 expenses (y) 75.00 80.00 85.00 90.00 100.00 110.00 110.00 130.00 120.00 140.00 ŷ 75.11976 80.67365 86.22754 97.33533 102.8892 102.8892 102.8892 130.6587 119.5509 141.7665 residuals -.11976 -.67365 -1.22754 -7.33533 -2.88922 7.11078 7.11078 -.65868 .44910 -1.76647 63 Partial SPSS Output: regression analysis Model Summary Model 1 R ,980a R Square ,961 Adjusted R Square ,956 Std. Error of the Estimate 4,59636 a. Predictors: (Constant), monthly income in USD Coeffi cientsa Model 1 Unstandardized Coeffic ients B St d. Error -35,958 10,126 ,111 ,008 (Const ant) monthly inc ome in USD St andardiz ed Coeffic ients Beta ,980 t -3, 551 13,966 Sig. ,007 ,000 expenses for x-mas presents in USD a. Dependent Variable: ex pens es for x -mas presents in USD Linear Regression expenses for x-m as presents in USD = -35,96 + 0,11 * incom e R-Square = 0,96 140,00 120,00 100,00 80,00 1000,00 1200,00 1400,00 1600,00 monthly income in USD 64 Multiple Regression: The multiple regression is used if you analyse one dependent and several (a minimum of two independent variables) independent variables, where the independent variables should not correlate and the variance should not be unequal. The relation between three variables will be demonstrated through the following equation: ŷi = α + β1x1 + β2x2 , where β is the slope and α is the vertical intercept. β is also known as the regression-coefficient. The sign expresses the direction of the relationship: Positive: the more, the more / the less, the less Negativ: the more, the less / the less, the more ~ 0: no pattern, no relationship. The x-variables should not be correlated and the variances should be equal or at least similar. 65 Logistic Regression: See example during the lecture !!!! 66