Data Preparation Steps in Data Preparation Editing Coding Entering Data Data Tabulation Reviewing Tabulations Statistically adjusting the data Editing Carefully checking survey data for Completeness (no omissions) Legibility (non-ambiguous) Right informant Consistency e.g. charging something when the person does not own a charge card Accuracy. Most important purpose is to eliminate or at least reduce the number of errors in the raw data. Solutions 1. Ideally re-interview respondent 2. Eliminate all unacceptable surveys (case wise deletion) (if sample is large and few unacceptable) 3. In calculations only the cases with complete responses are considered (pair wise deletion) (means that some statistics will be based on different sample sizes) 4. Code illegible or missing answers into a a “no valid response” category 5. substitute a neutral value - typically the mean response to the variable, therefore the mean remains unchanged Coding • The process of systematically and consistently assigning each response a numerical score. • The key to a good coding system is for the coding categories to be mutually exclusive and the entire system to be collectively exhaustive. • To be mutually exclusive, every response must fit into only one category. • To be collectively exhaustive, all possible responses must fit into one of the categories. • Exhaustive means that you have covered the entire range of the variable with your measurement. Coding • Coding Missing Numbers: When respondents fail to complete portions of the survey. – Whatever the reason for incomplete surveys, you must indicate that there was no response provided by the respondent. – For single digit responses code as “9”, 2 digit code as “99” •Coding Open-Ended Questions: When open-ended questions are used, you must create categories. – All responses must fit into a category – similar responses should fall into the same category. e.g. Who services your car? ______________ Possible categories: self, garage, husband, wife, friend, relative etc. • To make it collectively exhaustive add an “other” or “none of the above” category –Only a few i.e. < 10% should fit into this category Precoded Questionnaires: Sometimes you can place codes on the actual questionnaire, which simplifies data entry. This… Are you: Male Female How satisfied are you with our product? ___Very Satisfied ___Somewhat Satisfied ___Somewhat Dissatisfied ___Very Dissatisfied ___No opinion Becomes this… Are you: (1) Male (2) Female How satisfied are you with our product? _1__Very Satisfied _2__Somewhat Satisfied _3__Somewhat Dissatisfied _4__Very Dissatisfied _5__No opinion 1. Are you solely responsible for taking care of your automotive service needs ___ Yes ___ No 2. If No who performs the simple maintenance ___________ 3. If scheduled maintenance is done on your automobile, how do you keep track of what has been done •Not tracked •auto dealer records •mental recollection •other 4. How often is your automobile serviced? •Once per month •Once every three months •Once every six months •Once per year •Other _______________ Code Book Col. No Question No. Question Des. Range of permissible values 1-3 ID # N/A 001-200 4 1 Responsible for Maintenance 0= No. 1=yes, 9= blank 5 2 perform simple maintenance 0=husband, 1=boyfriend, 2=father, 3=mother, 4=relative, 5=friend, 6=other, 9=blank 5 3 How maintenance tracked 0=not tracked, 1=auto dealer records, 2=personal records, 3=mental recollection, 4=other, 9=blank 6 4 How often maintenance performed Once per 0=month, 1= 3 months, 2= 6 months, 3= year, 4= other 9=blank 7 4 Other for how often In questions that permit multiple responses, each possible response option should be assigned a separate column 6. Which magazines do you read, choose all that apply. • Time • National Geographic • Readers Digest • Chatelaine • MacLean's Col. No Question No. Question Des. Range of permissible values 15 6 Time 0 =read, 1= not read 16 6 Readers Dig. 0 =read, 1= not read 17 6 MacLean's 0 =read, 1= not read 18 6 National Geo. 0 =read, 1= not read 19 6 Chatelaine 0 =read, 1= not read For rank order questions, separate columns are also needed 7. Please rank the following brands of toothpaste in order of preference (1-5) • Crest • Colgate • Aquafresh • Arm & Hammer • Pepsodent Col.# Q. No. Question Des. Range of permissible values 20 7 Crest rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 21 7 Colgate rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 22 7 Acquafresh rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 23 7 A & H rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth 25 7 Pepsodent rank 0 =blank, 1 = most important, 2 =2nd most important, 3 =third, 4=fourth, 5= fifth Preparing the Data for Analysis Variable Re-specification • Existing data modified to create new variables • Large number of variables collapsed into fewer variables • E.g. If 10 reasons for purchasing a car are given they might be collapsed into four categories e.g. performance, price, appearance, and service • Creates variables that are consistent with research questions Entering Data • Problems can occur during data entry, such as transposing numbers and inputting an infeasible code(e.g out of range) – E.g. Score on range of 1-5 then 0, 6, 7, and 8 are unacceptable or out of range (might be due to transcription error) • Always check the data-entry work. Descriptive Statistics Five types of statistical analysis Descriptive What are the characteristics of the respondents? Inferential What are the characteristics of the population? Differences Are two or more groups the same or different? Associative Are two or more variables related in a systematic way? Predictive Can we predict one variable if we know one or more other variables? Descriptive Statistics Summarization of a collection of data in a clear and understandable way the most basic form of statistics lays the foundation for all statistical knowledge The tradeoff in descriptive statistics • If you use fewer statistics to describe the distribution of a variable, you lose information but gain clarity. • When should one use fewer statistics? – When dropping the number of statistics would leave more information per remaining statistic. – When the information you drop is unimportant to one’s research question. Type of Measurement Type of descriptive analysis Two categories Nominal More than two categories Frequency table Proportion (percentage) Frequency table Category proportions (percentages) Mode Type of Measurement Type of descriptive analysis Ordinal Rank order Median Interval Arithmetic mean Ratio means Data Tabulation • Tabulation: The organized arrangement of data in a table format that is easy to read and understand. – Tabulate the data to count the number of responses to each question. • Simple Tabulation: The tabulating of results of only one variable informs you how often each response was given. • Frequency Distribution: A distribution of data that summarizes the number of times a certain value of a variable occurs and is expressed in terms of percentages. Frequency Tables The arrangement of statistical data in a row-andcolumn format that exhibits the count of responses or observations for each category assigned to a variable • How many of certain brand users can be called loyal? • What percentage of the market are heavy users and light users? • How many consumers are aware of a new product? • What brand is the “Top of Mind” of the market? More on relative frequency distributions • Rules for relative frequency distributions: – – – – – Make sure each observation is in one and only one category. Use categories of equal width. Choose an appealing number of categories. Provide labels Double-check your graph. • Definitions: – A histogram is a relative frequency distribution of a quantitative variable – A bar graph is a relative frequency distribution of a qualitative variable WebSurveyor Bar Chart How did you find your last job? 643 Netw orking 213 print ad 179 Online recruitment site 112 Placement firm 18 Temporary agency 1.5 % Temporary agency 9.6 % Placement firm 15.4 % Online recruitment site 18.3 % print ad 55.2 % Netw orking 0 100 200 300 400 500 600 700 How many times per week do you use mouthwash ? 1__ 2__ 3__ 4__ 5__ 6__ 7__ 112223333344444445555566677 1 2 2 3 3 5 4 7 5 5 7 6 5 4 3 2 1 6 3 7 2 0 1 2 3 4 5 6 7 Normal Distribution - a b Normal Distributions Curve is basically bell shaped from - to symmetric with scores concentrated in the middle (i.e. on the mean) than in the tails. Mean, medium and mode coincide They differ in how spread out they are. The area under each curve is 1. The height of a normal distribution can be specified mathematically in terms of two parameters: the mean () and the standard deviation (). Skewed Distributions Occur when one tail of the distribution is longer than the other. Positive Skew Distributions have a long tail in the positive direction. sometimes called "skewed to the right" more common than distributions with negative skews E.g. distribution of income. Most people make under $40,000 a year, but some make quite a bit more with a small number making many millions of dollars per year The positive tail therefore extends out quite a long way Negative Skew Distributions have a long tail in the negative direction. called "skewed to the left." negative tail stops at zero • Kurtosis: how peaked a distribution is. A zero indicates normal distribution, positive numbers indicate a peak, negative numbers indicate a flatter distribution) Peaked distribution Flat distribution Thanks, Scott! Summary statistics –central tendency –Dispersion or variability A quantitative measure of the degree to which scores in a distribution are spread out or are clustered together; Descriptive Analysis: Measures of Central Tendency • Mode: the number that occurs most often in a string (nominal data) • Median: half of the responses fall above this point, half fall below this point (ordinal data) • Mean: the average (interval/ratio data) Mode the most frequent category users 25% non-users 75% Advantages: • meaning is obvious • the only measure of central tendency that can be used with nominal data. Disadvantages • many distributions have more than one mode, i.e. are "multimodal • greatly subject to sample fluctuations • therefore not recommended to be used as the only measure of central tendency. Median the middle observation of the data number times per week consumers use mouthwash 112223333344444445555566677 Frequency distribution of Mouthwash use per week Light user Mode Median Mean Heavy user The Mean (average value) sum of all the scores divided by the number of scores. a good measure of central tendency for roughly symmetric distributions can be misleading in skewed distributions since it can be greatly influenced by extreme scores in which case other statistics such as the median may be more informative formula = SX/N (population) X ¯ = Sxi/n (sample) where /X ¯ is the population/sample mean and N/n is the number of scores. Normal Distributions with different Mean - 1 0 2 Measures of Dispersion or Variability • Minimum, Maximum, and Range (Highest value minus the lowest value) • Variance • Standard Deviation (A measure’s distance from the mean) Distribution of Final Course Grades in MGMT 3220Y 25 Frequency 20 - 1 SD 15 + 1 SD 10 5 RANGE 0 Frequency F D C B A 3 10 20 23 12 Grade Variance • The difference between an observed value and the mean is called the deviation from the mean • The variance is the mean squared deviation from the mean • i.e. you subtract each value from the mean, square each result and then take the average. 2 = S(x¯ xi)2/n • Because it is squared it can never be negative Standard Deviation • The standard deviation is the square root of the variance 2/n S = S(xx ) ¯ i • Thus the standard deviation is expressed in the same units as the variables • Helps us to understand how clustered or spread the distribution is around the mean value. Measures of Dispersion Suppose we are testing the new flavor of a fruit punch Dislike 1 1. 2 3 4 5 Like Data x x 2. 3. x x 2/n 2 = S(xx ) ¯ i X= 4 2= 1 S=1 5 3 x 6. 5 3 x 4. 5. 3 5 2/n S = S(xx ) ¯ i Measures of Dispersion Dislike 1 2 3 4 1. 2. 5 Like Data x 5 4 x 3. x 5 4. x 5 5. x 5 6. 2/n 2 = S(xx ) ¯ i x X ¯ = 4.6 2=0.26 S = 0.52 4 2/n S = S(xx ) ¯ i Measures of Dispersion Dislike 1 1. 4 5 Like Data 1 x x 4. 5. 3 x 2. 3. 2 1 x 2/n 2 = S(xx ) ¯ i 5 X= ¯ 3 2=4 S=2 1 x 6. 5 x 5 2/n S = S(xx ) ¯ i Normal Distributions with different SD 2 - 1 3 Cross Tabulation • A statistical technique that involves tabulating the results of two or more variables simultaneously • informs you how often each response was given • Shows relationships among and between variables • frequency distribution for each subgroup compared to the frequency distribution for the total sample • must be nominally scaled Cross-tabulation • Helps answer questions about whether two or more variables of interest are linked: – Is the type of mouthwash user (heavy or light) related to gender? – Is the preference for a certain flavor (cherry or lemon) related to the geographic region (north, south, east, west)? – Is income level associated with gender? • Cross-tabulation determines association not causality. Dependent and Independent Variables • The variable being studied is called the dependent variable or response variable. • A variable that influences the dependent variable is called independent variable. Cross-tabulation • Cross-tabulation of two or more variables is possible if the variables are discrete: – The frequency of one variable is subdivided by the other variable categories. • Generally a cross-tabulation table has: – Row percentages – Column percentages – Total percentages • Which one is better? DEPENDS on which variable is considered as independent. Contingency Table • A contingency table shows the conjoint distribution of two discrete variables • This distribution represents the probability of observing a case in each cell – Probability is calculated as: Observed cases P= Total cases Cross tabulation GROUPINC * Gender Crosstabulation GROUPINC income <= 5 5<Income<= 10 income >10 Total Count % within GROUPINC % within Gender % of Total Count % within GROUPINC % within Gender % of Total Count % within GROUPINC % within Gender % of Total Count % within GROUPINC % within Gender % of Total Gender Female Male 10 9 52.6% 47.4% 55.6% 18.8% 15.2% 13.6% 5 25 16.7% 83.3% 27.8% 52.1% 7.6% 37.9% 3 14 17.6% 82.4% 16.7% 29.2% 4.5% 21.2% 18 48 27.3% 72.7% 100.0% 100.0% 27.3% 72.7% Total 19 100.0% 28.8% 28.8% 30 100.0% 45.5% 45.5% 17 100.0% 25.8% 25.8% 66 100.0% 100.0% 100.0% Chi-square Test for Independence • The Chi-square test for independence determines whether two variables are associated or not. H0: Two variables are independent H1: Two variables are not independent Chi-square test results are unstable if cell count is lower than 5