Data Preparation

advertisement

Data Preparation

Steps in Data Preparation

Editing

Coding

Entering Data

Data Tabulation

Reviewing Tabulations

Statistically adjusting the data (e.g. weighting)

Editing

Carefully checking survey data for

Completeness (no omissions)

Non-ambiguous ( e.g. two boxes checked instead of one)

Right informant (e.g. under age, when all supposed to be over 18)

Consistency

 e.g. charging something when the person does not own a charge card

Accuracy (e.g. no numbers out of range)

Most important purpose is to eliminate or at least reduce the number of errors in the raw data.

Solutions

1. Ideally re-interview respondent

2. Eliminate all unacceptable surveys (case wise deletion) (if sample is large and few unacceptable)

3. In calculations only the cases with complete responses are considered (pair wise deletion)

(means that some statistics will be based on different sample sizes)

4. Code illegible or missing answers into a a “no valid response” category

5. substitute a neutral value - typically the mean response to the variable, therefore the mean remains unchanged

Coding

• The process of systematically and consistently assigning each response a numerical score.

The key to a good coding system is for the coding categories to be mutually exclusive and the entire system to be collectively exhaustive.

• To be mutually exclusive, every response must fit into only one category.

To be collectively exhaustive, all possible responses must fit into one of the categories.

Exhaustive means that you have covered the entire range of the variable with your measurement .

Coding

Coding Missing Numbers: When respondents fail to complete portions of the survey.

– Whatever the reason for incomplete surveys, you must indicate that there was no response provided by the respondent.

– For single digit responses code as “9”, 2 digit code as “99”

Coding Open-Ended Questions: When open-ended questions are used, you must create categories.

All responses must fit into a category

– similar responses should fall into the same category.

e.g. Who services your car? ______________

Possible categories: self, garage, husband, wife, friend, relative etc.

• To make it collectively exhaustive add an “other” or

“none of the above” category

Only a few i.e. < 10% should fit into this category

Precoded Questionnaires: Sometimes you can place codes on the actual questionnaire, which simplifies data entry.

This…

Are you: Male Female

How satisfied are you with our product?

___Very Satisfied

___Somewhat Satisfied

___Somewhat Dissatisfied

___Very Dissatisfied

___No opinion

Becomes this…

Are you: (1) Male (2) Female

How satisfied are you with our product?

_ 1 __Very Satisfied

_ 2 __Somewhat Satisfied

_ 3 __Somewhat Dissatisfied

_ 4 __Very Dissatisfied

_ 5 __No opinion

1. Are you solely responsible for taking care of your automotive service needs ___ Yes ___ No

2. If No who performs the simple maintenance ___________

3. If scheduled maintenance is done on your automobile, how do you keep track of what has been done

4. How often is your automobile serviced?

4

5

Col.

No

Question

No.

1-3 ID #

1

2

Code Book

Question Des.

N/A

Responsible for

Maintenance perform simple maintenance

Range of permissible values

001-200 (this also means the surveys themselves should be numbered)

0= No. 1=yes, 9= blank

0=husband, 1=boyfriend, 2=father, 3=mother,

4=relative, 5=friend, 6=other, 9=blank

5 3

6

7

4

4

How maintenance tracked

0=not tracked, 1=auto dealer records, 2=personal records, 3=mental recollection, 4=other, 9=blank

How often maintenance performed

Other for how often

Once per month =1, 3 months=2, 6 months =3 , year =4 , other =5, blank = 9

In questions that permit multiple responses, each possible response option should be assigned a separate column

6. Which magazines do you read, choose all that apply.

Col. No Question No.

15

16

17

18

6

6

6

6

19 6

Question Des.

Range of permissible values

Time 0 =read, 1= not read

Readers Dig.

0 =read, 1= not read

MacLean's 0 =read, 1= not read

National Geo.

0 =read, 1= not read

Chatelaine 0 =read, 1= not read

For rank order questions, separate columns are also needed

7. Please rank the following brands of toothpaste in order of preference (1-5) with 1 being the most important

Col.# Q. No.

Question Des.

20

21

22

23

25

7

7

7

7

7

Range of permissible values

Crest rank 0 =blank, 1 = most important, 2 =2 nd most important, 3 =third, 4=fourth, 5= fifth

Colgate rank

A & H rank

0 =blank, 1 = most important, 2 =2 nd most important, 3 =third, 4=fourth, 5= fifth

Acquafresh rank 0 =blank, 1 = most important, 2 =2 nd most important, 3 =third, 4=fourth, 5= fifth

0 =blank, 1 = most important, 2 =2 nd most important, 3 =third, 4=fourth, 5= fifth

Pepsodent rank 0 =blank, 1 = most important, 2 =2 nd most important, 3 =third, 4=fourth, 5= fifth

Preparing the Data for Analysis

Variable Re-specification

Existing data modified to create new variables

Large number of variables collapsed into fewer variables

E.g. If 10 reasons for purchasing a car are given they might be collapsed into four categories e.g. performance, price, appearance, and service

Creates variables that are consistent with research questions

Entering Data

• Problems can occur during data entry, such as transposing numbers and inputting an infeasible code(e.g out of range)

– E.g. Score on range of 1-5 then 0, 6, 7, and 8 are unacceptable or out of range (might be due to transcription error)

• Always check the data-entry work.

Descriptive Statistics

Five types of statistical analysis

Descriptive

Inferential

Differences

Associative

Predictive

What are the characteristics of the respondents?

What are the characteristics of the population?

Are two or more groups the same or different?

Are two or more variables related in a systematic way?

Can we predict one variable if we know one or more other variables?

Descriptive Statistics

Summarization of a collection of data in a clear and understandable way

 the most basic form of statistics

 lays the foundation for all statistical knowledge

Measures of central tendency (mean, median, mode)

Measures of dispersion (range, standard deviation, and coefficient of variation)

Measures of shape (skewness and kurtosis)

The tradeoff in descriptive statistics

• If you use fewer statistics to describe the distribution of a variable, you lose information but gain clarity.

• When should one use fewer statistics?

– When dropping the number of statistics would leave more information per remaining statistic.

– When the information you drop is unimportant to one’s research question.

Type of

Measurement

Type of descriptive analysis

Nominal

Two categories

More than two categories

Frequency table

Proportion (percentage)

Frequency table

Category proportions

(percentages)

Mode

Type of

Measurement

Ordinal

Interval

Ratio

Type of descriptive analysis

Rank order

Median

Arithmetic mean means

Data Tabulation

Tabulation: The organized arrangement of data in a table format that is easy to read and understand.

– Tabulate the data to count the number of responses to each question.

Simple Tabulation: The tabulating of results of only one variable informs you how often each response was given.

Frequency Distribution: A distribution of data that summarizes the number of times a certain value of a variable occurs and is expressed in terms of percentages .

Frequency Tables

The arrangement of statistical data in a row-andcolumn format that exhibits the count of responses or observations for each category assigned to a variable

How many of certain brand users can be called loyal?

• What percentage of the market are heavy users and light users?

• How many consumers are aware of a new product?

• What brand is the “Top of Mind” of the market?

More on relative frequency distributions

• Rules for relative frequency distributions:

Make sure each observation is in one and only one category.

Use categories of equal width.

– Choose an appealing number of categories.

– Provide labels

– Double-check your graph.

Definitions:

– A histogram is a relative frequency distribution of a quantitative variable

– A bar graph is a relative frequency distribution of a qualitative variable

WebSurveyor Bar Chart

How did you find your last job?

Temporary agency

1.5 %

643 Netw orking

213 print ad

179 Online recruitment site

112 Placement firm

18 Temporary agency

9.6 % Placement firm

15.4 % Online recruitment site print ad 18.3 %

Netw orking

0 100 200 300 400 500 600 700

55.2 %

How many times per week do you use mouthwash ?

1__ 2__ 3__ 4__ 5__ 6__ 7__

1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7

1 2

2 3

3 5

4 7

5 5

6 3

7 2

1

0

4

3

2

7

6

5

6

7

4

5

1

2

3

-

Normal Distribution

 a b

Normal Distributions

Curve is basically bell shaped from -

 to

 symmetric with scores concentrated in the middle (i.e. on the mean) than in the tails.

Mean, medium and mode coincide

They differ in how spread out they are.

The area under each curve is 1.

The height of a normal distribution can be specified mathematically in terms of two parameters: the mean (

) and the standard deviation (

).

Skewed Distributions

Occur when one tail of the distribution is longer than the other .

Positive Skew Distributions

 have a long tail in the positive direction.

 sometimes called "skewed to the right"

 more common than distributions with negative skews

E.g. distribution of income. Most people make under $40,000 a year, but some make quite a bit more with a small number making many millions of dollars per year

The positive tail therefore extends out quite a long way

Negative Skew Distributions

 have a long tail in the negative direction.

 called "skewed to the left."

 negative tail stops at zero

• Kurtosis: how peaked a distribution is. A zero indicates normal distribution, positive numbers indicate a peak, negative numbers indicate a flatter distribution)

Peaked distribution

Flat distribution

Thanks, Scott!

Summary statistics

– central tendency

Dispersion or variability

A quantitative measure of the degree to which scores in a distribution are spread out or are clustered together;

Descriptive Analysis: Measures of

Central Tendency

Mode: the number that occurs most often in a string (nominal data)

Median: half of the responses fall above this point, half fall below this point

(ordinal data)

Mean: the average (interval/ratio data)

Mode

 the most frequent category users 25% non-users 75%

Advantages:

• meaning is obvious

• the only measure of central tendency that can be used with nominal data.

Disadvantages

• many distributions have more than one mode, i.e. are

"multimodal

• greatly subject to sample fluctuations

• therefore not recommended to be used as the only measure of central tendency .

Median

 the middle observation of the data number times per week consumers use mouthwash

1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7

Frequency distribution of

Mouthwash use per week

Light user

Mode

Median

Mean

Heavy user

The Mean (average value)

 sum of all the scores divided by the number of scores.

 a good measure of central tendency for roughly symmetric distributions

 can be misleading in skewed distributions since it can be greatly influenced by extreme scores in which case other statistics such as the median may be more informative

 formula

=

S

X/N (population)

¯ S x i

/n (sample) where

 ¯ and N/n is the number of scores.

Normal Distributions with different Mean

-

 

1

0

2

 

Measures of Dispersion or

Variability

• Minimum, Maximum, and Range

(Highest value minus the lowest value)

• Variance

• Standard Deviation (A measure’s distance from the mean)

Distribution of Final Course Grades in MGMT 3220Y

25

20

15

10

- 1 SD

+ 1 SD

5

0

Frequency

F

3

D

10

RANGE

C

20

Grade

B

23

A

12

Variance

• The difference between an observed value and the mean is called the deviation from the mean

• The variance is the mean squared deviation from the mean

• i.e. you subtract each value from the mean, square each result and then take the average .

2 =

S

¯

i

) 2 /n

• Because it is squared it can never be negative

Standard Deviation

The standard deviation is the square root of the variance

S =

 S

(x- x i

) 2 /n

Thus the standard deviation is expressed in the same units as the variables

Helps us to understand how clustered or spread the distribution is around the mean value

.

Measures of Dispersion

Suppose we are testing the new flavor of a fruit punch

Dislike 1 2 3 4 5 Like Data x

1.

3

2.

x

5

3.

x

3

4.

x

5

5.

x 3

6.

x 5

X= 4

2 = 1

S = 1

2 =

S

¯ i

) 2 /n S =

 S

(x- x i

) 2 /n

Measures of Dispersion

Dislike 1 2 3 4 5 Like Data

4.

5.

6.

1.

2.

3.

x x x x x x

5

5

4

5

4

5

¯

2 =0.26

S = 0.52

2 =

S

¯ i

) 2 /n S =

 S

(x- x i

) 2 /n

Measures of Dispersion

4.

5.

6.

1.

2.

3.

Dislike 1 2 3 4 5 Like Data x x x x x x

5

1

5

1

5

1

¯

2 =4

S = 2

2 =

S

¯ i

) 2 /n S =

 S

(x- x i

) 2 /n

-

Normal Distributions with different SD

2

1

3

 

Cross Tabulation

• A statistical technique that involves tabulating the results of two or more variables simultaneously

• informs you how often each response was given

Shows relationships among and between variables

• frequency distribution for each subgroup compared to the frequency distribution for the total sample

• must be nominally scaled

Cross-tabulation

• Helps answer questions about whether two or more variables of interest are linked:

Is the type of mouthwash user (heavy or light) related to gender?

Is the preference for a certain flavor (cherry or lemon) related to the geographic region

(north, south, east, west)?

Is income level associated with gender?

• Cross-tabulation determines association not causality.

Dependent and Independent Variables

• The variable being studied is called the dependent variable or response variable.

• A variable that influences the dependent variable is called independent variable .

Cross-tabulation

Cross-tabulation of two or more variables is possible if the variables are discrete:

The frequency of one variable is subdivided by the other variable categories.

Generally a cross-tabulation table has:

Row percentages

– Column percentages

Total percentages

Which one is better?

DEPENDS on which variable is considered as independent.

Contingency Table

• A contingency table shows the conjoint distribution of two discrete variables

• This distribution represents the probability of observing a case in each cell

– Probability is calculated as:

P=

Observed cases

Total cases

Cross tabulation

GROUPINC

Total income <= 5

5>Income<= 10 income >10

GROUPINC * Gender Crosstabulation

Count

% within GROUPINC

% within Gender

% of Total

Count

% within GROUPINC

% within Gender

% of Total

Count

% within GROUPINC

% within Gender

% of Total

Count

% within GROUPINC

% within Gender

% of Total

Gender

Female

10

Male

9

52.6%

55.6%

15.2%

5

47.4%

18.8%

13.6%

25

16.7%

27.8%

7.6%

3

17.6%

16.7%

83.3%

52.1%

37.9%

14

82.4%

29.2%

4.5%

18

27.3%

100.0%

27.3%

21.2%

48

72.7%

100.0%

72.7%

Total

19

100.0%

28.8%

28.8%

30

100.0%

45.5%

45.5%

17

100.0%

25.8%

25.8%

66

100.0%

100.0%

100.0%

General Procedure for

Hypothesis Test

1. Formulate H

0

(null hypothesis) and H

(alternative hypothesis)

1

2. Select appropriate test

3. Choose level of significance

4. Calculate the test statistic (SPSS)

5. Determine the probability associated with the statistic.

• Determine the critical value of the test statistic.

General Procedure for

Hypothesis Test

6 a) Compare with the level of significance,

 b) Determine if the critical value falls in the rejection region . (check tables)

7 Reject or do not reject H

0

8 Draw a conclusion

1. Formulate H

1

and H

0

• The hypothesis the researcher wants to test is called the alternative hypothesis H

1

.

• The opposite of the alternative hypothesis is the null hypothesis H

0

(the status quo)(no difference between the sample and the population, or between samples).

• The objective is to DISPROVE the null hypothesis.

• The Significance Level is the Critical probability of choosing between the null hypothesis and the alternative hypothesis

2. Select Appropriate Test

The selection of a proper Test depends on:

– Scale of the data

• nominal

• interval

– the statistic you seek to compare

Proportions (percentages)

• means

– the sampling distribution of such statistic

Normal Distribution

T Distribution

• 

2 Distribution

– Number of variables

• Univariate

Bivariate

Multivariate

– Type of question to be answered

Example

A tire manufacturer believes that men are more aware of their brand. To find out, a survey is conducted of 100 customers, 65 of whom are men and 35 of whom are women.

The question they are asked is:

Are you aware of our brand: Yes or No. 50 of the men were aware and 15 were not whereas 10 of the women were aware and 25 were not.

Are these differences significant?

Aware

Men Women

50 10

Unaware 15 25

65 35

Aware

Unaware

Awareness of Tire

Manufacturer’s Brand

Men

50/39

15/21

65

Women Total

10/21 60

25/14

35

40

100

1. Formulate H

1

and H

0

We want to know whether brand awareness is associated with gender. What are the Hypotheses

H

0

: There is no difference in brand awareness based on gender

H

1

: There is a difference in brand awareness based on gender

2. Select Appropriate Test

X 2

(Chi Square)

Used to discover whether 2 or more groups of one variable

(dependent variable) vary significantly from each other with respect to some other variable (independent variable).

Are the two variables of interest associated:

Do men and women differ with respect to product usage

(heavy, medium, or light)

Is the preference for a certain flavor (cherry or lemon) related to the geographic region (north, south, east, west)?

H

0

: Two variables are independent (not associated)

H

1

: Two variables are not independent (associated)

Must be nominal level, or, if interval or ratio must be divided into categories

Awareness of Tire Manufacturer’s Brand

Aware

Men

50/39

Women Total

10/21 60

Unaware 15/26

65

25/14

35

40

100

Estimated cell

Frequency

E ij

=

R i

C n j

R i

= total observed frequency in the i th row

C j

= total observed frequency in the j th column

n = sample size

E ij

= estimated cell frequency

3. Choose Level of Significance

Whenever we draw inferences about a population, there is a risk that an incorrect conclusion will be reached

The real question is how strong the evidence in favor of the alternative hypothesis must be to reject the null hypothesis.

The significance level states the probability of incorrectly rejecting H error ,

0

. This error is commonly known as Type I

The value of

 is called the significance level of the test

In the example a Type I error would be committed if we said that

There is a difference between men and women with respect to brand awareness when in fact there was no difference

Significance Level selected is typically .05 or .01

• i.e 5% or 1%

In other words we are willing to accept the risk that 5% (or 1%) of the time the results we get indicate that there is a difference between men and women with respect to brand awareness when in fact there is no difference

3. Choose Level of Significance

• We commit

Type error II when we incorrectly accept a null hypothesis when it is false. The probability of committing Type error II is denoted by

.

• In our example we commit a type II error when we say that.

there is NO difference between men and women with respect to brand awareness (we accept the null hypothesis) when in fact there is

Type I and Type II Errors

Accept null Reject null

Null is true Correctno error

Type I error

Null is false Type II error

Correctno error

Which is worse?

• Both are serious, but traditionally Type I error has been considered more serious, that’s why the objective of hypothesis testing is to reject H

0 only when there is enough evidence that supports it.

• Therefore, we choose  to be as small as possible without compromising

.

• Increasing the sample size for a given α will decrease β

(I.e. accepting the null hypothesis when it is in fact false)

Awareness of Tire Manufacturer’s Brand

Aware

Men

50/39

Women Total

10/21 60

Unaware 15/26

65

25/14

35

40

100

Estimated cell

Frequency

E ij

=

R i

C n j

R i

= total observed frequency in the i th row

C j

= total observed frequency in the j th column

n = sample size

E ij

= estimated cell frequency

Chi-Square Test

Estimated cell

Frequency

E ij

=

R i

C j n

R i

= total observed frequency in the i th row

C j

= total observed frequency in the j th column n = sample size

E ij

= estimated cell frequency

Chi-Square statistic x ²

=  ( O i

E i

E i x ² = chi-square statistics

O i

= observed frequency in the i th cell

E i

= expected frequency on the i th cell

Degrees of

Freedom d.f.=(R-1)(C-1)

4. Calculate the Test Statistic

Chi-Square Test: Differences Among Groups

X

+

2

=

( 15

( 50

26

26

39

39

)

2

+

)

2

+

( 25

( 10

14

14

21

21

)

2

)

2

2

2

=

3 .

102

=

22 .

161

+

5 .

762

+

4 .

654

+

8 .

643

= d .

f .

= d .

f .

=

(

( R

2

1 )( C

1 )( 2

1 )

1 )

=

1

Chi-square test results are unstable if cell count is lower than 5

Degrees of Freedom

 the number of values in the final calculation of a statistic that are free to vary

 For example To calculate the standard deviation of a random sample, we must first calculate the mean of that sample and then compute the sum of the squared deviations from that mean

While there will be n such squared deviations only (n - 1) of them are free to assume any value whatsoever.

This is because the final squared deviation from the mean must include the one value of X such that the sum of all the Xs divided by n will equal the obtained mean of the sample.

All of the other (n - 1) squared deviations from the mean can, theoretically, have any values whatsoever..

5. Determine the Probabilityvalue (Critical Value)

•The p-value is the probability of seeing a random sample at least as extreme as the sample observed given that the null hypothesis is true.

• given the value of alpha ,

 we use statistical theory to determine the rejection region.

• If the sample falls into this region we reject the null hypothesis; otherwise, we accept it

• Sample evidence that falls into the rejection region is called statistically significant at the alpha level .

Significance from

p

-values -continued

• How small is a “small” p-value? This is largely a matter of semantics but if the

p-value is less than 0.01, it provides “convincing” evidence that the alternative hypothesis is true;

p-value is between 0.01 and 0.05, there is “strong” evidence in favor of the alternative hypothesis;

p-value is between 0.05 and 0.10, it is in a “gray area”;

p-values greater than 0.10 are interpreted as weak or no evidence in support of the alternative.

5. Determine the Probability-value (Critical Value)

Chi-square Test for Independence

Under H

0

, the probability distribution is approximately distributed by the Chi-square distribution (

2 ).

Chi-square

3.84

Reject H

0

2

22.16

X 2 with 1 d.f. at .05 critical value = 3.84

6 a) Compare with the level of significance,

 b) Determine if the critical value falls in the rejection region . (check tables)

22.16 is greater than 3.84 and falls in the rejection area

In fact it is significant at the .001 level, which means that the chance that our variables are independent, and we just happened to pick an outlying sample, is less than 1/1000

7 Reject or do not reject H

0

Since 22.16 is greater than 3.84 we reject the null hypothesis

8 Draw a conclusion

Men and women differ with respect to brand awareness, specifically, men are more brand aware then women

Download