Chapter 1 - dbmanagement.info

advertisement

Chapter 1

Introduction to Statistics

Section 1.1

Fundamental Statistical

Concepts

Objectives

• Explain the purpose of statistics.

• Decide what tasks to complete before you analyze your data.

• Distinguish between populations and samples.

What Is Statistics?

HEIGHT

5

4

5

10

5

2

5

8

5

8

6

1

5

5

6

5

11

5

Descriptive Statistics

HEIGHT

5

 5

2

 5

4

 5

5

 5

8

5

8

5

10

5

11

 6

 6

1

MIN AVERAGE=5

5

 MAX

Inferential Statistics

5

 5

2

 5

4

 5

5

 5

8

5

8

5

10

5

11

 6

 6

1

MIN AVERAGE=5

5

 MAX

Defining the Problem

Before you begin any analysis, you should complete certain tasks.

1. Outline the purpose of the study.

2. Document the study questions.

3. Define the population of interest.

4. Determine the need for sampling.

5. Define the data collection protocol.

Cereal Example

Rise n

Shine

15 ounces

Defining the Problem

The purpose of the study is to determine whether Rise n Shine cereal boxes contain 15 ounces of cereal.

The study question is whether the average amount of cereal in Rise n Shine boxes is equal to 15 ounces.

Population

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise

Shine n n

Shine

Rise

Sample

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine Rise n

Shine Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine Rise n

Rise

Shine

Rise n

Shine n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Simple Random Sampling

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise ...

n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

R ise n

Shine

Convenience Sampling

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

Rise n

Shine

...

Rise n

Shine

Rise n

Shine

Rise n

Shine

R ise n

Shine

Rise n

Shine

Parameters and Statistics

Statistics are used to approximate population parameters.

Mean

Variance

Standard

Deviation

Population

Parameters

 2

Sample

Statistics x s 2 s

Levels of Measurement

The two levels of measurement of data used in this course are

• continuous

• discrete.

Describing Your Data

The goals when you are describing data are to

• screen for unusual data values

• inspect the spread and shape of continuous variables

• characterize the central tendency

• draw preliminary conclusions about your data.

Process of Data Analysis

Population

Random

Sample

Describe

Sample

Statistics

Make

Inferences

Section 1.2

Examining Distributions

Objectives

• Examine distributions of data.

• Explain and interpret measures of location, dispersion, and shape.

• Use the MEANS and UNIVARIATE procedures to produce summary statistics.

• Use the UNIVARIATE procedure to generate stem-and-leaf, box-and-whisker, normal probability plots and histograms.

Cereal Data Set

Rise n

Shine

.

.

.

.

.

.

WEIGHT

.

.

.

.

.

.

ID NUMBER

.

.

.

.

.

.

Distributions

When you examine the distribution of values for the variable WEIGHT, you can find out

• the range of possible data values

• the frequency of data values

• whether the data values accumulate in the middle of the distribution or at one end.

Symmetric Distributions

WEIGHT

Skewed Distributions

WEIGHT

Normal Distribution

Examples of Normal Distributions std 1.5 std 1.0 std 0.5

Measures of Central Tendency

The mean is the balancing point of your data.

15.02

14.98

15.01

14.99

15.00

Percentiles

40 th

Percentile

0

40% 60%

WEIGHT

Measures of Dispersion

15.00

WEIGHT

15.00

WEIGHT

Measures of Shape

Skewed to Left Symmetric

Skewed to Right

WEIGHT WEIGHT WEIGHT

Measures of Shape

Light-tailed

Normal

Heavy-tailed

The MEANS Procedure

PROC MEANS DATA=SAS-data-set <options>;

VAR variables;

RUN;

The UNIVARIATE Procedure

PROC UNIVARIATE DATA=SAS-data-set<options>;

VAR variables;

ID variable;

HISTOGRAM variables / <options>;

PROBPLOT variables / <options>;

RUN;

Descriptive Statistics

This demonstration illustrates using the

MEANS and UNIVARIATE procedures to calculate descriptive statistics for continuous variables.

Graphical Displays of Distributions

PROC UNIVARIATE produces three kinds of plots for examining the distribution of your data values:

• stem-and-leaf plots

• box-and-whisker plots

• normal probability plots.

PROC UNIVARIATE can also generate histograms and graphically enhanced normal probability plots.

Stem-and-Leaf Plots

9 01338

8 0012347789

7 0013455667799

6 03568

5 8

4

3 9

2 0

1 4

Multiply Stem.Leaf by 10**1

Box-and-Whisker Plots

100|

|

90|

|

80|

70|

|

60|

50|

|

40|

30|

|

20|

10|

|

|

|

|

-

-

-

-

-

-

-

-

-

-

+

0

*

* max point 1.5 IQ units from box

75th percentile

50th percentile median

25th percentile min point 1.5 IQ units from box more than 1.5 IQ units from box more than 3 IQ units from box

The mean is denoted by +.

Normal Probability Plots

1.

......

... .

....

...

..

..

..

..

...

.

...

4.

..

.

.

..

.

..

..

..

..

..

..

.

.

....

.

.

...

..

..

.

..

.

.

2.

...

..

. ..

.............

..

.

..

..

.......

...

.

....

.

.

....

...

.

...

.

.

5.

3.

..

....

......

..

...

.

.

...

..

.

........

........

...

.

.

.

..

......

.

...... .......

.

.

....

.........

........

Examining Distributions

This demonstration illustrates using PROC

UNIVARIATE to generate stem-and-leaf, box-and-whisker, normal probability plots and histograms.

Section 1.3

Confidence Intervals for the Mean

Objectives

• Explain and interpret the confidence intervals for the mean.

• Explain the central limit theorem.

• Calculate confidence intervals using the MEANS procedure.

Point Estimates estimates estimates

Variability among Samples

.

.

mean of 15.02

mean of 15.03

.

.

Standard Error of the Mean

A statistic that measures the variability of your estimate is the standard error of the mean.

It differs from the sample standard deviation because

• the sample standard deviation deals with the variability of your data

• the standard error of the mean deals with the variability of your sample mean.

Confidence Intervals

95% Confidence

( | | )

5% Confidence

| | )

Assumptions about

Confidence Intervals

The types of confidence intervals in this course make the assumption that the sample means are normally distributed.

Distribution of Sample Means

Weight Mean of Weight

Normal Distribution

Useful Probabilities for Normal Distributions

68%

95%

99%

      

Confidence Intervals

Distribution of the Sample Means

95% x

Central Limit Theorem

To satisfy the assumption of normality, you can either

• verify that the population distribution is approximately normal, or

• apply the central limit theorem.

The central limit theorem states that the distribution of sample means is approximately normal provided that the sample size is large enough.

Central Limit Theorem

Confidence Intervals

This demonstration illustrates calculating confidence intervals using PROC MEANS.

Section 1.4

Hypothesis Testing

Objectives

• Define some common terminology related to hypothesis testing.

• Perform hypothesis testing using the

UNIVARIATE procedure.

• Compare the means of paired groups using the TTEST procedure.

Judicial Analogy

Hypothesis Significance Level

Collect Evidence Decision Rule

Coin Example

H

T

H

T

H

Coin Analogy

Hypothesis Significance Level

Collect Evidence Decision Rule

Types of Errors

You used a decision rule to make a decision, but was the decision correct?

ACTUAL

DECISION

Fair Coin

Fair Coin correct

Not Fair Coin Type I error

Not Fair Coin

Type II error correct

Modified Coin Experiment

Which coins are fair?

55 Heads

45 Tails

p-value = .27

63 Heads

37 Tails

p-value < .01

40 Heads

60 Tails

p-value = .04

15 Heads

85 Tails

p-value < .01

Statistical Hypothesis Test

Set Hypothesis

Rise n

Shine

15 oz.

Collect Data set

Significance Level

p-value

p-value

Decision Rule

Comparing  and the p -Value

In general, you

• reject the null hypothesis if p < 

• fail to reject the null hypothesis if p   .

Performing a Test of Hypothesis

To test the null hypothesis H

0

:  software calculates the t statistic

= 

0

, SAS t

( x

 

0

) s x

Two-Sided Test of Hypothesis

The test of hypothesis is two-sided if the null is rejected when the actual value of interest is either less than or greater than the hypothesized value.

H

0

:   15.00

H

1

:   15.00

Two-Sided Test of Hypothesis

-3 -2 -1 0 1 2 3

T

One-Sided Test of Hypothesis

In many situations, you are only interested in one direction. Perhaps you only want evidence that the mean is significantly lower than fifteen.

For example, instead of testing

H

0

:  = 15 versus H

1

:   15 you test

H

0

:   15 versus H

1

:  < 15

One-Sided Test of Hypothesis

-3 -2 -1 0 1 2 3

T

Hypothesis Testing

This demonstration illustrates using PROC

UNIVARIATE to perform hypothesis testing.

Paired Samples

BEFORE

ADVERTISING

AFTER

Sales

Sales

The TTEST Procedure

PROC TTEST DATA=SAS-data-set;

CLASS variable;

VAR variables;

PAIRED variable*variable;

RUN;

Paired t -Test

This demonstration illustrates using PROC

TTEST to conduct a paired sample t-test.

Section 1.5

Two-Sample

t

-Tests

Objectives

• Recognize and validate the assumptions of a two-sample t-test.

• Analyze two populations with the TTEST procedure.

Cereal Example

Rise n

Shine

M o rn in g

Assumptions

Comparing Two Populations

2

1

• independent observations

Rise n Shine

• normally distributed data for each group

• equal variances for each group.

F Test for Equality of Variances

H :

0

2

1

=

2

2

H :

1

2

1

=

2

2

F=

2 2 max(s , s )

1 2

2 2 min(s , s )

Test Statistics and p -Values

F Test for equal variances: H

0

: 

1

2 = 

2

2

Variance Test:

F’ = 1.51

DF = (3,3) Prob > F’ = 0.7446

t-Tests for equal means: H

0

: 

1

= 

2

Unequal Variance t-test:

T = 7.4017

DF = 5.8

Prob > |T| = 0.0004

Equal Variance t-test:

T = 7.4017

DF = 6.0

Prob > |T| = 0.0003

Test Statistics and p -Values

F Test for equal variances: H

0

: 

1

2 = 

2

2

Variance Test:

F’ = 15.28

DF = (9,4) Prob > F’ = 0.0185

t-Tests for equal means: H

0

: 

1

= 

2

Unequal Variance t-test:

T = -2.4518

DF = 11.1 Prob > |T| = 0.0320

Equal Variance t-test:

T = -1.7835

DF = 13.0 Prob > |T| = 0.0979

Testing for Equality of Means

This demonstration illustrates using PROC

TTEST to test for the equality of means for two groups.

Section 1.6

Output Delivery System

Objectives

• Introduce the Output Delivery System (ODS).

• Examine some simple statements in ODS.

• Use ODS to capture some specific

UNIVARIATE procedure output.

• Use ODS to generate a report in the HTML format.

• Use ODS to generate data sets with specific

PROC UNIVARIATE output.

Output Delivery System

SAS procedure computes results

Output object created in

ODS

ODS converts data component into SAS data set

ODS Statements

TRACE provides information about the output object such as the name and path.

LISTING opens, manages, or closes the Listing destination.

OUTPUT creates SAS data set from an output object.

Output Delivery System

This demonstration illustrates the Output

Delivery System by introducing some simple concepts and building on that knowledge.

Section 1.7

Exercises

Section 1.8

Chapter Summary

Section 1.9

Solutions to Exercises

Download