Data Analysis with SPSS 16

advertisement
General Sir John Kotelawala Defence
University
Workshop on
Descriptive and Inferential Statistics
Faculty of Research and Development
14th May 2013
1. Introduction to Statistics
1.1 What is Statistics?
In the common usage, `statistics' refers to numerical information. (Here, `Statistics' is the
plural of `Statistic', which means one piece of numerical information). For example,

Percentage of male nurses in Sri Lanka is 5%

Birth rate: 17.42 births/1,000 population

Death rate: 5.92 deaths/1,000 population

Infant mortality rate: 9.7 deaths/1,000 live births

Life expectancy at birth: male: 72.21 years female: 79.38 years

GDP (value of all final goods and services produced in a year): $106.5 billion

Unemployment rate (the percent of the labor force that is without jobs) : 5.8%

Inflation rate (the annual percent change in consumer prices compared with the
previous year's consumer prices): 5.9% (2010 est.)

In the more specific sense, `statistics' refers to a field of Study. It has been defined in several
ways. For example,

Statistics is the study of the collection, organization, analysis, and interpretation of
data - http://en.wikipedia.org/wiki/Statistics

Statistics is the mathematical science involved in the application of quantitative
principles to the collection, analysis, and presentation of numerical data. –
http://stat.fsu.edu/undergrad/statinf2.php

Statistics is the science of collecting, organizing, presenting, analyzing, and
interpreting numerical data to assist in making more effective decisions. http://business.clayton.edu/arjomand/business/l1.html
1.2 Data and Information

These words are often used interchangeably. However, there are some differences.

Data are the numbers, characters, symbols, images etc., collected in the raw form for
analysis whereas information is processed data.

Data is unprocessed facts and figures without any added interpretation or analysis.
3

Information is data that has been interpreted so that it has meaning for the user.

Knowledge is a combination of information, experience and insight that may benefit
the individual or the organization.
1.3 Distinguishing between Variables and Data

A variable is some characteristic which has different `values' or categories for
different units (items/subjects/individuals)

Examples of variables on which data are collected at a prenatal clinic. Gender,
Ethnicity, Age, Body temperature, Pulse rate, Blood pressure, Fasting blood sugar
level, Urine pH value, Income group, Number of children.

We collect data on variables.

Data are raw numbers or facts that must be processed (analyzed) to get useful
information.

We get information by processing data.
Variable: Age (in years) of patients
• Data: 31, 42, 34, 33, 41, 45, 35, 39, 28, 41
• Information:
the mean age is 36.9 years.
the percentage of patients above 40 years of age: 40%
1.4 Population and sample
Statistics
is
used
for
making
conclusions
regarding
a
group
of
units
(individuals/items/subjects). Such a group of interest is called a population. In research, the
`population' represents a group of units that one wishes to generalize the conclusions to. The
populations of interest are usually large.
Even though the decisions have to be made pertaining to the population of interest, often it is
impossible or very difficult to collect data from the whole population, due to practical
constraints on the available money, time and labour etc., or due to the nature of the
population. Therefore, often data are collected from only a subset of the population. Such a
subset is called a sample.
4
1.5 Descriptive Statistics and Inferential Statistics
Descriptive Statistics is the branch of Statistics that includes methods of organizing,
summarizing and presenting data in an informative way. Commonly used methods are:
frequency tables, graphs, and summary measures.
Inferential Statistics is the branch of Statistics that includes methods used to make decisions,
estimates, predictions, or generalizations about a population, based on a sample. This
includes point estimation, interval estimation, tests of hypotheses, regression analysis, time
series analysis, multivariate analysis, etc
1.6 Classification of Variables
5
Why do we need to know about types of variables? You need to know, in order to evaluate
the appropriateness of the statistical techniques used, and consequently whether the
conclusions derived from them are valid. In other words, you can't tell whether the results in
a particular medical research study are credible unless you know what types of variables or
measures have been used in obtaining the data.
1.6.1 Qualitative Variables

The characteristic is a quality.

The data are categories. They cannot be given numerical values. However, they may
be given numerical labels.

Examples: Gender of patient, Ethnicity, income group
1.6.2 Quantitative Variables

The characteristic is a quantity.

The data are numbers. They are obtained by counting or measuring with some scale.

Examples: Age, Body temperature, Pulse rate, Blood pressure, Fasting blood sugar
level, Urine pH value, Number of children
1.6.3 Discrete Variables

Quantitative.

Usually, the data are counts.

There are impossible values between any two possible values.

Examples: Pulse rate, Number of children
1.6.4 Continuous Variables

Quantitative.

Usually, the data are obtained by measuring with a scale.

There are no impossible values between any two possible values. Any value between
any two possible values is also a possible value.

Examples: Age, Fasting blood sugar level, Body temperature, Urine pH value
6
1.6.5 Scales of measurement
1.6.5.1 Nominal Variables

Qualitative

No order or ranking in categories.

Examples: Gender, Ethnicity
1.6.5.2 Ordinal Variables

Qualitative

Categories can be ordered or ranked.

Examples: income group
1.6.5.3 Interval Variables

Quantitative.

Data can be ordered or ranked.

There is no absolute zero. Zero is only an arbitrary point with which other values can
compare.

Difference between two numbers is a meaningful numerical value.

They are called interval variables because the intervals between the numbers
represent something real. This is not the case with ordinal variables.

Ratio of two numbers is not a meaningful numerical value.

Examples: Temperature
1.6.5.4 Ratio Variables

Possesses all the characteristics of an interval variable.

There exists an absolute (true) zero.

Ratio between different measurements is meaningful.

Examples: Age, Pulse rate, Fasting blood sugar level, Number of children
7
2. Data Analysis with SPSS 16
2.1 Running SPSS for Windows
Method 01
Click on the Start button at the lower left of your screen, and among the program listed, find
SPSS for windows and select SPSS 16.0 for Windows.
Method 02
If there is an SPSS shortcut on the desktop, simply put the cursor on it and double click the
left mouse button.
Shown below is an image of the screen you will see when SPSS is ready.
Menu Bar
Tool Bar
Start –up dialog box
Figure 01
8
You could select any one of the options on the start-up dialog box and click OK, or you could
simply hit Cancel. If you hit Cancel, you can either enter new data in the blank Data Editor or
you could open an existing file using the File menu bar as explained later.
2.2 Different Types of Windows in SPSS
2.2.1 The Data Editor
As shown in figure 01 first you will see start – up dialog box listing several options; behind it
is the Data Editor. The Data Editor is a worksheet used for entering and editing data. It has
two panes,
 Data editor
Variable View
Data View
 Output viewer
 Syntax editor
 Script window
2.2.1.1 Naming and defining variables
When preparing a new dataset in SPSS, it is required to set the following attributes from the
variable view.
 Move your cursor to the bottom of the Data Editor, where you will see a tab labeled
Variable View. Click on that tab. A different grids appears, with these column
headings:
For each variable we create, we need to specify all or most of the attributes described by
these column headings.
9
Name

Should be a single word.

Spaces and special characters (!, ?, *, ) are not allowed.

Each variable name must be unique; duplication is not allowed.

The underscore character is frequently used where a space is
desired in names.
Type
Click within the Type column, and a small gray button marked with
three dots will appear; click on it and you’ll see this dialog box.
Numeric is the default type. (Basically, numeric and string types are
preferred for many of the variables.)
(For a full description of each of the variable types, click on the Help
button.)
Width&
Applicable for numeric type of variables.
Decimals
Label
This is an optional attribute which can be used for entering a detailed
name.
Values
This option allows user to configure the coding structure for
categorical variables.
(In the Values column, click on the word None and then click the gray
10
box with three dots. This open the value labels dialog box. )
(eg: Type “1” in the value box and type “male” in the label box. Click
Add. Then type “0” in the value, and “female” in label. Click Add and
then click OK. )
Missing
The user can assign codes to represent the missing observations.
Measure
The scale of measurement applicable to variable. Both interval and
ratio scales are referred as ‘scale’ type.
2.2.1.2 Entering Data
The Data View pane of the Data Editor window is used to enter the data. Displayed initially
is an empty spreadsheet with the variable names you have defined appearing as the column
headings.
2.2.1.3 Saving a Data File
On the File menu, choose Save As…In the Save in box, select the destination directory that
chosen (in our example, we’re saving it to the Desktop.). Then give a suitable file name and
click Save.
11
2.2.2 Output Viewer
Display outputs and errors. Extension of the saved file will be “spv.”
2.3
Reading data to the SPSS
Data can be entered directly or it can be imported from a number of different sources. The
process for reading data stored in SPSS format data files; spreadsheet application, such as
Microsoft Excel is to be covered in the class room session. SPSS format data files are
organized by cases (rows) and variables (columns).
12
3. Descriptive Analysis of Data
Descriptive statistics consists of organizing and summarizing the information collected.
Descriptive statistics describes the information collected through numerical measurements,
charts, graphs and tables. The main purpose of descriptive statistics is to provide an overview
of the information collected.
3.1 Organizing Qualitative Data
Recall that qualitative data provide no numerical measures that categorize or classify an
individual. When qualitative data are collected, we often interested in determining the
number of individuals that occur within each category.
3.1.1 Tabular Data Summaries
A frequency table (frequency distribution) is a listing of the values a variable takes in a data
set, along with how often (frequency) each value occurs.
Definition 3.1: The frequency is the number of observations in the data set that fall into a
particular class.
Definition 3.2: The relative frequency is the class frequency divided by the total number of
observations in the data set; that is,
Relative frequency =
Definition 3.3: The percentage is the relative frequency multiplied by 100; that is,
Percentage = Relative frequency * 100
Relative frequency is usually more useful than a comparison of absolute frequencies.
One- way frequency tables (Simple frequency table)
Analyze
Descriptive Statistics
Frequencies
(Select the variable and click OK)
13
Table 01: Composition of the sample by activity
Note: The “Valid Percent” column takes into account missing values. For instance, if there
was one missing values in this data set, then the valid number of cases would be 91. If that
were the case, the valid Percentage of slight category would be 11%. Note that “Percent” and
“Valid Percent” will both always total to 100%.
The “Cumulative Percent” is a cumulative percentage of the cases for category and all
categories listed above it in the table. The cumulative percentages are not meaningful, of
course, unless the scale has ordinal properties.
3.2 Cross classification tables
Cross classification tables (contingency tables/ two-way tables) display the relationship
between two or more categorical (nominal or ordinal) variables.
Analyze
Descriptive Statistics
Crosstabs…
14
Note: Crosstabs command will not present percentages from its default options. You can add
Row, Column and Total percentages as appropriate using Cells… option in crosstab
command window.
Table 02: Composition of the sample by smoke and gender
15
3.3 Graphical Presentation for Categorical Data
The most effective way to present information is by means of visual display. Graphs are
frequently used in statistical analyses both as a means of uncovering patterns in a set of data
and as a means of conveying the important information from a survey in a concise and
accurate fashion.
3.3.1 Bar Charts
Simple Bar Chart
Graphs
Legacy Dialogs
Bar
Choose the options Simple and Summaries for
groups of cases
Choose the relevant
variable as category
axis
16
Cluster Bar Chart
Graphs
Legacy Dialogs
Bar
Choose the options Cluster and Summaries for
groups of cases
Component Bar Chart (Sub-divided bar diagram)
These diagrams show the total of values and its break up into parts. The bar is subdividing
into various parts in proportion to the values given in the data and may be drawn on absolute
figures or percentages. Each component occupies a part of the bar proportional to its share in
the total. To distinguish different components from one another, different colors or shades
may be given. When sub-divided bar diagram is drawn on percentage basis it is called
percentage bar diagram. The various components should be kept in the same order in each
bar.
17
Pie Chart
SPSS Command
Graphs
Legacy Dialogs
Pie
Define
3.2 Organizing Quantitative Data
3.2.1 Grouped frequency tables
In order to construct a grouped frequency distribution, the numerical variable should be
classified first. We can use Recode option in SPSS to perform this classification. One the
variable is classified into a different variable, a frequency table can be prepared to present the
grouped frequency distribution.
SPSS command for Recode (into different variables)
Transform
Recode in to different variables
or
Transform
Visual binning
3.2.2 Graphical Presentation of Numerical Data
When presenting and analyzing the behavior of numerical variable, different graphical
options such as Histogram, Dot plot, Box plot can be used.
SPSS commands
Histogram:
Graphs
Legacy Dialogs
Histogram
Legacy Dialogs
Scatter/ Dot
Legacy Dialogs
Box plot
Dot plot:
Graphs
Simple Dot
Define
Box plot:
Graphs
Simple
Define
18
3.3 Summary measures
SPSS Command
Analyze
Descriptive Statistics
Frequencies
Analyze
Descriptive Statistics
Descriptives
Analyze
Descriptive Statistics
Explore
Statistics
Central Tendency
Mean:
Median: It is the value that lies in the middle of the data when arranged
in ascending order. That is, half the data are below the median and half
the data are above the median.
Mode: The mode of a variable is the most frequent observation of the
variable that occurs in the data set
Measures of
Range: Difference between the largest data value and the smallest data
Dispersion
value.
Sample variance:
Sample Standard deviation:
Inter-Quartile range: measure the spread of a data around the median.
The range of middle 50% of the data is called the inter-quartile range.
Quartiles
The quartiles of a set of values are the three points that divide the data
set into four groups, each representing a fourth of the population being
sampled.
Measures of
Skewness is the characteristic that describes the lack of symmetry.
skewness
Kurtosis
Degree of peakeedness of a distribution, usually taken relative to a
normal distribution.
19
3.4 Scatter Plot
When you analyze bi-variate data it is best to start with a suitable graph. In a quantitative bivariate data set, we have a (x; y) pair for each sampling unit, where x denotes the independent
variable and y denotes the dependent variable. Each (x; y) pair can be considered as a point
on the cartesian plan. Scatter plot is a plot of all the (x; y) pairs in the data set.
The purpose of scatter plot is to illustrate diagrammatically any relationship between two
quantitative variables.

If the variables are related, what kind of relationship it is, linear or nonlinear ?

If the relationship is linear, the scattergram will show whether it is negative or positive.
SPSS Command
Graphs
Legacy Dialogs
Scatter/ Dot
Simple Scatter
Define
20
3.5 Correlation

The correlation coefficient, r lies between -1 and +1.

When r = 1, it signifies a perfect positive linear relationship

When r = -1, it signifies a perfect negative linear relationship

The further away r is from 0, the stronger is the correlation. Figure 6.5 shows some
examples.
SPSS Command
Analyse
Correlation
Bivariate
21
4. Fundamentals of Statistical Inference
The need for making educated guesses and drawing conclusions regarding some group of
units of interest arises in almost every field. Such a group of interest is called a population.
In research, the population represents a group of units that you wish to generalize your
conclusions to.
Even though the decisions have to be made pertaining to the population of interest, often it is
impossible or very difficult to collect data from the whole population, due to practical
constraints on the available money, time and labour etc., or due to the nature of the
population. Therefore, often data are collected from only a subset of the population. Such a
subset is called a sample.
The process of making educated guess and conclusions regarding a population, using a
sample from that population is called a Statistical Inference. Usually this involves collecting
suitable data, analyzing data using suitable statistical techniques, measuring the uncertainty
of the results and making conclusions.
Statistical inference problems usually involve one or more unknown constant related to the
population of interest. Such unknown constants are called parameters. For example, the
total of the value of variable X for the units of a finite population (which is called the
population total), the means of the values of X for the units of a finite population (which is
called the population mean), proportion of units with some specified characteristics (which is
called the population proportion) and the means of some random variable (which is called the
expected value) are some examples for parameters. In addition, we come across parameters in
various models like regression models, probability distributions.
Often statistical inference problems involve estimation of parameters and test of hypotheses
concerning parameters. Estimation can be of the form of point estimation and/or interval
estimation.
22
4.1
Point Estimation
It involves using the sample data to calculate a single number to estimate the parameter of
interest. For instance, we might use the sample mean
to estimate the population mean μ.
The problem is that two different samples are very likely to result in different sample means,
and thus there is some degree of uncertainty involved. A point estimate does not provide any
information about the inherent variability of the estimator; we do not know how close
μ in any given situation. While
is to
is more likely to be near the true population mean if the
sample on which it is based is large.
4.2
Interval Estimation
The method is often preferred. The technique provides a range of reasonable values that are
intended to contain the parameter of interest, the range of values is called a confidence
interval. In interval estimation we derive an interval so that we can say that the parameter lies
within the interval with a given level of confidence.
4.3
Terminology and Notation
4.3.1 Estimate
An approximate value for a parameter, determined using a sample of data is called a point
estimate or in short, an estimate.
4.3.2 Estimator
We obtain an estimate by substituting the sample of data in to a formula. Such a formula is
called an estimator. An estimator is a function of the data.
4.3.3 Notation
We usually use Greek letters to denote parameters. For example the population mean,
population standard deviation, population proportion are usually denoted by µ, σ and θ
respectively.
23
Example:
Suppose that we are interested in estimating the mean µ and the variance σ2. Let X1, X2,…
X5 be 5 random observations from this population. Let {3, 5, 2, 1, 2} be one observed sample
from this population and {4, 1, 3, 2, 1} be another observed sample from this population.
Table 01 illustrates the terms parameters, estimators and estimates.
Parameter
Estimator
Estimate 01
Estimate 02
(Using {3, 5, 2, 1, 2})
(Using {4, 1, 3, 2, 1})
µ
σ2
4.4 Point Estimation of Population Mean
Suppose X is a variable derived on the units of a large population and we are interested in the
population mean μ. Suppose we have selected a random sample of n units and we have
observed X on those units. Let x1, x2, x3,… be the observed values of X. Then
= (x1 + x2 + x3
+… xn)/n can be used as an approximate value for the population mean. Therefore, we say
that the
is an estimate for μ. It is a point estimate.
In order to estimate the population mean using the sample mean, one of the following
options can be used. These were introduced in the previous section.
Analyze
Descriptive Statistics
Frequencies
Analyze
Descriptive Statistics
Descriptives
Analyze
Descriptive Statistics
Explore
4.5.1 Bound on the error of
Statistics
and confidence intervals
Usually an estimate is not exactly equal to the parameter. The difference between the actual
value of the parameter and the estimate is called the ‘error’ of the estimate. Since we do not
know the actual value of the parameter, we cannot know the exact error in our estimate.
However we can place a bound on the error with a known level of confidence. For example,
24
using the statistical theory, we may be able to make a statement like ‘we are 95% confident
that error of the estimate is less than 75 ’. This is equivalent to saying that ‘we are 95%
confident that
’. This is equivalent to saying that ‘we are 95% confident that
’. This means, we are 95% confident that
is in the interval
). Such a interval is called a 95% confidence interval.
25
Computing an Appropriate Confidence interval for a Population Mean
Yes
Yes
Is the value
of σ
known?
No
Is n≥30?
No
Is the
population
Normal?
Yes
Use the sample
standard deviation
s to estimate σ
and use
No
Is the value
of σ known?
Use
Use a
nonparametric
technique
Or, more correctly
Use
Or
Since n is large,
there is little
difference
between these
intervals
No
Yes
Use
Use
Increase the
sample size at
least 30 to
develop a
confidence
interval.
26
Small sample from a normal population
Example 1
A researcher wish to estimate the average number of heart beats per minute for a certain
population. In one such study the following data were obtained from 16 individuals.
77, 92, 93, 77, 98, 81, 76, 71, 100, 87, 88, 86, 97, 95, 81, 96
It is known from past research that the distribution of the number of heart beats per minute
among humans is normally distributed. Find 90% confidence interval for the mean.
SPSS Command for the interval Estimation of population mean
Analyze
Descriptive Statistics
Explore
Note:
Use ‘Statistics’ in ‘Explore’ command and set the confidence level if it is required to be
change. The default confidence level is 95%.
27
Interpretation:
We are 90% confidence that the mean heart beat level for the population is between
(82.7019, 90.4231).
Interpretation
What do we mean by saying that we are 90% confident that the mean heart beat level for
the population is between ( 82.7019, 90.4231)
…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
Example 02
As reported by the US National Center for Health Statistics, the mean serum high density
lipoprotein (HDL) cholesterol of female 20 – 29 years old is μ = 53. Dr. Paul wants to
estimate the mean serum HDL cholesterol of his 20 – 29 years old female patients. He
randomly selects 15 of his 20 – 29 year old patients and obtains the data as shown.
65, 47, 51, 54, 70, 55, 44, 48, 36, 53, 45, 34, 59, 45, 54
28
a)
Use the data to compute a point estimate for the population mean serum HDL
cholesterol in patients.
b)
Construct a 95% confidence interval for the mean serum HDL cholesterol for the
patients. Interpret the result.
Note: In this problem it is not given that the population is normally distributed. Since the
sample size is small, we must verify that serum HDL cholesterol is normally distributed. If a
population cannot be assumed normal, we must use large sample or nonparametric
techniques. However if we can assume that the parent population is normal, then small
samples can be handled using the t distribution
Assessing normality
The assumption of normality is a prerequisite for many inferential statistical techniques.
There are a number of different ways to explore this assumption graphically:

Histogram

Stem-and-leaf plot

Boxplot

Normal probability plot
Furthermore, a number of statistics are available to test normality:

Kolmogorov – Smirnov statistic, with a Lilliefors significance level and the Shapiro
Wilk statistic

Skewness

Kurtosis
Normal probability plots
1.
Select the Analyze menu.
2.
Click on Descriptive Statistics and then Explore… to open the Explore dialogue
box.
3.
Select the variable you require (i.e HDL) and click on the ► button to move this
29
variable into the Dependent List: box
4.
Click on the Plots… command pushbutton to obtain the Explore: Plots sub dialogue
box.
5.
Click on the Normality plots with tests check box, and ensure that the Factor levels
together radio button is selected in the Boxplots display.
6.
Click on Continue.
7.
In the Display box, ensure that Both is activated.
8.
Click on the Options… command pushbutton to open the Explore: Options subdialogue box.
9.
In the Missing Values box, click on the Exclude cases pairwise radio button. If this
option is not selected then, by default, any variable with missing data will be excluded
from the analysis. That is, plots and statistics will be generated only for cases with
complete data.
10.
Click on Continue and then OK.
Normal Probability Plot
In a normal probability plot, each observed value is paired with its expected value from the
normal distribution. If the sample is from a normal distribution, then the cases fall more or
less in a straight line.
30
Kolmogorov-Smirnov and Shapiro-Wilk statistics
The Kolmogorov-Smirnov with a Lilliefors significance level for testing normality is
produced with the normal probability and detrended probability plots. If the significance level
is greater than 0.05 then normality is assumed.
Since the conditions are satisfied we can precede with the t test confidence intervals.
Large sample from a normal distribution (σ unkown)
Example 03
A reacher is interested in obtaining an estimate of the average level of some enzyme in a
certain human population. He has taken a sample of 35 individuals and determined the level
of the enzyme in each individual. It is known from past research that the distribution of the
level of this enzyme among humans is normally distributed. The following are the values
20, 11, 32, 25, 6, 23, 19, 24, 15, 31, 19, 23, 21, 27, 17, 20, 23, 23, 22, 13, 15, 28, 27, 18, 11,
32, 23, 28, 14, 23, 21, 25, 19, 29, 17
Construct a 95% confidence interval for the mean population mean and interpret the result.
Large sample from a non-normal distribution, or we do not know data are normally
distributed (σ unkown)
Example 04
(Pulse data set)
1. Construct a 95% confidence interval for the mean pulse rate of all males
2. Construct a 95% confidence interval for the mean pulse rate of all females
31
3. Compare the preceding results. Can we conclude that the population means for males and
females are different? Why or Why not?
Note:
We said that if we do not know σ (which is almost always the case) and the sample size n is
large (say at least 30), then we can estimate σ by s in the z-based confidence interval.
(
)
It can be argued, however, that because the t-based confidence interval
( ±
)
is a statistically correct interval that not requires that we know σ, then it is best, if we do not
know σ, to use this interval for any size sample – even for a large sample. Most common ttables give t points for degrees of freedom from 1 to 30, so we would need a more complete
t table or computer software package to use the t-based confidence interval for a sample
whose size n exceeds 31. For large samples (n > 30), the tradition “by-hand” approach is to
invoke the Central Limit Theorem, to estimate σ using the sample standard deviation (s) and
to construct an interval using the normal distribution, but this is just a practical approach
from pre-computing days. With software like SPSS, the default presumption is that we don’t
know σ, and so the Explore command automatically uses the sample standard deviation and
builds an interval using the value of the t – distribution rather than the normal. However,
because these intervals do not differ by much when n is at least 30, it is reasonable, if n is at
least 30, to use the large sample, z-based interval as an approximation to the t-based
interval. In practice, the values of the normal and t distribution becomes very close when n
exceeds 30.
32
5. Hypothesis testing
5.1 Introduction

Sometimes, the objective of an investigation is not to estimate a parameter, but
instead to decide which of two contradictory statements about the parameter is
correct. This is called hypothesis testing.

Hypothesis testing typically begin with some theory, claim or assertion about a
particular parameter or several parameters.

In any hypothesis testing problem, there are two contradictory hypotheses under
consideration, one is called the null hypothesis. The other is called the
alternative hypothesis.

The validity of a hypothesis will be tested by analyzing the sample. The
procedure which enables us to decide whether a certain hypothesis is true or not,
is called Test of Hypothesis.
5.2 Terminology and Notation
Hypothesis: A hypothesis is a statement or claim regarding a characteristic of one or more
populations.
Test of Hypothesis: The testing of hypothesis is a procedure based on sample evidence and
probability, used to test claims regarding a characteristic of one or more populations.
Hypothesis testing is based upon two types of hypotheses.
The null hypothesis, denoted by H0 is a statement to be tested. The null hypothesis is
assumed true until evidence indicates otherwise.
The alternative hypothesis denoted by H1 is a claim to tested. We are trying to find
evidence for the alternative hypothesis.
Two - Tailed
Left - Tailed
Right -Tailed
Table 5.1
33
Computation of Test Statistics
A function of sample observations (i.e. statistic) whose computed value determined the final
decision regarding acceptance or rejection of H0, is called a Test Statistic. The appropriate test
statistics has to be chosen very carefully and knowledge of its sampling distribution under H 0
(i.e. when the null hypothesis is true) is essential in framing the decision rule. If the value of
the test statistic falls in the critical region, the null hypothesis is rejected.
Types of Errors in Hypothesis Testing - Type I and Type II Errors
As stated earlier, we use sample data to determine whether to reject or not reject the null
hypothesis. Because the decision to reject or not reject the null hypothesis is based upon
incomplete (i. e., sample) information, there is always the possibility of making an incorrect
decision. In fact, there are four possible outcomes from hypothesis testing.
Four Outcomes from Hypothesis Testing
Reality
H0 is True
Conclusion
H1 is True
Do not Reject H0
Reject H0
Table 5.2
The Level of Significance
The level of significance is the maximum probability of making a type I error and it is
denoted by α,
α = P (Type I error) = P( rejecting H0 when H0 is true)
The probability of making a Type I error is chosen by the researcher before the sample data
are collected. Traditionally, 0.01, 0.05 or 0.1 are taken as α
Critical Region or Rejection Region
The rejection region or critical region is the region of the standard normal curve
corresponding to a predetermined level of significance α. The region under the normal curve
which is not covered by the rejection region is known as Acceptance Region. Thus the
34
statistic which leads to rejection of null hypothesis H0 gives us the region known as
Rejection region or Critical region. The value of the test statistic compute to test the null
hypothesis H0 is known as the Critical Value. The Critical value separates the rejection
region from the acceptance region.
Two - Tailed
Left - Tailed
Right - Tailed
Table 5.3
Methods for making conclusion
Method 01: Compare the critical value with the test statistic:
Two Tailed
Left Tailed
Right tailed
Table 5.4
35
Method 02: Compare the p - value with the significance level:
Two Tailed
Left Tailed
Right tailed
Table 5.5
Power
The probability of rejecting a wrong null hypothesis is called the power of the test. The
probability of committing type ii error is denoted by ß.
Power = 1-ß
5.3 Formulating a hypothesis
It is ideal if a test can be derived such that both errors are minimized simultaneously.
However, it may not be possible with the available data.
Instead, we consider tests for which the probability of one error is controlled. Conventionally,
the type I error is controlled.
Usually, out of the two errors, one error is more serious than the other. In such situations it is
reasonable to minimize the probability of the more serious error. In order to achieve this, the
hypothesis is constructed so that the more serious error will be the type I error.
An alternative way is to take the initially favored claim as the null hypothesis. The initially
favored claim will not be rejected in favor of the alternative unless sample evidence
contradicts it and provides strong support for the assertion.
If one of the hypothesis is an equality and the other is an inequality, then the equality
hypothesis is taken to be the null hypothesis.
36
5.4 Steps in test of hypothesis
1. Set up the “Null Hypothesis” H0 and the “Alternative Hypothesis” H1.
2. State the appropriate “test statistic” and also its sampling distribution when the null
hypothesis is true.
3. Select the “level of significance” α of the test, if it is not specified in the given problem.
4. Find the “critical region” of the test at the chosen level of significance.
5. Compute the value of the test statistic on the basis of sample data null hypothesis.
6. If the computed value of test statistic lies in the critical region “reject H0” otherwise “do
not reject H0”.
7. Write the conclusion in plain non-technical language.
37
5.5 One – Sample Hypothesis Tests about Population Mean
Selecting an Appropriate Test Statistic to Test a Hypothesis about a Population Mean
Yes
Yes
Is the value
of σ
known?
No
Z=
Is the
population
Normal?
Yes
Use the sample
standard deviation
s to estimate σ
and use
Use
No
Is n≥30?
No
Is the value
of σ known?
Z=
Use
Use a
nonparametric
technique
t=
Or
Or, more correctly
Since n is large,
there is little
difference
between these
tests
No
Yes
Use
Z=
Use
Increase the
sample size at
least 30 to
conduct
parametric
hypothesis test
t=
38
5.5.1 A small sample two sided hypothesis
Example 5.1
File: ph.sav
An engineer wants to measure the bias in a pH meter. She uses the meter to measure the pH
in 14 neutral substances (pH = 7) and obtains the data obtained below.
7.01
7.04
6.97
7.00
6.99
6.97
7.04
7.04
7.01
7.00
6.99
7.04
7.07
6.97
Is there sufficient evidence to support the claim that the pH meter is not correctly calibrated
at the α = 0.05 level of significance?
Approach:
In this case, we have only sixteen observations, meaning that the Central Limit Theorem does
not apply. With a small sample, we should only use the t test if we can reasonably assume
that the parent population is normally distributed. In this problem also since the sample size is
small before proceeding to test, we must verify that pH is normally distributed.
Hypothesis to be tested
H0: Data are normally distributed.
H1: Data are not normally distributed.
Analyze
Descriptive Statistics
Explore
39
According to the Kolmogorov- Smirnov p-value 0.2 > 0.05. Hence we do not reject H0 under
0.05 level of significance.We can conclude data are normally distributed.
Since the conditions are satisfied we can proceed with the t test.
Hypothesis to be tested:
………………………………………………………………………………………………….
To conduct a one-sample t-test
1. Select the Analyze menu.
2. Click on Compare Means and then One-Sample T Test… to open the One-Sample T
Test dialogue box.
3. Select the variable you require (i.e. pH) and click on the ► button to move the variable
into the Test Variable(s): box.
4. In the Test Value: box type the mean score (i.e. 7).
40
5. Click on OK.
Calculated value of the test
P-value
Statistic
Note: In SPSS a Column labeled Sig. (usually two tailed Sig.) displays the p-value of a
particular Hypothesis test.
Decision:………………………………………………………………………………………..
Conclusion:……………………………………………………………………………………..
…………………………………………………………………………………………………..
41
Note:
5.5.2 Performing One-tail Tests using One-Sample T Test Procedure
The One Sample T-test procedure in SPSS is designed to test two-tail hypothesis. However, a
researcher may need to test a one-tail (left tail or right tail) hypothesis. In this situation the pvalue for the corresponding test has to be computed using the following criteria.
1. For left-tail tests(i.e. H1: μ <
)
If the sample mean is less than
(i.e. t < 0) then, p-value = Sig/2
Otherwise, p-value = 1-Sig/2
2. For right-tail tests(i.e. H1: μ >
)
If the sample mean is greater than
(i.e. t > 0) then, p-value = Sig/2
Otherwise, p-value = 1-Sig/2
Example 5.2
In a study conducted by the U.S. Department of Agriculture, it was found that the mean daily
caffeine intake of 20-29 year old female in 2010 was 142.8 milligrams. A nutritionist claims
that the mean daily caffeine intake has increased since then. She obtains a simple random
sample of 35 females between 20 and 29 years of age and determines their daily caffeine
intakes. The results are presented in caffine.sav. Test the nutritionist’s claim at the α = 0.05
level of significance.
Approach: The dataset represents a large sample (n=35), so we can rely on the Central Limit
Theorem to assert that the sampling distribution is approximately normal.
Hypothesis:…………………………………………………………………………………….
P-value:…………………………………………………………………………………………
Decision:………………………………………………………………………………………..
Conclusion:……………………………………………………………………………………
…………………………………………………………………………………………………
42
Non – Parametric Binomial Test for the One-Sample Test procedure
The Binomial Test procedure compares an observed proportion of cases to the propotion
expected under a binomial distribution with a specified probability parameter. The observed
proportion is defined either by the number of cases having the first value of a dichotomous (a
variable that has two possible values) variable or by the number of cases at or below a given
cut point on a scale (quantitative) variable.
Hypothesis (to be tested on a quantitative variable)
H0: median = m0
vs,
H1: median ≠ m0
SPSS command
Analyze
Nonparametric
Binomial Test
Note: Set the cut point to the hypothesized median value.
43
6. Inferences on Two Samples
In the preceding chapter, we used a statistical test of hypothesis to compare the unknown
mean, proportion of a single population to some fixed known value. In practical applications
however, it is far more common to compare the means of two different populations, where
both parameters are unknown.
In order to perform inference on the difference of two population means, we must first
determine whether the data come from an independent or dependent sample.

Samples are independent when he individuals selected for one sample do not dictate
which individuals are to be in second sample.

Samples are dependent when the individuals selected to be in one sample are used to
determine the individuals to be in the second sample.
6.1 Testing hypotheses concerning two populations means μ1 and μ2: Dependent
Samples
Let (x1, y1), (x2, y2), (x3, y3),…. ( xn, yn) be a random sample of paired observations. Suppose
that x’s are identically distributed with population mean and population variance μ1 and
respectively. Also suppose that y’s are identically distributed with population mean and
population variance μ2 and
respectively.
Let μd be a known constant. Consider the following hypotheses:
Two-Tailed
Left-Tailed
H0:
H0:
H1:
H1:
Right-Tailed
≥
H0:
≤
H1:
Rather than consider the two sets of observations to be distinct samples, we focus on the
difference in measurements within each pair. Suppose that our two groups observations are
as follows:
44
Sample 01
Sample 02
Differences within each pair
x11
x12
d1 = x11 – x12
x21
x22
d2 = x21 – x22
x31
x32
d3 = x31 – x32
…
…
….
xn1
xn2
dn = xn1 – xn2
=
- )2
If differences are normally distributed or the sample size n is large,
The test statistic is,
U=
Compare the critical value with the test statistic, using the guideline below
Two - tailed
If U <
or U >
reject the null hypothesis
,n-1
Left - Tailed
Right - Tailed
If U <
If U >
,n-1
reject the null hypothesis
reject the null hypothesis
6.1.2 Confidence Interval for Matched – Pairs Data
We can also create a confidence interval for the mean difference
difference
, using the sample mean
, the sample standard difference sd , the sample size and
. Remember, the
format for a confidence interval about population mean is of the following form:
Point estimate ± Margin of error
Based on the preceding formula we compute the confidence interval about
as follows:
45
(1-α) 100% confidence interval for
is given by
SPSS Command
Command for Paired - Samples T test
Analyze
Compare Means
Paired Samples T Test
Example 6.1
A dietitian hopes to reduce a person’s cholesterol level by using a special diet supplemented
with a combination of vitamin pills. Six (6) subjects were pre-tested and then placed on diet
for two weeks. Their cholesterol levels were checked after the two week period. The results
are shown below. Cholesterol levels are measured in milligrams per deciliter.
2.1
Test the claim that the Cholesterol level before the special diet is greater than the
Cholesterol level after the special diet at α = 0.01 level of significance.
2.2
Construct 99% confidence interval for the difference in mean cholesterol levels.
Assume that the cholesterol levels are normally distributed both before and after.
Subject
1
2
3
4
5
6
Before
210
235
208
190
172
244
After
190
170
210
188
173
228
Example 6.2
A physician is evaluating a new diet for patients with a family history of heart disease. To test
the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their weights
are measured before and after the study, and the physician wants to know if either set of
measurements has changed. Test whether there are statistically significant differences
between the pre and post-diet of these patients. Use 5% level of significant.
Step 01: Calculating differences
46
Transform
Compute Variable
Step 02:
Because the sample size is small, we must verify that difference data normally distributed.
Analyze
Descriptive Statistics
Explore
Note: Use ‘Plots… ‘ in ‘Explore’ command and set ‘Normality plots with test’
Step 03:
Command for Paired - Samples T test
Analyze
Compare Means
Paired Samples T Test
6.4 Performing One – tail Tests using Paired – Samples T Test procedure
The Paired Samples T – Test procedure in SPSS is designed to test two-tail hypothesis.
However, a researcher may need to test a one – tail (left-tail or right-tail) hypothesis. In this
situation the p-value for the corresponding test has to be computed using the following
criteria.
1. For left-tail tests (i.e.
< 0)
If the sample mean of differences is less than 0 (i.e t < 0) then, p-value = Sig/2.
Otherwise, p-value = 1 – Sig/2
47
2. For right-tail tests (i.e.
> 0)
If the sample mean of differences is greater than 0 (i.e t > 0) then, p-value = Sig/2.
Otherwise, p-value = 1 – Sig/2
Example: If a researcher tries to find whether post-diet weights have been significantly
increased, determine the p-value and state your findings at 5% level of significance.
6.5 Nonparametric Wilcoxon Test for Two Related Samples
Hypothesis
H0:
=0
vs,
≠0
H1:
SPSS command
Analyze
Nonparametric
2 Related Samples
Note: Ensure that ‘Wilcoxon’ is checked in the ‘Test Type’ dialog box.
6.6 Testing hypotheses concerning two population means μ1 and μ2: Independent
samples
Let x1, x2, x3, ….xm be a random sample of observations from a certain population with
population mean and population variance μ1 and
respectively. Also let y1, y2, …yn be a
random sample of observations from a certain population with population mean and
population variance μ2 and
respectively. Further suppose that two samples are
independent.
Let μd be a known constant. Consider the following hypotheses:
Two-Tailed
Left-Tailed
H0:
H0:
H1:
H1:
Right-Tailed
≥
H0:
≤
H1:
48
Case 01: Data from normal distributions, both variances are known
The test statistic is,
U=
Compare the critical value with the test statistic, using the guideline below
Two - Tailed
Left - Tailed
Right - Tailed
If
If
If
U<
or U >
reject the null hypothesis
U<
reject the null hypothesis
U >
reject the null hypothesis
Case 02: Data from two normal distributions with unequal variances (
), both
variances are unknown, m and n are small
The test statistic is,
U=
Compare the critical value with the test statistic, using the guideline below
Two - tailed
If Ucal <
Left - Tailed
or t >
reject the null hypothesis
If Ucal<
Right - Tailed
,ν
reject the null hypothesis
If Ucal >
reject the null hypothesis
Where
ν=
49
(1-α)100% Confidence Interval about the Difference of Two Means
(
)±
Case 03: Data normal, both variances are unknown, but known that they are equal.
=
=
2
=
2
=
Also let
=
The test statistic is,
U=
Compare the critical value with the test statistic, using the guideline below
Two - tailed
If Ucal <
Left - Tailed
or Ucal>
reject the null hypothesis
If Ucal <
Right - Tailed
,m+n-2
reject the null hypothesis
If Ucal >
reject the null hypothesis
(1-α)100% Confidence Interval about the Difference of Two Means
(
)±
SPSS Command for the Independent-Samples T test
Analyze
Compare Means
Independent Samples T Test
Note: On ‘Define Groups’ option, apply relevant codes of the groups to be compared.
50
6.6.1 Performing One – tail Tests using Independent – Samples T Test procedure
The Independent Samples T – Test procedure in SPSS is designed to test two-tail hypothesis.
However, a researcher may need to test a one – tail (left-tail or right-tail) hypothesis. In this
situation the p-value for the corresponding test has to be computed using the following
criteria.
1. For left-tail tests (i.e.
<
)
If the sample mean of differences is less than 0 (i.e t < 0) then, p-value = Sig/2.
Otherwise, p-value = 1 – Sig/2
2. For right-tail tests (i.e.
>
)
If the sample mean of differences is greater than 0 (i.e t > 0) then, p-value = Sig/2.
Otherwise, p-value = 1 – Sig/2
6.7 The Nonparametric Mann – Whitney U Test for Two Independent Samples
What should you do if the t test assumptions are markedly violated (e.g., what if the response
variable is not normal?) One answer is to run the appropriate nonparametric test, which in
this case called the Mann – Whitney (M-W) U test.
Hypothesis
H0:
=
vs,
H1:
≠
SPSS command
Analyze
Nonparametric
2 Independent Samples
Note: Ensure that ‘Mann – Whitney U test’ is checked.
On ‘Define Groups’ option, apply relevant codes of the groups to be compared.
51
Example 6.3:
The purpose of a study by Eidelman et al. was to investigate the nature of lung destruction in
cigarette smokers before the development of marked emphysema. Three lung destructive
index measurements were made on the lungs of lifelong nonsmokers and smokers who died
suddenly outside the hospital of nonrespiratory causes. A large score indicates greater lung
damage. For one of the indexes the scores yielded by the lungs of a sample of nine
nonsmokers and a sample of 16 smokers are shown in Table 02. We wish to know if we may
conclude, on the basis of these data, that smoker, in general, have greater lung damage as
measured by this destructive index than do smokers.
Nonsmokers 18.1
6.0
10.8
11.0
7.7
17.9
8.5
13.0
18.9
Smokers
16.6
13.9
11.3
26.5
17.4
15.3
15.8
12.3
18.6
12.0
24.1
16.5
21.8
16.3
23.4
18.8
Example 6.4:
Researchers wished to know if they could conclude that two populations of infants differ with
respect to mean age at which they walked alone. The following data (age in months) were
collected:
Sample from population A: 9.5, 10.5, 9.0, 9.75, 10.0, 13.0, 10.0, 13.5, 10.0, 9.5, 10.0, 9.75
Sample from population B: 12.5, 9.5, 13.5, 13.75, 12.0, 13.75, 12.5, 9.5, 12.0, 13.5, 12.0,
12.0
52
7. Comparison Multiple Groups
In the preceding chapter, we covered techniques for determining whether a difference exits
between the means of two independent populations. It is not unusual, however, to encounter
situations in which we wish to test for differences among three or more independent means
rather than just two. The extension of the two sample t test to three or more samples is known
as the Analysis of Variance or ANOVA for short.
Definition:
Analysis of Variance (ANOVA) is an inferential method that is used to test the equality of
three or more population means.
7.1 One- Way Analysis of Variance
It is the simplest type of analysis of variance. The one-way analysis of variance is a form of
design and subsequent analysis utilized when the data can be classified into k categories or
levels of a single factor, and the equality of the k class means in the population is to be
investigated.
For example, five fertilizers are applied to four plots each of wheat and yield of wheat on
each of the plot is given. We may be interested in finding out whether the effect of these
fertilizers on the yield is significantly different or in other words, whether the samples have
come from the same normal population. The answer to this problem is problem is provided
by the technique of analysis of variance. The basic purpose of the variance is to test the
homogeneity of several means.
In order to perform ANOVA test, certain requirements must be satisfied.
7.2 Requirements of ANOVA Test
1. Independent random samples have been taken from each population.
2. The populations are normally distributed.
3. The population variances are all equal.
7.3 The Hypothesis test of Analysis of Variance
H0:
H1: At least one of the population means differs from the others
53
7.4 Decomposition of Total Sum of Squares
The name analysis of variance is derived from a partitioning of total variability into its
component parts. Let yij is the jth observation of ith factor level. The data collected under the
factor levels can be represented as follows.
Group (Factor Level/ Treatment)
1
Number
of n1
2
3
…..
k
n2
n3
….
nk
observations
mean
variance
Grand mean ( ) =
=
The total variation present in the data is measured by the sum of squares of all these
deviations. Thus
Total Sum of Squares (SSTo) =
The total variation in the observation
can be split into the following two components.
1. The variation between the classes or the variation due to different bases of
classification, commonly known as treatments.
2. The variation within the classes, i.e, the inherent variation of the random variable
within the observation of a class. This variation is due to chance causes which are
beyond the control of human hand.
54
The sum of squares due to differences in the treatment means is called the treatment sum of
squares or between sums of squares and is given by the expression.
Sum of squares of the differences between treatments =
or
Treatment Sum of Squares (SSTr)
The sum of squares due to inherent variabilities in the experiment material is called the Sum
of Squares of the differences within the treatment.
Sum of squares of differences within the treatment(SSE) =
It can be shown that
+
=
Total sum of squares = Sum of squares between treatments + Sum of squares within
treatments
(SSTo)
(SSTr)
(SSE)
7.5 The Mean Squares
In finding the average squared deviations due to treatment and to error, we divide each sum
of squares by its degrees of freedom. We call the two resulting averages mean square
treatment (MSTr) and mean square error (MSE), respectively.
The number of degrees of freedom associated with SSTr = k-1
MSTr =
The number of degrees of freedom associated with SSE = n- k
MSE =
The Expected Values of the Statistics MSTr and MSE under the null hypothesis
E(MSE) =
……………………………………………….(1)
55
+
E (MSTr) =
……………………………. (2)
- mean of population i
μ – combined mean of all k population
When the null hypothesis of ANOVA is true and all K population means are equal MSTr and
MSE are two independent, unbiased estimators of the common population variance
.
In on the other hand, the null hypothesis is not true and differences do exist among k
population means, then MSTr will tend to be larger than MSE. This happens because, when
not all population means are equal, the second term in eq 2 is a positive number.
7.6 The test statistic in analysis of variance
Under the assumption of ANOVA the ratios MSTr/ MSE possesses an F distribution with k-1
degrees of freedom for the numerator and n-k degrees of freedom for the denominator when
the null hypothesis is true.
Decision rule
If
>
reject H0
Alternatively
p-value = Pr (F >
) under the distribution
Thus reject H0 if p – value < α (level of significance)
ANOVA Table
Source
of Sum
of Degrees
of Mean
F
variation
Squares
freedom
Squares
statistics
Treatment
SSTr
k-1
MSTr
F=
Error
SSE
n-k
MSE
Total
SSTo
n-1
test p-value
Pr( F >
)
56
Example 7.1
A family doctor claims that the mean HDL cholesterol levels of males in the age groups 2029 years old, 40-49 years old and 60 - 69 years old are equal. He obtains a simple random
sample of 12 individuals from each group and determines their HDL cholesterol level. The
results are presented in table 7.1
Table 7.1
20 – 29 years old
40 – 49 years old
60 -69 years old
54
61
44
43
41
65
38
44
62
30
47
53
61
33
51
53
29
49
35
59
49
34
35
42
39
34
35
46
74
44
50
50
37
35
65
38
Approach: We must verify the requirements
1.
As was stated in the problem, the data were collected using random sampling method.
2.
None of the subjects selected are related any way. So the samples are independent.
3.
Normality test suggest sample data come from populations that are normally
distributed(by using the normality test).
Because all requirements are satisfied, we can perform a one – way ANOVA.
Hypothesis:……………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
………………………………………………………………………………………………..
57
Decision:……………………………………………………………………………….
Conclusion:……………………………………………………………………………………
…………………………………………………………………………………………………
………………………………………………………………………………………..
Example 7.2
An experimenter wished to study the effect of 5 fertilizers on the yield of crop. He divided
the field into 45 plots and assigned each fertilizer at random to 9 plots. Data in table 4
represent the number of pods on soyabean plants for various plot types.
Fertilizer Pods
A
32
31
36
35
41
34
39
37
38
B
29
31
33
32
19
30
36
30
32
C
34
30
31
27
40
33
37
42
39
D
34
37
24
23
32
33
27
34
30
E
27
34
36
32
35
29
35
30
31
Test at the 5% level to see whether the fertilizers differed significantly.
Part 01:
Hypothesis:……………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
………………………………………………………………………………………………..
Decision:……………………………………………………………………………………
Conclusion:……………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………..
Part 02:
Where are the differences?
After performing a one-factor independent measures ANOVA and finding out that the results
are significant, we know that the means are not all the same. This relatively simple
conclusion, however, actually raises more questions? Is
different than
? Are all five
58
means different? Post – hoc provide answer to these questions whenever we have a
significant ANOVA result?
There are many different kinds of post-hoc tests, that examine which means are different
from each other: One commonly used procedure is Tukey’s Honestly Significant Difference
Test.
SPSS Command
Analyze
Compare Means
One – Way ANOVA
The variables are still selected, as earlier. Click on Post – Hoc… and select only Tukey,
as shown here:
………………………………………………………………………………………………
……………………………………………………………………………………………...
………………………………………………………………………………………………
59
The Nonparametric Kruskal – Wallis Test
SPSS Command
Analyze
Nonparametric
k independent samples
Ensure that Kruskal – Wallis H is checked
On ‘Define Groups’ option, apply relevant codes of the groups to be compared.
Example 7.3
To compare the effectiveness of three types of weight reduction diets, a homogeneous groups
of 22 women were divided into three sub-grouups and each sub-group followed one of these
diet plans for a period of two months. The weight reductions, in kgs were noted as given
below
Diet
Weight reduction
plan
I
4.3
3.2
2.7
6.2
5.0
3.9
II
5.3
7.4
8.3
5.5
6.7
7.2
8.5
III
1.4
2.1
2.7
3.1
1.5
0.7
4.3
3.5
0.3
Test whether the effectiveness of the three weight reducing diet plans are same at 5% level of
significance.
Hypothesis:……………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………
Decision:……………………………………………………………………………………
Conclusion:……………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………
60
Exercise:
Inference about Two Means: Dependent Samples
1. A dietitian hopes to reduce a person’s cholesterol level by using a special diet
supplemented with a combination of vitamin pills. 16 subjects were pre – tested and then
placed on diet for two weeks. Their cholesterol levels were checked after the two week
period. The results are shown in table 01. Cholesterol levels are measured in milligrams
per deciliter.
I.
Test the claim that the cholesterol level before the special diet is greater than
the cholesterol level after the special diet at α = 0.05 level of significance.
II.
Construct 95% confidence interval for the difference in mean cholesterol
level.
Subject
Before
After
1
273
222
2
219
164
3
235
171
4
252
149
5
140
167
6
260
173
7
214
196
8
194
220
9
135
186
10
202
167
11
196
217
12
215
186
13
180
231
14
266
208
15
219
137
16
209
211
Table 01
Step 01: Set up the null hypothesis and alternative hypothesis
61
…………………………………………………………………………………………………
…………………………………………………………………………………………………
Step 02: Compute the difference between the before and after cholesterol level for each
individual
Step 03: Before proceed with the test of hypothesis we must verify that the difference data are
normally distributed because the sample size is small.
We will construct a normal Q- Q plot and normality test to verify the assumption
Hypothes:………………………………………………………………………………………
…...……………………………………………………………………………………………
Decision:………………………………………………………………………………………
Conclustion:……………………………………………………………………………………
…..………………………………………………………………………………………………
Step 04: Now we can proceed with the hypothesis test.
Decision:………………………………………………………………………………………
…………………………………………………………………………………………………
Conclusion:……………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
………………
Part 02: 95% confidence interval
…………………………………………………………………………………………………
Interpret the result
…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
62
Nonparametric Wilcoxon Test for Two Related Samples
2. Suppose that you are interested in examining the effects of the transition from fetal to
postnatal circulation among premature infant. For each of 14 healthy newborns,
respiratory rate is measured at two different times-once when the infant is less than 15
days old, and again when he or she is more than 25 days old.
Subject
Respiratory Rate
(breaths/minute)
I.
Time 1
Time 2
1
62
63
2
35
42
3
38
40
4
80
42
5
48
36
6
48
46
7
68
45
8
26
70
9
48
42
10
27
80
11
43
46
12
67
67
13
52
52
14
88
89
At the α = 0.1 level of significance, test the null hypothesis that the median difference
in respiratory rates for the two times is equal to 0.
Hypothesis………………………………………………………………………………
…………………………………………………………………………………………
Decision…………………………………………………………………………………
Conclusion………………………………………………………………………………
…………………………………………………………………………………………
63
II.
Do you feel that it would have been appropriate to use the paired t – test to evaluate
these data? Why or why not?
…………………………………………………………………………………………
Inference about Two Means: Independent Samples
Case 01: Data from two normal distributions with unequal variances and both
variances
are unknown.
3. A physical therapist wanted to know whether the mean step pulse of men was less than
the mean step pulse of women. She randomly selected 51 men and 70 women to
participate in the study. Each subject was required to step up and down onto a six-inch
platform for three minutes. The pulse of each subject (in beats per minute) was then
recorded.
Data: pulse.sav
State the null and alternative hypothesis:
………………………………………………………………………………………………
Identify the p-value and state the researcher’s conclusion if the level of significance was α
= 0.01.
………………………………………………………………………………………………
………………………………………………………………………………………………
What is the 95% confidence interval for the mean difference in pulse rates of men versus
women? Interpret this interval.
………………………………………………………………………………………………
………………………………………………………………………………………………
Case 02: Data normal, both variances are unknown, but known that they are equal.
4. Researcher wanted to determine whether carpeted rooms contained more bacteria than
uncarpeted rooms. To determine the amount of bacteria in a room, researcher pumped the
air from the room over a Petri dish at the rate of one cubic foot per minute for eight
carpeted rooms and eight uncarpeted rooms. Colonies of bacteria were allowed to form in
the 16 Petri dish. The results are presented in the table below.
64
Test the claim that carpeted rooms have more bacteria than uncarpeted rooms at the α =
0.05 level of significance.
Carpeted Rooms
Uncarpeted Rooms
(Bacteria/ cubic foot)
(Bacteria/ cubic foot)
11.8
10.8
12.1
12.0
8.2
10.1
8.3
11.1
7.1
14.6
3.8
10.1
13.0
14.0
7.2
13.7
Hypothesis………………………………………………………………………………
…………………………………………………………………………………………
Decision…………………………………………………………………………………
Conclusion………………………………………………………………………………
…………………………………………………………………………………………
The Nonparametric Mann – Whitney U Test for Two Independent Samples
5. When a person is exposed to an infection, the person typically develops antibodies. The
extent to which the antibodies respond can be measured by looking at a person’s titer,
which is a measure of the number of antibodies present. The higher the titer, the more
antibodies are present. The data in table 02 represent the titers of 11 healthy people
exposed to the tularemia virus in Vermont.
ill
Healthy
640
160
1280
320
10
320
160
160
80
640
640
160
320
320
10
320
1280
640
160
320
80
640
Test the claim that the level of titer in the ill group is greater than the level of titer in the
healthy group, at the α = 0.1 level of significance.
Approach:…………………………………………………………………………………...
Hypothesis:…………………………………………………………………………………
………………………………………………………………………………………………
65
Decision:…………………………………………………………………………………..
Conclusion:…………………………………………………………………………………
………………………………………………………………………………………………
………………………………………………………………………………………………
Mann – Whitney Using Qualitative Data
6. The Mann – Whitney Test can be performed on qualitative data if data can ba ranked. For
example, a letter grade received in a class is qualitative data that can be ranked – an “A”
ranks higher than a “B”. Suppose a department chair wants to discover whether there was
a difference in the grades of students learning a computer program based on the style of
the teaching methods. The chair randomly selects 15 students from Professor A’s class
and Professor B’s class and obtains data on the below. Test whether the grades
administered in each class are equivalent.
Professor A
Professor B
C
D
F
C
C
B
B
A
A
C
B
B
D
C
A
B
B
C
D
A
A
C
B
D
C
B
C
F
C
B
Hypothesis:…………………………………………………………………………………
………………………………………………………………………………………………
Decision:…………………………………………………………………………………..
Conclusion:…………………………………………………………………………………
………………………………………………………………………………………………
………………………………………………………………………………………………
66
Download