Uploaded by sohai bok

Analysis

advertisement
BHMC3004
Chapter 1
Chapter 1 INTRODUCTION to STATISTICS
1.1
•
•
•
•
•
Introduction
Science of conducting studies to collect, organize, summarize, analyze, and draw
conclusions from data.
Reasons of studying statistics:
1. Able to read and understand the various statistical studies performed in your fields.
2. May be called on to conduct research in your field, since statistical procedures are
basic to research.
3. Use the knowledge gained to become better consumers and citizens.
Role of Statistics in social research process:
1. Asking the research question
2. Formulating the hypothesis
3. Collecting data
4. Analyzing data
5. Evaluating the hypothesis
The field of statistics is usually divided into two categories:
A. Descriptive/ Deductive Statistics
o Descriptive and analysis without drawing conclusion or inference about a larger
group.
o Collecting, organizing, summarizing, presenting the data by using tables, graphs,
and summary measures.
B. Inferential / Inductive Statistics
o Making inferences / drawing conclusions about population, based on
information obtained from the samples.
o Performing estimations and hypothesis tests, determining relationships among
variables, and making predictions.
Example:
On the last 3 Sundays, Henry sold 2, 1, and 0 new cars, respectively.
o An example of descriptive statistics is:
Henry averaged 1 new car sold for the last 3 Sundays.
o An example of inferential statistics is:
Henry never sells more than 2 cars on a Sunday.
Example 1
The last four semesters an instructor taught Introductory Statistics, the following numbers of
students passed the course: 17, 19, 4, and 20.
Determine each of the following statements is descriptive in nature and which is inferential.
i) The last four semesters the instructor taught Introductory Statistics, an average of 15
students passed the course.
1
BHMC3004
Chapter 1
ii)
The next time the instructor teaches Introductory Statistics, we can expect
approximately 15 students to pass the course.
iii)
The instructor will never pass more than 20 students in an Introductory Statistics class.
iv)
The last four semesters the instructor taught Introductory Statistics, no more than 20
students passed the course.
v)
Only 5 students passed one semester because the instructor was in bad mood the entire
semester.
vi)
The instructor passed so few students in his Introductory Statistics class because he
does not like teaching that course.
1.2
Basic Tterms
o Population
✓ A collection, or set, of individuals or objects or events whose properties, are to be
analyzed.
✓ Set of all the items under consideration.
o Sample
✓ A subset of population.
✓ Should possess the same or similar characteristics as the subjects in the
population.
✓ Draw conclusions about the population.
o Variable
✓ A characteristic of interest about each individual element of a population or
sample.
✓ Dependent variable: variable that the researcher wants to explain (the “effect”);
the object of the research.
✓ Independent variable: variable that is expected to “cause” or account for the
dependent variable.
✓ The independent variable usually occurs earlier in time than the dependent
variable.
o Data
✓ The set values collected for the variable from each of the elements belonging to
the sample.
o Parameter
✓ A numerical value summarizing all the data of an entire population.
o Statistic
✓ A numerical value summarizing the sample data.
2
BHMC3004
Chapter 1
Example 2
A statistics student is interested in finding out the percent of all households in Malaysia have a
single woman as the head of the household. To estimate the percentage, you conduct a survey
with 200 households and the finding shows that 75 of them are headed by a single woman.
Identify each of the following terms.
i)
Population
ii)
Sample
iii)
Variable
iv)
Data
v)
Parameter
vi)
Statistic
3
BHMC3004
1.3
Chapter 1
Types of Variables
Data
Qualitative/
Quantitative/
Attribute
Numerical
Discrete
Continuous
•
Qualitative Variable / Attribute
o Cannot assume a numerical value.
o Two or more non-numerical categories.
o E.g., Hair colour, hometown, level of satisfactory.
• Quantitative Variable
o Can measure numerically.
o E.g., Number of cars owned, time it takes to get to school.
o Can be further divided into two types:
1. Discrete
✓ Values are countable.
✓ Certain values with no intermediate values.
✓ E.g., Number of children in the family, number of houses.
2. Continuous
✓ Any numerical value over a certain interval.
✓ Any variable that involves money is considered a continuous variable.
✓ E.g., Height of students, income.
•
The table below provides examples of the various types of data.
Data type
Question type
Responses / Data
Do you own a car?
Yes / No
What type of car do you own?
Toyota / Honda / Perodua
How many cars do you own?
1/ 2 / 3 / … (integer)
Qualitative
Discrete
Quantitative
Continuous What is the price of your car?
… (figures)
4
BHMC3004
1.4
•
•
•
1)
2)
3)
4)
Chapter 1
Levels of Measurement
Important in determining which statistical inference test should be used to analyze the
data.
4 levels: Nominal  Ordinal  Interval  Ratio
All are mutually exclusive and exhaustive.
* Mutually Exclusive
An individual, object, or measurement is included in only one category
(nonoverlapping).
* Exhaustive
Each individual, object, or measurement must appear in one of the categories.
Nominal Variable
o Qualitative variable that can only be categorized and counted, no particular order.
o Arithmetic operations are not meaningful.
o The lowest / most primitive measurement, less informative.
o E.g., Hair colour, religion and hometown.
Ordinal Variable
o A qualitative variable that incorporates an ordered position or ranking.
o Precise differences between data values cannot be determined or are meaningless.
o Higher than nominal variable.
o E.g., Level of satisfaction (“very satisfied”, “satisfied”, “not satisfied”) and grade (A, B,
C, F).
Interval Variable
o Next highest level of measurement.
o Meaningful amount of differences between data values can be determined.
o No natural zero point.
o E.g., Temperature on the Celsius scale.
✓ 0oC is just a point on the scale and does not represent the absence of the
condition (no heat).
✓ it is incorrect to say that 60C is twice as hot as 30C, just that it is 30C warmer.
Ratio Variable
o The highest level, gives most information.
o The interval level with an inherent zero starting points, i.e., 0 point is meaningful,
which means the zero point is the absence of the characteristic.
o Ratios and differences between two numbers are meaningful.
o E.g., Monthly income, Age.
5
BHMC3004
Chapter 1
Levels of Data
Nominal
Ordinal
Interval
Ratio
Data may
only be
classified
Data are
ranked
Meaningful
difference
between
values
Meaningful
0 point and
ratio
between
values
Example 3
Identify the level of measurement for the following data.
i) Numbers of persons in a family.
ii)
Colour of cars.
iii)
Marital status of people.
iv)
Length of a frog’s jump.
v)
Reading group of a student (low, medium, or high).
vi)
The most frequent use of your microwave oven (reheating, defrosting, warming, other).
vii)
Number of consumers who refuse to answer a telephone survey.
viii)
The door chosen by a mouse in an experiment (A, B, or C).
6
BHMC3004
1.5
•
A)
B)
C)
D)
•
Chapter 1
Sources of Data
The availability of accurate and appropriate data is essential for deriving reliable results.
Data may be obtained from internal sources, external sources, or surveys and
experiments.
Internal Data
o Data taken from the records of the organization itself, such as a company’s own
personnel files or accounting records.
o For example, if a company wishes to forecast the future sales of its product may use
the data of past periods from its own records.
o Accurate and reliable, since these records are kept by the organization itself.
External Data
o Data taken from sources outside the organization, often for another purpose.
o A large number of government and private publications can be used as external
sources of data.
o For instant, the Statistical Abstract of the United States, Employment and Earnings
and Handbook of Labour Statistics, census data.
Primary Data
o The data are published or released by the same organization that collected them (for
the first time and specially collected the present purpose of one particular statistical
inquiry, e.g., surveys and experiments).
o Can take a long time and costly to collect. However, it can be more accurate, more
detailed, and more complete.
o For example, if we want to study the relationship between the family background
and the course selected by students, we could collect all the relevant information by
means of questionnaire. We then have to process the data and present them in the
most convenient form for our study.
Secondary Data
o The data are published by an organization other than the one by which they were
collected or collected for other purposes, e.g., data obtained from the internal or
external sources.
o Secondary data is convenient and cheaper to collect. However, it may be inadequate
for the purpose of the inquiry.
Collecting primary data is very much more complicated and time consuming compared
to the collection of secondary data. The method of collection has to be decided upon,
questionnaires have to be designed for the collection of data and the researcher has to
make a decision as to whether to conduct a census or a sample survey.
7
BHMC3004
Chapter 2
Chapter 2
DATA COLLECTION
2.1
Data Collection Process
2.1.1
Personal Interview
• Advantages:
i) Purpose and meaning of each question are explained so that answers given are more
valid.
ii) High response rate (80 - 90%).
• Disadvantages:
i) Interviewer biases either consciously or unconsciously.
ii) More expensive (recruit, train and pay the interviewers).
iii) People may not like to give confidential or embarrassing information.
2.1.2
Postal Questionnaire
• Advantages:
i) Cheaper.
ii) Wider area coverage.
iii) Can ask many things including personal habits.
iv) No interruption, the respondent will answer questions in a convenient way.
• Disadvantages:
i) Poor response rate (about 20%) and hard to get a good sample.
ii) For questions that are not clear, answers given would not be accurate or relevant.
iii) Need a mailing list.
2.1.3
Direct Observation
• Advantage:
i) Most accurate and precise among the other method of collecting data.
• Disadvantages:
i) Expensive and time consuming.
ii) Not applicable and uneconomical in many situations.
2.1.4
Telephone Enquiries
• Advantages:
i) Cheaper.
ii) Wider area coverage.
iii) All sessions can be controlled and monitored properly.
iv) Questionnaire can be computerized, and questions can be changed based on
respondent’s answer.
• Disadvantages:
i) Poor response rate (hang up the interviewer).
ii) Time waste (not at home).
iii) Limited interview time.
1
BHMC3004
Chapter 2
2.1.5
Online Survey
• Advantages:
i) Faster and large volume of data collection.
ii) Save cost and flexible design.
iii) Anonymity.
iv) Respondent acceptability.
• Disadvantages:
i) Sample bias.
ii) Length, response and dropout rates.
iii) Technical problems.
2.1.6
Focus Group
• Advantages:
i) Require fewer resources and time.
ii) Can request clarifications to unclear responses.
iii) Can view both sides of the coin and build a balances perspective on the matter.
• Disadvantages:
i) Sample selected may not represent the population accurately.
ii) Dominant participants can influence the responses of others.
2.2
•
•
2.3
•
Designing a Questionnaire
Prepared either to be used as postal questionnaire or as a basis for personal and
telephone interview.
Consists of two sections:
(a)
Classification section
o Personal details of the respondents such as gender, age, marital status,
occupation etc.
(b)
Questioning section
Related to the subject matter of inquiry. The characteristics of the questions:
o Simple question
o Not ambiguous
o Short question
o Capable of a precise answer
o Not too personal
o Avoid questions that lead to a particular answer
o Questions are in a logical sequence
o Questionnaire should be as short as possible
o Cover the exact object of the inquiry
Sample and Census Data
Sample survey
o Technique of collecting information from a portion of the population.
o The results of the sample survey are usually used to make inferences about the larger
population.
o Sample data.
2
BHMC3004
•
•
•
2.4
•
•
•
•
•
•
Chapter 2
Census
o Survey that includes every member of the population.
o Many countries carry out a census study of their population every ten years - update
the information on the residents.
o Census data.
Pilot study
o A study that done before the actual fieldwork is carried out.
o The purpose:
✓ to identify possible problems and difficulties
✓ to test out and improve questionnaires
A sample survey can reduce the cost and time and the results may be as accurate as the
census study if the sample is selected using a proper sampling technique.
Sampling Techniques
Sampling
o Process of selecting a representative subset (random process) from the population.
Sampling Techniques
o Scientific methods of selecting samples from populations.
Sampling Frame
o A list of all elements in the population from which the sample will be drawn.
o Complete, up to date and adequate for the purpose.
Reasons for Sampling:
1) The destructive nature of certain tests.
2) The physical impossibility of checking all items in the population.
3) The cost is often prohibitive and time-consuming.
4) The adequacy of sample results.
A useful sample (the conclusion can be drawn about the population) is a sample with
1) proper size (larger more reliable);
2) randomly chosen (avoid biasness).
Two types of sampling methods:
A) Probability Sampling
o Each item or person in the population being studied has a known likelihood (nonzero) of being included in the sample.
o Simple random sampling, systematic random sampling, stratified random sampling,
cluster sampling, multi-stage random sampling.
B) Non-probability Sampling
o Not all items or persons have a chance of being included in the sample.
o Sample is based on the judgment of the person selecting the sample.
o Convenience sampling, judgment sampling, quota sampling, snowball sampling.
3
BHMC3004
Chapter 2
2.4.1
Methods of Probability Sampling
2.4.1.1 Simple Random Sampling
• Each item or person has the same chance of being chosen.
• Can be obtained by
a) Through mixing and simply picking;
b) Using output of some mechanical process such as a revolutionary drum in the
drawing of lottery ticket;
c) Using a random number table.
• The use of random number table:
a) Number all items in the sampling frame (population) in sequential order.
b) Select a starting point randomly in the random number table.
c) After that, continue select the random numbers in a consistent manner, that is, row
by row or column by column.
Select groups of random numbers with same number of digits as the total population
size.
d) Select the items that have the same digits as the random numbers chosen in step (3).
Example 1
Sample of 10 students out of 300 students for a seminar.
1) Number the students from 001 to 300.
2) Refer to random number table, start from _________________________________________________
(starting point).
3) Refer the first three digits, the random numbers are
_______________________________________________________________________________________
________________________________________________________________________________________________.
4) Thus, students numbered___________________________________________________________________
________________________________________________________________________________________________
are being chosen.
Do not select the same
number more than once
4
BHMC3004
Chapter 2
Table of Random Numbers
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
1–5
13962
43905
00504
61274
43753
83503
36807
19110
82615
05621
06936
84981
66354
49602
78430
33331
62843
19528
16737
99389
36160
05505
85962
28763
42222
43626
97761
49275
15797
04497
95468
01420
74633
46662
10853
68583
75818
16395
53892
66009
45292
34033
13364
03343
46145
37703
12622
56043
43401
18053
6 – 10
70992
46941
48658
57238
21159
51662
71420
55680
86984
26584
37293
60458
88441
94109
72391
51803
84445
15445
01887
06685
38196
45420
19758
04900
40446
40039
43444
44270
75134
24853
87411
74218
40171
99688
10393
01032
78982
16837
15105
26869
93427
45008
09937
62593
24476
51658
98083
00251
35924
53460
11 – 15
65172
72300
38051
47267
16239
21636
35804
18792
93290
36493
55875
16194
96191
36460
96973
15934
56652
77764
50934
45945
77705
44016
92795
54460
82240
51492
95895
52512
39856
43879
30647
71047
97092
59576
03013
67938
24258
00538
40963
91829
92326
41621
00535
93332
62507
17420
17689
70085
28308
32125
16 – 20
28053
11641
59408
35303
50595
68192
44862
41487
87971
63013
71213
92403
04794
62353
70437
75807
91797
33446
43306
62000
28891
79662
00458
22083
79159
36488
24102
03951
73527
07613
88711
14401
79137
04887
90372
29733
93051
57133
69267
65078
70206
79437
88122
09921
19530
30593
59677
28067
55140
81357
Column
21 – 25 26 – 30
02190
83634
43548
30455
16508
82979
29066
02140
62509
61207
84294
38754
23577
79551
16614
83053
60022
35415
68181
57702
83025
46063
80951
80068
14714
64749
00721
66980
97803
78683
46561
80188
45284
25842
41204
70067
75190
86997
76228
60645
12106
56281
92069
27628
71289
05884
89279
43492
44168
38213
70280
24218
07006
71923
21651
53867
78417
36208
26400
17180
01765
57688
74537
14820
30698
97915
02310
35508
89639
65800
71176
35699
02081
83890
89398
78205
85534
00533
89616
49016
15847
14302
98745
84455
47278
90758
25306
57483
41257
97919
39637
64220
56603
93316
78135
53000
07515
53854
26935
67234
31 – 35
66012
07686
92002
60867
86816
84755
42003
00812
20852
49510
74665
47076
43097
82554
04670
78984
96246
33354
56561
87750
86222
50002
37963
00066
46839
14596
04800
73531
59510
18880
60665
45248
36305
69481
88532
10551
66944
72122
27130
14200
60043
66769
23542
98115
02290
45486
79858
18138
23023
78460
36 – 40
70305
31840
63606
39847
29902
34053
58684
16749
02909
75304
12178
23310
83976
90270
70667
29317
73504
70680
79018
46329
66116
32540
23322
40857
26598
04744
32062
70073
76913
66083
57636
78007
42613
30300
71789
15091
99856
99655
90420
97469
30530
94729
35273
33460
40357
03698
52548
40564
70268
47833
41 – 45
66761
03261
41078
50968
23395
94582
09271
45347
99476
38724
10741
74899
83281
12312
58912
27971
21631
66664
34273
46544
39626
19848
73243
86568
29983
89336
41425
45542
22499
02196
36070
65911
87251
94047
59964
52947
87950
25294
72584
88307
57149
17975
67912
55304
38408
80220
67367
77086
80435
20496
46 – 50
88344
89139
86326
96719
72640
29215
68396
88199
45568
15712
58362
87929
72038
56299
21883
16440
81223
75486
25196
95665
06080
27319
98185
49336
67645
35630
66862
22831
68467
10638
37285
38583
75608
57096
50681
20134
13952
20941
84576
92282
08642
50963
97670
43572
50031
12139
72416
49557
24269
35645
5
BHMC3004
Chapter 2
2.4.1.2 Systematic Random Sampling
• The items or individuals of the population are arranged in some order.
• A random starting point is selected and then every k-th member of the population is
selected for the sample.
• k, sampling interval = population size (N)  sample size (n)
• Can be biased if the population has repetitive or systematic pattern.
Example 2
Let a population of 200; select a sample that is 10% of population.
Number of samples = 10% of population =
Then sampling interval is
Select a starting point randomly, say ______, the items selected for the sample would be
___________________________________________________________________________________________________________
__________________________________________________________________________________________________________.
2.4.1.3 Stratified Random Sampling
• A population is first divided into strata, according to its various prominent
characteristics such as sex, age, and household income.
• Elements in each subgroup or strata are homogeneous.
• Sub-sample is drawn utilizing a simple random sample within each stratum.
• Advantage: More accurate in reflecting the characteristics of the population.
• Can be divided proportionately or non-proportionately.
Example 1
Example 2
Example 3
Population
All people in United
States
All intercollegiate
athletes
All primary students
in the local school
district
Strata
4 Time Zones in the
United States
(Eastern, Central,
Mountain, Pacific)
26 intercollegiate
teams
11 different primary
schools in the local
school district
Obtain a Simple
Random Sample
500 people from each
of the 4 time zones
5 athletes from each
of the 26 teams
20 students from each
of the 11 primary
schools
Sample
4 × 500 = 2000
selected people
26 × 5 = 130 selected
athletes
11 × 20 = 220
selected students
Example 3
Refer to Example 1, the 300 students are classified according to their year of study as shown in
the following table. Draw a proportionate stratified sampling of 10 students.
Year
Number of students
Label
1
120
001 – 120
2
90
121 – 210
3
90
211 – 300
n
6
BHMC3004
Chapter 2
Using the same starting point in the random number table in Example 1, the random numbers
(within the range) are
Year
Label
1
001 – 120
2
121 – 210
3
211 – 300
n
Sample
2.4.1.4 Cluster Sampling
• First divided into small subdivisions, called primary units or clusters.
• Clusters or primary units should be as heterogeneous as the population itself.
• Then randomly choose the clusters. All the items in the chosen clusters are included in
the sample.
• A simple and less costly procedure.
• Area sample is the most popular type of cluster sample.
Example 1
Example 2
Example 3
Population
All people in United
States
All intercollegiate
athletes
All primary students in
the local school district
Strata
4 Time Zones in the
United States (Eastern,
Central, Mountain,
Pacific)
26 intercollegiate teams
11 different primary
schools in the local
school district
Obtain a Simple
Random Sample
2 time zones from the 4
possible time zones
8 teams from the 26
possible teams
4 primary schools from
the l1 possible
elementary schools
Sample
every person in the 2
selected time zones
every athlete on the 8
selected teams
every student in the 4
selected primary
schools
2.4.1.5 Multi-stage Ssampling
• The area of survey is divided into a number of areas, and three or four areas are selected
by random means.
• Each area selected is again sub-divided and another sample of smaller areas is selected
at random.
• The process continues until ultimately a number of quite small areas has been selected.
• A random sample of the relevant people within each of these areas is then interviewed.
• It reduces the area of survey and thus brings down the cost to a reasonable bound.
7
BHMC3004
Chapter 2
2.4.2
Methods of Nonprobability Sampling
2.4.2.1 Convenience Sampling
• Pre-testing of questionnaires, the gathering of ideas and insights or the forming of
hypothesis.
• The selection is left primarily to the interviewers.
• Often, respondents are selected because they happen to be in the right place at the right
time.
2.4.2.2 Judgment Sampling
• The researcher selects a respondent whom he feels possesses certain characteristics
that represent the population of interest based on his experience.
2.4.2.3 Quota Ssampling
• Like stratified random sampling, one has to take note the various characteristics of the
population, for example, the divisions on gender, age and job type.
• The sample size is then divided into sub-sample sizes (quota) to include similar
proportions of people within these characteristics.
• Each interviewer is then given the quota of people with these characteristics to contact.
The final selection of the individuals is left up to the interviewers (similar to convenience
sampling).
2.4.2.4 Snowball Sampling
• An initial group of respondents is selected, usually at random.
• After being interviewed, these respondents are asked to identify others who belong to
the target population of interest.
• This procedure is applied until the researcher obtains the required number of
respondents.
8
BHMC3004
Chapter 2
Summary:
Strengths and Weaknesses of Basic Sampling Techniques
Probability Sampling
Techniques
Strengths
Weaknesses
Simple Random Sampling
Easily applied. Results
can be projected on
population.
Difficult to obtain
sampling frame,
expensive, sometimes no
assurance of
representativeness.
Systematic Sampling
Easier to implement than
simple random sampling.
Can decrease
representativeness if
certain patterns exist in
sampling frame.
Stratified Sampling
Includes all important
subpopulations, precision
is improved.
Difficult to select relevant
stratification variables,
not feasible to stratify on
many variables,
expensive.
Cluster Sampling
Easy to implement, cost
effective and work is
reduced.
Imprecise, difficult to
compute and to interpret
results.
Non-probability Sampling
Techniques
Strengths
Weaknesses
Convenience Sampling
Less expensive, less time
consuming, most
convenience.
Selection bias, sample not
representative, not
recommended for
descriptive or causal
research.
Judgment Sampling
Less expensive, less time
consuming, most
convenience.
Does not allow
generalisation, subjective.
Quota Sampling
Sample can be controlled
for certain
characteristics.
Selection bias, no
assurance or
representativeness.
Snowball Sampling
Can estimate rare
characteristics.
Time consuming.
9
BHMC3004
Chapter 3
Chapter 3 DATA PRESENTATION
3.1
•
Frequency Distribution
A grouping of data into mutually exclusive categories showing the number of
observations in each class.
3.1.1 Qualitative Data
• Lists all categories and the number of elements that belong to each of the categories.
• Nominal Variable
•
•
Gender
Frequency, f
Female
15
Male
25
Total, 
40
Relative Frequency =
15
40
25
40
𝑓
Σ𝑓
Percentage, % =
𝑓
Σ𝑓
× 100
= 0.375
0.375  100 = 37.5%
= 0.625
0.625  100 = 62.5%
1
100
Ordinal Variable
Relative Frequency =
Grade
f
A
8
B
15
C
10
F
7
Total, 
40
𝑓
Σ𝑓
Percentage, %
100
1
Joint Frequency Distribution
Grade
Gender
A
B
C
F
Female
3
6
4
2
Male
5
9
6
5
Total
15
25
Total
8
15
10
7
40
o The table is referred as bivariate table or contingency table, reporting the overlap
between two variables.
1
BHMC3004
Chapter 3
3.1.2 Quantitative Data
• Lists all the classes and the number of values that belong to each data.
• Discrete Variable - Ungrouped
f
Number of Children
Relative Frequency
Percentage
0
1
2
3
4
Total
o Number of class, c = 5
•
Discrete Variable - Grouped
1 to 5, i.e.
1, 2, 3, 4 and 5
Class Limit
Class Boundary
1–5
0.5 -< 5.5
6 – 10
5.5 -< 10.5
11 – 15
10.5 -< 15.5
16 – 20
15.5 -< 20.5
f
Class Midpoint
Total
•
Continuous Variable
Class Limit = Class Boundary
0 to less than 5
f
Class Midpoint
0 -< 5
5 -< 10
10 -< 15
15 -< 20
Open-ended class
Assume 20 -< 25
20 and above
Total
o Current upper class limit = subsequent lower class limit
o Number of class, c = 5
o For open-ended class, in further calculation, assume to be of the same size as the
immediate neighbouring class.
2
BHMC3004
•
Chapter 3
Steps in Constructing of a Grouped Frequency Distribution from a Set of Raw Data or
Ungrouped Data
o Step 1: Decide the number of classes, c.
2c > n, n = number of observations
o Step 2: Determine the class width, i (same for all classes).
Highest value - Lowest value
𝑖 >
𝑐
o Step 3: Set the class limits and class boundaries, if necessary.
o Step 4: Tally mark.
o Disadvantage: Lose the information on individual observations.
Example 1
A random sample of 30 students were asked to give the number of hours (to the nearest hour)
they spent per week studying outside of class. Also, their eye color and the number of pets they
owned was recorded. The results are given as follows.
Student
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Eye Colour
Blue
Brown
Brown
Green
Blue
Green
Hazel
Brown
Blue
Hazel
Blue
Green
Brown
Grey
Brown
Green
Green
Hazel
Grey
Brown
Grey
Blue
Blue
Brown
Hazel
Blue
Brown
Grey
Hazel
Brown
Number of Pets
1
0
3
1
2
1
0
3
4
3
1
1
1
2
0
4
0
1
2
2
1
0
3
0
1
1
2
2
4
2
Number of Hours Studying
10
7
15
20
6
25
22
13
12
21
16
22
25
20
29
25
27
15
14
17
8
18
24
28
24
25
11
9
10
17
3
BHMC3004
Chapter 3
Construct the frequency distributions for the data on eye colour, number of pets owned, and
number of hours spent per week studying outside of class.
Eye Colour
f
Relative Frequency
Percentage
30
1
100
Blue
Brown
Grey
Hazel
Green
Total
No of Pets
f
0
6
1
10
2
7
3
4
4
3
Total
30
Relative Frequency
Percentage
Number of Hours Studying:
n=
lowest =
highest =
i>
Class limit:
4
BHMC3004
Chapter 3
Example 2
A random sample of 30 students was selected and the average number of hours each student
studied in a week is determined.
15.0
23.7
19.7
15.4
18.3
23.0
14.2
20.8
13.5
20.7
17.4
18.6
12.9
20.3
13.7
21.4
18.3
29.8
17.1
18.9
10.3
26.1
15.7
14.0
17.8
33.8
23.2
12.9
27.1
16.6
Organize the data into a frequency distribution.
n=
lowest =
highest =
i>
Class limit:
Number of Hours
Frequency
Relative Frequency
Percentage
5
BHMC3004
3.2
•
•
•
•
Chapter 3
Cumulative Frequency Distribution
For variables that are ordinal level and above.
Gives the total number of values that fall below/above the upper/lower boundary of
each class.
“Less than” cumulative frequency distribution
o A table showing the total frequency of all values less than the upper class boundary
of each class interval.
“More than” cumulative frequency distribution
o A table showing the total frequency of all values more than or equal to the lower class
boundary of each class interval
Cumulative frequency of a class
•
Cumulative relative frequency =
•
Cumulative percentage = (Cumulative relative frequency)  100
Total frequency
Example 3
Construct a “Less than” and a “More than” cumulative frequency distribution for the data in
Example 2.
Number of Hours
Cumulative Frequency
Less than
Less than
Less than
Less than
Less than
Less than
Number of Hours
Cumulative Frequency
More than or equal to
More than or equal to
More than or equal to
More than or equal to
More than or equal to
More than or equal to
6
BHMC3004
3.3
•
•
•
•
Chapter 3
Graphic Presentation for Qualitative Data
Well suited for non-technical audience such as executives, managers.
Provide overview information rather than detail.
Examples: Bar chart, Pie chart
A good diagram or graph should has
i.
title
ii.
source
iii.
units of measurement
iv.
include a key if appropriate
v.
scale should be approximately
vi.
axis should be clearly stated
chosen and stated
3.3.1 Bar Chart
• A graph made of bars of the same width heights or lengths of the bars represent the
frequencies of respective categories.
• Can be used to depict any of the level of measurement (nominal, ordinal, interval, or
ratio).
• Can be constructed vertically or horizontally.
• Can show positive and negative values.
• 3 types:
A) Simple Bar Chart
B) Component Bar Chart
C) Multiple Bar Chart
Note:
• Leave a small gap between the adjacent bars (say 1/2 of the bar width).
3.3.1.1
Simple Bar Chart
• Used to represent a qualitative variable.
• E.g., Daily sales of ice cream
Sales
Day
Monday
Tuesday
Wednesday
Thursday
Friday
100
140
100
100
170
o Vertical Bar Chart
o Horizontal Bar Chart
7
BHMC3004
Chapter 3
3.3.1.2
Component Bar Chart
• Useful to illustrate a breakdown in the figures.
• The constituent parts of each bar are always stacked in the same order with the height
of each representing the individual values or frequencies.
• E.g., The total sales of ice cream could be broken down into sales by flavours.
Friday
Flavour
Monday
Tuesday
Wednesday
Thursday
Vanilla
Chocolate
50
50
60
80
40
60
50
50
80
90
Total
100
140
100
100
170
8
BHMC3004
Chapter 3
3.3.1.3
Multiple Bar Chart
• Uses a separate bar to represent each constituent part of the total.
• These bars are joined into a set of each class of data.
• E.g., The multiple bar chart for the previous example.
9
BHMC3004
Chapter 3
Example 4
Construct a multiple bar chart and a component bar chart for the following data.
Month
Jan
Feb
Mar
Apr
May
Total
Product (in thousand)
A
B
C
15
20
30
29
18
14
30
25
15
24
18
32
35
27
15
10
BHMC3004
Chapter 3
3.3.3 Pie Chart
• A circle divided proportionally to the relative frequency and portions of the circle are
allocated for the different groups.
• The angle,  = relative frequency  360
• Useful for displaying a relative frequency distribution.
• Whole pie or chart represents the total sample or population.
Example 5
In a study of retractions in biomedical journals, 436 were due to error, 201 were due to
plagiarism, 888 were due to fraud, 291 were duplications of publications, and 287 had other
causes. Illustrate the above information in a graph and interpret the graph.
11
BHMC3004
Chapter 3
Cause
Frequency
Error
436
Plagiarism
201
Fraud
888
Duplication
290
Other
287
Total
2102
Relative frequency
Angle
13.65%
20.74%
13.80%
9.56%
42.25%
12
BHMC3004
3.4
•
•
•
Chapter 3
Graphic Presentation for Quantitative Data
Well suited for technical audience such as engineers, supervisors, etc.
Provide more numerical details.
Examples: Histogram, Frequency polygon, Cumulative frequency polygon, Stem and leaf
display
3.4.1 Histogram
• A graph, with a set of rectangles, in which classes (midpoints or class boundaries) is
marked on the horizontal axis and frequencies (called the frequency histogram),
relative frequencies (relative frequency histogram), or percentages (percentage
histogram) are marked on the vertical axis.
• Each rectangle is constructed so that its area is proportion to the frequency of the class
interval it represents.
• 2 types:
A) Equal-width histogram
B) Unequal-width histogram
• The bars in a histogram are drawn adjacent to each other.
• Symbol -⁄⁄- (truncation) is used to indicate that the entire axis is not shown.
3.4.1.1
Equal Width Histogram
• All the class intervals have the same width (or size).
• The vertical axis which represents the height of each rectangle is the class frequency,
relative frequency or percentage.
Example 6
The data below represent the defective items produced by machines of varying age. Draw a
histogram for the data.
Age (to the nearest month)
Frequency
1–5
2
6 – 10
3
11 – 15
7
16 – 20
15
21 – 25
20
26 – 30
22
31 – 35
17
13
BHMC3004
Frequency, f
Chapter 3
Histogram for the Defective Items Produced by Machine of Varying Age
22
20
18
16
14
12
10
8
6
4
2
0
0.5
5.5
10.5
15.5
20.5
25.5
30.5
35.5
Age (to the nearest month)
14
BHMC3004
Chapter 3
3.4.1.2
Unequal-width Histogram
• Class intervals are of unequal width (or size).
• The height of each rectangle must be adjusted where it differs from the “standard” class
width, i.e., from the class width of the majority of class intervals.
• For example, when the width of a particular class interval (base of the rectangle)
doubles in length, the height (class frequency) must be halves, and so on.
• Vertical axis is frequency density/adjusted height, where
frequency
Frequency density =
no of standard class widths
Example 7
Construct a histogram for the following data.
Time
(minutes)
Frequency
Class
width
No. of standard class
widths
40 –< 45
8
5
1
45 –< 50
13
5
1
50 –< 55
16
5
1
55 –< 60
24
5
1
60 –< 70
24
10
2
70 –< 85
15
15
3
Frequency density/
Adjusted frequency
Height
(Frequency density)
24 / 2 =
Histogram for the Time
25
20
15
10
5
0
35
Time (minutes)
40
45
50
55
60
65
70
75
80
85
15
BHMC3004
3.4.1.3
Chapter 3
Shapes of Histogram
1.
symmetric / normal / triangular
• identical on both sides of its
central point
2.
skewed
• non-symmetric, a longer tail on
one side than the other
i. skewed to the right
o longer tail on the right side
ii. skewed to the left
o longer tail on the left side
3.
•
uniform / rectangular
same frequency for each class
3.4.2 Frequency Polygon
• Consists of line segments connecting the points formed by the class midpoint and the
class frequency.
• Join the midpoints of the tops of successive bars in a histogram with straight lines.
• Join the points at each end of the diagram to the base line at the centers of the adjoining
class intervals (2 classes with 0 frequencies).
Example 8
Construct a frequency polygon for the data in Example 6 and Example 7. (Draw on the
histogram)
16
BHMC3004
Chapter 3
Example 6
Frequency, f
Histogram for the Defective Items Produced by Machine of Varying Age
22
20
18
16
14
12
10
8
6
4
2
0
0.5
5.5
10.5
15.5
20.5
25.5
30.5
35.5
Age (to the nearest month)
17
BHMC3004
Chapter 3
Example 7
Frequency density/
25 Adjusted frequency
Histogram for the Time
20
15
10
5
0
35
Time (minutes)
40
45
50
55
60
65
70
75
80
85
18
BHMC3004
Chapter 3
3.4.3 Cumulative Frequency Polygon
• A line drawn for a cumulative frequency distribution by joining the dots marked above
the upper boundaries of classes at heights equal to the cumulative frequencies of
respective classes.
• Used to determine how many or what proportion of the data values are below or above
a certain value.
• If the dots are joined by a smooth curve, it is called as cumulative frequency curve or
ogive.
Example 9
Construct a cumulative frequency polygon for the data in Example 6 and Example 7.
Example 6
Age in months
Example 7
F
Time(minute)
Less than 0.5
More than or equal to 40
Less than 5.5
More than or equal to 45
Less than 10.5
More than or equal to 50
Less than 15.5
More than or equal to 55
Less than 20.5
More than or equal to 60
Less than 25.5
More than or equal to 70
Less than 30.5
More than or equal to 85
Less than 35.5
More than or equal to 40
F
19
BHMC3004
Chapter 3
Example 6
Cumulative frequency, F
Cumulative Frequency Polygon for the Defective Items
Produced by Machine of Varying Age
100
90
80
70
60
50
40
30
20
10
0
–4.5
0.5
5.5
10.5
15.5
20.5
25.5
30.5
35.5
Age (to the nearest month)
20
BHMC3004
Chapter 3
Example 7
Cumulative frequency, F
Cumulative Frequency Polygon for the Time
100
90
80
70
60
50
40
30
20
10
0
40
45
50
55
60
65
70
75
80
85
Time (minutes)
21
BHMC3004
Chapter 3
3.4.4 Stem and Leaf Display
• A display of data in which each numerical value is divided into two parts: a leading
digit(s) becomes the stem and the trailing digit(s) becomes the leaf.
• The purpose is to display the shape of a distribution.
• Steps:
i) Split each value into two parts; the stem and the leaf.
ii) Draw a vertical line and write the stems on the left side of it, from the lowest to the
highest.
iii) Records the leaves next to the corresponding stems on the right side of the vertical
line.
• The leaves are usually arranged in increasing order.
• No comma is places between leaf digits.
• Each leaf contains only a single digit while the stem may have many digits as needed.
• The advantage of the display is do not lose any information on individual observations.
• The stem-and-leaf display reveals some important features:
i) Range of data values
ii) Where the values are concentrated
iii) The distribution is symmetrical or not
iv) Whether gaps exist or not
v) Presence of outliers
o If the leaves become too crowded, then each distinct stem from the basic plot can be split
into either 2 or 5 different intervals.
2 intervals:
5 intervals:
Stem
1st
Stem
1st
2nd
3rd
4th
5th
Leaf digits
0, 1, 2, 3, or 4
2nd 5, 6, 7, 8, or 9
Leaf digits
0 or 1
2 or 3
4 or 5
6 or 7
8 or 9
o E.g., 12, 13, 13, 15, 17, 18, 19, 20, 21, 23, 25, 27.
✓ Split the ones digit, thus if duplicate each stem,
1 2335789
Too few stems, shape is not clearly seen  not a suitable display
2 01357
✓ thus if duplicate each stem,
1 233
1 5789
2 013
2 57
22
BHMC3004
Chapter 3
o E.g., 10, 12, 13, 13, 14, 15, 15, 15, 16, 16, 19
✓ Split the ones digit, 1 0 2 3 3 4 5 5 5 6 6 9
✓ if duplicate each stem, 1 0 2 3 3 4
1 555669
Unable to comment on the shape
Also, not suitable
✓ thus 1 0
1 233
Better display  Shape can be seen clearer
1 4555
1 66
1 9
o If the range between the smallest and largest data values is large and there are relatively
few data values, the stem and leaf display will have many stem rows with few leaves in
any one row (or empty rows), we may produce a condensed stem and leaf display by
truncating the last digit of the data values and reconstruct the plot.
o E.g., 4, 25, 78, 105, 136, 143, 198, 200, 261
✓ Split the ones digit, 0 4
1
2
5
3
4
:
:
7
8
:
:
10 5
Too many stems, shape is not clearly seen
 not a suitable display
*Cannot skip the “in-between” stems which
have no leaf
11
:
:
25
26 1
o Consider: 004, 025, 078, 105, 136, 143, 198, 200, 261 (no round up)
✓ Split the tenths digit, 0 0 2 7
Better display  Shape can be seen clearer
1 0 3 4 9
2 0 6
Example 10
Alice achieved the following scores on her quizzes this semester: 86, 79, 92, 84, 69, 88, 91, 83,
96, 78, 82, 85. Construct a stem and leaf display for the data.
6
7
8
9
23
BHMC3004
Chapter 3
Example 11
Below is the weight for a sample of 30 students (in kg):
19.1 19.8 18.0 19.2 19.5 17.3 20.0 20.3 19.6
18.5 18.1 19.7 18.4 17.6 21.2 20.6 22.2 19.1
21.1 19.3 20.8 21.2 21.0 18.7 19.9 18.7 22.1
17.2 18.4 21.4
Construct a stem and leaf display.
Example 12
The ages (in months) at which 25 children were first enrolled in a preschool are listed below.
38
40
38
35
39
34
37
36
35
36
45
35
36
36
43
41
36
37
43
38
40
34
41
39
36
Construct a stem and leaf display for the distribution of the age of the preschoolers.
24
BHMC3004
Chapter 4
Chapter 4 DESCRIPTIVE STATISTICS
4.1
•
•
•
•
Measure of Central Location (Average)
A single value within the range of data used to represent all the values in the series.
The point of location around which individual values cluster.
Also known as measure of central value or central tendency.
Two types:
o Mathematical Average: Mean
o Positional Average: Median, Mode, Fractiles
4.1.1 Mean
• Properties:
o For interval-level and ratio level data.
o All values are used.
o Unique.
o Σ(𝑥𝑖 − 𝑥̅ ) = 0.
o Use to comparing populations.
• Advantages:
o Simple, always exists and unique.
o Fully representative.
o For further mathematical analysis.
o Can be calculated even when only the total value and the number of items are known.
o Relatively reliable.
• Disadvantages:
o Affected by extreme values.
o Cannot determine the mean for open-ended class(es) data. If such classes contain a
large proportion of the values, then the mean may be subject to substantial error.
A)
Raw Data
Σ𝑋
•
Population Mean, μ =
•
Sample Mean, 𝑥̅ =
Σ𝑥
•
 is a parameter and 𝑥̅ is a statistic.
𝑁
𝑛
Example 1
The following are the ages (in years) of all eight employees of a small company:
53
32
61
27
39
44
49
57
Find the mean age of these employees. Interpret your answer.
X = 53 + 32 + 61 + 27 + 39 + 44 + 49 + 57 =
Σ𝑋
μ=
𝑁
The average age of all eight employees of this company is
years old.
1
BHMC3004
Chapter 4
Example 2
Following are the list prices (in $) of eight homes randomly selected from all homes for sale in
a city.
245,670
176,200
360,280
272,440
450,394
310,160
393,610
374, 480
Calculate and interpret the mean.
x = 245,670 + … + 374,480 =
𝑥̅ =
Σ𝑥
𝑛
The average
B)
Ungrouped Frequency Distribution
•
𝑥̅ =
Σ𝑓𝑥
Σ𝑓
x : data value
f : frequency
Example 3
In a survey of 50 households, the number of children in each household are shown as below.
Number of children
0
1
2
3
4
5
Number of households
8
15
13
9
3
2
Find and interpret the mean for the above data.
𝑥̅ =
Σ𝑓𝑥
Σ𝑓
The average
C)
Grouped Frequency Distribution
•
𝑥̅ =
Σ𝑓𝑥
Σ𝑓
x : midpoint
Example 4
Determine the mean for the data below.
Score
f
60 – 62
10
63 – 65
36
66 – 68
84
69 – 71
54
72 – 74
16
x
fx
𝑥̅ =
Σ𝑓𝑥
Σ𝑓
Total
2
BHMC3004
4.1.2
•
•
•
•
•
A)
Chapter 4
̃
Median, 𝒙
The midpoint of the ordered values.
There are as many values above the median as below it in the data array.
Properties:
o Unique
o Ratio, interval and ordinal-level data
o Open-ended frequency distribution
Advantages:
o Not influenced by outliers.
o Preferred for data sets that contain outliers.
Disadvantages:
o Data have to be arranged.
o Does not fully reflect the distribution.
o Unsuitable for use in further calculations.
o May not be truly representative if there are too few items.
Raw Data
•
𝑛+1
𝑥̃ = the (
)th item (n is odd)
2
𝑛
𝑛
𝑥̃ = the mean of ( 2) th and ( 2 + 1) th items (n is even)
Example 5
The following data relates to the marks obtained by 15 students. Find and interpret the median
value.
30, 35, 52, 52, 35, 40, 59, 60, 41, 46, 61, 65, 47, 70, 72
Rank the data:
30, 35, 35, 40, 41, 46, 47, 52, 52, 59, 6 0, 61, 65, 70, 72
𝑛+1
n = 15 (odd), 𝑥̃ = (
2
) th item
Half of the group of students obtained less than or equal to 52 marks while the other half of
them obtained more than or equal to 52 marks
Example 6
Find the median of the data speeds (in Mbps) of smartphones from six different
telecommunication companies. Interpret the finding.
38.5 55.6 22.4 14.1 23.1 24.5
Rank the data:
n = 6 (even),
6
6
𝑥̃ = between (2) th and (2 + 1) th items
Half of the telecommunication
3
BHMC3004
B)
•
•
Chapter 4
Ungrouped Frequency Distribution
Determined using the cumulative frequency distribution table.
Position of median is the same as in raw data set.
Example 7
Given below are quiz scores (out of 10) obtained by 150 students. Determine the median.
f
Score
C)
•
•
6
30
7
52
8
26
9
22
10
20
150
150
2
2
𝑥̃ = between (
) th and (
+ 1) th
Grouped Frequency Distribution
Estimate from cumulative frequency polygon.
𝑥̃ = xn/2 or 50% of the total (Regardless of even or odd n)
F
n
n/2
Class boundary
̃
𝒙
Example 8
Estimate the median for the data in Example 4 using cumulative frequency polygon.
Cumulative frequency distribution
𝑥̃ =
Class Boundary
F
Less than 59.5
0
Less than 62.5
10
Less than 65.5
46
Less than 68.5
130
Less than 71.5
184
Less than 74.5
200
200
th
2
4
BHMC3004
F
Chapter 4
Cumulative Frequency Polygon for the Score
200
180
160
140
120
100
80
60
40
20
Score
0
59.5
62.5
65.5
68.5
71.5
74.5
5
BHMC3004
Chapter 4
Example 9
The following table gives the frequency distribution of the workers of a factory according to
their average monthly income in a certain year. Estimate the median value using cumulative
frequency polygon.
Income (RM)
f
500 -< 1000
28
1000 –< 1500
34
1500 –< 2000
46
2000 –< 2500
32
2500 –< 3000
24
3000 –< 3500
12
Income (RM)
F
Less than 500
Less than 1000
Less than 1500
Less than 2000
Less than 2500
Less than 3000
Less than 3500
𝑥̃ =
6
BHMC3004
Chapter 4
Cumulative Frequency Polygon for the Average Monthly Income in a Certain Year
F
180
160
140
120
100
80
60
40
20
Income (RM)
0
500
1000
1500
2000
2500
3000
3500
7
BHMC3004
4.1.3
•
•
•
•
•
•
•
Chapter 4
̂
Mode, 𝒙
The value that appears most frequently.
Useful for nominal and ordinal data.
Advantages:
o Simple and easy to understand.
o Not affected by extreme values.
o Can be found for open-ended classes.
o For quantitative and qualitative variables.
o Can be the value of an actual item in the distribution.
Disadvantages:
o May not exist and may not be unique.
o Not suitable for further calculations or mathematical analysis.
o Data have to be arranged.
The distribution with one mode is called as unimodal.
When two values occur with the same (highest) frequency, the distribution is called
bimodal.
If more than two modal values occur, it is said to be multimodal.
A)
Raw data
Example 10
The following data gives the speed (in km per hour) of the cars that were stopped for speeding
violations at two locations, A and B.
Location A: 125, 130, 120, 135, 127, 125, 118, 125
Location B: 115, 120, 110, 113, 112, 125, 118, 123
Determine the mode for the two locations.
𝑥̂A =
The most frequent
𝑥̂B =
The distribution of speed that are stopped for speeding violations in location B
Example 11
Printing press turns out in 5 impressions:
‘very sharp’, ‘sharp’, ‘sharp’, ‘sharp’, ‘blurred’.
Then modal value is
B)
•
Ungrouped Frequency Distribution
Choose the item with the highest frequency.
8
BHMC3004
Chapter 4
Example 12
i)
ii)
C)
•
•
Days of birth
freq
Monday
22
Tuesday
10
Wednesday
32
Thursday
17
Friday
13
Saturday
32
Sunday
14
Height (cm)
f
155
3
156
7
𝑥̂ =
157
10
The most frequent
158
15
159
16
160
9
161
2
𝑥̂ =
The most frequent
Grouped Frequency Distribution
If frequency polygon or curve is given, 𝑥̂ is the x value with the highest peak.
Plot a histogram.
Modal class
f
𝑥̂
9
BHMC3004
Chapter 4
Example 13
Estimate the mode using the histogram
Weight(gram)
No of packages
450 –< 452
11
452 –< 454
26
454 –< 456
34
456 –< 458
24
458 -< 460
20
f
From graph,
𝑥̂ =
The most frequent
Histogram for the Weight
36
32
28
24
20
16
12
8
4
0
450
452
454
456
458
460
10
BHMC3004
•
Chapter 4
Considerations for Choosing a Measure of Central Tendency
4.1.4 The Relative Positions of the Mean, Median, and Mode
a) Symmetric Distribution
o Zero skewness
𝑥̅ = 𝑥̃ = 𝑥̂
b) Positively Skewed
o Skewed to the right
𝑥̂ < 𝑥̃ < 𝑥̅
c) Negatively Skewed
o Skewed to the left
𝑥̅ < 𝑥̃ < 𝑥̂
11
BHMC3004
Chapter 4
4.1.5 Fractiles and Quartiles
• Measures of location/position.
• Include not only central location but also any position based on the number of equal
divisions in a given distribution
• Median (𝑥̃) – divide the distribution into 2 equal parts
• Quartiles (Qi) – divide into 4 equal parts
• Deciles (Di) – divide into 10 equal parts
• Percentiles (Pk) – divide into 100 equal parts
• Q2 = D5 = P50 = 𝑥̃
A)
Raw data/Ungrouped frequency distribution
Organize the data into ascending order and calculate the location:
1
o Q1 = 4 (𝛴𝑓 + 𝟏) th item
•
3
o Q3 = 4 (𝛴𝑓 + 𝟏) th item
o Di =
𝑖
10
𝑘
Exact location
(𝛴𝑓 + 𝟏) th item, i = 1, 2, 3, …, 8, 9, 10
o Pk = 100 (𝛴𝑓 + 𝟏)th item, k = 1, 2, …, 99, 100
B)
Grouped Frequency Distribution
Location:
•
1
o Q1 = 4 Σ𝑓 th item
3
o Q3 = 4 Σ𝑓 th item
Approximate
location
𝑖
o Di = 10 Σ𝑓 th item
𝑘
o Pk = 100 Σ𝑓 th item
•
Use the ogive / cumulative frequency polygon to estimate the quartiles, deciles, and
percentiles.
Example 14
For the following data, determine the Q1, Q3, D7, P59.
46
47
49
49
51
53
54
54
55
55
59
1
Q1 = 4 (11 + 1) th item
3
Q3 = 4 (11 + 1) th item
7
D7 = 10 (11 + 1) th item
59
P59 = 100 (11 + 1) th item
12
BHMC3004
Chapter 4
Example 15
A company selling a consumer product directly to retail outlets has collected the following
information:
Number of Order
No. of salesman
10 –19
3
20 – 29
8
30 – 39
16
40 – 49
22
50 – 59
19
60 – 69
8
70 – 79
4
Determine the Quartiles, D2, P66.
Q1 =
80
Q3 =
3(80)
D2 =
2(80)
4
F
Less than 9.5
0
Less than 19.5
3
Less than 29.5
11
Less than 39.5
27
Less than 49.5
49
Less than 59.5
68
Less than 69.5
76
Less than 79.5
80
th =
4
P66 =
Class Boundary
10
th =
th =
66(80)
100
th =
13
BHMC3004
F
Chapter 4
Cumulative Frequency Polygon for the Number of Order
80
70
60
50
40
30
20
10
0
9.5
Number of Order
19.5
29.5
39.5
49.5
59.5
69.5
79.5
14
BHMC3004
4.2
•
•
•
Chapter 4
Measure of Dispersion
Measure of variability - describes diversity and variability in the distribution of a
variable.
Nominal variable: Index of Qualitative Variation
Interval/Ratio: Two main types:
A)
Distance measures:
o Measure the distance between any two significant positional values
o Range, Interquartile Range.
B)
Average Deviation Measures
o Measures the average or Mean Deviation of all the data from some measures of
central tendency.
o Variance, Standard Deviation and Coefficient of Variation.
4.2.1 Index of Qualitative Variation
• For nominal variables to compare the diversity of a variable in different groups or to
find out the group has become more diverse over time.
• Based on the ratio of the total number of differences in the distribution to the maximum
number of possible differences within the same distribution.
• Vary from 0.00 to 1.00.
• When all the cases in the distribution are in one category, there is no variation or
diversity, IQV = 0.00.
• When the cases in the distribution are distributed evenly across the categories, there is
a maximum of variability or diversity, IQV = 1.00.
•
IQV =
𝐾(1002 −Σ𝑃𝑐𝑡 2 )
1002 (𝐾−1)
where K : Number of categories
Pct : Sum of all percentage in the distribution
Example 16
The following table shows the top five ethnic groups for two states by percentage, 2010.
Comment and compare the diversity for the ethnicity between the following two states.
Ethnic Group
Maine (%)
Hawaii (%)
White
97.3
29.7
Latino
1.3
10.7
Asian
1.1
46.3
Native Hawaiian or Pacific Islander
-
11.9
Other
0.3
1.5
Total
100.0
100.0
15
BHMC3004
Chapter 4
Ethnic Group
Maine (%)
White
97.3
Latino
1.3
Asian
1.1
Native Hawaiian or Pacific Islander
-
Other
0.3
Total
100.0
Ethnic Group
Hawaii (%)
White
29.7
Latino
10.7
Asian
46.3
Native Hawaiian or Pacific Islander
11.9
Other
1.5
Total
100.0
IQVMaine =
(%)2
-
(%)2
𝐾(1002 −𝛴𝑃𝑐𝑡 2 )
1002 (𝐾−1)
The number of ethnic differences
in Maine is 7% of the maximum
possible differences. 1
IQVHawaii =
The number of ethnic differences in
Hawaii is 84% of the maximum
possible differences.
is considerably more ethnic variation than in
4.2.2 Range
• Influenced by an extreme value(s), especially if they are unrepresentative values.
• Easy to compute and understand.
A)
•
Raw Data/Ungrouped Frequency Distribution
Range = highest value – lowest value
•
Grouped Frequency Distribution
Range = Upper class boundary of the last class – Lower class boundary of the first class
B)
16
BHMC3004
Chapter 4
Example 17
Find the range for the below data.
Data Set
i)
{2, 2, 3, 4, 5}
ii)
{2, 5, 7, 10, 100}
iii)
{-4, -8, 12, 10, 17, 7, 1, -3}
Range
Example 18
Find the range for the data in Example 15.
Lowest value =
Highest value =
Range =
4.2.3 Interquartile Range and Quartile Deviation
• Interquartile Range
o Measure the middle 50 percent of the observations
o IR = Q3 – Q1
• Quartile Deviation
o QD =
•
•
•
(𝑄3 −𝑄1 )
2
o The smaller the QD, the greater concentration of the middle half of the observations
in the data.
Can be computed for the open-ended classes.
Not influenced by the extreme values.
Not fully representative of a set of measurements as it is not based on all the information
available.
Example 18
For a set of heights for a group of students, the upper quartile is 24cm and the lower quartile is
10cm. What is the quartile deviation? Give an interpretation for the finding.
IR = Q3 – Q1 =
[The height of the middle 50 percent of the students varied with a spread of
(𝑄3 − 𝑄1 )
𝑄𝐷 =
2
.]
The height of half of the middle 50 percent of the students varied with a spread of
17
BHMC3004
Chapter 4
4.2.4 Standard Deviation and Variance
• Variance
o The arithmetic mean of the squared deviations from the mean.
o All values are used.
o Not influenced by extreme values.
A)
•
•
B)
•
•
•
Raw data
Population variance,
Σ(𝑋 − μ)2
Σ𝑋 2
Σ𝑋
2
σ =
=
− μ2 , where  =
𝑁
𝑁
𝑁
Sample variance,
(Σ𝑥)2
2
Σ𝑥 2 − 𝑛
)
Σ(𝑥
−
𝑥̅
Σ𝑥
𝑠2 =
=
, where 𝑥̅ =
𝑛−1
𝑛−1
𝑛
* Deviation formula; Direct formula
Grouped Frequency Distribution
Population variance,
Σ𝑓(𝑋 − μ)2
Σ𝑓𝑋 2
2
σ =
=
− μ2
Σ𝑓
Σ𝑓
Sample variance,
(Σ𝑓𝑥)2
2
Σ𝑓𝑥
−
2
Σ𝑓(𝑥 − 𝑥̅ )
Σ𝑓
𝑠2 =
=
Σ𝑓 − 1
Σ𝑓 − 1
* X, x : class midpoint, f = class frequency
* Deviation formula; Direct formula
Standard Deviation
o The square root of the variance.
o Population standard deviation, σ = √σ2 .
•
•
o Sample standard deviation, s = √𝑠 2 .
For a data set with a large amount of variation, the data values will, on the average, be
far from the mean - the standard deviation will be large.
For a data set with a small amount of variation, the data values will, on the average, be
close to the mean; the standard deviation will be small.
Example 19
Refer to Example 1. Find the population mean and standard deviation.
Ages (in years) of all eight employees of a small company:
53
32
61
27
39
44
49
57
x = 362, μ = 45.25 years old
σ2 =
Σ(𝑋−μ)2
𝑁
18
BHMC3004
Chapter 4
Example 20
The hourly wages earned by a sample of five students are $7, $5, $11, $8, $6. Find the variance
and the standard deviation.
x 2 = 295, x = 37
s2
=
Σ𝑥 2 −
(Σ𝑥)2
𝑛
𝑛−1
Example 21
Calculate the sample standard deviation following set of data.
Score
No. of students, f
60 – 62
10
63 – 65
fx
fx 2
36
2304
147456
66 – 68
84
5628
377076
69 – 71
54
3780
264600
72 – 74
16
1168
85264
200
4.2.5
•
•
•
•
•
Midpoint, x
-
Coefficient of Deviation / Coefficient of Variation
Ratio of the standard deviation to the arithmetic mean.
Expressed as a percentage.
σ
CV = μ × 100% for population
𝑠
CV = 𝑥̅ × 100% for sample
Used to compare the variability between two or more different distributions or when
the means differ markedly.
Example 22
Consider the measurement on yield and plant height of a paddy variety. The mean and standard
deviation for yield are 50kg and 10kg respectively. The mean and standard deviation for plant
height are 55cm and 5cm respectively. Compare and comment on the variability of the
distributions.
CVyield =
CVheight =
The distribution of
distribution of the
of the paddy is more disperse/variable as compared to the
of the paddy.
19
BHMC3004
Chapter 4
•
Considerations for Choosing a Measure of Variation
•
Measure of Skewness
Measurement of the lack of symmetry of the distribution.
4.3
4.3.1 Pearsonian Coefficient of Skewness
• Pearson first coefficient of skewness,
mean-mode
Sk(1) = standard deviation
•
Pearson second coefficient of skewness,
3(mean-median)
Sk(2) = standard deviation
Sk(2) =
Sk(2) =
•
•
3(𝑥̅ −𝑥̃)
𝑠
̃)
3(μ−μ
𝑠
for sample and
for population
Range from –3.00 up to 3.00.
A value of 0 indicates a symmetric distribution.
4.3.2 Quartile Measure of Skewness
𝑄3 + 𝑄1 −2 median
•
SkQ =
•
•
Takes values between –1 and +1.
Convenient to use when the median and the quartiles are used to describe the
distribution.
Interquartile range
20
BHMC3004
Chapter 4
Example 23
The lengths of stay by patients on the cancer floor of a local hospital were organized into a
frequency distribution. The mean length of stay was 28 days, the median 25 days, and the
standard deviation was found to be 4.2 days.
Calculate the coefficient of skewness. Interpret the result.
3(mean-median)
Sk =standard deviation
The distribution for the lengths of stay by patients on the cancer floor of a local hospital is
4.4
•
•
Box Plot/Box and Whisker Plot
A graphical display, based on quartiles, that helps to picture a set of data.
Five data are needed:
whisker
box
whisker
i)
Minimum value
ii)
iii)
iv)
iv)
•
•
•
First Quartile
Median
Third Quartile
Maximum Value
(iv)
(v)
(ii)
(iii)
(i)
Right-skewed: the right side whisker is much longer than the left side whisker.
Perfectly symmetrical: the length of the left whisker will equal the length of the right
whisker, and the median line will divide the box in half.
Left skewed: the length of the left side whisker will be much longer than the right side
whisker.
Example 24
In a study of memory recall times, a series of stimulus words was shown to a subject on a
computer screen. For each word, the subject was instructed to recall either a pleasant or an
unpleasant memory associated with that word.
Successful recall of a memory was indicated by the subject pressing a bar on the computer
keyboard. Table below shows the recall times (in seconds) for 11 pleasant and 7 unpleasant
memories.
Pleasant memory
Unpleasant memory
1.07
4.63
1.45
1.22
5.55
1.9
1.63
6.17
2.32
2.12
2.43
2.56
2.57
2.93
3.87
3.03
4.33
3.22
21
BHMC3004
Chapter 4
Pleasant memory:
n = 11
Minimum =
Q1 =
Median =
Q3 =
Maximum =
Unpleasant memory:
n=7
Minimum =
Q1 =
Median =
Q3 =
Maximum =
Boxplots for the Recall Time
Unpleasant memory
Pleasant memory
1
2
3
4
5
6
The distributions for the recall times for
However, the distributions for
unpleasant memory.
is more disperse/variable as compared to
22
BHMC3004
4.5
•
•
Chapter 4
Reliability and Validity
All measurements, especially measurements of behaviours, opinions, and constructs,
are subject to fluctuations (error) that can affect the measurement’s reliability and
validity.
Reliability
o Measurement of consistency and stability of test scores.
o Prerequisite for validity.
o Analogous to variance (low reliability = high variance)
o Reliability coefficient  0.70 is considered to have good reliability; if below 0.50, it
would not be considered a very reliable test.
Type
Definition
Over time
(test-retest
reliability)
Administer the same test twice over a period Correlation
between
of time to the same individuals.
scores at Time 1 and Time
2
Across items
(internal
consistency)
Consistency of people’s responses across the Cronbach’s alpha, 
items on a multiple-item measure.
Across different
researchers
(inter-rater
reliability)
The extent to which different observers are Cronbach’s alpha
consistent in their judgments.
(quantitative) or
Cohen’s kappa, 
(categorical)
Alternate forms
(parallel-forms
reliability)
Administer different versions of an assessment Correlation between the
tool (different in wording but both contain responses to the pairs of
items that probe the same construct) to the questions
same group of individuals.
•
Measured by
Validity
o Suitability or meaningfulness of the measurement.
o Analogous to unbiasedness (valid = unbiased).
Type
Definition
Measured by
Content
The extent to which the content of the test Correlating experts’ judgment or
matches the instructional objectives.
Item-item or item-total correlation
Criterion
The extent to which scores on the test are
in agreement with (concurrent validity) or
predict (predictive validity) an external
criterion.
Construct
The extent to which an assessment Factor analysis or
corresponds to other variables, as Correlating with other theoretical
predicted by some rationale or theory.
measure with which the developing
instrument should correlate
Correlating the test with the criteria
during data collection (concurrent
validity) or some point in the future
(predictive validity)
23
BHMC3004
•
Chapter 4
Measures to ensure validity of a research:
o Appropriate time scale for the study
o Appropriate methodology
o Most suitable sampling method
o The respondents must not be pressured in any ways to select specific choices among
the answer sets
Example
Sample raw data: 13, 24, 44, 56, 67, 70, 82
After entering the data:
x = 356, x2 = 21930, n = 7
𝑥̅ =
𝛴𝑥
=
𝑛
𝛴𝑥
s=√
356
7
= 50.8571
2
2 −(𝛴𝑥)
𝑛
𝑛−1
21930 −
=√
(356)2
7
7−1
= 25.2493
Sample grouped data:
After entering the data:
fx = 240, fx2 = 4287.5, f = 14
𝑥̅ =
𝛴𝑓𝑥
s=√
𝛴𝑓
=
240
14
= 17.1429
(𝛴𝑓𝑥)2
𝛴𝑓𝑥 2 −
𝛴𝑓
𝛴𝑓−1
=√
(240)2
4287.5 −
14
14−1
Class
f
Midpoint, x
10-<15
4
12.5
15-<20
7
17.5
20-<25
3
22.5
= 3.6502
24
BHMC3004
Chapter 5
Chapter 5 REGRESSION
5.1
•
•
•
•
•
5.2
•
•
•
•
Regression Analysis
A prediction model using one or more independent/exploratory/predictor variables to
predict the values of a dependent/response/outcome variable.
Explain and predict the dependent variable on the basis of information on the independent
variable(s).
Bivariate regression/Simple linear regression
o Examines changes in the dependent variable as a function of changes or differences in
values of ONE independent variable.
o E.g.
i) What is the relationship between education and income? For each year of education, how
much does income increase (on average)?
ii) What will be the rate of return on investment? For each dollar invested, how much will
sales increase?
iii) For a political candidate, how many votes will he or she get for each dollar spent on
advertising?
Multiple linear regression
o Attempts to model the relationship between two or more independent variables and a
dependent variable by fitting a linear equation to observed data.
i) Do age and IQ scores effectively predict GPA?
ii) Do weight, height, and age explain the variance in cholesterol levels?
Nonlinear Regression
o Observational data are modeled by a function which is a nonlinear combination of the
model parameters and depends on one or more independent variables.
Scatter Diagram
A plot of paired observations.
Illustrates whether
o any relationship between the DV and IVs;
o positive / negative relationship;
o linear / non-linear relationship.
Positive relationship: An increase in IV will lead to an increase in DV, and vice versa.
Negative relationship: An increase in IV will lead to a decrease in DV, and vice versa.
Example 1
The following data shows the educational attainment (in year), X and the Internet usage (in hour)
per week, Y, for a sample of 10 individuals. Draw a scatter diagram of the two variables and
comment on the graph.
X
10
9
12
13
19
11
16
23
14
21
Y
1
0
3
4
7
2
6
9
5
8
1
BHMC3004
Chapter 5
Scatter Diagram for Educational Attainment and Internet Usage
Internet Usage per Week (hour)
10
8
6
4
2
0
There exists a
5.3
•
•
•
4
8
12
16
20
Educational Attainment (year)
24
relationship between
Simple Linear Equation
The relationship between the variables is linear, i.e., the equation model is a straight line.
The general form: 𝑌̂ = a + bX
𝑌̂ – predicted value of the Y variable.
X – any value of the independent variable.
* 𝑌̂ can be denoted as Y ’
The general form: 𝑌̂ = a + bX
a : Y-intercept; the estimated value of Y when X = 0; or when the regression line crosses the
Y-axis when X = 0.
b : slope of the regression line; the average change in slope of the regression line; or the
average change in 𝑌̂ for each change of one unit in X.
o b positive - positive linear relationship.
o b negative- negative linear relationship.
o Be careful in making interpretation of a. If X = 0 is outside the range of X in the data set,
the prediction may not carry much credibility.
2
BHMC3004
•
Chapter 5
By Least Squares Method:
o Slope of the regression line, b :
𝑛(Σ𝑋𝑌) − (Σ𝑋)(Σ𝑌)
𝑛(Σ𝑋 2 ) − (Σ𝑋)2
n : the total observations in (X, Y)
o Y-intercept, a :
Σ𝑌
Σ𝑋
𝑎 =
−𝑏
or 𝑎 = 𝑌̅ − 𝑏𝑋̅
𝑛
𝑛
* 𝑌̅ and 𝑋̅ are the mean of Y and X, respectively
a and b : estimated regression coefficients or regression coefficients.
The model 𝑌̂ = a + bX is also called as the least-squares regression line of Y on X.
Assumptions:
o For each value of X, there is a group of Y values, and these Y values are normally
distributed.
o The means of these normal distributions of Y values all lie on the regression line.
o The standard deviations of these normal distributions are equal.
o The Y values are statistically independent.
Two types of estimation using the regression equation:
1.
Interpolation estimate
o Estimate the values of Y within the range of the observations of X in the data set.
o More accurate and more reliable.
2.
Extrapolation estimate
o Estimate the values of Y outside the range of the observations of X in the data set.
o Most commonly used for forecasting using a time series.
o May less accurate and unreliable to a certain extent.
𝑏 =
•
•
•
•
Example 2
Find the least squares equation for the Internet usage on educational attainment based on the data
in Example 1. Interpret the regression coefficients obtained.
X
Y
XY
X2
10
1
9
0
0
81
12
3
36
144
13
4
52
169
19
7
133
361
11
2
22
121
16
6
96
256
23
9
207
529
14
5
70
196
21
8
168
441
3
BHMC3004
𝑏 =
𝑛(Σ𝑋𝑌) − (Σ𝑋)(Σ𝑌)
𝑛(Σ𝑋 2 ) − (Σ𝑋)2
𝑎 =
Σ𝑌
Σ𝑋
−𝑏
𝑛
𝑛
Chapter 5
The least squares line: 𝑌̂ =
The regression coefficient b is 0.6166 and is interpreted as
“The average change in the estimated
with every 1 year of change in
or
With each additional
of
is predicted to
The Y-intercept a is -4.6257, may not have a clear substantive interpretation.
(The estimated
is
when
Example 3
The age of the respondents in the sample from Example 1 are recorded as follows.
Age, X
Internet Usage per week (hour), Y
55
1
60
0
45
3
35
4
23
7
40
2
22
6
27
9
41
5
30
8
4
BHMC3004
i)
Chapter 5
Refer to the scatter diagram for age and internet usage per week, interpret the diagram.
The diagram suggests there is a
ii)
relationship between
Obtain a least squares regression line for Internet usage per week on the age.
𝑛(𝛴𝑋𝑌) − (𝛴𝑋)(𝛴𝑌)
𝑏 =
𝑛(𝛴𝑋 2 ) − (𝛴𝑋)2
𝑎 =
𝛴𝑌
𝛴𝑋
−𝑏
𝑛
𝑛
The least squares regression line:
iii)
Interpret the regression coefficients obtained from ii).
The average change in the estimated
with every 1
of change in the
or With each additional
The estimated
iv)
Predict the Internet usage per week for the respondent of age
(1)
50;
(2)
20.
Comment on the reliability and accuracy of each of the estimate.
(1) 𝑌̂ = 12.2641 – 0.2054(50)
The value of 50 falls
by the technique of
the range of the data set hence the estimate is obtained
, thus it is considered as
(2) 𝑌̂ = 12.2641 – 0.2054 (20)
The value of 50 falls
by the technique of
the range of the data set hence the estimate is obtained
, thus it is considered as
5
BHMC3004
5.4
•
•
•
•
•
•
Chapter 5
Coefficient of Determination, r2
The proportion of the total variation in the dependent variable (Y) that is explained or
accounted for by the variation in the independent variable (X).
Also known as goodness of fit.
0 < r 2 < 1.
r 2 = 0, the DV cannot be predicted from IV.
r 2 = 1, the DV can be predicted without error from the IV.
The closer the value is to 1 or 100%, the better fit of the regression model.
2
𝑛Σ𝑋𝑌 − (Σ𝑋)(Σ𝑌)
2
𝑟 = (
)
√[𝑛Σ𝑋 2 − (Σ𝑋)2 ][𝑛Σ𝑌 2 − (Σ𝑌)2 ]
Example 4
Calculate and interpret the coefficient of determination for Example 2 and Example 3.
Example 2
X = 148
Y = 45
X 2 = 2398 Y 2 = 285 n = 10
2
10(794) − (148)(45)
XY = 794
2
𝑟 =(
)
√[10(2398) − (148)2 ][10(285) − (45)2 ]
About
of the total variation in the
accounted for by the variation in the
Thus the regression line 𝑌̂ = -4.6257 + 0.6166 X is
that is explained or
Example 3
2
10(1391) − (378)(45)
2
𝑟 =(
)
√[10(15798) − (378)2 ][10(285) − (45)2 ]
About
of the total variation in the
accounted for by the variation in the
Thus the regression line 𝑌̂ = 12.2641 – 0.2054X
that is explained or
.
6
BHMC3004
5.5
Chapter 5
Multiple Linear Regression
Extension of bivariate regression used to examine the effect of two or more independent
variables on the dependent variable.
General form: 𝑌̂ = a + b1* X1 + b2* X2
o where 𝑌̂ = the predicted value on DV
o X1 = the value on IV X1
o X2 = the value on IV X2
a = the Y-intercept; or the estimated value of Y when X1 = 0 and X2 = 0
bi * = the partial slope of Y and Xi ; the average change in Y with a unit change in a specific Xi ,
while controlling or holding constant the value of the other IV(s)
If there is a curvilinear relation between the IV and DV, the model can be a polynomial
regression model,
𝑌̂ = 0 + 1X + 2X 2 + … + h X h.
o Consider as multiple linear regression since it is linear in the regression coefficients, 1,
2, … h.
o When h = 2, the model is called as quadratic regression;
h = 3 is a cubic regression;
h = 4 is a quartic regression, and so on.
Multiple coefficient of determination, R square, measures the proportion of the total
variation in the DV that is explained jointly by two or more IVs.
Pearson’s multiple correlation coefficient, R, measures the linear relationship between the
DV and the combined effect of two or more IVs.
•
•
•
•
•
•
•
Example 5
Refer to the previous examples, let educational attainment and age be the independent variables
and Internet usage per week be the dependent variable.
Let
Y : Internet usage per week (Usage)
X1 : Educational Attainment (Edu)
X2 : Age
Output from SPSS
Model
1
Variables Entered/Removeda
Variables
Variables
Entered
Removed
b
Age, Edu
.
Method
Enter
a. Dependent Variable: Usage
b. All requested variables entered.
Model
R
1
.9884a
Model Summary
Adjusted R
R Square
Square
.9769
.9703
Std. Error of
the Estimate
.5220
a. Predictors: (Constant), Age, Edu
7
BHMC3004
Chapter 5
ANOVAa
Model
1
Regression
Residual
Total
Sum of
Squares
80.5924
1.9076
82.5000
df
Mean Square
F
Sig.
2.0000
7.0000
9.0000
40.2962
.2725
147.8695
.0000b
a. Dependent Variable: Usage
b. Predictors: (Constant), Age, Edu
Coefficientsa
Model
1
(Constant)
Edu
Age
Unstandardized Coefficients
B
-.6051
.491
-.057
Std. Error
1.7175
.062
.023
Standardized
Coefficients
Beta
.779
-.245
t
Sig.
-.3523
7.883
-2.477
.7350
.000
.042
a. Dependent Variable: Usage
b1* = 0.49, the estimated Internet usage increases by 0. 49 hours per each year of increase in
education attainment, holding a ge constant.
b2* = -0.06, the estimated Internet usage decreases by 0. 06 hours with each
of increase in
age
when educational attainment
is held constant.
a = -0.61, the estimated Internet usage per week is -0.61 hours when both educational attainment
and age are 0 (not meaningful).
Multiple coefficient of determination, r 2 = 0.98
98% of the total variation in the Internet usage per week can be explained by the model containing
Educational Attainment and Age.
8
BHMC3004
Chapter 5
5.6 Non-linear Regression
• Observational data are modeled by a function which is a non-linear combination of the model
parameters and depends on one or more IVs.
• Some non-linear equations can be transformed to mimic a linear equation. If this happens,
the non-linear equation is called “intrinsically linear”.
• Non-linear Transformation
Standard Linear
Power
Model
Model Transformation
Parameter Transformation
y =  + x
None
NA
log y = log a + b log x
Y = log y, X = log x
 = log a,  = b
y = ln a + b(ln x)
Y = y, X = ln x
 = ln a,  = b
ln y = ln a + bx
Y = ln y, X = x
 = ln a,  = b
b
y = ax
Logarithmic
b
y = ln (ax )
Exponential
y = ae
bx
1
Reciprocal
𝑦
1
y = 𝑎+𝑏𝑥
1
Y = 𝑦, X = x
1
1
Square Root
= a + bx
 = a,  = b
y =(𝑎+𝑏𝑥)2
√𝑦
= a + bx
 = a,  = b
y = a + b √𝑥
Y = y, X = √𝑥
 = a,  = b
Example 6
Fit the data using a suitable regression model.
Dose
0
1.3
2.8
5.0
10.2
16.5
21.3
31.8
52.2
Response
0.1
0.5
0.9
2.6
7.1
12.3
15.3
20.4
24.2
Plot a scatter diagram.
There is a possible non-linear relationship between the variables.
9
BHMC3004
Chapter 5
Result from statistical software:
Model Summary and Parameter Estimates
Dependent Variable: response
Equation
Model Summary
R Square
F
Sig.
Parameter Estimates
Constant
b1
b2
Linear
0.9294
92.0945
0.0000
1.2464
0.5116
Quadratic
0.9955
659.7932
0.0000
-1.0123
0.9378
-0.0087
Cubic
0.9972
601.7636
0.0000
-0.6141
0.7637
0.0014
Exponential
0.6200
11.4205
0.0118
0.8864
0.0875
b3
-0.0001
The independent variable is dose.
Models that fit the data:
Linear:
Quadratic:
Cubic:
10
BHMC3004
Chapter 6
Chapter 6 CORRELATION
6.1
•
6.2
•
•
•
Correlation Analysis
A group of statistical techniques used to measure the strength of the association
between two variables.
Coefficient of Correlation
A measure of the strength of the relationship between two variables.
Range between -1 and 1.
Value of 0 indicates that there is no linear relationship between two variables.
6.2.1 Pearson’s Correlation Coefficient, r
• Pearson product-moment correlation coefficient.
• A measure for interval-ratio variable, describe the strength and the direction of the
linear relationship between the variables.
• r =√Coefficient of Determination, 𝑟 2
Σ(𝑋 − 𝑋̅)(𝑌 − 𝑌̅)
𝑛Σ𝑋𝑌 − (Σ𝑋)(Σ𝑌)
=
=
(𝑛 − 1)𝑆𝑋 𝑆𝑌
√[𝑛Σ𝑋 2 − (Σ𝑋)2 ][𝑛Σ𝑌 2 − (Σ𝑌)2 ]
•
•
A symmetrical measure, thus the correlation between X and Y is identical to the
correlation between Y and X.
Guideline for the strength of a relationship:
Type of correlation
Negative correlation
Positive correlation
Strength/ Degree
–1
–1.0 < r ≤ –0.8
–0.8 < r ≤ –0.6
–0.6 < r ≤ –0.4
–0.4 < r ≤ –0.2
–0.2 < r < 0
0
1
Perfect
Very strong
Strong
Moderate
Weak
Very weak
No relationship
0.8 ≤ r < 1.0
0.6 ≤ r < 0.8
0.4 ≤ r < 0.6
0.2 ≤ r < 0.4
0 ≤ r < 0.2
0
Example 1
The weight of a car can influence the mileage that the car can obtain. Based on the given data,
calculate and interpret the coefficient of correlation. Hence, calculate the coefficient of
determination and interpret the result.
Weight
(in00 pounds)
23
25
28
30
35
35
40
Mileage (mpg)
53.3
40.9
46.9
32.2
31.3
28.0
23.1
X = Weight, Y = Mileage
X 2 =
X =
2
Y =
Y =
XY =
1
BHMC3004
𝑟 =
Chapter 6
𝑛Σ𝑋𝑌 − (Σ𝑋)(Σ𝑌)
√[𝑛Σ𝑋 2 − (Σ𝑋)2 ][𝑛Σ𝑌 2 − (Σ𝑌)2 ]
There exists a
linear relationship between the
When the
increases, the
Coefficient of determination, r 2 =
of the variability in
with the
can be predicted from the relationship
Example 2
Use Pearson product-moment correlation coefficient to illustrate the strength of the
relationship between
i) Educational attainment and Internet usage;
ii) Age and Internet usage.
Educational Attainment (year)
Age
Internet Usage per week (hour)
10
55
1
9
60
0
12
45
3
13
35
4
19
23
7
11
40
2
16
22
6
23
27
9
14
41
5
21
30
8
Let r1 be the correlation coefficient between Educational attainment and Internet usage, and
r2 be the correlation coefficient between Age and Internet usage.
[Refer Chapter 5, Example 4]
X1 = 148
Y = 45
X1Y = 794
X12 = 2398
Y 2 = 285
n = 10
𝑟1 =
10(794) − (148)(45)
√[10(2398) − (148)2 ][10(285) − (45)2 ]
2
BHMC3004
Chapter 6
X2 = 378
Y = 45
X2Y = 1391
X22 = 15798
Y 2 = 285
n = 10
𝑟2 =
10(1391) − (378)(45)
√[10(15798) − (378)2 ][10(285) − (45)2 ]
6.2.2 Spearman’s Rank Correlation Coefficient
• Rank Correlation
o Used to measure the strength of a relationship between the variables that are of at
least ordinal data. E.g., discipline and exam marks, job performance and qualification.
• Spearman’s Rank Correlation Coefficient, rs
o A measure of rank correlation.
o Can be used even though the variables to be correlated are not representable in
numeric form.
• Spearman’s Rank Correlation Coefficient,
6Σ𝐷2
𝑟𝑠 = 1 –
𝑛(𝑛2 − 1)
where D = rX – rY ,
rX / rY = rank of X / Y
X and Y are the characteristics of the data
•
Guideline for the strength of a relationship:
Type of correlation
Strength/ Degree
Agreement between
the rankings
Disagreement between the
rankings
Perfect
Very high
High
Moderate
Low
Very low
No relationship
1
–1
–1.0 < r ≤ –0.8
–0.8 < r ≤ –0.6
–0.6 < r ≤ –0.4
–0.4 < r ≤ –0.2
–0.2 < r < 0
0
0.8 ≤ r < 1.0
0.6 ≤ r < 0.8
0.4 ≤ r < 0.6
0.2 ≤ r < 0.4
0 ≤ r < 0.2
0
3
BHMC3004
Chapter 6
Example 3
Consider a musical talent contest where 10 competitors are evaluated by two judges, X and Y.
The scores of the judges (out of 10) were as follows:
Contestant
1
2
3
4
5
6
7
8
9
10
Score by Judge X
5
9
3
8
6
7
4
8
4
6
Score by Judge Y
7
8
6
7
8
5
10
6
5
8
Describe the relationship between the scores by the judges using Spearman rank correlation
coefficient.
Contestant
1
2
3
4
5
6
7
8
9
10
Score by Judge X
5
9
3
8
6
7
4
8
4
6
Score by Judge Y
7
8
6
7
8
5
10
6
5
8
rX
rY
D = rX – rY
D2
D 2 =
rs =
There is a
degree of
between the rankings of
If Judge X evaluates a particular contestant with a higher score, then
6.2.3 Comparison of Rank and Product Moment Correlation
• Product moment coefficient
o The standard measure of correlation.
o Data must be numeric.
• Rank coefficient
o Approximation to the r.
o Easier to use, less calculations.
o Non-numeric data.
o Insensitive to small changes in actual values.
4
BHMC3004
6.3
Chapter 6
Other Measures of Association
• If one or both of the variables is nominal:
o Contingency Coefficient
o Phi and Cramer’s V
o Lambda
• If both of the variables are ordinal:
o Gamma
o Kendall’s tau-b
o Kendall’s tau-c
* Dichotomies should be treated as ordinal.
6.3.1 Contingency Coefficient
• Range between 0 and 1 with higher values indicate a stronger association.
• Highly sensitive to the size of table. The larger the number of categories, the closer the
maximum value is to 1.
6.3.2 Phi and Cramer’s V
•
•
Vary between 0 and 1, regardless of the number of rows and columns.
Nondirectional measure that ranges between 0 and 1, with 0 indicating no association
and 1 as perfect association.
6.3.3 Lambda, 
•
Asymmetrical measure of association, vary depending on which variable is considered
the independent variable and which the dependent variable.
•
Often underestimate the strength of the relationship.
•
3 versions of Lambda – one that you would use when one variable is the dependent
variable, another that you would use if the other variable was dependent, and a third
you would use if you don’t want to designate either of the variables as dependent.
•
Range from 0 to 1.
•
0.0: nothing to be gained by using the IV to predict the DV.
•
1.0: by using the IV as a predictor, we are able to predict the DV without any error.
6.3.4 Gamma, Kendall’s tau-b, and Kendall’s tau-c
•
Symmetrical measure of association.
•
Vary from 0.0 to 1.0 and provides an indication of the strength and direction of the
association between the variables.
•
Gamma will always be larger.
5
BHMC3004
Chapter 7
Chapter 7 HYPOTHESIS TESTING
7.1
•
•
Introduction
Hypothesis is a statement about a population parameter developed for the purpose of
testing.
Hypothesis testing is
o a procedure based on sample evidence and probability theory to determine whether
the hypothesis is a reasonable statement, or
o an inferential procedure that uses the data from a sample to draw a general
conclusion about a population.
Step 2
Step 3
o
Step 1
State null and
alternative
hypothesis
Make a decision:
1. Reject H0
2. Fail to reject H0
7.2
•
•
•
Determine
significance level
and the critival value
Identify
the test
statistic
Step 5
Step 4
Take a sample and
calculate the value
of the test statistic.
Formulate
a decision
rule
Definition
Null Hypothesis, H0
o A statement about the value of a population parameter.
o No effect, no change, or no significant difference.
o Include =, ≤ or ≥. Always contain the equal sign as the null hypothesis is the
statement to be tested, and we need a specific value to include in our calculations.
o Also used to state that there is no relationship between two variables.
Alternative Hypothesis, H1
o Research hypothesis.
o Inverse, or opposite of H0.
o Expressed in terms of population parameters, but its specific form varies from test
to test.
o Can include ≠, > or <, directly contradicts the H0.
o A statement in which there is some statistical significance between two variables.
o A statement that is accepted if the sample data provide sufficient evidence that the
H0 is false.
Level of Significance, 
o The probability of rejecting the H0 when it is true.
o Level of risk.
1
BHMC3004
•
•
•
•
•
•
Chapter 7
Critical Region
o Composed of extreme sample values that are very unlikely to be obtained if the null
hypothesis is true.
o If the outcome of a statistical test falls in the critical region, the H0 is rejected.
Critical Value
o The dividing point between the acceptance region and the rejection region (critical
region).
o The boundary of the critical region.
o Based on the level of significance, type of test and type of test statistic.
Type I Error
o Rejecting the H0 when it is true.
o In a typical research situation, a Type I error means the researcher concludes that a
treatment does have an effect when in fact it has no effect.
o The probability of committing Type I error is .
Type II Error
o Fails to reject H0 when it is false.
o In a typical research situation, a Type II error means that the hypothesis test has
failed to detect a real treatment effect.
o The probability of committing Type II error is .
Test Statistic
o A value, determined from sample information, used to determine whether to reject
the null hypothesis.
Decision Rule
o A statement of the specific condition under the null hypothesis is rejected and the
condition under which it is not rejected.
2
BHMC3004
7.3
•
•
Chapter 7
Hypothesis Testing for One Population Mean
The claims are statements about a population mean, .
Type of hypothesis test:
i) One-tailed test/Directional Hypothesis Test/One-Tailed Test.
o H1 specifies either an increase (right-tailed test) or a decrease (left-tailed test) in the
population mean score.
o Make a statement about the direction of the effect.
o The rejection region is at the right tail or left tail of the distribution.
ii) Two-tailed test / Non-Directional Hypothesis Test/Two-Tailed Test.
o The primary concern is deciding whether a population mean is different from a
specific value.
o The rejection region is in both tails of the distribution.
Sign
•
Type of Test
H0
H1
≤
>
Right-tailed test
more/ not more than, at most
=

Two-tailed test
different, change/ same, equal
≥
<
Left-tailed test
less/ not less than, at least
Type of test statistic:
i)
If , population standard deviation, is known, the test statistic is the z-test,
sample mean − hypothesized population mean 𝑥̅ − μ
𝑧 =
=σ
standard error between 𝑥̅ and μ
⁄ 𝑛
√
ii)
If  is unknown but n ≥ 30, the test statistic is the z-test where  is estimated by
s, sample standard deviation,
𝑥̅ − μ
𝑧 = 𝑠
⁄ 𝑛
√
iii)
If  is unknown and n < 30, the test statistic is the t-statistic, where  is estimated
by s,
𝑥̅ − μ
𝑡 = 𝑠
with (𝑛 − 1) degree of freedom
⁄ 𝑛
√
3
BHMC3004
Chapter 7
Percentage Points of the Normal Distribution
The table gives the 100α percentage points, uα, of a standardised Normal distribution where
1  −u 2/ 2
α=
du . Thus, uα is the value of a standardised Normal variate which has
 e
2 u 
probability α of being exceeded.
α
uα

u

u

u
0.50
0.0000
0.029
1.8957
0.009
2.3656
0.45
0.1257
0.028
1.9110
0.008
2.4089
0.40
0.2533
0.027
1.9268
0.007
2.4573
0.35
0.3853
0.026
1.9431
0.006
2.5121
0.30
0.5244
0.025
1.9600
0.005
2.5758
0.25
0.6745
0.024
1.9774
0.004
2.6521
0.20
0.8416
0.023
1.9954
0.003
2.7478
0.15
1.0364
0.022
2.0141
0.002
2.8782
0.10
1.2816
0.021
2.0335
0.001
3.0902
0.05
1.6449
0.020
2.0537
0.0005
3.2905
0.048
1.6646
0.019
2.0749
0.0001
3.7190
0.046
1.6849
0.018
2.0969
0.00005
3.8906
0.044
1.7060
0.017
2.1201
0.00001
4.2649
0.042
1.7279
0.016
2.1444
0.000005
4.4172
0.040
1.7507
0.015
2.1701
0.038
1.7744
0.014
2.1973
0.036
1.7991
0.013
2.2262
0.034
1.8250
0.012
2.2571
0.032
1.8522
0.011
2.2904
0.030
1.8808
0.010
2.3263
4
BHMC3004
Chapter 7
Critical Values of Student’s t Distribution
α
Degrees of
freedom
α/2
α
One-Tailed Tests
α/2
Two-Tailed Tests
Significance level, 
0.01
0.1
0.1
0.05
0.02
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
35
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.341
1.337
1.333
1.330
1.328
1.325
1.323
1.321
1.319
1.318
1.316
1.315
1.314
1.313
1.311
1.310
1.306
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.761
1.753
1.746
1.740
1.734
1.729
1.725
1.721
1.717
1.714
1.711
1.708
1.706
1.703
1.701
1.699
1.697
1.690
15.895
4.849
3.482
2.999
2.757
2.612
2.517
2.449
2.398
2.359
2.328
2.303
2.282
2.264
2.249
2.235
2.224
2.214
2.205
2.197
2.189
2.183
2.177
2.172
2.167
2.162
2.158
2.154
2.150
2.147
2.133
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.518
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457
2.438

1.282
1.645
2.054
2.326
0.05
0.02
0.01
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.761
1.753
1.746
1.740
1.734
1.729
1.725
1.721
1.717
1.714
1.711
1.708
1.706
1.703
1.701
1.699
1.697
1.690
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.145
2.131
2.120
2.110
2.101
2.093
2.086
2.080
2.074
2.069
2.064
2.060
2.056
2.052
2.048
2.045
2.042
2.030
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
2.583
2.567
2.552
2.539
2.528
2.518
2.508
2.500
2.492
2.485
2.479
2.473
2.467
2.462
2.457
2.438
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.012
2.977
2.947
2.921
2.898
2.878
2.861
2.845
2.831
2.819
2.807
2.797
2.787
2.779
2.771
2.763
2.756
2.750
2.724
1.645
1.960
2.327
2.576
5
BHMC3004
•
•
•
Chapter 7
Hypothesis
Test Statistic
Critical Value
Critical Region
Left-tailed test
H0:  ≥ 0
H1:  < 0
z-statistic
–z
z < –z
t-statistic
–t, n – 1
t <–t, n – 1
Right-tailed test
H0:  ≤ 0
H1:  > 0
z-statistic
z
z > z
t-statistic
t, n – 1
t > t, n – 1
Two-tailed test
H0:  = 0
H1:  ≠ 0
z-statistic
z/2
z > z/2 or z < –z/2
t-statistic
t/2, n – 1
t > t/2, n – 1 or t < –t/2, n – 1
Refer critical value to the Standard Normal z table or Student’s t table.
Compare the test statistic to the critical value and make a decision to reject or not to
reject the null hypothesis.
Interpret the results of the test.
6
BHMC3004
Chapter 7
Example 1
It is known that, nationally, doctors working for health maintenance organizations (HMOs)
average 13.5 years of experience in their specialties, with a standard deviation of 7.6 years. The
executive director of an HMO in a Western state is interested in determining whether its doctor
have less experience than the national average.
A random sample of 150 doctors from HMOs shows a mean of only 10 years of experience. Test
at 0.01 level of significance.
Let  be the true population mean number of years of experience
H0:   13.5
H1:  < 13.5 (Claim, Left-tailed test)
Since  is known, z test is used.
 = 0.01, critical value = – z0.01 = –2.3263
Test statistic,
Rejection region
𝑧 =
𝑥̅ −μ
σ
⁄ 𝑛
√
10−13.5
= 7.6
⁄
√150
= -5.6403
-2.3263
If z < -2.3263, H0 is rejected. Otherwise, it is failed to reject H0.
Since z = -5.6403 < -2.3263, H0 is rejected.
Therefore, the doctors have less experience than the national average at 0.01 level of
significance.
Example 2
The average cost of a hotel room in town A is said to be $168 per night. To determine if this is
true, a random sample of 25 hotels is taken and resulted in a mean of $172.50 and a standard
deviation of $15.40. Test the appropriate hypothesis at 0.05 level of significance.
Let  be the true population mean cost of a hotel room per night.
H0:
H1:
Since  is not given and n < 30, the test statistic is t test.
 = 0.05, df =
critical value =
Test statistic,
𝑥̅ −μ
𝑡 =𝑠
⁄ 𝑛
√
=
If
, H0 is rejected. Otherwise, it is failed to reject H0.
Since
Therefore, we can conclude that the average cost of a hotel room in town A is
7
BHMC3004
Chapter 7
7.4
Hypothesis Testing for Two Population Mean
7.4.1 Independent Groups
• Compare the means of two independent populations and test the hypothesis about
1 – 2.
• E.g., A social psychologist may want to compare men and women in terms of their
attitudes towards abortion.
• Assumptions:
i) The observations within each sample must be independent.
ii) The two populations from which the samples are selected must be normal.
iii) The two populations from which the samples are selected must have equal variances.
• Types of Test Statistic:
i) If 1 and 2 are known, the test statistic is the z-test,
z=
=
sample mean difference−hypothesized population mean difference
estimated standard error
(𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 )
σ2
√ 1
.
σ22
𝑛1 + 𝑛2
If σ2 = σ12 = σ22 , 𝑧 =
(𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 )
1
1
σ√𝑛 + 𝑛
1
2
.
ii) If 1 and 2 are unknown but n1 ≥ 30 and n2 ≥ 30, the test statistic is the z-test where
1 and 2 are estimated by s1 and s2, sample standard deviation,
(𝑥̅1 − 𝑥̅ 2 ) − (μ1 − μ2 )
𝑧 =
𝑠2 𝑠2
√ 1+ 2
𝑛1 𝑛2
iii) If 1 and 2 are unknown and n1 < 30 and n2 < 30, the test statistic is the t-statistic,
where 1 and 2 are estimated by s1 and s2,
(𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 )
𝑡 =
with 𝑑𝑓 = 𝑛1 + 𝑛2 − 2,
1
1
𝑠𝑤 √𝑛 + 𝑛
1
where 𝑠𝑤2
2
(𝑛1 − 1)𝑠12 + (𝑛2 − 1)𝑠22
=
(pooled variance).
𝑛1 + 𝑛2 − 2
8
BHMC3004
Chapter 7
Hypothesis
Test Statistic
Critical Value
Critical Region
Left-tailed test
H0: 1 – 2 ≥ d0
H1: 1 – 2 < d0
z-statistic
–z
z < –z
t-statistic
−𝑡α,𝑛1+𝑛2−2
t < −𝑡α,𝑛1+𝑛2−2
Right-tailed test
H0: 1 – 2 ≤ d0
H1: 1 – 2 > d0
z-statistic
z
z > z
t-statistic
𝑡α,𝑛1+𝑛2−2
t >𝑡α,𝑛1+𝑛2−2
Two-tailed test
H0: 1 – 2 = d0
H1: 1 – 2 ≠ d0
z-statistic
z/2
z > z/2 or z < –z/2
t-statistic
𝑡α/2,𝑛1+𝑛2−2
t >𝑡α/2,𝑛1+𝑛2−2 or t < −𝑡α/2,𝑛1+𝑛2−2
Example 3
The salaries for 35 faculty members from private institutions and 30 faculty members from
public institutions are randomly and independently selected. Their annual salaries ($000) are
recorded and the summary of the information are as follows.
Private Institutions
Public Institutions
𝑥̅1 = 98.19
s1 = 26.21
n1 = 35
𝑥̅2 = 83.18
s2 = 23.95
n2 = 30
At the 5% significance level, do the data provide evidence to conclude that mean salaries for
faculty in private and public institutions differ?
Let 1 be the true population mean annual salary for faculty members from private institutions;
and 2 be the true population mean annual salary for faculty members from public institutions.
H0:
H1:
1 and 2 are unknown but n1 ≥30 and n2 ≥ 30, z test is used.
 = 0.05, critical value =
Test statistic,
(𝑥̅1 − 𝑥̅2 ) − (μ1 − μ2 )
𝑧 =
=
2
2
𝑠
𝑠
√ 1+ 2
𝑛1 𝑛2
If
, H0 is rejected. Otherwise, it is failed to reject H0.
Since
Therefore, the data provide
9
BHMC3004
Chapter 7
Example 4
A sample of 10 children from City A showed that the mean time they spent watching television
is 28.50 hours per week with a standard deviation of 4 hours. Another sample of 15 children
from City B showed that the mean time spent by them watching television is 23.25 hours per
week with a standard deviation of 5 hours.
Using a 1% level of significance, can you conclude that the mean time spent watching television
by children in City A is greater than that for children in City B ? Assume that the standard
deviations for the two populations are equal.
Let A be the true population mean time spent watching television by children in City A;
and B be the true population mean time spent watching television by children in City B.
H0:
H1:
1 and 2 are unknown, n1 < 30 and n2 < 30, t test is used.
 = 0.01, df = 10 + 15 – 2= 23
critical value =
sw2 =
Test statistic, t =
, H0 is rejected. Otherwise, it is failed to reject H0.
If
Since
Thus, we are 99% confident that the mean time spent watching television by children in City A
7.4.2 Correlated Groups
• A single sample of individuals is measured more than once on the same dependent
variable. The same subjects are used in all the treatment conditions.
o E.g., A clinical psychologist may want to evaluate a therapy technique by comparing
depression scores for patients before therapy with their scores after therapy.
• In a matched-subjects study, each individual in one sample is matched with a subject in
the other sample. The matching is done so that the two individuals are equivalent (or
nearly equivalent) with respect to a specific variable that the researcher would like to
control.
• Assumptions:
o
o
The observations within each treatment condition must be independent.
The population distribution of difference scores (D values) must be normal.
10
BHMC3004
•
Chapter 7
The t test begins by computing a difference between the first and second measurements
for each subject (or the difference for each matched pair).
o The difference scores, are obtained by
D = X2 – X1.
o The mean difference, 𝑥̅𝐷 =
o The test statistic is t =
Hypothesis
𝑛
𝑥̅ 𝐷 −μ𝐷
𝑠𝐷
⁄
√𝑛
Test Statistic
Left-tailed test
H0: D ≥ 0
H1: D < 0
Right-tailed test
H0: D ≤ 0
H1: D > 0
Σ𝐷
t-statistic
Two-tailed test
H0: D = 0
H1: D ≠ 0
, where D is the sum of differences.
Σ𝐷
with df (n – 1), where sD = √
2 −(Σ𝐷)
𝑛
𝑛−1
2
.
Critical Value
Critical Region
−𝑡α,𝑛−1
t <−𝑡α,𝑛−1
𝑡α,𝑛−1
t > 𝑡α,𝑛−1
𝑡α/2,𝑛−1
t >𝑡α/2,𝑛−1 or t <−𝑡α/2,𝑛−1
Example 5
The following data are weight changes of a group of 10 participants in a study, after
administration of a drug proposed to result in weight loss.
At  = 0.05 level of significance, do these data provide sufficient evidence to indicate that the
drug will help reducing weight?
Subject
1
2
3
4
5
6
7
8
9
10
Before
55.4
63.9
60.1
78.8
59.2
68.7
70.0
69.2
84.9
75.3
After
55.2
63.6
58.8
77.2
58.5
69.2
70.0
68.9
83.9
74.8
-1.6
-0.7
0.5
0.0
-0.3
-1.0
-0.5
D = xa – xb
D = –5.4
𝑥̅𝐷 = –0.54
D 2 = 6.46
sD = 0.6275
Let D be the true population mean difference between the weights before and after the
administration of the drug,
where D = xafter – xbefore.
H0:
H1:
df = 10 – 1 = 9,  = 0.05, critical value =
11
BHMC3004
Chapter 7
Test statistic,
𝑥̅𝐷 − μ𝐷
𝑡 = 𝑠
𝐷
⁄
√𝑛
If
, H0 is rejected. Otherwise, it is failed to reject H0.
Since
Therefore, these data
Example 6
Listed below are brain volumes (cm3) of 10 pairs of twins. Use  = 0.10 to test the claim that
there is no difference in brain volumes between the first-born and the second-born twins.
First Born
1005
1035
1281
1051
1034
1079
1104
1439
1029
1160
Second Born
963
1027
1272
1079
1070
1173
1067
1347
1100
1204
D = X1 – X2
9
-28
-36
-94
37
92
-71
-44
D2
81
784
1296
8836
1369
8464
5041
1936
Let D be the true population mean difference in brain volumes between the first-born and
second-born twins.
H0:
H1:
df =
,  = 0.1, critical value
Test statistic,
𝑥̅𝐷 − μ𝐷
𝑡 = 𝑠
=
𝐷
⁄
√𝑛
If t
, H0 is rejected. Otherwise, it is failed to reject H0.
Since t = −1.833 < 0.4742 < 1.833, H0 is failed to reject.
Therefore, there is
12
BHMC3004
7.5
•
•
•
•
Chapter 7
The Chi-Square Test
An inferential statistical technique designed to test on qualitative variables.
Used to test on the
i) shape of the distribution of a variable (Goodness of Fit Test);
ii) significance of the relationship between two variables (Independence Test);
iii) comparison of the distributions of a variable between two or more populations
(Homogeneity Test).
Rely on Chi-square distribution, 2.
Critical value = 2df ; critical region: 2 > 2df .
Acceptance region
•
Rejection region
Test statistic,
(𝑓𝑜 − 𝑓𝑒 )2
2
χ =Σ
𝑓𝑒
with a specific degree of freedom, df
where fo: observed frequency (from sample)
fe: expected frequency (predicted from the H0)
7.5.1 Goodness of Fit Test
• Determines how well the obtained sample proportions fit the population proportions
specified by the H0.
• The null hypothesis assumes that there is no significant difference between the observed
and expected distribution.
• The alternative hypothesis states that the population distribution has a different shape
from that specified in H0.
• Degree of freedom, df = C – 1, where C is the number of categories.
• fe = np where n is the sample size and p is the proportion stated in the H0.
13
BHMC3004
Chapter 7
Chi-square (2) Distribution
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
50
60
70
80
90
100
0.1
2.706
4.605
6.251
7.779
9.236
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412
29.615
30.813
32.007
33.196
34.382
35.563
36.741
37.916
39.087
40.256
51.805
63.167
74.397
85.527
96.578
107.565
118.498
Proportion in Critical Region
0.05
0.025
0.01
3.841
5.024
6.635
5.991
7.378
9.210
7.815
9.348
11.345
9.488
11.143
13.277
11.070
12.833
15.086
12.592
14.449
16.812
14.067
16.013
18.475
15.507
17.535
20.090
16.919
19.023
21.666
18.307
20.483
23.209
19.675
21.920
24.725
21.026
23.337
26.217
22.362
24.736
27.688
23.685
26.119
29.141
24.996
27.488
30.578
26.296
28.845
32.000
27.587
30.191
33.409
28.869
31.526
34.805
30.144
32.852
36.191
31.410
34.170
37.566
32.671
35.479
38.932
33.924
36.781
40.289
35.172
38.076
41.638
36.415
39.364
42.980
37.652
40.646
44.314
38.885
41.923
45.642
40.113
43.195
46.963
41.337
44.461
48.278
42.557
45.722
49.588
43.773
46.979
50.892
55.758
59.342
63.691
67.505
71.420
76.154
79.082
83.298
88.379
90.531
95.023
100.425
101.879
106.629
112.329
113.145
118.136
124.116
124.342
129.561
135.807
0.005
7.879
10.597
12.838
14.860
16.750
18.548
20.278
21.955
23.589
25.188
26.757
28.300
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997
41.401
42.796
44.181
45.559
46.928
48.290
49.645
50.993
52.336
53.672
66.766
79.490
91.952
104.215
116.321
128.299
140.169
14
BHMC3004
Chapter 7
Example 7
The human resource at a company is concerned about absenteeism among hourly workers. She
decides to sample the records to determine whether absenteeism is distributed evenly
throughout the six working days. The sample results are as follows:
Day
Mon
Tues
Wed
Thurs
Fri
Sat
No. Absent
12
9
11
10
9
9
Use 0.01 level of significance to test the hypothesis.
H0 : In the general population, the absenteeism is distributed evenly throughout the six working
days and the distribution of the absenteeism is as follows:
Day
Mon
Tues
Wed
Thurs
Fri
Sat
p
1/6
1/6
1/6
1/6
1/6
1/6
H1 : The absenteeism is not distributed evenly throughout the six working days.
 = 0.01, df = 6 – 1 = 5, Critical value =
Day
Mon
Tues
Wed
Thurs
Fri
Sat
𝑓𝑜
12
9
11
10
9
9
𝑓𝑒
10
10
10
10
10
60
Test statistic,
(𝑓𝑜 − 𝑓𝑒 )2
2
χ =Σ
𝑓𝑒
If
, H0 is rejected. Otherwise, it is failed to reject H0.
Since
Therefore, the absenteeism is
Example 8
The American Accounting Association classifies accounts receivable as “current”, “late” and “not
collectable”. Industry figures shows that 60 percent of accounts receivable are current, 30
percent are late, and 10 percent are not collectable.
An accountancy firm has 500 accounts receivable: 320 are current, 120 are late, and 60 are not
collectable. Are these numbers in agreement with the industry distribution? Use 0.05 level of
significance.
H0: In the general population, the distribution of the classification of the accounts receivable is
as follows: 60% current, 30% late, and 10% not collectable
H1: The distribution of the classification of the accounts receivable is different from that
specified in H0.
 = 0.05, df =
, Critical value =
15
BHMC3004
𝑓𝑜
Chapter 7
Current
Late
Not Collectible
Total
320
120
60
500
𝑓𝑒
(𝑓𝑜 − 𝑓𝑒 )2
χ =Σ
𝑓𝑒
2
If
, H0 is rejected. Otherwise, it is failed to reject H0.
Since
Therefore, we are 95% confident that the distribution of accounts
7.5.2 Independence Test
• The null hypothesis always states that the two variables are independent or there is no
consistent, predictable relationship between them.
• The data are presented in the form of matrix, called as a contingency table.
• df = (R – 1)(C – 1) where R is the number of rows and C is the number of columns.
Column total ×Row total
• fe =
Grand total
Example 9
Recent recession and bad economic conditions forced many people to hold more than one job.
A sample of 500 persons who held more than one job produced the following two-way table.
Test at 5% level of significance whether gender and marital status are related for all people who
hold more than one job.
Single
Married
Other
Male
72
209
39
Female
33
102
45
H0 : In the general population, there is no relationship between gender and marital status for
all people who hold more than one job.
H1 : There is a consistent and predictable relationship between gender and marital status for
all people who hold more than one job.
 = 0.05, df =
, Critical value =
𝑓𝑜 (𝑓𝑒 )
Single
Married
Other
Total
Male
72
209
(199.04)
39
(53.76)
320
Female
33
102
(111.96)
45
(30.24)
180
Total
χ2 = Σ
105
311
84
500
(𝑓𝑜 − 𝑓𝑒 )2
𝑓𝑒
16
BHMC3004
Chapter 7
If
, H0 is rejected. Otherwise, it is failed to reject H0.
Since
Therefore, gender and marital status are
7.5.3 Homogeneity Test
• The null hypothesis can be stated as the populations are homogeneous with respect to
the variable.
• The steps for carrying out the independence test and homogeneity test are the same.
Example 10
In a study of the television viewing habits of children, a developmental psychologist selects a
random sample of 300 first graders - 100 boys and 200 girls. Each child is asked which of the
following TV programs they like best: The Lone Ranger, Sesame Street, or The Simpsons.
Results are shown in the contingency table below.
Viewing Preferences
Total
Lone Ranger
Sesame Street
The Simpsons
Boys
50
30
20
100
Girls
50
80
70
200
Total
100
110
90
300
Do the boys’ preferences for these TV programs differ significantly from the girls’ preferences?
Use a 0.05 level of significance.
H0: In the general population, the boys’ preferences for these TV programs do not differ
significantly from the girls’ preferences.
H1: The boys’ preferences for these TV programs differ significantly from the girls’ preferences.
 = 0.05, df =
, Critical value =
𝑓𝑜 (𝑓𝑒 )
Lone Ranger
Sesame Street
The Simpsons
Total
Boys
50
30
(36.7)
20
(30)
100
Girls
50
80
(73.3)
70
(60)
200
Total
100
110
90
300
(𝑓𝑜 − 𝑓𝑒 )2
χ =Σ
𝑓𝑒
2
If
, H0 is rejected. Otherwise, it is failed to reject H0.
Since
Therefore, the boys' preferences for these TV programs
17
BHMC3004
7.6
•
Chapter 7
p-Value in Hypothesis Testing
p-Value
o The probability of observing a sample value as extreme as, or more extreme than, the
value observed, given that the null hypothesis is true.
o If p-value ≤ , H0 is rejected. Otherwise, it is failed to reject H0.
18
BHMC3004
Chapter 8
Chapter 8 Cross Tabulation
8.1
•
•
8.2
•
•
•
•
Introduction
A technique for analysing the relationship between two or more nominal or ordinal
variables that have been organized in a table.
A type of bivariate analysis, a statistical method designed to detect and describe the
relationship between two nominal or ordinal variables.
Bivariate Table
Contingency table, a joint frequency distribution of two nominal or ordinal variables.
r  c table
o r : number of rows
o c : number of columns.
Characteristics:
o Title: Description of the variables
o Column Variable: Independent variable
o Row Variable: Dependent variable
o Order the categories from lowest to highest: From left to right across the columns;
from top to bottom along the rows.
o Cell: Intersection of a row and a column
o Marginal: Row and column totals
Column
o ource of Data
variable
E.g., 2  2 Contingency Table
Dependent
Variable
Row
variable
Independent Variable
Total
I1
I2
D1
D2
a
c
b
d
a+b
c+d
Total
a+c
b+d
a+b+c+d
Marginal
Cell
•
Two basic rules:
1. Calculate percentages within each category of the IV.
Independent Variable
Dependent Variable
D1
D2
Total
I1
𝑎
 100%
𝑎+𝑐
𝑐
𝑎+𝑐
 100%
100%
(Total I1)
I2
𝑏
𝑏+𝑑
𝑑
𝑏+𝑑
 100%
 100%
100%
(Total I2)
1
BHMC3004
Chapter 8
2. Interpret the table by comparing the percentage point difference for different categories
of the independent variable.
o Limit comparisons to categories with at least 10 percent point difference.
o For 2  2 table, only one comparison is needed for interpretation.
8.3
Properties of a Bivariate Relationship
1. Existence of a relationship
• Percentage distributions vary across the different categories of the independent
variable.
2. Strength of the relationship
• The larger the percentage difference across the categories, the stronger the association.
• Percentage differences are a rough indicator of the strength of a relationship between
two variables.
3.
•
•
•
Direction of the relationship
Applicable to ordinal or interval-ratio level.
Positive relationship: vary in the same direction (both go up or both go down)
Negative relationship: vary in the opposite direction (when one goes up the other goes
down)
Example 1
Refer to the following bivariate table, describe the relationship between race and home
ownership.
Home Ownership by Race
Home Ownership
Race
Total
Black
White
Own
3
7
10
Rent
6
4
10
Total
9
11
20
Let Race be the independent variable.
Home Ownership
Race
Total
Black
White
Own
33%
64%
50%
Rent
67%
36%
50%
Total
(N)
100%
(9)
100%
(11)
100%
(20)
2
BHMC3004
Chapter 8
There is a 31% percentage point difference between the percentage of white homeowners
(64%) and black homeowners 33%).
In other words, in this group, whites are more likely to be homeowners than blacks.
Therefore, we can conclude that one’s race appears to be associated with the likelihood of being
a homeowner.
Example 2
Analyse the following bivariate table to examine whether the frequency of church attendance
by respondents had an effect on their support for abortion. Support for abortion was measured
with the following questions:” Please tell me whether or not you think it should be possible for
a pregnant woman to obtain a legal abortion if the woman wants it for any reason.”
Frequency of church attendance was determined by asking respondents to indicate how often
they attend religious services.
Support for Abortion by Church Attendance
Abortion
Church Attendance
Total
Never
Infrequently
Frequently
Yes
55%
50%
26%
43%
No
45%
50%
74%
57%
Total
(N)
100%
(111)
100%
(212)
100%
(157)
100%
(480)
Let the hypothesis be those who attend church frequently are more likely to be pro-life.
We may observe that the percentage that supports abortion changes across
Thus, the table indicates
Besides that, the largest percentage difference between respondents who
The differences indicate a
3
BHMC3004
Chapter 8
Example 3
Describe the direction of the relationship based on the following bivariate tables.
i)
Health Condition by Social Class
Health
Class
Low
Middle
High
Poor
39%
12%
9%
Fair
36%
45%
28%
Good
25%
43%
63%
Total
(N)
100%
(39)
100%
(254)
100%
(202)
* As “class” goes up “health” goes up.
There is a
ii)
Frequency of Trauma by Social Class
Trauma
Class
Low
Middle
High
0
31%
41%
48%
1
22%
42%
20%
2+
47%
17%
32%
Total
(N)
100%
(48)
100%
(220)
100%
(180)
* As “class” goes up “trauma” goes down.
There is a
4
Download