Uploaded by Marivic Tanola

QUARTER4-STATISTICS-LECTURE-NOTES

advertisement
STATISTICS AND PROBABILITY LECTURE NOTES
(Prepared by: Marivic D. Taňola)
NAME: ____________________________________
4th QUARTER – Week 1:
HYPOTHESIS TESTING
JUST REFLECT
•
•
Sometimes we hear claims on social media that we find unbelievable. Such as: a whitening
product advertisement stating that if you use their whitening product, then your skin is like
snow white.
The weatherman stating that there is a 90% chance of rain tomorrow.
We might feel compelled to challenge such claims. To challenge claims, we must run a
research study upon a sample (since the surveying the entire population would be impossible).
To test a claim, you must write two hypotheses.
Hypothesis testing is a decision-making process for evaluating claims about a population.
•
•
Null hypothesis (Ho), is basically, “The population is like this.” It states, in formal terms,
that the population is no different than usual.
Alternative hypothesis (Ha), is, “The population is like something else.” It states that the
population is different than the usual, that something has happened to this population, and
as a result it has a different mean, or different shape than the usual case.
In order to state the hypothesis correctly, the researcher must translate the claim into
mathematical symbols. There are three possible sets of statistical hypotheses.
TWO-TAILED TEST
1. Ho : parameter = specific value
Ha : parameter ≠ specific value
LEFT-TAILED TEST
2. Ho : parameter = specific value
Ha : parameter < specific value
RIGHT-TAILED TEST
3. Ho : parameter = specific value
Ha : parameter > specific value
In the hypothesis testing, there are four possible outcomes.
•
•
Reject Ho
Do not Reject Ho
Ho is True
Type I Error
Correct Decision
Ho is False
Correct Decision
Type II Error
A type I error occurs if one rejects the null hypothesis when it is true.
A type II error occurs if one does not reject the null hypothesis when it is false.
1
The decision is made based on probabilities: “How large a difference is necessary to reject
the null hypothesis?” here is where the level of significance is used.
The level of significance is the maximum probability of committing a type I error. This
probability is symbolized by α (Greek letter alpha). That is, P(type I error) = α. the probability
of type II error is symbolized by β (Greek letter beta), that is, P(type II error) = β, although in
most hypothesis testing situations, b cannot be computed.
Generally, statisticians agree on using three arbitrary significance levels: the 0.10, 0.05,
and 0.01 level. That is, if the null hypothesis is rejected, the probability of type I error will be
10%, 5% and 1%, and the probability of a correct decision will be 90%, 95% and 99%,
depending on which level of significance is used. In other words, when α = 0.05, there is a 5%
chance of rejecting a true null hypothesis.
•
You can reflect on these figures which are commonly used hypothesis testing in research:
After a significance level is chosen, a critical value is selected from a table for the appropriate
test.
•
•
•
The critical value determines the critical and non-critical regions.
The critical region or the rejection region is the range of values of the test value that
indicates that there is a significant difference and that the null hypothesis should be
rejected.
The non-critical or non-rejection region is the range of values of the test value that
indicates that the difference was probably due to chance and that the null hypothesis should
not be rejected.
2
If the test is two-tailed, the critical value will be either positive or negative. If the test is
left-tailed, the critical value will be negative. If the test is right-tailed, the critical value will be
positive.
JUST LEARN
3
A hypothesis is essentially an idea about the population that you think might be true, but which
you cannot prove to be true. While you usually have good reasons to think it is true, and you
often hope that it is true, you need to show that the sample data support your idea.
In hypothesis testing the following steps should be considered:
1. State the null and alternative hypotheses.
2. Select the level of significance.
3. Determine the critical value and the rejection region/s.
4. State the decision rule.
5. Compute the test statistic.
6. Make a decision, whether to reject or not to reject the null hypothe
JUST EVALUATE
4
5
4th QUARTER – Week 2:
JUST REFLECT
•
•
•
•
•
•
•
•
•
•
•
•
•
You can reflect on these statements which are commonly used in research.
The symbol ≠ in the
alternative hypothesis suggests either a greater than ( > ) relation or a less than
( < ) relation.
When the alternative hypothesis utilizes the ≠ symbol, the test is said to be nondirectional. Also called a two-tailed test.
When the alternative hypothesis utilizes the > or the < symbol, the test is said
to be directional, may either be called left-tailed or right-tailed.
These are the graphical representations of two-tailed test and the one-tailed test:
6
JUST LEARN:
7
JUST EVALUATE
4th QUARTER – Week 3:
8
JUST RECALL AND REFLECT:
Directions: Choose the letter that corresponds to your answer. Write your answer on a
separate sheet.
1.
Which of the following is a Null Hypothesis test formula?
A.
2.
If the hypothesis contains the greater than symbol (>) the rejection region is ______.
A.
3.
C. Center -tailed
D. Cross-tailed
Center tailed B. Right tailed
C. Left tailed D. Cross tailed
Test how far the mean of a sample is from zero.
Determine whether a statistical result is significant.
Determine the appropriate value of the significance level.
Derive the standard error of the data.
What do you call a population for testing purposes?
A.
•
B. Right-tailed
What is the main purpose of hypothesis testing?
A.
B.
C.
D.
5.
Left-tailed
If the hypothesis contains the less than symbol (<) the rejection region is ____.
A.
4.
Test statistic C. Variance statistic B. Population statistic D. Null statistic
Statistic C. Hypothesis B. Level of Significance D. Test-Statistic
The rejection region (RR) specifies the values of the test statistic for which the null
hypothesis is rejected in favor of the alternative hypothesis.
9
•
If the computed value of the test statistic falls in RR, we reject the null hypothesis (Ho) and
accept the alternative hypothesis (H1).
•
If the value of the test statistic does not fall into the rejection (critical) region, we accept Ho.
The region, other than the rejection region, is the acceptance region.
•
Typical values for α are 0.01, 0.05 and 0.1. It is a value that we select based on the certainty
we need. In most cases, the choice of α is determined by the context we are operating in,
but 0.05 is the most commonly used value.
JUST LEARN:
DO IT IN A GROUP:
1. Directions: Briefly answer the Self – Assessment Questions (SAQ) below.
SAQ 1:
When
do
we
accept
Null
Hypothesis? SAQ 2: When do we reject Null
Hypothesis?
1. Directions: Identify the Rejection Region.
PROBLEM 1. Professor Balenciaga has reported her students’ grades for several semesters
and the average for all the grades of these students is 83. Her new class of 28 students seem
to be higher than the average of ability and she wants to demonstrate that the current class is
superior to the previous classes according to their average." Is there sufficient evidence for the
class average of 86.2 and the standard deviation of 12 present to support her argument that
the current class is superior? Using the 0.05 significance level.
PROBLEM 2. Professor Balenciaga has reported her students’ grades for several semesters
and the average for all the grades of these students is 83. Her new class of 30 students seem
to be higher than the average of ability and she wants to demonstrate that the current class is
superior to the previous classes according to their average." Is there sufficient evidence for the
class average of 86.2 and the standard deviation of 12 present to support her argument that
the current class is superior? Using the 0.05 significance level.
10
JUST EVALUATE:
Directions: Choose the letter that corresponds to your answer. Write your answer on a
separate sheet.
1.
Null hypothesis is rejected as direct evidence that the alternative hypothesis is:
a.
2.
True
b. False
c. Either
d. Neither
One or two tailed tests will determine ________.
a.
that hypothesis has one or two conclusion.
b.
the two values of the sample need to be rejected.
c.
the rejection region is located in one or two tails of the distribution.
d.
the rejection region is located in one tails of distribution.
11
4th QUARTER – Week4:
JUST RECALL AND REFLECT
Test statistic is a value computed from the data. The test statistic is used to assess
the evidence in rejecting or accepting the null hypothesis. Each statistic test is used for a
different test.
JUST LEARN
HYPOTHESIS TESTING ON A POPULATION MEAN
12
STEP 6: Draw the appropriate conclusion.Since H0 is rejected, there is enough evidence to
support the claim that college students watch less television than the general public.
JUST LEARN WITH THE GROUP
13
JUST EVALUATE
14
For items 4 and 5, refer to the following information:
Previously, an organization reported that teenagers spent 4.5 hours per week, on
average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen (15)
randomly chosen teenagers were asked how many hours per week they spend on the phone.
The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a
hypothesis test.
4.
5.
The null and alternate hypotheses are:
(a) Ho :x=4.5,Ha :x>4.5
(b) Ho :μ≥4.5,Ha :μ<4.5
(c) Ho :μ=4.75,Ha:μ>4.75
(d) Ho :μ=4.5,Ha :μ>4.5
At a significance level of a = 0.05, what is the correct conclusion?
(a)
There is enough evidence to conclude that the mean number of hours is more
than 4.75.
(b)
There is enough evidence to conclude that the mean number of hours is more
than 4.5.
(c)
There is not enough evidence to conclude that the mean number of hours is
more than 4.5.
(d)
There is not enough evidence to conclude that the mean number of hours is
more than 4.75.
4th QUARTER – Week 5
15
JUST RECALL AND REFLECT
JUST LEARN
16
6.
Decision
• If we reject 𝐻�0, we can conclude that 𝐻�𝐴� is true.
• If, however, we do not reject 𝐻�0, we may conclude that 𝐻�0 is true.
Decision rule using 𝑝� – value:
• If the 𝑝� – value is less than or equal to 𝛼�, we reject the null hypothesis (𝑝� ≤ 𝛼�).
• If the 𝑝� – value is greater than to 𝛼�, we do not reject the null hypothesis (𝑝� > 𝛼�).
17
SHARE INSIGHTS (BY GROUP)
JUST EVALUATE
Directions: Study the problem and answer the task given.
18
PROBLEM 1: A Company manufactures calculators with an average mass of 500g, an
engineer believes that the average weight to be different and decides to calculate the average
mass of 60 calculates.
TASK: State the null and alternative hypothesis.
𝑯�𝑶�:
𝑯�𝟏�:
PROBLEM 2: Reyes performed a study to validate a translated version of the Western
Mindanao State University (WMSU) questionnaire used with English-speaking patients with
hip or knee osteoarthritis. For the 76 women classified with severe hip pain. The WMSU mean
function score was 70.7 with standard deviation of 14.6, we wish to know if we may conclude
that the mean function score for a population of similar women subjects with sever hip pain is
less than 75. Let α= 0.01.
TASK: Perform hypothesis testing by following the steps below.
1.
2.
3.
4.
5.
6.
Data:
Assumption:
Hypothesis:
Test Statistics:
Decision Rule:
Decision:
4th QUARTER – Week 6
Test Statistic for Population Proportion
19
JUST LEARN:
Step 5. Make a statistical Decision.
Since the computed test statistic 𝒛� = −𝟐�. 1𝟑�𝟑� falls in the rejection region, reject the null
hypothesis.
Step 6: Draw the appropriate conclusion.
Since H0 is rejected, then there is enough evidence to conclude that the percentage of voters
for the administration candidate is different from 65%.
4th QUARTER – Week 7
BIVARIATE DATA AND SCATTERPLOT
20
JUST RECALL AND REFLECT
Have you ever wondered whether tall people have longer arms than short people? We’ll
explore this question by collecting data on two variables — height and arm span (measured
from left fingertip to right fingertip).
•
•
Do people with above-average arm spans tend to have above-average heights?
Do people with below-average arm spans tend to have below-average heights?
Directions: Study the table given and answer the questions that follow.
Person Number
1
Arm Span
Height
156 157
162 160
2
3
4
159 160
162 155
5
6
161 161
160 162
7
8
162 165
170 166
9
10
11
12
170 170 173
170 167 185
173
176
The methods we employ to do this depend on the type of variables we are dealing with;
that is, they depend on whether the data are numerical or categorical. The ways of measuring
the relationship between the following pairs of variables.
• a numerical variable and categorical variable (for example, height and nationality)
• two categorical variables (for example, gender and religious denomination)
• two numerical variables (for example, height and weight)
In a relationship between two variables, if the values of one variable ‘depend’ on the
values of another variable, then the former variable is referred to as the dependent variable
and the latter variable is referred to as the independent variable.
BIVARIATE DATA - consist of two (2) variables can be dependent is the variable that
can cause the dependent variable to change. or dependent variable is the variable that
is influenced or affected by the independent variable.
It is useful to identify the independent and dependent
variables where possible since it is the usual practice when
displaying data on a graph the independent variable on the horizontal
axis and the dependent variable on the vertical axis.
21
EXAMPLE 1.
You want to test a new dosage of drug that supposedly prevents sneezing in people
allergic to flowers.
•
Variable in the -axis: new dosage of drug
•
Variable in the -axis: Sneezing
EXAMPLE 2.
A soap manufacturer wants to prove that a little amount of detergent can remove
greater amount of stain.
•
Variable in the -axis: amount of detergent.
•
Variable in the -axis: Amount of stain removed.
SCATTERPLOT – is a diagram that is used to show the degree and pattern of
relationship between the two (2) sets of data. They are constructed on the Coordinate
plane each data point on a scatter plot represents two (2) values.
A scatterplot is used to determine if there is a relationship between two numerical
variables. Data are collected on the two variables and often displayed in a table of ordered
pairs. A scatterplot is graph of the ordered pairs of numbers. Each ordered pair is a dot on the
graph.
PATTERNS OF DATA IN SCATTERPLOT APE (FORM)
•
SHAPE (FORM) - Refers to whether a data pattern is linear (straight) or nonlinear
(curved).
22
LINEAR FORM
If the points seem to approximate a
straight line, the association is a linear
NON-LINEAR FORM
If the points seem to appropriate a
curve, the association is a non-linear form.
form.
FORM OF AN ASSOCIATION
2. Linear form – when the points tend to follow a straight line.
3. Non-linear form – when the points tend to follow a curved line.
2. FORM (DIRECTION) - Refers to the direction of change in variable when variable gets
bigger. If variable also gets bigger, the slope is positive; but if variable gets smaller, the
slope is negative.
POSITIVE
Positive association exists between
the variables if the gradient of the line is
positive, that is, the dots on the scatterplot
tend to go up as we go from left to right.
NEGATIVE
Negative
association
between the variables if the gradient of the
line is negative, that is, the dots on the
scatterplot tend to go down as we go from
left to right.
exists
DIRECTION OF AN ASSOCIATION
3. Positive – gradient of the line is positive.
• Negative – gradient of the line is negative.
4. VARIATION (STRENGTH) - Refers to the degree of “scatter” in the plot. If the dots are widely
spread, the relationship between variables is weak. If the dots are concentrated
23
around a line, the relationship is strong.
a single stream. A pattern
is clearly seen. There is
only a small amount of
scatter
in
the
plot
MODERATE
In moderate association
the amount of scatter in the
plot increases and the
pattern becomes less clear.
This indicates that the
association is less strong.
STRONG
WEAK
In weak association the
amount of scatter increases
further, and the pattern
becomes even less clear.
Linear form is less evident.
In strong association
the dots will tend to follow
STRENGTH OF AN ASSOCIATION
Strong- small amount of scatter in the plot.
Moderate – modest amount of scatter in the plot.
Weak – large amount of scatter in the plot.
EXAMPLE 3.
Determine the relationship between the height and arm span. The date data
collected on these variables is shown in the table of ordered Pairs.
Height
(cm)
Arm
Span
(cm)
172 159 178 162 156 174 151 162 165 185 186 176 166 180 158
172 162 182 164 159 180 151 165 168 189 188 184 167 184 161
24
Each
person
has
two
numerical variables, height, and arm span. To
construct a scatterplot, draw a number plane with
the height on the horizontal axis and arm span on
the vertical axis. Plot each ordered pair as a dot.
The scatterplot shows there is a relationship
between these variables.
JUST EVALUATE
Directions: Construct a scatterplot using the tables and describe the a. shape (form), b. trend
(direction), and c. strength (variation).
4th QUARTER – Week 8
THE PEARSON PRODUCT-MOMENT CORRELATION
JUST RECALL AND REFLECT
25
Directions: Identify the direction and the strength of the following correlation given. Choose your
answer from the box.
a. Strong positive correlation
c. No correlation
b. Moderate positive correlation
d. Moderate negative correlation
e. Strong negative correlation
f. Perfect correlation
TASK: Research on the life of Karl Pearson and his important contributions in the field of
statistics. Do not forget to copy and study the formula he proposed for computing the coefficient
of correlation( r).
Correlation coefficient, computed from the sample data measures the strength and
direction of a linear relationship between two variables. The strength of correlation is indicated by
the coefficient of correlation. There are several coefficients of correlation. One that is most
commonly used in linear correlation is Pearson Product-Moment coefficient of correlation,
symbolized by r, named in honor of the statistician who did a lot of research on this area, Karl
Pearson.
26
Where,
r is called the Pearson correlation coefficient. This indicates the degree of relationship
between the two values,
X is the values in the first set of data,
Y
is the values in the second set of data, and
n
is the total number of values/data pairs.
Analyze the diagram below:
The Pearson correlation coefficient, r, can take a range of values from +1 to -1.
•
•
•
•
A value greater than 0 indicates a positive correlation; that is, as the value of one
variable increases, so does the value of the other variable.
A value less than 0 indicates a negative association; that is, as the value of one
variable increases, the value of the other variable decreases.
A value of 0 indicates that there is no correlation between the two variables.
The direction of the points scattered tells the direction of correlation that exists
between the variables.
Explore the Correlation S cale.
The stronger the association of the two variables, the closer the Pearson correlation
coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or
negative, respectively. See table below (Table of range of values).
PEARSON R
QUALITATIVE DESCRIPTION
±1
± 0.75 to < ± 1
± 0.50 to < ± 0.75
± 0.25 to < ± 0.50
> 0 ± to < ± 0.25
Perfect
Very high
Moderately high
Moderately low
Very low
27
0
No correlation
Different relationships and their correlation coefficients are shown in the diagram below:
•
•
•
Achieving a value of +1 or -1 means that all your data points are included on the line
of best fit – there are no data points that show any variation away from this line.
Values for r between +1 and -1 (for example, r = 0.7 or -0.3) indicate that there is
variation around the line of best fit.
The closer the value of r to 0 the greater the variation around the line of best fit. It
indicates the closeness of the point to the trend line.
The closer the points are to the trend line, the stronger the relationship is.
28
LESSON 2
Correlation coefficient formula is used to find how strong a relationship is between
data. The formula returns a value between -1 and 1, where:
•
1 indicates a strong positive relationship.
•
-1 indicates a strong negative relationship.
•
A result of zero indicates no relationship at all.
Meaning
✓ A correlation coefficient of 1 means that for every positive increase in one variable, there
is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in
(almost) perfect correlation with foot length.
✓ A correlation coefficient of -1 means that for every positive increase in one variable,
there is a negative decrease of a fixed proportion in the other. For example, the amount of
gas in a tank decreases in (almost) perfect correlation with speed.
29
✓ Zero means that for every increase, there isn’t a positive or negative increase. The two
just aren’t related.
Let’s find the value of the correlation coefficient from the table below.
SUBJECT
GLUCOSE LEVEL Y
1
AGE
X
43
2
21
65
3
25
79
4
42
75
5
57
87
6
59
81
99
STEP 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.
x2
Age x
Glucose level y
43
99
2
21
65
3
25
79
4
42
75
5
57
87
6
59
81
Subject
1
y2
xy
STEP 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 =
4,257.
30
Subject
Age x
Glucose level y
xy
1
43
99
4257
2
21
65
1365
3
25
79
1975
4
42
75
3150
5
57
87
4959
6
59
81
4779
STEP 3: Take
numbers
in
put the result
x2
y2
the square of the
the x column and
in the x2 column.
Subject
Age x
Glucose level y
xy
x2
1
2
3
4
5
6
43
21
25
42
57
59
99
65
79
75
87
81
4257
1365
1975
3150
4959
4779
1849
441
625
1764
3249
3481
y2
STEP 4: Take the square of the numbers in the y column, and put the results in the y2 column.
Subject
Age x
Glucose level y
xy
x2
y2
1
43
99
4257
1849
9801
2
21
65
1365
441
4225
3
25
79
1975
625
6241
4
42
75
3150
1764
5625
5
57
87
4959
3249
7569
6
59
81
4779
3481
6561
31
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298, which
means the relationship between variables is moderate positive correlation.
Assumptions
For the Pearson r correlation, both variables should be normally distributed (normally
distributed variables have a bell-shaped curve). Other assumptions include linearity and
homoscedasticity. Linearity assumes a straight-line relationship between each of the two
variables and homoscedasticity assumes that data are equally distributed about the
regression line.
JUST EVALUATE
I. Directions: Calculate r and make a generalization regarding the information that you
get from the co mputed correlation coefficient for each of the following:
a. ∑X = 225
b. ∑X = 32
c. ∑X = 180
∑Y=22
∑Y = 1105
∑Y = 147
2
2
∑X = 9653
∑X = 220
∑X2 = 6914
∑Y2 = 143
∑Y2 = 364525
∑Y2 = 5273
∑XY = 651
∑XY = 3402
∑XY = 4013
n=6
n=6
n=7
32
II. Directions: Solve the Problem.
The following are the heights of a father and his eldest son, in inches:
Heights of the Father
71 69 67 68 68 66 70 72 65 60
71 69 69 65 66 63 68 70 60 58
Heights of the Eldest Son
QUESTION: Do the data support the hypothesis that height is hereditary?
Explain.
Accompany your explanation with statistical computations.
III. Directions: Read the statement carefully and choose the best answer. For items
1 – 5. Complete the table below.
Consider the scores obtained in Math(X) and Statistics (Y) subjects by 10 students.
Observation
Math Score (X)
Stat Score (Y)
X2
Y2
XY
1
5
2
25
4
10
2
8
7
64
49
56
3
10
8
100
64
80
4
12
9
144
81
108
5
12
10
144
100
120
6
14
12
196
144
168
7
15
14
225
196
210
8
16
10
256
100
160
9
18
16
324
256
288
10
20
12
400
144
240
Sum
1. The ∑X2 is equal to ________.
a. 1118 b. 1138
c. 1878
d. 1873
2. Find ∑XY.
a. 1440 b. 1040
c. 1400
d. 1140
3. How many respondents are being observed?
a. 20
b. 12 c. 10 d. 6
4. Based on the given data, solve for the Pearson’s correlation coefficient.
a.
0.78
b. 0.87
c. 0.86
33
d. 0.76
5. Evaluate what conclusion can be derived from the result of r obtained in the data.
a. There is a no relationship between math scores and statistics scores of the
students.
b.
There is a strong negative relationship between math scores and statistics scores of
the students.
c. There is a moderately positive relationship between math scores and statistics
scores of the students.
d. There is a strong positive relationship between math scores and statistics scores of
the students.
34
Download