STATISTICS AND PROBABILITY LECTURE NOTES (Prepared by: Marivic D. Taňola) NAME: ____________________________________ 4th QUARTER – Week 1: HYPOTHESIS TESTING JUST REFLECT • • Sometimes we hear claims on social media that we find unbelievable. Such as: a whitening product advertisement stating that if you use their whitening product, then your skin is like snow white. The weatherman stating that there is a 90% chance of rain tomorrow. We might feel compelled to challenge such claims. To challenge claims, we must run a research study upon a sample (since the surveying the entire population would be impossible). To test a claim, you must write two hypotheses. Hypothesis testing is a decision-making process for evaluating claims about a population. • • Null hypothesis (Ho), is basically, “The population is like this.” It states, in formal terms, that the population is no different than usual. Alternative hypothesis (Ha), is, “The population is like something else.” It states that the population is different than the usual, that something has happened to this population, and as a result it has a different mean, or different shape than the usual case. In order to state the hypothesis correctly, the researcher must translate the claim into mathematical symbols. There are three possible sets of statistical hypotheses. TWO-TAILED TEST 1. Ho : parameter = specific value Ha : parameter ≠ specific value LEFT-TAILED TEST 2. Ho : parameter = specific value Ha : parameter < specific value RIGHT-TAILED TEST 3. Ho : parameter = specific value Ha : parameter > specific value In the hypothesis testing, there are four possible outcomes. • • Reject Ho Do not Reject Ho Ho is True Type I Error Correct Decision Ho is False Correct Decision Type II Error A type I error occurs if one rejects the null hypothesis when it is true. A type II error occurs if one does not reject the null hypothesis when it is false. 1 The decision is made based on probabilities: “How large a difference is necessary to reject the null hypothesis?” here is where the level of significance is used. The level of significance is the maximum probability of committing a type I error. This probability is symbolized by α (Greek letter alpha). That is, P(type I error) = α. the probability of type II error is symbolized by β (Greek letter beta), that is, P(type II error) = β, although in most hypothesis testing situations, b cannot be computed. Generally, statisticians agree on using three arbitrary significance levels: the 0.10, 0.05, and 0.01 level. That is, if the null hypothesis is rejected, the probability of type I error will be 10%, 5% and 1%, and the probability of a correct decision will be 90%, 95% and 99%, depending on which level of significance is used. In other words, when α = 0.05, there is a 5% chance of rejecting a true null hypothesis. • You can reflect on these figures which are commonly used hypothesis testing in research: After a significance level is chosen, a critical value is selected from a table for the appropriate test. • • • The critical value determines the critical and non-critical regions. The critical region or the rejection region is the range of values of the test value that indicates that there is a significant difference and that the null hypothesis should be rejected. The non-critical or non-rejection region is the range of values of the test value that indicates that the difference was probably due to chance and that the null hypothesis should not be rejected. 2 If the test is two-tailed, the critical value will be either positive or negative. If the test is left-tailed, the critical value will be negative. If the test is right-tailed, the critical value will be positive. JUST LEARN 3 A hypothesis is essentially an idea about the population that you think might be true, but which you cannot prove to be true. While you usually have good reasons to think it is true, and you often hope that it is true, you need to show that the sample data support your idea. In hypothesis testing the following steps should be considered: 1. State the null and alternative hypotheses. 2. Select the level of significance. 3. Determine the critical value and the rejection region/s. 4. State the decision rule. 5. Compute the test statistic. 6. Make a decision, whether to reject or not to reject the null hypothe JUST EVALUATE 4 5 4th QUARTER – Week 2: JUST REFLECT • • • • • • • • • • • • • You can reflect on these statements which are commonly used in research. The symbol ≠ in the alternative hypothesis suggests either a greater than ( > ) relation or a less than ( < ) relation. When the alternative hypothesis utilizes the ≠ symbol, the test is said to be nondirectional. Also called a two-tailed test. When the alternative hypothesis utilizes the > or the < symbol, the test is said to be directional, may either be called left-tailed or right-tailed. These are the graphical representations of two-tailed test and the one-tailed test: 6 JUST LEARN: 7 JUST EVALUATE 4th QUARTER – Week 3: 8 JUST RECALL AND REFLECT: Directions: Choose the letter that corresponds to your answer. Write your answer on a separate sheet. 1. Which of the following is a Null Hypothesis test formula? A. 2. If the hypothesis contains the greater than symbol (>) the rejection region is ______. A. 3. C. Center -tailed D. Cross-tailed Center tailed B. Right tailed C. Left tailed D. Cross tailed Test how far the mean of a sample is from zero. Determine whether a statistical result is significant. Determine the appropriate value of the significance level. Derive the standard error of the data. What do you call a population for testing purposes? A. • B. Right-tailed What is the main purpose of hypothesis testing? A. B. C. D. 5. Left-tailed If the hypothesis contains the less than symbol (<) the rejection region is ____. A. 4. Test statistic C. Variance statistic B. Population statistic D. Null statistic Statistic C. Hypothesis B. Level of Significance D. Test-Statistic The rejection region (RR) specifies the values of the test statistic for which the null hypothesis is rejected in favor of the alternative hypothesis. 9 • If the computed value of the test statistic falls in RR, we reject the null hypothesis (Ho) and accept the alternative hypothesis (H1). • If the value of the test statistic does not fall into the rejection (critical) region, we accept Ho. The region, other than the rejection region, is the acceptance region. • Typical values for α are 0.01, 0.05 and 0.1. It is a value that we select based on the certainty we need. In most cases, the choice of α is determined by the context we are operating in, but 0.05 is the most commonly used value. JUST LEARN: DO IT IN A GROUP: 1. Directions: Briefly answer the Self – Assessment Questions (SAQ) below. SAQ 1: When do we accept Null Hypothesis? SAQ 2: When do we reject Null Hypothesis? 1. Directions: Identify the Rejection Region. PROBLEM 1. Professor Balenciaga has reported her students’ grades for several semesters and the average for all the grades of these students is 83. Her new class of 28 students seem to be higher than the average of ability and she wants to demonstrate that the current class is superior to the previous classes according to their average." Is there sufficient evidence for the class average of 86.2 and the standard deviation of 12 present to support her argument that the current class is superior? Using the 0.05 significance level. PROBLEM 2. Professor Balenciaga has reported her students’ grades for several semesters and the average for all the grades of these students is 83. Her new class of 30 students seem to be higher than the average of ability and she wants to demonstrate that the current class is superior to the previous classes according to their average." Is there sufficient evidence for the class average of 86.2 and the standard deviation of 12 present to support her argument that the current class is superior? Using the 0.05 significance level. 10 JUST EVALUATE: Directions: Choose the letter that corresponds to your answer. Write your answer on a separate sheet. 1. Null hypothesis is rejected as direct evidence that the alternative hypothesis is: a. 2. True b. False c. Either d. Neither One or two tailed tests will determine ________. a. that hypothesis has one or two conclusion. b. the two values of the sample need to be rejected. c. the rejection region is located in one or two tails of the distribution. d. the rejection region is located in one tails of distribution. 11 4th QUARTER – Week4: JUST RECALL AND REFLECT Test statistic is a value computed from the data. The test statistic is used to assess the evidence in rejecting or accepting the null hypothesis. Each statistic test is used for a different test. JUST LEARN HYPOTHESIS TESTING ON A POPULATION MEAN 12 STEP 6: Draw the appropriate conclusion.Since H0 is rejected, there is enough evidence to support the claim that college students watch less television than the general public. JUST LEARN WITH THE GROUP 13 JUST EVALUATE 14 For items 4 and 5, refer to the following information: Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen (15) randomly chosen teenagers were asked how many hours per week they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test. 4. 5. The null and alternate hypotheses are: (a) Ho :x=4.5,Ha :x>4.5 (b) Ho :μ≥4.5,Ha :μ<4.5 (c) Ho :μ=4.75,Ha:μ>4.75 (d) Ho :μ=4.5,Ha :μ>4.5 At a significance level of a = 0.05, what is the correct conclusion? (a) There is enough evidence to conclude that the mean number of hours is more than 4.75. (b) There is enough evidence to conclude that the mean number of hours is more than 4.5. (c) There is not enough evidence to conclude that the mean number of hours is more than 4.5. (d) There is not enough evidence to conclude that the mean number of hours is more than 4.75. 4th QUARTER – Week 5 15 JUST RECALL AND REFLECT JUST LEARN 16 6. Decision • If we reject 𝐻�0, we can conclude that 𝐻�𝐴� is true. • If, however, we do not reject 𝐻�0, we may conclude that 𝐻�0 is true. Decision rule using 𝑝� – value: • If the 𝑝� – value is less than or equal to 𝛼�, we reject the null hypothesis (𝑝� ≤ 𝛼�). • If the 𝑝� – value is greater than to 𝛼�, we do not reject the null hypothesis (𝑝� > 𝛼�). 17 SHARE INSIGHTS (BY GROUP) JUST EVALUATE Directions: Study the problem and answer the task given. 18 PROBLEM 1: A Company manufactures calculators with an average mass of 500g, an engineer believes that the average weight to be different and decides to calculate the average mass of 60 calculates. TASK: State the null and alternative hypothesis. 𝑯�𝑶�: 𝑯�𝟏�: PROBLEM 2: Reyes performed a study to validate a translated version of the Western Mindanao State University (WMSU) questionnaire used with English-speaking patients with hip or knee osteoarthritis. For the 76 women classified with severe hip pain. The WMSU mean function score was 70.7 with standard deviation of 14.6, we wish to know if we may conclude that the mean function score for a population of similar women subjects with sever hip pain is less than 75. Let α= 0.01. TASK: Perform hypothesis testing by following the steps below. 1. 2. 3. 4. 5. 6. Data: Assumption: Hypothesis: Test Statistics: Decision Rule: Decision: 4th QUARTER – Week 6 Test Statistic for Population Proportion 19 JUST LEARN: Step 5. Make a statistical Decision. Since the computed test statistic 𝒛� = −𝟐�. 1𝟑�𝟑� falls in the rejection region, reject the null hypothesis. Step 6: Draw the appropriate conclusion. Since H0 is rejected, then there is enough evidence to conclude that the percentage of voters for the administration candidate is different from 65%. 4th QUARTER – Week 7 BIVARIATE DATA AND SCATTERPLOT 20 JUST RECALL AND REFLECT Have you ever wondered whether tall people have longer arms than short people? We’ll explore this question by collecting data on two variables — height and arm span (measured from left fingertip to right fingertip). • • Do people with above-average arm spans tend to have above-average heights? Do people with below-average arm spans tend to have below-average heights? Directions: Study the table given and answer the questions that follow. Person Number 1 Arm Span Height 156 157 162 160 2 3 4 159 160 162 155 5 6 161 161 160 162 7 8 162 165 170 166 9 10 11 12 170 170 173 170 167 185 173 176 The methods we employ to do this depend on the type of variables we are dealing with; that is, they depend on whether the data are numerical or categorical. The ways of measuring the relationship between the following pairs of variables. • a numerical variable and categorical variable (for example, height and nationality) • two categorical variables (for example, gender and religious denomination) • two numerical variables (for example, height and weight) In a relationship between two variables, if the values of one variable ‘depend’ on the values of another variable, then the former variable is referred to as the dependent variable and the latter variable is referred to as the independent variable. BIVARIATE DATA - consist of two (2) variables can be dependent is the variable that can cause the dependent variable to change. or dependent variable is the variable that is influenced or affected by the independent variable. It is useful to identify the independent and dependent variables where possible since it is the usual practice when displaying data on a graph the independent variable on the horizontal axis and the dependent variable on the vertical axis. 21 EXAMPLE 1. You want to test a new dosage of drug that supposedly prevents sneezing in people allergic to flowers. • Variable in the -axis: new dosage of drug • Variable in the -axis: Sneezing EXAMPLE 2. A soap manufacturer wants to prove that a little amount of detergent can remove greater amount of stain. • Variable in the -axis: amount of detergent. • Variable in the -axis: Amount of stain removed. SCATTERPLOT – is a diagram that is used to show the degree and pattern of relationship between the two (2) sets of data. They are constructed on the Coordinate plane each data point on a scatter plot represents two (2) values. A scatterplot is used to determine if there is a relationship between two numerical variables. Data are collected on the two variables and often displayed in a table of ordered pairs. A scatterplot is graph of the ordered pairs of numbers. Each ordered pair is a dot on the graph. PATTERNS OF DATA IN SCATTERPLOT APE (FORM) • SHAPE (FORM) - Refers to whether a data pattern is linear (straight) or nonlinear (curved). 22 LINEAR FORM If the points seem to approximate a straight line, the association is a linear NON-LINEAR FORM If the points seem to appropriate a curve, the association is a non-linear form. form. FORM OF AN ASSOCIATION 2. Linear form – when the points tend to follow a straight line. 3. Non-linear form – when the points tend to follow a curved line. 2. FORM (DIRECTION) - Refers to the direction of change in variable when variable gets bigger. If variable also gets bigger, the slope is positive; but if variable gets smaller, the slope is negative. POSITIVE Positive association exists between the variables if the gradient of the line is positive, that is, the dots on the scatterplot tend to go up as we go from left to right. NEGATIVE Negative association between the variables if the gradient of the line is negative, that is, the dots on the scatterplot tend to go down as we go from left to right. exists DIRECTION OF AN ASSOCIATION 3. Positive – gradient of the line is positive. • Negative – gradient of the line is negative. 4. VARIATION (STRENGTH) - Refers to the degree of “scatter” in the plot. If the dots are widely spread, the relationship between variables is weak. If the dots are concentrated 23 around a line, the relationship is strong. a single stream. A pattern is clearly seen. There is only a small amount of scatter in the plot MODERATE In moderate association the amount of scatter in the plot increases and the pattern becomes less clear. This indicates that the association is less strong. STRONG WEAK In weak association the amount of scatter increases further, and the pattern becomes even less clear. Linear form is less evident. In strong association the dots will tend to follow STRENGTH OF AN ASSOCIATION Strong- small amount of scatter in the plot. Moderate – modest amount of scatter in the plot. Weak – large amount of scatter in the plot. EXAMPLE 3. Determine the relationship between the height and arm span. The date data collected on these variables is shown in the table of ordered Pairs. Height (cm) Arm Span (cm) 172 159 178 162 156 174 151 162 165 185 186 176 166 180 158 172 162 182 164 159 180 151 165 168 189 188 184 167 184 161 24 Each person has two numerical variables, height, and arm span. To construct a scatterplot, draw a number plane with the height on the horizontal axis and arm span on the vertical axis. Plot each ordered pair as a dot. The scatterplot shows there is a relationship between these variables. JUST EVALUATE Directions: Construct a scatterplot using the tables and describe the a. shape (form), b. trend (direction), and c. strength (variation). 4th QUARTER – Week 8 THE PEARSON PRODUCT-MOMENT CORRELATION JUST RECALL AND REFLECT 25 Directions: Identify the direction and the strength of the following correlation given. Choose your answer from the box. a. Strong positive correlation c. No correlation b. Moderate positive correlation d. Moderate negative correlation e. Strong negative correlation f. Perfect correlation TASK: Research on the life of Karl Pearson and his important contributions in the field of statistics. Do not forget to copy and study the formula he proposed for computing the coefficient of correlation( r). Correlation coefficient, computed from the sample data measures the strength and direction of a linear relationship between two variables. The strength of correlation is indicated by the coefficient of correlation. There are several coefficients of correlation. One that is most commonly used in linear correlation is Pearson Product-Moment coefficient of correlation, symbolized by r, named in honor of the statistician who did a lot of research on this area, Karl Pearson. 26 Where, r is called the Pearson correlation coefficient. This indicates the degree of relationship between the two values, X is the values in the first set of data, Y is the values in the second set of data, and n is the total number of values/data pairs. Analyze the diagram below: The Pearson correlation coefficient, r, can take a range of values from +1 to -1. • • • • A value greater than 0 indicates a positive correlation; that is, as the value of one variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as the value of one variable increases, the value of the other variable decreases. A value of 0 indicates that there is no correlation between the two variables. The direction of the points scattered tells the direction of correlation that exists between the variables. Explore the Correlation S cale. The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or negative, respectively. See table below (Table of range of values). PEARSON R QUALITATIVE DESCRIPTION ±1 ± 0.75 to < ± 1 ± 0.50 to < ± 0.75 ± 0.25 to < ± 0.50 > 0 ± to < ± 0.25 Perfect Very high Moderately high Moderately low Very low 27 0 No correlation Different relationships and their correlation coefficients are shown in the diagram below: • • • Achieving a value of +1 or -1 means that all your data points are included on the line of best fit – there are no data points that show any variation away from this line. Values for r between +1 and -1 (for example, r = 0.7 or -0.3) indicate that there is variation around the line of best fit. The closer the value of r to 0 the greater the variation around the line of best fit. It indicates the closeness of the point to the trend line. The closer the points are to the trend line, the stronger the relationship is. 28 LESSON 2 Correlation coefficient formula is used to find how strong a relationship is between data. The formula returns a value between -1 and 1, where: • 1 indicates a strong positive relationship. • -1 indicates a strong negative relationship. • A result of zero indicates no relationship at all. Meaning ✓ A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation with foot length. ✓ A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost) perfect correlation with speed. 29 ✓ Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related. Let’s find the value of the correlation coefficient from the table below. SUBJECT GLUCOSE LEVEL Y 1 AGE X 43 2 21 65 3 25 79 4 42 75 5 57 87 6 59 81 99 STEP 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2. x2 Age x Glucose level y 43 99 2 21 65 3 25 79 4 42 75 5 57 87 6 59 81 Subject 1 y2 xy STEP 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 = 4,257. 30 Subject Age x Glucose level y xy 1 43 99 4257 2 21 65 1365 3 25 79 1975 4 42 75 3150 5 57 87 4959 6 59 81 4779 STEP 3: Take numbers in put the result x2 y2 the square of the the x column and in the x2 column. Subject Age x Glucose level y xy x2 1 2 3 4 5 6 43 21 25 42 57 59 99 65 79 75 87 81 4257 1365 1975 3150 4959 4779 1849 441 625 1764 3249 3481 y2 STEP 4: Take the square of the numbers in the y column, and put the results in the y2 column. Subject Age x Glucose level y xy x2 y2 1 43 99 4257 1849 9801 2 21 65 1365 441 4225 3 25 79 1975 625 6241 4 42 75 3150 1764 5625 5 57 87 4959 3249 7569 6 59 81 4779 3481 6561 31 The range of the correlation coefficient is from -1 to 1. Our result is 0.5298, which means the relationship between variables is moderate positive correlation. Assumptions For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve). Other assumptions include linearity and homoscedasticity. Linearity assumes a straight-line relationship between each of the two variables and homoscedasticity assumes that data are equally distributed about the regression line. JUST EVALUATE I. Directions: Calculate r and make a generalization regarding the information that you get from the co mputed correlation coefficient for each of the following: a. ∑X = 225 b. ∑X = 32 c. ∑X = 180 ∑Y=22 ∑Y = 1105 ∑Y = 147 2 2 ∑X = 9653 ∑X = 220 ∑X2 = 6914 ∑Y2 = 143 ∑Y2 = 364525 ∑Y2 = 5273 ∑XY = 651 ∑XY = 3402 ∑XY = 4013 n=6 n=6 n=7 32 II. Directions: Solve the Problem. The following are the heights of a father and his eldest son, in inches: Heights of the Father 71 69 67 68 68 66 70 72 65 60 71 69 69 65 66 63 68 70 60 58 Heights of the Eldest Son QUESTION: Do the data support the hypothesis that height is hereditary? Explain. Accompany your explanation with statistical computations. III. Directions: Read the statement carefully and choose the best answer. For items 1 – 5. Complete the table below. Consider the scores obtained in Math(X) and Statistics (Y) subjects by 10 students. Observation Math Score (X) Stat Score (Y) X2 Y2 XY 1 5 2 25 4 10 2 8 7 64 49 56 3 10 8 100 64 80 4 12 9 144 81 108 5 12 10 144 100 120 6 14 12 196 144 168 7 15 14 225 196 210 8 16 10 256 100 160 9 18 16 324 256 288 10 20 12 400 144 240 Sum 1. The ∑X2 is equal to ________. a. 1118 b. 1138 c. 1878 d. 1873 2. Find ∑XY. a. 1440 b. 1040 c. 1400 d. 1140 3. How many respondents are being observed? a. 20 b. 12 c. 10 d. 6 4. Based on the given data, solve for the Pearson’s correlation coefficient. a. 0.78 b. 0.87 c. 0.86 33 d. 0.76 5. Evaluate what conclusion can be derived from the result of r obtained in the data. a. There is a no relationship between math scores and statistics scores of the students. b. There is a strong negative relationship between math scores and statistics scores of the students. c. There is a moderately positive relationship between math scores and statistics scores of the students. d. There is a strong positive relationship between math scores and statistics scores of the students. 34