1 5.1 DATA SUMMARY AND DISPLAY Statistics ??? Meaning : Numerical facts Field or discipline of study Collection of methods for planning experiments, obtaining data and organizing, analyzing, interpreting and drawing the conclusions or making a decision. 2 BASIC TERMS IN STATISTICS Population - Entire collection of individuals which are characteristic being studied. Sample - Subset of population. Population Sample 3 Census - Survey includes every member of population. Sample survey - Collecting information from a portion of population (techniques) Element - Specific subject or object about which information collected. Variable - Characteristics which make different values. 4 Observation - Value of variable for an element. Data Set - A collection of observation on one or more variables. Table 1: Student’s Score for Business Statistic Element Name Score Mohd Amirul bin Hamdi 90 Hashimah 78 Variable Observation/ Measurement 5 TYPES OF VARIABLES Variable Quantitative Discrete (e.g, number of houses, cars accidents Qualitative Continuous (e.g., length, age, height, weight, time) e.g., gender, marital status 6 QUANTITATIVE AND QUALITATIVE VARIABLE 1) Quantitative variable A variable that can be measured numerically. Data collected on a quantitative variable are called quantitative data. There are two types of quantitative variables:i. Discrete Variable A variable whose values are countable, can assume only certain values with no intermediate values. ii. Continuous Variable A variable that can assume any numerical value over a certain interval or intervals. 2) Qualitative variable A variable that cannot assume a numerical value but can be classified into two or more nonnumeric categories. Data collected on such a variable are called qualitative data. 7 STATISTICS DESCRIPTIVE STATISTICS Using tables, graphs & summary measures INFERENTIAL STATISTICS Using sample result in making decision or predict about a population. Also called inductive reasoning or inductive statistics. 8 Descriptive Statistics Consists of methods for organizing, displaying and describing data by using tables, graphs and summary measures. In general divided by two categories :- Data presentation (display) - Statistics 9 Inferential Statistics Consists of methods that use sample results to help make decisions or predictions about a population. Area statistics which are deal with decision making procedures. Example :- In order to find the salary of a college graduate, we may select 2000 recent college graduates, find the starting salaries and make decision based on the information. 10 DATA PRESENTATION A data with a lot of observations usually looks non informative - We cannot get much information with the raw data We have to summarize or organize in such a way so that we can get some information about the data. 11 DATA PRESENTATION OF QUALITATIVE DATA Tabular presentation for qualitative data is usually in the form of frequency table Frequency table- table represent the number of times the observation occurs in data A graphic display can reveal at a glance the main characteristics of a data set. Three types of graphs used to display qualitative data:- bar graph - pie chart - line chart 12 Example 5.1 Table 5.1 shows that the data of 50 UNIMAP students with their data and background. Code used : • For gender: 1 is male and 2 is female • For ethnic group: 1 is Malay, 2 is Chinese, 3 is Indian and 4 is others • Not much information can be obtained from the data 1 in the raw form. It has to be summarized so that we can get more informations. 13 If data from table 5.1 summarized into gender and ethnic group, then the frequency tables can get as below : Observation Frequency Male 28 Female 22 Total 50 Table 5.2: Frequency Table for the Gender Observation Frequency Malay 33 Chinese 9 Indian 6 Others 2 Total 50 Table 5.3: Frequency Table for the Ethnic Group 14 Bar Chart Bar chart is used to display the frequency distribution in the graphical form. It consists of two orthogonal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations is represented by a bar. *Bar chart is for data from Table 5.3. Figure 1: Bar Chart of the Ethnic Group 15 2.1.2 Pie Chart Pie Chart is used to display the frequency distribution. It displays the ratio of the observations. It is a circle consists of a few sectors. The sectors represent the observations while the area of the sectors represent the proportion of the frequencies of that observations. *Pie chart is for data from Table 5.2. Figure 2: The Pie Chart for the Gender 16 2.1.3 Line Chart Line chart is used to display the trend of observations. It consists of two orthoganal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations are joint by lines. Example : Table 2.4 below shows the number of sandpipers recorded between January 1989 till December 1989. Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec 10 7 5 10 39 7 260 316 142 11 4 9 Table 2.4 : The number of sandpipers Figure 3: The line Chart for the numbers of common Sandpipers 17 DATA PRESENTATION OF QUANTITATIVE DATA Tabular presentation of quantitative data is usually in the form of frequency distribution Frequency distribution – table that represents the frequency of the observation that fall inside some spesific classes (intervals). The are a few graph available for graphical presentation of the quantitative data. The most popular are: - Histogram - Frequency polygon - Ogive 18 FREQUENCY DISTRIBUTION When summarizing large quantities of raw data, it is often useful to distribute the data into classes. In determining the classes, there is no spesific rules but statistician suggest the number of classes are between 5 to 20 Sturges’s Rule Number of classes , c=1+3.3 log n Where n is the numbers of observations in the data set. Class width: Largest value-smallest value Number of classes Range i c i 19 Example 5.2 CGPA (Class) 2.50 - 2.75 2.75 - 3.00 3.00 - 3.25 3.25 - 3.50 3.50 - 3.75 3.75 - 4.00 Total Frequency 2 10 15 13 7 3 50 Table 5.5: The Fequency Distribution of the Students’ CGPA 20 Cumulative Frequency Distributions A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class. In cumulative frequency distribution table, each class has the same lower limit but a different upper limit. Table5.7 : Class Limit, Class Boundaries, Class Width , Cumulative Frequency Weekly Earnings (dollars) (Class Limit) Number of Employees, f Class Boundaries Class Width Cumulative Frequency 801-1000 9 800.5 – 1000.5 200 9 1001-1200 22 1000.5 – 1200.5 200 9 + 22 = 31 1201-1400 39 1200.5 – 1400.5 200 31 + 39 = 70 1401-1600 15 1400.5 – 1600.5 200 70 + 15 = 85 1601-1800 9 1600.5 – 1800.5 200 85 + 9 = 94 1801-2000 6 1800.5 – 2000.5 200 94 + 6 = 100 21 Histogram The histogram looks like the bar chart except that the horizontal axis represent the data which is quantitative in nature. There is no gap between the bars. 22 Frequency Polygon The frequency polygon looks like the line chart except that the horizontal axis represent the class mark of the data which is quantitative in nature. 23 Ogive Ogive is a line graph with the horizontal axis represent the upper limit of the class interval while the vertical axis represent the cummulative frequencies. 24 DATA SUMMARY What is statistic? Statistis is a number that describe the sample such as sample mean which describe the sample average. Type of statistic i. Measure of central tendency ii. Measure of dispersion 25 MEASURE OF CENTRAL TENDENCY There are 3 popular central tendency measures, mean, median & mode. 1) Mean The mean of a sample is the sum of the measurements divided by the number of measurements in the set. Mean is denoted by ( ) Mean = Sum of all values / Number of values Mean can be obtained as below :- - For raw data, mean is defined by, _ x x1 x2 ....... xn x , for n 1,2,..., n or x n n _ 26 For tabular/group data, mean is defined by: n x i 1 n f i xi i 1 or fi fx f Where f = class frequency; x = class mark (mid point) 27 Example The mean sample for students CGPA (raw) is x x n 160.98 3.22 50 28 Example : The mean sample for Table 5.8 Frequency, f Class Mark (Midpoint), x 2.50 - 2.75 2 2.625 5.250 2.75 - 3.00 10 2.875 28.750 CGPA (Class) fx n x 3.00 - 3.25 15 3.125 46.875 3.25 - 3.50 13 3.375 43.875 3.50 - 3.75 7 3.625 25.375 3.75 - 4.00 3 3.875 11.625 Total 50 f i xi i 1 n f 161 .75 3.235 50 i i 1 161.750 Table 5.8 29 2) Median Median is the middle value of a set of observations arranged in order of magnitude and normally is devoted by ~ x i) The median for ungrouped data. - The median depends on the number of observations in the data, . - If n is odd, then the median is the( ordered n observations. n 1 ) 2 th observation of the - If n is even, then the median is the arithmetic mean of the n n th observation and the ( 1) th observation. 2 2 30 ii) The median of grouped data / frequency of distribution. The median of frequency distribution is defined by: f F j 1 x L c 2 fj ~ where, • L = the lower class boundary of the median class; • c = the size of the median class interval; • Fj 1= the sum of frequencies of all classes lower than the median class • fj = the frequency of the median class. 31 Example for ungrouped data :The median of this data 4, 6, 3, 1, 2, 5, 7, 3 is 3.5. Proof :- Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7. As n=8 (even), the median is the mean of the 4th and 5th observations that is 3.5. 32 Example for grouped data :Find median for frequency distribution below Cum. CGPA (Class) Frequency, f frequency 2.50 - 2.75 2 2 2.75 - 3.00 10 12 3.00 - 3.25 15 27 f F j 1 x L c 2 fj ~ 25 12 Median , x 3.00 0.25 3.217 15 ~ 3.25 - 3.50 13 40 3.50 - 3.75 7 47 3.75 - 4.00 3 50 Total 50 33 3) Mode • The mode of a set of observations is the observation with the highest frequency and is usually denoted by ( ). Sometimes x mode can also be used to describe the qualitative data. i) Mode of ungrouped data :- Defined as the value which occurs most frequent. - The mode has the advantage in that it is easy to calculate and eliminates the effect of extreme values. - However, the mode may not exist and even if it does exit, it may not be unique. 34 *Note: If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. ii) The mode for grouped data/frequency distribution data. - When data has been grouped in classes and a frequency curve is drawn to fit the data, the mode is the value of corresponding to the maximum point on the curve. 35 ii) The mode for grouped data/ frequency distribution data 1 x L c 1 2 where L = the lower class boundary of the modal class; c = the size of the modal class interval; 1 = the difference between the modal class frequency and the class before it; and 2 = the difference between the modal class frequency and the class after it. *Note: - The class which has the highest frequency is called the modal class. 36 Example for ungrouped data : The mode for the observations 4,6,3,1,2,5,7,3 is 3. Example for grouped data based on table : CGPA (Class) Modal Class Frequency 2.50 - 2.75 2 2.75 - 3.00 10 3.00 - 3.25 15 3.25 - 3.50 13 3.50 - 3.75 7 3.75 - 4.00 3 Total 50 1 x L c 3.179 1 2 1 5 x L c 3 . 00 0 . 25 ( ) 3.179 52 1 2 37 Measure of Dispersion The measure of dispersion/spread is the degree to which a set of data tends to spread around the average value. It shows whether data will set is focused around the mean or scattered. The common measures of dispersion are: 1) range 2) variance 3) standard deviation The standard deviation actually is the square root of the variance. The sample variance is denoted by s2 and the sample standard deviation is denoted by s. 38 39 Variance i) Variance for ungrouped data The variance of a sample (also known as mean square) for the raw (ungrouped) data is denoted by s2 and defined by: 2 ( x x ) S2 n 1 ii) Variance for grouped data The variance for the frequency distribution is defined by: fx fx n 2 S2 2 f ( x x ) fx 1 2 n 1 40 Example for ungrouped data : given income for 5 workers are : RM 1000, RM 2500, RM 2000, RM 4000, RM 3500. Find variance of this data. Solution: 1000 2500 2000 4000 3500 5 2600 Mean, x Variance, S 2 ( x x) 2 n 1 1000 2600 2500 2600 2000 2600 4000 2600 3500 2600 2 2 2 2 2 5 1 5700, 000 4 142500 41 Example for grouped data : The variance for frequency distribution in Table is: Class boundaries Frequency, f Class Mark, x 2.50 - 2.75 2.75 - 3.00 3.00 - 3.25 3.25 - 3.50 3.50 - 3.75 3.75 - 4.00 2 10 15 13 7 3 2.625 2.875 3.125 3.375 3.625 3.875 Total 50 S2 fx 2 f x n 1 n 2 fx fx2 5.250 28.750 46.875 43.875 25.375 11.625 13.781 82.656 146.484 148.078 91.984 45.047 161.750 528.031 (161.75) 2 528.031 50 0.0973 49 42 ESTIMATION Introduction The field of statistical inference consist of those methods used to make decisions or to draw conclusions about a population. These methods utilize the information contained in a sample from the population in drawing conclusions 43 ESTIMATOR VS ESTIMATE Estimator • In statistics, the method used Estimate • The value that obtained from a sample I have a sample of 5 numbers and I take the average. The estimator is taking the average of the sample. The estimator of the mean. Let say, the average = 4 the estimate. 44 CONFIDENCE INTERVAL ESTIMATES Definition : An Interval Estimate In interval estimation, an interval is constructed around the point estimate and it is stated that this interval is likely to contain the corresponding population parameter. Definition : Confidence Level and Confidence Interval Each interval is constructed with regard to a given confidence level and is called a confidence interval. The confidence level associated with a confidence interval states how much confidence we have that this interval contains the true population parameter. The confidence level is denoted by 45 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION MEAN The (1 - a )100% Confidence Interval of Population Mean, m (i) x ± za s 2 n if s is known and normally distributed population æ s s ö ÷ or çç x - za < m < x + za ÷ 2 2 n nø è s if s is unknown, n large (n ³ 30) 2 n æ s s ö ÷ or çç x - za < m < x + za ÷ 2 2 n n è ø (ii) x ± za 46 s (iii) x ± tn - 1,a if s is unknown, normally distributed population 2 n and small sample size ( n < 30) æ s s ö ÷÷ or çç x - tn - 1,a < m < x + tn - 1,a 2 2 n nø è 47 EXAMPLE If a random sample of size n = 20 from a normal population with the variance s 2 = 225 has the mean x = 64.3, construct a 95% confidence interval for the population mean, m. 48 SOLUTION It is known that, n = 20, m = x = 64.3 and s = 15 For 95% CI, 95% = 100(1 – a )% 1 –a = 0.95 a = 0.05 a = 0.025 2 za = z0.025 = 1.96 2 49 æs ö ÷ Hence, 95% CI = x ± za çç ÷ n ø 2 è æ 15 ö ÷ = 64.3 ± 1.96 çç ÷ è 20 ø = 64.3 ± 6.57 = [57.73, 70.87] @ 57.73 < m < 70.87 Thus, we are 95% confident that the mean of random variable is between 57.73 and 70.87 50 Example : A publishing company has just published a new textbook. Before the company decides the price at which to sell this textbook, it wants to know the average price of all such textbooks in the market. The research department at the company took a sample of 36 comparable textbooks and collected the information on their prices. This information produced a mean price RM 70.50 for this sample. It is known that the standard deviation of the prices of all such textbooks is RM4.50. Construct a 90% confidence interval for the mean price of all such college textbooks. 51 solution It is known that, n = 36, m = x = RM70.50 and s = RM 4.50 For 90% CI, 90% = 100(1 – a )% 1 –a = 0.90 a = 0.1 a = 0.05 2 za = z0.05 = 1.65 2 52 æs ö ÷ Hence, 90% CI = x ± za çç ÷ n ø 2 è æ4.50 ö ÷ = 70.50 ± 1.65 çç ÷ 36 è ø = 70.50 ± 1.24 = [ RM 69.26, RM 71.74] Thus, we are 90% confident that the mean price of all such college textbooks is between RM69.26 and RM71.74 53 EXAMPLE Consider a survey on male students height in a certain IPTA: a random sample of 100 male students are taken. The height of the male students is normally distributed with mean 178.2 cm and variance 17.75 cm2. i) Construct a 95% CI for the mean of male students height ii) If mean of the female students height is 170.2 cm height, at 98% CI, verify whether if this can proof that the male are taller than the female students. 54 SOLUTION It is known that n 100 x 178.2 2 17.75 For 95 % CI 95% 1 100% 1 0.95 0.95 0.025 2 z z0.05 1.96 2 55 Hence 95% CI; x z n 2 17.75 178.2 1.96 100 178.2 0.83 177.37,179.03 56 ii) It is known that x 178.2 and 2 17.75 For 98 % CI 98% 1 100% 1 0.98 0.02 thus 2 0.01 z 2.33 2 Hence, 98% CI x z n 2 17.75 178.2 2.33 100 178.2 0.98 177.22,179.18 We can see that mean of female students does not lies in the interval 177.22,179.18 hence this indicate that the male students are taller than female students. 57 CONFIDENCE INTERVAL ESTIMATES FOR POPULATION PROPORTION The (1 - a )100% Confidence Interval for p for Large Samples (n ³ 30) pˆ ± za 2 pˆ (1 - pˆ ) n or pˆ - za 2 pˆ (1 - pˆ ) < p < pˆ + za 2 n pˆ (1 - pˆ ) n 58 Example According to the analysis of Women Magazine in June 2005, “Stress has become a common part of everyday life among working women in Malaysia. The demands of work, family and home place an increasing burden on average Malaysian women”. According to this poll, 40% of working women included in the survey indicated that they had a little amount of time to relax. The poll was based on a randomly selected of 1502 working women aged 30 and above. Construct a 95% confidence interval for the corresponding population proportion. 59 Solution Let p be the proportion of all working women age 30 and above, who have a limited amount of time to relax, and let pˆ be the corresponding sample proportion. From the given information, n = 1502 , pˆ = 0.40, qˆ = 1 - pˆ = 1 – 0.40 = 0.60 æ pq ö ˆˆ÷ ç Hence, 95% CI = pˆ ± za ç n ÷ 2 è ø æ 0.40(0.60) ö ÷ = 0.40 ± 1.96 çç 1502 ÷ è ø = 0.40 ± 0.02478 = [0.375, 0.425] or 37.5% to 42.5% Thus, we can state with 95% confidence that the proportion of all working women aged 30 and above who have a limited amount of time to relax is between 37.5% and 42.5%. 60 EXERCISE 61 HYPOTHESIS TESTS Hypothesis and Test Procedures A statistical test of hypothesis consist of : 1. The Null hypothesis, H 0 2. The Alternative hypothesis, H1 3. The test statistic and its p-value 4. The rejection region 5. The conclusion 62 Definition Hypothesis testing can be used to determine whether a statement about the value of a population parameter should or should not be rejected. Null hypothesis, H0 : A null hypothesis is a claim (or statement) about a population parameter that is assumed to be true. (the null hypothesis is either rejected or fails to be rejected.) Alternative hypothesis, H1 : An alternative hypothesis is a claim about a population parameter that will be true if the null hypothesis is false. 63 Test Statistic is a function of the sample data on which the decision is to be based. p-value is the probability calculated using the test statistic. The smaller the p-value, the more contradictory is the data to H 0 . DEVELOPING NULL AND ALTERNATIVE HYPOTHESIS It is not always obvious how the null and alternative hypothesis should be formulated. When formulating the null and alternative hypothesis, the nature or purpose of the test must also be taken into account. We will examine: 1) The claim or assertion leading to the test. 2) The null hypothesis to be evaluated. 3) The alternative hypothesis. 4) Whether the test will be two-tail or one-tail. 5) A visual representation of the test itself. In some cases it is easier to identify the alternative hypothesis first. In other cases the null is easier. 9.1.1 Alternative Hypothesis as a Research Hypothesis • Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis. • In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support. • The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected. Example 9.1: A new drug is developed with the goal of lowering blood pressure more than the existing drug. • • Alternative Hypothesis: The new drug lowers blood pressure more than the existing drug. Null Hypothesis: The new drug does not lower blood pressure more than the existing drug. 9.1.2 Null Hypothesis as an Assumption to be Challenged • We might begin with a belief or assumption that a statement about the value of a population parameter is true. • We then using a hypothesis test to challenge the assumption and determine if there is statistical evidence to conclude that the assumption is incorrect. • In these situations, it is helpful to develop the null hypothesis first. Example 9.2 : The label on a soft drink bottle states that it contains at least 67.6 fluid ounces. • • Null Hypothesis: The label is correct. µ > 67.6 ounces. Alternative Hypothesis: The label is incorrect. µ < 67.6 ounces. Example 9.3: Average tire life is 35000 miles. • Null Hypothesis: µ = 35000 miles • Alternative Hypothesis: µ ≠ 35000 miles 9.1 DEVELOPING NULL AND ALTERNATIVE HYPOTHESIS It is not always obvious how the null and alternative hypothesis should be formulated. When formulating the null and alternative hypothesis, the nature or purpose of the test must also be taken into account. We will examine: 1) The claim or assertion leading to the test. 2) The null hypothesis to be evaluated. 3) The alternative hypothesis. 4) Whether the test will be two-tail or one-tail. 5) A visual representation of the test itself. In some cases it is easier to identify the alternative hypothesis first. In other cases the null is easier. 9.1.1 Alternative Hypothesis as a Research Hypothesis • Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis. • In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support. • The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected. Example 9.1: A new drug is developed with the goal of lowering blood pressure more than the existing drug. • • Alternative Hypothesis: The new drug lowers blood pressure more than the existing drug. Null Hypothesis: The new drug does not lower blood pressure more than the existing drug. 9.1.2 Null Hypothesis as an Assumption to be Challenged • We might begin with a belief or assumption that a statement about the value of a population parameter is true. • We then using a hypothesis test to challenge the assumption and determine if there is statistical evidence to conclude that the assumption is incorrect. • In these situations, it is helpful to develop the null hypothesis first. Example 9.2 : The label on a soft drink bottle states that it contains at least 67.6 fluid ounces. • • Null Hypothesis: The label is correct. µ > 67.6 ounces. Alternative Hypothesis: The label is incorrect. µ < 67.6 ounces. Example 9.3: Average tire life is 35000 miles. • Null Hypothesis: µ = 35000 miles • Alternative Hypothesis: µ ≠ 35000 miles How to decide whether to reject or accept H 0 ? The entire set of values that the test statistic may assume is divided into two regions. One set, consisting of values that support the H1 and lead to reject H 0 , is called the rejection region. The other, consisting of values that support the H 0 is called the acceptance region. H0 always gets “=“. Tails of a Test Sign in H 0 Sign in H1 Rejection Region Two-Tailed Test = ¹ In both tail Left-Tailed Right-Tailed Test Test = or ³ = or £ < > In the left tail In the right tail 77 Population Mean, , ( known and unknown ) Null Hypothesis : H 0 : m = m0 Test Statistic : x- m Z= s n • any population, is known and n is large or • normal population, is known and n is small Z = x -s m • any population, is unknown and n is large n t = x -s m n v =n- 1 • normal population, is unknown and n is small 78 Alternative hypothesis Rejection Region H1 : m¹ m0 Z < - za 2 or Z > za 2 H1 : m > m0 Z > za H1 : m < m0 Z< - za 79 Definition: p-value The p-value is the smallest significance level at which the null hypothesis is rejected. Using the p - value approach, we reject the null hypothesis, H 0 if p - value < a for one - tailed test p - value a < for two - tailed test 2 2 and we do not reject the null hypothesis, H 0 if p - value ³ a for one - tailed test p - value a ³ for two - tailed test 2 2 80 Example The average monthly earnings for women in managerial and professional positions is RM 2400. Do men in the same positions have average monthly earnings that are higher than those for women ? A random sample of n = 40 men in managerial and professional positions showed x = RM 3600 and s = RM 400. Test the appropriate hypothesis using a = 0.01 81 Solution 1.The hypothesis to be tested are, H 0 : m £ 2400 H1 : m > 2400 2.We use normal distribution n > 30 3. Rejection Region : Z > za ; za = z0.01 = 2.33 4. Test Statistic Z= x - m 3600 - 2400 = = 18.97 s 400 n 40 Since 18.97 > 2.33, falls in the rejection region, we reject H 0 and conclude that average monthly earnings for men in managerial and professional positions are significantly higher than those for women. 82 POPULATION PROPORTION, P Null Hypothesis : Test Statistic : H 0 : p = p0 pˆ - p0 Z= p0 q0 n Alternative hypothesis Rejection Region H1 : p ¹ Z < - za p0 2 or Z > za H1 : p > p0 Z > za H1 : p < p0 Z< - za 2 83 Example When working properly, a machine that is used to make chips for calculators does not produce more than 4% defective chips. Whenever the machine produces more than 4% defective chips it needs an adjustment. To check if the machine is working properly, the quality control department at the company often takes sample of chips and inspects them to determine if the chips are good or defective. One such random sample of 200 chips taken recently from the production line contained 14 defective chips. Test at the 5% significance level whether or not the machine needs an adjustment. 84 Solution The hypothesis to be tested are , H 0 : p £ 0.04 H1 : p > 0.04 Test statistic is pˆ - p0 0.07 - 0.04 Z= = = 2.17 p0 q0 0.04(0.96) 200 n Rejection Region : Z > za ; za = z0.05 = 1.65 Since 2.17 > 1.65, falls in the rejection region, we can reject H 0 and conclude that the machine needs an adjustment. 85 REGRESSION AND CORRELATION Regression – is a statistical procedure for establishing the relationship between 2 or more variables. This is done by fitting a linear equation to the observed data. The regression line is then used by the researcher to see the trend and make prediction of values for the data. There are 2 types of relationship: Simple ( 2 variables) Multiple (more than 2 variables) THE SIMPLE LINEAR REGRESSION MODEL is an equation that describes a dependent variable (Y) in terms of an independent variable (X) plus random error Y = b 0 + b1 X + e where, 0 1 = intercept of the line with the Y-axis = random error = slope of the line Random error, is the difference of data point from the deterministic value. This regression line is estimated from the data collected by fitting a straight line to the data set and getting the equation of the straight line, Ù Ù Ù Y = b 0+ b1 X LEAST SQUARES METHOD • The least squares method is commonly used to determine values for 0 and 1 that ensure a best fit for the estimated regression line to the sample data points • The straight line fitted to the data set is the line: Ù Ù Ù y = b 0+ b1 x y is the estimated or predicted value of y for a given value of x. In other words, the predicted value of the dependent variable y for a given independent variable x can simply be obtain by substituting the given value of x. We can find the least squares estimators 0 and formula Ù b1 = Ù 1 by using the Sxy Sxx Ù b 0 = y - b1 x where 1 n x xi n i 1 1 n y yi n i 1 æ n öæ n ö çå xi ÷çå yi ÷ n S xy = å xi yi - è i =1 øè i =1 ø n i =1 æ n ö2 çå xi ÷ n 2 S xx = å xi - è i =1 ø n i =1 n S yy = å i =1 æ n ö2 çå yi ÷ 2 yi - è i =1 ø n 89 EXAMPLE Suppose we take a sample of seven household from a low moderate income neighborhood and collect information on their income and food expenditures for the past month. The information obtained (in hundreds of Ringgit Malaysia) is given below Income Food expenditures 35 9 49 15 21 7 39 11 15 5 28 8 25 9 Find the least squares regression line of food expenditure (Y) on income (X) 90 SOLUTION Income Food Expenditure x 35 49 21 39 15 28 25 y 9 15 7 11 5 8 9 xy 315 735 147 429 75 224 225 x2 1225 2401 441 1521 225 784 625 y2 81 225 49 121 25 64 81 ∑ x = 212 ∑ y = 64 ∑ xy = 2150 ∑ x2 = 7222 ∑ y2 = 646 Compute x, y, x and y 91 x 212 x y 64 x 212 =30.2857 n 7 compute xy and xy 2150 x x 2 y y 64 9.1429 n 7 2 7222 compute S xy and S xx n n xi yi n 212 64 211.7143 S xy xi yi i 1 i 1 2150 n 7 i 1 2 n 2 xi n 212 S xx xi 2 i 1 7222 801.4286 n 7 i 1 compute 0 and 1 1 S xy S xx 211.7143 0.2642 801.4286 0 y 1 x 9.1429 0.2642 30.2587 1.1414 Thus our regression model is y 1.1414 0.2642x 92 CORRELATION (R) Correlation measures the strength of a linear relationship between the two variables. Also known as Pearson’s product moment coefficient of correlation. The symbol for the sample coefficient of correlation is r Formula : r = S xy S xx .S yy Properties of r : - 1£ r £ 1 Values of r close to 1 implies there is a strong positive linear relationship between x and y. Values of r close to -1 implies there is a strong negative linear relationship between x and y. Values of r close to O implies little or no linear relationship between x and y. EXAMPLE Refer example before. Calculate the value of r and interpret its meaning. Solution From example before we know that S xy 211.7143 and S xx 801.4286 compute 2 n 2 yi n 64 S yy yi 2 i 1 646 60.8571 n 7 i 1 S xy 211.7143 r 0.9587 S xx S yy 801.4286 60.8571 Since the r value close to 1, implies that there is strong positive linear relationship between income (x) and food expenditure (y). 95