CHE357

CHE357: Experimental Data Analysis Dr. A.Y Omari Sasu Kwame Nkrumah University of Science and Technology Department of Statistics and Actuarial Science STATISTICS Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions. Key Concept In this section we begin with a few very basic definitions, and then we consider an overview of the process involved in conducting a statistical study. This process consists of “prepare, analyze, and conclude.” “Preparation” involves consideration of the context, the source of data, and sampling method. In future chapters we construct suitable graphs, explore the data, and execute computations required for the statistical method being used. In future chapters we also form conclusions by determining whether results have statistical significance and practical significance. Statistical thinking involves critical thinking and the ability to make sense of results. Statistical thinking demands so much more than the ability to execute complicated calculations. Through numerous examples, exercises, and discussions, this text will help you develop the statistical thinking skills that are so important in today’s world. Basic definitions A variable is a characteristic or attribute that can assume different values. Data are collections of observations, such as measurements, genders, or survey responses. (A single data value is called a datum, a term rarely used. Data are the values (measurements or observations) that the variables can assume. Variables whose values are determined by chance are called random variables. Statistics is the science of planning studies and experiments; obtaining data; and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them. A population is the complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inferences about. A census is the collection of data from every member of the population. A sample is a subcollection of members selected from a population. Because populations are often very large, a common objective of the use of statistics is to obtain data from a sample and then use those data to form a conclusion about the population. Example In the journal article “Residential Carbon Monoxide Detector Failure Rates in the United States” (by Ryan and Arnold, American Journal of Public Health, Vol. 101, No. 10), it was stated that there are 38 million carbon monoxide detectors installed in the United States. When 30 of them were randomly selected and tested, it was found that 12 of them failed to provide an alarm in hazardous carbon monoxide conditions. In this case, the population and sample are as follows: Population: All 38 million carbon monoxide detectors in the United States Sample: The 30 carbon monoxide detectors that were selected and tested The objective is to use the sample data as a basis for drawing a conclusion about the population of all carbon monoxide detectors, and methods of statistics are helpful in drawing such conclusions. The body of knowledge called statistics is sometimes divided into two main areas, depending on how data are used. The two areas are  Descriptive statistics  Inferential statistics Descriptive statistics consists of the collection, organization, summarization, and presentation of data. In descriptive statistics the statistician tries to describe a situation and present the data in some meaningful form, such as charts, graphs, or tables Inferential statistics consists of generalizing from samples to populations, performing estimations and hypothesis tests, determining relationships among variables, and making predictions. The statistician tries to make inferences from samples to populations. Inferential statistics uses probability, i.e., the chance of an event occurring. TRY Determine whether descriptive or inferential statistics were used. a. The average price of a 30-second ad for the Academy Awards show in a recent year was 1.90 million dollars b. The Department of Economic and Social Affairs predicts that the population of Mexico City, Mexico, in 2030 will be 238,647,000 people. c. A medical report stated that taking statins is proven to lower heart attacks, but some people are at a slightly higher risk of developing diabetes when taking statins. d. A survey of 2234 people conducted by the Harris Poll found that 55% of the respondents said that excessive complaining by adults was the most annoying social media habit. Types of Data Key Concept A major use of statistics is to collect and use sample data to make conclusions about populations. We should know and understand the meanings of the terms statistic and parameter, as defined below. Basic Types of Data Definitions A parameter is a numerical measurement describing some characteristic of a population. A statistic is a numerical measurement describing some characteristic of a sample. Example There are 17,246,372 high school students in the United States. In a study of 8505 U.S. high school students 16 years of age or older, 44.5% of them said that they texted while driving at least once during the previous 30 days 1. Parameter: The population size of 17,246,372 high school students is a parameter, because it is the entire population of all high school students in the United States. If we somehow knew the percentage of all 17,246,372 high school students who reported they had texted while driving, that percentage would also be a parameter. 2. Statistic: The sample size of 8505 surveyed high school students is a statistic, because it is based on a sample, not the entire population of all high school students in the United States. The value of 44.5% is another statistic, because it is also based on the sample, not on the entire population. Quantitative and Qualitative data Definitions Quantitative (or numerical) data consist of numbers representing counts or measurements. Qualitative (or Categorical or attribute) data consist of names or labels (not numbers that represent counts or measurements). CAUTION Categorical data are sometimes coded with numbers, with those numbers replacing names. Although such numbers might appear to be quantitative, they are actually categorical data. Example 1. Quantitative Data: The ages (in years) of subjects enrolled in a clinical trial 2. Categorical Data as Labels: The genders (male>female) of subjects enrolled in a clinical trial 3. Categorical Data as Numbers: The identification numbers 1, 2, 3 . . . . 25 are assigned randomly to the 25 subjects in a clinical trial. Those numbers are substitutes for names. They don’t measure or count anything, so they are categorical data. Discrete / Continuous Quantitative data can be further described by distinguishing between discrete and continuous types. Discrete data result when the data values are quantitative and the number of values is finite, or “countable.” (If there are infinitely many values, the collection of values is countable if it is possible to count them individually, such as the number of tosses of a coin before getting tails.) Continuous (numerical) data result from infinitely many possible quantitative values, where the collection of values is not countable. (That is, it is impossible to count the individual items because at least some of them are on a continuous scale, such as the lengths of distances from 0 cm to 12 cm.) Example 1. Discrete Data of the Finite Type: Each of several physicians plans to count the number of physical examinations given during the next full week. The data are discrete data because they are finite numbers, such as 27 and 46, that result from a counting process. 2. Discrete Data of the Infinite Type: Casino employees plan to roll a fair die until the number 5 turns up, and they count the number of rolls required to get a 5. It is possible that the rolls could go on forever without ever getting a 5, but the numbers of rolls can be counted, even though the counting might go on forever. The collection of the numbers of rolls is therefore countable. 3. Continuous Data: When the typical patient has blood drawn as part of a routine examination, the volume of blood drawn is between 0 mL and 50 mL. There are infinitely many values between 0 mL and 50 mL. Because it is impossible to count the number of different possible values on such a continuous scale, these amounts are continuous data. The classification of variables can be summarized as follows: variable quantitative discrete qualitative continuous Try Classify each variable as a discrete or continuous variable. a. The number of hours during a week that children ages 12 to 15 reported that they watched television. b. The number of touchdowns a quarterback scored each year in his college football career. c. The amount of money a person earns per week working at a fast-food restaurant. d. The weights of the football players on the teams that play in the NFL this year. Levels of Measurement Another common way of classifying data is to use four levels of measurement: nominal, ordinal, interval, and ratio, all defined below. Level of Measurement Brief Description Example Ratio There is a natural zero Heights, lengths, distances, starting Volumes point and ratios make sense Interval Differences are meaningful, Body temperatures in degrees but there is no natural zero Fahrenheit or Celsius starting point and ratios are meaningless Ordinal Data can be arranged in Ranks of colleges in U.S. order, News but differences either can’t be & World Report found or are meaningless. Nominal Categories only. Data cannot Eye colors be arranged in order. The nominal level of measurement is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in some order (such as low to high). An example of this could be state names, or names of the individuals, or courses by name, gender, race, religion, or sport. These do not need to be placed in any order. The logical operators are ∪,∩. Data are at the ordinal level of measurement if they can be arranged in some order, but differences (obtained by subtraction) between data values either cannot be determined or are meaningless. The order may be either increasing or decreasing. One example would be income levels. The data could have numeric values such as 1, 2, 3, or values such as high, medium, or low. Also, categorical variables that judge size (small, medium, large, etc.) are ordinal variables. The logical operators are ∪,∩, <, >, = . Data are at the interval level of measurement if they can be arranged in order, and differences between data values can be found and are meaningful. Data at this level do not have a natural zero starting point at which none of the quantity is present. The logical operators for interval scale are ∪,∩, <, >, +, −, = Data are at the ratio level of measurement if they can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point (where zero indicates that none of the quantity is present). For data at this level, differences and ratios are both meaningful. The logical operators for ratio scale are ∪, ∩, <, >, +, −,÷, = Figure 1: Summary of data types and scale measures Exercise Indicate the scale of measurement for the following set of data. a. January, February, . . . , June b. Single, Married, Divorced c. 30 ,45,15, 12 d. 10th Feb, 12th Aug, 25th Sep a. First class honors, Second class honors, Pass b. 5cm, 25cm, 10cm, 15 cm c. 30oC, 37oC, 19oC d. Christian, Muslim, Hindu, Jewish a. $ 559, $ 870, $ 170 b. 3km, 8km, 4km, 10 km c. Breakfast, Lunch, Dinner d. CPP, PNDC, NPP, NDC Numerical Descriptive Measure • Measure of central tendencies • Measure of dispersion • Measure of position • Measure of shape Measure of Central Tendencies These are the averages which determine the central location or middle of the data The Arithmetic Mean This is the best known and most commonly used average. Let 𝑥1, 𝑥2, 𝑥3,··· , 𝑥𝑛 be a data set. The mean is given as (1) When the data is grouped in a frequency distribution, xi becomes the class mark or midpoint of the ith class boundary with frequency, 𝑓𝑖 . Assumed Mean The mean may also be computed using (2) where A = Assumed mean di = xi-Ai is called the deviation of ith classmark Weighted Mean We sometimes associate with each observation certain weighting factor or weight depending on the significance attached to the observation. Let the values, 𝑥1 , 𝑥2 , 𝑥3 ,··· , 𝑥𝑛 be the set of data with weights 𝑤1 , 𝑤2 , 𝑤3 ,··· , 𝑤𝑛 respectively. Then the weighted mean is given as (3) Median The median is the middle-ranked value of an ordered array data. It divides the data set into two equal parts after the observations have been arranged in order of magnitude. Let 𝑥1 , 𝑥2 , 𝑥3 ,··· , 𝑥𝑛 be the observations arranged in an increasing order of magnitude. The median, denoted M is defined as (4) , For grouped data, the median is given as if n is even (5) where Lm = lower class boundary of the median class fcm = cumulative frequency just before the median class fm = frequency of the median class Cm = class width of median class boundary n = total number of observation (total frequency) Mode The mode is defined as the most frequent observed value of a given set of observations. For grouped data, the mode is given as (6) where, LCB = Lower Class Boundary of the modal class ∆1 = Absolute difference between the frequency of modal class and pre modal class ∆2 = Absolute difference between the frequency of modal class and post modal class C = width of the modal class Example 1. The following data represent the scores on a statistics examination of a sample of students: 87, 63, 91, 72, 80, 77, 93, 69, 75, 79, 70, 83, 94 , 75, 88. Find the mean and median mark. 2. The average fuel efficiencies, in miles per gallon, of cars sold in the United States in the years 1999 to 2003 were 28.2, 28.3, 28.4, 28.5, 29.0 Find the sample mean of this set of data. 3. The following is a frequency table of the ages of a sample of members of a symphony for young adults. Age Value Frequency 16 9 17 12 18 15 19 8 20 10 a. Find the sample mean of the given ages 4. A company runs two manufacturing plants. A sample of 30 engineers at plant 1 yielded a sample mean salary of $33,600. A sample of 20 engineers at plant 2 yielded a sample mean salary of $42,400. What is the sample mean salary for all 50 engineers? 5. A student’s final end of semester examination marks in six courses are: 56, 68, 65, 70, 78, 80. If the credits for the courses are 4, 3, 3, 4, 3, 2 respectively, determine the approximate average mark. 6. The distribution below gives measurements on 40 different subjects. Class Interval No. of Subjects(xi) 110-119 4 120-129 6 130-139 3 140-149 5 150-159 10 160-169 4 170-179 8 Using an assumed mean of A = 144.5, compute the mean, mode and Median Measure of Dispersion The degree to which the numerical data tend to spread about an average is the dispersion or variation of the data. Numerous measures of dispersion exist, the most commonly being the range, mean deviation, variance ( or standard deviation), quartile deviation and coefficient of variation. Range The range is the simplest measure of dispersion. The range of set of measurements x1,x2,x3,··· ,xn is defined as the difference between the largest and smallest measurements. In the case of grouped data, the range is defined as the difference between the last and the first class marks. Mean Deviation The mean deviation (MD) is a measure of the average amount by which the observations, x1,x2,x3,··· ,xn, differ from the arithmetic mean, x¯ . It is given as (7) (8) Variance and Standard Deviation The variance of a set of observations x1,x2,x3,··· ,xn, is the average of the squared deviations from the arithmetic mean. It is denoted by σ2 and S2, population and sample data respectively. The variance is given as (9) (10) For grouped data the variance is given as; (11) (12) Note The standard deviation is defined as the positive square root of the variance. Co-efficient of Variation The standard deviation is useful as a measure of dispersion within a given set of data. Sometimes, we may be interested in comparing variations between two or more sets of data. The standard deviation or the variance can be used for this purpose when the variables are given in the same units and are such that their means are approximately equal. For instance, comparing the distributions of annual incomes and absenteeism for a group of employees. In order to make a meaningful comparison of the dispersion in incomes and absenteeism, we need to convert each of these standard deviations to a relative value. This relative measure of dispersion is called the co-efficient of variation (CV). The co-efficient of variation is defined as (13) Example 1. During the past few months, one runner averaged 12 miles per week with a standard deviation of 2 miles, while another runner averaged 24 miles per week with a standard deviation of 3 miles. Which of the two runners is relatively more consistent in his weekly running habits? Measures of Position 1. Quartiles 2. Deciles 3. Percentiles The general formula for measure of position is given as (14) where LCB is the lower class boundary for the position P is the position C is the class width of the position fcm is the cumulative frequency just before the positions class boundary. Measure of shape Measures of shape determine whether the distribution of data exhibits a symmetric pattern or stretch out in a particular direction. Two of such measures of shape are the skewness and kurtosis. Skewness The skewness of a distribution indicates its degree of symmetry or nonsymmetry. It is measured by the Pearson Co-efficient of skewness (Sk) , defined by (15) Where x¯ is the mean M is the median S.D is the Standard Deviation Interpretation if Sk = 0, the distribution is said to be symmetric if Sk > 0, the distribution is said to be skewed to the right if Sk < 0, the distribution is said to be skewed to the left Figure 2: Graph of symmetrical and Non-symmetrical distributions Co-efficient of Peakness The degree of peakness or kurtosis of a distribution is described by the coefficient of kurtosis, k defined by; (16) which is compared to the value, 3. Interpretation If k = 3, the distribution is said to be normal. If k < 3, the distribution is less peaked than the normal distribution If k > 3, the distribution is more peaked than the normal distribution Figure 3: Graphs of distributions indicating their peakness Example The following data gives the total number of fires in Ontario, Canada, for 11 months in the year 2002: 6, 13, 5, 7, 7, 3, 7, 2, 5, 9, 8. Compute the co-efficient of skewness. EXERCISE 1. Here are grouped data for heights of 100 randomly selected male students; Height(Inches) Class Frequency,(f) Mark,(x) 59.5-62.5 61 5 62.5-65.5 64 18 65.5-68.5 67 42 68.5-71.5 70 27 71.5-74.5 73 8 a. Determine the coefficient of i. Skewness ii. Peakness b. Interpret the results. 2. A study of the test scores for a course in Principles of Management and years of service of the employees enrolled in a Business programme resulted in a mean score of 200 with standard deviation, 40 and mean number of years of service of 20 with standard deviation of 2. Compare the relative dispersion in the two distributions using the coefficient of variation. 3. The variation in the annual incomes of executives is to be compared with the variation in incomes of unskilled employees. For a sample of executives, the mean income is $500,000 with standard deviation of $50,000 while that of the unskilled employees have a mean of $22,000 with standard deviation, $2,200. Compute the coefficients of variation for a meaningful comparison of variation in annual incomes. 4. Calculate the mean deviation for the following data Class Frequency 0- 10- 20- 30- 40- 50- 9 19 29 39 49 59 6 7 15 16 4 2 Compute the i. Mode ii. Median iii. Variance iv. Standard deviation. Graphical display of data and descriptive statistics When conducting a statistical study, the researcher must gather data for the particular variable under study. For example, if a researcher wishes to study the number of people who were bitten by poisonous snakes in a specific geographic area over the past several years, he or she has to gather the data from various doctors, hospitals, or health departments. To describe situations, draw conclusions, or make inferences about events, the researcher must organize the data in some meaningful way. The most convenient method of organizing data is to construct a frequency distribution. After organizing the data, the researcher must present them so they can be understood by those who will benefit from reading the study. The most useful method of presenting the data is by constructing statistical charts and graphs. There are many different types of charts and graphs, and each one has a specific purpose Frequency distribution When working with large data sets, a frequency distribution (or frequency table) is often helpful in organizing and summarizing data. A frequency distribution helps us to understand the nature of the distribution of a data set. A frequency distribution (or frequency table) shows how data are partitioned among several categories (or classes) by listing the categories along with the number (frequency) of data values in each of them. Time(seconds) Class Frequency Boundary 75-124 74.5-124.5 11 125-174 124.5-174.5 24 175-224 174.5- 224.5 10 225-274 224.5-275.5 3 275-324 274.5-324.5 2 The frequency for a particular class is the number of original values that fall into that class. For example, the first class has a frequency of 11, so 11 of the service times are between 75 seconds and 124 seconds, inclusive. Lower class limits are the smallest numbers that can belong to each of the different classes. (the above table has lower class limits of 75, 125, 175, 225, and 275.) Upper class limits are the largest numbers that can belong to each of the different classes. (the above table has upper class limits of 124, 174, 224, 274, and 324.) Class boundaries are the numbers used to separate the classes, but without the gaps created by class limits. Class midpoints are the values in the middle of the classes. The above table has class midpoints of 99.5, 149.5, 199.5, 249.5, and 299.5. Each class midpoint can be found by adding the lower class limit to the upper class limit and dividing the sum by 2. Class width is the difference between two consecutive lower class limits (or two consecutive lower class boundaries) in a frequency distribution. The above table uses a class width of 50. (The first two lower class boundaries are 75 and 125, and their difference is 50.) Graphical representation of data After you have organized the data into a frequency distribution, you can present them in graphical form. The purpose of graphs in statistics is to convey the data to the viewers in pictorial form. It is easier for most people to comprehend the meaning of data presented graphically than data presented numerically in tables or frequency distributions. This is especially true if the users have little or no statistical knowledge. Histogram While a frequency distribution is a useful tool for summarizing data and investigating the distribution of data, an even better tool is a histogram, which is a graph that is easier to interpret than a table of numbers. A histogram is a graph consisting of bars of equal width drawn adjacent to each other (unless there are gaps in the data). The horizontal scale represents classes of quantitative data values, and the vertical scale represents frequencies. The heights of the bars correspond to frequency values. Important Uses of a Histogram ■ Visually displays the shape of the distribution of the data ■ Shows the location of the center of the data ■ Shows the spread of the data ■ Identifies outliers From the table above Histogram 30 frequency 25 20 15 10 5 0 74.5-124.5 124.5-174.5 174.5- 224.5 224.5-274.5 274.5-324.5 time (seconds) The Ogive A graph that can be used represents the cumulative frequencies for he classes. This type of graph is called the cumulative frequency graph, or ogive. The cumulative frequency is the sum of the frequencies accumulated up to the upper boundary of a class in the distribution. The ogive is a graph that represents the cumulative frequencies for the classes in a frequency distribution With the same table above, Time(seconds) Time less than Cumulative frequency Less than 74.5 0 75-124 Less than124.5 11 125-174 Less than 174.5 35 175-224 Less than 224.5 45 225-274 Less than 275.5 48 275-324 Less than 324.5 50 OGIVE cumulative frequency 60 50 40 30 20 10 0 0 50 100 150 200 250 300 350 times less than The Pie Graph The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the sizes of the sections. Percentages or proportions can be used. The variable is nominal or categorical. A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution. Example This frequency distribution show calls received each shift by a local municipality for a recent year. Construct a pie graph for the data. Shift Frequency Angle of sector Percentage Day 2594 2594 ∗ 360° = 119° 7830 2594 ∗ 100% = 33% 7830 Evening 2800 129° 36% Night 2436 112° 31% The frequency for each class must be converted to a proportional part of the circle. This 𝑓 conversion is done by using the formula 𝐷𝑒𝑔𝑟𝑒𝑒𝑠 = 𝑛 · 360° Where f = frequency for each class and n = sum of the frequencies. Hence, the following conversions are obtained. The degrees should sum to 360°. Total frequency is 7830. Using a protractor, graph each section and write its name and corresponding percentage as shown Frequency 31% 33% Day Evening Night 36% Introduction to Probability Theory Terminologies and Notations Experiment: An experiment is any process that generates well-defined outcomes. There are two types of experiments, namely; deterministic and random (or chance) experiment. In the deterministic experiments, the observed results are not subject to chance while the outcomes of random experiments cannot be predicted with certainty Trial: A trial is a single performance of an experiment (that is, a repetition of an experiment). Outcome: The possible result of each trial of an experiment is called outcome. Sample Space: It is the set of all possible outcomes of an experiment. It is denoted by the letter, S. Event: An event is a collection of one or more outcomes from an experiment which is a subset of a sample space. It is denoted by a capital letter. (17) Axioms of Probability • Axiom 1: For every event A, 0 ≤ P(A) ≤ 1 • Axiom 2: For every event A, P(A) ≥ 0 • Axiom 3: For the sure or certain event S, P(S) = 1 • Axiom 4: For any number of mutually exclusive events A1,A2,··· , P(A1 ∪ A2 ∪ ···) = P(A1) + P(A2) + ··· Some Important Definitions in Probability • Events A and B are mutually exclusive if both cannot occur at the same time. ie A ∩ B = ∅ and P(A ∩ B = 0) • A and B are independent events if and only if P(A∩B) = P(A)×P(B). • If ∅ is an empty set, then P(∅) = 0 • If A0 is the complement of an event A, then P(A0) = 1-P(A) • If A and B are any two events, then P(A∪B) = P(A)+P(B)−P(A∩B) • Conditional Probability: Let A and B be two events in the sample space, S with P(B) > 0. The probability that an event A occurs given that event B has already occurred, denoted P(A|B), is called the conditional probability of A given B. The conditional probability of A given B is defined as (18) EXERCISE 1. A box contains three balls. One red, one blue, and one yellow. Consider an experiment that consists of withdrawing a ball from the box, replacing it, and withdrawing a second ball. a. What is the sample space of this experiment? b. What is the event that the first ball drawn is yellow? c. What is the event that the same ball is drawn twice? 2. An experiment consists of flipping a coin three times and each time noting whether it lands heads or tails. a. What is the sample space of this experiment? b. What is the event that tails occur more often than heads? 3. Suppose a coin is flipped twice. Assume that all four possibilities are equally likely to occur. Find the conditional probability that both coins land heads given that the first one does. Application of Counting Techniques In a sample space with a large number of outcomes, determining the number of outcomes associated with the events through direct enumeration could be tedious. In this section, we develop some counting techniques and use them in probability computations. We shall examine three basic counting techniques, namely the Multiplication Principle, Permutation and Combination. The Multiplication Principle If an operation can be performed in n1 ways, a second operation can be performed in n2 ways and so on for the kth operation which can be performed in nk ways, then the combined experiment or operations can be performed in n1 × n2 × n3 × ···nk ways. Example • How many different 7-place license plates are possible if the rst 3 places are to be occupied by letters and the final 4 by numbers? Solution By the generalized version of the basic principle, the answer is 26 · 26 · 26 · 10 · 10 · 10 · 10 = 175,760,000. • How many license plates would be possible if repetition among letters or numbers were prohibited? Solution. In this case, there would be 26·25·24·10·9·8·7 = 78,624,000 possible license plates. Permutation An ordered arrangement of objects is called a permutation. The number of permutations of • n distinct objects, taken all together is n! = n(n×1)(n×2)×···×3×2×1 • n distinct objects taken k at a time is , where k < n • n objects consisting of groups of which n1 of the first group are alike, n2 the second group are alike and so on for the kth group with nk objects which are alike is • n distinct objects arranged in a circle, called circular permutations is given by (20) Example • How many distinct three-digit numbers can be formed using the digits 2, 4, 6, and 8 if no digit can be repeated? Solution The number of distinct three-digit numbers will be . • How many different letter arrangements can be formed from the letters PEPPER? Solution The 6- letter word, PEPPER has 3P’s, 2E’s and 1R. Hence, there are possible letter arrangements of the letters PEPPER. Combination A combination is a selection of objects in which the order of selection does not matter. The number of ways in which k objects can be selected from n distinct objects, irrespective of their order is dened by (21) Example • The number of ways of choosing a committee of 5 from 9 persons is • In a tank containing 10 fishes, there are three yellow and seven black fishes. We select three fishes at random. a. What is the probability that exactly one yellow fish gets selected? b. What is the probability that at least one yellow fish gets selected? Solution Let A be the event that exactly one yellow fish gets selected, and B be the event that at most one yellow fish gets selected. There are 10C3 = 120 ways to select three fishes from 10. a. There are 3C1 = 3 ways to select a yellow fish and 7C2 = 21 ways to select two black fishes. By multiplication rule, the probability of selecting exactly one yellow fish is . b. The probability that at least one yellow fish gets selected is the same as 1 − P(none), which is 1 − 0.292 = 0.708. Probability Distributions Basic Concepts of a Probability Distribution A random variable is a variable (typically represented by x) that has a single numerical value, determined by chance, for each outcome of a procedure. A probability distribution is a description that gives the probability for each value of the random variable. It is often expressed in the format of a table, formula, or graph. Random variables may also be discrete or continuous A discrete random variable has a collection of values that is finite or countable. (If there are infinitely many values, the number of values is countable if it is possible to count them individually, such as the number of tosses of a coin before getting heads.) A continuous random variable has infinitely many values, and the collection of values is not countable. (That is, it is impossible to count the individual items because at least some of them are on a continuous scale, such as body temperatures.) Probability Distribution: Requirements Every probability distribution must satisfy each of the following three requirements. 1. There is a numerical (not categorical) random variable 𝑥, and its number values are associated with corresponding probabilities. 2. ∑ 𝑃(𝑥) = 1 Where 𝑥 assumes all possible values. (The sum of all probabilities must be 1, but sums such as 0.999 or 1.001 are acceptable because they result from rounding errors.) 3. 0 ≤ 𝑃(𝑥) ≤ 1 for every individual value of the random variable x. (That is, each probability value must be between 0 and 1 inclusive.) The second requirement comes from the simple fact that the random variable x represents all possible events in the entire sample space, so we are certain (with probability 1) that one of the events will occur. The third requirement comes from the basic principle that any probability value must be 0 or 1 or a value between 0 and 1. Example Construct probability distributions for the following random variables: I. The number of heads when four fair coins are tossed. II. The difference between the results of two fair dice rolled together. Solution I. The sample space for tossing four fair coins: S =HHHH, HHHT, HHTT, HHTH, HTHH, HTHT, HTHH, HTTT, THHH, THHT, THTT, THTH, TTHH, TTHT,TTHH, TTTT The random variable, X is the number of heads occurring in that experiment which assumes the values, X = 0, 1, 2, 3, 4. The required probability distributions is x 0 1 2 3 4 P(x) II. The table below indicates the difference between all possible pair outcomes (Dice 1, Dice 2) 1 2 3 4 5 6 1 0 1 2 3 4 5 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 5 4 3 2 1 0 1 6 5 4 3 2 1 0 The random variable X is the difference occurring in that experiment which assumes the values, X = 0, 1, 2, 3, 4, 5. The required probability distribution is given as X 0 1 2 3 4 5 P(x) Example Let’s consider tossing two coins, with the following random variable: x = number of heads when two coins are tossed The above x is a random variable because its numerical values depend on chance. With two coins tossed, the number of heads can be 0, 1, or 2, and the Table below is a probability distribution because it gives the probability for each value of the random variable x and it satisfies the three requirements listed earlier: 1. The variable x is a numerical random variable, and its values are associated with probabilities, as in Table 5-1. 2. ∑ 𝑃(𝑥) = 1 = 0.25 + 0.50 + 0.25 = 1 3. Each value of P(x) is between 0 and 1. (Specifically, 0.25 and 0.50 and 0.25 are each between 0 and 1 inclusive.) The random variable x in the Table below is a discrete random variable, because it has three possible values (0, 1, 2), and three is a finite number, so this satisfies the requirement of being finite or countable. Probability Distribution for the Number of Heads in Two Coin Tosses x: Number of Heads When Two Coins P(x) Are Tossed 0 0.25 1 0.5 2 0.25 EXERCISE 1. The daily demand of cake at a bakery at the beginning of the day has the probability function given by X 0 1 2 3 4 5 P(x) 0.15 0.20 0.35 0.15 0.10 0.05 Let X denote the number of cakes demanded. I. Verify that it is a probability mass function. II. Find the probability that there will be at most 3 orders. 2. If P(x) is a probability mass function, find k , elsewhere 3. The random variable k has the probability function; , k = 1,2,3,... Compute the value of b, the expected value and variance of k. Continuous Probability Distribution The probability distribution, f(x) is said to be probability density function of the continuous random variable, x if for an interval of real numbers [a , b] the following properties are satisfied: • f(x) ≥ 0 for any value of x , where −∞ ≤ a ≤ x ≤ b ≤ ∞ Example a. Let x be a continuous random variable with probability density function: , elsewhere Determine the value of k. b. Determine the value of k and hence compute the probabilities, P(1 ≤ x ≤ 2) and P(x > 2) 𝑘𝑥, 0 ≤ 𝑥 ≤ 3, 𝑘 > 0 {3𝑘(4 − 𝑥) 3<𝑥≤4 0, 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒 Cumulative Distributive Function The cumulative distribution function (cdf ) for a random variable x, denoted, F(x) is defined by F(x) = P(X ≤ x). If x is a discrete random variable with probability mass function, P(x) then, F(x) = ∑ 𝑃(𝑡) which is a step function. If X is, however, a continuous random variable with where −∞ ≤ x ≤ ∞, probability density function, f(x), Then and P(x1 ≤ x ≤ x2) = F(x2) − F(x1) Properties of F(x) • F(a) ≤ F(b), wherever a ≤ b • limx→−∞ F(x) = 0 and limx→∞ F(x) = 1 • 0 ≤ F(x) ≤ 1 Exercise 1. A random variable X has the following distribution: X -5 0 3 6 P(x) 0.2 0.1 0.4 0.3 Find the cumulative distribution function F(x) . 2. The CDF of a discrete random variable X is given in the following table: X -1 0 2 5 6 P(x) 0.1 0.15 0.4 0.8 1 a. Find P(X = 2) b. Find P(X > 0). 3. Let the function: 𝑓(𝑥) = { 𝐶𝑥 2 , 0 < 𝑥 < 3 0, 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒 a. Find the value of c so that f(x) is a density function. b. Compute P(2 < X < 3). c. Find the distribution function F(x). 4. The random variable X has a cumulative distribution function: Find the probability density function of X. Parameters of a Probability Distribution Remember that with a probability distribution, we have a description of a population instead of a sample, so the values of the mean, standard deviation, and variance are parameters, not statistics. The mean, variance, and standard deviation of a discrete probability distribution can be found with the following formulas Mean 𝝁 for a probability distribution 𝜇 = ∑(𝑥 ∙ 𝑃(𝑥)), if x is discrete , if x is continuous and −∞ ≤ x ≤ ∞ Variance 𝝈𝟐 for a probability distribution The variance of the random variable, x with probability distribution, p(x) or f(x) defined by; 𝑉𝑎𝑟(𝑥) = 𝜎 2 = ∑(𝑥 − 𝜇)2 ∙ 𝑃(𝑥)) Or 𝑉𝑎𝑟(𝑥) = 𝜎 2 = ∑(𝑥 2 ∙ 𝑃(𝑥)) − 𝜇 2 𝑖𝑓 𝑥 𝑖𝑠 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑉𝑎𝑟(𝑥) (x) if x is continuous. Standard deviation 𝝈 is the positive square root of the variance Median of a distribution The median of a distribution of the random variable x is that value of x = m such that P(x ≤ m) or P(x ≥ m) = 0.5 or close to it. The median is obtained by the equation; I. or close to it, if x is discrete. if x is continuous and is such that a ≤ x ≤ b II. Mode The mode of a distribution of random variable x is that value of x = m0 that maximizes the probability distribution function, p(x) or f(x). Finding the Mean, Variance, and Standard Deviation Using this table below, find the mean variance and standard deviation. P(x) 𝑥𝑃(𝑥) 𝑥 2 𝑃(𝑥) 0 0.25 0 0 1 0.5 0.5 0.5 2 0.25 0.5 1 Total 1 1 1.5 x: Number of Heads When Two Coins Are Tossed Therefore the mean 𝜇 = ∑(𝑥 ∙ 𝑃(𝑥)) = 1 And variance 𝜎 2 = ∑(𝑥 2 ∙ 𝑃(𝑥)) − 𝜇 2 = 1.5 − 12 = 0.5 Standard deviation is √0.5 = 0.707 Example 1. Let 1 1 Then 𝐸(𝑋) = 1 (2) + 0 (2) = 1 /2 2. Let X be a discrete random variable whose probability density function is given in the following table: X -1 0 1 2 3 P(x) Find E(X) and the standard deviation of the random variable x 3. Let Y be a random variable with pdf a. Find the expected value and variance of Y. b. Let X = 300Y + 50. Find E(X) and Var(X), and c. Find P(X > 750) d. Determine the median 4 5 SPECIAL PROBABILITY DISTRIBUTIONS Discrete Probability Distributions 1. Bernoulli Process 2. Binomial Distribution 3. Poisson Distribution Continuous Probability Distributions 1. The Uniform Distribution 2. The Exponential Distribution 3. The Normal Distribution Discrete Probability Distribution Bernoulli Process A random variable x is said to have a Bernoulli distribution if it assumes the values, 0 and 1 for two outcomes. The probability distribution for the success in the trial, x is defined by 𝑃(𝑥) = 𝑝 𝑥 (1 − 𝑝)1−𝑥 , 𝑥 = 0,1𝑎𝑛𝑑0 < 𝑝 < 1 (22) where the mean and variance of the distribution are as follows µ = E(x) = p, and σ2 = V ar(x) = p(1 − p) Binomial Probability Distributions Binomial probability distributions allow us to deal with circumstances in which the outcomes belong to two categories, such as heads/tails or acceptable/defective or survived/died. A binomial probability distribution results from a procedure that meets these four requirements: 1. The procedure has a fixed number of trials. (A trial is a single observation.) 2. The trials must be independent, meaning that the outcome of any individual trial doesn’t affect the probabilities in the other trials. 3. Each trial must have all outcomes classified into exactly two categories, commonly referred to as success and failure. 4. The probability of a success remains the same in all trials. Notation for Binomial Probability Distributions S and F (success and failure) denote the two possible categories of all outcomes. 𝑃(𝑆) = p (p = probability of a success) 𝑃(𝐹) = 1 - p = q (q = probability of a failure) n the fixed number of trials x a specific number of successes in n trials, so x can be any whole number between 0 and n, inclusive p probability of success in one of the n trials q probability of failure in one of the n trials 𝑃(𝑥) probability of getting exactly x successes among the n trials In a binomial probability distribution, probabilities can be calculated by using Formula 𝑛 𝑃(𝑥) = ( ) ∙ 𝑝 𝑥 ∙ 𝑞 𝑛−𝑥 𝑥 For 𝑥 = 0,1,2,3, … . . , 𝑛 Where n = number of trials x = number of successes among n trials p = probability of success in any one trial q = probability of failure in any one trial (𝑞 = 1 – 𝑝) 𝑛 𝑛! ( )= (𝑛 − 𝑥)! 𝑥! 𝑥 Example Given that there is a 0.85 probability that a randomly selected adult knows what Twitter is, use the binomial probability formula to find the probability that when five adults are randomly selected, exactly three of them know what Twitter is. We are to find P(3) given that n = 5, x = 3, p = 0.85, and q = 0.15. Using the formula 𝑛 𝑃(𝑥) = ( ) ∙ 𝑝 𝑥 ∙ 𝑞 𝑛−𝑥 𝑥 𝑃(𝑥) = 𝑃(3) = 𝑛! ∙ 𝑝 𝑥 ∙ 𝑞 𝑛−𝑥 (𝑛 − 𝑥)! 𝑥! 5! ∙ (0.85)3 ∙ (0.15)5−3 (5 − 3)! 3! = (10)(0.614125)(0.0225) = 0.138178 = 0.138 (rounded to three significant digits) Poisson Probability Distributions The following definition states that Poisson distributions are used with occurrences of an event over a specified interval, and here are some applications: ■ Number of Internet users logging onto a website in one day ■ Number of patients arriving at an emergency room in one hour ■ Number of Atlantic hurricanes in one year A Poisson probability distribution is a discrete probability distribution that applies to occurrences of some event over a specified interval. The random variable x is the number of occurrences of the event in an interval. The interval can be time, distance, area, volume, or some similar unit. The probability of the event occurring x times over an interval is given by 𝜇 𝑥 ∙ 𝑒𝜇 𝑃(𝑥) = , 𝑥 = 0,1,2,3 … . 𝑎𝑛𝑑 𝜇 > 0 𝑥! where e ≈ 2.71828 𝜇 = mean number of occurrences of the event in the intervals The mean and variance are the same 𝐸(𝑥) = µ = 𝑉𝑎𝑟(𝑥) . The distribution of x may simply be denoted as 𝑥 ∼ 𝑝(µ) Example 1. If X is a Poisson random variable with parameter λ = 2, find P(X = 0) Solution Using the facts that 20 = 1 and 0! = 1, we obtain 𝑃(𝑋 = 0) = 𝑒 −2 = 0.1353 Continuous Probability Distributions Uniform Distribution A random variable X is said to have a uniform probability distribution on (a, b), denoted by U(a,b), if the density function of X is given by where the mean and variance are and Figure 4: Uniform probability density Example 1. If X is a uniformly distributed random variable over (0, 10), calculate the probability that a. X < 3 b. X > 6 c. 3 < X < 8. Solution a. b. c. 2. You are to meet a friend at 2 p.m. However, while you are always exactly on time, your friend is always late and indeed will arrive at the meeting place at a time uniformly distributed between 2 and 3 p.m. Find the probability that you will have to wait a. At least 30 minutes b. Less than 15 minutes c. Between 10 and 35 minutes d. Less than 45 minutes 3. Buses arrive at a specified stop at 15-minute intervals starting at 7 A.M. That is, they arrive at 7, 7:15, 7:30, 7:45, and so on. If a passenger arrives at the stop at a time that is uniformly distributed between 7 and 7:30, find the probability that he waits a. less than 5 minutes for a bus; b. more than 10 minutes for a bus 4. The melting point, X, of a certain solid may be assumed to be a continuous random variable that is uniformly distributed between the temperatures 100oC and 120oC. Find the probability that such a solid will melt between 112oC and 115oC. Exponential Distribution A continuous random variable whose probability density function is given, for some λ > 0, by , elsewhere is said to be an exponential random variable (or, more simply, is said to be exponentially distributed) with parameter λ > 0 as the mean. The random variable x represents length of time or space. and Example 1. The time, in hours, during which an electrical generator is operational is a random variable that follows the exponential distribution with λ = 160 What is the probability that a generator of this type will be operational for a. Less than 40 hours? b. Between 60 and 160 hours? c. More than 200 hours? 2. The time (in hours) required to repair a machine is an exponentially distributed random variable with parameter λ = 12. What is a. the probability that a repair time exceeds 2 hours? b. the conditional probability that a repair takes at least 10 hours, given that its duration exceeds 9 hours? 3. The number of years a radio functions is exponentially distributed with parameter λ = 18. If Jones buys a used radio, what is the probability that it will be working after an additional 8 years? Normal Distribution The probability density function for the normal random variable, x which is simply called normal distribution is defined by Where σ > 0, E(x) = µ and Var (x) = σ2 A random variable modelled by the Normal distribution with mean, µ and variance, σ2 is denoted as x ∼ N(µ,σ2) The normal probability density function is a bell-shaped density curve that is symmetric about the value µ. Its variability is measured by σ. The larger σ is, the more variability there is in this curve. The Figure below presents three different normal probability density functions. Note how the curves flatten out as σ increases. Figure 5: Three normal probability density functions Computations of Probabilities of Normal Random Variable To compute the probability that x lies within the interval [a, b], P(a ≤ x ≤ b) the normal random variable, x is standardized using the transformation, , called the Z-score Example 1. Find the following probabilities using the normal table i. P(z ≤1.95) ii. P(1.18 ≤ z ≤ 0.48) iii. P(0 ≤ z ≤ 2.58) iv. P(z > 2.63) v. P(−2.35 ≤ z ≤ 2.35) 2. Suppose that y ∼ N(6,4). What percentage will y fall between 5 and 10 ? 3. Le tX ∼ N(12,5). Find the value of x0 such that a. P(X > x0) = 0.05 b. P(X < x0) = 0.98 c. P(X < x0) = 0.20 d. P(X > x0) = 0.90. 4. The scores, X, of an examination may be assumed to be normally distributed with µ = 70 and σ2 = 49. What is the probability that: a. A score chosen at random will be between 80 and 85? b. A score will be greater than 75? c. A score will be less than 90? d. Interpret the meaning of (a), (b), and ( c ). The Standard Normal Distribution The Standard Score The Standard score, or z-score, represents the number of standard deviations a random variable x will fall from the mean. 𝑧= 𝑥−𝜇 𝜎 Example The test scores for a civil service exam are normally distributed with a mean of 152 and a standard deviation of 7. Find the standard z-score for a person with a score of: 161, 148 and 152 Solution 𝑧= 𝑧= 161 − 152 = 1.29 7 148 − 152 = −0.57 7 𝑧= 152 − 152 =0 7 Finding Probabilities To find the probability that z is less than a given value, read the cumulative area in the table corresponding to that z-score. Eg. 𝑃(𝑧 < −1. 45) = 0.0735 To find the probability that z is greater than a given value, subtract the cumulative area in the table from 1. Eg. 𝑃(𝑧 > −1.24) = 1 – 0.1075 = 0.8925 To find the probability that z is between two given values, find the cumulative area for each and subtract the smaller area from the larger. Eg. 𝑃(−1.25 < 𝑧 < 1.17) = 0.8790 − 0.1056 = 0.7734 Exercise 1. (a) Find the following probabilities using the normal table i. 𝑃(𝑧 ≤ −1.95) ii 𝑃(−1.18 ≤ 𝑧 ≤ 0.48) iii. 𝑃(0 ≤ 𝑧 ≤ 2.58) iv. 𝑃(𝑧 > 2.63) v. 𝑃(−2: 35 ≥ 𝑧 ≥ 2: 35) b. Suppose that 𝑦 ~𝑁(6, 4). What percentage will y fall between 5 and 10? 2.(a) The nicotine content of a brand of cigarettes is normally distributed with a mean of 2:0mg and a standard deviation of 0.25mg. What is the probability that a cigarette will have nicotine content i. of 1.65mg or less? ii. between 1.50mg and 2.25mg? iii. of 2.18mg or more? b. The weekly amount spent for maintenance and repairs in a certain company was observed, over a long period of time, to be approximately normally distributed with a mean of $400 and a standard deviation of $20. If $450 is budgeted for the week, what is the probability that the actual costs will exceed the budgeted amount? How much should be budgeted for weekly repairs and maintenance in order for the budgeted amount is exceeded with a probability of 0.1? Central Limit Theorem The Central Limit Theorem states that, under rather general conditions, sums and means of random samples of measurements drawn from a population tend to have an approximately normal distribution. Let a random sample of size n observations be selected from a population with mean 𝜇 and variance, 𝜎 2 . The sampling distribution of the sample mean (𝑥̅ ) will be approximately normally distributed with mean, 𝜇 𝑥 = 𝜇 and standard deviation 𝜎𝑥 = 𝜎 √𝑛 , provided n is sufficiently large. Example The mean height of African men (ages 20-29) is 𝜇 = 69.2 and 𝜎 = 2.9 inches. Random samples of 60 such men are selected. Find the i mean and standard deviation of sampling distribution. ii probability that the mean of the height is greater than 70 Solution 𝜇 = 69.2 𝜎 = 2.9 Distribution of means of sample size 60, 𝜇𝑥 =𝑥̅ = 𝜇 = 69.2 will be normal. 𝜎𝑥̅ = 2.9 √60 = 0.3744 Find the z-score for a sample mean of 70 𝑧= 𝑥 − 𝜇 70 − 69.2 = = 2.14 𝜎𝑥̅ 0.3744 P(𝑥̅ > 70) = P(z > 2.14) 1 – 0.9838 = 0.0162 Example a. If x and y are independent normal random variables with 𝐸(𝑥) = 1; 𝑉𝑎𝑟 (𝑥) = 4; 𝐸(𝑦) = 10 𝑎𝑛𝑑 𝑉𝑎𝑟 (𝑦) = 9, Determine the following: i 𝐸(2𝑥 + 3𝑦) and 𝑉𝑎𝑟 (2𝑥 + 3𝑦) ii 𝑃(2𝑥 + 3𝑦 < 40) Solution Given that 𝑥 ~ 𝑁(1, 4) and y ~ N (10, 9), we let 𝑇 = 2𝑥 + 3𝑦, Then 𝜇 𝑇 = E(T) = E(2x + 3y) 2E(x) + 3E(y) 2(1) + 3(10) = 32 𝜎 2 = 𝑉𝑎𝑟 (𝑇) = 𝑉𝑎𝑟 (2𝑥 + 3𝑦) 22 𝑉𝑎𝑟 (𝑥) + 32 𝑉𝑎𝑟 (𝑦) 4(4) + 9(9) = 97 = 9.852 ii. From (i) T ~ N(32, 97) 40 − 32 𝑃 < 40 = ∅ ( ) 9.85 ∅(0.81) = 0.7910 Exercise The mass of a biscuit is a normal random variable (x) with mean 50 grams and a standard deviation of 4 grams. If a packet contains 20 biscuits and the mass of the packaging material is also normal random variable with mean 100 grams and standard deviation, 3 grams. Find the probability that the mass of the total packet i. Will exceed 1,047 grams ii. Lies between 1,050 and 1,200 grams The Normal Approximation to Binomial The Normal distribution provides a good approximation to the binomial distribution when the number of trials, n is large, probability of a success in a trial, p not close to 0 or 1 and both np and np (1 - p) are greater than 5. Thus the binomial random variable, x becomes approximately normal random variable with mean, 𝜇 = np and variance, 𝜎 2 = np (1 - p). To improve upon the approximation, a continuity correction may be utilized by adding or subtracting 0.5 to/from x to account for the fact that a discrete distribution is being approximate by a continuous distribution. In this case the standardized random variable thus becomes 𝑧 = 𝑥 ± 0.5 − 𝑛𝑝 √𝑛𝑝(1 − 𝑝) Example Suppose that x has a Binomial distribution with n = 200 and p = 0.4. Using the continuity correction use the Normal approximation to Binomial to find each of the following probabilities:  𝑃(𝑥 = 90)  𝑃(𝑥 ≤ 95)  𝑃(𝑥 > 65)  𝑃(𝑥 < 60)  𝑃(70 < 𝑥 < 100) Solution 𝜇= 200(0.4) = 80 and 𝜎 =√200(0.4)(0.6) = 6.9282  𝑃(𝑥 = 90) = 𝑃(89.5 ≤ 𝑥 ≤ 90.5) 90.5 − 80 89.5 − 80 = ∅( ) −= ∅ ( ) 6.9282 6.9282 ∅(2.81) − ∅(−1.37) = 0.0210  𝑃(𝑥 ≤ 95) = 𝑃(𝑥 ≤ 95.5) = ∅( 95.5 − 80 ) 6.9282 ∅(2.81) = 0.9875  𝑃(𝑥 > 65) = 1 − 𝑃(𝑥 < 65.5) 65.5 − 80 = 1−∅( ) 6.9282 1 − ∅(−2.09) = 1 − 0.0183 = 0.9817  𝑃(𝑥 < 60) = (𝑃 ≤ 59.5) = ∅( 59.5 − 80 ) 6.9282 ∅(−2.96) = 0.015  𝑃(70 < 𝑥 < 100) = 𝑃(70.5 ≤ 𝑥 ≤ 99.5) = ∅( 99.5 − 80 70.5 − 80 ) − ∅( ) 6.9282 6.9282 = ∅(2.81) − ∅(−1.37) = 0.9975 − 0.0853 = 0.9122 Exercise A manufacturer of components for electric motors has found that about 10% of the production will not meet customer specifications. If 500 components are examined, Find the expected number of components which did not meet customer specifications. Find the probability that exactly 52 components or more did not meet customer specifications. Find the probability that between 36 and 58 (inclusive) components did not meet customer specifications. Confidence Intervals Point Estimate A point estimate is a single value estimate for a population parameter. The best point estimate of the population mean 𝜇 is the sample mean 𝑥̅ . Interval Estimate An interval estimate is an interval or range of values used to estimate a population parameter. The level of confidence, x, is the probability that the interval estimate contains the population parameter. Confidence Level The confidence level of an interval estimate of a parameter is the probability that the interval estimate will contain the parameter. Maximum Error of Estimate The maximum error of estimate E is the maximum likely difference between the point estimate of a parameter and the actual value of the parameter. 𝐸 = 𝑧𝑐 𝜎𝑥 = 𝑧𝑐 𝜎 √𝑛 when n ≥ 30, the sample standard deviation, s, can be used for 𝜎 Confidence Intervals A confidence interval is a specific interval estimate of a parameter determined by using data obtained from a sample and by using the specific confidence level of the estimate. A confidence interval for the population mean is 𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸 Example The president of a large university wishes to estimate the average age of the students presently enrolled. From past studies, the standard deviation is known to be 2 years. A sample of 50 students is selected, and the mean is found to be 23.2 years. Find the 95% confidence interval of the population mean. Solution Since the 95% confidence interval is desired, 𝑧𝛼 = 1.96. Hence, substituting in the formula 2 𝑥̅ − 𝑧𝛼 2 𝜎 √𝑛 < 𝜇 < 𝑥̅ + 𝑧𝛼 2 𝜎 √𝑛 2 2 23.2 − 1.96 ( ) < 𝜇 < 23.2 + 1.96 ( ) √50 √50 22.6 < 𝜇 < 23.8 Hence, the president can say, with 95% confidence, that the average age of the students is between 22.6 and 23.8 years, based on 50 students. Exercises  A survey of 30 adults found that the mean age of a person's primary vehicle is 5.6 years. Assuming the standard deviation of the population is 0.8 year, find the 99% confidence interval of the population mean.  The following data represent a sample of the assets (in millions of dollars) of 30 credit unions in southwestern Pennsylvania. Find the 90% confidence interval of the mean. 12.23 16.56 4.39 2.89 1.24 2.17 13.19 9.16 1.42 73.25 1.91 14.64 11.59 6.69 1.06 8.74 3.17 18.13 7.92 4.78 16.85 40.22 2.42 21.58 5.01 1.47 12.24 2.27 12.77 2.76 Formula for the Minimum Sample Size Needed for an Interval Estimate of the Population Mean 𝑛=( 𝑧𝑐 𝜎 2 ) 𝐸 where E is the maximum error of estimate. Example The college president asks the statistics teacher to estimate the average age of the students at their college. How large a sample is necessary? The statistics teacher would like to be 99% confident that the estimate should be accurate within 1 year and a standard deviation of 3. Solution 𝑛=( 𝑧𝑐 𝜎 2 ) 𝐸 E=1, 𝑧𝑐 = 2.58, 𝜎 = 3 2.58(3) 2 𝑛=( ) = 59.9 ≈ 60 1 Therefore, to be 99% confident that the estimate is within 1 year of the true mean age, the teacher needs a sample size of at least 60 students. Exercise a You want to estimate the mean one-way fare. How many fares must be included in your sample if you want to be 95% confident that the sample mean is within $ 2 of population mean? b The growing seasons for a random sample of 35 U.S. cities were recorded, yielding a sample mean of 190.7 days and a sample standard deviation of 54.2 days. Estimate the true mean population of the growing season with 95% confidence. c How many cities' growing seasons would have to be sampled in order to estimate the true mean growing season with 95% confidence within 2 days? (Use a standard deviation 54.2) d A restaurant owner wishes to find the 99% confidence interval of the true mean cost of a dry martini. How large should the sample be if she wishes to be accurate within $0.10? A previous study showed that the standard deviation of the price was $0.12. Confidence Intervals for the mean (Small samples) If the distribution of a random variable x is normal and n < 30, then the sampling distribution of 𝑥̅ is a t-distribution with n – 1 degrees of freedom. Degrees of freedom They are the number of values that are free to vary after a sample statistic has been computed. For example if the mean of 5 values is 10, then 4 of the 5 values are free to vary. But once 4 values are selected, the fifth value must be a specific number to get a sum of 50, since 50/5 = 10. Hence, the degrees of freedom are 5 -1 = 4, and this value tells the researcher which t curve to use. Confidence interval for small samples 𝑠 Maximum error of estimate 𝐸 = 𝑡𝑐 ( 𝑛) √ Formula for a Specific Confidence Interval for the Mean When 𝜎 Is Unknown and n < 30 𝑥̅ − 𝑡𝑐 𝑠 √𝑛 < 𝜇 < 𝑥̅ + 𝑡𝑐 𝑠 √𝑛 The degrees of freedom are n - 1. Example Find the 𝑡𝛼 value for a 95% confidence interval when the sample size is 22. 2 Solution The d.f = 22 - 1, or 21. Find 21 in the left column and 95% in the row labeled "Confidence intervals." The intersection where the two meet gives the value for 𝑡𝛼 , which is 2.080. See Figure 2 below Figure: t table Example Ten randomly selected automobiles were stopped, and the tread depth of the right front tire was measured. The mean was 0.32 inch, and the standard deviation was 0.08 inch. Find the 95% confidence interval of the mean depth. Assume that the variable is approximately normally distributed. Solution Since 𝜎 is unknown and s must replace it, the t distribution must be used for 95% confidence interval. Hence, with 9 degrees of freedom, t = 2.262: 𝑥̅ − 𝑡𝑐 0.32 − (2.262) ( 𝑠 √𝑛 < 𝜇 < 𝑥̅ + 𝑡𝑐 𝑠 √𝑛 0.08 0.08 ) < 𝜇 < 0.32 + (2.262) ( ) √10 √10 0.26 < 𝜇 < 0.38 Therefore, one can be 95% confident that the population mean tread depth of all right front tires is between 0.26 and 0.38 inch based on a sample of 10 tires. Exercises a The data represent a sample of the number of home fires started by candles for the past several years. Find the 99% confidence interval for the mean number of home fires started by candles each year. 5460 5900 6090 6310 7160 8440 9930 b The average hemoglobin reading for a sample of 20 teachers was 16 grams per 100 milliliters, with a sample standard deviation of 2 grams. Find the 99% confidence interval of the true mean. c A sample of 17 states had these cigarette taxes (in cents): 112 120 98 55 71 35 99 124 64 150 150 55 100 132 20 70 93 Find a 98% confidence interval for the cigarette tax in all 50 states. d The number of grams of carbohydrates in a 12- ounce serving of a regular soft drink is listed here for a random sample of sodas. Estimate the mean number of carbohydrates in all brands of soda with 95% confidence. 48 37 52 40 43 46 41 38 41 45 45 33 35 52 45 41 30 34 46 40 Figure: Summary on when to use z or t distribution Population Proportions A proportion represents a part of a whole. The proportion of successes in a sample is given by 𝑥 𝑝̂=𝑛 where x is the number of sample units that possess the characteristics of interest and n is sample size. 𝑞̂ is the point estimate for the proportion of failures where 𝑞̂ = 1 − 𝑝̂ If 𝑛𝑝 ≥ 5 and 𝑛𝑞 ≥ 5 the sampling distribution for ^p is normal. Confidence interval for population proportions The maximum error of estimate, E, for confidence interval is: 𝐸 = 𝑧𝑐 √ 𝑝̂ 𝑞̂ 𝑛 The confidence interval for the population proportion, p, is 𝑝̂ − 𝐸 < 𝑝 < 𝑝̂ + 𝐸 Example A sample of 500 nursing applications included 60 from men. Find the 90% confidence interval of the true proportion of men who applied to the nursing program. Solution 𝑝̂ = 60 = 0.12 𝑎𝑛𝑑 𝑞 ̂ = 1 – 0.12 = 0: 88 𝑧𝑐 = 1.65. 500 But 𝑝̂ 𝑞̂ 𝑝̂ 𝑞̂ 𝑝̂ − 𝑧𝑐 √ < 𝑝 < 𝑝̂ + 𝑧𝑐 √ 𝑛 𝑛 0.12 − 1.65√ 0.12(0.88) 0.12(0.88) < 𝑝 < 0.12 + 1.65√ 500 500 0.096 < 𝑝 < 0.144 Hence, one can be 90% confident that the percentage of applicants who are men is between 9.6% and 14.4%. Note If no approximation of ^p is known, one should use ^p = 0.5. Exercises  In a study of 1907 fatal traffic accidents, 449 were alcohol related. Construct a 99% confidence interval for the proportion of fatal traffic accidents that are alcohol related.  A survey of 200,000 boat owners found that 12% of the pleasure boats were named Serenity. Find the 95% confidence interval of the true proportion of boats named Serenity.  A survey found that out of 200 workers, 168 said they were interrupted three or more times an hour by phone messages, faxes, etc. Find the 90% confidence interval of the population proportion of workers who are interrupted three or more times an hour. Minimum Sample Size If you have a preliminary estimate for p and q, the minimum sample size given a confidence interval and a maximum error of estimate needed to estimate p is 𝑧𝑐 2 𝑛 = 𝑝̂ 𝑞̂ ( ) 𝐸 Example You wish to estimate the proportion of fatal accidents that are alcohol related at a 99% level of confidence. Find the minimum samples size needed to be accurate to within 2% of the population proportion. Use an estimate of p = 0.235. Solution 𝑧𝑐 2 𝑛 = 𝑝̂ 𝑞̂ ( ) 𝐸 2.575 2 𝑛 = (0.235)(0.765) ( ) = 2980.05 0.02 With a preliminary sample you need at least n = 2981 for your sample Exercise  A researcher wishes to estimate, with 95% confidence, the proportion of people who own a home computer. A previous study shows that 40% of those interviewed had a computer at home. The researcher wishes to be accurate within 2% of the true proportion. Find the minimum sample size necessary.  The Gallup Poll found that 27% of adults surveyed nationwide said they had personally been in a tornado. How many adults should be surveyed to estimate the true proportion of adults who have been in a tornado with a 95% confidence interval 5% wide?  A researcher wishes to estimate the proportion of executives who own a car phone. She wants to be 90% confident and be accurate within 5% of the true proportion. Find the minimum sample size necessary. Confidence Intervals for Variance and Standard Deviation To calculate the confidence intervals for Variance and Standard deviation, a new statistical distribution is needed. It is called the chi-square distribution. The chi-square variable is similar to the t variable in that its distribution is a family of curves based on the number of degrees of freedom. The symbol for chi-square is 𝜒 2 Confidence Intervals for Variance (𝑛 − 1)𝑠 2 𝜒𝑅2 Degree of freedom = n - 1 2 <𝜎 < (𝑛 − 1)𝑠 2 𝜒𝐿2 Example 2 2 Find the values for 𝜒𝑅𝑖𝑔ℎ𝑡 and 𝜒𝐿𝑒𝑓𝑡 for a 90% confidence interval when n = 25. Solution When the sample is 25, there are 24 degree of freedom. 2 𝜒𝑅𝑖𝑔ℎ𝑡 is 1 – 0.9 2 = 0.05 2 𝜒𝑅𝑖𝑔ℎ𝑡 = 36.415 2 𝜒𝐿𝑒𝑓𝑡 𝑖𝑠 1 + 0.9 2 = 0.95 2 𝜒𝐿𝑒𝑓𝑡 = 13.848 𝝌𝟐 for the example above Example Find the 95% confidence interval for the variance of the nicotine content of cigarettes manufactured if a sample of 20 cigarettes has a standard deviation of 1.6 milligrams. Solution Since 𝛼 = 0.05, the two critical values, respectively, for the 0.025 and 0.975 levels for 19 degrees of freedom are 32.852 and 8.907. (𝑛 − 1)𝑠 2 (𝑛 − 1)𝑠 2 2 < 𝜎 < 𝜒𝑅2 𝜒𝐿2 (20 − 1)(1.6)2 (20 − 1)(1.6)2 2 <𝜎 < 32.852 8.907 1.5 < 𝜎 2 < 5.5 Hence, one can be 95% confident that the true variance for the nicotine content is between 1.5 and 5.5. Exercises  Find the 90% confidence interval for the variance and standard deviation for the price in dollars of an adult single-day beach ticket. The data represent a selected sample of nationwide beach resorts. Assume the variable is normally distributed. 59 54 53 52 51 39 49 46 49 48  Find the 99% confidence interval for the variance and standard deviation of the weights of 25 one-gallon containers of motor oil if a sample of 14 containers has a variance of 3.2. The weights are given in ounces. Assume the variable is normally distributed.  A random sample of stock prices per share (in dollars) is shown. Find the 90% confidence interval for the variance and standard deviation for the prices. Assume the variable is normally distributed. 26.69 13.88 28.37 75.37 7.50 47.50 3.81 53.81 13.62 6.94 28.25 28.00 40.25 10.87 46.12 12.00 43.00 45.12 60.50 14.75 Hypothesis Testing A statistical hypothesis is a claim about a population. Null Hypothesis It is denoted by 𝐻0 . It contains a statement of equality such as ≥, = 𝑜𝑟 ≤The null hypothesis is assumed to be true unless there is strong evidence to the contrary { similar to how a person is assumed to be innocent until proven guilty. Alternative Hypothesis It is denoted by 𝐻𝑎 . It contains a statement of inequality such as <, ≠ 𝑜𝑟 > Example  A hospital claims its ambulance response time is less than 10 minute. Solution 𝐻0 : 𝜇 ≥ 10𝑚𝑖𝑛 𝐻𝑎 : 𝜇 < 10𝑚𝑖𝑛  A costumer magazine claims the proportion of cell phones calls made during evenings and weekends is at most 60% Solution 𝐻0 ∶ 𝑝 ≤ 0.60 𝐻𝑎 ∶ 𝑝 > 0.60 Errors and Level of Significance Type I error We reject the null hypothesis when the null is true. The probability of Type I error = 𝛼 Type II error We accept the null hypothesis when it is not true. The probability of Type II error = 𝛽 Level of Significance (𝜶) The maximum probability of committing a Type I error. 1 and 2-tailed test 1-tailed Test Indicates that the null hypothesis should be rejected when the test value is in the critical region on one side. Left tailed test When the critical region is on the left side of the distribution of the test value. The Alternative Hypothesis 𝐻𝑎 ∶ 𝜇 < 𝑣𝑎𝑙𝑢𝑒 Figure: Left tail test Right tailed test When the critical region is on the right side of the distribution of the test value. The Alternative Hypothesis 𝐻𝑎 ∶ 𝜇 > 𝑣𝑎𝑙𝑢𝑒 Figure: Right tail test Two tail test The null hypothesis should be rejected when the test value is in either of two critical regions on either side of the distribution of the test value. The alternative hypothesis for a two-tail test is 𝐻𝑎 ≠ 𝑣𝑎𝑙𝑢𝑒 Figure: 2-tail test P- Value The probability of observing any test statistic that is at least as extreme as the one computed from a sample, given that the null hypothesis is true. Finding P-Values:1-tail test The test statistics for right-tail test is 𝑧 = −1.56. Find P-value. The area to the right of z = 1.56 is 1 − 0.9406 = 0.0594 The P-value is 0.0594. Finding P-values:2-tail test The test statistic for a two-tail test is z = -2.63. Find the corresponding P-value. The area to the left of 𝑧 = −2: 63 is 0.0043 The P-value is 2(0.0043) = 0.0086 Test Decisions with P-values The decision about whether there is enough evidence to reject the null hypothesis can be made by comparing the P-value to the 𝛼 value; the level of significance of the test.  If P ≤ 𝛼reject the null hypothesis  If P > 𝛼 fail to reject the null hypothesis Example  If the P-value of a hypothesis test is 0.0749, at a 0.05 level of significance, we fail to reject 𝐻0 since P > 𝛼  If the P-value of a hypothesis is 0.0245, at a 0.05 level of significance, we reject 𝐻0 since 𝑃≤𝛼  Write the null and alternative hypothesis  State the level of significance  Identify the sampling distribution  Find the test statistic and standardize it  Calculate the P-value for the test statistic  Make your decision  Interpret your decision Hypothesis Testing for the Mean (n ≥ 30) The z-Test for a Mean The z-test is a statistical test for a population mean. The z-test can be used If the population is normal and s is known or When the sample size, n, is at least 30. The test statistic is the sample mean 𝑥̅ and the standardized test statistic is z 𝑧= where 𝜎𝑥̅ = 𝑥̅ − 𝜇 𝜎𝑥̅ 𝜎 √𝑛 Example (1) A cereal company claims the mean sodium content in one serving of its cereal is no more than 230 mg. You work for a national health service and are asked to test this claim. You find that a random sample of 52 servings has a mean sodium content of 232mg and a standard deviation of 10 mg. At 𝛼 = 0.05, do you have enough evidence to reject the company's claim? Solution  Write the null and alternative hypothesis 𝐻0 ∶ 𝜇 ≤ 230𝑚𝑔 𝐻𝑎 ∶ 𝜇 > 230𝑚𝑔  State the level of significance. 𝛼 = 0.05  Determine the sampling distribution. Since the sample size is at least 30, the sampling distribution is normal  Find the test statistics and standardize it 𝑧= 𝜎𝑥̅ = 𝑧=  𝜎 √𝑛 = 𝑥̅ − 𝜇 𝜎𝑥̅ 10 √52 = 1.387 232 − 230 = 1.44 1.387 Calculate the P-value for the test statistic Since this is a right tail test, the P-value is the area found to the right of z = 1:44 in the normal distribution. From the table, 𝑃 = 1 − 0.9251 = 0.0749  Make your decision. Compare the P-value to 𝛼 Since 0.0749 > 0.05, fail to reject 𝐻0 .  Interpret your decision. There is not enough evidence to reject the claim that the mean sodium content of one serving of its cereal is no more than 230 mg Rejection Regions The set of values for the test statistic that leads to rejection of H0 Critical Values The values of the test statistic that separate the rejection and non-rejection regions. Using the critical value to make test decisions  Write the null and alternative hypothesis  State the level of significance  Identify the sampling distribution  Find the critical value  Find the rejection region  Find the test statistic and standardize it  Make your decision  Interpret your decision Example From example (1), the critical value at 𝛼 = 0.05 is 1.645 and the standardize test statistic 𝑧= 𝜎𝑥̅ = 𝑧= 𝜎 √𝑛 = 𝑥̅ − 𝜇 𝜎𝑥̅ 10 √52 = 1.387 232 − 230 = 1.44 1.387 But 𝑧 = 1.44 does not fall in the rejection region, so we fail to reject H0 Hypothesis Testing for the Mean (n < 30) The t Sampling Distribution Find the critical value 𝑡0 for a left-tailed test given α= 0.01 and n = 18 𝑑. 𝑓 = 18 − 1 = 17 𝑡0 = −2.567 Find the critical values −𝑡0 and 𝑡0 for a two tailed test given 𝛼 = 0: 05 and n = 11 𝑑. 𝑓 = 11 − 1 = 10 𝑡0 = −2: 228 and 𝑡0 = 2.228 Example A university says the mean number of classroom hours per week for full-time faculty is 11.0. A random sample of the number of classroom hours for full-time faculty for one week is listed below. You work for a student organization and are asked to test the claim. A 𝛼 = 0.01, do we have enough evidence to reject the university claim? 11.8 8.6 12.6 7.9 6.4 10.4 13.6 9.1 Solution  Write the null hypothesis 𝐻0 ∶ 𝜇 = 11.0 𝐻𝑎 ∶ 𝜇 ≠ 11.0  State the level of significance. 𝛼 = 0: 01  Determine the sampling distribution Since the sample size is 8, the sampling distribution is a t-distribution with 8 − 1 = 7 degree of freedom Since 𝐻𝑎 contains the ≠ symbol, this is a two tailed test.  Find the critical values at 𝛼 = 2 since it is a two tail test −𝑡0 = −3.499 𝑎𝑛𝑑 𝑡0 = 3.499  Find the rejection region  Find the test statistics and standardize it. 𝑛 = 8 𝑥̅ = 10.05 𝑠 = 2.485 𝜇 = 11.00 𝑡= 𝜎𝑥̅ = 𝑡=  𝜎 √𝑛 = 𝑥̅ − 𝜇 𝜎𝑥̅ 2.485 √8 = 0.87858 10.050 − 11.0 = −1.08 0.87858 Make your decision 𝑡 = −1.08 does not fall in the rejection in region so fail to reject 𝐻0 at 𝛼 = 0.01  Interpret your decision. There is not enough evidence to reject the university's claim that full time faculty spend a mean of 11 classroom hours. Hypothesis Testing for Proportions p is the population proportion of successes. The test statistic is 𝑥 𝑝̂ = 𝑛, the proportion of sample successes. If 𝑛𝑝 ≥ 5 and 𝑛𝑞 ≥ 5 the sampling distribution for 𝑝̂ is normal Test Statistics The standardized test statistic is 𝑧 = 𝑝̂−𝑝 𝑝𝑞 𝑛 √ Example A communications industry spokesperson claims that over 40% of Africans either own a cellular phone or have a family member who does. In a random survey of 1036 Africans, 456 said they or family member owned a cellular phone. Test the spokes person's claim at 𝛼 = 0.05. Solution  Write the null and alternative hypothesis. 𝐻0 ∶ 𝑝 ≤ 0.40 𝐻𝑎 > 0: 40  State the level of significance.𝛼 = 0: 05  Determine the sampling distribution. 1036(.40) > 5 𝑎𝑛𝑑 1036(.60) > 5 The sampling distribution is normal  Find the Critical value critical value = 1.645  Find the test statistic and standardize it 𝑛 = 1036 𝑥 = 456 𝑝̂ = 𝑧=  44 − 40 √(. 40)(.60) 1036 = 𝑥 456 = = 44 𝑛 1036 0.04 = 2.63 0.1522 Make your decision. z = 2.63 falls in the rejection region, so reject 𝐻0 Hypothesis Testing for Variance and Standard Deviation 𝑠 2 is the test statistic for population variance. Its sampling distribution is a 𝜒 2 distribution with n- 1 degree of freedom. Test Statistics The standardized test statistic is 𝜒2 = (𝑛 − 1)𝑠 2 𝜎2 Example A state administrator says that the standard deviation of test scores for 8th grade students who took a life-science assessment test is less than 30. You work for the administrator and are asked to test this claim. You find that a random sample of 10 tests has a standard deviation of 28.8. At 𝛼 = 0.01, do you have enough evidence to support the administrator's claim? Assume test scores are normally distributed. Solution  Write the null and alternative hypothesis. 𝐻0 ∶ 𝑝 ≥ 30 𝐻𝑎 < 30  State the level of significance.𝛼 = 0.01  Determine the sampling distribution.  The sampling distribution is 𝜒 2 with 10 - 1 = 9 degree of freedom  Find the Critical value. critical value = 2.088  Find the test statistic 𝑛 = 10 𝑠 = 28.8 𝜒2 =  (𝑛 − 1)𝑠 2 (10 − 1)(28.8)2 = = 8.2944 𝜎2 302 Make your decision. 𝜒 2 = 8.2944 does not fall in the rejection region, so fail to reject 𝐻0  Interpret your decision. There is not enough evidence to support the administrator's claim that the standard deviation is less than 30. Correlation and Regression Correlation A relationship between two Variables. Types of correlation  Negative Correlation  Positive Correlation  No linear Correlation Correlation Coefficient A measure of the strength and direction of a linear relationship between two variables. 𝑟= 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 √𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 √𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 The range of r is from -1 to 1  If r is close to -1, there is a strong negative correlation.  If r is close to 0, there is no correlation  If r is close to 1, there is a strong positive correlation Exercise Fit a regression line and find the correlation between absence and final grade in the data below Absences (x) Final Grade (y) 8 78 2 92 5 90 12 58 15 43 9 74 6 81 Hypothesis Test for Significance r is the correlation coefficient for the sample. The correlation coefficient for the population is 𝜌 (rho). For a two tail test for significance: 𝐻0 ∶ 𝜇 = 0 (The correlation is not significant) 𝐻𝑎 ∶ 𝜇 ≠ 0 (The correlation is significant) The sampling distribution for r is a t-distribution with n - 2 d.f. The Standardized test statistic is 𝑡= 𝑟−0 = 𝜎𝑟 𝑟 2 √1 − 𝑟 𝑛−2 Example The correlation between the number of times absent and a final grade is r = -0.975. There were seven pairs of data. Test the significance of this correlation. use 𝛼 = 0: 01 Exercises The height and weight of 10 boys were measured and the results are given by the table below. Find the correlation coefficient between the height (cm) and the weight (kg). Is the correlation significant at 5%? Wt 38 39 43 44 35 32 31 42 49 41 Ht 150 152 146 158 142 144 135 145 155 150 In each of the following, find the correlation coefficient and the significant at 5% between y and x from the information given: n ∑𝑥 ∑𝑦 ∑ 𝑥2 ∑ 𝑦2 ∑ 𝑥𝑦 i) 12 129 63 1500 800 700 ii) 9 56 76 500 830 620 436 585 31218 1858 iii) 10 69 Linear Regression Once you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line. The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept. The line of regression is 𝑦̂ = 𝑚𝑥 + 𝑏 The slope m is: 𝑚= 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 The intercept is 𝑏 = 𝑦̅ − 𝑚𝑥̅ Exercises 1. Find the line of regression of y on x from the following values:  n= 5, ∑ 𝑥 = 10, ∑ 𝑥 2 = 30, ∑ 𝑦 = 13.1, ∑ 𝑦 2 = 54.41, ∑ 𝑥𝑦 = 40.3  n= 6, ∑ 𝑥 = 15, ∑ 𝑥 2 = 55, ∑ 𝑦 = 6.4, ∑ 𝑦 2 = 8.06, ∑ 𝑥𝑦 = 20.4  n= 5, ∑ 𝑥 = 20, ∑ 𝑥 2 = 90, ∑ 𝑦 = 18.7, ∑ 𝑦 2 = 75.77, ∑ 𝑥𝑦 = 82.3  n= 5, ∑ 𝑥 = 15, ∑ 𝑥 2 = 55, ∑ 𝑦 = 77, ∑ 𝑦 2 = 1503, ∑ 𝑥𝑦 = 177 2. For each of the situations in question 1, estimate the value of y when x = 10 Measures of Regression and Correlation The Coefficient of Determination The coefficient of determination, r 2, is the ratio of explained variation in y to the total variation in y 𝑟2 = 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 The correlation coefficient of number of times absent and final grade is r = -0.975. The coefficient of determination is 𝑟 2 = (−0.975)2 = 0: 9506 Interpretation About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is Unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc The Standard Error of Estimate The Standard Error of Estimate, 𝑠𝑒 is the standard deviation of observed yi values about the 𝑦̂ predicted value. 2 ∑(𝑦𝑖 − 𝑦̂) 𝑖 √ 𝑠𝑒 = 𝑛−2 Chi-Square Test of Goodness of Fit A chi-square distribution is skewed right and are not symmetric. The value of 𝜒 2 ≥ 0 A chi-square goodness-of-fit test is used to test whether a frequency distribution Chi-Square Test If the observed frequencies are obtained from a random sample and each expected frequency is at least 5, the sampling distribution for goodness-of-fit test is chi-square distribution with k -1 degrees of freedom (where k = the number of categories). The test statistic is 𝜒2 = ∑ (𝑂 − 𝐸)2 𝐸 O = observed frequency in each category E = expected frequency in each category Example A social service organization claims 50% of all marriages are the first marriage for both bride and groom, 12% are first for bride only, 14% for the groom only and 24% a remarriage for both. The results of a study of 103 randomly selected married couples listed in the table. Test the distribution claimed by the agency. Use 𝛼 = 0.01 First Marriage f Bride and Groom 55 Bride only 12 Groom only 12 Neither 24 Solution  Write the null and alternative hypothesis. H0 : The distribution of first-time marriages is 50% for both bride and groom, 12% for bride only, 14% for groom only, 24% are remarriages for both. Ha : The distribution of first-time marriages differs from claimed distribution.  State the level of significance. 𝛼 = 0: 01  Determine the sampling distribution A chi-square distribution with 4 - 1 = 3 d.f  Find the critical value. Critical value = 11.34  Find the test statistic 𝜒2 = ∑ (𝑂 − 𝐸)2 𝐸 (𝑂−𝐸)2 First Marriage % O E (𝑂 − 𝐸)2 Bride and Groom 50 55 51.5 12.25 Bride only 12 12 12.36 0.1296 0.0105 Groom only 14 12 14.42 5.8564 0.4061 Neither 24 24 24.72 0.5184 0.0210 Total 100 103 103 0.6755 𝐸 0.2379 𝜒 2 = 0.6755  Make decision The test statistic 0.6755 does not fall in the rejection region, so we fail to reject H0  Interpret your decision The distribution fit the specified for first time marriages Exercise A die is rolled 120 times. The results are given in the table below. Check whether or not it is biased. Number 1 2 3 4 5 6 Frequency 14 13 21 16 27 29 Comparing Two Variances Two Sample Test for Variances To compare population variances, 𝜎12 and 𝜎22 use the F- distribution. Let 𝑠12 and 𝑠22 represent the sample variances of two different populations. If both populations are normal and the population variances, 𝜎12 and 𝜎22 are equal, then the sampling distribution is called an F- distribution. 𝑠12 always represents the larger of the two variances 𝐹= 𝑠12 𝑠22 Analysis of Variance One Way Analysis of Variance (ANOVA) This is a hypothesis testing technique that is used to compare means from three or more population. 𝐻0 ∶ 𝜇1 = 𝜇2 = 𝜇3 = ⋯ = 𝜇𝑘 (All population means are equal.) 𝐻𝑎 : at least one of the means is different from the others. The variance is calculated in two different ways and the ratio of the two values is formed. 𝐹= 𝑀𝑆𝐵 𝑀𝑆𝑊 MSB, Mean Square Between, the variance between samples, measures the differences related to the treatment given to each sample. MSW, Mean Square Within, the variance within samples, measures the differences related to entries within the same sample. The variance within samples is due to sampling error Mean Square Between Each group is given a different "treatment". The variation from the grand mean (mean of all values in all small groups) is measured. The treatment (or factor) is the variable that distinguishes members of one sample from another. First calculate SSB and divide by k -1, the degrees of freedom. (k = the number of treatments or factors.) 2 𝑆𝑆𝐵 = ∑ 𝑛𝑖 (𝑥̅𝑖 − 𝑥̅ ) ∑ 𝑛𝑖 (𝑥̅𝑖 − 𝑥̅ )2 𝑆𝑆𝐵 𝑀𝑆𝐵 = = 𝑘−1 𝑛−1 Mean Square Within Calculate SSW and divide by N - k, the degree of freedom. 𝑆𝑆𝑊 = ∑(𝑛𝑖 − 1)𝑠𝑖2 𝑀𝑆𝑤 = ∑(𝑛𝑖 − 1)𝑠𝑖2 𝑆𝑆𝑤 = 𝑁−𝑘 𝑁−𝑘 If MSB is close in value to MSW, the variation is not attributed to different effects the different treatments have on the variable. The ratio of two measures (F-ratio) is close to 1 If MSB is significantly greater than MSW, the variation is probably due to differences in the treatments or factors, and the F-ratio will differ significantly from 1 Example The table below shows the annual amount spent on reading (in $) for a random sample of some consumers from four regions. At 𝛼 = 0.10, can you conclude that the mean annual amounts spent are different? Northeast Midwest South West 308 246 103 223 58 169 143 184 141 246 164 221 109 158 119 269 220 167 99 199 144 76 214 171 108 204 316 Solution Write the null and alternative hypothesis. 𝐻0 ∶ 𝜇1 = 𝜇2 = 𝜇3 = 𝜇4 Ha : At least one of the means is different from the others State the level of significance = 𝛼 0.10 Determine the sampling distribution. An F distribution with 𝑑. 𝑓𝑁𝑢𝑚𝑒𝑟𝑎𝑡𝑜𝑟 = 3 and 𝑑. 𝑓𝐷𝑒𝑛𝑜𝑚𝑒𝑛𝑎𝑡𝑜𝑟 = 23 Find the critical value. The critical value is 2.34 Find the test statistic 𝐹 = 𝑀𝑆𝐵 𝑀𝑆𝑊 Northeast Midwest South West 308 246 103 223 58 169 143 184 141 246 164 221 109 158 119 269 220 167 99 199 144 76 214 171 108 204 316 𝑥̅ 185.14 177.00 135.71 210.14 𝑠2 9838.66 4050.05 1741.39 1020.80 𝑥̅ = 4779 = 177 27 Mean Square Between 2 𝑆𝑆𝐵 = ∑ 𝑛𝑖 (𝑥̅𝑖 − 𝑥̅ ) Mean n (𝑥̅𝑖 − 𝑥̅ )2 𝑛𝑖 (𝑥̅𝑖 − 𝑥̅ )2 185.17 7 66.26 463.8 177.00 6 0.00 0.0 135.71 7 1704.86 11934.0 210.14 7 1098.26 7687.6 𝑀𝑆𝐵 = 𝑆𝑆𝐵 20086 = = 6695.33 𝑘−1 3 Mean Square Within 𝑆𝑆𝑊 = ∑(𝑛𝑖 − 1)𝑠𝑖2 n 𝑠2 (𝑛𝑖 − 1)𝑠 2 7 9838.66 59031.9 6 4050.05 20250.2 7 1741.39 10448.4 7 1020.80 6124.8 𝑀𝑆𝑊 = 𝐹 = 95855 = 4167.61 23 6955.33 = 1.669 4167.61 Make your decision Since F = 1.669 does not fall in the rejection region, fail to reject the null hypothesis Interpret your decision. There is not enough evidence to support the claim that the means are not equal. Expenses for reading are the same for all the regions.

CHE357

Related documents

Products

Support

CHE357

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib