Engineering Data Analysis Module - Batangas State University

lOMoARcPSD|12687361 MATH 403-Engineering Data Analysis-Module-Batangas State University BS Psychology (Batangas State University) Studocu is not sponsored or endorsed by any college or university Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS MATH 403 Engineering Data Analysis CABACES, DONNALYN C. MARCAIDA, MARJORIE G. SOTTO, RODOLFO JR. C. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 1 OBTAINING DATA Introduction Statistics may be defined as the science that deals with the collection, organization, presentation, analysis, and interpretation of data in order be able to draw judgments or conclusions that help in the decision-making process. The two parts of this definition correspond to the two main divisions of Statistics. These are Descriptive Statistics and Inferential Statistics. Descriptive Statistics, which is referred to in the first part of the definition, deals with the procedures that organize, summarize and describe quantitative data. It seeks merely to describe data. Inferential Statistics, implied in the second part of the definition, deals with making a judgment or a conclusion about a population based on the findings from a sample that is taken from the population. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Demonstrate an understanding of the different methods of obtaining data. 2. Explain the procedures in planning and conducting surveys and experiments. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Statistical Terms Before proceeding to the discussion of the different methods of obtaining data, let us have first definition of some statistical terms: Population or Universe refers to the totality of objects, persons, places, things used in a particular study. All members of a particular group of objects (items) or people (individual), etc. which are subjects or respondents of a study. Sample is any subset of population or few members of a population. Data are facts, figures and information collected on some characteristics of a population or sample. These can be classified as qualitative or quantitative data. Ungrouped (or raw) data are data which are not organized in any specific way. They are simply the collection of data as they are gathered. Grouped Data are raw data organized into groups or categories with corresponding frequencies. Organized in this manner, the data is referred to as frequency distribution. Parameter is the descriptive measure of a characteristic of a population Statistic is a measure of a characteristic of sample Constant is a characteristic or property of a population or sample which is common to all members of the group. Variable is a measure or characteristic or property of a population or sample that may have a number of different values. It differentiates a particular member from the rest of the group. It is the characteristic or property that is measured, controlled, or manipulated in research. They differ in many respects, most notably in the role they are given in the research and in the type of measures that can be applied to them. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 1.1 Methods of Data Collection Collection of the data is the first step in conducting statistical inquiry. It simply refers to the data gathering, a systematic method of collecting and measuring data from different sources of information in order to provide answers to relevant questions. This involves acquiring information published literature, surveys through questionnaires or interviews, experimentations, documents and records, tests or examinations and other forms of data gathering instruments. The person who conducts the inquiry is an investigator, the one who helps in collecting information is an enumerator and information is collected from a respondent. Data can be primary or secondary. According to Wessel, “Data collected in the process of investigation are known as primary data.” These are collected for the investigator’s use from the primary source. Secondary data, on the other hand, is collected by some other organization for their own use but the investigator also gets it for his use. According to M.M. Blair, “Secondary data are those already in existence for some other purpose than answering the question in hand.” In the field of engineering, the three basic methods of collecting data are through retrospective study, observational study and through a designed experiment. A retrospective study would use the population or sample of the historical data which had been archived over some period of time. It may involve a significant amount of data but those data may contain relatively little useful information about the problem, some of the relevant data may be missing, recording errors or transcription may be present, or those other important data may not have been gathered and archived. These result in statistical analysis of historical data which identifies interesting phenomena but difficulty of obtaining solid and reliable explanations is encountered. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS In an observational study, however, process or population is observed and disturbed as little as possible, and the quantities of interests are recorded. In a designed experiment, deliberate or purposeful changes in the controllable variables of the system or process is done. The resulting system output data must be observed, and an inference or decision about which variables are responsible for the observed changes in output performance is made. Experiments designed with basic principles such as randomization are needed to establish cause-and-effect relationships. Much of what we know in the engineering and physical-chemical sciences is developed through testing or experimentation. In engineering, there are problem areas with no scientific or engineering theory that are directly or completely applicable, so experimentation and observation of the resulting data is the only way to solve them. There are times there is a good underlying scientific theory to explain the phenomena of interest. Tests or experiments are almost always necessary to be conducted to confirm the applicability and validity of the theory in a specific situation or environment. Designed experiments are very important in engineering design and development and in the improvement of manufacturing processes in which statistical thinking and statistical methods play an important role in planning, conducting, and analyzing the data. (Montgomery, et al., 2018) 1.2 Planning and Conducting Surveys A survey is a method of asking respondents some well-constructed questions. It is an efficient way of collecting information and easy to administer wherein a wide variety of information can be collected. The researcher can be focused and can stick to the questions that interest him and are necessary in his statistical inquiry or study. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS However surveys depend on the respondents honesty, motivation, memory and his ability to respond. Sometimes answers may lead to vague data. Surveys can be done through face-to-face interviews or self-administered through the use of questionnaires. The advantages of face-to-face interviews include fewer misunderstood questions, fewer incomplete responses, higher response rates, and greater control over the environment in which the survey is administered; also, the researcher can collect additional information if any of the respondents’ answers need clarifying. The disadvantages of face-to-face interviews are that they can be expensive and time-consuming and may require a large staff of trained interviewers. In addition, the response can be biased by the appearance or attitude of the interviewer. Self-administered surveys are less expensive than interviews. It can be administered in large numbers and does not require many interviewers and there is less pressure on respondents. However, in self-administered surveys, the respondents are more likely to stop participating mid-way through the survey and respondents cannot ask to clarify their answers. There are lower response rates than in personal interviews. When designing a survey, the following steps are useful: 1. Determine the objectives of your survey: What questions do you want to answer? 2. Identify the target population sample: Whom will you interview? Who will be the respondents? What sampling method will you use? 3. Choose an interviewing method: face-to-face interview, phone interview, selfadministered paper survey, or internet survey. 4. Decide what questions you will ask in what order, and how to phrase them. 5. Conduct the interview and collect the information. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 6. Analyze the results by making graphs and drawing conclusions. In choosing the respondents, sampling techniques are necessary. Sampling is the process of selecting units (e.g., people, organizations) from a population of interest. Sample must be a representative of the target population. The target population is the entire group a researcher is interested in; the group about which the researcher wishes to draw conclusions. There are two ways of selecting a sample. These are the non-probability sampling and the probability sampling. Non-Probability Sampling Non-probability sampling is also called judgment or subjective sampling. This method is convenient and economical but the inferences made based on the findings are not so reliable. The most common types of non-probability sampling are the convenience sampling, purposive sampling and quota sampling. In convenience sampling, the researcher use a device in obtaining the information from the respondents which favors the researcher but can cause bias to the respondents. In purposive sampling, the selection of respondents is predetermined according to the characteristic of interest made by the researcher. Randomization is absent in this type of sampling. There are two types of quota sampling: proportional and non proportional. In proportional quota sampling the major characteristics of the population by sampling a proportional amount of each is represented. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS For instance, if you know the population has 40% women and 60% men, and that you want a total sample size of 100, you will continue sampling until you get those percentages and then you will stop. Non-proportional quota sampling is a bit less restrictive. In this method, a minimum number of sampled units in each category is specified and not concerned with having numbers that match the proportions in the population. Probability Sampling In probability sampling, every member of the population is given an equal chance to be selected as a part of the sample. There are several probability techniques. Among these are simple random sampling, stratified sampling and cluster sampling. Simple Random Sampling Simple random sampling is the basic sampling technique where a group of subjects (a sample) is selected for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. Every possible sample of a given size has the same chance of selection; i.e. each member of the population is equally likely to be chosen at any stage in the sampling process. Stratified Sampling There may often be factors which divide up the population into sub-populations (groups / strata) and the measurement of interest may vary among the different subpopulations. This has to be accounted for when a sample from the population is selected Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS in order to obtain a sample that is representative of the population. This is achieved by stratified sampling. A stratified sample is obtained by taking samples from each stratum or sub-group of a population. When a sample is to be taken from a population with several strata, the proportion of each stratum in the sample should be the same as in the population. Stratified sampling techniques are generally used when the population is heterogeneous, or dissimilar, where certain homogeneous, or similar, sub-populations can be isolated (strata). Simple random sampling is most appropriate when the entire population from which the sample is taken is homogeneous. Some reasons for using stratified sampling over simple random sampling are: 1. the cost per observation in the survey may be reduced; 2. estimates of the population parameters may be wanted for each subpopulation; 3. increased accuracy at given cost. Cluster Sampling Cluster sampling is a sampling technique where the entire population is divided into groups, or clusters, and a random sample of these clusters are selected. All observations in the selected clusters are included in the sample. 1.3 Planning and Conducting Experiments: Introduction to Design of Experiments The products and processes in the engineering and scientific disciplines are mostly derived from experimentation. An experiment is a series of tests conducted in a systematic manner to increase the understanding of an existing process or to explore a new product or process. Design of Experiments, or DOE, is a tool to develop an Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS experimentation strategy that maximizes learning using minimum resources. Design of Experiments is widely and extensively used by engineers and scientists in improving existing process through maximizing the yield and decreasing the variability or in developing new products and processes. It is a technique needed to identify the "vital few" factors in the most efficient manner and then directs the process to its best setting to meet the ever-increasing demand for improved quality and increased productivity. The methodology of DOE ensures that all factors and their interactions are systematically investigated resulting to reliable and complete information. There are five stages to be carried out for the design of experiments. These are planning, screening, optimization, robustness testing and verification. 1. Planning It is important to carefully plan for the course of experimentation before embarking upon the process of testing and data collection. At this stage, identification of the objectives of conducting the experiment or investigation, assessment of time and available resources to achieve the objectives. Individuals from different disciplines related to the product or process should compose a team who will conduct the investigation. They are to identify possible factors to investigate and the most appropriate responses to measure. A team approach promotes synergy that gives a richer set of factors to study and thus a more complete experiment. Experiments which are carefully planned always lead to increased understanding of the product or process. Well planned experiments are easy to execute and analyze using the available statistical software. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 2. Screening Screening experiments are used to identify the important factors that affect the process under investigation out of the large pool of potential factors. Screening process eliminates unimportant factors and attention is focused on the key factors. Screening experiments are usually efficient designs which require few executions and focus on the vital factors and not on interactions. 3. Optimization After narrowing down the important factors affecting the process, then determine the best setting of these factors to achieve the objectives of the investigation. The objectives may be to either increase yield or decrease variability or to find settings that achieve both at the same time depending on the product or process under investigation. 4. Robustness Testing Once the optimal settings of the factors have been determined, it is important to make the product or process insensitive to variations resulting from changes in factors that affect the process but are beyond the control of the analyst. Such factors are referred to as noise or uncontrollable factors that are likely to be experienced in the application environment. It is important to identify such sources of variation and take measures to ensure that the product or process is made robust or insensitive to these factors. 5. Verification This final stage involves validation of the optimum settings by conducting a few followup experimental runs. This is to confirm that the process functions as expected and all objectives are achieved. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS REFERENCES: Montgomery, Douglas C.,et al., Applied Statistics and Probabiliy for Engineers, 7th ed., John Wiley & Sons (Asia) Pte Ltd, 2018 Panopio, Felix M. (2004). Statistics with Probability. Batangas City, Philippines: Feliber Publishing House Rawley, Eve. Planning and Conducting Surveys. https://www.ck12.org/statistics/planning-andconducting-surveys/lesson/Planning-and-Conducting-Surveys-ALG-I/ Date accessed: July 27, 2020 Walpole, Ronald E., et al., Probability and Statistics for Engineers and Scientists, 9th ed., Pearson Education Inc., 2016 Introduction to Design of Experiments. https://www.weibull.com/hotwire/issue84/hottopics84.htm. Date Accessed: April 15, 2020 https://mathspace.co/learn/world-of-maths/language-and-use-of-statistics/planning-a-statisticalinvestigation-i-investigation-18643/investigation-statistical-inquiry-916/ Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Answer the following. 1. Explain the different methods how you can obtain data. 2. As one of the students of EDA class, you are tasked to conduct a survey to show which extracurricular activities the students from the College of Engineering, Architecture and Fime Arts would like to engage in during the first semester. Follow the presented steps in conducting a survey. 3. You are asked to conduct an experiment in a catapult shown in the figure. It ia a table-top wooden device used in teaching design of experiments and statistical process control. The objective of the experiment is to determine the significant factors that affect the distrance travelled by the ball at it is thrown by the catapult. Also, you are to establish the settings to reach 25, 50, 75 and 100 inches. The response variable is the distance and the factors are the band height, start angle, number of rubber bands used ( 1 or 2), arm length, and the stop angle. Explain how are you going to conduct the experiment taking note of the stages of planning and conducting design of expermients. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 2 PROBABILITY Introduction Probability is simply how likely an event is to happen. “The chance of rain today is 50%” is a statement that enumerates our thoughts on the possibility of rain. The likelihood of an outcome is measured by assigning a number from the interval [0, 1] or as percentage from 0 to 100%. The higher the number means the event is more likely to happen than the lower number. A zero (0) probability indicates that the outcome is impossible to happen while a probability of one (1) indicates that the outcome will occur inevitably. This module intends to discuss the concept of probability for discrete sample spaces, its application, and ways of solving the probabilities of different statistical data. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Understand and describe sample spaces and events for random experiments 2. Explain the concept of probability and its application to different situations 3. Define and illustrate the different probability rules 4. Solve for the probability of different statistical data. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Probability Probability is the likelihood or chance of an event occurring. 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑎𝑦𝑠 𝑎𝑐ℎ𝑖𝑒𝑣𝑖𝑛𝑔 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑡ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 For example, the probability of flipping a coin and it being heads is ½, because there is 1 way of getting a head and the total number of possible outcomes is 2 (a head or tail). We write P(heads) = ½ .  The probability of something which is certain to happen is 1.  The probability of something which is impossible to happen is 0.  The probability of something not happening is 1 minus the probability that it will happen. Experiment – is used to describe any process that generates a set of data Event – consists of a set of possible outcomes of a probability experiment. Can be one outcome or more than one outcome. Simple event – an event with one outcome. Compound event – an event with more than one outcome. 2.1 Sample Space and Relationships among Events Sample space is the set of all possible outcomes or results of a random experiment. Sample space is represented by letter S. Each outcome in the sample space is called an element of that set. An event is the subset of this sample space and it is represented by Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS letter E. This can be illustrated in a Venn Diagram. In Figure 2.1, the sample space is represented by the rectangle and the events by the circles inside the rectangle. The events A and B (in a to c) and A, B and C (in d and e) are all subsets of the sample space S. Figure 2.1 Venn diagrams of sample space with events (adapted from Montgomery et al., 2003) For example if a dice is rolled we have {1, 2, 3, 4, 5, and 6} as sample space. The event can be {1, 3, and 5} which means set of odd numbers. Similarly, when a coin is tossed twice the sample space is {HH, HT, TH, and TT}. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Difference between Sample Space and Events As discussed in the beginning sample space is set of all possible outcomes of an experiment and event is the subset of sample space. Let us try to understand this with few examples. What happens when we toss a coin thrice? If a coin is tossed three times we get following combinations, HHH, HHT, HTH,THH, TTH, THT, HTT and TTT All these are the outcomes of the experiment of tossing a coin three times. Hence, we can say the sample space is the set given by, S = {HHH, HHT, HTH,THH, TTH, THT, HTT, TTT} Now, suppose the event be the set of outcomes in which there are only two heads. The outcomes in which we have only two heads are HHT, HTH and THH hence the event is given by, E = {HHT, HTH, THH} We can clearly see that each element of set E is in set S, so E is a subset of S. There can be more than one event. In this case, we can have an event as getting only one tail or event of getting only one head. If we have more than one event we can represent these events by E1, E2, E3 etc. We can have more than one event for a Sample space but there will be one and only one Sample space for an Event. If we have Events E1, E2, E3, …… En as all the possible subset of sample space then we have, S = E1 ∪ E2 ∪ E3 ∪ …….∪ En Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS We can understand this with the help of a simple example. Consider an experiment of rolling a dice. We have sample space, S = {1, 2, 3, 4, 5, 6} Now if we have Event E1 as getting odd number as outcome and E2 as getting even number as outcome for this experiment then we can represent E 1 and E2 as the following set, E1 = {1, 3, 5} E2 = {2, 4, 6} So we have {1, 3, 5} ∪ {2, 4, 6} = {1, 2, 3, 4, 5, 6} Or S = E1 ∪ E2 Hence, we can say union of Events E1 and E2 is S. Null space – is a subset of the sample space that contains no elements and is denoted by the symbol . It is also called empty space. Operations with Events Intersection of events The intersection of two events A and B is denoted by the symbol A  B. It is the event containing all elements that are common to A and B. This is illustrated as the shaded region in Figure 2.1 (c). Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS For example, Let A = {3,6,9,12,15} and B = {1,3,5,8,12,15,17}; then A  B = {3,12,15} Let X = {q, w, e, r, t,} and Y = {a, s, d, f}; then X  Y = , since X and Y have no elements in common. Mutually Exclusive Events We can say that an event is mutually exclusive if they have no elements in common. This is illustrated in Figure 2.1 (b) where we can see that A  B =. Union of Events The union of events A and B is the event containing all the elements that belong to A or to B or to both and is denoted by the symbol A  B. The elements A  B maybe listed or defined by the rule A  B = { x | x  A or x  B}. For example, Let A = {a,e,i,o,u} and B = {b,c,d,e,f}; then A  B = {a,b,c,d,e,f,i,o,u} Let X = {1,2,3,4} and Y = {3,4,5,6}; then A  B = {1,2,3,4,5,6} Compliment of an Event The complement of an event A with respect to S is the set of all elements of S that are not in A and is denoted by A’. The shaded region in Figure 2.1 (e) shows (A  C)’. For example, Consider the sample space S = {dog, cow, bird, snake, pig} Let A = {dog, bird, pig}; then A’ = {cow, snake} Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Probability of an Event Sample space and events play important roles in probability. Once we have sample space and event, we can easily find the probability of that event. We have following formula to find the probability of an event. 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑎𝑛 𝑒𝑣𝑒𝑛𝑡 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 𝑜𝑓 𝑎𝑛 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑠𝑒𝑡 𝑃(𝐸) = Where, 𝑛(𝐸) 𝑛(𝑆) n (S) represents number of elements in a sample space of an experiment; n (E) represents a number of elements in the event set; and P (E) represents the probability of an event. When probabilities are assigned to the outcomes in a sample space, each probability must lie between 0 and 1 inclusive, and the sum of all probabilities assigned must be equal to 1. Therefore, 0  P (E)  1 and P(S) = 1 Let us try to understand this with the help of an example. If a die is tossed, the sample space is {1, 2, 3, 4, 5, 6}. In this set, we have a number of elements equal to 6. Now, if the event is the set of odd numbers in a dice, then we have {1, 3, and 5} as an event. In this set, we have 3 elements. So, the probability of getting odd numbers in a single throw of dice is given by 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 1 3 = 2 6 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 2.2 Counting Rules Useful in Probability Multiplicative Rule Suppose you have j sets of elements, n1 in the first set, n2 in the second set, ... and nj in the jth set. Suppose you wish to form a sample of j elements by taking one element from each of the j sets. The number of possible sets is then defined by: 𝑛1 ∙ 𝑛2 ∙ … ∙ 𝑛𝑗 Permutation Rule The arrangement of elements in a distinct order is called permutation. Given a single set of n distinctively different elements, you wish to select k elements from the n and arrange them within k positions. The number of different permutations of the n elements taken k at a time is denoted Pkn and is equal to 𝑃𝑘𝑛 = Partitions rule 𝑛! (𝑛 − 𝑘)! Suppose a single set of n distinctively different elements exists. You wish to partition them into k sets, with the first set containing n1 elements, the second containing n2 elements, ..., and the kth set containing nk elements. The number of different partitions is Where, 𝑛! 𝑛1 ! 𝑛2 ! … 𝑛𝑘 ! n1 + n2 + … + nk = n Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The numerator gives the permutations of the n elements. The terms in the denominator remove the duplicates due to the same assignments in the k sets (multinomial coefficients). Combinations Rule A sample of k elements is to be chosen from a set of n elements. The number of different samples of k samples that can be selected from n is equal to 𝑛! 𝑛 ( )= 𝑘 𝑘! (𝑛 − 𝑘)! 2.3 Rules of Probability Before discussing the rules of probability, we state the following definitions:  Two events are mutually exclusive or disjoint if they cannot occur at the same time.  The probability that Event A occurs, given that Event B has occurred, is called a conditional probability. The conditional probability of Event A, given Event B, is denoted by the symbol P (A|B).  The complement of an event is the event not occurring. The probability that Event A will not occur is denoted by P (A').  The probability that Events A and B both occur is the probability of the intersection of A and B. The probability of the intersection of Events A and B is denoted by P (A ∩ B). If Events A and B are mutually exclusive, P(A ∩ B) = 0. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS  The probability that Events A or B occur is the probability of the union of A and B.  The probability of the union of Events A and B is denoted by P(A ∪ B). If the occurrence of Event A changes the probability of Event B, then Events A and B are dependent. On the other hand, if the occurrence of Event A does not change the probability of Event B, then Events A and B are independent. Rule of Addition Rule 1: If two events A and B are mutually exclusive, then: 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) Rule 2: If events A and B are not mutually exclusive events, then: 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) Example 1. A student goes to the library. The probability that she checks out (a) a work of fiction is 0.40, (b) a work of non-fiction is 0.30, and (c) both fiction and non-fiction is 0.20. What is the probability that the student checks out a work of fiction, non-fiction, or both? Solution: Let F = the event that the student checks out fiction; Let N = the event that the student checks out non-fiction. Then, based on the rule of addition: 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐹) + 𝑃(𝑁) − 𝑃(𝐹 ∩ 𝑁) 𝑃(𝐴 ∪ 𝐵) = 0.4 + 0.3 − 0.2 = 𝟎. 𝟓 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Rule of Multiplication Rule 1: When two events A and B are independent, then: 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵) Dependent - Two outcomes are said to be dependent if knowing that one of the outcomes has occurred affects the probability that the other occurs Conditional Probability - an event B in relationship to an event A is the probability that event B occurs after event A has already occurred. The probability is denoted by 𝑃(𝐵|𝐴). Rule 2: When two events are dependent, the probability of both occurring is: 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴)𝑃(𝐵|𝐴) Where 𝑃(𝐵|𝐴) = 𝑃(𝐴 ∩ 𝐵) 𝑃(𝐴) , provided that P (A)  0 Example 1. A day’s production of 850 manufactured parts contains 50 parts that do not meet customer requirements. Two parts are selected randomly without replacement from the batch. What is the probability that the second part is defective given that the first part is defective? Solution: Let A = event that the first part selected is defective Let B = event that the second part selected is defective. P (B|A) =? If the first part is defective, prior to selecting the second part, the batch contains 849 parts, of which 49 are defective, therefore P (B|A) = 49/849 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 2. An urn contains 6 red marbles and 4 black marbles. Two marbles are drawn without replacement from the urn. What is the probability that both of the marbles are black? Solution: Let A = the event that the first marble is black; and let B = the event that the second marble is black. We know the following:  In the beginning, there are 10 marbles in the urn, 4 of which are black. Therefore, P (A) = 4/10.  After the first selection, there are 9 marbles in the urn, 3 of which are black. Therefore, P (B|A) = 3/9. 4 3 𝑃(𝐴 ∩ 𝐵) = ( ) ( ) = 𝟎. 𝟏𝟑𝟑 10 9 Example 3. Two cards are selected from a pack of cards. What is the probability that they are both queen? Solution: Let A = First card which is a queen Let B = Second card which is also a queen We require P (A  B). Notice that these events are dependent because the probability that the second card is a queen depends on whether or not the first card is a queen. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS P (A  B) = P (A) P (B|A) P (A) = 1/13 and P (B|A) = 3/51 P (A  B) = (1/13) (3/51) = 1/221 = 0.004525 Rule of Subtraction The probability that event A will occur is equal to 1 minus the probability that event A will not occur. 𝑃(𝐴) = 1 − 𝑃(𝐴′ ) Example 1.The probability of Bill not graduating in college is 0.8. What is the probability that Bill will not graduate from college? Solution: 𝑃(𝐴) = 1 − 0.8 = 𝟎. 𝟐 REFERENCES: Montgomery, D. C. et al. (2003). Applied Statistics and Probability for Engineers 3rd Edition. USA. John Wiley & Sons, Inc. Walpole, R. E. et al. (2016). Probability & Statistics for Engineers & Scientists 9th Edition. England. Pearson Education Limited https://math.tutorvista.com/statistics/sample-space-and-events.html https://stattrek.com/probability/probability-rules.aspx https://www.ck12.org/book/CK-12-Probability-and-Statistics-Advanced-SecondEdition/section/3.6/ Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. Three events are shown on the Venn diagram in the following figure: Reproduce the figure and shade the region that corresponds to each of the following events. a. A’ b. A  B c. (A  B)  C d. (B  C)’ e. (A  B)’  C 2. Each of the possible five outcomes of a random experiment is equally likely. The sample space is {a, b, c, d, e}. Let A denote the event {a, b}, and let B denote the event {c, d, e}. Determine the following: a. P(A) b. P(B) c. P(A’) d. P(A  B) e. P(A  B) 3. If A, B, and C are mutually exclusive events with P (A) = 0.2, P(B) = 0.3, and P(C) = 0.4, determine the following probabilities: a. P(A  B  C) c. P(A  B) b. P(A  B  C) d. P[(A  B)  C] e. P(A’  B’  C’) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 4. A lot of 100 semiconductor chips contains 20 that are defective. Two are selected randomly, without replacement, from the lot. a. What is the probability that the first one selected is defective? b. What is the probability that the second one selected is defective given that the first one was defective? c. What is the probability that both are defective? d. How does the answer to part (b) change if chips selected were replaced prior to the next selection? 5. Suppose 2% of cotton fabric rolls and 3% of nylon fabric rolls contain flaws. Of the rolls used by a manufacturer, 70% are cotton and 30% are nylon. What is the probability that a randomly selected roll used by the manufacturer contains flaws? Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 3 DISCRETE PROBABILITY DISTRIBUTIONS Introduction Many physical systems can be modelled by a similar or the same random variables and random experiments. The distribution of the random variables involved in each of these common systems can be analyzed, and the result of that analysis can be used in different applications and examples. In this chapter, the analysis of several random experiments and discrete random variables that often appear in applications is discussed. A discussion of the basic sample space of the random experiment is frequently omitted and the distribution of a particular random variable is directly described. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Determine probabilities from probability mass functions. 2. Determine probabilities from cumulative functions and cumulative distribution functions from probability mass functions. 3. Calculate means and variances for discrete random variables. 4. Understand the assumptions for each of the discrete probability distributions presented. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 5. Select an appropriate discrete probability distribution to calculate probabilities in specific applications. 6. Calculate probabilities, determine means and variances for each of the discrete probability distributions presented Discrete Probability Distribution A discrete distribution describes the probability of occurrence of each value of a discrete random variable. A discrete random variable is a random variable that has countable values, such as a list of non-negative integers. With a discrete probability distribution, each possible value of the discrete random variable can be associated with a non-zero probability. Thus, a discrete probability distribution is often presented in tabular form. 3.1 Random Variables and Their Probability Distributions Random Variables In probability and statistics, a random variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). As opposed to other mathematical variables, a random variable conceptually does not have a single, fixed value (even if unknown); rather, it can take on a set of possible different values, each with an associated probability. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS A random variable’s possible values might represent the possible outcomes of a yet-to-be-performed experiment, or the possible outcomes of a past experiment whose already-existing value is uncertain (for example, as a result of incomplete information or imprecise measurements). They may also conceptually represent either the results of an “objectively” random process (such as rolling a die), or the “subjective” randomness that results from incomplete knowledge of a quantity. Random variables can be classified as either discrete (that is, taking any of a specified list of exact values) or as continuous (taking any numerical value in an interval or collection of intervals). The mathematical function describing the possible values of a random variable and their associated probabilities is known as a probability distribution. Discrete Random Variables Discrete random variables can take on either a finite or at most a countably infinite set of discrete values (for example, the integers). Their probability distribution is given by a probability mass function which directly maps each value of the random variable to a probability. For example, the value of x1 takes on the probability p1, the value of x2 takes on the probability p2, and so on. The probabilities pi must satisfy two requirements: every probability pi is a number between 0 and 1, and the sum of all the probabilities is 1. (p1+p2+⋯+pk=1) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Discrete Probability Distribution This shows the probability mass function of a discrete probability distribution. The probabilities of the singletons {1}, {3}, and {7} are respectively 0.2, 0.5, 0.3. A set not containing any of these points has probability zero. Examples of discrete random variables include the values obtained from rolling a die and the grades received on a test out of 100. Probability Distributions for Discrete Random Variables Probability distributions for discrete random variables can be displayed as a formula, in a table, or in a graph. A discrete random variable x has a countable number of possible values. The probability distribution of a discrete random variable x lists the values and their probabilities, where value x1 has probability p1, value x2 has probability x2, and so on. Every probability pi is a number between 0 and 1, and the sum of all the probabilities is equal to 1. Examples of discrete random variables include:  The number of eggs that a hen lays in a given day (it can’t be 2.3)  The number of people going to a given soccer match  The number of students that come to class on a given day  The number of people in line at McDonald’s on a given day and time A discrete probability distribution can be described by a table, by a formula, or by a graph. For example, suppose that xx is a random variable that represents the number of people waiting at the line at a fast-food restaurant and it happens to only take the values 2, 3, or 5 with probabilities 2/10, 3/10, and 5/10 respectively. This can be expressed Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS through the function f(x) = x/10, x=2, 3, 5 or through the table below. Of the conditional probabilities of the event BB given that A1 is the case or that A2 is the case, respectively. Notice that these two representations are equivalent, and that this can be represented graphically as in the probability histogram below. Probability Histogram: This histogram displays the probabilities of each of the three discrete random variables. The formula, table, and probability histogram satisfy the following necessary conditions of discrete probability distributions: 1. 0≤f(x) ≤1, i.e., the values of f(x) are probabilities, hence between 0 and 1. 2. ∑f(x) =1, i.e., adding the probabilities of all disjoint cases, we obtain the probability of the sample space, 1. Sometimes, the discrete probability distribution is referred to as the probability mass function (pmf). The probability mass function has the same purpose as the probability histogram, and displays specific probabilities for each discrete random variable. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The only difference is how it looks graphically. Probability Mass Function This shows the graph of a probability mass function. All the values of this function must be non-negative and sum up to 1. x f(x) 2 0.2 3 0.3 5 0.5 Discrete Probability Distribution: This table shows the values of the discrete random variable can take on and their corresponding probabilities. Example 1. A shipment of 20 similar laptop computers to a retail outlet contains 3 that are defective. If a school makes a random purchase of 2 of these computers, find the probability distribution for the number of defectives. Solution: Let X be a random variable whose values x are the possible numbers of defective computers purchased by the school. Then x can only take the numbers 0, 1, and 2. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Now, Thus, the probability distribution of X is x 0 1 2 f(x) 68/95 51/190 3/190 3.2 Cumulative Distribution Functions You might recall that the cumulative distribution function is defined for discrete random variables as: 𝐹(𝑥) = 𝑃(𝑋 ≤ 𝑥) = ∑ 𝑓(𝑡) 𝑡≤𝑥 Again, F(x) accumulates all of the probability less than or equal to x. The cumulative distribution function for continuous random variables is just a straightforward extension of that of the discrete case. All we need to do is replace the summation with an integral. The cumulative distribution function ("c.d.f.") of a continuous random variable X is defined as: 𝑥 𝐹(𝑥) = ∫ 𝑓(𝑡)𝑑𝑡 −∞ For -∞<x<∞ Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 1. Suppose that a day’s production of 850 manufactured parts contains 50 parts that do not con- form to customer requirements. Two parts are selected at random, without replacement, from the batch. Let the random variable X equal the number of nonconforming parts in the sample. What is the cumulative distribution function of X? Solution: The question can be answered by first finding the probability mass function of X. Therefore, The cumulative distribution function for this example is graphed in the figure below. Note that F(x) is defined for all x from - < x <  and not only for 0, 1, and 2. Graph of the cumulative distribution function for the above example Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 3.3 Expected Values of Random Variables The expected value of a random variable is the weighted average of all possible values that this random variable can take on. Discrete Random Variable A discrete random variable X has a countable number of possible values. The probability distribution of a discrete random variable X lists the values and their probabilities, such that xi has a probability of pi. The probabilities pi must satisfy two requirements: 1. Every probability pi is a number between 0 and 1. 2. The sum of the probabilities is 1: p1+p2+⋯+pi = 1. Expected Value Definition In probability theory, the expected value (or expectation, mathematical expectation, EV, mean, or first moment) of a random variable is the weighted average of all possible values that this random variable can take on. The weights used in computing this average are probabilities in the case of a discrete random variable. The expected value may be intuitively understood by the law of large numbers: the expected value, when it exists, is almost surely the limit of the sample mean as sample size grows to infinity. More informally, it can be interpreted as the long-run average of the results of many independent repetitions of an experiment (e.g. a dice roll). The value may not be expected in the ordinary sense—the “expected value” itself may be unlikely or even impossible (such as having 2.5 children), as is also the case with the sample mean. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS How To Calculate Expected Value Suppose random variable X can take value x1 with probability p1, value x2 with probability p2, and so on, up to value xi with probability pi. Then the expectation value of a random variable XX is defined as E[X] = x1 p1+ x2 p2+⋯+xi pi, which can also be written as: E[X] =∑xi p1. If all outcomes xi are equally likely (that is, p 1= p2 =⋯=pi), then the weighted average turns into the simple average. This is intuitive: the expected value of a random variable is the average of all values it can take; thus, the expected value is what one expects to happen on average. If the outcomes xi are not equally probable, then the simple average must be replaced with the weighted average, which takes into account the fact that some outcomes are more likely than the others. The intuition, however, remains the same: the expected value of X is what one expects to happen on average. For example, let X represent the outcome of a roll of a six-sided die. The possible values for X are 1, 2, 3, 4, 5, and 6, all equally likely (each having the probability of 1/6). The expectation of X is: E[X] = (1x1/6) + (2x2/6) + (3x3/6) + (4x4/6) + (5x5/6) + (6x6/6) = 3.5. In this case, since all outcomes are equally likely, we could have simply averaged the numbers together: (1 + 2 + 3 + 4 + 5 + 6) /6 = 3.5. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Average Dice Value Against Number of Rolls An illustration of the convergence of sequence averages of rolls of a die to the expected value of 3.5 as the number of rolls (trials) grows. 3.4 The Binomial Distribution Binomial Experiment A binomial experiment is a statistical experiment that has the following properties:  The experiment consists of n repeated trials.  Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.  The probability of success, denoted by P, is the same on every trial.  The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials. Consider the following statistical experiment. You flip a coin 2 times and count the number of times the coin lands on heads. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS This is a binomial experiment because: 1. The experiment consists of repeated trials. We flip a coin 2 times. 2. Each trial can result in just two possible outcomes - heads or tails. 3. The probability of success is constant - 0.5 on every trial. 4. The trials are independent; that is, getting heads on one trial does not affect whether we get heads on other trials. The following notation is helpful, when we talk about binomial probability.  x: The number of successes that result from the binomial experiment.  n: The number of trials in the binomial experiment.  P: The probability of success on an individual trial.  Q: The probability of failure on an individual trial. (This is equal to 1 - P.)  n!: The factorial of n (also known as n factorial).  b (x; n, P): Binomial probability - the probability that an n-trial binomial experiment results in exactly x successes, when the probability of success on an individual trial is P.  nCr: The number of combinations of n things, taken r at a time. Binomial Distribution A binomial random variable is the number of successes x in n repeated trials of a binomial experiment. The probability distribution of a binomial random variable is called a binomial distribution. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Suppose we flip a coin two times and count the number of heads (successes). The binomial random variable is the number of heads, which can take on values of 0, 1, or 2. The binomial distribution is presented below. Number of Heads Probability 0 0.25 1 0.50 2 0.25 The binomial distribution has the following properties:  The mean of the distribution (μx) is equal to n * P.  The variance (σ2x) is n * P * (1 - P).  The standard deviation (σx) is sqrt [n * P * (1 - P)]. Binomial Formula and Binomial Probability The binomial probability refers to the probability that a binomial experiment results in exactly x successes. For example, in the above table, we see that the binomial probability of getting exactly one head in two-coin flips is 0.50. Given x, n, and P, we can compute the binomial probability based on the binomial formula. Binomial Formula. Suppose a binomial experiment consists of n trials and results in x successes. If the probability of success on an individual trial is P, then the binomial probability is: Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS b (x; n, P) = nCx * Px * (1 - P) n - x or b (x; n, P) = {n! / [ x! (n - x)!]} * Px * (1 - P) n - x Example 1.Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours? Solution: This is a binomial experiment in which the number of trials is equal to 5, the number of successes is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167. Therefore, the binomial probability is: b (2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3 b (2; 5, 0.167) = 0.161 Cumulative Binomial Probability A cumulative binomial probability refers to the probability that the binomial random variable falls within a specified range (e.g., is greater than or equal to a stated lower limit and less than or equal to a stated upper limit). For example, we might be interested in the cumulative binomial probability of obtaining 45 or fewer heads in 100 tosses of a coin. This would be the sum of all these individual binomial probabilities. b (x < 45; 100, 0.5) = b (x = 0; 100, 0.5) + b (x = 1; 100, 0.5) + ... + b (x = 44; 100, 0.5) + b (x = 45; 100, 0.5) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 1. What is the probability of obtaining 45 or fewer heads in 100 tosses of a coin? Solution: To solve this problem, we compute 46 individual probabilities, using the binomial formula. The sum of all these probabilities is the answer we seek. Thus, b (x < 45; 100, 0.5) = b (x = 0; 100, 0.5) + b (x = 1; 100, 0.5) + . . . + b (x = 45; 100, 0.5) b (x < 45; 100, 0.5) = 0.184 Example 3. The probability that a student is accepted to a prestigious college is 0.3. If 5 students from the same school apply, what is the probability that at most 2 are accepted? Solution: To solve this problem, we compute 3 individual probabilities, using the binomial formula. The sum of all these probabilities is the answer we seek. Thus, b (x < 2; 5, 0.3) = b(x = 0; 5, 0.3) + b(x = 1; 5, 0.3) + b(x = 2; 5, 0.3) b(x < 2; 5, 0.3) = 0.1681 + 0.3601 + 0.3087 b(x < 2; 5, 0.3) = 0.8369 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 3.5 The Poisson Distribution A Poisson distribution is the probability distribution that results from a Poisson experiment. Attributes of a Poisson Experiment A Poisson experiment is a statistical experiment that has the following properties:  The experiment results in outcomes that can be classified as successes or failures.  The average number of successes (μ) that occurs in a specified region is known.  The probability that a success will occur is proportional to the size of the region.  The probability that a success will occur in an extremely small region is virtually zero. Note that the specified region could take many forms. For instance, it could be a length, an area, a volume, a period of time, etc. Notation The following notation is helpful, when we talk about the Poisson distribution.  e: A constant equal to approximately 2.71828. (Actually, e is the base of the natural logarithm system.)  μ: The mean number of successes that occur in a specified region.  x: The actual number of successes that occur in a specified region.  P (x; μ): The Poisson probability that exactly x successes occur in a Poisson experiment, when the mean number of successes is μ. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Poisson Distribution A Poisson random variable is the number of successes that result from a Poisson experiment. The probability distribution of a Poisson random variable is called a Poisson distribution. Given the mean number of successes (μ) that occur in a specified region, we can compute the Poisson probability based on the following Poisson formula. Poisson Formula. Suppose we conduct a Poisson experiment, in which the average number of successes within a given region is μ. Then, the Poisson probability is: P (x; μ) = (e-μ) (μx) / x! where x is the actual number of successes that result from the experiment, and e is approximately equal to 2.71828. The Poisson distribution has the following properties:  The mean of the distribution is equal to μ.  The variance is also equal to μ. Example 1. The average number of homes sold by the Acme Realty company is 2 homes per day. What is the probability that exactly 3 homes will be sold tomorrow? Solution: This is a Poisson experiment in which we know the following:  μ = 2; since 2 homes are sold per day, on average.  x = 3; since we want to find the likelihood that 3 homes will be sold tomorrow. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS  e = 2.71828; since e is a constant equal to approximately 2.71828. We plug these values into the Poisson formula as follows: P (x; μ) = (e-μ) (μx) / x! P (3; 2) = (2.71828-2) (23) / 3! P (3; 2) = (0.13534) (8) / 6 P (3; 2) = 0.180 Thus, the probability of selling 3 homes tomorrow is 0.180. Cumulative Poisson Probability A cumulative Poisson probability refers to the probability that the Poisson random variable is greater than some specified lower limit and less than some specified upper limit. Example. Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that tourists will see fewer than four lions on the next 1-day safari? Solution: This is a Poisson experiment in which we know the following:  μ = 5; since 5 lions are seen per safari, on average.  x = 0, 1, 2, or 3; since we want to find the likelihood that tourists will see fewer than 4 lions; that is, we want the probability that they will see 0, 1, 2, or 3 lions.  e = 2.71828; since e is a constant equal to approximately 2.71828. To solve this problem, we need to find the probability that tourists will see 0, 1, 2, or 3 lions. Thus, we need to calculate the sum of four probabilities: P (0; 5) + P (1; 5) + P (2; 5) + P (3; 5). Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS To compute this sum, we use the Poisson formula: P (x < 3, 5) = P (0; 5) + P (1; 5) + P (2; 5) + P (3; 5) P (x < 3, 5) = [ (e-5) (50) / 0!] + [ (e-5) (51) / 1!] + [(e-5) (52) / 2!] + [(e-5) (53) / 3!] P (x < 3, 5) = [(0.006738) (1) / 1] + [(0.006738) (5) / 1] + [(0.006738) (25) / 2] + [(0.006738) (125) / 6] P (x < 3, 5) = [0.0067] + [0.03369] + [0.084224] + [0.140375] P (x < 3, 5) = 0.2650 Thus, the probability of seeing at no more than 3 lions is 0.2650. REFERENCES: Montgomery, D. C. et al. (2003). Applied Statistics and Probability for Engineers 3rd Edition. USA. John Wiley & Sons, Inc. Walpole, R. E. et al. (2016). Probability & Statistics for Engineers & Scientists 9th Edition. England. Pearson Education Limited https://courses.lumenlearning.com/boundless-statistics/chapter/discrete-random-variables/ https://newonlinecourses.science.psu.edu/stat414/node/98/ https://stattrek.com/probability-distributions/binomial.aspx https://stattrek.com/probability-distributions/poisson.aspx Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. The sample space of a random experiment is {a, b, c, d, e, f}, and each outcome is equally likely. A random variable is defined as follows: Outcome a b c d e f x 0 0 1.5 1.5 2 3 Determine the probability mass function of X. 2. Marketing estimates that a new instrument for the analysis of soil samples will be very successful, moderately successful, or unsuccessful, with probabilities 0.3, 0.6, and 0.1, respectively. The yearly revenue associated with a very successful, moderately successful, or unsuccessful product is $10 million, $5 million, and $1 million, respectively. Let the random variable X denote the yearly revenue of the product. Determine the probability mass function of X. 3. An assembly consists of three mechanical components. Suppose that the probabilities that the first, second, and third components meet specifications are 0.95, 0.98, and 0.99. Assume that the components are independent. Determine the probability mass function of the number of components in the assembly that meet specifications. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 4. Marketing estimates that a new instrument for the analysis of soil samples will be very successful, moderately successful, or unsuccessful, with probabilities 0.3, 0.6, and 0.1, respectively. The yearly revenue associated with a very successful, moderately successful, or unsuccessful product is $10 million, $5 million, and $1 million, respectively. Let the random variable X denote the yearly revenue of the product. Determine the probability mass function of X. 5. Given: Find: a. P(X  3) b. P(X  2) c. P(1  X  2) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) d. P(X > 2) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 4 CONTINUOUS PROBABILITY DISTRIBUTIONS Introduction The previous chapter discussed probability distributions for discrete random variables. Now we will be extending that to continuous random variables. In experiments where continuous variables are of interests (like length measurements), it is reasonable to model the range of possible values of the random variable by an interval (finite or infinite) of real numbers. Since the range is any value in the interval, the number of possible values of the random variable X is uncountably infinite and would have a different distribution from the discrete random variables previously discussed. In this chapter, the distributions, probability computations, means and variances for continuous random variables would be discussed. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Determine the probabilities from probability density functions 2. Determine the probabilities from cumulative distribution functions 3. Calculate means and variances for continuous random variables 4. Standardize normal random variables Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 5. Use the table for cumulative distribution function of a standard normal distribution to calculate probabilities 6. Approximate probabilities for some binomial and Poisson distributions 7. Use continuity corrections to improve the normal approximations to those binomial and Poisson distributions. 4.1 Continuous Random Variables and their Probability Distribution A continuous random variable has a probability of zero of assuming exactly any of its values. Consequently, its probability distribution cannot be given in tabular form. At first this may seem startling, but it, becomes more plausible when we consider a particular example. Let us discuss a random variable whose values are the heights of all people over 21 years of age. Between any two values, say 163.5 and 164.5 centimeters, or even 163.99 and 164.01 centimeters, there are an infinite number of heights, one of which is 164 centimeters. The probability of selecting a person at random who is exactly 164 centimeters tall and not one of the infinitely large set of heights so close to 164 centimeters that you cannot humanly measure the difference is remote, and thus we assign a probability of zero to the event. This is not the case, however, if we talk about the probability of selecting a person who is at least 163 centimeters but not more than 165 centimeters tall. Now we are dealing with an interval rather than a point value of our random variable. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS We shall concern ourselves with computing probabilities for various intervals of continuous random variables such as P (a < X < b), P (W > c), and so forth. Note that when X is continuous, P (a < X < b) = P (a < X < b) + P(X = b) = P (a < X < b). That is, it does not matter whether we include an endpoint of the interval or not. This is not true, though, when X is discrete. Although the probability distribution of a continuous random variable cannot be presented in tabular form, it can be stated as a formula. Such a formula would necessarily be a function of the numerical values of the continuous random variable X and as such will be represented by the functional notation f(x). In dealing with continuous variables, f(x) is usually called the probability density function, or Figure 4.1 Typical Density Functions simply the density function of A'. Since X is defined over a continuous sample space, it is possible for f(x) to have a finite number of discontinuities. However, most density functions that have practical applications in the analysis of statistical data are continuous and their graphs may take any of several forms, some of which are shown in Figure 4.1. Because areas will be used to represent probabilities and probabilities arc positive numerical values, the density function must lie entirely above the x axis. A probability Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS density function is constructed so that the area under its curve bounded by the x axis is equal to 1 when computed over the range of X for which f(x) is defined. Should this range of X be a finite interval, it is always possible to extend the interval to include the entire sot of real numbers by defining f(x) to be zero at all points in the extended portions of the interval. In Figure 4.2, the probability that X assumes a value between a and /; is equal to the shaded area under the density function between the ordinates at. x = a and x = b, and from integral calculus is given by 𝒃 P (a < X < b) = ∫𝒂 𝐟(𝐱) 𝐝𝐱 Figure 4.2 P (a < X < b) 𝑥2 , −1 < 𝑥 < 2, , find f(x), and use it to Example 1. For the density function 𝑓(𝑥) = {3 (0), 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒 evaluate P (0 < X ≤ 1). Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Solution: For –1 < x < 2, 𝑥 Therefore, F(x) = ∫−∞ 𝑓(𝑡)𝑑𝑡 𝑥 𝑡2 = ∫−1 3 0, 𝑥3+ 1 F(x) = { 9 𝑑𝑡 , 1, = 𝑡3 9 | 𝑥 = −1 𝑥3+ 1 9 𝑥 < −1 − 1 ≤ 𝑥 < 2, 𝑥 ≥ 2. The cumulative distribution function F(x) is expressed graphically in Figure 4.3. Now, 2 1 1 P (0 < X ≤ 1) = F (1) – F (0) = − 9 = 9 9 Figure 4.3 Continuous cumulative distribution function Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 4.2 Expected Values of Continuous Random Variables Let X be a continuous random variable with range [a, b] and probability density function f(x). The expected value of X is defined by 𝑏 𝐸(𝑋) = ∫ 𝑥𝑓(𝑥)𝑑𝑥 𝑎 Let’s see how this compares with the formula for a discrete random variable: 𝑛 𝐸(𝑋) = ∑ 𝑥𝑖 𝑝(𝑥𝑖 ) 𝑖=1 The discrete formula says to take a weighted sum of the values xi of X, where the weights are the probabilities p (xi). Recall that f(x) is a probability density. Its units are prob/ (unit of X) So f(x) dx represents the probability that X is in an infinitesimal range of width dx around x. Thus we can interpret the formula for E(X) as a weighted integral of the values x of X, where the weights are the probabilities f(x) dx. As before, the expected value is also called the mean or average. The variance of X, V(X) or 2, is ∞ ∞ 𝜎 2 = 𝑉(𝑋) = ∫ (𝑥 − 𝜇)2 𝑓(𝑥)𝑑𝑥 = ∫ 𝑥 2 𝑓(𝑥)𝑑𝑥 − 𝜇 2 −∞ −∞ The standard deviation of X is 𝜎 = √𝜎 2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 1. Let X ∼ uniform (0, 1). Find E(X). Solution: Since X has a range of [0, 1] and a density of f(x) = 1: 1 𝑥2 𝐸(𝑋) = ∫ 𝑥𝑑𝑥 = | 2 0 1 0 = Not surprisingly, the mean is at the midpoint of the range. 𝟏 𝟐 3 Example 2. Let X have range [0, 2] and density 𝑥 2 . Find E(X). 2 8 2 3 𝐸(𝑋) = ∫0 𝑥𝑓(𝑥)𝑑𝑥 = ∫0 𝑥 3 𝑑𝑥 = 8 3𝑥 4 32 | 2 0 = 3 2 Does it make sense that this X has mean is in the right half of its range? Yes. Since the probability density increases as x increases over the range, the average value of x should be in the right half of the range. µ is “pulled” to the right of the midpoint 1 because there is more mass to the right. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Properties of E(X) The properties of E(X) for continuous random variables are the same as for discrete ones: 1. If X and Y are random variables on a sample space Ω then E(X = Y) = E(X) + E(Y) 2. If a and b are constants then E (aX + b) = aE(X) + b Expectation of Functions of X This works exactly the same as the discrete case. If h(x) is a function then Y = h(X) is a random variable and ∞ 𝐸(𝑌) = 𝐸(ℎ(𝑋)) = ∫ ℎ(𝑥)𝑓𝑥(𝑥)𝑑𝑥 −∞ Example 1. Let X ∼ exp (λ). Find E(X2). ∞ 𝐸(𝑋 2 ) = ∫ 𝑥 2 λ𝑒 −λx 𝑑𝑥 = [−𝑥 2 𝑒 −λx − 0 2𝑥 −λx 2 ∞ 𝟐 𝑒 − 2 𝑒 −λx ] = 𝟐 λ λ 0 𝛌 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 4.3 Continuous Uniform Distribution This is the simplest continuous distribution as it is analogous to its discrete counterpart. A continuous random variable X with probability density function 𝑓(𝑥) = 1 , 𝑏−𝑎 𝑎≤𝑥≤𝑏 Is a continuous uniform random variable. The probability density function of a continuous uniform random variable is shown below and the formula for computing its mean and variance. 𝑎+𝑏 𝜇 = 𝐸(𝑋) = 2 𝑎𝑛𝑑 (𝑏 − 𝑎)2 𝜎 = 𝑉(𝑋) = 12 2 4.4 Normal Distribution The Normal Distribution is the most important and most widely used continuous probability distribution. It is the cornerstone of the application of statistical inference in Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS analysis of data because the distributions of several important sample statistics tend towards a Normal distribution as the sample size increases. Empirical studies have indicated that the Normal distribution provides an adequate approximation to the distributions of many physical variables. Specific examples include meteorological data, such as temperature and rainfall, measurements on living organisms, scores on aptitude tests, physical measurements of manufactured parts, weights of contents of food packages, volumes of liquids in bottles/cans, instrumentation errors and other deviations from established norms, and so on. The graphical appearance of the Normal distribution is a symmetrical bell-shaped curve that extends without bound in both positive and negative directions. The probability density function is given by 𝑓(𝑥) = 1 𝜎√2𝜋 𝑒𝑥𝑝 [− (𝑥 − 𝜇)2 ], 2𝜎 2 −∞ < 𝑥 < ∞; −∞ < 𝜇 < ∞, 𝜎 > 0 where μ and σ are parameters. These turn out to be the mean and standard deviation, respectively, of the distribution. As a shorthand notation, we write X ~ N (μ, σ2). The curve never actually reaches the horizontal axis buts gets close to it beyond about 3 standard deviations each side of the mean. For any Normally distributed variable: 68.3% of all values will lie between μ −σ and μ + σ (i.e. μ ± σ) 95.45% of all values will lie within μ ± 2 σ 99.73% of all values will lie within μ ± 3 σ Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The graphs below illustrate the effect of changing the values of μ and σ on the shape of the probability density function. Low variability (σ = 0.71) with respect to the mean gives a pointed bell-shaped curve with little spread. Variability of σ = 1.41 produces a flatter bell-shaped curve with a greater spread. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 1. The volume of water in commercially supplied fresh drinking water containers is approximately Normally distributed with mean 70 litres and standard deviation 0.75 litres. Estimate the proportion of containers likely to contain (i) in excess of 70.9 litres, (ii) at most 68.2 litres, (iii) less than 70.5 litres. Solution: Let X denote the volume of water in a container, in litres. Then X ~ N (70, 0.752 ), i.e. μ = 70, σ = 0.75 and Z = (X − 70)/0.75 (i) X = 70.9 ; Z = (70.9 − 70)/0.75 = 1.20 P(X > 70.9) = P (Z > 1.20) = 0.1151 or 11.51% (ii) X = 68.2 ; Z = −2.40 P(X < 68.2) = P (Z < −2.40) = 0.0082 or 0.82% (iii) X = 70.5 ; Z = 0.67 P(X > 70.5) = 0.2514; P(X < 70.5) = 0.7486 or 74.86% Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 4.5 Normal Approximation to Binomial and Poisson Distribution Binomial Approximation The normal distribution can be used as an approximation to the binomial distribution if X is a binomial random variable, 𝑍= 𝑋 − 𝑛𝑝 √𝑛𝑝(1 − 𝑝) The above equation is the formula for standardizing the random variable X. Probabilities involving X can be approximated by using a standard distribution. The approximation is good when n is large relative to p and when np > 5 and n (1 – P) > 5. In some cases, working out a problem using the Normal distribution may be easier than using a Binomial. Poisson Approximation Poisson distribution was developed as the limit of a binomial distribution as the number of trials increased to infinity therefore the normal distribution can also be used to approximate probabilities of a Poisson random variable. If X is a Poisson random variable with E(X) =  and V(X) = , then 𝑍= 𝑋−𝜆 √𝜆 is approximately a standard normal random variable and this approximation is good for  > 5. Continuity Correction The binomial and Poisson distributions are discrete random variables, whereas the normal distribution is continuous. We need to take this into account when we are using Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS the normal distribution to approximate a binomial or Poisson using a continuity correction. In the discrete distribution, each probability is represented by a rectangle (right hand diagram): When working out probabilities, we want to include whole rectangles, which is what continuity correction is all about. Example 1. Suppose we toss a fair coin 20 times. What is the probability of getting between 9 and 11 heads? Solution: Let X be the random variable representing the number of heads thrown. X ~ Bin (20, ½) Since p is close to ½ (it equals ½!), we can use the normal approximation to the binomial. X ~ N (20 × ½, 20 × ½ × ½) so X ~ N (10, 5) . In this diagram, the rectangles represent the binomial distribution and the curve is the normal distribution: Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS We want P (9 ≤ X ≤ 11), which is the red shaded area. Notice that the first rectangle starts at 8.5 and the last rectangle ends at 11.5. Using a continuity correction, therefore, our probability becomes P (8.5 < X < 11.5) in the normal distribution. 4.6 Exponential Distribution The exponential distribution obtains its name from the exponential function in the probability density function. Plots of the exponential distribution for selected values of are shown in Fig. 4.4. For any value of, the exponential distribution is quite skewed. Figure 4.4 Probability density function of exponential random variables for selected values of λ Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS If the random variable X has an exponential distribution with parameter λ, 𝜇 = 𝐸(𝑋) = 1 λ and 𝜎 2 = 𝑉(𝑋) = 1 λ2 It is important to use consistent units in the calculation of probabilities, means, and variances involving exponential random variables. The following example illustrates unit conversions. Example 1. In a large corporate computer network, user log-ons to the system can be modeled as a Poisson process with a mean of 25 log-ons per hour. What is the probability that there are no logons in an interval of 6 minutes? Solution: Let X denote the time in hours from the start of the interval until the first log-on. Then, X has an exponential distribution with log-ons per hour. We are interested in the probability that X exceeds 6 minutes. Because is given in log-ons per hour, we express all time units in hours. That is, 6 minutes 0.1 hour. The probability requested is shown as the shaded area under the probability density function in Fig. 4.4. Therefore, ∞ 𝑃(𝑋 > 0.1) = ∫ 25𝑒 −25𝑥 𝑑𝑥 = 𝑒 −25(0.1) = 0.082 0.1 Figure 4.4 Probability for the exponential distribution Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS In the previous example, the probability that there are no log-ons in a 6-minute interval is 0.082 regardless of the starting time of the interval. A Poisson process assumes that events occur uniformly throughout the interval of observation; that is, there is no clustering of events. If the log-ons are well modeled by a Poisson process, the probability that the first log-on after noon occurs after 12:06 P.M. is the same as the probability that the first log-on after 3:00 P.M. occurs after 3:06 P.M. And if someone logs on at 2:22 P.M., the probability the next log-on occurs after 2:28 P.M. is still 0.082. Our starting point for observing the system does not matter. However, if there are high-use periods during the day, such as right after 8:00 A.M., followed by a period of low use, a Poisson process is not an appropriate model for log-ons and the distribution is not appropriate for computing probabilities. It might be reasonable to model each of the high and low-use periods by a separate Poisson process, employing a larger value for during the high-use periods and a smaller value otherwise. Then, an exponential distribution with the corresponding value of can be used to calculate log-on probabilities for the high- and low-use periods. REFERENCES: Montgomery, D. C. et al. (2003). Applied Statistics and Probability for Engineers 3rd Edition. USA. John Wiley & Sons, Inc. Walpole, R. E. et al. (2016). Probability & Statistics for Engineers & Scientists 9th Edition. England. Pearson Education Limited Jeremy Orloff, and Jonathan Bloom. 18.05 Introduction to Probability and Statistics. Spring 2014. Massachusetts Institute of Technology: MIT Open Courseware, https://ocw.mit.edu. License: Creative Commons BY-NC-SA. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. Suppose that 𝑓(𝑥) = 𝑒 −𝑥 for 0 < x. Determine the following probabilities: a. P(1 < X) b. P(1 < X < 2.5) c. P(X = 3) d. P(X < 4) e. P (3  X) 2. The probability density function of the length of a metal rod is f(x) = 2 for 2.3 < x < 2.8 meters. b. If the specifications for this process are from 2.25 to 2.75 meters, what proportion of the bars fail to meet the specifications? c. Assume that the probability density function is f(x) = 2 for an interval of length 0.5 meters. Over what value the density should be centered to achieve the greatest proportion of bars within specifications? 3. Suppose f(x) = 0.125x for 0 < x < 4. Find the mean and variance of X. 4. Suppose the time it takes a data collection operator to fill out an electronic form for a database is usually between 1.5 and 2.2 minutes. d. What is the mean and variance of the time it takes the operator to fill out the form? e. What is the probability that it will take less than two minutes to fill out the form? f. Determine the cumulative distribution function of the time it takes to fill out the form. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 5. Suppose that X is a binomial random variable with n = 200 and p = 0.4 g. Approximate the probability that X is less than or equal to 70 h. Approximate that the probability of X is greater than 70 and less than 90. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 5 JOINT PROBABILITY DISTRIBUTIONS Introduction The study of random variables and their probability distributions in the preceding sections is restricted to one dimensional sample spaces, in that we recorded outcomes of an experiment as values assumed by a single random variable. However, it is often useful to have more than one random variable defined in a random experiment. In general, if X and Y are two random variables, the probability distribution that defines their simultaneous behaviour is called a joint probability distribution. In this chapter, we will investigate some important properties of these joint probability distributions. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Understand and use joint probability mass functions and joint probability density functions to calculate probabilities and calculate marginal probability distributions. 2. Understand and calculate conditional probability distributions from joint probability distributions and assess independence of random variables. 3. Calculate means and variances for linear functions of random variables and calculate probabilities for linear functions of normally distributed random variables. 4. Determine the distribution of general function of a random variable. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 5.1 JOINT PROBABILITY DISTRIBUTIONS FOR TWO RANDOM VARIABLES In the previous section, we studied probability distributions for a single random variable. There will be situations, however, where we may find it desirable to record the simultaneous outcomes of several random variables. For example, we might measure the amount of precipitate P and volume V of gas released from a controlled chemical experiment, giving rise to a two dimensional sample space consisting of the outcomes (p, v), or we might be interested in the hardness H and tensile strength T of cold-drawn copper, resulting in the outcomes (h, t). In a study to determine the likelihood of success in college based on high school data, we might use a three dimensional sample space and record for each individual his or her aptitude test score, high school rank, and gradepoint average at the end of freshman year in college. If X and Y are two discrete random variables, the probability distribution for their simultaneous occurrence can be represented by a function with values f(x, y) for any pair of values (x, y) within the range of the random variables X and Y. It is customary to refer to this function as the joint probability distribution of X and Y. Hence, in the discrete case, f (x, y) = P (X = x, Y = y) that is, the values (x, y) give the probability that outcomes x and y occur at the same time. For example, if an 18 wheeler is to have its tires serviced and X represents the number of miles these tires have been driven and Y represents the number of tires that need to be replaced, then f(30000,5) is the probability that the tires are used over 30,000 miles and the truck needs 5 new tires. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Discrete case. The function f ( x, y) is a joint probability distribution or probability mass function of the discrete random variables X and Y if 1. 𝑓(𝑥, 𝑦) ≥ 0 𝑓𝑜𝑟 𝑎𝑙𝑙 (𝑥, 𝑦) 2. ∑𝑥 ∑𝑦 𝑓(𝑥, 𝑦) = 1 3. 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑓(𝑥, 𝑦) For any region A in the xy plane, 𝑃[(𝑋, 𝑌) ∈ 𝐴] = ∑ ∑𝐴 𝑓(𝑥, 𝑦). Just as the probability mass function of a single random variable X is assumed to be zero at all values outside the range of X, so the joint probability mass function of X and Y is assumed to be zero for which a probability is not specified. Example 1. Two ballpoint pens are selected at random from a box that contains 3 blue pens, 2 red pens, and 3 green pens. If X is the number of blue pens selected and Y is the number of red pens selected, find a.) the joint probability function f(x, y). b.) 𝑃[(𝑋, 𝑌) ∈ 𝐴] where A is the region{(𝑥, 𝑦)|𝑥 + 𝑦 ≤ 1}. Solution: The possible pairs of values (x, y) are (0, 0), (0, 1), (1, 0), (1, 1), (0, 2), and (2, 0). Now, f (0, 1), for example, represents the probability that a red and a green pens are selected. The total number of equally likely ways of selecting any 2 pens from the 8 is (82) = 28. The number of ways of selecting 1 red from 2 red pens and 1 green from 3 green pens is (21)(31) = 6. Hence, 𝑓(0.1) = 6⁄28 = 3⁄14 . Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Similar calculations will yield the probabilities for the other cases, which are presented in Table 1. Note that the probabilities sum to 1. It will become clear that the joint probability distribution of Table 1. can be represented by the formula f(x, y)= 3 (3x) (2y) (2-x-y ) (82) for x = 0, 1, 2; y = 0, 1, 2; and 0 ≤ x + y ≤ 2. (b) The probability that (X, Y) fall in the region A is 𝑃[(𝑋, 𝑌) ∈ 𝐴] = 𝑃(𝑋 + 𝑌 ≤ 1) = 𝑓(0,0) + 𝑓(0,1) + 𝑓(1,0) = 3 3 9 9 + + = 28 14 28 14 Table 1. Joint Probability Distribution for Example 1 x Row Totals f(x,y) 0 y 1 2 Column Totals 0 1 3 28 9 28 3 28 15 25 1 28 0 0 1 28 3 14 5 14 2 3 14 15 28 0 3 28 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) 3 7 1 lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example2. Suppose we toss a pair of fair, four-sided dice, in which one of the dice is RED and the other is BLACK. Let, X = the outcome on the RED die = {1, 2, 3, 4} Y = the outcome on the black die = {1, 2, 3, 4} Find the following: a) What is the probability that X takes on a particular value x, and Y takes on a particular value y? b) What is P(X = x, Y = y)? Solution: Just as we have to in the case with one discrete random variable, in order to find the “joint probability distribution” of X and Y, we first need to define the support of X and Y. Well the support of X is, S1 = {1, 2, 3, 4} and the support of Y is: S2 = {1, 2, 3, 4} Now, that if we let (x, y) denote one of the possible outcomes of one toss of the pair of dice, then certainly (1, 1) is a possible outcome, as is (1, 2), (1, 3) and (1, 4). If we continue to enumerate all of the possible outcomes, we soon see that the joint support S has 16 possible outcomes: S = {(1,1),(1,2),(1,3),(1,4),(2,1),(2,2),(2,3),(3,1), (3,2),(3,3),(3,4), (4,1), (4,2),(4,3), (4,4)} Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Now, because the dice are fair, we should expect each of 16 possible outcomes to be equally likely. Therefore, using the classical approach to assigning probability, the probability that X equals any particular x value, and Y equals any particular y value, is That is, for all (x, y) in the support S: 1 . 16 1 16 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦) = Because we have identified the probability for each (x, y), we have found what we call the joint probability mass function. Perhaps, it is not too surprising that the joint probability mass function, which is typically denoted as f(x, y), can be defined as a formula (as we have above), or as a table. Here’s what our joint probability mass function would like in tabular form: Black(Y) f(x, y) fX (x) 1 1 2 Red(X) 3 4 fY(y) 1 16 1 16 1 16 1 16 4 16 2 1 16 1 16 1 16 1 16 4 16 3 1 16 1 16 1 16 1 16 4 16 4 1 16 1 16 1 16 1 16 4 16 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) 4 16 4 16 4 16 4 16 1 lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS When X and Y are continuous random variables, the joint density function f(x, y) is a surface lying above the xy plane, and 𝑃[(𝑋, 𝑌) ∈ 𝐴], where A is any region in the xy plane, is equal to the volume of the right cylinder bounded by the base A and the surface. Continuous case. The case where both variables are continuous is obtained easily by analogy with discrete case on replacing sums by integrals. Thus, the joint probability function for the random variables X and Y (or, as it is more commonly called, the joint density function of X and Y). The function f(x, y) is a joint density function of the continuous random variables X and Y if 1. 𝑓(𝑥, 𝑦) ≥ 0 𝑓𝑜𝑟 𝑎𝑙𝑙 (𝑥, 𝑦) ∞ ∞ 2. ∫−∞ ∫−∞ 𝑓(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = 1 3.𝑃[(𝑋, 𝑌) ∈ 𝐴] = ∬𝐴 𝑓(𝑥, 𝑦) 𝑑𝑥 𝑑𝑦, for any region A in the xy plane. Example 1. A privately owned business operates both a drive-in facility and a walk-in facility. On a randomly selected day, let X and Y, respectively, be the proportions of the time that the drive-in and the walk-in facilities are in use, and suppose that the joint density function of these random variables is 2 Find the following: 𝑓(𝑥, 𝑦) = { 5 (2𝑥 + 3𝑦), 0 ≤ 𝑥 ≤ 1,0 ≤ 𝑦 ≤ 1, 0, ∞ ∞ (a) Verify condition of joint density function, ∫−∞ ∫−∞ 𝑓(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = 1 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 1 1 1 (b) Find 𝑃[(𝑋, 𝑌) ∈ 𝐴] 𝑤ℎ𝑒𝑟𝑒 𝐴 = {(𝑥, 𝑦)|0 < 𝑥 2 , 4 < 𝑦 < 2} Solution: (a) The integration of f(x, y) over the whole region is ∞ ∞ 1 ∫ ∫ 𝑓(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = ∫ ∫ −∞ −∞ 0 1 0 1 2𝑥 2 6𝑥𝑦 𝑥 = 1 𝑑𝑦 + )| = ∫ ( 5 𝑥=0 5 0 2 (2𝑥 + 3𝑦)𝑑𝑥𝑑𝑦 5 1 2 3 2𝑦 3𝑦 2 1 2 6𝑦 )| = + = 1 = ∫ ( + ) 𝑑𝑦 = ( + 0 5 5 5 5 5 0 5 (b) To calculate the probability, we use 𝑃[(𝑋, 𝑌) ∈ 𝐴] = 𝑃 (0 < 𝑋 < =∫ 1⁄2 1⁄4 =∫ 1⁄2 1⁄4 =( = ∫ 1⁄2 0 2 (2𝑥 + 3𝑦)𝑑𝑥 𝑑𝑦 5 1 1 1 , <𝑌< ) 2 2 4 1⁄2 2𝑥 2 6𝑥𝑦 1 3𝑦 𝑥 = 1⁄2 ( 𝑑𝑦 = ∫ ( + ) 𝑑𝑦 + )| 5 𝑥=0 5 5 1⁄4 10 𝑦 3𝑦 2 1⁄2 + )| 10 10 1⁄4 1 1 3 1 3 13 [( + ) − ( + )] = 10 2 4 4 16 160 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Practice Problem: 1. From a sack of fruit containing 3 oranges, 2 apples, and 3 bananas, a random sample of 4 pieces of fruit is selected. If X is the number of oranges and Y is the number of apples in the sample, find (a) The joint probability distribution of X and Y; (b) P [(X, Y) ∈ A], where A is the region that is given by {(x, y) | x + y ≤ 2} 2. Determine the values of c so that the following functions represent joint probability distributions of the random variables X and Y: (a) f (x, y) = cxy, for x = 1, 2, 3; y = 1, 2, 3; (b) f (x, y) = c|x − y|, for x = −2, 0, 2; y = −2, 3. 3. If the joint probability distribution of X and Y is given by f (x, y) = 𝑥+𝑦 30 , for x = 0, 1, 2, 3; y = 0, 1, 2, (a) P(X ≤ 2, Y = 1); (b) P(X > 2, Y ≤ 1); (c) P(X>Y ); (d) P(X + Y = 4) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Marginal Probability Distributions. If more than one random variable is defined in a random experiment, it is important to distinguish between the joint probability distribution of X and Y and the probability distribution of each variable individually. The individual probability distribution of a random variable is referred to as its marginal probability distribution. The marginal probability distribution of X can be determined from the joint probability distribution of X and other random variables. For example, consider discrete random variables X and Y. To determine P(X = x), we sum P(X = x, Y =y) over all points in the range of (X, Y) for which X = x. Subscripts on the probability mass functions distinguish between the random variables. For continuous random variables, an analogous approach is used to determine marginal ability distributions. In the continuous case, an integral replaces the sum. The marginal distributions of X alone and of Y alone are 𝑔(𝑥) = ∑ 𝑓(𝑥, 𝑦) 𝑎𝑛𝑑 ∑ 𝑓(𝑥, 𝑦) 𝑦 𝑥 for the discrete case, ∞ 𝑔(𝑥) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦 −∞ 𝑎𝑛𝑑 ∞ ℎ(𝑦) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑥 −∞ for the continuous case. The term marginal is used here because, in the discrete case, the values of g(x) and h(y) are just the marginal totals of the respective columns and rows when the values of f(x, y) are displayed in a rectangular table. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 1. Show that the column and row totals of Table 1. give the marginal distribution of X alone and of Y alone. Solution: For the random variable X, we see that 𝑔(0) = 𝑓(0,0) + 𝑓(0,1) + 𝑓(0,2) = 𝑔(1) = 𝑓(1,0) + 𝑓(1,1) + 𝑓(1,2) = 𝑔(2) = 𝑓(2,0) + 𝑓(2,1) + 𝑓(2,2) = 3 3 1 5 + + = , 28 14 28 14 3 15 9 + +0= , 28 28 14 3 3 +0+0= 28 28 which are just the column totals of Table 1. In a similar manner we could show that the values of h(y) are given by the row totals. In tabular form, these marginal distributions may be written as follows: x 0 1 2 5 15 3 g(x) 14 28 28 y h(y) 0 1 2 15 3 1 28 7 28 Example 2. Find g(x) and h(y) for the joint density function of Example 3. Solution: By definition, we have ∞ 𝑔(𝑥) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦 = ∫ −∞ 1 0 4𝑥 + 3 4𝑥𝑦 6𝑦 2 𝑦 = 1 2 (2𝑥 + 3𝑦)𝑑𝑦 = ( + )| = 10 𝑦 = 0 5 5 5 for 0 ≤ x ≤ 1, and g(x) = 0 elsewhere. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Similarly, we have ∞ ℎ(𝑦) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦 = ∫ −∞ 1 0 2 2(1 + 3𝑦) (2𝑥 + 3𝑦)𝑑𝑥 = 5 5 for 0 ≤ y ≤ 1, and h(y) = 0 elsewhere. Practice Problem 1. A fast-food restaurant operates both a drive through facility and a walk-in facility. On a randomly selected day, let X and Y, respectively, be the proportions of the time that the drive-through and walk-in facilities are in use, and suppose that the joint density function of these random variables is 2 𝑓(𝑥, 𝑦) = { 3 (𝑥 + 2𝑦), 0 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 1 0, (a) Find the marginal density of X. (b) Find the marginal density of Y. (c) Find the probability that the drive-through facility is busy less than one-half of the time. Conditional Probability Distribution. A special type of distribution in the form of f(x, y) / g(x) in order to be able to effectively compute conditional probabilities. Let X and Y be two random variables, discrete or continuous. The conditional distribution of the random variable Y given that X = x is 𝑓(𝑦|𝑥) = 𝑓(𝑥, 𝑦) , 𝑝𝑟𝑜𝑣𝑖𝑑𝑒𝑑 𝑔(𝑥) > 0 𝑔(𝑥) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Similarly, the conditional distribution of X given that Y = y is shown below 𝑓(𝑥|𝑦) = 𝑓(𝑥, 𝑦) , 𝑝𝑟𝑜𝑣𝑖𝑑𝑒𝑑 ℎ(𝑦) > 0 ℎ(𝑦) If we wish to find the probability that the discrete random variable X falls between a and b when it is known that the discrete variable Y = y, we evaluate 𝑃(𝑎 < 𝑋 < 𝑏|𝑌 = 𝑦) = ∑ 𝑓(𝑥|𝑦) 𝑎<𝑥<𝑏 where the summation extends over all values of X between a and b. When X and Y are continuous, we evaluate 𝑏 𝑃(𝑎 < 𝑋 < 𝑏|𝑌 = 𝑦) = ∫ 𝑓(𝑥|𝑦) 𝑑𝑥 𝑎 Example 1. Referring to Example1, find the conditional distribution of X, given that Y= 1, and use it to determine P(X = 0 | Y = 1). Solution: We need to find f (x|y), where y = 1. First, we find that 2 ℎ(1) = ∑ 𝑓(𝑥, 1) = 𝑥=𝑜 3 3 3 + +0= 7 14 14 Now 𝑓(𝑥|1) = 𝑓(𝑥, 1) 7 = ( ) 𝑓(𝑥, 1), ℎ(1) 3 𝑥 = 0,1,2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Therefore, 7 7 3 1 𝑓(0|1) = ( ) 𝑓(0,1) = ( ) ( ) = , 3 3 14 2 7 3 1 7 𝑓(1|1) = ( ) 𝑓(1,1) = ( ) ( ) = , 3 14 2 3 7 7 𝑓(2|1) = ( ) 𝑓(2,1) = ( ) (0) = 0 3 3 and the conditional distribution of X, given that Y = 1, is shown below x 0 1 𝑓(𝑥|1) 1 2 1 2 2 0 Finally, 𝑃(𝑋 = 0|𝑌 = 1) = 𝑓(0|1) = 1 2 Therefore, if it is known that 1 of the 2 pen refills selected is red, we have a probability equal to 1/2 that the other refill is not blue. Statistical Independence. If f (x|y) does not depend on y, then f (x|y) = g(x) and f(x, y) = g(x) h(y). The proof follows by substituting the equation below into the marginal distribution of X. 𝑓(𝑥, 𝑦) = 𝑓(𝑥|𝑦)ℎ(𝑦) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS That is ∞ ∞ 𝑔(𝑥) = ∫ 𝑓(𝑥, 𝑦)𝑑𝑦 = ∫ 𝑓(𝑥|𝑦)ℎ(𝑦)𝑑𝑦 −∞ −∞ If f (x|y) does not depend on y, we may write ∞ 𝑔(𝑥) = 𝑓(𝑥|𝑦) ∫ ℎ(𝑦)𝑑𝑦 −∞ Now ∞ ∫ ℎ(𝑦)𝑑𝑦 = 1 −∞ since h(y) is the probability density function of Y. Therefore 𝑔(𝑥) = 𝑓(𝑥|𝑦) 𝑎𝑛𝑑 𝑡ℎ𝑒𝑛 𝑓(𝑥, 𝑦) = 𝑔(𝑥)ℎ(𝑦) It should make sense to the reader that if f (x|y) does not depend on y, then of course the outcome of the random variable Y has no impact on the outcome of the random variable X. In other words, we say that X and Y are independent random variables. We now offer the following formal definition of statistical independence. Let X and Y be two random variables, discrete or continuous, with joint probability distribution f(x, y) and marginal distributions g(x) and h(y), respectively. The random variables X and Y are said to be statistically independent if and only if for all (x, y) within their range. 𝑓(𝑥, 𝑦) = 𝑔(𝑥)ℎ(𝑦) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Checking for statistical independence of discrete random variables requires a more thorough investigation, since it is possible to have the product of the marginal distributions equal to the joint probability distribution for some but not all combinations of (x, y). If you can find any point (x, y) for which f(x, y) is defined such that f(x, y) ≠ g(x) h(y), the discrete variables X and Y are not statistically independent. Example 1. Show that the random variables of Example1 are not statistically independent. Solution: Let us consider the point (0, 1). From Table1. We find the three probabilities f (0, 1), g (0), and h (1) to be 𝑓(0,1) = 2 𝑔(0) = ∑ 𝑓(0, 𝑦) = 𝑦=0 2 3 3 1 5 + + = 28 14 28 14 ℎ(1) = ∑ 𝑓(𝑥, 1) = 𝑥=0 3 14 3 3 3 + +0= 7 14 14 Clearly, 𝑓(0,1) ≠ 𝑔(0)ℎ(1) and therefore X and Y are not statistically independent. All the preceding definitions concerning two random variables can be generalized to the case of n random variables. Let 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) be the joint probability function of the random variables𝑥1 ,𝑥2 ,..., 𝑥𝑛 . Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The marginal distribution of 𝑥1 , for example, is 𝑔(𝑥1 ) = ∑ … ∑ 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) 𝑥2 𝑥𝑛 for the discrete case, and ∞ ∞ 𝑔(𝑥1 ) = ∫ … ∫ 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) 𝑑𝑥2 𝑑𝑥3 … 𝑑𝑥𝑛 −∞ −∞ for the continuous case. We can now obtain joint marginal distributions such as 𝑔(𝑥1 , 𝑥2 ) , shown below: 𝑔(𝑥1 ) = ∑ … ∑ 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) { 𝑥3 ∞ 𝑥𝑛 ∞ (𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑐𝑎𝑠𝑒), ∫ … ∫ 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) 𝑑𝑥3 𝑑𝑥4 … 𝑑𝑥𝑛 (𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 𝑐𝑎𝑠𝑒) −∞ −∞ We could consider numerous conditional distributions. For example, the joint conditional distribution of𝑋1,𝑋2 and𝑋3, given that 𝑋4 =𝑥4 , 𝑋5 =𝑥5 …... 𝑋𝑛 = 𝑋𝑛 , is written 𝑓(𝑥1 , 𝑥2 , 𝑥3 |𝑥4 , 𝑥5 , … . , 𝑥𝑛 ) = 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) 𝑔(𝑥4 , 𝑥5 , … . , 𝑥𝑛 ) where 𝑔(𝑥4 , 𝑥5 , … . , 𝑥𝑛 ) is the joint marginal distribution of the random variables 𝑥4 , 𝑥5 , … . , 𝑥𝑛 . It leads to the following definition for the mutual statistical independence of the variables 𝑋1 , 𝑋2 , … . , 𝑋𝑛 . Let𝑋1 , 𝑋2 , … . , 𝑋𝑛 be n random variables, discrete or continuous, with joint probability distribution 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) and marginal distribution 𝑓1 (𝑥1 ), 𝑓2 (𝑥2 )... 𝑓𝑛 (𝑥𝑛 ), respectively. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The random variables 𝑋1 , 𝑋2 , … . , 𝑋𝑛 are said to be mutually statistically independent if and only if 𝑓(𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) = 𝑓1 (𝑥1 ), 𝑓2 (𝑥2 ),... 𝑓𝑛 (𝑥𝑛 ) for all (𝑥1 , 𝑥2 , … . , 𝑥𝑛 ) within their range. For example, Suppose that the shelf life, in years, of a certain perishable food product packaged in cardboard containers is a random variable whose probability density function is given by 𝑒 −𝑥 , 𝑥>0 𝑓(𝑥) = { 0, 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒 Let 𝑋1,𝑋2 and 𝑋3 represent the shelf lives for three of these containers selected independently and find P (𝑋1 < 2, 1 < 𝑋2 < 3, 𝑋3 < 2). Since the containers were selected independently, we can assume that the random variables 𝑋1, 𝑋2, and 𝑋3 are statistically independent, having the joint probability density 𝑓(𝑥1 , 𝑥2 , 𝑥3 ) = 𝑓(𝑥1 )𝑓(𝑥2 )𝑓(𝑥3 ) = 𝑒 −𝑥1 𝑒 −𝑥2 𝑒 −𝑥3 = 𝑒 −𝑥1 −𝑥2 −𝑥3 for 𝑥1 > 0, 𝑥2 > 0 , 𝑥3 > 0, and f (𝑥1 , 𝑥2 , 𝑥3 ) = 0 elsewhere. Hence ∞ 3 2 P (𝑋1 < 2, 1 < 𝑋2 < 3, 𝑋3 < 2) = ∫2 ∫1 ∫0 𝑒 −𝑥1 −𝑥2 −𝑥3 𝑑𝑥1 𝑑𝑥2 𝑑𝑥3 = (1 − 𝑒 −2 )(𝑒 −1 − 𝑒 −3 )𝑒 −2 = 0.0372 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Practice Problem: 1. The amount of kerosene, in thousands of liters, in a tank at the beginning of any day is a random amount Y from which a random amount X is sold during that day. Suppose that the tank is not resupplied during the day so that x ≤ y, and assume that the joint density function of these variables is 𝑓(𝑥, 𝑦) = { 2 0<𝑥≤𝑦<1 0 (a) Determine if X and Y are independent. 2. Let the random variable X denote the time until a computer server connects to your machine (in milliseconds), and let Y denote the time until the server authorizes you as a valid user (in milliseconds). Each of these random variables measures the wait from a common starting time and X < Y. Assume that the joint probability density function for X and Y is in the equation below, determine the conditional probability density function for Y given that X = x. fXY (x, y) = (6𝑥10−6 )(−0.001𝑥 − 0.002𝑦) for x < y Joint Probability Distributions for More Than Two Random Variables More than two random variables can be defined in a random experiment. Results for multiple random variables are straightforward extensions of those for two random variables. The joint probability distribution of random variables 𝑥1 , 𝑥2 , 𝑥3 … . , 𝑥𝑝 can be specified with a method to calculate the probability that 𝑥1 , 𝑥2 , 𝑥3 … . , 𝑥𝑝 assume a value in any region A of of p-dimensional space. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS For continues random variables, a joint probability density function 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ) is used to determine that (𝑥1 , 𝑥2 , 𝑥3 … . , 𝑥𝑝 ) ∈ A by the multiple integral of 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ) over the region A.  A joint probability density function for the continuous random variables X1, X2 , X3....Xp, denoted as 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ), satisfies the following properties: 1. 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ) ≥ 0 𝑓𝑜𝑟 𝑎𝑙𝑙 (𝑥, 𝑦) ∞ ∞ ∞ 2. ∫−∞ ∫−∞ … ∫−∞ 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ) 𝑑𝑥1 𝑑𝑥2 … 𝑑𝑥𝑝 = 1 3.𝑃[(𝑥1 , 𝑥2 , . . , 𝑥𝑝 ) ∈ 𝐴] = ∬𝐴 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ) 𝑑𝑥1 𝑑𝑥2 … 𝑑𝑥𝑝 for any region A in the xy plane.  If the joint probability density function of continuous random variables X1, X2 , ....Xp, is 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ),the marginal probability distribution of Xi 𝑓𝑥1 (𝑥𝑖 ) = ∬ 𝑓𝑥1 , 𝑥2 , . . . , 𝑥𝑝 (𝑥1 , 𝑥2 , . . , 𝑥𝑝 ) 𝑑𝑥1 𝑑𝑥2 … 𝑑𝑥𝑖−1 𝑑𝑥𝑖+1 … . . 𝑑𝑥𝑝 where the integral is over all points in the range of X1, X2 , ....Xp, for which Xi= xi  Conditional probability distributions can be developed for multiple random variables by an extension of the ideas used for two random random variables. For example, the joint conditional probability distribution of X1, X2 ,and X3 given (X4= x4 , X5= x5 ) is 𝑓(𝑋1 , 𝑋2 , 𝑋3 , | 𝑥4 , 𝑥5 )(𝑥1 , 𝑥2 , 𝑥3 ) = 𝑓𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5 (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 , 𝑥5 ) 𝑓𝑜𝑟 𝑓𝑥4 𝑥5 (𝑥4 , 𝑥5 ) > 0 𝑓𝑥4 𝑥5 (𝑥4 , 𝑥5 ) The concept of independence can be extended to multiple random variables. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS  Random variables X1, X2 , ....Xp are independent if and only if 𝑓(𝑋1 , 𝑋2 , … 𝑋𝑝 ( 𝑥1 , 𝑥2 …….𝑥𝑝 ) = 𝑓𝑋1 ( 𝑥1 ), 𝑓𝑋2 ( 𝑥2 ) … … . 𝑓𝑋𝑝 ( 𝑥𝑝 ) for all 𝑥1 , 𝑥2 …….𝑥𝑝 5.2 Linear Functions of Random Variables A random variable is sometimes defined as a function of one or more random variables. Results for linear functions are important, for example, if the random variables of X1, and X2 denote the length and width, respectively, of a manufactured part, Y = 2 X1 + X2 is a random variable that represents the perimeter of the part. In this section, we develop results for random variables that are linear functions of random variables. Given random variables X1, X2 , ....Xp and constants C0, C1, C2, ....Cp, the equation below is is a linear function of X1, X2, ....Xp. Y = C0 + C0 X1 + C2 X2, +………+ Cp Xp  Mean of a Linear Function If Y = C0 + C0 X1 + C2 X2, +………+ Cp Xp, then E(Y) = C0 + C0 E(X1) + C2 E(X2), +………+ Cp E (Xp)  Variance of Linear Function If X1, X2 , ....Xp are random variables, and Y = C0 + C0 X1 + C2 X2, +………+ Cp Xp, then in general, 𝑉(𝑌) = 𝐶12 𝑉(𝑋1 ) + 𝐶22 𝑉(𝑋1 ) + ⋯ … … 𝐶𝑝2 𝑉(𝑋𝑝) + 2 ∑ ∑ 𝐶𝑖 𝐶𝑗 𝑐𝑜𝑣(𝑋𝑖 𝑋𝑗 ) 𝑖<𝑗 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS If X1, X2, ....Xp are independent, 𝑉(𝑌) = 𝐶12 𝑉(𝑋1 ) + 𝐶22 𝑉(𝑋2 ) + ⋯ … … 𝐶𝑝2 𝑉(𝑋𝑝 ) Example 1. A semiconductor product consists of three layers. Suppose that the variances in thickness of the first, second, and third layers are 25, 40, and 30 square nanometers, respectively, and the layer thicknesses are independent. What is the variance of the thickness of the final product? Solution: Let X1, X2, X3, and X be random variables that denote the thicknesses of the respective layers and the final product. Then, X = X1+ X2 + X3 The variance of X is: V(X) = V(X1) + V(X2) + V(X3) = 25 + 40 + 30 = 95 𝑛𝑚2 Consequently, the standard deviation of thickness of the final product is 951/2 = 9.75 nm, and this shows how the variation in each layer is propagated to the final product. Practice Problem: 1. The expected value of a probability distribution of the random variable X is 10. E(X) = 10. Find E (3X + 4). 2. Given the following probability function of the random variable X, find E(X), VAR(X), E(3X + 4) and VAR(3X + 4). x 5 10 15 P(X=x) 0.2 0.5 0.3 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 3. A lot containing 7 components is sampled by a quality inspector; the lot contains 4 good components and 3 defective components. A sample of 3 is taken by the inspector. Find the expected value of the number of good components in this sample. 5.3 General Functions of Random Variables Suppose that X is a discrete random variable with probability distribution fX(x). Let Y = h(X) be a function of X that defines a one-to-one transformation between the values of X and Y and that we wish to find the probability distribution of Y. By a one-to-one transformation, we mean that each value x is related to one and only one value of y = h(x) and that each value of y is related to one and only one value of x, say, x = u(y) where u(y) is found by solving y = h(x) for x in terms of y. Now the random variable Y takes on the value y when X takes on the value u(y). Therefore, the probability distribution of Y is fY (y) = P(Y = y) = P[X = u(y)] = fX [u(y)]  General Function of a Discrete Random Variable Suppose that X is a discrete random variable with probability distribution fX (x). Let Y = h(X) define a one-to-one transformation between the values of X and Y so that the equation y = h(x) can be solved uniquely for x in terms of y. Let this solution be x = u(y). Then the probability mass function of the random variable Y is fY (y) = fX [u(y)] Example 1. Let X be a geometric random variable with probability distribution fX (x) = p (1 − p) x−1, x = 1, 2,… Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Find the probability distribution of Y = X 2. Solution: Because X ≥ 0, the transformation is one to one; that is, y = x2 and x =√𝑦 . Therefore, the distribution of the random variable Y is  fY (y) = f (√𝑦) = p (1 − 𝑝)√𝑦−1 , y = 1, 4, 9, 16,… General Function of a Continuous Random Variable Suppose that X is a continuous random variable with probability distribution fX (x). The function Y = h(X) is a one-to-one transformation between the values of Y and X, so that the equation y = h(x) can be uniquely solved for x in terms of y. Let the solution be x = u(y). The probability distribution of Y is fY (y) = fX [u(y)] |J | where J = u′(y) is called the Jacobian of the transformation and the absolute value of J is used. REFERENCES: Walpole, Ronald E., et al., Probability and Statistics for Engineers and Scientists, 9th ed., Pearson Education Inc., 2016 Montgomery, Douglas C., et al., Applied Statistics and Probabiliy for Engineers, 7th ed., John Wiley & Sons (Asia) Pte Ltd, 2018 Murray, Spiegel R., et al., Probability and Statistics, 4th ed., McGraw Hill Companies Inc., 2013 https://online.stat.psu.edu/stat414/lesson/17/17.1 http://bestmaths.net/online/index.php/year-levels/year-12/year-12-topic-list/functions-randomvariables/ Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. A candy company distributes boxes of chocolates with a mixture of creams, toffees, and cordials. Suppose that the weight of each box is 1 kilogram, but the individual weights of the creams, toffees, and cordials vary from box to box. For a randomly selected box, let X and Y represent the weights of the creams and the toffees, respectively, and suppose that the joint density function of these variables is 𝑓(𝑥, 𝑦) = { 24 𝑥𝑦 0 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 1, 𝑥 + 𝑦 ≤ 1 0, (a) Find the probability that in a given box the cordials account for more than 1/2 of the weight. (b) Find the marginal density for the weight of the creams. (c) Find the probability that the weight of the toffees in a box is less than 1/8 of a kilogram if it is known that creams constitute 3/4 of the weight. 2. Given the joint density function below, find P (1 <Y < 3 | X = 1). 𝑓(𝑥, 𝑦) = { 6−𝑥−𝑦 8 0, 0 < 𝑥 < 2, 2 < 𝑦 < 4, 3. An industrial process manufactures items that can be classified as either defective or not defective. The probability that an item is defective is 0.1. An experiment is conducted in which 5 items are drawn randomly from the process. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Let the random variable X be the number of defectives in this sample of 5. What is the probability mass function of X? 4. Consider the random variables X and Y that represent the number of vehicles that arrive at two separate street corners during a certain 2-minute period. These street corners are fairly close together so it is important that traffic engineers deal with them jointly if necessary. The joint distribution of X and Y is known to be 𝑓(𝑥, 𝑦) = 9 16 . 1 4(𝑥+𝑦) for x = 0, 1, 2,... and y = 0, 1, 2,... (a) Are the two random variables X and Y independent? Explain why or why not. (b) What is the probability that during the time period in question less than 4 vehicles arrive at the two street corners? 5. Three cards are drawn without replacement from the 12 face cards (jacks, queens, and kings) of an ordinary deck of 52 playing cards. Let X be the number of kings selected and Y the number of jacks. Find the following: (a) The joint probability distribution of X and Y; (b) P [(X, Y) ∈ A], where A is the region given by {(x, y) | x + y ≥ 2}. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 6 Sampling Distributions and Point Estimation of Parameters Introduction Statistical methods are used to make decisions and draw conclusions about populations. This aspect of statistics is generally called statistical inference. These techniques utilize the information in a sample for drawing conclusions. This chapter covers the study of the statistical methods used in decision making. Statistical inference has one major areas which is the parameter estimation. In practice, the engineer will use sample data to compute a number that is in some sense a reasonable value (a good guess) of the true population mean. This number is called a point estimate. In this chapter, we will see that procedures are available for developing point estimates of parameters that have good statistical properties. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Explain and understand the general concepts of estimating the parameters of a population or a probability distribution. 2. Calculate and explain the important rule of the normal distribution as a sampling distribution and the central limit theorem. 3. Solve and explain important properties of point estimators, including bias, variance, standard error and mean square error. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 6.1 Point Estimation Statistical inference always focuses on drawing conclusions about one or more parameters of a population. An important part of this process is obtaining estimates of the parameters. Suppose that we want to obtain a point estimate (a reasonable value) of a population parameter. We know that before the data are collected, the observations are considered to be random variables, say, X1, X2, …, Xn. Therefore, any function of the observations, or any statistic, is also a random variable. For example, the sample mean X and the sample variance 𝑆 2 are statistics and random variables. A simple way to visualize this is as follows. Suppose we take a sample of n = 10 observations from a population and compute the sample average, getting the result x = 10.2. Now we repeat this process, taking a second sample of n = 10 observations from the same population and the resulting sample average is 10.4. The sample average depends on the observations in the sample, which differ from sample to sample because they are random variables. Consequently, the sample average (or any other function of the sample data) is a random variable. Because a statistic is a random variable, it has a probability distribution. We call the probability distribution of a statistic a sampling distribution. The notion of a sampling distribution is very important and is discussed and illustrated later in the chapter. When discussing inference problems, it is convenient to have a general symbol to represent the parameter of interest. We use the Greek symbol θ (theta) to represent the parameter. The symbol θ can represent the mean μ, the variance σ2, or any parameter of interest to us. The objective of point estimation is to select a single number based on sample data that is the most plausible value for θ. The numerical value of a sample Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS statistic is used as the point estimate. In general, if X is a random variable with probability distribution f(x), characterized by the unknown parameter θ, and if X1, X2, …, Xn is a random sample of size n from X, the statistic of θ. Note that = h(X1, X2,…, Xn) is called a point estimator is a random variable because it is a function of random variables. After the sample has been selected, Θ̂ takes on a particular numerical value θ̂ called the point estimate of θ.  A point estimate of some population parameter θ is a single numerical value of a statistic . The statistic is called the point estimator. Point estimation is the process of using the data available to estimate the unknown value of a parameter, when some representative statistical model has been proposed for the variation observed in some chance phenomenon. As an example, suppose that the random variable X is normally distributed with an unknown mean μ. Sample mean is a point estimator of the unknown population mean μ. That is, .After the sample has been selected, the numerical value is the point estimate of μ. Thus, if x1 = 25, x2 = 30, x3 = 29, and x4 = 31, the point estimate of μ is = 25 + 30 + 29 + 31 4 = 28.75 Similarly, if the population variance σ2 is also unknown, a point estimator for σ 2 is the sample variance S2, and the numerical value s2 = 6.9 calculated from the sample data is called the point estimate of σ2. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Estimation problems occur frequently in engineering. We often need to estimate • The mean μ of a single population • The variance σ 2(or standard deviation σ) of a single population • The proportion p of items in a population that belong to a class of interest • The difference in means of two populations, μ1 − μ2 • The difference in two population proportions, p1 − p2 Practice Problem: 1. Let X be the height of a randomly chosen individual from a population. In order to estimate the mean and variance of X, we observe a random sample X1, X2,⋯⋯, X7. We obtain the following values (in centimeters): 166.8,171.4,169.1,178.5,168.0,157.9,170.1166.8,171.4,169.1,178.5,168.0,157.9,170.1 Find the values of the sample mean, the sample variance, and the sample standard deviation for the observed sample. 6.2 Sampling Distributions and the Central Limit Theorem The field of statistical inference is basically concerned with generalizations and predictions. For example, we might claim, based on the opinions of several people interviewed on the street, that in a forthcoming election 60% of the eligible voters in the city of Detroit favor a certain candidate. In this case, we are dealing with a random sample of opinions from a very large finite population. As a second illustration we might state that the average cost to build a residence in Charleston, South Carolina, is between $330,000 and $335,000, based on the estimates of 3 contractors selected at random from the 30 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS now building in this city. The population being sampled here is again finite but very small. Finally, let us consider a soft-drink machine designed to dispense, on average, 240 millilitres per drink. A company official who computes the mean of 40 drinks obtains = 236 millilitres and, on the basis of this value, decides that the machine is still dispensing drinks with an average content of μ = 240 millilitres. The 40 drinks represent a sample from the infinite population of possible drinks that will be dispensed by this machine.  Random Sample The random variables are usually assumed to be independent and identically distributed. These random variables are known as a random sample. The random variables X1, X2, … , Xn are a random sample of size n if (a) the Xi ’s are independent random variables and (b) every Xi has the same probability distribution.  Statistic Such a random variable is called statistic. A statistic is any function of the observations in a random sample. We have encountered statistics before. For example, if X1, X2, … , Xn is a random sample of size n, the sample mean ,the sample variance S2, and the sample standard deviation S are statistics. Because a statistic is a random variable, it has a probability distribution.  Sampling distribution The probability distribution of a statistic is called a sampling distribution. The sampling distribution of a statistic depends on the distribution of the population, the size of the samples, and the method of choosing the samples. The probability distribution of is called the sampling distribution of the mean. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Consider determining the sampling distribution of the sample mean . Suppose that a random sample of size n is taken from a normal population with mean μ and variance σ2. Now each observation in this sample, say, X1, X2, … , Xn, is a normally and independently distributed random variable with mean μ and variance σ2. Then because linear functions of independent, normally distributed random variables are also normally distributed as discussed in the previous chapters, we conclude that the sample mean = 𝑋1 +𝑋2 ……𝑋𝑛 n has a normal distribution with mean μ = μ+μ+μ……μ n = μ and variance σ2 = σ2 +σ2 +σ2 ……σ2 𝑛2 = σ2 𝑛 Central Limit Theorem If we are sampling from a population that has an unknown probability distribution, the sampling distribution of the sample mean will still be approximately normal with mean μ and variance σ2/n if the sample n is large. This is one of the most useful theorems in statistics, called the central limit theorem. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS If X1, X2, … , Xn is a random sample of size n taken from a population (either finite or infinite) with mean μ and finite variance σ2 and if is the sample mean, the limiting form of the distribution of 𝑍= −μ σ √𝑛 as n → ∞ is the standard normal distribution . Figure1. Illustration of the Central Limit Theorem (distribution of for n =1, moderate n, and large n) Figure1 illustrates how the theorem works. It shows how the distribution of becomes closer to normal as n grows larger, beginning with the clearly nonsymmetric distribution of an individual observation (n = 1). It also illustrates that the mean of remains μ for any sample size and the variance of gets smaller as n increases. Example 1. An electrical firm manufactures light bulbs that have a length of life that is approximately normally distributed, with mean equal to 800 hours and a standard deviation of 40 hours. Find the probability that a random sample of 16 bulbs will have an average life of less than 775 hours. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Solution: The sampling distribution of will be approximately normal, with = 800 and = 40/√16 = 10. The desired probability is given by the area of the shaded region shown in the figure. Figure 2. Area for Example1 Corresponding to = 775, we find that 𝑍= and therefore 755 − 800 = −2.5 10 P ( < 775) = P (Z < −2.5) = 0.0062. Practice Problem: 1. An electronics company manufactures resistors that have a mean resistance of 100 ohms and a standard deviation of 10 ohms. The distribution of resistance is normal. Find the probability that a random sample of n = 25 resistors will have an average resistance of fewer than 95 ohms. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Approximate Sampling Distribution of a Difference in Sample Means If we have two independent populations with means μ 1 and μ2 and variances σ21 and σ22 and if 1 and 2 are the sample means of two independent random samples of sizes n1 and n2 from these populations, then the sampling distribution of the equation below is approximately standard normal if the conditions of the central limit theorem apply. If the two populations are normal, the sampling distribution of Z is exactly standard normal. 1 𝑍= − − (μ1 − μ2 ) 2 √σ21 /𝑛1 + σ22 /𝑛2 Example 1. Two independent experiments are run in which two different types of paint are compared. Eighteen specimens are painted using type A, and the drying time, in hours, is recorded for each. The same is done with type B. The population standard deviations are both known to be 1.0. Assuming that the mean drying time is equal for the two types of paint, find P( 𝐴 − 𝐵 > 1.0), where nA = nB = 18. 𝐴 and 𝐵 are average drying times for samples of size Solution: From the sampling distribution of 𝐴 approximately normal with mean μ 𝐴− 𝐵 − 𝐵, we know that the distribution is = μ𝐴 − μ𝐴 = 0 and variance σ2 𝐴− 𝐵 = σ2𝐴 n𝐴 + σ2𝐵 n𝐵 = 1 18 + 1 18 = 1 9 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Figure 3. Area for Example2 The desired probability is given by the shaded region in Figure 3. Corresponding to the value 𝐴 − 𝐵, = 1.0, we have 𝑍= 𝑍= 1 − 2 − (μ1 − μ2 ) √σ21 /𝑛1 + σ22 /𝑛2 1 − (μ𝐴 − μ𝐵 ) √1 9 =𝑍= 1−0 √1 9 =3 Therefore, P (Z > 3.0) = 1 – P (Z < 3.0) = 1 − 0.9987 = 0.0013. Practice Problem: 1. The television picture tubes of manufacturer A have a mean lifetime of 6.5 years and a standard deviation of 0.9 year, while those of manufacturer B have a mean lifetime of 6.0 years and a standard deviation of 0.8 year. What is the probability that a random sample of 36 tubes from manufacturer A will have a mean lifetime that is at least 1 year more than the mean lifetime of a sample of 49 tubes from manufacturer B? Given the following information. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 6.3 General Concepts of Point Estimation A point estimate of some population parameter θ is a single value . For example, the value of a statistic of the statistic , computed from a sample of size n, is a point estimate of the population parameter μ. Similarly, = x/n is a point estimate of the true proportion p for a binomial experiment. 6.3.1 Unbiased Estimator An estimator should be “close” in some sense to the true value of the unknown parameter. Formally, we say that is is an unbiased estimator of θ if the expected value of equal to θ. This is equivalent to saying that the mean of the probability distribution of (or the mean of the sampling distribution of ) is equal to θ. Bias of an Estimator The point estimator is an unbiased estimator for the parameter θ if E( )=θ If the estimator is not unbiased, then the difference E( )−θ is called the bias of the estimator . When an estimator is unbiased, the bias is zero; that is, E ( ) - θ = 0 Example 1. Let X1, X2, X3, ......, Xn be a random sample. Show that the sample mean below is an unbiased estimator of θ = EXi Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Solution: B( )=E( )=θ =E( )–θ = EXi – θ =0 Note that is an estimator is unbiased, it is not necessarily a good estimator. In the above example 1 = 𝑋1 . B( 1) =E( 1) –θ = EX1 – θ =0 Practice Problem: 1. Suppose that X is a random variable with mean μ and variance σ 2. Let X1, X2, … , Xn be a random sample of size n from the population represented by X. Show that the sample mean and sample variance S2 are unbiased estimators of μ and σ2, respectively. 6.3.2 Variance of Point Estimator Suppose that 1 and 2 are unbiased estimators of θ. This indicates that the distribution of each estimator is centered at the true value of zero. However, the variance of these distributions may be different. Figure 4 illustrates the situation. Because a smaller variance than 2, the estimator 1 1 has is more likely to produce an estimate close Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS to the true value of θ. A logical principle of estimation when selecting among several unbiased estimators is to choose the estimator that has minimum variance. Figure 4 Sampling Distributions of Two Unbiased Estimators 1 and 2 Minimum Variance Unbiased Estimator If we consider all unbiased estimators of θ, the one with the smallest variance is called the minimum variance unbiased estimator (MVUE). If X1, X2, … , Xn is a random sample of size n from a normal distribution with mean μ and variance σ2, the sample mean X is the MVUE for μ. When we do not know whether an MVUE exists, we could still use a minimum variance principle to choose among competing estimators. Suppose, for example, we wish to estimate the mean of a population (not necessarily a normal population). We have a random sample of n observations X1, X2, … , Xn, and we wish to compare two possible estimators for μ: the sample mean say, Xi . Note that both and a single observation from the sample, and Xi are unbiased estimators of μ; for the sample mean, we have V ( ) =σ2 ∕ n from previous Chapters and the variance of any observation is V (Xi) = σ2. Because V ( ) < V (Xi) for sample sizes n ≥ 2, we would conclude that the sample mean is a better estimator of μ than a single observation Xi. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 6.3.3 Standard Error When the numerical value or point estimate of a parameter is reported, it is usually desirable to give some idea of the precision of estimation. The measure of precision usually employed is the standard error of the estimator that has been used.  Standard Error of an Estimator The standard error of an estimator is its standard deviation given by σ =√𝑉( ). If the standard error involves unknown parameters that can be estimated, substitution of those values into σ produces an estimated standard error, denoted by Sometimes the estimated standard error is denoted by S or SE ( ). Suppose that we are sampling from a normal distribution with mean μ and variance σ2. Now the distribution of error of is normal with mean μ and variance σ2/n, so the standard is σ = σ √𝑛 If we did not know σ but substituted the sample standard deviation S into the preceding equation, the estimated standard error of SE ( ) = = would be S √𝑛 Table 1. present’s standard errors for some sample statistics with its standard error formula. Sampling distributions for these statistics, or at least their means and standard deviations (standard errors), can often be found. Some of these, together with ones already given, are shown in Table 1. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Table1. Standard Errors for Some Sample Statistics Example 1. An article in the Journal of Heat Transfer (Trans. ASME, Sec. C, 96, 1974, p. 59) described a new method of measuring the thermal conductivity of Armco iron. Using a temperature of 100°F and a power input of 550 watts, the following 10 measurements of thermal conductivity (in Btu/hr-ft-°F) were obtained: 41.60, 41.48, 42.34, 41.95, 41.86, 42.18, 41.72, 42.26, 41.81, 42.04 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS A point estimate of the mean thermal conductivity at 100 °F and 550 watts is the sample mean or = 41.924 Btu ∕ hr-ft-°F = σ ∕ √𝑛, and because σ is unknown, we The standard error of the sample mean is may replace it by the sample standard deviation s = 0.284 to obtain the estimated standard error of as SE ( ) = = S √𝑛 0.284 = √10 = 0.0898 6.3.4 Mean Square Error of an Estimator Sometimes it is necessary to use a biased estimator. In such cases, the mean squared error of the estimator can be important. The mean squared error of an estimator is the expected squared difference between Figure 5 A biased estimator  1 and θ. that has smaller variance than the unbiased estimator Mean Squared Error of an Estimator The mean squared error of an estimator MSE ( ) = E ( of the parameter θ is defined as − θ) 2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) 2 lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The mean squared error can be rewritten as follows: −E ( )] 2 + [θ – E ( )] 2 MSE ( ) = E [ = V ( ) + (bias) 2 That is, the mean squared error of plus the squared bias. If of is equal to the variance of the estimator is an unbiased estimator of θ, the mean squared error is equal to the variance of . The mean squared error is an important criterion for comparing two estimators. Let 1 and and MSE ( 2) efficiency of 2 to 2 be two estimators of the parameter θ, and let MSE ( be the mean squared errors of 1 is 1 and Then the relative defined as MSE( MSE( 1 2 ) ) If this relative efficiency is less than 1, we would conclude that estimator of θ than 2. 1) 1 is a more efficient 2. REFERENCES: Walpole, Ronald E., et al., Probability and Statistics for Engineers and Scientists, 9th ed., Pearson Education Inc., 2016 Montgomery, Douglas C., et al., Applied Statistics and Probability for Engineers, 7th ed., John Wiley & Sons (Asia) Pte Ltd, 2018 Murray, Spiegel R., et al., Probability and Statistics, 4th ed., McGraw Hill Companies Inc., 2013 https://www.probabilitycourse.com/chapter8/8_2_5_solved_probs.php Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. A population consists of the four numbers 3, 7, 11, 15. Consider all possible samples of size two that can be drawn with replacement from this population. Find (a) The population mean, (b) the population standard deviation, (c) the mean of the sampling distribution of means, (d) the standard deviation of the sampling distribution of means. Verify (c) and (d) directly from (a) and (b) by use of suitable formulas. 2. The mean score of students on an aptitude test is 72 points with a standard deviation of 8 points. What is the probability that two groups of students, consisting of 28 and 36 students, respectively, will differ in their mean scores by (a) 3 or more points, (b) 6 or more points, (c) between 2 and 5 points? 3. A normal population has a variance of 15. If samples of size 5 are drawn from this population, what percentage can be expected to have variances (a) less than 10, (b) more than 20, (c) between 5 and 10? 4. Measurements of a sample of weights were determined as 8.3, 10.6, 9.7, 8.8, 10.2, and 9.4 lb, respectively. Determine unbiased estimates of (a) the population mean, and (b) the population variance Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 5. X has a continuous distribution: Find the distribution of the sample mean of a random sample of size n=40? 1 𝑓(𝑥, 𝑦) = { 2 (2𝑥 + 3𝑦), 4 ≤ 𝑥 ≤ 6 0, Figure 5. The distributions of X and in problem 5. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 7 STATISTICAL INTERVALS Introduction Engineers are often involved in estimating parameters. Statistical intervals represent an uncertainty that exists in the data because we work with samples that are obtained from a larger population or process. Statistical intervals are staples of the quality and validation practitioner’s statistical tool box. Statistical intervals can manifest as plusor-minus limits on test data, represent a margin of error in a scientific poll, or indicate the level of confidence associated with a predicted value. This chapter will discussed a threepart series written to help validation and understand the three most common intervals; namely, the confidence interval, the prediction interval, and the tolerance interval. In this part, confidence intervals are discussed. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Construct confidence intervals using single sample and multiple sample 2. Construct a prediction for a future observation 3. Construct a tolerance interval for a normal distribution 4. Explain the three types of interval estimates; confidence intervals, prediction intervals and tolerance intervals Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 7.1 Single Sample: Estimating the Mean A way to avoid this is to report the estimate in terms of a range of plausible values called a confidence interval. A confidence interval always specifies a confidence level, usually 90%, 95%, or 99%, which is a measure of the reliability of the procedure. An interval estimate for a population parameter is called a confidence interval. Information about the precision of estimation is conveyed by the length of the interval. A short interval implies precise estimation. We cannot be certain that the interval contains the true, unknown population parameter—we use only a sample from the full population to compute the point estimate and the interval. However, the confidence interval is constructed so that we have high confidence that it does contain the unknown population parameter. Confidence intervals are widely used in engineering and the sciences. The basic ideas of a confidence interval (CI) are most easily understood by initially considering a simple situation. Suppose that we have a normal population with unknown mean μ and known variance σ2.This is a somewhat unrealistic scenario because typically both the mean and variance are unknown. However, in subsequent sections, we present confidence intervals for more general situations.  Confidence Interval on the Mean of a Normal Distribution, Variance Known If is the sample mean of a random sample of size n from a normal population with known variance σ2, a 100(1 − α) % confidence interval on μ is given by - 𝑍𝛼/2 ( 𝜎 √𝑛 )≤μ≤ + 𝑍𝛼/2 ( 𝜎 √𝑛 ) where 𝑍𝛼/2 is the upper 100α/2 percentage point of the standard normal distribution. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS For small samples selected from non-normal populations, we cannot expect our degree of confidence to be accurate. However, for samples of size n ≥ 30, with the shape of the distributions not too skewed, sampling theory guarantees good results. Example 1.ASTM Standard E23 defines standard test methods for notched bar impact testing of metallic materials. The Charpy V-notch (CVN) technique measures impact energy and is often used to determine whether or not a material experiences a ductile-tobrittle transition with decreasing temperature. Ten measurements of impact energy (J) on specimens of A238 steel cut at 60∘C are as follows: 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, and 64.3. Assume that impact energy is normally distributed with σ = 1 J. We want to find a 95% CI for μ, the mean impact energy. The required quantities are 𝑍𝛼/2 = 𝑍0.025 = 1.96, n = 10, σ = 1, and = 64.46. Solution: Using the equation above the resulting 95% CI is as follows: - 𝑍𝛼/2 ( 𝜎 √𝑛 64.46 -1.96( )≤μ≤ 1 √10 + 𝑍𝛼/2 ( 𝜎 √𝑛 ) ≤ μ ≤ 64.46 + 1.96( ) 1 √10 ) 63.84 ≤ μ ≤ 65.08 Based on the sample data, a range of highly plausible values for mean impact energy for A238 steel at 60∘C is 63.84 J ≤ μ ≤ 65.08 J. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Practice Problem: 1. The average zinc concentration recovered from a sample of measurements taken in 36 different locations in a river is found to be 2.6 grams per milliliter. Find the 95% and 99% confidence intervals for the mean zinc concentration in the river. Assume that the population standard deviation is 0.3 gram per milliliter. Ans. 2.47 <μ< 2.73.  Choice of Sample Size The precision of the confidence interval in the equation above is 2𝑍𝛼/2 ( This means that in using to 𝑍𝛼/2 ( 𝜎 √𝑛 to estimate μ, the error E = | 𝜎 √𝑛 ). − μ| is less than or equal ) with confidence 100(1 − α). This is shown graphically in Figure 1. Figure1. Error in Estimating μ with In situations whose sample size can be controlled, we can choose n so that we are 100(1 − α) % confident that the error in estimating μ is less than a specified bound on the error E. The appropriate sample size is found by choosing n such that 𝑍𝛼 ( 2 If 𝜎 √𝑛 )=E is used as an estimate of μ, we can be 100(1 − α) % confident that the error |x– μ| will not exceed a specified amount E when the sample size is n =( 𝑍𝛼 2 𝐸 𝜎 )2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 1.Consider the CVN test described in Example1 and suppose that we want to determine how many specimens must be tested to ensure that the 95% CI on μ for A238 steel cut at 60°C has a length of at most 1.0 J. Because the bound on error in estimation E is one-half of the length of the CI. Solution: E = 0.5, σ = 1, and 𝑍𝛼/2 = 1.96. The required sample size is n=( n=( 𝑍𝛼 2 𝐸 (1.96)(1) 0.5 𝜎 )2 )2 = 15. 37 and because n must be an integer, the required sample size is n = 16.  One-Sided Confidence Bounds on Mean of a Normal Distribution, Variance Known The confidence interval in Equation 8.5 gives both a lower confidence bound and an upper confidence bound for μ. Thus, it provides a two-sided CI. It is also possible to obtain one-sided confidence bounds for μ by setting either the lower bound l= −∞ or the upper bound u = ∞ and replacing 𝑍𝛼/2 by 𝑍𝛼 . A 100(1 − α) % upper-confidence bound for μ is + 𝑍𝛼 ( 𝜎 √𝑛 ) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS and a 100(1 − α) % lower-confidence bound for μ is - 𝑍𝛼 ( 𝜎 √𝑛 )≤μ Example 1.The same data for impact testing from Example 1 are used to construct a lower, one-sided 95% confidence interval for the mean impact energy. Recall that x = 64.46, σ = 1J, and n = 10. What is the interval? Solution: - 𝑍𝛼 ( 𝜎 √𝑛 64. 46 – 1.64 ( )≤μ 1 √10 )≤μ 63.94 ≤ μ The lower limit for the two sided interval in Example1 was 63.84. Because 𝑍𝛼 < 𝑍𝛼/2, the lower limit of a one-sided interval is always greater than the lower limit of a twosided interval of equal confidence. The one-sided interval does not bound μ from above so that it still achieves 95% confidence with a slightly larger lower limit. If our interest is only in the lower limit for μ, then the one-sided interval is preferred because it provides equal confidence with a greater limit. Similarly, a one-sided upper limit is always less than a two-sided upper limit of equal confidence. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Practice Problem: 1. In a psychological testing experiment, 25 subjects are selected randomly and their reaction time, in seconds, to a particular stimulus is measured. Past experience suggests that the variance in reaction times to these types of stimuli is 4 sec2 and that the distribution of reaction times is approximately normal. The average time for the subjects is 6.2 seconds. Give an upper 95% bound for the mean reaction time. Ans: 6.858 seconds.  Large-Sample Confidence Interval on the Mean When n is large, the quantity −μ S √𝑛 has an approximate standard normal distribution. Consequently, - 𝑍𝛼/2 ( 𝑆 √𝑛 )≤μ≤ + 𝑍𝛼/2 ( 𝑆 √𝑛 ) is a large-sample confidence interval for μ, with confidence level of approximately 100(1 − α) %. Example 1. An article in the 1993 volume of the Transactions of the American Fisheries Society reports the results of a study to investigate the mercury contamination in largemouth bass. A sample of fish was selected from 53 Florida lakes, and mercury concentration in the muscle tissue was measured (ppm). Find an approximate 95% CI on μ. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The mercury concentration values were Solution: The summary statistics for these data are as follows: The required quantities are n = 53, x = 0.5250,s = 0.3486, and 𝑍0.025 = 1.96. The approximate 95% CI on μ is - 𝑍𝛼/2 ( 𝑆 √𝑛 - 𝑍0.025 ( 0.5250 - 1.96 ( )≤μ≤ 𝑆 √𝑛 0.3486 √53 )≤μ≤ + 𝑍𝛼/2 ( 𝑆 √𝑛 + 𝑍0.025 ( ) 𝑆 √𝑛 ) ) ≤ μ ≤ 0.05250 + 1.96 ( 0.4311 ≤ μ ≤0.6189 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) 0.3486 √53 ) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS This interval is fairly wide because there is substantial variability in the mercury concentration measurements. A larger sample size would have produced a shorter interval. 7.2 Confidence Interval on the Mean of a Normal Distribution, Variance Unknown If and S are the mean and standard deviation of a random sample from a normal distribution with unknown variance σ2, a 100(1 − α) % confidence interval on μ is given by - 𝑡𝛼,𝑛−1 ( 2 𝑆 √𝑛 )≤μ≤ + 𝑡𝛼,𝑛−1 ( 2 𝑆 √𝑛 ) where 𝑡𝛼,𝑛−1 is the upper 100α/2 percentage point of the t distribution with n − 1 degrees 2 of freedom. Example 1. An article in the Journal of Materials Engineering [“Instrumented Tensile Adhesion Tests on Plasma Sprayed Thermal Barrier Coatings” (1989, Vol. 11(4), pp. 275–282)] describes the results of tensile adhesion tests on 22 U-700 alloy specimens. Find the confidence interval (CI). The load at specimen failure is as follows (in mega pascals): Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Solution: The sample mean is = 13.71, and the sample standard deviation is s = 3.55. Figures 8.6 and 8.7 show a box plot and a normal probability plot of the tensile adhesion test data, respectively. These displays provide good support for the assumption that the population is normally distributed. We want to find a 95% CI on μ. Since n = 22, we have n − 1 = 21 degrees of freedom for t, so 𝑡0.025,21 = 2.080. The resulting confidence interval (CI) is 𝑆 - 𝑡𝛼,𝑛−1 ( 2 13.71 - 2.080 ( 3.55 √22 √𝑛 )≤μ≤ + 𝑡𝛼,𝑛−1 ( 2 ) ≤ μ ≤ 13.71 + 2.080 ( 13.71 -1.57 ≤ μ ≤ 13.71 +1.57 𝑆 √𝑛 ) 3.55 √22 ) 12.14 ≤ μ ≤ 15.28 The CI is fairly wide because there is a lot of variability in the tensile adhesion test measurements. A larger sample size would have led to a shorter interval.  t - distribution Let X1, X2,…, Xn be a random sample from a normal distribution with unknown mean μ and unknown variance μ2. The random variable 𝑇= −μ S √𝑛 has a t distribution with n − 1 degrees of freedom. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 7.3 Confidence Interval on the Variance and Standard Deviation of a Normal Distribution Sometimes confidence intervals on the population variance or standard deviation are needed. When the population is modelled by a normal distribution, the tests and intervals described in this section are applicable. The following result provides the basis of constructing these confidence intervals.  X2 Distribution Let X1, X2,…, Xn be a random sample from a normal distribution with mean μ and variance σ2, and let S2 be the sample variance. Then the random variable 𝑋2 = (n − 1)𝑆 2 σ2 has a chi-square (χ2) distribution with n − 1 degrees of freedom.  Confidence Interval on the Variance If s2 is the sample variance from a random sample of n observations from a normal distribution with unknown variance σ2, then a 100(1 − α) % confidence interval on σ2 is ( (n−1)𝑆 2 𝑋2𝛼 2 ,𝑛−1 ) ≤ σ2 ≤ ( (n−1)𝑆 2 𝑋 2 1−𝛼 2 ) ,𝑛−1 where 𝑋 2 𝛼,𝑛−1 and 𝑋 2 1−𝛼,𝑛−1 are the upper and lower 100α/2 percentage points of the 2 2 chi-square distribution with n − 1 degrees of freedom, respectively. A confidence Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS interval for σ has lower and upper limits that are the square roots of the corresponding limits in the above equation.  One-Sided Confidence Bounds on the Variance The 100(1 − α) % lower and upper confidence bounds on σ2 are respectively ( (n−1)𝑆 2 𝑋 2 𝛼,𝑛−1 ) ≤ σ2 and σ2 ≤ ( (n−1)𝑆 2 𝑋 2 1−𝛼,𝑛−1 ) Example 1. An automatic filling machine is used to fill bottles with liquid detergent. A random sample of 20 bottles results in a sample variance of fill volume of s2 = 0.01532 (fluid ounce). If the variance of fill volume is too large, an unacceptable proportion of bottles will be under- or overfilled. We will assume that the fill volume is approximately normally distributed. Find the confidence bound. Solution: A 95% upper confidence bound is found. σ2≤( σ2 ≤ ( (20−1)0.0153 𝑋 2 0.95,19 (n−1)𝑆 2 𝑋 2 1−𝛼,𝑛−1 ) ) = 0.0287(fluid ounce)2 This last expression may be converted into a confidence interval on the standard deviation σ by taking the square root of both sides, resulting in σ =0.17 Therefore, at the 95% level of confidence, the data indicate that the process standard deviation could be as large as 0.17 fluid ounce. The process engineer or Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS manager now needs to determine whether a standard deviation this large could lead to an operational problem with under- or over-filled bottles. 7.4 Two Samples: Estimating the Difference between Two Means If we have two populations with means μ1 and μ2 and variances σ21 and σ22 , respectively, a point estimator of the difference between μ1 and μ2 is given by the statistic 1 − 2. Therefore, to obtain a point estimate of μ1 − μ2, we shall select two independent random samples, one from each population, of sizes n1 and n2, and compute 1 − 2, the difference of the sample means. Clearly, we must consider the sampling distribution of 1 −  2. Confidence Interval for Difference between Two Means, Variances Known If 1 and 2 are means of independent random samples of sizes n1 and n2 from populations with known variances σ21 andσ22 , respectively, a 100(l − α) % confidence interval for μ1 -μ2 is given by ( 1 − 2 )- 𝑍𝛼 ( √ 2 σ21 𝑛1 + σ22 𝑛2 ) < 𝜇1 -𝜇2 < ( 1 − 2) where 𝑍𝛼 is the z-value leaving an area of α/2 to the right. 2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) σ21 + 𝑍𝛼 ( √ 2 𝑛1 + σ22 𝑛2 ) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 7.5 Large-Sample Confidence Interval for a Population Proportion It is often necessary to construct confidence intervals on a population proportion.  Normal Approximation for a Binomial Proportion If n is large, the distribution of 𝑍= 𝑋−𝑛𝑝 √𝑛𝑝−(1−𝑝) = −𝑝 √𝑛𝑝−(1−𝑝) is approximately standard normal.  Approximate Confidence Interval on a Binomial Proportion If is the proportion of observations in a random sample of size n that belongs to a class of interest, an approximate 100(1 − α) % confidence interval on the proportion p of the population that belongs to this class is - 𝑍𝛼 (√ 2 (1− 𝑛 ) )≤ p ≤ + 𝑍𝛼 (√ 2 (1− 𝑛 ) ) where 𝑍𝛼 is the upper α/2 percentage point of the standard normal distribution.  2 Sample Size for a Specified Error on a Binomial Proportion In situations when the sample size can be selected, we may choose n to be 100(1− α) % confident that the error is less than some specified value E. If we set E = 𝑍𝛼 √ 2 𝑝(1 − 𝑝) 𝑛 and solve for n, the appropriate sample size is n=( 𝑍𝛼 2 𝐸 𝜎 )2 p (1-p) Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS  Approximate One-Sided Confidence Bounds on a Binomial Proportion The approximate 100(1 − α) % lower and upper confidence bounds are respectively. - 𝑍𝛼 (√ (1− 𝑛 ) )≤ p and + 𝑍𝛼 (√ (1− 𝑛 ) ) 7.6 Prediction Interval for Future Observation In some problem situations, we may be interested in predicting a future observation of a variable. This is a different problem than estimating the mean of that variable, so a confidence interval is not appropriate. In this section, we show how to obtain a 100(1 − α) % prediction interval on a future value of a normal random variable. A prediction interval provides bounds on one (or more) future observations from the population. For example, a prediction interval could be used to bound a single, new measurement of viscosity—another useful interval. For a normal distribution of measurements with unknown mean μ and known variance σ2, a 100(1 − α) % prediction interval of a future observation 𝑋0 is - 𝑍𝛼 𝜎 ( √1 + 2 1 𝑛 ) < 𝑋0 ≤ + 𝑍𝛼 𝜎 ( √1 + 2 where 𝑍𝛼 is the z-value leaving an area of α/2 to the right. 1 𝑛 ) 2 For a normal distribution of measurements with unknown mean μ and unknown variance σ2, a 100(1 − α) % prediction interval of a future observation 𝑋0 is Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS - 𝑡𝛼 𝑆 ( √1 + 2 1 𝑛 ) < 𝑋0 ≤ 1 + 𝑡𝛼 𝑆 (√1 + ) 𝑛 2 where 𝑡𝛼 is the t-value with v = n − 1 degrees of freedom, leaving an area of α/2 to the right. 2 Example 1. A meat inspector has randomly selected 30 packs of 95% lean beef. The sample resulted in a mean of 96.2% with a sample standard deviation of 0.8%. Find a 99% prediction interval for the leanness of a new pack. Assume normality. Solution: - 𝑡𝛼 𝑆 ( √1 + 2 96.2 - (2.756)(0.8) √1 + 1 30 1 𝑛 ) < 𝑋0 ≤ 1 + 𝑡𝛼 𝑆 (√1 + ) 2 𝑛 ) < 𝑋0 ≤ 96.2 + ( 2.756) (0.8) √1 + 1 30 ) Notice that the prediction interval is considerably longer than the CI. This is because the CI is an estimate of a parameter, but the PI is an interval estimate of a single future observation. Practice Problem: Due to the decrease in interest rates, the First Citizens Bank received a lot of mortgage applications. A recent sample of 50 mortgage loans resulted in an average loan amount of $257,300. Assume a population standard deviation of $25,000. For the next customer who fills out a mortgage application, find a 95% prediction interval for the loan amount. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 7.7 Tolerance Interval A tolerance interval is another important type of interval estimate. For example, the chemical product viscosity data might be assumed to be normally distributed. We might like to calculate limits that bound 95% of the viscosity values. A tolerance interval for capturing at least γ% of the values in a normal distribution with confidence level 100(1 − α) % is -ks , + ks or ± ks where k is a tolerance interval factor found in Table I. Values are given for γ = 90%, 95%, and 99%, and for 90%, 95%, and 99% confidence. This interval is very sensitive to the normality assumption. One-sided tolerance bounds can also be computed. The tolerance factors for these bounds are also given in Table I. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Table1. Tolerance Factors for Normal Distributions Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example 1. Consider Example 7. With the information given, find a tolerance interval that gives two-sided 95% bounds on 90% of the distribution of packages of 95% lean beef. Assume the data came from an approximately normal distribution. Recall from Example 7 that n = 30, the sample mean is 96.2%, and the sample standard deviation is 0.8%. From Table I., k = 2.14. Using Solution: ± ks 96.2 ± (2.14 )(0.8) we find that the lower and upper bounds are 94.5 and 97.9. We are 95% confident that the above range covers the central 90% of the distribution of 95% lean beef packages. REFERRENCES: Walpole, Ronald E., et al., Probability and Statistics for Engineers and Scientists,9th ed., Pearson Education Inc., 2016 Montgomery, Douglas C., et al., Applied Statistics and Probability for Engineers, 7th ed., John Wiley & Sons (Asia) Pte Ltd, 2018 Murray, Spiegel R., et al., Probability and Statistics, 4th ed., McGraw Hill Companies Inc., 2013 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. A random sample of size n1 = 25, taken from a normal population with a standard deviation σ1 = 5, has a mean 1 = 80. A second random sample of size n2 = 36, taken from a different normal population with a standard deviation σ2 = 3, has a mean 2 = 75. Find a 94% confidence interval for μ1 − μ2. 2. An electrical firm manufactures light bulbs that have a length of life that is approximately normally distributed with a standard deviation of 40 hours. If a sample of 30 bulbs has an average life of 780 hours, find a 96% confidence interval for the population mean of all bulbs produced by this firm. 3. A machine produces metal pieces that are cylindrical in shape. A sample of pieces is taken, and the diameters are found to be 1.01, 0.97, 1.03, 1.04, 0.99, 0.98, 0.99, 1.01, and 1.03 centimeters. Find a 99% confidence interval for the mean diameter of pieces from this machine, assuming an approximately normal distribution. 4. A machine produces metal pieces that are cylindrical in shape. A sample of these pieces is taken and the diameters are found to be 1.01, 0.97, 1.03, 1.04, 0.99, 0.98, 0.99, 1.01, and 1.03 centimeters. Use these data to calculate three interval types and draw interpretations that illustrate the distinction between them in the context of the system. For all computations, assume an approximately Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS normal distribution. The sample mean and standard deviation for the given data are = 1.0056 and s = 0.0246. (a) Find a 99% confidence interval on the mean diameter. (b) Compute a 99% prediction interval on a measured diameter of a single metal piece taken from the machine. (c) Find the 99% tolerance limits that will contain 95% of the metal pieces produced by this machine. 5. A random sample of 20 students yielded a mean of s2 = 16 = 72 and a variance of for scores on a college placement test in mathematics. Assuming the scores to be normally distributed, construct a 98% confidence interval for σ2. . Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 8 TEST ON HYPOTHESIS FOR A SINGLE SAMPLE Introduction In the previous chapters, how a parameter of a population can be estimated from sample data using a point estimate or confidence interval was discussed. In many situations there are two competing claims about the value of a parameter, and whichever claim is correct must be determined. This can be done by statistical inference. Inferential statistics is the other branch of statistics which deals with the estimates of population values called parameters and to make statements about computed statistics acceptable to some degree of confidence. Statistical inference is the method concerned with making estimates of population value. This method called hypothesis testing is a help in determining how accurate the generalizations are. This chapter focuses on the basic principles of hypothesis testing of means, variance and proportion involving a single sample of data. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Test hypotheses on the mean of a normal distribution using either a Z-test or a t-test. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 2. Test hypotheses on the variance or standard deviation of a normal distribution. 3. Test hypotheses on a population proportion. 4. Use the P-value approach for making decisions in hypothesis tests. 8.1. Hypothesis Testing Hypothesis testing is a decision-making process for evaluating the claims about a population. The goal of this process is to make judgment about the difference between the sample statistics and a hypothesized population parameter. In this process, the researcher must define the population under study, state the hypothesis to be investigated, give the significance level, select a sample, collect data, perform the required test and reach a conclusion. The z test and t test are statistical tests for hypothesis testing on means while chi-square test is used for testing the standard deviation. Null and Alternative Hypothesis The null hypothesis, denoted as Ho is the statement of equality indicating no existence of relationship between the variables under study. This statement is tested for the purpose of being accepted or rejected. The alternative hypothesis, denoted as Ha is also termed as research hypothesis. It is a statement of the expectation derived from the theory under the study. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Type I and Type II Error In hypothesis testing, there are four possible outcomes. Reject Ho Do no reject Ho Ho is true Type I error Correct decision Ho is false Correct decision Type II error A type I error occurs if one rejects the null hypothesis when it is true. It is also referred to as significance level and denoted by the Greek symbol alpha (). The common values of  are 1%, 5% and 10%. A type II error occurs if one does not reject the null hypothesis when it is false. It is denoted by a Greek symbol beta (). Significance Level and Confidence Interval The level of significance is the maximum probability of committing a type I error. That is, P (type I error) = . Generally, statisticians agree on using three arbitrary significance levels: 0.10, 0.05 and 0.01 level. That is, if the null hypothesis is rejected, the probability of a type I error will be 10%, 5% or 1% and the probability of correct decision will be 90%, 95% or 99%, depending on which level of significance is used. The values of correct decision is the confidence interval which represents the chance of accepting the null hypothesis when in fact it is true. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 8.1.1. One-sided and Two-sided Hypothesis In order to state the hypothesis correctly, the researcher must translate correctly the claim into mathematical symbols. There are three possible sets of statistical hypotheses. 1. Ho : parameter = specific value This is a two-tailed test H1 : parameter  specific value 2. Ho : parameter = specific value This is a left-tailed test H1 : parameter < specific value 3. Ho : parameter = specific value This is a right-tailed test H1 : parameter > specific value 8.1.2. P-value in Hypothesis Tests In hypotheses testing of a discrete test statistic, the critical region may be arbitrarily chosen. If α is too large, it can be reduced by making an adjustment in the critical value. It may be necessary to increase the sample size to offset the decrease that occurs automatically in the power of the test. In statistical analysis, it had become customary to choose a significance level of 0.10, 0.05 or 0.01 and the critical region is selected accordingly in which the rejection or non-rejection of the null hypothesis H0 would depend on. For example, if the test is two tailed and 𝛼 is set at the 0.05 level of significance and the test statistic involves, say, the standard normal distribution, then a z-value is observed from the data and the critical region is z > 1.96 or z < −1.96 where the value 1.96 is found as z0.025 in the table of Areas Under the Normal Curve. A value of z in the critical region prompts the statement “The value of the test statistic is significant,” which we can then Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS translate into the user’s language. For example, if the hypothesis is given by H 0: μ = 12, H1: μ  12, one might say, “The mean differs significantly from the value 12.” The philosophy that the maximum risk of making a type I error should be controlled is he root of the pre-selection of a significance level. However, this approach does not account for values of test statistics that are “close” to the critical region. Suppose, for example, in the illustration with H0: μ = 12 versus H1: μ  12, a value of z = 1.84 is observed; strictly speaking, with  = 0.05, the value is not significant. But the risk of committing a type I error if one rejects H0 in this case could hardly be considered severe. In fact, in a two-tailed scenario, one can quantify this risk as P = 2P (Z > 1.84 when μ = 12) = 2(0.0329) = 0.0658. As a result, 0.0658 is the probability of obtaining a value of z as large as or larger in magnitude than 1.84 when in fact μ = 12. It is an important information to the user although the evidence against H 0 is not as strong as that which would result from rejection at an  = 0.05 level. As a result, the P-value approach has been extensively used in applied statistics. It is designed to have an alternative, in terms of a probability, to a mere “reject” or “do not reject” conclusion. The P-value also gives an important information when the z-value falls well into the ordinary critical region. For example, if z is 2.75, it is informative for the user to observe that P = 2(0.0030) = 0.0060, and thus the z-value is significant at a level considerably less than 0.05. It is important to know that under the condition of H0, a value of z = 2.75 is an extremely rare event. That is, a value at least that large in magnitude would only occur 60 times in 10,000 experiments. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS A P-value is the lowest level of significance at which the observed value of the test statistic is significant. It is the smallest level of  that would lead to rejection of the Ho with the given data. 8.1.3. General Procedure for Test of Hypothesis The following are the steps in hypothesis testing using the fixed probability of Type I Error approach. 1. State the null and alternative hypothesis. 2. Determine the level of significance and the direction of test. The direction of test will be based on whether the alternative hypothesis is stated as left or right tailed test or as two-tailed test. 3. Determine the appropriate statistical test based on the level of measurement of the data gathered. 4. Write the decision rule expressing on how to accept or reject the null hypothesis. 5. Compute the test statistic and compare with the critical value. The test statistic plays a vital role in rejecting or accepting the null hypothesis. 6. State the decision based on the resulting computed value when compared to the critical value. 7. Draw scientific or engineering conclusion for the given problem. If you will be testing the hypothesis using Significant Testing or the P-value approach, follow these steps: 1. State the null and alternative hypothesis. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 2. Determine the appropriate statistical test based on the level of measurement of the data gathered. 3. Compute the test statistic. 4. Compute the P-value based on the computed value of the test statistic. 5. State the decision based on the resulting P-value and knowledge of the scientific system. 6. Draw scientific or engineering conclusion for the given problem. 8.2. Test on the Mean of a Normal Distribution Variance Known Following the steps in hypothesis testing for only single mean, the hypothesized value referred to as the hypothesized mean (µo). The null hypothesis is stated as: Ho: µ = µo The alternative hypothesis can be written as: H1 : µ  µo H1 : µ > µo H1: µ < µo The decision rule is stated as follows: reject the null hypothesis if the absolute value of the test statistic exceeds the critical value. Otherwise, do not reject the null hypothesis. In order to draw inference on a mean in one-population case assuming that the entries are normally distributed and the variance is known, Z-test is used. It can be used when the sample size is equal or greater than 30 (n  30). The Z-statistic, Zc, is the test statistic Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS used in order to lead for the rejection of null hypothesis in favor of the alternative hypothesis. This is computed as: 𝑧𝑐 = 𝑋̅ − 𝜇𝑜 𝜎/√𝑛 Where 𝑋̅ the computed mean is in the gathered data, 𝜇𝑜 is the hypothesized mean, 𝜎 is the population standard deviation which is known or given and n is the sample size. The critical value is obtained using the z-tabular value. For a two-tailed test, the value of 1-/2 written symbolically as z/2 is considered. Otherwise, for one-tailed test the value of 1- written as z is written. Figure 1. The Normal Distribution or Z- Distribution for Testing the Hypothesis Ho:  = o with critical values for (a) H1:   o, (b)  > o, (c)  < o Example 1. A random sample of 100 students enrolled in Statistics course under Professor X shows that the average grade in the midterm examination is 85%. Professor X claims that the average grade of the students in the midterm is at least 80% with a standard deviation of 16%. Is there an evidence to say that the claim is correct at 5% level of significance? Solution: 1. H0 : µ = 80% H1 : µ > 80% Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 2.  = 0.05, right-tailed test 3. 𝑧𝑐 = 𝑋̅ −𝜇𝑜 𝜎/√𝑛 4. Critical region: z > 1.645. Reject H0 if zc is greater than 1.645 5. Computing for z-statistic: 𝑧𝑐 = = 𝑋̅ − 𝜇𝑜 𝜎 √𝑛 85 − 80 16 √100 = 3.125 6. Reject H0 since 3.125 is greater than 1.645 7. Therefore, the Professor claim is correct is 5% level of significance. Using the P-value approach, the P-value corresponding to z = 3.125 is 0.0009 using the table for Areas Under the Normal Curve. This results to an evidence stronger than the 0.05 level of significance in favor of the alternative hypothesis, H1. Example 2. A manufacturer of solar lamp claims that the mean useful life of their new product is 8 months with a standard deviation of 0.5 month. To test this clam, a random sample of 50 solar lamps were tested and found to have a mean life of 7.8 months. Test the hypothesis that  = 8 months against the alternative hypothesis that   8 months using 1% level of significance. Solution: 1. H0 : µ = 8 months H1 : µ  8 months 2.  = 0.01, two-tailed test Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 3. 𝑧𝑐 = 𝑋̅ −𝜇𝑜 𝜎/√𝑛 4. Critical region: z < -2.575 and z > 2.575. Reject H0 if -2.575 > zc > 2.575 5. Computing for z-statistic: 𝑧𝑐 = = 𝑋̅ − 𝜇𝑜 𝜎 √𝑛 7.8 − 8 0.5 √50 = −2.8284, 𝑠𝑎𝑦 2.83 6. Reject H0 since -2.83 is less than -2.575 7. Therefore, the mean useful life of the new product is not equal to 8 months. In fact it is less than 8 months at 1% level of significance. Using the P-value approach and considering that this is a two-tailed test, the P-value is twice as the area to the left of z = -2.83. Using the table for Areas Under the Normal Curve, 𝑃 = 𝑃(|𝑧| > 2.83) = 2𝑃(𝑧 < −2.83) = 0.0046 This results to rejection of Ho at  less than 1%. 8.3. Test on the Mean of a Normal Distribution Variance Unknown To draw an inference on a mean in one-population case assuming normally distributed but the variance is unknown and the sample size is less than 30, t-test is used. The test statistic used is the t-statistic, tc, which is computed as follows: 𝑡𝑐 = 𝑋̅ − 𝜇𝑜 𝑠/√𝑛 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS where 𝑋̅ the computed mean is in the gathered data, 𝜇𝑜 is the hypothesized mean, s is the sample standard deviation and n is the sample size. The critical value is obtained using the t-tabular value. For a two-sided test, critical value is obtained at /2 and at a degree of freedom (d.f.) equals to (n-1), written as t/2 (n-1). Otherwise, for one-sided test, the value is obtained at  and at a degree of freedom (n-1) written as t (n-1). Figure 2. T- Distribution for Testing the Hypothesis Ho:  = o with critical values for (a) H1:   o, (b)  > o, (c)  < o Example1. The College of Engineering of a State University gives an entrance exam to incoming freshmen. Those who got scores equal or higher than the set passing are accepted in the College. The average score of the incoming freshmen was 80% before the implementation of K to 12 education system. Due to this implementation, the entrance exam was suspended for two years and it is thought that the quality of the first year students had diminished. However, with the vision, mission, goals and objectives of the University and the College towards quality education, the Dean wants to determine if the quality of freshmen students has changed. He wants to know if it has improved or diminished so a small random sample of 15 freshmen students and administers the same entrance exam. The average score is found to be 83% with a standard deviation of 5%. Determine whether the quality has changed using 1% level of significance. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Solution: 1. H0 : µ = 80% H1 : µ  80% 2.  = 0.01, two-tailed test 3. 𝑡𝑐 = 𝑋̅ −𝜇𝑜 𝑠/√𝑛 4. Critical region: t =  2.977. Reject H0 if tc is less than -2.977 or greater than 2.977 This is obtained from the table for Critical Values of the t-distribution using /2 = 0.005 and degree of freedom,  = 15 -1 = 14. 5. Computing for t-statistic: 𝑡𝑐 = = 𝑋̅ − 𝜇𝑜 𝑠 √𝑛 83 − 80 5 √15 = 2.32 6. Do not reject H0 since 2.32 is less than 2.977 but greater than -2.977 7. Therefore, the quality of freshmen students has not changed at 1% level of significance. The P-value corresponding to 2.32 is 0.036 or 3.6%. Since this is a two-tailed test, then 𝑃 = 𝑃(|𝑡| > 2.977) = 2𝑃(𝑡 < −2.977) = 0.036 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 8.4. Test on Variance and Statistical Deviation of a Normal Distribution The chi-square distribution will be used to test a claim about a single variance or standard deviation. The formula for the Chi-square test for a single variance is given by: 𝜒2 = (𝑛 − 1)𝑠 2 𝜎2 where n is the sample size, 𝑠 2 is the sample variance and 𝜎 2 is the population variance with the degrees of freedom equal to (n -1). There are three assumptions for the Chisquare test: the sample must be randomly selected from the population, the population must be normally distributed for the variable under study, and the observations must be independent of each other. Figure 3. Chi-Squared Distribution for Testing the Hypothesis Ho: 2 = o2 with critical values for (a) H1: 2  o2, (b) 2 > o2, (c) 2 < o2 Example1. A company claims that the variance of the sugar content of its ice cream is equal to 25 mg/oz. A sample of 20 servings is selected, and the sugar contents is measured. The variance of the sample is found to be 36. At 10% level of significance, is there enough evidence to reject the claim? Solution: 1. H0 : 2 = 25 mg/oz Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS H1 : 2  25 mg/oz 2.  = 0.10, two-tailed test 3. 𝜒 2 = (𝑛−1)𝑠2 𝜎2 4. Critical region: 𝜒 2 < 10.117 and 𝜒 2 > 30.144 . Reject H0 if 𝜒 2 is less than 10.117 or greater than 30.144. This is obtained from the table for Critical Values of the Chi-Squared distribution using /2 = 0.05 and degree of freedom,  = 20 -1 = 19. 5. Computing for 𝜒 2 - statistic: (𝑛 − 1)𝑠 2 𝜒 = 𝜎2 2 = (19)(36) 25 = 27.36 6. Do not reject H0 since 10.117 < 27.36 < 30.144. 7. Therefore, the company claim that the sugar content is equal to 25 mg/oz is correct at 10% level of significance. 8.5. Test on a Population Proportion The problem of testing the hypothesis considers the proportion of successes in a binomial experiment equals some specified value. That is, the null hypothesis H o that p = po, where p is the parameter of the binomial distribution is tested. The alternative hypothesis may be one of the usual one-sided or two-sided alternatives: 𝑝 < 𝑝𝑜 , 𝑝 > 𝑝𝑜 or 𝑝 ≠ 𝑝𝑜 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The following are the steps in testing a proportion of small samples: 1. H0: H1: p = po Alternatives are: 𝑝 < 𝑝𝑜 , 𝑝 > 𝑝𝑜 or 𝑝 ≠ 𝑝𝑜 2. Choose a level of significance equal to . 3. Test statistic: Binomial variable X with p = po. 4. Computations: Find x, the number of successes, and compute the appropriate Pvalue 5. Decision: Draw appropriate conclusion based on the P-value. Example1. A home developer claims that solar panels are installed in 65% of all homes being constructed today in a certain subdivision. Would you agree with this claim if a random survey of new homes in this subdivision shows that 8 out of 15 had solar panels installed? Use a 0.10 level of significance. Solution: 1. H0 : p = 0.65 H1 : p  0.65 2.  = 0.05, two-tailed test 3. Test statistic: Binomial variable X with p = 0.65 and n = 15 4. Computations: x = 8 and npo = (15) (0.65) = 9.75. Using the table for Binomial Probability Sums, the computed P-value is shown below Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 𝑃 = 2𝑃(𝑋 ≤ 8 𝑤ℎ𝑒𝑛 𝑝 = 0.65) 8 = 2 ∑ 𝑏(𝑥; 15,0.65) 𝑥=0 = 0.5213 5. Do not reject H0 and conclude that there is no enough evidence to doubt the claim of the home developer. For large n, approximation is required. When the hypothesized value po is very close to 0 or 1, the Poisson distribution with parameter µ = npo may be used. However, the normalcurve approximation, with parameters µ = npo and 2 = npoqo, is usually preferred for large n and is very accurate as long as po is not extremely close to 0 or 1. Using the normal approximation, the z-value for testing p = po is given by 𝑧= 𝑥 − 𝑛𝑝𝑜 √𝑛𝑝𝑜 𝑞𝑜 which is a value of the standard normal variable Z. Hence, for a two-tailed test at the level of significance, the critical region is z < -z/2 and z > z/2. For one-sided alternative p < po, the critical region is z < -z and for the alternative p > po, the critical region is z > z. Example1. A semiconductor company produces microcontrollers for robotic applications. The company is said to demonstrate capability to the customers if the process produces defective items not exceeding to 5%. To determine this, a random sample of 200 microcontrollers were tested and found out that there are four defective items. Will you agree that the company demonstrate process capability at 0.05 level of significance? Use P-value approach. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Solution: 1. H0 : p = 0.05 H1: p < 0.05 2. 𝑧 = 𝑥−𝑛𝑝𝑜 √𝑛𝑝𝑜 𝑞𝑜 3. Computing for 𝑧 - statistic: 𝑧= = 𝑥 − 𝑛𝑝𝑜 √𝑛𝑝𝑜 𝑞𝑜 4 − 200(0.05) √200(0.05)(0.95) = −1.95 4. The P-value from the Table for Areas Under the Normal Curve, P(z < -1.95) = 0.0256. 5. Since the P-value is less than 0.05, then reject Ho. 6. Therefore at 5% level of significance, the company demonstrates process capability for the customers. REFERENCES: Garcia, George A. Fundamental Concepts and Methods in Statistics, Manila: University of Sto. Tomas Publishing House, 2004 Montgomery, Douglas C., et al., Applied Statistics and Probability for Engineers, 7th ed., John Wiley & Sons (Asia) Pte Ltd, 2018 Walpole, Ronald E., et al., Probability and Statistics for Engineers and Scientists, 9th ed., Pearson Education Inc., 2016 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. A company producing lubricating oil claims that the average content of the containers is 20 liters. Test this claim if a random sample of ten containers are 20.4, 19.4, 20.2, 20.6, 20.2, 19.6, 19.8, 20.8, 20.6 and 19.6 liters. Assume normal distribution and use 1% level of significance. 2. It is claimed that personal vehicle is driven 25,000 kilometers per year. Would you agree with this claim if a random sample of 100 vehicle owners were asked to keep the records of their travel and showed that an average of 28,500 kilometers with a standard deviation of 3,950 kilometers? Use P-value in your conclusion. 3. A marketing expert for mobile operating system believes that 40% of the users prefer android. If 9 out of 20 choose android over IOS, what can you conclude about the marketing expert’s claim? Use 5% level of significance. 4. If the volume of containers of a particular lubricating oil in Problem 1 is known to normally distributed with a variance of 0.06 liter. Test this hypothesis against the alternative that the variance is not equal to 0.06 liter. Use 0.01 level of significance. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 9 STATISTICAL INFERENCE OF TWO SAMPLES Introduction The previous chapter discussed hypothesis testing of mean, variance and proportion for single sample. In this chapter, statistical inference of two samples concerning means, variances and proportions will be discussed. Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Test hypotheses on the difference in means of two normal distributions using either a Z-test or a t-test. 2. Test hypotheses on the difference between variances of normal distributions. 3. Test hypotheses on the difference between population proportions 4. Use the P-value approach for making decisions in hypothesis tests. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 9.1. Inference on the Difference in Means of Two Normal Distributions, Variance Known The test of difference of means is used to determine if there is significant difference between two populations of the same characteristics. For example, if we want to determine if there is a significant difference between the performance of two classes of engineering students enrolled in Statistics. To know this, take a sample from each class, specify the level of significance and test the hypothesis on the differences of the means and assume that the performance of the two sections is being compared. The null hypothesis is stated as follows: There is no significant difference in the performance of the two classes. In mathematical symbol: Ho: 1 = 2. The alternative hypothesis is: There is a significant difference in the performance of the two classes. Writing this in symbol: H1: 1  2. This is a two-tailed test. The performance of the first class is better (or poorer) than the second class. This is a one-tailed test, either right or left-tailed. The inequality statement is, H1: 1 > 2 or H1: 1 < 2. For large samples (n  30) and when there are two independent random samples of size n1 and n2, respectively, which are drawn from two populations with means µ1 and µ2 and variances 12 and 22 and the random variable is normally distributed, the Zstatistic can be computed using this formula: Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 𝑍= ̅̅̅1 − 𝑋 ̅̅̅2 ) − (𝜇1 − 𝜇2 ) (𝑋 𝜎1 2 𝜎2 2 𝑛1 + 𝑛2 √ The two-sided hypothesis on two means can be written as And the formula is then reduced to: Ho = 𝜇1 − 𝜇2 = do 𝑍= ̅̅̅1 − 𝑋 ̅̅̅2 ) − 𝑑𝑜 (𝑋 𝜎1 2 𝜎2 2 + 𝑛2 𝑛1 √ If the population variances are not known, the sample standard deviations (s 1 and s2) are used in the above formula. Example1. The Bureau of Agricultural Research is studying two varieties of high yielding corn. Based on past studies, the difference in yield is significant. To know if there is really significant difference, the Director of the Bureau decided to conduct an experiment. Forty hectares of the first variety and thirty hectares of the second variety are planted and are grown in the same laboratory conditions. After harvesting, yield are 250 sacks for 1st variety with a standard deviation of 20 sacks and 240 for the 2 nd variety with a standard deviation of 15 sacks. Is there a significant difference in the yield of the two varieties of corn? Use 1% level of significance. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Solution: 1. Ho: 1 = 2 H1: 1  2. 2.  = 0.01, two-tailed test 3. 𝑍 = ̅̅̅̅ ̅̅̅̅ (𝑋 1 −𝑋 2) 𝑠 2 𝑠 2 √ 1 + 2 𝑛1 , since known are sample variances 𝑛2 4. Critical region: Z = 2.575. Reject Ho if Z is less than -2.575 or greater than 2.575. 5. Computing for Z: 𝑍= = ̅̅̅1 − 𝑋 ̅̅̅2 ) (𝑋 𝑠1 2 𝑠2 2 + 𝑛2 𝑛1 √ 250 − 240 2 2 √(20) + (15) 30 40 = 2.39 6. Do not reject Ho since -2.575 < 2.39 < 2.575 7. Therefore, there is no significant difference in the yield of the two varieties of corn. Using the P-value approach 𝑃 = 𝑃(|𝑧| > 2.39) = 2𝑃(𝑧 < −2.83) = 2(0.0084) = 0.0168 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 9.2. Inference on the Difference in Means of Two Normal Distributions, Variance Unknown For small samples ( n < 30): If the variance is unknown and they are assumed to be equal, the test statistic for the pooled t-test (often called the two-sample t-test) is used. It is given by: 𝑡= (𝑥 ̅̅̅1 − 𝑥 ̅̅̅) 2 − 𝑑𝑜 1 1 𝑠𝑝 √ + 𝑛 𝑛1 2 where sp is computed from the pooled variance given by this equation: 𝑠𝑝 2 𝑠1 2 (𝑛1 − 1) + 𝑠2 2 (𝑛2 − 1) = 𝑛1 + 𝑛2 − 2 When the variance of the two normal population are unknown and are not equal, the test statistic is given by: 𝑡′ = ̅̅̅1 − 𝑋 ̅̅̅2 ) − 𝑑𝑜 (𝑋 𝑠1 2 𝑠2 2 + 𝑛2 𝑛1 √ has an approximate t-distribution with approximate degrees of freedom 𝑣= 2 𝑠 2 𝑠 2 ( 𝑛1 + 𝑛2 )2 1 2 2 𝑠 2 𝑠 2 [( 1 ) /(𝑛1 − 1)] + [( 2 ) /(𝑛2 − 1)] 𝑛1 𝑛2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Example1. An experiment was performed to compare the hardness of two different materials. Twelve pieces of material A were tested by exposing each piece to a Brinell Hardness Tester. Ten pieces of material B were also tested in the same machine. In each case the harness was determined and recorded. The samples of material A gave an average hardness of 85 with a sample standard deviation of 4, while the samples of material B gave an average of 81 and standard deviation of 5. Will we agree at 5% level of significance that the hardness of material A exceeds that of material B by more than 2 BHN? Solution: Let 1 and 2 be the population means of the hardness of Material A and Material B, respectively. The population variances are unknown and let us first assume that they are equal. Since n < 30, t-test will be used. 8. H0 : 1 - 2 = 2 H1 : 1 - 2 > 2 9.  = 0.05, right-tailed test 10. 𝑡 = (𝑥 ̅̅̅1̅−𝑥 ̅̅̅2̅)−𝑑𝑜 1 1 + 𝑛1 𝑛2 𝑠𝑝 √ 11. Critical region: t > 1.725. Reject H0 if zc is greater than 1.725. (This is obtained from the Table for the Critical Values of the t-Distribution at degrees of freedom of  = 12 + 10 – 2 = 20 and at and  = 0.05. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 12. Computing for t-statistic: 𝑡𝑐 = (𝑥 ̅̅̅1 − 𝑥 ̅̅̅) 2 − 𝑑𝑜 1 1 𝑠𝑝 √ + 𝑛1 𝑛2 Computing first for sp: 𝑠𝑝 2 𝑠1 2 (𝑛1 − 1) + 𝑠2 2 (𝑛2 − 1) = 𝑛1 + 𝑛2 − 2 (11)(16) + (9)(25) 𝑠𝑝 = √ 12 + 10 − 2 = 4.478 𝑡𝑐 = (85 − 81) − 2 1 1 4.478√ + 10 12 = 1.04 13. Do not reject H0 since 1.04 is less than 1.725. 14. Therefore, we are unable to agree that the hardness of material A exceeds that of material B by more than 2 units at 5% level of significance. Using the P-value approach, the P-value corresponding to t > 1.04 is 0.16 which is higher than 0.05. 9.3. Inference on the Variance of Two Normal Distributions Now, consider the problem of testing the equality of the variances 12 = 22 of two populations. The null hypothesis Ho that 12 = 22 against one of the usual alternatives Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 12 < 22, 12 > 22 or 12  22. For independent random samples of size n1 and n2, respectively, from the two populations, the f-value for testing 12 = 22 is the ratio 𝑠1 2 𝑓= 2 𝑠2 where 𝑠1 2 and 𝑠2 2 are the variances computed from the two samples. If the two populations are approximately normally distributed and the null hypothesis is true, the ratio f is the F-distribution with 𝑣1 = 𝑛1 − 1 and 𝑣2 = 𝑛2 − 1 degrees of freedom. Therefore, the critical regions corresponding to the one-sided alternatives 12 < 22 ad 12 > 22 are, respectively, f < f1- (𝑣1 , 𝑣2 ) and f > f1- (𝑣1 , 𝑣2 ). For the two-sided alternative 12  22, the critical region is < f1-/2 (𝑣1 , 𝑣2 ) and < f/2 (𝑣1 , 𝑣2 ). Example1. In testing for the difference in the hardness of the two materials in the previous example, the variances of the two unknown population are assumed to be equal. Is this assumption justified? Use a 0.10 level of significance. Solution: Let 12 and 12 be the population variances for the hardness of material A and B, respectively. 1. H0 : 12 = 12 H1 : 12  12 2.  = 0.10, two-tailed test 𝑠 2 3. 𝑓 = 𝑠1 2 2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 4. Critical region: 0.34 > f > 3.11. Reject H0 if fc is less than 0.34 or greater than 3.11. (This is obtained from the Table for the Critical Values of the F-Distribution). Take note that f0.95 which is 0.34 is obtained from: 𝑓1−𝛼 (𝜐1 , 𝜐2 ) = 𝑓0.95 (11,9) = = 1 𝑓𝛼 (𝜐2 , 𝜐1 ) 1 𝑓0.05 (9,11) 1 2.90 = 0.34 5. Computing for f-statistic: 𝑓= = 𝑠1 2 𝑠2 2 16 25 = 0.64 6. Do not reject Ho since 0.34 < 0.64 < 3.11 7. Thefore the assumption of equal variances is justified at 10% level of significance. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 9.4. Inference on Two Population Proportions In testing the null hypothesis that the two proportions, or binomial parameters, are equal, the hypothesis p1 = p2 against the alternatives p1 < p2, p1 > p2 or p1  p2 are tested. ̂2 . Independent The statistic on which the decision is based is the random variable𝑃̂1 − 𝑃 samples of size n1 and n2 are selected at random from two binomial populations and the ̂2 for the two samples is computed. proportion of successes 𝑃̂1 and 𝑃 In the construction of confidence intervals for p1 and p2, for n1 and n2 sufficiently large, ̂2 was approximately normally distributed with mean that the point estimator 𝑃̂1 and 𝑃 𝜇𝑃̂1 −𝑃̂2 = 𝑝1 − 𝑝2 and variance 𝜎 2 𝑃̂1 −𝑃̂2 = 𝑝1 𝑞1 𝑝2 𝑞2 + 𝑛1 𝑛2 Therefore, acceptance and critical regions can be established by using the standard normal variable 𝑍= ̂2 )– (𝑝1 − 𝑝2 ) (𝑃̂1 − 𝑃 𝑝1 𝑞1 𝑝2 𝑞2 + √ 𝑛 𝑛2 1 When Ho is true, 𝑝1 = 𝑝2 = 𝑝 and 𝑞1 = 𝑞2 = 𝑞 and the formula for Z becomes 𝑍= ̂2 ) (𝑃̂1 − 𝑃 1 1 √𝑝𝑞 [( ) + ( )] 𝑛2 𝑛1 where the pooled estimate of the proportion p is Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 𝑝̂ = 𝑥1 + 𝑥2 𝑛1 + 𝑛2 Where 𝑥1 and 𝑥2 are the number of successes in each of the two samples. Substituting 𝑝̂ for p and 𝑞̂ = 1 − 𝑝̂ , for the z-value for testing 𝑝1 = 𝑝2 is determined from the formula 𝑧= (𝑝 ̂1 − 𝑝 ̂2 ) 1 1 √𝑝̂ 𝑞̂ [( ) + ( )] 𝑛2 𝑛1 Hence, for the alternative 𝑝1 ≠ 𝑝2 at the -level of significance, the critical region is z < - z/2 and z > z/2. For one-sided alternative 𝑝1 < 𝑝2 , the critical region is z < -z and for the alternative 𝑝1 > 𝑝2 , the critical region is z > z. Example1. A telecommunication company proposed construction of a cell site tower in a certain city. To determine whether this is to be constructed, a vote is to be taken among the residents of a city and the surrounding barangays. Many residents in the barangays feel that the proposal will pass because of the large proportion of city voters who favor the construction. A poll is taken to determine if there is a significant difference in the proportion of city voters and barangay voters favoring the proposal. If 180 of 300 city voters favor the proposal and 280 of 500 barangay residents favor it, would you agree that the proportion of city voters favoring the proposal is higher than the proportion of barangay voters? Use a 0.025 level of significance. Solution: 1. H0 : 𝑝1 = 𝑝2 H1:  𝑝1 > 𝑝2 2.  = 0.025, right-tailed test Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 3. 𝑧 = (𝑝 ̂−𝑝 1 ̂ 2) 1 1 √𝑝̂𝑞̂[(𝑛 )+(𝑛 )] 1 2 4. Critical region: z > 1.96. Reject H0 if z is greater than 1.96. 5. Computing for z-statistic 𝑧= ̂ 𝑝1 = 𝑝 ̂2 = 𝑝̂ = (𝑝 ̂1 − 𝑝 ̂2 ) 1 1 √𝑝̂ 𝑞̂ [( ) + ( )] 𝑛2 𝑛1 𝑥1 180 = = 0.60 𝑛1 300 𝑥2 280 = = 0.56 𝑛2 500 180 + 280 46 = = 0.575 300 + 500 80 𝑞̂ = 1 − 0.5725 = 0.425 𝑧= (0.60 − 0.56 ) √(0.575)(0.425) [( 1 ) + ( 1 )] 300 500 𝑧 = 1.108 6. Do not reject Ho since 1.108 < 1.96. 7. Therefore, do not agree that the proportion of the city voters in favor of the construction of the cell site tower is higher that the proportion of the barangay voters. REFERENCES: Garcia, George A. Fundamental Concepts and Methods in Statistics, Manila: University of Sto. Tomas Publishing House, 2004 Montgomery, Douglas C., et al., Applied Statistics and Probability for Engineers, 7th ed., John Wiley & Sons (Asia) Pte Ltd, 2018 Walpole, Ronald E., et al., Probability and Statistics for Engineers and Scientists, 9th ed., Pearson Education Inc., 2016 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. 1. Professor X teaches the course Engineering Data Analysis (EDA) using the conventional method in one of his classes. He then began to teach the course using computers and statistical software in the second class. Professor X gives the same examinations to these two classes. It was observed that the students who are taught using computers and statistical software tend to get higher scores but this is not true everytime. He decides to test this hypothesis at 1% level of significance. From the final exam results, he takes a random sample for 15 students from the first class and 10 from the second class. He gets the following results: for the class using conventional method: mean of 84 and standard deviation of 8, while for the second class using computer and statistical software: mean of 92 and standard deviation of 5. As a student of EDA will you agree with Professor X? 2. A wire and cable company claims that the average tensile strength of cable A exceeds that of cable B by at least 15 kilograms. To test his claim, 100 pieces of each type of cable are tested under similar conditions. Cable A has an average tensile strength of 87.6 kilograms with a standard deviation of 6.82 kilograms, while cable B has an average tensile strength of 78.8 kilograms with a standard deviation of 5.66 kilograms. Test the manufacturer’s claim using a 0.01 level of significance. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 3. A cosmetic company would like to determine the level of acceptance of their customers who are normally the young ladies (teenagers) and women to its new product. There are 300 teenagers and 250 women who are randomly selected. Among the teenagers, 100 affirms that they will buy the product while 65 women said that they will buy the new product. With these information, is there a significant difference in the level of acceptance of the new product between the teenagers and women. Use 5% level of significance. the same problem about the hardness of two materials, conduct hypothesis testing but this time assuming that the variances are not equal. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Chapter 10 SIMPLE LINEAR REGRESSION AND CORRELATION Introduction Another area of statistics is regression and correlation which involves determining whether a relationship between two or more numerical or quantitative variable exists describe the nature of the relationship, regression is to be used. There are two types of relationships: simple, when there are two variables under study, and multiple, when there are many variables under study. Simple relationships can be further classified as positive or negative. A positive relationship exists when both variables increase or decrease at the same time. A negative relationship exists when one variable increases while the other decreases, or vice versa. The fundamental principles of linear regression and correlational analysis will be discussed in this chapter. In includes empirical models using linear regression, its estimation using the least-square approach, hypothesis testing t-test and analysis of variance (ANOVA), prediction of future observation using the model, determination of the adequacy of the model using residual analysis and coefficient determination and the correlation model. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Intended Learning Outcomes At the end of this module, it is expected that the students will be able to: 1. Construct empirical models using simple liner regression. 2. Estimate the parameters in a linear regression model using the Least-Square Approach. 3. Test hypothesis on simple linear regression 4. Predict future observation using the regression model 5. Determine the adequacy of the regression model using residual analysis and coefficient determination. 6. Apply the correlation model. 10.1. Empirical Models Many in engineering and the sciences problems involve analysis of the relationship between variables. The pressure and temperature of a gas in a container, the velocity and the area of the channel, and the displacement and velocity are related to each other. In case of the displacement and velocity relationship, if do be the displacement of the particle at time t = 0 and v be the velocity, then the displacement at any time t is dt = do + vt. This is an example of a deterministic linear relationship because the model predicts displacement perfectly. However, in many situations, the relationship between variables is not deterministic. For example, the fuel usage of a car (y) and its weight x, or the electrical energy consumption of a house (y) and the size of the house x, in square Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS feet, y and x are related but the relationships are not deterministic. This means that the value of y (fuel usage, energy consumption) cannot be predicted perfectly from the corresponding value of x. It is possible for different cars to have different fuel usage even if they have the same weight, and it is possible for different houses to consume different electrical energy even if of the same sizes of the house. Regression analysis is the collection of statistical tools that are used to model and explore relationships between variables that are of nondeterministic relationship. It is the most widely used statistical tools because these types of problems occur so frequently in many fields of engineering and science. In this chapter, only one independent variable x will be considered and the relationship with the response y is assumed to be linear. This may seem to be a simple scenario, but many practical problems fall into this assumption. For example, in a chemical process, suppose that the yield of the product is related to the operating temperature. To build a model to predict yield at a given temperature regression analysis can be used. It can also be used for process optimization or for process control purposes. To illustrate, consider the data presented in the table below. This shows the purity of oxygen (y) produced in a chemical distillation process and the percentage of hydrocarbons (x) present in the main condenser unit. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Table 1. Purity of Oxygen and Percentage of Hydrocarbons Observation Number Percentage of Hydrocarbons Purity of Oxygen Observation Number Percentage of Hydrocarbons Purity of Oxygen 1 0.99 90.01 11 1.02 89.05 2 1.15 91.43 12 1.29 93.74 3 1.46 96.73 13 1.36 94.45 4 0.87 87.59 14 1.23 91.77 5 1.55 99.42 15 1.40 93.65 6 1.19 93.54 16 1.15 92.52 7 0.98 90.56 17 1.01 89.54 8 1.11 89.85 18 1.20 90.39 9 1.26 93.25 19 1.32 93.41 10 1.43 94.98 20 0.95 87.33 A scatter plot diagram of the data in the table above is presented in the next figure. A scatter plot is a graph of the ordered pair (x, y) of numbers consisting of the independent variable x, and the dependent variable y. The independent variable, the variable that can be controlled or manipulated, is plotted on the horizontal axis. The dependent variable, plotted on the vertical axis, is the variable that cannot be controlled or manipulated. The purpose of this graph is to determine the nature of the relationship between the variables which may be positive linear, negative linear, curvilinear, or no relationship. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Figure 4. Scatter plot diagram showing the purity of oxygen and percentage of hydrocarbon in a distillation unit It can be seen from the scatter plot diagram that there is no simple curve that will pass exactly through all the given data points. But there is a strong indication that these data points lie scattered randomly around a straight line. Therefore, it is probably reasonable to assume that a straight-line relationship exist between the mean of the purity of oxygen (y) and the percentage of hydrocarbon present (x). That is𝐸(𝑦|𝑥) = 𝜇𝑦|𝑥 = 𝑎 + 𝑏𝑥, where a and b are the intercept and slope, respectively. They are called regression coefficients. Although the mean of y is a linear function of x, the actual observed value y does not fall exactly on a straight line. In order to generalize this to a probabilistic linear model, it is necessary to assume that the expected value of y is a linear function of x but for a fixed value of x, the actual value of y is determined by the mean value function of the linear model with the addition of a random error term. That is: 𝑦 = 𝑎 + 𝑏𝑥 + 𝜀, where 𝜀 is the random error term. The equation has only one independent variable or regressor Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS and the model is called the simple linear regression model. There are times that a model like this arises from a theoretical relationship. There is no theoretical knowledge of the relationship between x and y and the choice of the model will be based on inspection of a scatter plot diagram, such as the example above. The regression model is then thought of as an empirical model. Figure 2 shows the scatter plot diagram with added trend line with an equation of 𝑦 = 75 + 15𝑥. This is obtained by using Excel sheet, plotting the data points in scatter diagram and add a trend line displaying the equation. The slope and y- intercept are rounded off to integers. With this model we can determine the value of y for any given value of x. Figure 5. Linear regression model showing the relationship of y and x Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 10.2. Regression: Modeling Linear Relationships - The Least Squares Approach A simple linear regression has only one dependent or response variable (y) and one independent, regressor or predictor variable (x). Suppose that the value of y at each value of x is a random variable and that the true relationship between them is a straight line. As mentioned above, the expected value of y for each value of x is 𝐸(𝑦|𝑥) = 𝜇𝑦|𝑥 = 𝑎 + 𝑏𝑥 where a and b are the intercept and slope, respectively, called regression coefficients. This assumes that y can be described by this model: 𝒚 = 𝒂 + 𝒃𝒙 + 𝜺 where 𝜀 is the random error term with mean zero and unknown variance. It is also assumed that the random errors corresponding to different observations are uncorrelated random variables. Suppose that we have n pairs of observations ( x 1, y1), (x2, y2) ... , (xn, yn) as in the data presented in Table 1 and in the scatter plot in Figure 1. These data are to be used for the estimated regression line. The estimates of a and b should result to a line that is the “best fit” to the given data. Karl Gauss (1777–1855), a German scientist, proposed estimating the parameters a and b to minimize the sum of the squares of the vertical deviations as shown in the figure. Figure 6. Deviations of the data from the estimated regression model Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS This criterion for estimating the regression coefficients is called the method of least squares. Using this equation, 𝑦 = 𝑎 + 𝑏𝑥 + 𝜀, the n observations in the sample may be expressed as 𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 + 𝜀𝑖 , where 𝑖 = 1,2, … , 𝑛 The least square estimates of the intercept and slope of the linear regression model are: 𝑏̂ = 𝑆𝑆𝑥𝑦 𝑆𝑆𝑥𝑥 𝑎̂ = 𝑦̅ − 𝑏̂𝑥̅ where: 𝑆𝑆𝑥𝑥 = 𝑛 𝑛 ∑ 𝑥𝑖2 𝑖 𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑖 𝑦𝑖 − 𝑖 (∑𝑛𝑖 𝑥𝑖 ) − 𝑛 2 (∑𝑛𝑖 𝑥𝑖 )(∑𝑛𝑖 𝑦𝑖 ) 𝑛 ∑𝑛𝑖 𝑦𝑖 𝑦̅ = 𝑛 𝑛 ∑𝑖 𝑥𝑖 𝑥̅ = 𝑛 The “best fit” or estimated regression line is therefore: 𝑦̂ = 𝑎̂ + 𝑏̂ 𝑥 Note that each pair of observations satisfies the relationship 𝑦𝑖 = 𝑎̂ + 𝑏̂𝑥𝑖 + 𝑒𝑖 𝑖 = 1,2, … , 𝑛 where 𝑒𝑖 is called the residual and computed as 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 . This describes the error in the fit of the model and the 𝑖 𝑡ℎ observation𝑦𝑖 . In section 10.6 the residuals will be used to determine the adequacy of the regression or fitted model. Example1. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Using the data in Table 1: 20 n ∑ 𝒙𝒊 23.92 ̅ 𝒙 1.1960 ∑ 𝒚𝒊 1,843.21 ̅ 𝒚 92.1605 ∑ 𝒙𝟐𝒊 29.2892 ∑ 𝒚𝟐𝒊 170,044.5321 ∑ 𝒙 𝒊 𝒚𝒊 2,214.6566 Solution: 𝑆𝑆𝑥𝑥 = 𝑛 ∑ 𝑥𝑖2 𝑖 (∑𝑛𝑖 𝑥𝑖 ) − 𝑛 = 29.2892 − 𝑛 = 0.68088 𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑖 𝑦𝑖 − 𝑖 (23.92)2 20 (∑𝑛𝑖 𝑥𝑖 )(∑𝑛𝑖 𝑦𝑖 ) 𝑛 = 2,214.6566 − = 10.17744 2 (23.92)(1,843.21) 20 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 𝑏̂ = = 𝑆𝑆𝑥𝑦 𝑆𝑆𝑥𝑥 10.17744 0.68088 = 14.94748 𝑎̂ = 𝑦̅ − 𝑏̂𝑥̅ = 92.1605 − (14.94748)(1.196) Therefore: = 74.28331 𝑦̂ = 𝑎̂ + 𝑏̂ 𝑥 𝑦̂ = 74.283 + 14.947𝑥 The residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 are used to obtain an estimate of the variance of the error term, 𝜎 2 . The sum of the squares of the residuals is called the error sum of squares is 𝑆𝑆𝐸 = 𝑛 ∑ 𝑒𝑖2 𝑖=1 𝑛 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 𝑖=1 The expected value of the error sum of squares is 𝐸(𝑆𝑆𝐸) = (𝑛 − 2)𝜎 2 . Therefore the unbiased estimator of variance is 𝜎̂ 2 = 𝑆𝑆𝐸 𝑛−2 Computing for 𝑆𝑆𝐸 and 𝜎 2 using the above data: 𝑆𝑆𝐸 = 21.2498 𝜎 2 = 1.1805 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 10.3. Correlation: Estimating the Strength of Linear Relation Correlation is a statistical method used to determine if there is a relationship between variables and the strength of the relationship. Statisticians use a measure called correlation coefficient. This correlation coefficient measures how closely the points in a scatter diagram are spread around a line. The symbol for the sample correlation coefficient is r. The symbol for the population coefficient is the Greek letter, rho (). Correlation Coefficient can be calculated using the following equation: 𝑟= where 𝑆𝑆𝑥𝑥 𝑆𝑆𝑥𝑦 √𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 (∑ 𝑥) = ∑𝑥 − 𝑛 2 𝑆𝑆𝑦𝑦 = ∑ 𝑦 2 − 𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑦 − 2 (∑ 𝑦)2 𝑛 (∑ 𝑥)(∑ 𝑦) 𝑛 where n is the number of data pairs, SS is the sum of squares. Example1. Determine the correlation coeffient of the previous example. 𝑟= 𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 𝑆𝑆𝑥𝑦 √𝑆𝑆𝑥𝑥 𝑆𝑆𝑦𝑦 2 (∑ 𝑥) = 0.68088 = ∑𝑥 − 𝑛 2 (∑ 𝑦)2 = ∑𝑦 − = 173.3769 𝑛 2 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 𝑆𝑆𝑥𝑦 = ∑ 𝑥𝑦 − 𝑟= (∑ 𝑥)(∑ 𝑦) = 10.17744 𝑛 10.17744 √(0.68088)(173.3769) = 0.9367 A correlation coefficient of 0.9367 indicates good positive linear relationship between the two variables. Taking 𝑟 2 = 0.8774, this means that approximately 88% of the variation in y values is accounted for by a linear relationship with x. 10.4. Hypothesis Tests in Simple Linear Regression Testing statistical hypotheses about the model parameters and constructing certain confidence intervals is an important part of assessing the adequacy of a linear regression model. To test hypotheses about the slope and intercept of the regression model, the error component in the model, ε, is assumed to be normally and independently distributed with mean zero and variance 2, abbreviated NID (0, 2). 10.4.1. Use of t-tests Suppose that we wish to test the hypothesis that the slope equals a constant, say, β1, 0. Assuming two-sided alternative, the appropriate hypotheses are H0: 𝑏 = 𝑏0 , H1: 𝑏 ≠ 𝑏0 Because the errors i are NID (0, 2), it follows directly that the observations 𝑦𝑖 are NID (𝑎 + 𝑏𝑥𝑖 , 𝜎 2 ). Now 𝑏̂ is a linear combination of independent normal random variables and, consequently, 𝑏̂ is𝑁(𝑏, 𝜎 2 /𝑆𝑥𝑥 ), using the bias and variance properties of the slope. In Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS addition, (𝑛 − 2)𝜎̂ 2 /𝜎 2 has a chi-square distribution with 𝑛 − 2 degrees of freedom, and 𝑏̂ is independent of𝜎̂ 2 . As a result of those properties, the statistic for the slope 𝑡0 = 𝑏̂ − 𝑏0 √𝜎̂ 2 /𝑆𝑥𝑥 where √𝜎̂ 2 /𝑆𝑥𝑥 is the standard error of the slope, 𝑠𝑒(𝑏̂). The same procedure can be used to test the hypotheses about the y-intercept. The hypotheses are: H0: 𝑎 = 𝑎0 , H1: 𝑎 ≠ 𝑎0 The test statistic is: 𝑡0 = 1 𝑥̅ 2 𝑎̂ − 𝑎0 1 𝑥̅ 2 √𝜎̂ 2 [ + ] 𝑛 𝑆𝑥𝑥 where √𝜎̂ 2 [ + ] is the standard error of the slope, 𝑠𝑒(𝑎̂) 𝑛 𝑆 𝑥𝑥 𝑡0 = 𝑎̂ − 𝑎0 𝑠𝑒(𝑎̂) In both cases, the null hypothesis is to be rejected if the computed value of the test statistic, 𝑡0 , is such that |𝑡0 | > 𝑡∝/2,𝑛−2 A very important case of the hypotheses H0: 𝑏 = 𝑏0 , H1: 𝑏 ≠ 𝑏0 is H0: 𝑏 = 0 , H1: 𝑏 ≠ 0. These relate to the significance of regression. Failure to reject the null hypothesis H0: 𝑏 = 0 is equivalent to concluding that there is no linear relationship between the dependent and the independent variables or that the true relationship between the two variables in not linear. If the null hypothesis H0: 𝑏 = 0 is rejected, it could mean that the straight-line model is adequate or there is a linear effect of the independent variable. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 10.4.2. Analysis of Variance Approach to Test Significance of Regression The Analysis-of-Variance (ANOVA) approach is used in analyzing the quality of the estimated regression line. It is a procedure where the total variation in the dependent variable is subdivided into meaningful components that are then observed and treated systematically. Suppose that we have 𝑛 experimental data points in the usual form (𝑥𝑖 , 𝑦𝑖 ) and that the regression line is estimated. In the previous section, in the estimation of𝜎 2 , this identity was established: 𝑆𝑦𝑦 = 𝑏𝑆𝑥𝑦 + 𝑆𝑆𝐸 An alternative and more informative formulation is partitioning of the total corrected sum of squares of 𝑦 into these two components: 𝑛 𝑛 Symbolically, 𝑖=1 2 𝑛 ∑(𝑦𝑖 − 𝑦̅) = ∑(𝑦̂𝑖 − 𝑦̅) + ∑(𝑦𝑖 − 𝑦̂𝑖 )2 𝑖=1 2 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸 𝑖=1 where SSR is the regression sum of squares which reflects the amount of variation in the 𝑦 − values explained by the straight line model and SSE is the error sum of squares which reflects variation about the regression line. If we are to test the hypothesis, 𝐻0 : 𝛽 = 0 and 𝐻1 : 𝛽 ≠ 0 where the null hypothesis says that the model is𝜇𝑦|𝑥 = 𝛼. This means that the variation in 𝑌 results from random chances or fluctuations that are independent of𝑥. In order to this the hypothesis, the fstatistic is to be used. It is given by this equation: 𝑓= 𝑆𝑆𝑅/1 𝑆𝑆𝑅 = 2 𝑆𝑆𝐸/(𝑛 − 2) 𝑠 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS The null hypothesis is to be rejected if 𝑓 > 𝑓𝛼 (1, 𝑛 − 2). A analysis-of-variance table showing the summary on how to compute for the f-statistic is presented below Table 2. Analysis of Variance for Testing 𝜷 = 𝟎 Source of Variation Sum of Squares Degrees of freedom Mean square Computed f Regression SSR 1 SSR Error SSE n-2 𝑆𝑆𝑅/𝑠 2 Total SST n-1 𝑠2 = 𝑆𝑆𝐸 𝑛−2 10.5. Prediction of New Observations One of the reasons for building a linear regression is to predict response values at one or more values of the independent variable. This section focuses on the errors associated with that prediction. The equation 𝑦̂ = 𝑎 + 𝑏𝑥 may be used to predict or estimate the mean 𝜇𝑌|𝑥0 at 𝑥 = 𝑥0 . It can also be used to predict a single value when 𝑥 = 𝑥0 . The error of prediction is expected to be higher when predicting a single value than when a mean value is predicted. It will then affect the width of intervals for the values being predicted. To construct a confidence interval for𝜇𝑌|𝑥0 , the point estimator 𝑌̂0 = 𝐴 + 𝐵𝑥0 to estimate 𝜇𝑌|𝑥0 = 𝛼 + 𝛽𝑥0 . It can be shown that the sampling distribution of 𝑌̂0 is normal with mean 𝜇𝑌̂0 = 𝐸(𝑌̂0 ) = 𝐸(𝐴 + 𝐵𝑥0 ) = 𝛼 + 𝛽𝑥0 = 𝜇𝑌|𝑥0 , Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS and variance 1 (𝑥0 − 𝑥̅ )2 2 2 2 = 𝜎 = 𝜎 [ 𝜎𝑌2̂0 = 𝜎𝐴+𝐵𝑥 + ] 𝑌̅+𝐵(𝑥0 −𝑥̅ ) 0 𝑛 𝑆𝑥𝑥 Thus, the (1 − 𝛼) 100% confidence interval on the mean response 𝜇𝑌|𝑥0 can now be constructed from the statistic. 𝑇= 𝑌̂0 − 𝜇𝑌|𝑥0 (𝑥 − 𝑥̅ )2 1 𝑆√( ) + 0𝑆 𝑛 𝑥𝑥 which has a 𝑡 −distribution with 𝑛 − 2 degrees of freedom. That is 1 𝑦̂0 − 𝑡𝛼 𝑠√ + 2 𝑛 (𝑥0 −𝑥̅ )2 𝑆𝑥𝑥 1 < 𝜇𝑌|𝑥0 < 𝑦̂0 + 𝑡𝛼 𝑠√ + 2 𝑛 (𝑥0 −𝑥̅ )2 𝑆𝑥𝑥 , where 𝑡𝛼 is a value of the 𝑡 −distribution with 𝑛 − 2 degrees of freedom. 2 Example 1. Using the above example about the level of purity of oxygen, construct a 95% confidence interval about the mean response. In particular, predict the mean oxygen purity at 1.00%. Solution: The fitted model is 𝜇𝑌|𝑥0 = 74.283 + 14.947𝑥0 and the 95% confidence interval is when𝑥0 = 1.00%, then 1 (𝑥0 − 1.1960)2 + 20 0.68088 𝜇𝑌|𝑥0 ± 2.101√1.18[ 𝜇𝑌|𝑥0 = 74.283 + 14.97(1.00) = 89.23 So the confidence interval is computed 89.23 ± 2.101√1.18[ 1 (1.00 − 1.1960)2 + 0.68088 20 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 89.23 ± 0.75 Therefore the 95% confidence interval on 𝜇𝑌|1.00 is 88.48 ≤ 𝜇𝑌|1.00 ≤ 89.98 Now consider the prediction interval for a single response. A (1 − 𝛼) 100% prediction interval for a single response 𝑦0 is given by: 𝑦̂0 − 𝑡𝛼 𝑠√1 + 2 1 (𝑥0 − 𝑥̅ )2 1 (𝑥0 − 𝑥̅ )2 + < 𝑦0 < 𝑦̂0 + 𝑡𝛼 𝑠√1 + + 𝑛 𝑆𝑥𝑥 𝑛 𝑆𝑥𝑥 2 where 𝑡𝛼 is a value of the 𝑡 −distribution with 𝑛 − 2 degrees of freedom. 2 Example 2. Using the above example about the level of purity of oxygen, find a 95% prediction interval on the next observation of the level of purity of oxygen at 𝑥0 = 1.00%. Recall that 𝑦̂0 = 89.23. Solution: 89.23 − 2.101√1.18 [1 + 1 (1.00 − 1.1960)2 + ≤ 𝑌0 0.68088 20 ≤ 89.23 + 2.101√1.18 [1 + Simplifying 1 (1.00 − 1.1960)2 + 20 0.68088 86.83 ≤ 𝑌0 ≤ 91.63 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 10.6. Adequacy of the Regression Model Several assumptions are required to fit a regression model. To estimate the parameters of the model will require the assumption that the errors are uncorrelated random variables with mean zero and constant variance. Tests of hypotheses and interval estimation require that the errors be normally distributed. In addition, we assume that the order of the model is correct. If it is a simple linear regression model, the phenomenon actually behaves in a linear or first-order manner is assumed. It always necessary to consider the validity of these assumptions. Analyses to examine the adequacy of the model should be conducted. These can be done through residual analysis and coefficient determination. 10.6.1. Residual Analysis The residuals from a regression model are 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 , 𝑖 = 1,2, … , 𝑛 where 𝑦𝑖 an actual observation is and 𝑦̂𝑖 is the corresponding fitted value from the regression model. Residual analysis is frequently helpful to check the assumption that the errors are approximately normally distributed with constant variance and to determine whether additional terms in the model would be useful. A frequency histogram of the residuals or a normal probability plot of residuals can be constructed and be used to approximately check the normality. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Figure 7. Patterns for residual plots. (a) Satisfactory (b) Funnel (c) Double bow (d) Nonlinear Example1. Determine the residuals of the previous problem and plot the graph. Solution: Hydrocarbon level, x Oxgen purity, y 0.99 1.15 1.46 0.87 1.55 1.19 0.98 1.11 1.26 1.43 1.02 1.29 1.36 1.23 1.4 1.15 1.01 1.2 1.32 0.95 90.01 91.43 96.73 87.59 99.42 93.54 90.56 89.85 93.25 94.98 89.05 93.74 94.45 91.77 93.65 92.52 89.54 90.39 93.41 87.33 Predicted value, 𝑦̂ 89.081 91.472 96.106 87.287 97.451 92.070 88.931 90.874 93.116 95.657 89.529 93.565 94.611 92.668 95.209 91.472 89.379 92.219 94.013 88.483 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) Residual, 𝑒 = 𝑦 − 𝑦̂ 0.929 -0.042 0.624 0.303 1.969 1.470 1.629 -1.024 0.134 -0.677 -0.479 0.175 -0.161 -0.898 -1.559 1.048 0.161 -1.829 -0.603 -1.153 lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS Figure 8. Normal probability plot of residuals (left). Plot of residuals versus predicted values (center). Plot of residuals versus hydrocarbon level 10.6.2. Coefficient Determination The Coefficient of Determination, 𝑅 2 is often used to judge the adequacy of a regression model. It is the square of the correlation coefficient between jointly distributed random variables 𝑋 and 𝑌 and has a value 0 ≤ 𝑅 2 ≤ 1 from the analysis of variance identity. 𝑅 2 is often referred to as the amount of variability in the data explained or accounted for by the regression model. For the oxygen purity regression model we have𝑅 2 = 𝑆𝑆𝑅 𝑆𝑆𝑇 = 152.13 173.38 = 0.877; that is, the model accounts for 87.7% of the variability in the data. It is always possible to make 𝑅 2 unity by adding enough terms to the model and therefore 𝑅 2 should always be used with caution. For example, a “perfect” fit can be obtained with a polynomial of degree n − 1. Generally, 𝑅 2 will increase if a variable is added to the model, but this does not necessarily imply that the new model is superior to the old one. Unless the error sum of squares in the new model is reduced by an amount equal to the original error mean square, the new model will have a larger error mean square than the old one because of the loss of 1 error degree of freedom. Thus, the new model will actually be worse than the old one. The dispersion Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS of the variable x impacted the magnitude of𝑅 2 . The larger the dispersion, the larger the value of 𝑅 2 will usually be. There are some misconceptions about𝑅 2 . It does not measure the magnitude of the slope of the regression line. A large value of𝑅 2 does not imply a steep slope. Also, 𝑅 2 does not measure the appropriateness of the model because it can be artificially inflated by adding higher-order polynomial terms to the model. Even if y and x are related in a nonlinear fashion, 𝑅 2 . will often be large. Lastly, even though𝑅 2 . Is large, this does not necessarily imply that the regression model will provide accurate predictions of future observations. 10.7. Correlation Correlation analysis attempts to measure the strength of such relationships between two variables by means of a single number called a correlation coefficient. This correlation coefficient measures how closely the points in a scatter diagram are spread around a line. The symbol for the sample correlation coefficient is 𝑟. The symbol for the population coefficient is the Greek letter, rho (𝜌). The value of 𝜌 is 0 when β1 = 0, which results when there essentially is no linear regression; that is, the regression line is horizontal and any knowledge of X is useless in predicting Y. Since σ Y2 ≥ σ2, we must have ρ2 ≤ 1 and hence −1≤ 𝜌 ≤1. Values of ρ = ±1 only occur when σ2 = 0, in which case we have a perfect linear relationship between the two variables. Thus, a value of ρ equal to +1 implies a perfect linear relationship with a positive slope, while a value of ρ equal to −1 results from a perfect linear relationship with a negative slope. It might be said, then, Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS that sample estimates of 𝜌 close to unity in magnitude imply good correlation, or linear association, between X and Y, whereas values near zero indicate little or no correlation. The measure 𝜌 of linear association between two variables X and Y is estimated by the sample correlation coefficient𝑟, where 𝑆𝑥𝑦 𝑆𝑥𝑥 𝑟 = 𝑏1 √ = 𝑆𝑦𝑦 √𝑆𝑥𝑥 𝑠𝑦𝑦 REFERENCES: Montgomery, Douglas C., et al., Applied Statistics and Probabiliy for Engineers, 7th ed., John Wiley & Sons (Asia) Pte Ltd, 2018 Walpole, Ronald E., et al., Probability and Statistics for Engineers and Scientists, 9th ed., Pearson Education Inc., 20 Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS CHAPTER TEST Solve the following problems completely. An article in the Journal of Environmental Engineering (1989, Vol. 115(3), reported the results of a study on the occurrence of sodium and chloride in surface streams in central Rhode Island. The following data are chloride concentration y (in milligrams per liter) and roadway area in the watershed x (in percentage). x y x y x y 0.19 4.4 0.15 6.6 0.57 9.7 0.70 10.6 0.67 10.8 0.63 10.9 0.47 11.8 0.70 12.1 0.60 14.3 0.78 14.7 0.81 15.0 0.78 17.3 0.69 19.2 1.30 23.1 1.05 27.4 1.06 27.7 1.74 31.8 1.62 39.5 1. Draw a scatter diagram of the data. 2. Fit the simple linear regression model using the method of least squares. Find an estimate of σ2. 3. Estimate the mean chloride concentration for a watershed that has 1% roadway area. 4. Find the fitted value corresponding to x = 0.47 and the associated residual. 5. Test the hypothesis H0: β1 = 0 versus H1: β1 ≠ 0 using the analysis of variance procedure with 𝛼 = 0.01. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com) lOMoARcPSD|12687361 MATH 403- ENGINEERING DATA ANALYSIS 6. Find a 99% confidence interval of Mean chloride concentration when roadway area x = 1.0% 7. Find a 99% prediction interval on chloride concentration when roadway area x = 1.0%. 8. Plot the residuals versus ŷ and versus x. Interpret these plots. 9. Prepare a normal probability plot of the residuals. Does the normality assumption appear to be satisfied? 10. Determine the correlation coefficient and the coefficient of determination. Downloaded by Jheriel Urbano (jherielpraderourbano08@gmail.com)

Engineering Data Analysis Module - Batangas State University

Related documents

Products

Support

Engineering Data Analysis Module - Batangas State University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib