1 Chapter 1 – Introduction: The Role of Statistics in Engineering Example: The manufacturer of a medical laser used in ophthalmic surgery wants to be able to quote quality characteristics of the laser to potential customers. One characteristic that they want to use is the average lifetime of the laser under normal use. They could obtain the exact average lifetime by running each laser produced until it wears out, recording the lifetime for that laser, and finding the average over all lasers produced. They would then know the exact average lifetime for this type of laser. The drawback to this procedure is that they would be left with no product to sell. In order to both stay in business and advertise quality characteristics to potential customers, they need to find a way to estimate the average lifetime from a relatively small sample of lasers. Since they are using only some of the lasers produced, not all, their estimate will have some uncertainty. One major use of statistics is to quantify the degree of uncertainty in such situations. Definition: Statistics is the branch of applied mathematics that deals with collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling. There are two general branches of statistics: 1) Descriptive statistics and 2) Inferential Statistics. Definition: Descriptive statistics consists of methods of organizing, summarizing, and presenting data in an informative way. Graphical techniques allow us to present summaries of data in pictorial form, so that data characteristics may be easily seen. 2 Numerical techniques provide various summary values which represent the characteristics of the data set. Definition: A unit is a single entity, usually an object or a person, whose characteristics are of interest to the researcher. Definition: A population of units is the set of all items of interest in a statistical problem. Example 1: All registered voters in Florida in November, 2012. Example 2: All cars of a certain model coming off an assembly line in October, 2009. Example 3: All 12-oz. cans of Pepsi-Cola produced at a certain factory in the year 2009. Definition: A statistical population is the set of all measurements corresponding to each unit in the entire population of units. Note: We will generally use the term population to refer to either a population of units or a statistical population. Definition: A parameter is a numerical characteristic of a population. Example 1: Proportion of all registered voters in Florida who intend to vote for Pres. Barack Obama in November, 2012. Example 2: Average time until first major repair job for all cars of a certain model coming of an assembly line in October, 2009. Example 3: Average amount of Pepsi-Cola, by weight, in all 12-oz. cans of Pepsi-Cola produced at a certain factory in the year 2009. 3 Definition: A sample is a subset of a population. We will also use the term sample to denote the subset of measurements that are actually collected by the researcher. Example 1: One thousand randomly selected registered voters from across Florida in October, 2012. Example 2: Every 50th car of a certain model coming off an assembly line in October, 2009. Example 3: Every 100th 12-oz. can of Pepsi-Cola produced at a certain factory in the year 2009. Note 1: The description of a sample must refer to the population from which the sample was selected. Note 2: Depending on the method of selection, a sample may or may not be representative of the population from which it was selected. Definition: A statistic is a numerical characteristic of a sample. Example 1: The proportion of one thousand randomly selected voters from across Florida in October, 2012 who intend to vote for Pres. Barack Obama. Example 2: The average time until the first major repair job for a sample consisting of every 50th car of a certain model coming off an assembly line in October, 2009. Example 3: The average amount of Pepsi-Cola in a sample consisting of every 100th can of Pepsi-Cola produced at a certain factory in the year 2009. Definition: Inferential statistics Consists of methods of drawing conclusions about the characteristics of a population based on the information obtained from a sample selected from the population. 4 Inferential statistics is divided into the fields of 1) estimation and 2) hypothesis testing. An example of inferential statistics is the estimation of the average lifetime for the population of medical lasers based on data from a small sample of the lasers. Note: Inferential statistics amounts to making decisions based on incomplete information. Note: The particular statistical inferential technique used depends strongly on the method by which the sample was selected from the population. Definition: A representative sample is a sample whose characteristics reflect the characteristics of the population from which the sample was selected. Example 1: Is this sample representative? If the sample was randomly chosen, then it has a good chance of being representative of the population. Example 2: Is this sample representative? What if there were some cyclically occurring flaw in the manufacturing process which affected every 50th car produced? Example 3: Is this sample representative? What if there were some flaw in the manufacturing process which led to overfilling of every 100th can? Note: A primary reason for working with samples instead of entire populations is that often the populations are too large to handle easily. Example 1: All registered voters in Florida in October, 2012. There are approximately 6,000,000 of them. 5 Example 2: How easy would it be to actually examine the entire months production of cars, following them over time to see when the first major repair job was required? Example 3: Would we actually want to weigh the amount of Pepsi-Cola in every 12-oz. can coming off the assembly line at a certain factory in 2009? Whenever we infer a population characteristic based on sample data, there is always the chance that our inference will be incorrect. Example: A public opinion poll conducted in 1936 for Literary Digest Magazine (R.I.P.) predicted that Alf Landon would defeat Franklin Delano Roosevelt in the Presidential election by a 3 to 2 margin. Actually, F.D.R. won 62% of the ballots. Why was the prediction so incorrect? 1) The pollsters sent out 10 million sample ballots to prospective voters, based on the magazine’s subscription list and on telephone directories. (Poor identification of population.) 2) Only 2.3 million of the mailed ballots were actually returned. (Self-selection of sample.) Note: To do valid statistical inference, we need a sample which is likely to be representative of the population. We want to build into our statistical inferential procedures measures of reliability, which will tell us how likely it is that our inference is correct/incorrect. These measures of reliability depend on the sampling method used. For estimation of parameters based on sample statistics, the measure of reliability is called the confidence level. For testing hypotheses about parameters based on sample statistics, the measure of reliability is the significance level. 6 Definition: A simple random sample of size n is a sample drawn from a population by a method which makes every sample of size n equally likely to be chosen. Steps in choosing a SRS of size n: 1) Obtain a list of all members of the population; this list is called a sampling frame. (Note: This is the most difficult step in the whole process, and is also error-prone.) 2) Assign a unique ID number to each member of the population. 3) Go to a table of random numbers; choose a convenient starting point; go down the column, recording numbers within the range of the assigned ID numbers, until n distinct numbers are selected. 4) The population members that have the ID numbers obtained by this process make up the SRS of size n. (Step 2 may also be done using your calculator) Note: We can never be absolutely certain that our sample is representative, but simple random sampling gives us a good chance. Example: I want to estimate the average height of the class, without gathering height data for every person in the class. I will select a simple random sample of size 3 and use the average height of the members of the sample as the estimate of the average height of the members of the class. I assign a unique ID number to each person in the class; the first person on the class roll will have the ID number 001, the second person 002, etc. I then go to a table of random numbers, open it, and blindly choose a starting point. Reading down the column from the starting point, I find 3 distinct two-digit numbers within the range of the values of the ID numbers. The class members with these 3 ID numbers constitute the SRS. 7 Collecting Engineering Data Observational study: members of the sample are simply observed, during routine operation, with measurements taken - To build empirical models - Cause-and-effect relationships cannot be confirmed Designed experiment: the engineer makes deliberate, purposeful changes in controllable variables (called factors), and observes the results of these changes - Designing and running very efficient experiments - Cause-and-effect relationships can be examined, using: Hypothesis testing and parameter estimation A carefully designed data collection procedure (including the method of selecting a sample from the population) will usually lead to interpretable and useful results; a poorly designed data collection procedure will often lead to worthless data. As R. A. Fisher said, "Often the only thing you can do with a poorly designed experiment is to try to find out what it died of." Example of an Observational Study to Build an Empirical Model1 The table contains data collected on three variables in an observational study conducted at a semiconductor manufacturing plant. In this plant, the semiconductor is wire-bonded to a frame. The variables are: Pull Strength – the force required to break the bond; Wire Length; and Die Height. We want to be able to predict the Pull Strength by knowing the Wire Length and Die Height. 1 Montgomery, D. C.; Runger, G. C.; and Hubele, N. F. Engineering Statistics, 3 rd Edition, John Wiley & Sons, Inc. (2004). 8 The linear regression model that we want to estimate has the following form (We will cover linear regression in Chapter 11): Pull Strength 0 1 (Wire Length) 2 ( Die Height ) We wish to: a) Test to see whether this model adequately represents the relationship between Pull Strength and Die Height (is the relationship linear?), and b) estimate the values of the constant term and the coefficients in this equation, so that we will have an useful model. (The term ε in the equation is a random error term.)