INTRODUCTION TO STATISTICS AND DATA ANALYSIS ENGINEERING DATA ANALYSIS THE CHALLENGE With the advancement in sciences and engineering occurring in large part through the collection and analysis of data, proper analysis of data can be challenging, because scientific data are subject to random variation. How can one draw conclusions from the results of an experiment when those results could have come out differently? The method of statistics allow scientists and engineers to design valid experiments and to draw reliable conclusions from the data they produce. THE ENGINEERING METHOD AND STATISTICAL THINKING Many of the engineering sciences are employed in the engineering problem-solving method: mechanical sciences, such as statics and dynamics fluid science thermal sciences such as thermodynamics and Engineering data can be collected by Retrospective study Observational study Designed experiment heat transfer electrical sciences materials science chemical sciences THE BASIC IDEA The basic idea behind all statistical methods of data analysis is to make inferences about a population by studying a relatively small sample from it. For example, consider a machine that makes steel balls for ball bearings used in clutch systems. The specification for the diameter of the balls is 0.65 ± 0.03 cm. During the last hour, the machine has made 2000 balls. The QE wants to know how many of these balls meet the specifications. He does not have the time to measure all 2000 balls, so he draws a random sample of 80 balls, 72 of which (90%) meet the specifications. (How can he be sure that 90% of the whole population meet the specifications)? TWO FIELDS OF STATISTICS INFERENTIAL STATISTICS is the process of using data analysis to make predictions (“inference”) from that data. DESCRIPTIVE STATISTICS are used to describe the basic features in the study, in the form of charts, graphs, plots, etc. COLLECTING ENGINEERING DATA Sample Population DEFINITION • A population is the entire collection of objects or outcomes about which information is sought. • A sample is a subset of a population, containing the objects or outcomes that are actually observed. TANGIBLE VS. CONCEPTUAL POPULATIONS DEFINITION • A tangible population is a population consist of actual physical objects that are countable and always finite. • A conceptual population happens when all the values that might possibly occur have been observed from a simple random sample. A simple random sample may consist of values obtained from a process under identical experimental conditions. Example: Each of the following processes involves sampling from a population. Define the population, and state whether it is tangible or conceptual. • A shipment of bolts is received from a vendor. To check whether the shipment is acceptable with regard to shear strength, an engineer reaches into the container and selects 10 bolts, one by one to test. • The resistance of a certain resistor is measured 5 times with the same ohmmeter. SAMPLING DEFINITION • A simple random sample of size n is a sample chosen by a method in which each collection of n population items is equally likely to comprise the sample, just as in a lottery. Think of a lottery consisting of 10,000 tickets and 5 winners will be chosen. What is the fairest way to choose the winners? SAMPLING EXAMPLE: A utility company wants to conduct a survey to measure the satisfaction level of its customers in a certain town. There are 10,000 customers in the town, and utility employees want to draw a sample of size 200 to interview personally. They obtain a list of all 10,000 customers, and number them from 1 to 10,000. They use a computer random number generator to generate 200 random integers between 1 and 10,000 and then contact the customers who correspond to those numbers. Is this a simple random sample? SAMPLING EXAMPLE: A quality engineer wants to inspect electronic microcircuits in order to obtain information on the proportion that are defective. She decide to draw a sample of 100 circuit from a day’s production. Each hour for 5 hours, she takes the 20 most recently produced circuits and tests them. Is this a simple random sample? SAMPLING EXAMPLE: A construction engineer has just received a shipment of 1000 concrete blocks, each weighing approximately 25 kilograms. The blocks have been delivered in a large pile. The engineer wishes to investigate the compressive strength of the blocks by measuring the strengths in a sample of blocks. What is the more appropriate method of selecting random samples? DEFINITION • A sample of convenience is a sample that is not drawn by a well-defined random method. SAMPLING If, for example, a quality inspector draws a random sample of 40 bolts from a large shipment, measures the length of each and finds that 32 of them (80%) meet a length specification. By chance, a second inspector got a few more good bolts, about 90% in her sample. The proportion of good bolts in the population is likely to be close to 80% or 90%, but it is not likely that it is exactly equal to either value. DEFINITION • A sampling variation happens when two or more different samples from the same population will differ from each other as well. SAMPLING DEFINITION • With sampling with replacement, what one gets in one sample does not affect what one gets in a different sample. In this case, we say that the samples are independent. • With sampling without replacement, what one gets in one sample does affect what one gets in a different sample. In this case, we say that the samples are dependent. An urn contains five balls numbered 1 through 5. I pick two balls and write down their numbers and place them back in the urn. Then I pick another two balls and write down their numbers.Are the two samples dependent or independent? SAMPLING OTHER SAMPLING METHODS • Weighted sampling is when some items are given a greater chance of being selected than others (ex., lottery in which some people have more tickets than others.) • Stratified random sampling is then the population is divided into subpopulations known as strata, and a simple random sample is drawn from each stratum. • Cluster sampling is when items are drawn from the population in groups or clusters. TYPES OF DATA DEFINITION • When a numerical quantity designating how much or how many is assigned to each item in a sample, the resulting set of values is called numerical or quantitative. • In some cases, if sample items are placed into categories, and category names are assigned to the sample items, the data are categorical or qualitative. Example: In a loading test of column-to-beam welded connections, data may be collected both on the torque applied at failure and on the location of the failure (weld or beam). Quantitative variable: Torque Qualitative variable: Location (weld or beam) SUMMARY STATISTICS SAMPLE MEAN The sample mean, also known as the “arithmetic mean” or the “average” is the sum of the numbers in a sample, divided by how many there are. DEFINITION Let 𝑋1 , … , 𝑋𝑛 be a sample. The sample mean is: 𝑛 1 ത 𝑋 = 𝑋𝑖 𝑛 𝑖=1 SAMPLE VARIANCE AND STANDARD DEVIATION The sample standard deviation is a quantity that measures the degree of spread in a sample. The square of the sample standard deviation is the sample variance. DEFINITION Let 𝑋1 , … , 𝑋𝑛 be a sample. The sample variance is the quantity: 𝑛 1 2 𝑠 = 𝑋𝑖 − 𝑋ത 2 𝑛−1 𝑖=1 An equivalent formula can be used: 𝑠2 𝑛 1 = 𝑋𝑖2 − 𝑛𝑋ത 2 𝑛−1 𝑖=1 SAMPLE VARIANCE AND STANDARD DEVIATION DEFINITION Let 𝑋1 , … , 𝑋𝑛 be a sample. The sample standard deviation is the quantity: 𝑛 1 𝑋𝑖 − 𝑋ത 𝑛−1 𝑠= 2 𝑖=1 An equivalent formula can be used: 𝑛 𝑠= 1 𝑋𝑖2 − 𝑛𝑋ത 2 𝑛−1 𝑖=1 OUTLIERS Sometimes, a sample may contain a few points that are much larger or smaller than the rest. Such points are called outliers. This may result from data entry errors, and needs to be scrutinized and should be corrected or deleted. SAMPLE MEDIAN The median is a measure of center. DEFINITION If n numbers are ordered from smallest to largest: 𝑛+1 • If n is odd, the sample median is the number in the position 2 . 𝑛 • If n is even, the sample median is the average of the numbers in the positions 2 and 𝑛 2 +1 QUARTILES If the median divides the sample in half, quartiles divide it nearly as possible into quarters. A sample has 3 quartiles. Let n represent the sample size. First quartile: 0.25(𝑛 + 1) Second quartile: 0.50(𝑛 + 1) Third quartile: 0.75(𝑛 + 1) Note that the second quartile is the same as the median. QUARTILES Example: In the article “Evaluation of Low-Temperature Properties of HMA Mixtures” (P. Sebasly, A. Lake, and J. Epps, Journal of Transportation Engineering, 2002-578-583), the following values of fracture stress (in MPa) were measured for a sample of 22 mixtures of hotmixed asphalt (HMA). 30 75 79 80 80 105 126 138 149 179 191 223 232 236 240 242 245 247 254 274 384 470 Find the first and third quartiles. PERCENTILES The pth percentile of a sample, for a number p between 0 and 100, divides the sample so that as nearly as possible p% of the sample values are less than the pth percentile and (100-p)% are greater. Let n represent the sample size. pth percentile: p (𝑛 100 + 1) Note that the 25th percentile is the 1st quartile, the median is the 50th percentile and 2nd quartile, and the 75th percentile is the 3rd quartile. If the quantity is an integer, that is the percentile, otherwise, get the average of the two sample values on either side. PERCENTILES Example: In the article “Evaluation of Low-Temperature Properties of HMA Mixtures” (P. Sebasly, A. Lake, and J. Epps, Journal of Transportation Engineering, 2002-578-583), the following values of fracture stress (in Mpa) were measured for a sample of 22 mixtures of hotmixed asphalt (HMA). 30 75 79 80 80 105 126 138 149 179 191 223 232 236 240 242 245 247 254 274 384 470 Find the 65th percentile. GRAPHICAL SUMMARIES STEM-AND-LEAF PLOT Example: The table below shows a study of the bioactivity of a certain antifungal drug. The drug was applied to the skin of 48 subjects. After 3 hours, the amount of drug remaining in the skin were measured in units of ng/cm2. The list has been sorted in numerical order. 3 15 22 27 40 4 16 22 33 41 4 16 22 34 41 7 17 23 34 51 7 17 24 35 53 8 18 25 36 55 9 20 26 36 55 9 20 26 37 74 12 21 26 38 12 21 26 40 STEM-AND-LEAF PLOT Stem-and-leaf plot: Stem Leaf 0 34477899 1 22566778 2 001122234566667 3 34456678 4 0011 5 1355 6 7 4 3 15 22 27 40 4 16 22 33 41 4 16 22 34 41 7 17 23 34 51 7 17 24 35 53 8 18 25 36 55 9 20 26 36 55 9 20 26 37 74 12 21 26 38 12 21 26 40 DOTPLOT A dotplot is a graph that can be used to give a rough impression of the shape of a sample, useful when the sample size is not too large and when the sample contains some repeated values. HISTOGRAM A histogram is a graphic that gives an idea of the “shape” of a sample, indicating regions where sample points are concentrated and regions where they are sparse. Example: The table on shows PM emissions of 62 vehicles driven at high altitude. 7.50 6.28 6.07 5.23 5.54 3.46 2.44 3.01 13.63 13.02 23.38 9.24 3.22 2.06 4.04 17.11 12.26 19.91 8.50 7.81 7.18 6.95 18.64 7.10 6.04 5.66 8.86 4.40 3.57 4.35 3.84 2.37 3.81 5.32 5.84 2.85 4.68 1.85 9.14 8.67 9.52 2.68 10.14 9.20 7.31 2.09 6.32 6.53 6.32 2.01 5.91 5.60 5.61 1.50 6.46 5.29 5.64 2.07 1.11 3.32 1.83 7.56 HISTOGRAM Class interval (g/gal) Frequency Relative frequency 1≤x <3 12 0.1935 3≤x<5 11 0.1774 5≤x<7 18 0.2903 7≤x<9 9 0.1452 9 ≤ x < 11 5 0.0806 11 ≤ x < 13 1 0.0161 13 ≤ x < 15 2 0.0323 15 ≤ x < 17 0 0.0000 17 ≤ x < 19 2 0.0323 19 ≤ x < 21 1 0.0161 21 ≤ x < 23 0 0.0000 23 ≤ x < 25 1 0.0161 Example: The table on shows PM emissions of 62 vehicles driven at high altitude. Construct a frequency table. Data will be counted into several class intervals. There is no hard and fast rule as to how to decide how many class intervals to use. HISTOGRAM HISTOGRAM Class interval (g/gal) Frequency Relative frequency 1≤x <3 12 0.1935 3≤x<5 11 0.1774 5≤x<7 18 0.2903 7≤x<9 9 0.1452 9 ≤ x < 11 5 0.0806 11 ≤ x < 13 1 0.0161 13 ≤ x < 15 2 0.0323 15 ≤ x < 17 0 0.0000 17 ≤ x < 19 2 0.0323 19 ≤ x < 21 1 0.0161 21 ≤ x < 23 0 0.0000 23 ≤ x < 25 1 0.0161 To construct a histogram: (1) determine the number of classes to use and construct intervals of equal width; (2) compute the frequency and relative frequency for each class; and, (3) draw a rectangle for each class, the heights of the rectangles may be set equal to the frequencies or to the relative frequencies. SKEWNESS Skewness refers to the asymmetry of a histogram; a symmetric histogram has its right half a mirror image of its eft half. A histogram skewed to the left or negatively skewed has long left-hand tail. On the same hand, a histogram skewed to the right or positively skewed has long right-hand tail. HISTOGRAM MODES Histogram mode refers to the “peak”, or local maximum in a histogram. A histogram is said to be unimodal if it has only one peak or mode, and bimodal if it has two clearly distinct modes. Bimodal histogram indicates that the sample can be divided into two subsamples that differ from each other in some scientifically important way.