Chapter 1 Lecture Slides 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 1: Summarizing Univariate Data 2 Introduction • How can one draw conclusions from the results of an experiment when those results could have come differently? – Example: Normal blood sugar levels should be between 70 and 120 mg/ dL • A knowledge of statistics is essential to address this question. 3 Quality Control: Example 1 Consider a machine that makes steel balls for ball bearings used in clutch systems. The specification for the diameter of the balls is 0.65 0.03 cm. During the last hour, the machine has made 2000 balls. The quality engineer wants to know approximately how many of these balls meet the specification. He does not have time to measure all 2000 balls. So, he draws a random sample of 80 balls, measures them, and finds that 72 of them (90%) meet the diameter specification. It is unlikely that the sample of 80 balls represents the population of 2000 exactly. 4 Section 1.1: Sampling Definitions: A population is the entire collection of objects or outcomes about which information is sought. A sample is a subset of a population, containing the objects or outcomes that are actually observed. A simple random sample (SRS) of size n is a sample chosen by a method in which each collection of n population items is equally likely to comprise the sample, just as in the lottery. 5 SRS Sample: Example 2 Q: A utility company wants to conduct a survey to measure the satisfaction level of its customers in a certain town. There are 10,000 customers in the town, and utility employees want to draw a sample of size 200 to interview over the telephone. They obtain a list of all 10,000 customers, and number them from 1 to 10,000. They use a computer random generator to generate 200 random integers between 1 and 10,000 and then telephone the customers who correspond to those numbers. Is this a simple random sample? A: Yes, this is a simple random sample. 6 Sampling (cont.) Definition: A sample of convenience is a sample that is not drawn by a well-defined random method. Things to consider with convenience samples: Differ systematically in some way from the population. Only use when it is not feasible to draw a random sample. 7 Sample of Convenience: Example 3 A construction engineer has received a shipment of 1000 concrete blocks, each weighing approximately 50 pounds. The blocks are in a large pile. The engineer wishes to investigate the crushing strength of the blocks by measuring the strengths in a sample of 10 blocks. It may be difficult to take a SRS since that would involve getting blocks from the middle and bottom of the pile, so the engineer may just take 10 off the top. This would be a sample of convenience. 8 Simple Random Sampling • A SRS is not guaranteed to reflect the population perfectly. • SRS’s always differ in some ways from each other; occasionally a sample is substantially different from the population. • Two different samples from the same population will vary from each other as well. This phenomenon is known as sampling variation. 9 Example 2 cont. Which sample is a simple random sample? 10 Tangible Population • The populations that consist of actual physical objects – customers, blocks, balls are called tangible populations. • Tangible populations are always finite. • After we sample an item, the population size decreases by 1. 11 More on SRS Definition: A conceptual population consists of items that are not actual objects. • For example, a geologist weighs a rock several times on a sensitive scale. Each time, the scale gives a slightly different reading. • Here the population is conceptual. It consists of all the readings that the scale could in principle produce. 12 SRS (cont.) • The items in a sample are independent if knowing the values of some of the items does not help to predict the values of the others. • Items in a simple random sample may be treated as independent in most cases encountered in practice. The exception occurs when the population is finite and the sample comprises a substantial fraction (more than 5%) of the population. 13 Types of Data • Numerical or quantitative if a numerical quantity is assigned to each item in the sample. • Height • Weight • Age • Categorical or qualitative if the sample items are placed into categories. • Gender • Hair color • Zip code 14 Controlled Experiments • Suppose that a chemical engineer wants to determine how the concentrations of reagent and catalyst affect the yield of a process. • The engineer can run the process several times, changing the concentrations each time and compare the yields that result. • This sort of experiment is called a controlled experiment because the values of the factor are under the control of the experimenter. 15 Observational Studies • There are many situations in which scientists cannot control the levels of the factors. • Many studies have been conducted to determine the effect of cigarette smoking on the risk of lung cancer. • In these studies, rates of cancer among smokers are compared with rates among nonsmokers. • The experimenter cannot control who smokes and who doesn’t. • This kind of study is called an observational study. 16 Section 1.2: Summary Statistics • Sample Mean: n 1 X Xi n i 1 • Sample Variance: n n 2 1 1 2 2 2 s Xi X X i nX n 1 i 1 n 1 i 1 • Sample standard deviation is the square root of the sample variance. 17 More on Summary Statistics • If X1, …, Xn is a sample, and Yi = a + b Xi ,where a and b are constants, then Y a bX . s b s , and sy b sx . 2 y 2 2 x 18 Definition of a Median The median is another measure of center, like the mean. To find it: If n is odd, the sample median is the number in n 1 . position 2 If n is even, the sample median is the average n n of the numbers in positions and 1. 2 2 19 Example 3 A simple random sample of five men is chosen from a large population of men, and their heights are measured. The five heights (in cm) are 166.4, 183.6, 173.5, 170.3, and 179.5. Find the sample mean, sample variance, sample standard deviation, and the median. 20 Quartiles The first quartile is the median of the lower half of the data (include the median in the lower half of the data if n is odd). The third quartile is the median of the upper half of the data (include the median in the upper half of the data if n is odd). 21 Definition of Percentile • The pth percentile of a sample, for a number between 0 and 100, divides the sample so that as nearly as possible p% of the sample values are less than the pth percentile. 22 To Find Percentiles Order the sample values from smallest to largest. Then compute the quantity (p/100)(n+1), where n is the sample size. If this quantity is an integer, the sample value in this position is the pth percentile. Otherwise, average the two sample values on either side. 23 Note on Percentiles • The first quartile is the 25th percentile. • The median is the 50th percentile. • The third quartile is the 75th percentile. 24 Example 4 The following values of fracture stress (in megapascals) were measured for a sample of 24 mixtures of hot-mixed asphalt: 30 75 79 80 80 105 126 138 149 179 179 191 223 232 232 236 240 242 245 247 254 274 384 470. • • • • • What is the mean of these data? What is the median? What is the first quartile? What is the third quartile? What is the 65th percentile? 25 Section 1.3: Graphical Summaries • • • • Stem-and-leaf plot Dotplot Histogram Boxplot 26 Stem-and-leaf Plot • A simple way to summarize a data set. • Each item in the sample is divided into two parts: a stem, consisting of the leftmost one or two digits, and the leaf, which consists of the next digits. • It is a compact way to represent the data. • It also gives us some indication of the shape of our data. 27 Example 5 • Amount of Drug in Skin • • • Stem-and-leaf plot: 0 34477899 1 22566778 2 001122234566667 3 34456678 4 0011 5 1355 6 7 4 Let’s look at the first line of the stem-and-leaf plot. This represents measurements of 3, 4, 5, 7, 7, 8, 9, and 9 minutes. A good feature of these plots is that they display all the sample values. One can 28 reconstruct the data in its entirety from a stem-and-leaf plot. Dotplot • A dotplot is a graph that can be used to give a rough impression of the shape of a sample. • It is useful when the sample size is not too large and when the sample contains some repeated values. • Good method, along with the stem-and-leaf plot to informally examine a sample. • Not generally used in formal presentations. Dotplot for HiAltitude 2 12 HiAltitude 22 29 Histogram • Graphical display that gives an idea of the shape of the sample. • We want a reasonable number of observations in each interval. • The bars of the histogram touch each other. A space indicates that there are no observations in that interval. 30 Creating a Histogram • Determine the number of classes to use, and construct class intervals of equal width. • Compute the frequency and relative frequency for each class. • Draw a rectangle for each class. The heights of the rectangles may be set equal to the frequencies or to the relative frequencies. 31 Example 6 32 Example 6 cont. 33 Symmetry and Skewness • A histogram is perfectly symmetric if its right half is a mirror image of its left half. – Heights of random men • Histograms that are not symmetric are referred to as skewed. • A histogram with a long right-hand tail is said to be skewed to the right, or positively skewed. – Incomes are right skewed. • A histogram with a long left-hand tail is said to be skewed to the left, or negatively skewed. – Grades on an easy test are left skewed. 34 Symmetry and Skewness 35 Shape of Histogram • A histogram with only one peak is what we call unimodal. • If a histogram has two peaks then we say that it is bimodal. • If there are more than two peaks in a histogram, then it is said to be multimodal. 36 Boxplots • A boxplot is a graphic that presents the median, the first and third quartiles, and any outliers present in the sample. • The interquartile range (IQR) is the difference between the third and first quartile. This is the distance needed to span the middle half of the data. 37 Boxplots 38 Creating a Boxplot Compute the median and the first and third quartiles of the sample. Indicate these with horizontal lines. Draw vertical lines to complete the box. Find the largest sample value that is no more than 1.5 IQR above the third quartile, and the smallest sample value that is not more than 1.5 IQR below the first quartile. Extend vertical lines (whiskers) from the quartile lines to these points. Points more than 1.5 IQR above the third quartile, or more than 1.5 IQR below the first quartile are designated as outliers. Plot each outlier individually. 39 Example 5 cont. Notice there are no outliers in these data. Looking at the four pieces of the boxplot, we can tell that the sample values are comparatively densely packed between the median and the third quartile. The lower whisker is a bit longer than the upper one, indicating that the data has a slightly longer lower tail than an upper tail. The distance between the first quartile and the median is greater than the distance between the median and the third quartile. This boxplot suggests that the data are skewed to the left. 40 Comparative Boxplots • Sometimes we want to compare between more than one sample. • We can place the boxplots of the two samples side-by-side. • This will allow us to compare how the medians differ between samples, as well as the first and third quartile. • It also tells us about the difference in spread between the two samples. 41 Example 7 42 Example 7 cont. 43 Summary • • • • Types of data Sampling Summary Statistics Graphical displays of data HW1: 1.1(4,6), 1.2(2,10,12, 16)1.3(5, 7, 9, 12, 14); Ch(8, 13) 44