STAT 303 Introduction to Engineering Statistics Fall 2015 Lecture Packet by Chris Malone Winona State University cmalone@winona.edu 1 2 Chapter 1: Introduction Section 1.1: What is statistics? Your definitions ----- Why am I here? Thinking about this and getting reasonable answer to this question is likely to make this class a bit more enjoyable. 3 Section 1.2: Getting Started Conceptual Model for Statistics Definition: Statistics is a collection of techniques that provides information for the decision making process Some Advantages: • Removes some of the burden of making a decision and places it on the data • Most advantageous when faced with uncertainties • Allows us to make decisions without prejudice • • 4 Some Disadvantages (some perceived) • It is hard to do and understand • You can show what you want • • Statistics had two basic forms • Descriptive: Techniques and / or measures to describe data • Inferential: Process in which conclusions are made about a larger group using a smaller group Comments: • Most of the “bad publicity” is due to inferential statistics. We must be careful of the generalizations / conclusions that are being made about the larger group. • We will spend a lot more time on inferential statistics in this class. Data has two basic forms each with two measurements scales • Categorical (i.e. Qualitative): measurements that are classified into categories -- Nominal: -- Ordinal: • Numerical (i.e. Quantitative): measurements taken on a (naturally) numeric scale -- Discrete: -- Continuous: 5 It is necessary to identify the data type before doing a statistical analysis. The data type determines which analyzes are appropriate and which are not. Traditionally, analyzes for numerical variables have been emphasized more than analysis for categorical variables. Example 1.1: This data is from a study of lead pollution near El Paso, TX. The American Smelting and Refining Company (ASARCO) operated a lead and copper smelter in El Paso from 1887 to 1999. This particular study investigated the amount of lead exposure in children over a two-year time period. A snip-it of the data is given here. The various symbols in the Columns box identify the various data types. The data type determines which analyses are more appropriate. You can change (force) a particular data type onto a variable, but this should be done with care. 6 Consider an analysis of a categorical variable, say Location. Select Analyze > Distribution. Place Location in the Y, Columns box and click OK. This is shown here. The following output is given. Discuss… Next, consider an analysis of a numerical variable, say IQ. 7 The output is given here. Discuss… Realize that the appropriate summaries and graphical procedures are different for different types of variables. This course will proceed through appropriate statistical procedures for various data types and combinations of data types. 8 Section 1.3: Getting Data Consider the following definitions before getting into a discussion of the various methods of collecting data. Definition: Population -- entire collection of objects under study. The objects are often people, but they can be anything Definition: Sample – the observed collection of objects under study Definition: Census – when there is no difference between the sample and the population (i.e. when you observe the entire collection of objects under study) The objects in the population are the ones we want to understand and/or make decision about. If a census is taken, then making decisions about the objects in the population is straight forward (i.e. finding averages, ranges, and making graphs). Furthermore, inferential methods are not needed when a census of the entire population is completed. However, often it is not reasonable to collect measurements on the all objects of the population, so a sample of the population is obtained. If you are going to use a sample to make decisions about a population, it is very important that our sample represent the population. This should be obvious, but is much easier to say than do in practice. 9 Example 1.2 Can you determine representative? Consider the following exercise. The goal of this simple exercise is to determine the average number of squares per bunch in the following picture. Instead of counting the number of squares in all 100 bunches, identify 10 representative bunches and record their identification number in the first row of the table below. For each bunch, count and record the number of squares and place the results in the second row. Bunch ID # of Squares Questions: • What is the average number of squares per bunch for your representative sample? • What is the average number of squares from one or more of your neighbors? • How do the results from your representative sample compare to your neighbors? 10 Obtain the average number of squares per bunch from several individuals in the class and record their values in the following table. Individual Average How well did we do? • On the following number line, sketch each of the averages recorded above. ____________________________________________________ • The average number of squares per bunch (for all 100 bunches) is _____ Discussion… 11 What about random sampling? The goal of random sampling is to ensure that a representative sample is taken. There are various random sampling methods with the simplest being simple random sampling. Definition: Simple random sampling – a sampling method is which each observations in the population has an equal chance of being selected. Taking a simple random sample traditionally meant putting a piece of paper for each ”observation” in a hat and random selecting observations. Even though this may sound exciting, statisticians use computers to select simple random samples. Obtaining a simple random sample using JMP Open the Random_Rectangles.JMP data file. Select Tables > Subset. In the subset window, select 10 in the Random – sample size: box. Specify that you want All Columns from the original table. Finally, give resulting table of randomly selected observations name in the Output table name: box. 12 The following randomly selected subset is returned. Example 1.3 Summarizing the random sample results. In the following table, list the IDs and counts for the randomly selected observations given above. Bunch ID # of Squares 13 How well does simple random sampling do? Consider the following 10 random samples I’ve selected. Plot the averages from these 10 random samples on the same number line for which you plotted the results of class on earlier. How do the results from the 10 random samples compare to the results from the 10 representative samples selected in Example 1.2? Discuss the similarities / differences? 14 Section 1.4: Sampling Errors There are two types of sampling errors. Sampling: Errors that naturally occur in a random sampling process The behaviors of these errors are well understood when good sampling techniques are used Summary > errors cause by the act of sampling > have the potential to be bigger in smaller samples than in larger samples > it is possible to determine to what degree they will effect the outcome > unavoidable (this is the price of ensuring representative sample) Nonsampling: Errors due to things other than the sampling process The errors are more difficult to control and should be of concern whenever measurements are taken. Some Examples: > Nonresponse > Voluntary Response > Hidden Biases / Lurking Variables > Survey design effects / question effect Summary > are more problematic than sampling errors > are always present > may be impossible to correct after data is collected > nearly impossible to determine the degree to which they adversely effect the analysis > minimized by using good survey / data collection methodologies 15 Section 1.5: Random Variables / Distributions Definition: Observation – the collect of measurements from a particular object Definition: Variable – is any measurable characteristics of an observation The definition of variable is often used more loosely and is used to represent the set of measurable characteristics across all observations. Example 1.4 Consider the following data from the Lead El Paso study. Of interest here is the Location=Close children in the study. Questions: • Give an example of two different observations. • Give an example of three variables. 16 The concept of a random variable and probability distribution are important to your understanding of inferential statistics. Definition: Random Variable – is simply a variable or measurement that is obtained through some random process Definition: Distribution – a table or graph of all possible random variables. A distribution list the possible values for the random variable and also gives the frequency of occurrence for each random variable. Comments • All random variables have a distribution • Certain types of random variables occur so frequently that we name their distribution. For example, the bell-shaped distribution is thought to occur so frequently that we’ve labeled it the normal distribution. Example 1.5: Consider the following 22 observations from the El Paso Lead Study whose Location = Close. Let these 22 observations represent the population. That is, we only care about making decision about these 22 individuals. 17 Take a simple random sample of 5 individuals from this population. Place their value in the table below. ID Sex Age Colic Clum Irr Loc Years Test Year1 IQ Lead1 Lead2 Close Type Year2 1 2 3 4 5 Main Ideas: • EVERTHING in the population is unknown and fixed • EVERTHING in the sample is known and random • EVERTHING in the population has a corresponding component in the sample Two final definitions Definition: Parameter – summary characteristic of a distribution Definition: Statistic – summary characteristic of a sample 18