Practicing the Concepts #1 * Basic Concepts and Terminology

advertisement
STAT 303
Introduction to
Engineering Statistics
Fall 2015
Lecture Packet
by
Chris Malone
Winona State University
cmalone@winona.edu
1
2
Chapter 1: Introduction
Section 1.1: What is statistics?
Your definitions
-----
Why am I here?
Thinking about this and getting reasonable answer to this question is likely
to make this class a bit more enjoyable.
3
Section 1.2: Getting Started
Conceptual Model for Statistics
Definition: Statistics is a collection of techniques that provides information for
the decision making process
Some Advantages:
•
Removes some of the burden of making a decision and places it on the
data
•
Most advantageous when faced with uncertainties
•
Allows us to make decisions without prejudice
•
•
4
Some Disadvantages (some perceived)
•
It is hard to do and understand
•
You can show what you want
•
•
Statistics had two basic forms
•
Descriptive: Techniques and / or measures to describe data
•
Inferential: Process in which conclusions are made about a larger
group using a smaller group
Comments:
•
Most of the “bad publicity” is due to inferential statistics. We must be
careful of the generalizations / conclusions that are being made about
the larger group.
•
We will spend a lot more time on inferential statistics in this class.
Data has two basic forms each with two measurements scales
•
Categorical (i.e. Qualitative): measurements that are classified into
categories
-- Nominal:
-- Ordinal:
•
Numerical (i.e. Quantitative): measurements taken on a (naturally)
numeric scale
-- Discrete:
-- Continuous:
5
It is necessary to identify the data type before doing a statistical analysis. The
data type determines which analyzes are appropriate and which are not.
Traditionally, analyzes for numerical variables have been emphasized more than
analysis for categorical variables.
Example 1.1: This data is from a study of lead pollution near El Paso, TX. The
American Smelting and Refining Company (ASARCO) operated a lead and
copper smelter in El Paso from 1887 to 1999. This particular study investigated
the amount of lead exposure in children over a two-year time period. A snip-it of
the data is given here.
The various symbols in the Columns box identify the various data types. The
data type determines which analyses are more appropriate. You can change
(force) a particular data type onto a variable, but this should be done with care.
6
Consider an analysis of a categorical variable, say Location. Select Analyze >
Distribution. Place Location in the Y, Columns box and click OK. This is shown
here.
The following output is given.
Discuss…
Next, consider an analysis of a numerical variable, say IQ.
7
The output is given here.
Discuss…
Realize that the appropriate summaries and graphical procedures are different
for different types of variables. This course will proceed through appropriate
statistical procedures for various data types and combinations of data types.
8
Section 1.3: Getting Data
Consider the following definitions before getting into a discussion of the various
methods of collecting data.
Definition: Population -- entire collection of objects under study. The objects
are often people, but they can be anything
Definition: Sample – the observed collection of objects under study
Definition: Census – when there is no difference between the sample and the
population (i.e. when you observe the entire collection of objects under study)
The objects in the population are the ones we want to understand and/or make
decision about. If a census is taken, then making decisions about the objects in
the population is straight forward (i.e. finding averages, ranges, and making
graphs). Furthermore, inferential methods are not needed when a census of the
entire population is completed.
However, often it is not reasonable to collect measurements on the all objects of
the population, so a sample of the population is obtained. If you are going to use
a sample to make decisions about a population, it is very important that our
sample represent the population. This should be obvious, but is much easier to
say than do in practice.
9
Example 1.2 Can you determine representative?
Consider the following exercise. The goal of this simple exercise is to determine
the average number of squares per bunch in the following picture.
Instead of counting the number of squares in all 100 bunches, identify 10
representative bunches and record their identification number in the first row of
the table below. For each bunch, count and record the number of squares and
place the results in the second row.
Bunch ID
# of Squares
Questions:
•
What is the average number of squares per bunch for your representative
sample?
•
What is the average number of squares from one or more of your
neighbors?
•
How do the results from your representative sample compare to your
neighbors?
10
Obtain the average number of squares per bunch from several individuals in the
class and record their values in the following table.
Individual
Average
How well did we do?
•
On the following number line, sketch each of the averages recorded
above.
____________________________________________________
•
The average number of squares per bunch (for all 100 bunches) is _____
Discussion…
11
What about random sampling?
The goal of random sampling is to ensure that a representative sample is taken.
There are various random sampling methods with the simplest being simple
random sampling.
Definition: Simple random sampling – a sampling method is which each
observations in the population has an equal chance of being selected.
Taking a simple random sample traditionally meant putting a piece of paper for
each ”observation” in a hat and random selecting observations. Even though this
may sound exciting, statisticians use computers to select simple random
samples.
Obtaining a simple random sample using JMP
Open the Random_Rectangles.JMP data file. Select Tables > Subset.
In the subset window, select 10 in the Random – sample size: box. Specify that
you want All Columns from the original table. Finally, give resulting table of
randomly selected observations name in the Output table name: box.
12
The following randomly selected subset is returned.
Example 1.3 Summarizing the random sample results. In the following table, list
the IDs and counts for the randomly selected observations given above.
Bunch ID
# of Squares
13
How well does simple random sampling do?
Consider the following 10 random samples I’ve selected.
Plot the averages from these 10 random samples on the same number line for
which you plotted the results of class on earlier.
How do the results from the 10 random samples compare to the results from the
10 representative samples selected in Example 1.2? Discuss the similarities /
differences?
14
Section 1.4: Sampling Errors
There are two types of sampling errors.
Sampling: Errors that naturally occur in a random sampling process
The behaviors of these errors are well understood when good sampling
techniques are used
Summary
> errors cause by the act of sampling
> have the potential to be bigger in smaller samples than in larger samples
> it is possible to determine to what degree they will effect the outcome
> unavoidable (this is the price of ensuring representative sample)
Nonsampling: Errors due to things other than the sampling process
The errors are more difficult to control and should be of concern whenever
measurements are taken.
Some Examples:
> Nonresponse
> Voluntary Response
> Hidden Biases / Lurking Variables
> Survey design effects / question effect
Summary
> are more problematic than sampling errors
> are always present
> may be impossible to correct after data is collected
> nearly impossible to determine the degree to which they adversely effect
the analysis
> minimized by using good survey / data collection methodologies
15
Section 1.5: Random Variables / Distributions
Definition: Observation – the collect of measurements from a particular object
Definition: Variable – is any measurable characteristics of an observation
The definition of variable is often used more loosely and is used to represent the
set of measurable characteristics across all observations.
Example 1.4 Consider the following data from the Lead El Paso study. Of
interest here is the Location=Close children in the study.
Questions:
•
Give an example of two different observations.
•
Give an example of three variables.
16
The concept of a random variable and probability distribution are important to
your understanding of inferential statistics.
Definition: Random Variable – is simply a variable or measurement that is
obtained through some random process
Definition: Distribution – a table or graph of all possible random variables. A
distribution list the possible values for the random variable and also gives the
frequency of occurrence for each random variable.
Comments
•
All random variables have a distribution
•
Certain types of random variables occur so frequently that we name
their distribution. For example, the bell-shaped distribution is thought
to occur so frequently that we’ve labeled it the normal distribution.
Example 1.5: Consider the following 22 observations from the El Paso Lead
Study whose Location = Close. Let these 22 observations represent the
population. That is, we only care about making decision about these 22
individuals.
17
Take a simple random sample of 5 individuals from this population. Place their
value in the table below.
ID
Sex Age Colic Clum Irr Loc
Years Test
Year1
IQ
Lead1 Lead2
Close Type
Year2
1
2
3
4
5
Main Ideas:
•
EVERTHING in the population is unknown and fixed
•
EVERTHING in the sample is known and random
•
EVERTHING in the population has a corresponding component in the sample
Two final definitions
Definition: Parameter – summary characteristic of a distribution
Definition: Statistic – summary characteristic of a sample
18
Download