Stat 328 1st Week Outline 1 Introduction

Stat 328 1st Week Outline
1 Introduction
Statistics is:
•data collection
•data summarization
•quantitative inference from data
all in a framework that recognizes the reality and omnipresence of
Some sources of variation in business data include:
•real item-to-item variation
•sampling variation
•measurement error
1.2 Issues in Measurement
The fundamental issues in metrology (the science of measurement)
With precise enough measurement, even repeat measurements of
the same unit will typically vary.
1.3 Issues in Sample Selection
Data are sometimes collected from concrete/well-defined
"populations'' of items/units of interest.
•(for purposes of protecting oneself and producing inferences
with quantifiable reliability) this is ideally done "at random''
•in reality, it is often done haphazardly or according to
Data are often collected from "processes'' (where there is no fixed
concrete/well-defined set of items or units under consideration)
and the object is to understand the nature of the process. If there is
any hope of doing this, the process must be "stable,'' i.e. must not
be changing in a completely unpredictable fashion over time.
(SPC methods are aimed at verifying this.)
The same mathematics is typically used to support inferencemaking for both populations and processes. (This unfortunately
sometimes leads to muddled thinking and expositions.)
1.4 Some Terminology
Data may be:
-measurements (JMP:continuous)
Data may be:
•multivariate (including bivariate, including "paired'')
1.5 Mathematical Models and Data Analysis
Mathematical models are descriptions of real systems and
phenomena in terms of numbers, symbols, equations and the like.
The most common of these are "deterministic'' and don't allow for
randomness or variation from a particular prediction they generate.
Probability models are mathematical descriptions of "chance''
phenomena and do allow for the kind of variation seen in real
world business data. These are used to support quantitative
inference from data.
1.6 Basic Descriptive Statistics
The place to start with data description is with a single sample ...
data from a single population or single set of process conditions.
Some basic tools for quantitative data are:
•graphical representations
-dot diagrams
-stem and leaf diagrams
-histograms (bar charts)
•numerical summaries
-sample minimum and sample maximum
-variance and standard deviation
1.7 The Simplest Possible Examples of Probability-Based
Slightly non-standard but informative examples of probabilitybased inference concern what can be learned from the sample
minimum and sample maximum, if one adopts a model of
independent observations from an unknown (but fixed)
"continuous distribution.'' The interval with end-points at
the sample extremes can be used as
•"confidence interval'' for the distribution median
•"prediction interval'' for a single additional observation
with known confidence/reliability.
1.8 Normal Probability Models
A simple and convenient probability model for a single observation
is that of a normal distribution. This is the famous, archetypal
bell-shaped continuous distribution and is completely specified by
its mean and standard deviation. It has a number of famous
properties, including the fact that essentially all of the distribution
is within 3 standard deviations of the mean.
If one models several observations as independent from a fixed
normal distribution, it is possible for mathematicians to derive
(implied) distributions for important data summaries (statistics).
These in turn can lead to methods of quantitative inference. For
example, the fact that the sample mean minus the distribution mean
divided by the sample standard deviation over È8 has the >
distribution, leads to a confidence interval formula for the
distribution mean.
The course formula sheet has examples of the kinds of methods
that are possible starting from a "random sampling from a normal
distribution'' model.
1.9 Hypothesis Testing
In Vardeman's opinion statistical intervals are more informative
than "hypothesis tests.'' Nevertheless, some mention of these are
necessary. A statistical hypothesis is a statement about a model
parameter or parameters. To test such is to use data to decide
whether or not to continue under the assumption embodied by it.
To make this decision, one collects data, computes some data
summary (the value of a test statistic), compares the observed value
to a reference distribution (describing behavior of the test statistic
if the hypothesis is true) and throws out the hypothesis if the
observed value is extreme/rare in comparison to this reference
distribution. For example, if a test statistic has a standard
normal reference distribution, an observed value of 3.0 would be
rare and (unless the alternatives of interest tend to produce small
rather than large observed values) the null hypothesis would
typically be "rejected.''
A variant on the accept/reject approach to testing is one where the
final product is a so-called :-value or observed level of
significance. This is the probability (calculated using the reference
distribution) of obtaining a value more extreme than the one in
Further, testing information is available as a by-product of
confidence interval making. If a particular value (of a model
parameter), #, is of interest and is inside one's confidence interval,
the null hypothesis "parameter = #'' should not be rejected. If # is
not inside the confidence interval, the hypothesis could be rejected.
2. The Goals of Regression
The basic goal of so-called "regression analysis'' is the modeling
of a response/output/dependent variable, C, as an approximate
function of one or more input/system/independent variables
B" ß B# ß ÞÞÞß B5 Þ To do this, one begins with 8 data vectors
ÐCß B" ß B# ß ÞÞÞß B5 Ñ and uses the technology of Stat 328 to find
equations that allow adequate prediction of the C's based on the
ÐB" ß B# ß ÞÞÞß B5 Ñ's. The descriptive statistics part of this can be done
without any appeal to probability models. This part is simply
"curve-fitting'' (or "surface-fitting'') using "least squares'' software.
In order to make quantitative inferences and predictions with
plausible "reliability'' or "confidence'' figures, one must adopt and
use some probability model. The most convenient (and standard)
such model is one that says that C is a deterministic function of
ÐB" ß B# ß ÞÞÞß B5 Ñ plus normally distributed "error'' or "noise'' that has
mean 0 and a standard deviation that remains constant as
ÐB" ß B# ß ÞÞÞß B5 Ñ changes.