Stat 328 1st Week Outline 1 Introduction 1.1???Statistics??? Statistics is: •data collection •data summarization •quantitative inference from data all in a framework that recognizes the reality and omnipresence of variation. Some sources of variation in business data include: •real item-to-item variation •sampling variation •measurement error 1.2 Issues in Measurement The fundamental issues in metrology (the science of measurement) are: •validity •precision •accuracy With precise enough measurement, even repeat measurements of the same unit will typically vary. -1- 1.3 Issues in Sample Selection Data are sometimes collected from concrete/well-defined "populations'' of items/units of interest. •(for purposes of protecting oneself and producing inferences with quantifiable reliability) this is ideally done "at random'' •in reality, it is often done haphazardly or according to convenience Data are often collected from "processes'' (where there is no fixed concrete/well-defined set of items or units under consideration) and the object is to understand the nature of the process. If there is any hope of doing this, the process must be "stable,'' i.e. must not be changing in a completely unpredictable fashion over time. (SPC methods are aimed at verifying this.) The same mathematics is typically used to support inferencemaking for both populations and processes. (This unfortunately sometimes leads to muddled thinking and expositions.) 1.4 Some Terminology Data may be: •quantitative -measurements (JMP:continuous) -counts •qualitative -ordinal -nominal Data may be: •univariate •multivariate (including bivariate, including "paired'') -2- 1.5 Mathematical Models and Data Analysis Mathematical models are descriptions of real systems and phenomena in terms of numbers, symbols, equations and the like. The most common of these are "deterministic'' and don't allow for randomness or variation from a particular prediction they generate. Probability models are mathematical descriptions of "chance'' phenomena and do allow for the kind of variation seen in real world business data. These are used to support quantitative inference from data. 1.6 Basic Descriptive Statistics The place to start with data description is with a single sample ... data from a single population or single set of process conditions. Some basic tools for quantitative data are: •graphical representations -dot diagrams -stem and leaf diagrams -histograms (bar charts) •numerical summaries -sample minimum and sample maximum -quantiles -mean -variance and standard deviation -3- 1.7 The Simplest Possible Examples of Probability-Based Inference Slightly non-standard but informative examples of probabilitybased inference concern what can be learned from the sample minimum and sample maximum, if one adopts a model of independent observations from an unknown (but fixed) "continuous distribution.'' The interval with end-points at the sample extremes can be used as •"confidence interval'' for the distribution median •"prediction interval'' for a single additional observation with known confidence/reliability. 1.8 Normal Probability Models A simple and convenient probability model for a single observation is that of a normal distribution. This is the famous, archetypal bell-shaped continuous distribution and is completely specified by its mean and standard deviation. It has a number of famous properties, including the fact that essentially all of the distribution is within 3 standard deviations of the mean. If one models several observations as independent from a fixed normal distribution, it is possible for mathematicians to derive (implied) distributions for important data summaries (statistics). These in turn can lead to methods of quantitative inference. For example, the fact that the sample mean minus the distribution mean divided by the sample standard deviation over È8 has the > distribution, leads to a confidence interval formula for the distribution mean. -4- The course formula sheet has examples of the kinds of methods that are possible starting from a "random sampling from a normal distribution'' model. 1.9 Hypothesis Testing In Vardeman's opinion statistical intervals are more informative than "hypothesis tests.'' Nevertheless, some mention of these are necessary. A statistical hypothesis is a statement about a model parameter or parameters. To test such is to use data to decide whether or not to continue under the assumption embodied by it. To make this decision, one collects data, computes some data summary (the value of a test statistic), compares the observed value to a reference distribution (describing behavior of the test statistic if the hypothesis is true) and throws out the hypothesis if the observed value is extreme/rare in comparison to this reference distribution. For example, if a test statistic has a standard normal reference distribution, an observed value of 3.0 would be rare and (unless the alternatives of interest tend to produce small rather than large observed values) the null hypothesis would typically be "rejected.'' A variant on the accept/reject approach to testing is one where the final product is a so-called :-value or observed level of significance. This is the probability (calculated using the reference distribution) of obtaining a value more extreme than the one in hand. Further, testing information is available as a by-product of confidence interval making. If a particular value (of a model parameter), #, is of interest and is inside one's confidence interval, the null hypothesis "parameter = #'' should not be rejected. If # is not inside the confidence interval, the hypothesis could be rejected. -5- 2. The Goals of Regression The basic goal of so-called "regression analysis'' is the modeling of a response/output/dependent variable, C, as an approximate function of one or more input/system/independent variables B" ß B# ß ÞÞÞß B5 Þ To do this, one begins with 8 data vectors ÐCß B" ß B# ß ÞÞÞß B5 Ñ and uses the technology of Stat 328 to find equations that allow adequate prediction of the C's based on the ÐB" ß B# ß ÞÞÞß B5 Ñ's. The descriptive statistics part of this can be done without any appeal to probability models. This part is simply "curve-fitting'' (or "surface-fitting'') using "least squares'' software. In order to make quantitative inferences and predictions with plausible "reliability'' or "confidence'' figures, one must adopt and use some probability model. The most convenient (and standard) such model is one that says that C is a deterministic function of ÐB" ß B# ß ÞÞÞß B5 Ñ plus normally distributed "error'' or "noise'' that has mean 0 and a standard deviation that remains constant as ÐB" ß B# ß ÞÞÞß B5 Ñ changes. -6-