Introduction

advertisement
STAT 600: 1 - Introduction to Statistics
Spring 2014
WHAT IS STATISTICS?
Statistics is the science of learning from data.
This involves collecting, describing, and drawing conclusions from data.
The application of statistics can be divided into two broad areas:

Descriptive Statistics: Graphical methods or numerical summaries which
describe data.

Inferential Statistics: Process in which a smaller group (sample) is used to draw
conclusions about a larger group (population).
TYPES OF DATA
Recall that statistics is the science of data. All data (and hence the variables that we
measure) can be classified as one of two types, and each of these has two subgroups.

Qualitative (or categorical): Measurements that are classified into one of a group
of categories.
- Nominal: Order is not important.
- Ordinal: Measurements fall in some natural order.

Quantitative (or numerical): Measurements that are recorded on a naturally
occurring numerical scale.
- Discrete: There are gaps between possible data values.
- Continuous: There are no gaps between possible data values; that is, the
measurements occur on a continuous scale.
Example 1.1: The National Center for Health Statistics administers a survey to National
Health and Nutrition Examination Survey participants on an annual basis. The survey
participants are randomly selected from the U.S. population
(http://www.cdc.gov/nchs/nhanes/participant.htm).
The following survey items refer to the participant’s dermatological health. Classify
each item (variable) as nominal, ordinal, discrete, or continuous.
1
STAT 600: 1 - Introduction to Statistics
Spring 2014
Survey Item/Variable
How many moles do you have that are at least ¼ inch in
diameter?
What is your natural hair color?
Variable Type
When you go outside on a very sunny day for more than an
hour, how often do you wear sunscreen (always, most of
the time, sometimes, rarely, or never)?
How many times in the past year have you had a sunburn?
If applicable, diameter of moles or lesions suspicious of
melanoma or other malignancies.
Example 1.2: Assessing Mercury Levels Found in Fish in Maine Lakes
Mercury is a toxic metal sometimes found in fish consumed by humans. The state of
Maine conducted a field study of 115 lakes to characterize mercury levels in fish,
measuring mercury and 10 variables on lake characteristics. From these data we could
investigate potentially investigate the following research questions:
1. Are mercury levels high enough to be of concern in Maine lakes?
2. Do dams and other man-made flowage controls increase/decrease mercury
levels?
3. Do different types of lakes have different mercury levels?
4. Which lake characteristics best predict mercury levels?
The variables measured by the researchers as part of this field study are listed below.
Classify each variable according type.
Merc (ppm): Mercury level found in fish fillets in parts per million
N: number of fish in the composite
Elevation: elevation of the lake (feet)
Surf Area: surface area of the lake (acres)
Z: maximum depth (feet)
Lake type: 1 = oligotrophic, 2 = eutrophic, 3 = mesotrophic
ST: lake stratification indicator. (1 = yes, 0 = no)
___________
___________
___________
___________
___________
___________
___________
This refers to whether or not there is temperature stratification within a lake. In summer, the lake surface
warms up and a decreasing temperature gradient may exist with the bottom remaining cold. A lake is
considered stratified if a temperature decrease of 1 degree per meter or greater exists with depth.
DA: drainage area (square miles).
___________
Area of land which collects and drains the rainwater which falls on it, such as the area around the lake.
2
STAT 600: 1 - Introduction to Statistics
Spring 2014
RF: RF = (total runoff during year)/(total precipitation during year)
________
Runoff factor (RF) is the amount of rainwater or melted snow which flows in rivers and streams. In
general, higher runoff factors may lead to more surface waters from lake watershed reaching lakes. If
contaminants are from local source, this may influence concentration found in fish.
FR: flushing rate = (total inflow volume during year)/(total volume of lake). ________
Flushing rate (FR) gives the number of times all water is theoretically exchanged during a year.
DAM: Dept. of Inland Fisheries and Wildlife impoundment class.
0 = no functional dam present; all natural flowage
1 = some man-made flowage in the drainage area
________
SOME BASIC DEFINITIONS
Most of what we’ll be doing in this course centers on trying to understand a set of
information. This set of information is from a . . .
Population: The complete collection of ALL elements to be studied.
The population is often so big that obtaining all information about its elements is
either difficult or impossible. So, we work with a more manageable set of data
obtained from a . . .
Sample: A subcollection of elements drawn from a population.
Example 1.3: Consider the National Health and Nutrition Examination Survey
mentioned above. Identify the following:

Population of interest:

Sample:
Census: All elements are drawn from the population (hence there is no difference
between the population and the sample). Note that inferential methods are not
needed when a census of the entire population is taken.
Observation: The collection of measurements from a particular unit in a population.
Variable: Any measurable characteristic of an observation.
3
STAT 600: 1 - Introduction to Statistics
Spring 2014
When creating a data set to be imported into a statistical software package such as JMP,
you should place each variable in its own column. Then, each row will consist of a
separate observation.
Example 1.4: Show how you would construct a data set for the National Health and
Nutrition Examination Survey mentioned above.
Parameter: A numerical descriptive measure of a population. This value is almost
always unknown.
 𝜋 or population proportion
 μ for population mean
Statistic: A numerical descriptive measure of a sample. This value is calculated from the
observed data. We will use the following notation.
 𝜋̂ for sample proportion
 𝑦̅ for sample mean
Example 1.5: Suppose you are interested in estimating the proportion of the U.S.
population that never wears sunscreen when outside for more than an hour on a sunny
day. Identify the following:

Parameter of interest:

Statistic of interest:
4
STAT 600: 1 - Introduction to Statistics
Spring 2014
General Approach to Statistical Process
5
STAT 600: 1 - Introduction to Statistics
Spring 2014
TYPES OF STUDIES – (from Powerpoint in class and on website)
Two Main Types of Studies
Observational – researcher collects info on attributes or measurements of interest, but
does not influence results.
Experimental – researcher deliberately influences events and investigates the effects of
the intervention, e.g. clinical trials and laboratory experiments.
EXPERIMENTAL STUDIES – basic terms and concepts
1.
Completely Randomized Design (CRD)
The treatments are allocated entirely by chance to the experimental units.
Example 1.6: Tomato Plants
Which of two varieties of tomatoes (A & B) yield a greater quantity of market quality fruit?
Factors that may affect yield: soil fertility; exposure to wind/sun; soil pH levels; soil
water content etc. Divide the field into plots and randomly allocate the tomato varieties
(treatments) to each plot (unit).
Situation 1:
8 plots – 4 get variety A
Situation 2:
UPHILL
8 plots – 4 get variety A
6
STAT 600: 1 - Introduction to Statistics
Spring 2014
2.
Blocking
Group (block) experimental units by some known factor and then randomize within each
block in an attempt to balance out the unknown factors.
Example 1.7: Tomato Plants Again
It is recognized that there are two areas in the field — well drained and poorly drained.
Partition the field into two blocks and then randomly allocate the tomato varieties to
plots within each block.
Well-drained
Poorly drained
How should we allocate varieties to the 12 plots?
Example 1.8: Comparing Three Pain Relievers for Headache Sufferers
How could we design an experiment? How could blocking be used to increase precision
of our experiment?
7
STAT 600: 1 - Introduction to Statistics
Spring 2014
Example 1.9: Horse Leg Wraps
•
17 “boots” tested, each boot is tested n = 5 times. Why?
•
Because of the time constraints all boots were not tested on the same day.
•
8 tested 1st day, 5 tested 2nd day, 4 tested 3rd day.
•
Leg was placed in freezer and thawed before the 2nd and 3rd days
Horse Leg Diagram:
of testing.
Questions:
What problems do you foresee with this experimental design?
What actually happened? Below is a plot of the force readings when no wrap was used
on the leg during the three days of testing.
What is the implication of the results shown above?
8
STAT 600: 1 - Introduction to Statistics
Spring 2014
Final Results of Horse Leg Wrap Study
Q: What should have been done?
3. Using People as Experimental Units (Medical Studies/Clinical Trials)
Example 1.10: Cholesterol Drug Study
Suppose we wish to determine whether a drug will help lower the cholesterol level of
patients who take it.
How should we design the study?
9
STAT 600: 1 - Introduction to Statistics
Spring 2014
Important Concepts for Experiments with Human Subjects
• control group:
– Receive no treatment or an existing treatment
• blinding:
– Subjects don’t know which treatment they receive
• double blind:
– Subjects and administers / diagnosticians are blinded
• placebo:
– Inert dummy treatment
• placebo effect:
– A common response in humans when they believe they have been treated.
– Approximately 35% of people respond positively to dummy treatments - the
placebo effect
OBSERVATIONAL STUDIES
There are two major types of observational studies: prospective studies and
retrospective studies.
1.
Prospective Studies
Choose samples now, measure variables and follow up in the future, e.g. choose a group
of smokers and non-smokers now and observe their health in the future.
2.
Retrospective Studies
Looks back at the past, e.g. a case-control study
Separate samples for cases and controls (non-cases). Why?
Look back into the past and compare histories. For example, we could choose two
groups: lung cancer patients and non-lung cancer patients. Compare their smoking
histories.
3. Cross-sectional Studies
Choose samples now, measure variables of interest, some of which may be retrospective
in nature.
10
STAT 600: 1 - Introduction to Statistics
Spring 2014
Important Note:
1.
Observational studies should use some form of random sampling to obtain
representative samples.
2.
Observational studies cannot reliably establish causation.
Example 1.11: “Home Births Give Babies a Good Chance”, NZ Herald, 1990

An Australian report was stated that babies are twice as likely to die during or
soon after a hospital delivery than those from a home birth.

The report was based upon simple random samples of home births and hospital
births.
Comments:
Example 1.12: “Lead Exposure Linked to Bad Teeth in Children”, USA Today

The study involved 24,901 children ages 2 and older.

It showed that the greater the child’s exposure to lead, the more decayed or
missing teeth.
Comments:
11
STAT 600: 1 - Introduction to Statistics
Spring 2014
Additional Examples:
Example 1.13: Determine whether age at 1st pregnancy is a risk factor for cervical
cancer.
Example 1.14: Determine what factors might influence the “success” of a duck nest.
Example 1.15: Test the toxicity of a new pesticide/herbicide on aquatic organisms.
12
Download