Statistics Basics

advertisement
Statistical Lingo
As the planning committee thought about this statistical training, we
thought it would be important to have a list of key statistical concepts that might
be useful to you. At the training we may use many terms, and we want our staff
to be on the same page coming into the program. Below you will find a list of
definitions and key concepts that we think are important for you to understand as
you use statistics. The definitions below have come from a wide range of
sources.
Statistics: A tool to help one process and make sense about large amounts of
quantitative information or data collected on research trials, human behavior or
related projects. Statistics is to data as grammar is to words. (Ellen Fireman)
Population: The entire potential group for which you want to make a statement
about. If you are interested in knowing what potatoes Maine people want to eat,
the population is all Maine people that eat potatoes. You will rarely, if ever, have
all information about a population. We sample that population and make
inferences about the total population from the sample data.
Sample: A set of measurements that constitutes part of the population of
interest. The larger the sample, generally the greater the confidence that you will
have accurately described the population. A random sample is one in which any
individual measurement is as likely to be included as any other.
Sample Number: This is the number of observations that make up a set of data.
When you see a capital N, that signifies the whole population number; a lower
case n signifies the number in your sample. Example: if you were working with
4-H youth in Maine … the N is the total number of 4-H youth; the n is your
subsample of 4-H youth.
Parameters: these are the characteristics of a population. Characteristics
include averages, ranges, midpoints, variation about the average etc.
Normal Distributions: This is also called the Bell-Shaped Curve. A general
characteristic or variable of biological populations (humans, plants, etc.) that has
more values near the average and fewer toward the ends of the range. This is
also a distribution of samples that is symmetrical about the mean, median and
mode and with no skewed tails. Consider shoe sizes: if your foot is very small
or very large, you have trouble finding shoes. If you are a size 9 or 10 in mens or
7 to 9 in women, your selection generally improves. If you were to plot women’s
shoe sizes that work in Extension in Maine, it would likely take this shape.
Normal Distribution of Women's Shoe Sizes in UMCE
Frequency of Occurrence
25
20
15
10
5
0
2
4
6
8
10
12
14
Women's Shoe Sizes in Extension
Parametric Statistics: terms used to describe populations that take on this bellcurve shape (ie, average, variation, etc.), which is also called a “normal sample
distribution”
Non-parametric Statistics: Statistics used to describe data that do not follow
this type of bell-curve shape. (for example, microbiological samples, water flow
or infiltration rates, etc.) Different statistics are used to describe these
populations.
Experiment: A planned organized process to determine if a specific treatment
(drug, educational program, ad campaign) has an effect.
Hypothesis: theory or issue you want to test with your study.
Note: steps to good experimentation, a) form hypothesis, b) plan
experiment (to reduce bias), c) take good data, d) interpret data, e)
confirm or reject hypothesis, f) write up and publish the study.
Treatment: the specific material or concept that you want to test in your study
(drug, chemical or organic fertilizer, educational program, curriculum, etc.)
The method of manipulating your samples, what you intend to expose or apply to
your samples (dose of chemical to plants, dose of blueberry capsule to
participants, concentrated dip to potatoes) in your study to determine any effects.
Observational studies: no treatments are applied – experiments are done on
people, plants, animals, etc. who might have different characteristics.
Observational data is generally observed, noted and recorded.
Note: frequently studies done with humans (malnutrition, fetal alcohol
syndrome, etc.) where treatment would not be ethical to apply.
Controlled Experiments: planned processes to determine if a treated material
(human, crop, etc.) is different from an untreated material (your control).
Treated groups – groups receiving the treatment
Control groups – groups not receiving the treatment
Historical controls – subjects from a past study act as controls for a
current study
Key concept in all experimentation: the control group should be as
similar as absolutely possible to the treated groups so that the real
differences found can be attributed to the treatment effect.
Experimental Unit: the unit of experimental material to which a treatment is
applied: may be a person, leaf, individual plot of corn, … whatever.
Experimental Error: Variability (natural, genetic, environmental soil type
variability) among experimental units that cannot be controlled by the researcher.
We design our studies to try to limit experimental error as much as possible.
Bias: defined as the chance that a researcher’s preconceived notions about a
treatment may influence an experimental outcome.
Subject bias: important to human studies and often called the placebo
effect, humans are influenced positively or negatively by the treatment.
Evaluator bias: a potential influence by the person evaluating the
treatment. Randomized code labeling of plots helps control this source of bias.
Blind: not allowing the subject to know if they are taking the treatment or
placebo. Double blind treatment assignment eliminates both sources of bias.
Variable: a measurable characteristic of an experimental unit.
Quantitative variables: measured numerically. Size of human feet, yield
of corn, etc. One can make frequency distributions of these data. These types of
data represent a measured quantity.
Categorical variables: Variables that take on names or labels some
examples would be a breed of dog (collie, shepherd, husky) or colors of a ball
(red, green, yellow)
Quantitative variables can be further broken down into two more types of
variables:
Discrete variables: one specific number … must be a single number (ex.
number of times during a coin toss that heads came up first). Ex.
Number of treatments, and number of dogs owned.
Continuous variables: A value that falls between two specified values
(conducting a study on female college student weight, may have
previously specified weight ranges 120-130 lbs, 130-140 lbs, etc.; the
students’ weight would fall between one of these values. Others include
weight of grain, years of ownership of a home, etc.
Qualitative variables: not measured numerically – eye color, social
security numbers, much survey information like agree, strongly agree. Etc.
Independent variable: this is the treatment level or factor that you set
when conducting a trial. Ex) rate of nitrogen in a corn trial etc.
Dependent variable: this is the information (data) that you collect from the
trial … this is dependent on the treatment level applied.
Measures of distributions:
These are three common measures used to describe data in
statistics: Mean/Average, Median and Mode:
Average: to find the average sum of your data set (the numbers you have) and
divide by the total number of samples. Ex. 3, 7, 5, 15, 2, 8, 10. sum = 50, n
(number of samples) = 7, Divide 50 by 7 = your average which is 7.1.
Mean: is the same definition as average. Population mean is given the greek
symbol mu (X); the sample mean is a statistic – the estimate drawn from a
sample of the population mean. Sample mean is given symbol x bar.
Median: the midpoint of a series of data: rank the numbers in order and find the
middle one. 2, 3, 5, 7, 8, 10, 15, - the median is 7.
Note: if median and mean are similar, data are likely parametric, or
normally distributed
Mode: The mode is the number repeated most often in your data set. If there are
no repeating numbers, then there is no mode for the data set.
Ex. Data set: 3, 5, 10, 2, 3, 6, 15, 3. The mode would be 3 because it is the
number repeated the most. Mode = 3
Variance or standard deviation: term that is a measure of the spread (or
variability) around the mean of a specific set of data. Generally speaking, the
lower the variance or standard deviation around a mean, the more confidence we
have that that sample mean is a real or valid number. The figure below
illustrates this concept. A variance of 0.5 means that the numbers that make up
the mean are tightly distributed around that mean. A variance of 2 suggests a
more variability. Symbols for standard deviation is . Variance is 2.
Variance is the square of the standard deviation.
Standard deviation around a population mean is defined by this equation:
[x is a value from the population, m is the mean of all x, n is the number of x in
the population, S is the summation]
The estimate of population standard deviation calculated from a random sample
is:
[x is an observation from the sample, x-bar is the sample mean, n is the sample
size, Sigma is the summation]
Standard error of the mean or standard error:
The standard error is an important statistic for determining whether the
mean of one sample is really different from another. To calculate the standard
error, you take the variance of a data set and divide it by the number of values
that went into making the mean. If you teach a program and you have 14 pre
test and 14 post test scores, your calculation of the standard error is the variance
of those scores divided by the number of people participating (14). The symbol
for the standard error is Sy.
Coefficient of Variation
There are times when you want to describe the variability associated with
a research project. You might have taken weights of people in an exercise
program and waist size before and after. Loss in weight and waist size should be
quantitatively or numerically quite different. After the program, the average
weight loss might be 17.5 lbs, and the waist size change might be 3.2 inches.
We can use the standard deviations of these data sets and make statements as
to whether the data from weight loss was more variable than change in waist size
by using the term coefficient of variation. It is calculated by taking the standard
deviation divided by the mean and multiplied by 100 to put it onto a percentage
basis.
Weight loss
Waist size
Sample mean = 17.5
Standard deviation = 3.3
CV = 18.7%
Sample mean 3.2
Std. Dev. = 1.5
CV = 47%
These statistics may be explained by looking at the people who attend. Some
people concentrate weight gain in their waist (for example: beer drinking, stock
car racing fan, male southerners). If your program had some Southern male
beer drinking, stock car racing fans, the change in waist size would likely be a
more variable measurement because you had such a great potential for change if
they can’t get into their refrigerator and find beer.
CV are useful to get a sense of the general variability of a given test procedure or
measurement. If you had a test where people find an average CV of 10%, it is
likely a pretty accurate test. In field research, we can use a CV to see how well
we might have controlled variability in one location compared to another. It is a
useful measure.
For Statistics Junkies … these are important but difficult theoretical
concepts:
Degrees of Freedom: This is an important and difficult concept in statistics. It is
like an accounting procedure. When you calculate the sample mean for a set of
data, and then you use that statistic when you calculate a second statistic
(standard deviation). You have to use n-1 instead of n because you have already
calculated the mean. Another way to consider this can be seen in this example.
Imagine you have four numbers (2, 3, 4 and 5) that must add up to a total of 14;
you are free to choose the first three numbers at random, but the fourth must be
chosen so that it makes the total equal to 14. With four numbers, you have three
degrees of freedom. When you have four treatments in an experiment, you have
three degrees of freedom associated with that and used in all calculations of
error.
Z scores:
Another way to normalize a set of data is to calculate a Z score. The Z
distribution is shown below as well as the calculation.
With normally distributed data, 95% of the data falls within 2 standard deviations
around a mean. 99.7% of all data fall within 3 standard deviations around the
sample mean.
This has practical value to someone evaluating whether a single data point might
be thrown out. Generally if the data point is > 3 standard deviations away from
the average, it is considered an outlier point and is safe to eliminate particularly if
there are good reasons to do so. For example, you have 14 people in a class
who take a pre and post test score. If one or two participants only attended 10%
of the class, their post test score would likely be significantly lower than the posttest average. If > 3 standard deviations, this would be a logical score to throw
out.
Download