Stats Glossary

advertisement
Stats Glossary
Section 1.1

Data – numbers collected in a particular context.
For example, if you asked everyone in our class how many brothers and/or sisters
they had, the numbers that your class members gave as responses would represent
data.

Variable – any characteristic of a person or thing that can be assigned a number
of a category.
For example, in the scenario described above, the variable would be the number
of brothers and/or sisters.
Students sometimes have trouble determining whether or not a statement
represents a variable. Suppose that the statement was “name of your math
teacher.” If the observational units were students in our class, would this vary
from student to student? Pretend to ask each student, “What is the name of your
math teacher?” Aren’t they all going to say the same name? Since the students’
answers would not vary, this cannot be a variable.

Observational unit – the person or thing to which the number or category is
assigned; also called the case.
For example, each class member that you asked for the number of brothers and/or
sisters is an observational unit or case.
________________________________________________________________________
Section 1.2

Quantitative variable – a variable that measures a numerical characteristic; also
called a measurement variable.
For example, since the response to how many brothers and/or sisters a person has
is a number, this variable is a quantitative variable.
Count variable – a type of quantitative variable; answers the question, “How
many?”

Categorical variable – a variable that records a category destination; also called
a qualitative variable.
For example, if you were to record the gender of your class members, gender
would be a categorical variable because the class members are either in the female
category or in the male category.
Here is another example. Suppose you asked your class members for their
favorite flavor of ice cream and you allowed them to choose from the following:
vanilla, chocolate, strawberry, or other. Ice cream flavor would be a categorical
variable because the class responses would fall into one of the four categories.

Binary variable – a special categorical variable for which only two possible
categories exist.
For example, the variable gender would be a binary variable. The variable ice
ream flavor would not be a binary variable because it has more than two
categories for the responses.

Displays for a Categorical Variable
1. Frequency Table
2. Picture Graph
3. Bar Graph
4. Segmented Bar Graph
5. Circle Graph
________________________________________________________________________
Section 1.3



Read Question – requires the respondent to read information from the table to
determine a solution; important but low-level.
Derive Question – requires some type of computation involving information read
from a table.
Interpret Question – requires an extension, prediction, or inference to read
beyond the data; higher-level thinking.
________________________________________________________________________
Section 1.4

Displays for Quantitative Variables
1. Dot Plot
2. Stem-Leaf Plot
3. Grouped Frequency Table
4. Histogram

Symmetric – a distribution is symmetric if one half is roughly a mirror image of
the other.







Skewed to the right – a distribution is skewed to the right if it tails off toward
larger values.
Skewed to the left – a distribution is skewed to the left if it tails off toward
smaller values.
Outliers – observations that differ markedly from the pattern established by the
vast majority.
Granularity – a distribution has this characteristic if it has values occurring only
at fixed intervals.
]Six Features of Data Distribution that are typically of interest –
1. center – to be discussed in section 2.1
2. variability or spread – to be discussed in sections 2.2 and 2.3
3. shape – while the shape may vary, many times the shape may be identified
as symmetric, skewed to the right, or skewed to the left
4. cluster/peaks – peaks or clusters indicate that the data fall into natural
subgroups
5. outliers – if outliers are present, they warrant close examination
6. granularity
Side-by-side Stemplot
o A common set of stems is used in the middle of the display with leaves for
each category branching out in either direction
o Order the leaves from the middle out toward either side
Statistical Tendency
o Pertains to average or typical cases but not necessarily to individual cases
o Ex: Men tend to be taller than women. This does not mean that all men
are taller than all women.
________________________________________________________________________
Section 1.5 Part 1
Response Variable vs. Explanatory Variable




Many times we would like to offer an explanation as to why a person gives a
particular response.
Example: Do you believe that a person who is 50 years old is “old”? A person’s
response to this question can most likely be explained by that person’s age. That
is, someone who is 20 might believe 50 is old. However, someone who is 49 or
60 might not consider 50 as old.
In the above example, there are two variables of interest, namely age and the “do
you believe 50 is old” variable. Since we are thinking that a person’s age might
predict the response to the statement, the variable age is called the explanatory
variable. The “do you believe 50 is old” variable is the response variable.
The response variable is affected or predicted by the explanatory variable.
Two-way Table


This is a table which classifies a person in 2 ways.
Continuing the above example, suppose the following data were collected:
Age
Agree
5
Y
10
Y
20
Y
25
Y
30
Y
35
N
45
N
50
N
60
N
65
N
Here is a two-way table for this data. (Notice: The ages are placed in categories so as
to create a categorical variable.)
Agree
Disagree

0-25
4
0
26-50
1
3
52-75
0
2
The explanatory variable should be in columns and the response variable in
rows.
Marginal Distribution


Calculated by finding the proportion of responses in each category
Example: Continuing the above example—The marginal distributions for the
age variable are 4/10 = .4 (there are 4 people in the 0-25 age category out of
10 people total), 4/10 = .4 (there are 4 people in the 26-50 age category out of
10 people total), 2/10 = .2 (there are 2 people in the 51-75 age category out of
10 people total).
Conditional Distribution


Distribution of one variable for given categories of the other variable.
From the above example, the proportion of “middle-aged” respondents who agree
is 1/4 = .25 (one agrees out of the total of 4 people in that age group).
Segmented Bar Graphs



Visual display for conditional distributions.
Each rectangle has a height of 100%.
Each rectangle is divided into segments whose lengths correspond to the
conditional proportions.
________________________________________________________________________
Section 2.1
Three Measures of Center
1. Mean – the arithmetic average—The mean is found by adding up the values of
the observations and dividing by the number of observations.
Example: Let 5, 10, 8, 7, 4 be the data set. To find the mean add these numbers (5
+ 10 + 8 + 7 + 4 = 34) and divide by how many numbers there were in the set
(34/5) = 6.8). The mean for this data set is 6.8.
The mean can be thought of as the “balance point” of the distribution. Also, the
mean can be calculated only on quantitative variables.
2. Median – the middle observation when the observations are listed in order.
To find the median:
o Arrange the values in order
o If there are an odd number of values, the median is in the (n + 1)/2
position.
o If there are an even number of values, the median is the average of the
values in the n/2 and (n/2) + 1 positions.
Example: Let 5, 10, 8, 7, 4 be the data set. To find the median, we must first list
these numbers in order—4, 5, 7, 8, 10. Since there are 5 numbers in the set (an
odd number) the median is the middle number, in this case 7. Sometimes a set is
very large so it is easier to figure out which numbered position the median is in.
If so, use the formula (n + 1)/2 to find the position number. For this example, n
would be 5. Using the formula, (5 + 1)/2 gives us 3. If you look in the third
position, the median is 7.
Example: Let 5, 10, 8, 7, 4, 12 be the data set. To find the median we must first
list these numbers in order—4, 5, 7, 8, 10, 12. Since there are 6 numbers in the
set (an even number) the median is the average of the two middle numbers, in this
case the average of 7 and 8 is 7.5. If a data set is very large, it may be beneficial
to use the formulas n/2 and (n/2) + 1 to find the two numbers that you must
average to get the median. In this example, n is 6. Using the formulas we get 6/2
= 3 and (6/2) + 1 = 4. We need to average the numbers in the third position (7)
and in the fourth position (8). If you average 7 and 8 you get 7.5.
3. Mode – the most common value; the value that occurs most frequently.
Example: Let 5, 7, 3, 4, 4, 1 bet the data set. The mode is 4 simply because it
occurs twice and the other values occur only once. Suppose the data set had been
orange, blue, orange, blue, orange, blue, red, black, green, red. The mode here
would be both orange and blue since each of these occurred the most (three times
each).
The mode applies to all categorical variables but is only useful with some
quantitative variables.
Sample Size – the number of observations in the data set; the variable n usually denotes
the sample size.
The relationship of the mean and the median –



Symmetric distribution – the mean is close to the median
Skewed right distribution – the mean is greater than the median
Skewed left distribution – the mean is less than the median
Resistant – a measure whose value is relatively unaffected by the presence of outliers.
Note: Measures of center are often important, but they do not summarize all aspects of a
distribution.
________________________________________________________________________
Section 2.2
Range



A measure of variability
Simple but not very useful
Maximum value minus the minimum value
Inter-quartile Range (IQR)



A measure of variability
It is the upper quartile minus the lower quartile
The range of the middle 50% of the data
Lower Quartile



25th percentile
The value such that 25% of the observations fall below that value and 75% of the
observations fall above the value
To find the lower quartile
1. Find the median for the entire data set. (This number divides the set into
two halves.)
2. Find the median for the portion of the data set that falls below the actual
median (which was found in step 1). This is your Lower Quartile. (By
dividing the bottom half of the data set in half, you have found the
quarters of the entire data set.)
3. Note: If there are an odd number of observations in the original data set,
the actual median is not included in the bottom half when finding the
lower quartile.
Upper Quartile



75th percentile
The value such that 75% of the observations fall below that value and 25% of the
observations fall above the value.
To find the upper quartile
1. Find the median for the entire data set. (This number divides the set into
two halves.)
2. Find the median for the portion of the data set that falls above the actual
median (which was found in step 1). This is your Upper Quartile. (By
dividing the upper half of the data set in half, you have found the quarters
of the entire data set.)
3. Note: If there are an odd number of observations in the original data set,
the actual median is not included in the upper half when finding the upper
quartile.
Five-number summary


Provides a quick, convenient description of where the four quarters of the data fall
Includes the minimum value, the lower quartile, the median, the upper quartile,
and the maximum value.
Boxplot





A visual display which is based on the 5-number summary.
Draw a box between the quartiles. This box demonstrates where the middle 50%
of the data fall.
Draw horizontal lines (or whiskers) that extend from the left and right sides of the
box to the minimum and maximum, respectively.
Mark the median with a vertical line inside the box.
One weakness of box plots – the effect of an outlier
Modified Boxplots



Outliers are marked with symbols.
“Whiskers” extend to the most extreme, nonoutlying value.
Rule for identifying outliers: outliers are observations lying more than 1.5 times
the IQR away from the nearer quartile.
________________________________________________________________________
Section 2.3
Standard Deviation

A widely used measure of variability.


To compute:
1. Calculate the difference between each observation and the mean.
2. Square each of these differences.
3. Add these squares.
4. Divide this sum by n-1.
5. Take the square root.
Denoted by s
Empirical Rule


With mound-shaped data
o About 68% of the observations fall within 1 standard deviation of the
mean.
o About 95% of the observations fall within 2 standard deviations of the
mean.
o Virtually all observations fall within 3 standard deviations of the mean
This is not necessarily true for distributions of other shapes.
z-score or standardized score




Useful for comparing individual scores from different distributions.
To calculate a z-score
1. Subtract the mean from the value of interest.
2. Divide by the standard deviation.
The z-score indicates how many standard deviations above or below the mean a
particular value falls.
It should only be used when working with mound-shaped distributions.
Note: A common misconception about variability is to believe that a “bumpier”
histogram indicates a more variable distribution, but this is not the case. Similarly, the
number of distinct values represented in a histogram does not necessarily indicate greater
variability.
________________________________________________________________________
Section 1.5 Part 2
Scatterplot




A scatterplot is similar to a dot plot except that it displays two quantitative
variables simultaneously.
The vertical axis represents one variable and the horizontal axis represents the
other.
A dot represents an observational pair.
Generally, the response variable is on the vertical axis and the explanatory
variable is on the horizontal axis.

For example, I believe that if I know your foot length, then I can tell you your
height. The variable foot length is predicting the variable height. Foot length is
the explanatory variable and should be on the horizontal axis. Height is the
response variable and should be on the vertical axis.
Positive Association


Two variables are positively associated if larger values of one variable tend to
occur with larger values of the other variable.
For example, consider the variables “number of hours worked” and “money
earned.” One would assume that if a large number of hours are worked, then a
large amount of money is earned. Therefore, these two variables are positively
associated.
Negative Association


Two variables are negatively associated if larger values of one variable tend to
occur with smaller values of the other.
For example, consider the variables “the number of days absent from class” and
“grade in class.” Generally, someone with a high number of absences will have a
lower grade in the class. These two variables are negatively associated.
Correlation Coefficient







The letter r is used to denote the correlation coefficient.
The correlation coefficient is a measure of the degree to which two variables are
associated.
The value of the correlation coefficient ranges from -1 to +1.
If the correlation coefficient equals +1 or -1, then the observations form a
perfectly straight line.
The sign of the correlation coefficient reflects the direction of the association.
That is, if r is positive then the two variables are positively associated. If r is
negative, then the two variables are negatively associated.
Values of r that are closer to +1 or -1 indicate stronger associations. Therefore,
the correlation coefficient indicates the magnitude or strength of the correlation.
The correlation coefficient only measures linear relationships between two
variables. Therefore, it is always important to look at the scatterplot when
interpreting r.
Association vs. Causation


Two variables may be strongly associated without a cause-and-effect relationship.
Often, it two variables are associated but a cause-and-effect relationship is not
apparent, then it is likely that the two variables are related to a third variable that
is not being measured. This third variable is called a lurking variable or a
confounding variable.
Download