Variable

advertisement
Vocabulary STATISTICS
1. Area principle: In a statistical display, each data value should be represented by the same amount of
area.
2. Attack of the Logarithms: When the Ladder of Powers does not satisfy a linear relationship. When
none of the data values is zero or negative, logarithms can be a helpful ally in the search for a useful
model.
Model Name
x-axis
y-axis
Comment
Exponential
x
Log(y)
Logarithmic
Log(x)
y
Power
Log(x)
Log(y)
This model is the “0”
power in the Ladder
approach, useful for
values that grow by
percentage increases.
A wide range of xvalues, or a scatterplot
descending rapidly at
the left but leveling off
toward the right, may
benefit from trying this
model.
The Goldilocks model:
When one of the
Ladder’s powers is too
big and the next is too
small, this one may be
just right.
3. Association:



Direction: A positive direction or association means that, in general, as one variable increases, so
does the other(graph goes up). When increases in one variable generally correspond to decreases
in the other, the association is negative(graph goes down).
Form: The form we care about most is straight, but you should certainly describe other patterns
you see in scatterplots such as clumping, and sparseness of data.
Strength: A scatterplot is said to show a strong association if there is little scatter around the
underlying relationship. A weak association is scattered greatly from the linearity.
4. Bar Chart: Bar charts show a bar representing the count of each category in a categorical variable.
5. Bins: Display of quantitative in a graph so they are sliced up into equal –width (scales) piles of
numbers in a histogram. Bin counts are the height of the bars in a histogram.
6. Boxplot: A boxplot displays the 5-number summary as a central box with whiskers that extend to the
non-outlying data values (minimum on left and maximum on the right) . Boxplots are particularly
effective for comparing groups. A boxplot reveals some features of a distribution not easily seen in a
histogram – the center, the middle 50%, outliers. Not so good at showing the whole shape and used
with support in the histogram.
7. Case: A case is an individual about whom or which we have data.
8. Center: A value that attempts the impossible by summarizing the entire distribution with a single
number, a “typical” value. Usually in terms of mean (average), median (a middle number in a graph)
and mode (number that happens the most).
9. Changing Center and Spread: Changing the center and spread of a variable is equivalent to
changing its units.
10. Comparing Boxplots: When comparing groups with boxplots




Compare the medians which group has the higher center?
Compare the IQRs; which group is more spread out?
Judged by the size of the IQRs, are the medians very different, similar?
Check for possible outliers, identify them if you can and suggest why or what them may be, show
both boxplots with and without the outliers to justify your data.
11. Conditional Distribution: The distribution of a variable restricting the Who to consider only a
smaller group of individuals.
12. Categorical Variable: A variable that names categories (whether with words or numbers)
13. Context: The context ideally tells us the 5 W’s of who, what, when and where. Who was measured,
what was measured, how the data were collected, where the data was collected and when and why
the study was performed.
14. Contingency Table: A contingency table displays counts and, sometimes, percentages of individuals
falling into named categories on 2 or more variables. The table categorizes the individuals on all
variables at once, to reveal possible patterns in one variable that may be contingent on the category of
the other.
15. Continuous Data: Variables that can take on an infinite number of possible values. Fractions, integers,
mixed numbers and decimals that represent a variable.
16. Correlation: Correlation is a numerical measure of the direction an strength of a linear association
for a scatterplot.
17. CUSS & BS: Description required to answer questions about the center, unusual (gaps and outliers),
shape, spread and be specific about each of these.
18. Data: systematically recorded information, whether numbers or labels, together with its context.
Data values, no matter what kind, are useless without their context. Newspaper journalists know that
the lead paragraph of a good story should establish the “Five W’s”: Who, what, when, where, and (if
possible) why. Often, we add How to the list as well. Answering these questions can provide the
context for the data values. If you can’t find answers to the Who and What , then you don’t have data,
and you don’t have any useful information.
19. Data Table: An arrangement of data in which each row represent a case and each column represents
a variable.
20. Deviations: The difference of a data value from the mean. Squaring these values and adding them
together is needed to find the variance.
21. Discrete Data: Data collected for a variable that can only take on a finite number of values. Integers.
All qualitative variables are discrete. You can’t have partial or fractional parts to represent something
that is a whole (you can’t have 3.7 people, you can have 3 whole people or 4 whole people, so knowing
what is represented is important.
22. Distribution: The distribution of a variable gives the possible values of the variable and the relative
frequency of each value.
23. Dot Plot: A dot plot graphs a dot for each case against a single axis.
24. Explanatory Variable: In a scatterplot, you must choose a role for each variable. The predictor to a
response. The predictor variable that accounts for, explains, or is responsible for the y-variable. The
explanatory variable is put on the x-axis.
25. Extrapolation: Although linear models provide an easy way to predict values of y for a given value of
x, it is unsafe to predict for values of x far from the ones used to find the linear model equation. Such
extrapolation may pretend to see into the future, but the predictions should not be trusted.
Extrapolation far from the mean can lead to silly and useless predictions.
26. Five Number Summary: A 5-number summary consists of the minimum and maximum, the Q1 and
Q3 quartiles (the first and third quartiles), and the median. This is used to describe data and support
skewness.
27. Frequency table: Relative Frequency table: A frequency table lists the categories in a categorical
variable and gives the count or percentage of observations for each category.
28.Goals of Re-expression:




Goal 1: Make the distribution of a variable (as seen in its histogram, for example) more symmetric.
It’s easier to summarize the center of a symmetric distribution, and for nearly symmetric
distributions, we can use the mean an d standard deviation. If the distribution is unimodal, then the
resulting distribution may be closer to the Normal model, allowing us to use the 68-95-99.7 Rule.
Skewed distributions are made much more symmetric by taking logs.
Goal 2: Make the spread of several groups (as seen in side by side boxplots) more alike. Even if
their centers differ. Groups that share a common spread are easier to compare. Taking logs makes
the individual boxplots more symmetric and gives them spreads that are more nearly equal.
Goal 3: make the form of a scatterplot more nearly linear. Linear scatterplots are easier to describe.
The greater value of re-expression to straighten a relationship is that we can fit a linear model once
the relationship is near straight. If taking logs make it a little bent then try the square power.
Goal 4: Make the scatter in a scatterplot spread out evenly rather than following a fan shape.
Having an even scatter is a condition of many methods of Statistics. This goal is closely related to
Goal 2, but it often comes along with Goal 3.
29. Histogram: A histogram uses adjacent bars to show the distribution of values in a quantitative
variable. Each bar represents the frequency of values falling in an interval of values. Show shape well
for a set of data.
30.Homogeneous: When something is homogeneous, it is made up of things (people, events, objects,
etc.) that are similar to each other. All the same category. Homogeneous data are drawn from a single
population. In other words, all outside processes that could potentially affect the data must remain constant for the
complete time period of the sample



Of the same or similar nature or kind.
Uniform in structure or composition throughout
Consisting of terms of the same degree or elements of the same dimension.
31. Independence: Variables are said to be independent if the conditional distribution of one variable is
the same for each category of the other. Need to show independence.
32. Interquartile Range (IQR): The IQP is the difference between the first and the third quartiles.
IQR = Q3 – Q1.
33. Intercept: The intercept, b (in a linear regression) gives a starting value in y-units. It’s the y-value
when x = 0.
34. Influential Point: If omitting a point from the data results in a very different regression model, then
that point is called an influential point.
35. Ladder of Powers: The Ladder of Powers places in order the effects that many re-expressions have on
the data.
Power
2
Name
The square of the data values, 𝒚𝟐
1
The raw data – no change at all. This is
the home base. The farther you step
from here up or down the ladder, the
greater the effect.
The square root of the data values, √𝒚
½
0
-1/2
-1
Comment
Try this for unimodal distributions that are
skewed to the left.
Data that can take on both positive and
negative values with no bounds are less
likely to benefit from re-expression.
Counts often benefit from a square root
re-expression. For counted data, start
here.
Although mathematicians define the “0Measurements that cannot be negative,
th” power differently, for us the place is
and especially values that grow by
held by the logarithm. You may feel
percentage increases such as salaries or
uneasy about logarithms.
populations, often benefit from a log reDon’t worry, the computer or calculator
expression. When in doubt, start here. If
does all the work.
your data have zeros, try adding a small
constant to all values before find the logs.
−𝟏
An uncommon re-expression, but
The (negative) reciprocal square root 𝒚
√
sometimes useful. Changing the sign to
take the negative of the reciprocal square
root preserves the direction of the
relationships, which can be a bit simpler.
−𝟏
Ratios of 2 quantities (miles per hour, for
The (negative) reciprocal, 𝒚
example) often benefit from a reciprocal.
(you have about a 50-50 chance that the
original ratio was taken in the “wrong”
order for simple statistical analysis and
would benefit from re-expression.) Often,
the reciprocal will have simple units (hours
per mile). Change the sign if you want to
preserve the direction of the relationships.
If your data have zeros, try adding a small
constant to all values before finding the
reciprocal.
36. Least Squares: The least squares criterion specifies the unique line that minimizes the variance of
the residuals or, equivalently, the sum of the squared residuals.
37. Leverage: Data points whose x-values are far from the mean of x are said to exert leverage on a linear
model. High leverage points pull the line close to them, and so they can have a large effect on the line,
sometimes completely determining the slope and intercept. With high enough leverage, their residuals
can appear to be deceptively small.
38. Linear Model: A linear model is an equation of the form y = ax + b. To interpret a linear model we
need to know the variables (along with their W’s) and their units.
39. Lurking Variable: A variable other than x and y that simultaneously affects both variables,
accounting for the correlation between the two.


A variable that is not explicitly part of a model but affects the way the variables in the model
appear to be related is called a lurking variable.
Because we can never be certain that observational data are not hiding a lurking variable that
influences both x and y, it is never safe to conclude that a linear model demonstrates a causal
relationship, no matter how strong the linear association.
40. Marginal Distribution: In a contingency table, the distribution of either variable alone is called the
marginal distribution. The counts or percentages are the totals found in the margins (last row or
column of the table.
41. Mean: An average for data, is found by summing all the data values and dividing by the count. The
mean is a point of balance in a histogram.
42. Median: A middle value with half of the data above and half below it.
43. Mode: The number that happens the most.
44. Model: An equation or formula that simplifies and represents reality.
45. Normal Modal: A useful family of models for unimodal, symmetric distributions. Provides a useful
way to understand data. We can decide whether a Normal model is appropriate by checking the Nearly
Normal Condition with a histogram or Normal probability plot. Normal models follow the 68-95-99.7
Rule, and we can use technology or tables for a more detailed analysis.
46. Normal Percentile: The Normal percentile corresponding to a z-score gives the percentage of values
in a standard Normal distribution found at the z-score or below.
47. Normal Probability Plot: A display to help assess whether a distribution of data is approximately
Normal. If the plot is nearly straight, the data satisfy the Nearly Normal Condition.
48. Outliers: Outliers are extreme values that don’t appear to belong with the rest of the data. They may
be unusual values that deserve further investigation, or just mistakes; there’s no obvious way to tell.



Don’t delete outliers automatically – you have to think about them Outliers can affect many
statistical analyses, so you should always be alert for them. When taking out an outlier to better
understand statistics you MUST explain your considerations on why this was done.
A point that does not fit the overall pattern seen in graphs, or scatterplots.
Any data point that stands away from the others can be called an outlier. In regression, outliers can
be extraordinary in 2 ways, by having a large residual or by having high leverage.
49. Outlier Condition: Actually means 2 things: Points with large residuals or high leverage (especially
both) can influence the regression model significantly. It’s a good idea to perform the regression
analysis with and without such points to see their impact.
50. Parameter: A numerically valued attribute of a model. For example, the values of µ (mean) and
ơ (standard deviation) in a N( µ,ơ) model are parameters.
51. Percentile: The ith percentile is the number that falls above the i% of the data.
52. Pie Chart: Pie charts show how a “whole” divides into categories by showing a wedge of a circle
whose area corresponds to the proportion in each category, usually by percentages.
53. Predicted Values: The value of ŷ found for each x-value in the data. A predicted value is found by
substituting the x-value in the regression equation. The predicted values are the values on the fitted
line; the points (x, ŷ) all lie exactly on the fitted line.
54. Qualitative Variable: A qualitative variable cannot be measured. You can determine if data is
qualitative by examining the definition of quality not quantity. The data for hair color, eye color,
favorite, favorite music, favorite TV show, favorite movie would not have any numerical meaning and
cannot be ordered numerically. So, these categories contain qualitative data.
55. Quantitative Variable: A variable in which the numbers act as numerical values . Quantitative
variables ALWAYS have units. A quantitative variable is finite and computable. A quantitative variable
can be measured on an ordinal, interval or on a ratio-scale.
56. Quartile: The lower quartile (Q1) is the value with the quarter of the data below and 75% of the data
above it. The upper quartile (Q3) has a quarter of the data above it and 75% of the data below it. The
median and quartiles divide data into four equal parts of data.
57. r: the correlation r, tells us about the regression:




The slope of the line is based on the correlation, adjusted for the units of xa and y. We’ve learned
interpret the slope in context.
For each standard deviation in x that we are away from the x mean, we expect to be r standard
deviations in y away from the y mean.
Because r is always between -1 and +1, each predicted y is fewer standard deviations from its mean
than the corresponding x was, a phenomenon called regression to the mean.
The square of the correlation r coefficient, 𝑅 2 , gives us the fraction of the variation of the response
accounted for by the regression model. The remaining 1 - 𝑅 2 of the variation is left in the residuals.
58.𝑹𝟐 : Correlation squared




𝑅 2 is the square of the correlation between y and x.
𝑅 2 gives the fraction of the variability of y accounted for by the least squares linear regression on x.
𝑅 2 is an overall measure of how successful the regression is in linearly relating y to x.
Even an 𝑅 2 near 100% doesn’t indicate that x caused y (or the other way around). Watch out for
lurking variables that may affect both x and y.
59. Range: The difference between the lowest (minimum) and the highest (maximum) values in a data
set. Range = maximum – minimum.
60. Re-express data: We re-express data by taking the logarithm, the square root, the reciprocal, or
some other mathematical operation on all values in the data set to transform the data. We’ve learned
that when seeking a useful re-expression, taking logs is often a good, simple starting point. To search
further, the Ladder of Powers or the log-log approach can help us find a good re-expression. W’eve
come to understand that our models won’t be perfect, but that re-expression can lead us to a useful
model. We’ve learned that when the conditions for regression are not met, a simple re-expression of
the data may help. There are several reasons to consider a re-expression:





To make the distribution of a variable more symmetric
To make the spread across different groups more similar
To make the form of a scatterplot straighter
To make the scatter around the line in a scatterplot more consistent.
61. Good Regression: A good regression is near linear, and we’ve learned that even a good regression
doesn’t mean we should believe the model completely:
 Extrapolation far from the mean can lead to silly and useless predictions.
 Even an 𝑅 2 near 100% doesn’t indicate that x caused y (or the other way around). Watch out for
lurking variables that may affect both x and y.
 Watch out for regressions based on summaries of the data sets. These regressions tend to look
stronger than the regression on the original axis.
62. Regression line: Line of best fit: The particular linear equation ŷ = 𝑏0 + 𝑏1 𝑥 that satisfies the least
squares criterion is called the least squares regression line. Casually, we often just call it the regression
line, or the line of best fit. Data from a scatterplot put in the calculator L1 and L2 then stat, calc 4 enter.
63. Regression to the mean: Because the correlation is always less than 1 in magnitude, each
predicted ŷ tends to be fewer standard deviations from its mean than its corresponding x was from its
mean. This is called regression to the mean.
64. Relative Frequency histogram: Replacing the counts on the vertical axis with the percentage of
the total number of cases falling in each bin.
65. Rescaling: Multiplying each data value by a constant multiplies (changes) both the measures of
position (mean, median, and quartiles) and the measure of spread (standard deviation and IQR) by that
constant.
66. Residuals: Residuals are the differences between data values and the corresponding values predicted
by the regression model – or, more generally, values predicted by any model. The residuals also
reveal how well the least squares regression model works: If a plot of residuals against predicted
values shows a pattern, we should re-examine the data to see why. The standard deviation of the
residuals. 𝑠𝑒 , quantifies the amount of scatter around the line.
Residual = observed value - predicted value .
67. Response Variable: Is the explanation from a prediction, the response that you hope to predict or
explain. Put on the y-axis of a scatterplot.
68. Scatterplots: A scatterplot shows the relationship between 2 quantitaive variables measured on the
same cases. Describing direction of the association (positive or negative) the form it takes ( linearity) ,
and its strength (r value).
69. Shape: To describe the shape of a distribution by looking for single vs. multiple modes (unimodal,
bimodal and multi-modal) and symmetry vs. skewness, or length of tails.
70. Shifting: Adding a constant to each data value adds the same constant to the mean, the median, and
the quartiles(affects measures of center and position by the constant) , but does not change the
standard deviation or IQR (measures of spread).
71. Simpson’s Paradox: When averages are taken across different groups, they can appear to
contradict the overall averages.
72. 68-95-99.7 Rule: In a Normal model , about 68% of values fall within 1 standard deviation of the
mean, about 95% fall within 2 standard deviations of the mean, and about 99.7% fall within 3 standard
deviations of the mean.
73. Skewed: A distribution is skewed if it’s not symmetric and one tail stretches out farther than the
other. Distributions are said to be skewed left when the longer tail stretches to the left, and skewed
right when it stretches to the right.
74. Slope: The slope gives a value in “y-units per x’unit” Changes of one unit in x are associated with the
changes of b units in predicted values of y. (y-y)/(x-x).
75. Spread: A numerical summary of how tightly the values are clustered around the center, the range of
values being used like minimum and maximum, standard deviation or 5 number summary explained in
detail the interquartile range.
76. Standard Deviation: the standard deviation is the square root of the variance. This takes into
account how far each value is from the mean. Like the mean, the standard deviation is appropriate
only for symmetric data.
77. Standard Normal Distribution: Is the normal distribution N(0,1) with the mean 0 and the standard
deviation of 1 on either side of 0 in a density graph.
78. Standardized Value: A value found by subtracting the mean and dividing by the standard deviation.
This is often called a z-score. Z =
𝑥−µ
ơ
.
79. Standardizing: We standardize to eliminate units. Standardized values can be compared and
combined even if the original variable had different units and magnitudes. Used especially with density
curves. Standardizing uses the standard deviation as a ruler to measure distance from the mean,
creating z-scores.
80. Statistics: A value calculated from data to summarize aspects of the data. For example, the mean, y,
and standard deviation, s are statistics.
81. Stem-and-Leaf Display: Shows quantitative data values in a way that sketches the distribution of
the data. It’s best described in detail by example.
82. Straight Enough condition: says that the relationship should be reasonably straight to fit a
regression line. Somewhat paradoxically, sometimes it’s easier to see that the relationship is not
straight after fitting the regression model by examining the residuals.
83. Subset: One unstated condition for finding a linear model is that the data be homogeneous. If,
instead, the data consist of 2 or more groups that have been thrown together, it is usually best to fit
different linear models to each group than to try to fit a single model to all the data. Displays of the
residuals can often help you find subsets in the data.
84. Symmetric: A distribution is symmetric if the two halves on either side of the center look
approximately like mirror images of each other.
85. Tails: The tails of a distribution are the parts that typically trail off on either side. Distribution can be
characterized as having long tails (if they straggle off for some distance) or short tails (if they don’t
straggle off very far.) Long tails to the right of median or mean is considered skewed to the right. A
long tail to the left of the median or mean is considered skewed to the left.
86. Timeplot: a timeplot displays data that change over time. Often successive values are connected with
lines to show trends and tendencies over time more clearly.
87. Unimodal: Having one mode. This is a useful term for describing the shape of a histogram when it’s
generally mound-shaped. Distributions with 2 modes are called bimodal. Those with more than 2 are
called multimodal.
88. Units: A quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams.
89. Variable: A variable holds information about the same characteristic for many cases.
90. Variance: The variance is the sum of squared deviations from the mean, divided by the count minus
one. This is needed to find the standard deviation for a set of data.
91. Z-score: A z-score tells how many standard deviations a value is from the mean; z-scores have a mean
of zero and a standard deviation of one. Density graphs to compare normalcy. Z scores help us to
compare apples to oranges by normalizing their data. A z-score can identify unusual or surprising
values among data.
Download