Vocabulary STATISTICS 1. Area principle: In a statistical display, each data value should be represented by the same amount of area. 2. Attack of the Logarithms: When the Ladder of Powers does not satisfy a linear relationship. When none of the data values is zero or negative, logarithms can be a helpful ally in the search for a useful model. Model Name x-axis y-axis Comment Exponential x Log(y) Logarithmic Log(x) y Power Log(x) Log(y) This model is the “0” power in the Ladder approach, useful for values that grow by percentage increases. A wide range of xvalues, or a scatterplot descending rapidly at the left but leveling off toward the right, may benefit from trying this model. The Goldilocks model: When one of the Ladder’s powers is too big and the next is too small, this one may be just right. 3. Association: Direction: A positive direction or association means that, in general, as one variable increases, so does the other(graph goes up). When increases in one variable generally correspond to decreases in the other, the association is negative(graph goes down). Form: The form we care about most is straight, but you should certainly describe other patterns you see in scatterplots such as clumping, and sparseness of data. Strength: A scatterplot is said to show a strong association if there is little scatter around the underlying relationship. A weak association is scattered greatly from the linearity. 4. Bar Chart: Bar charts show a bar representing the count of each category in a categorical variable. 5. Bins: Display of quantitative in a graph so they are sliced up into equal –width (scales) piles of numbers in a histogram. Bin counts are the height of the bars in a histogram. 6. Boxplot: A boxplot displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values (minimum on left and maximum on the right) . Boxplots are particularly effective for comparing groups. A boxplot reveals some features of a distribution not easily seen in a histogram – the center, the middle 50%, outliers. Not so good at showing the whole shape and used with support in the histogram. 7. Case: A case is an individual about whom or which we have data. 8. Center: A value that attempts the impossible by summarizing the entire distribution with a single number, a “typical” value. Usually in terms of mean (average), median (a middle number in a graph) and mode (number that happens the most). 9. Changing Center and Spread: Changing the center and spread of a variable is equivalent to changing its units. 10. Comparing Boxplots: When comparing groups with boxplots Compare the medians which group has the higher center? Compare the IQRs; which group is more spread out? Judged by the size of the IQRs, are the medians very different, similar? Check for possible outliers, identify them if you can and suggest why or what them may be, show both boxplots with and without the outliers to justify your data. 11. Conditional Distribution: The distribution of a variable restricting the Who to consider only a smaller group of individuals. 12. Categorical Variable: A variable that names categories (whether with words or numbers) 13. Context: The context ideally tells us the 5 W’s of who, what, when and where. Who was measured, what was measured, how the data were collected, where the data was collected and when and why the study was performed. 14. Contingency Table: A contingency table displays counts and, sometimes, percentages of individuals falling into named categories on 2 or more variables. The table categorizes the individuals on all variables at once, to reveal possible patterns in one variable that may be contingent on the category of the other. 15. Continuous Data: Variables that can take on an infinite number of possible values. Fractions, integers, mixed numbers and decimals that represent a variable. 16. Correlation: Correlation is a numerical measure of the direction an strength of a linear association for a scatterplot. 17. CUSS & BS: Description required to answer questions about the center, unusual (gaps and outliers), shape, spread and be specific about each of these. 18. Data: systematically recorded information, whether numbers or labels, together with its context. Data values, no matter what kind, are useless without their context. Newspaper journalists know that the lead paragraph of a good story should establish the “Five W’s”: Who, what, when, where, and (if possible) why. Often, we add How to the list as well. Answering these questions can provide the context for the data values. If you can’t find answers to the Who and What , then you don’t have data, and you don’t have any useful information. 19. Data Table: An arrangement of data in which each row represent a case and each column represents a variable. 20. Deviations: The difference of a data value from the mean. Squaring these values and adding them together is needed to find the variance. 21. Discrete Data: Data collected for a variable that can only take on a finite number of values. Integers. All qualitative variables are discrete. You can’t have partial or fractional parts to represent something that is a whole (you can’t have 3.7 people, you can have 3 whole people or 4 whole people, so knowing what is represented is important. 22. Distribution: The distribution of a variable gives the possible values of the variable and the relative frequency of each value. 23. Dot Plot: A dot plot graphs a dot for each case against a single axis. 24. Explanatory Variable: In a scatterplot, you must choose a role for each variable. The predictor to a response. The predictor variable that accounts for, explains, or is responsible for the y-variable. The explanatory variable is put on the x-axis. 25. Extrapolation: Although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x far from the ones used to find the linear model equation. Such extrapolation may pretend to see into the future, but the predictions should not be trusted. Extrapolation far from the mean can lead to silly and useless predictions. 26. Five Number Summary: A 5-number summary consists of the minimum and maximum, the Q1 and Q3 quartiles (the first and third quartiles), and the median. This is used to describe data and support skewness. 27. Frequency table: Relative Frequency table: A frequency table lists the categories in a categorical variable and gives the count or percentage of observations for each category. 28.Goals of Re-expression: Goal 1: Make the distribution of a variable (as seen in its histogram, for example) more symmetric. It’s easier to summarize the center of a symmetric distribution, and for nearly symmetric distributions, we can use the mean an d standard deviation. If the distribution is unimodal, then the resulting distribution may be closer to the Normal model, allowing us to use the 68-95-99.7 Rule. Skewed distributions are made much more symmetric by taking logs. Goal 2: Make the spread of several groups (as seen in side by side boxplots) more alike. Even if their centers differ. Groups that share a common spread are easier to compare. Taking logs makes the individual boxplots more symmetric and gives them spreads that are more nearly equal. Goal 3: make the form of a scatterplot more nearly linear. Linear scatterplots are easier to describe. The greater value of re-expression to straighten a relationship is that we can fit a linear model once the relationship is near straight. If taking logs make it a little bent then try the square power. Goal 4: Make the scatter in a scatterplot spread out evenly rather than following a fan shape. Having an even scatter is a condition of many methods of Statistics. This goal is closely related to Goal 2, but it often comes along with Goal 3. 29. Histogram: A histogram uses adjacent bars to show the distribution of values in a quantitative variable. Each bar represents the frequency of values falling in an interval of values. Show shape well for a set of data. 30.Homogeneous: When something is homogeneous, it is made up of things (people, events, objects, etc.) that are similar to each other. All the same category. Homogeneous data are drawn from a single population. In other words, all outside processes that could potentially affect the data must remain constant for the complete time period of the sample Of the same or similar nature or kind. Uniform in structure or composition throughout Consisting of terms of the same degree or elements of the same dimension. 31. Independence: Variables are said to be independent if the conditional distribution of one variable is the same for each category of the other. Need to show independence. 32. Interquartile Range (IQR): The IQP is the difference between the first and the third quartiles. IQR = Q3 – Q1. 33. Intercept: The intercept, b (in a linear regression) gives a starting value in y-units. It’s the y-value when x = 0. 34. Influential Point: If omitting a point from the data results in a very different regression model, then that point is called an influential point. 35. Ladder of Powers: The Ladder of Powers places in order the effects that many re-expressions have on the data. Power 2 Name The square of the data values, 𝒚𝟐 1 The raw data – no change at all. This is the home base. The farther you step from here up or down the ladder, the greater the effect. The square root of the data values, √𝒚 ½ 0 -1/2 -1 Comment Try this for unimodal distributions that are skewed to the left. Data that can take on both positive and negative values with no bounds are less likely to benefit from re-expression. Counts often benefit from a square root re-expression. For counted data, start here. Although mathematicians define the “0Measurements that cannot be negative, th” power differently, for us the place is and especially values that grow by held by the logarithm. You may feel percentage increases such as salaries or uneasy about logarithms. populations, often benefit from a log reDon’t worry, the computer or calculator expression. When in doubt, start here. If does all the work. your data have zeros, try adding a small constant to all values before find the logs. −𝟏 An uncommon re-expression, but The (negative) reciprocal square root 𝒚 √ sometimes useful. Changing the sign to take the negative of the reciprocal square root preserves the direction of the relationships, which can be a bit simpler. −𝟏 Ratios of 2 quantities (miles per hour, for The (negative) reciprocal, 𝒚 example) often benefit from a reciprocal. (you have about a 50-50 chance that the original ratio was taken in the “wrong” order for simple statistical analysis and would benefit from re-expression.) Often, the reciprocal will have simple units (hours per mile). Change the sign if you want to preserve the direction of the relationships. If your data have zeros, try adding a small constant to all values before finding the reciprocal. 36. Least Squares: The least squares criterion specifies the unique line that minimizes the variance of the residuals or, equivalently, the sum of the squared residuals. 37. Leverage: Data points whose x-values are far from the mean of x are said to exert leverage on a linear model. High leverage points pull the line close to them, and so they can have a large effect on the line, sometimes completely determining the slope and intercept. With high enough leverage, their residuals can appear to be deceptively small. 38. Linear Model: A linear model is an equation of the form y = ax + b. To interpret a linear model we need to know the variables (along with their W’s) and their units. 39. Lurking Variable: A variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two. A variable that is not explicitly part of a model but affects the way the variables in the model appear to be related is called a lurking variable. Because we can never be certain that observational data are not hiding a lurking variable that influences both x and y, it is never safe to conclude that a linear model demonstrates a causal relationship, no matter how strong the linear association. 40. Marginal Distribution: In a contingency table, the distribution of either variable alone is called the marginal distribution. The counts or percentages are the totals found in the margins (last row or column of the table. 41. Mean: An average for data, is found by summing all the data values and dividing by the count. The mean is a point of balance in a histogram. 42. Median: A middle value with half of the data above and half below it. 43. Mode: The number that happens the most. 44. Model: An equation or formula that simplifies and represents reality. 45. Normal Modal: A useful family of models for unimodal, symmetric distributions. Provides a useful way to understand data. We can decide whether a Normal model is appropriate by checking the Nearly Normal Condition with a histogram or Normal probability plot. Normal models follow the 68-95-99.7 Rule, and we can use technology or tables for a more detailed analysis. 46. Normal Percentile: The Normal percentile corresponding to a z-score gives the percentage of values in a standard Normal distribution found at the z-score or below. 47. Normal Probability Plot: A display to help assess whether a distribution of data is approximately Normal. If the plot is nearly straight, the data satisfy the Nearly Normal Condition. 48. Outliers: Outliers are extreme values that don’t appear to belong with the rest of the data. They may be unusual values that deserve further investigation, or just mistakes; there’s no obvious way to tell. Don’t delete outliers automatically – you have to think about them Outliers can affect many statistical analyses, so you should always be alert for them. When taking out an outlier to better understand statistics you MUST explain your considerations on why this was done. A point that does not fit the overall pattern seen in graphs, or scatterplots. Any data point that stands away from the others can be called an outlier. In regression, outliers can be extraordinary in 2 ways, by having a large residual or by having high leverage. 49. Outlier Condition: Actually means 2 things: Points with large residuals or high leverage (especially both) can influence the regression model significantly. It’s a good idea to perform the regression analysis with and without such points to see their impact. 50. Parameter: A numerically valued attribute of a model. For example, the values of µ (mean) and ơ (standard deviation) in a N( µ,ơ) model are parameters. 51. Percentile: The ith percentile is the number that falls above the i% of the data. 52. Pie Chart: Pie charts show how a “whole” divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category, usually by percentages. 53. Predicted Values: The value of ŷ found for each x-value in the data. A predicted value is found by substituting the x-value in the regression equation. The predicted values are the values on the fitted line; the points (x, ŷ) all lie exactly on the fitted line. 54. Qualitative Variable: A qualitative variable cannot be measured. You can determine if data is qualitative by examining the definition of quality not quantity. The data for hair color, eye color, favorite, favorite music, favorite TV show, favorite movie would not have any numerical meaning and cannot be ordered numerically. So, these categories contain qualitative data. 55. Quantitative Variable: A variable in which the numbers act as numerical values . Quantitative variables ALWAYS have units. A quantitative variable is finite and computable. A quantitative variable can be measured on an ordinal, interval or on a ratio-scale. 56. Quartile: The lower quartile (Q1) is the value with the quarter of the data below and 75% of the data above it. The upper quartile (Q3) has a quarter of the data above it and 75% of the data below it. The median and quartiles divide data into four equal parts of data. 57. r: the correlation r, tells us about the regression: The slope of the line is based on the correlation, adjusted for the units of xa and y. We’ve learned interpret the slope in context. For each standard deviation in x that we are away from the x mean, we expect to be r standard deviations in y away from the y mean. Because r is always between -1 and +1, each predicted y is fewer standard deviations from its mean than the corresponding x was, a phenomenon called regression to the mean. The square of the correlation r coefficient, 𝑅 2 , gives us the fraction of the variation of the response accounted for by the regression model. The remaining 1 - 𝑅 2 of the variation is left in the residuals. 58.𝑹𝟐 : Correlation squared 𝑅 2 is the square of the correlation between y and x. 𝑅 2 gives the fraction of the variability of y accounted for by the least squares linear regression on x. 𝑅 2 is an overall measure of how successful the regression is in linearly relating y to x. Even an 𝑅 2 near 100% doesn’t indicate that x caused y (or the other way around). Watch out for lurking variables that may affect both x and y. 59. Range: The difference between the lowest (minimum) and the highest (maximum) values in a data set. Range = maximum – minimum. 60. Re-express data: We re-express data by taking the logarithm, the square root, the reciprocal, or some other mathematical operation on all values in the data set to transform the data. We’ve learned that when seeking a useful re-expression, taking logs is often a good, simple starting point. To search further, the Ladder of Powers or the log-log approach can help us find a good re-expression. W’eve come to understand that our models won’t be perfect, but that re-expression can lead us to a useful model. We’ve learned that when the conditions for regression are not met, a simple re-expression of the data may help. There are several reasons to consider a re-expression: To make the distribution of a variable more symmetric To make the spread across different groups more similar To make the form of a scatterplot straighter To make the scatter around the line in a scatterplot more consistent. 61. Good Regression: A good regression is near linear, and we’ve learned that even a good regression doesn’t mean we should believe the model completely: Extrapolation far from the mean can lead to silly and useless predictions. Even an 𝑅 2 near 100% doesn’t indicate that x caused y (or the other way around). Watch out for lurking variables that may affect both x and y. Watch out for regressions based on summaries of the data sets. These regressions tend to look stronger than the regression on the original axis. 62. Regression line: Line of best fit: The particular linear equation ŷ = 𝑏0 + 𝑏1 𝑥 that satisfies the least squares criterion is called the least squares regression line. Casually, we often just call it the regression line, or the line of best fit. Data from a scatterplot put in the calculator L1 and L2 then stat, calc 4 enter. 63. Regression to the mean: Because the correlation is always less than 1 in magnitude, each predicted ŷ tends to be fewer standard deviations from its mean than its corresponding x was from its mean. This is called regression to the mean. 64. Relative Frequency histogram: Replacing the counts on the vertical axis with the percentage of the total number of cases falling in each bin. 65. Rescaling: Multiplying each data value by a constant multiplies (changes) both the measures of position (mean, median, and quartiles) and the measure of spread (standard deviation and IQR) by that constant. 66. Residuals: Residuals are the differences between data values and the corresponding values predicted by the regression model – or, more generally, values predicted by any model. The residuals also reveal how well the least squares regression model works: If a plot of residuals against predicted values shows a pattern, we should re-examine the data to see why. The standard deviation of the residuals. 𝑠𝑒 , quantifies the amount of scatter around the line. Residual = observed value - predicted value . 67. Response Variable: Is the explanation from a prediction, the response that you hope to predict or explain. Put on the y-axis of a scatterplot. 68. Scatterplots: A scatterplot shows the relationship between 2 quantitaive variables measured on the same cases. Describing direction of the association (positive or negative) the form it takes ( linearity) , and its strength (r value). 69. Shape: To describe the shape of a distribution by looking for single vs. multiple modes (unimodal, bimodal and multi-modal) and symmetry vs. skewness, or length of tails. 70. Shifting: Adding a constant to each data value adds the same constant to the mean, the median, and the quartiles(affects measures of center and position by the constant) , but does not change the standard deviation or IQR (measures of spread). 71. Simpson’s Paradox: When averages are taken across different groups, they can appear to contradict the overall averages. 72. 68-95-99.7 Rule: In a Normal model , about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the mean, and about 99.7% fall within 3 standard deviations of the mean. 73. Skewed: A distribution is skewed if it’s not symmetric and one tail stretches out farther than the other. Distributions are said to be skewed left when the longer tail stretches to the left, and skewed right when it stretches to the right. 74. Slope: The slope gives a value in “y-units per x’unit” Changes of one unit in x are associated with the changes of b units in predicted values of y. (y-y)/(x-x). 75. Spread: A numerical summary of how tightly the values are clustered around the center, the range of values being used like minimum and maximum, standard deviation or 5 number summary explained in detail the interquartile range. 76. Standard Deviation: the standard deviation is the square root of the variance. This takes into account how far each value is from the mean. Like the mean, the standard deviation is appropriate only for symmetric data. 77. Standard Normal Distribution: Is the normal distribution N(0,1) with the mean 0 and the standard deviation of 1 on either side of 0 in a density graph. 78. Standardized Value: A value found by subtracting the mean and dividing by the standard deviation. This is often called a z-score. Z = 𝑥−µ ơ . 79. Standardizing: We standardize to eliminate units. Standardized values can be compared and combined even if the original variable had different units and magnitudes. Used especially with density curves. Standardizing uses the standard deviation as a ruler to measure distance from the mean, creating z-scores. 80. Statistics: A value calculated from data to summarize aspects of the data. For example, the mean, y, and standard deviation, s are statistics. 81. Stem-and-Leaf Display: Shows quantitative data values in a way that sketches the distribution of the data. It’s best described in detail by example. 82. Straight Enough condition: says that the relationship should be reasonably straight to fit a regression line. Somewhat paradoxically, sometimes it’s easier to see that the relationship is not straight after fitting the regression model by examining the residuals. 83. Subset: One unstated condition for finding a linear model is that the data be homogeneous. If, instead, the data consist of 2 or more groups that have been thrown together, it is usually best to fit different linear models to each group than to try to fit a single model to all the data. Displays of the residuals can often help you find subsets in the data. 84. Symmetric: A distribution is symmetric if the two halves on either side of the center look approximately like mirror images of each other. 85. Tails: The tails of a distribution are the parts that typically trail off on either side. Distribution can be characterized as having long tails (if they straggle off for some distance) or short tails (if they don’t straggle off very far.) Long tails to the right of median or mean is considered skewed to the right. A long tail to the left of the median or mean is considered skewed to the left. 86. Timeplot: a timeplot displays data that change over time. Often successive values are connected with lines to show trends and tendencies over time more clearly. 87. Unimodal: Having one mode. This is a useful term for describing the shape of a histogram when it’s generally mound-shaped. Distributions with 2 modes are called bimodal. Those with more than 2 are called multimodal. 88. Units: A quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams. 89. Variable: A variable holds information about the same characteristic for many cases. 90. Variance: The variance is the sum of squared deviations from the mean, divided by the count minus one. This is needed to find the standard deviation for a set of data. 91. Z-score: A z-score tells how many standard deviations a value is from the mean; z-scores have a mean of zero and a standard deviation of one. Density graphs to compare normalcy. Z scores help us to compare apples to oranges by normalizing their data. A z-score can identify unusual or surprising values among data.