Stats Glossary Section 1.1 Data – numbers collected in a particular context. For example, if you asked everyone in our class how many brothers and/or sisters they had, the numbers that your class members gave as responses would represent data. Variable – any characteristic of a person or thing that can be assigned a number of a category. For example, in the scenario described above, the variable would be the number of brothers and/or sisters. Students sometimes have trouble determining whether or not a statement represents a variable. Suppose that the statement was “name of your math teacher.” If the observational units were students in our class, would this vary from student to student? Pretend to ask each student, “What is the name of your math teacher?” Aren’t they all going to say the same name? Since the students’ answers would not vary, this cannot be a variable. Observational unit – the person or thing to which the number or category is assigned; also called the case. For example, each class member that you asked for the number of brothers and/or sisters is an observational unit or case. ________________________________________________________________________ Section 1.2 Quantitative variable – a variable that measures a numerical characteristic; also called a measurement variable. For example, since the response to how many brothers and/or sisters a person has is a number, this variable is a quantitative variable. Count variable – a type of quantitative variable; answers the question, “How many?” Categorical variable – a variable that records a category destination; also called a qualitative variable. For example, if you were to record the gender of your class members, gender would be a categorical variable because the class members are either in the female category or in the male category. Here is another example. Suppose you asked your class members for their favorite flavor of ice cream and you allowed them to choose from the following: vanilla, chocolate, strawberry, or other. Ice cream flavor would be a categorical variable because the class responses would fall into one of the four categories. Binary variable – a special categorical variable for which only two possible categories exist. For example, the variable gender would be a binary variable. The variable ice ream flavor would not be a binary variable because it has more than two categories for the responses. Displays for a Categorical Variable 1. Frequency Table 2. Picture Graph 3. Bar Graph 4. Segmented Bar Graph 5. Circle Graph ________________________________________________________________________ Section 1.3 Read Question – requires the respondent to read information from the table to determine a solution; important but low-level. Derive Question – requires some type of computation involving information read from a table. Interpret Question – requires an extension, prediction, or inference to read beyond the data; higher-level thinking. ________________________________________________________________________ Section 1.4 Displays for Quantitative Variables 1. Dot Plot 2. Stem-Leaf Plot 3. Grouped Frequency Table 4. Histogram Symmetric – a distribution is symmetric if one half is roughly a mirror image of the other. Skewed to the right – a distribution is skewed to the right if it tails off toward larger values. Skewed to the left – a distribution is skewed to the left if it tails off toward smaller values. Outliers – observations that differ markedly from the pattern established by the vast majority. Granularity – a distribution has this characteristic if it has values occurring only at fixed intervals. ]Six Features of Data Distribution that are typically of interest – 1. center – to be discussed in section 2.1 2. variability or spread – to be discussed in sections 2.2 and 2.3 3. shape – while the shape may vary, many times the shape may be identified as symmetric, skewed to the right, or skewed to the left 4. cluster/peaks – peaks or clusters indicate that the data fall into natural subgroups 5. outliers – if outliers are present, they warrant close examination 6. granularity Side-by-side Stemplot o A common set of stems is used in the middle of the display with leaves for each category branching out in either direction o Order the leaves from the middle out toward either side Statistical Tendency o Pertains to average or typical cases but not necessarily to individual cases o Ex: Men tend to be taller than women. This does not mean that all men are taller than all women. ________________________________________________________________________ Section 1.5 Part 1 Response Variable vs. Explanatory Variable Many times we would like to offer an explanation as to why a person gives a particular response. Example: Do you believe that a person who is 50 years old is “old”? A person’s response to this question can most likely be explained by that person’s age. That is, someone who is 20 might believe 50 is old. However, someone who is 49 or 60 might not consider 50 as old. In the above example, there are two variables of interest, namely age and the “do you believe 50 is old” variable. Since we are thinking that a person’s age might predict the response to the statement, the variable age is called the explanatory variable. The “do you believe 50 is old” variable is the response variable. The response variable is affected or predicted by the explanatory variable. Two-way Table This is a table which classifies a person in 2 ways. Continuing the above example, suppose the following data were collected: Age Agree 5 Y 10 Y 20 Y 25 Y 30 Y 35 N 45 N 50 N 60 N 65 N Here is a two-way table for this data. (Notice: The ages are placed in categories so as to create a categorical variable.) Agree Disagree 0-25 4 0 26-50 1 3 52-75 0 2 The explanatory variable should be in columns and the response variable in rows. Marginal Distribution Calculated by finding the proportion of responses in each category Example: Continuing the above example—The marginal distributions for the age variable are 4/10 = .4 (there are 4 people in the 0-25 age category out of 10 people total), 4/10 = .4 (there are 4 people in the 26-50 age category out of 10 people total), 2/10 = .2 (there are 2 people in the 51-75 age category out of 10 people total). Conditional Distribution Distribution of one variable for given categories of the other variable. From the above example, the proportion of “middle-aged” respondents who agree is 1/4 = .25 (one agrees out of the total of 4 people in that age group). Segmented Bar Graphs Visual display for conditional distributions. Each rectangle has a height of 100%. Each rectangle is divided into segments whose lengths correspond to the conditional proportions. ________________________________________________________________________ Section 2.1 Three Measures of Center 1. Mean – the arithmetic average—The mean is found by adding up the values of the observations and dividing by the number of observations. Example: Let 5, 10, 8, 7, 4 be the data set. To find the mean add these numbers (5 + 10 + 8 + 7 + 4 = 34) and divide by how many numbers there were in the set (34/5) = 6.8). The mean for this data set is 6.8. The mean can be thought of as the “balance point” of the distribution. Also, the mean can be calculated only on quantitative variables. 2. Median – the middle observation when the observations are listed in order. To find the median: o Arrange the values in order o If there are an odd number of values, the median is in the (n + 1)/2 position. o If there are an even number of values, the median is the average of the values in the n/2 and (n/2) + 1 positions. Example: Let 5, 10, 8, 7, 4 be the data set. To find the median, we must first list these numbers in order—4, 5, 7, 8, 10. Since there are 5 numbers in the set (an odd number) the median is the middle number, in this case 7. Sometimes a set is very large so it is easier to figure out which numbered position the median is in. If so, use the formula (n + 1)/2 to find the position number. For this example, n would be 5. Using the formula, (5 + 1)/2 gives us 3. If you look in the third position, the median is 7. Example: Let 5, 10, 8, 7, 4, 12 be the data set. To find the median we must first list these numbers in order—4, 5, 7, 8, 10, 12. Since there are 6 numbers in the set (an even number) the median is the average of the two middle numbers, in this case the average of 7 and 8 is 7.5. If a data set is very large, it may be beneficial to use the formulas n/2 and (n/2) + 1 to find the two numbers that you must average to get the median. In this example, n is 6. Using the formulas we get 6/2 = 3 and (6/2) + 1 = 4. We need to average the numbers in the third position (7) and in the fourth position (8). If you average 7 and 8 you get 7.5. 3. Mode – the most common value; the value that occurs most frequently. Example: Let 5, 7, 3, 4, 4, 1 bet the data set. The mode is 4 simply because it occurs twice and the other values occur only once. Suppose the data set had been orange, blue, orange, blue, orange, blue, red, black, green, red. The mode here would be both orange and blue since each of these occurred the most (three times each). The mode applies to all categorical variables but is only useful with some quantitative variables. Sample Size – the number of observations in the data set; the variable n usually denotes the sample size. The relationship of the mean and the median – Symmetric distribution – the mean is close to the median Skewed right distribution – the mean is greater than the median Skewed left distribution – the mean is less than the median Resistant – a measure whose value is relatively unaffected by the presence of outliers. Note: Measures of center are often important, but they do not summarize all aspects of a distribution. ________________________________________________________________________ Section 2.2 Range A measure of variability Simple but not very useful Maximum value minus the minimum value Inter-quartile Range (IQR) A measure of variability It is the upper quartile minus the lower quartile The range of the middle 50% of the data Lower Quartile 25th percentile The value such that 25% of the observations fall below that value and 75% of the observations fall above the value To find the lower quartile 1. Find the median for the entire data set. (This number divides the set into two halves.) 2. Find the median for the portion of the data set that falls below the actual median (which was found in step 1). This is your Lower Quartile. (By dividing the bottom half of the data set in half, you have found the quarters of the entire data set.) 3. Note: If there are an odd number of observations in the original data set, the actual median is not included in the bottom half when finding the lower quartile. Upper Quartile 75th percentile The value such that 75% of the observations fall below that value and 25% of the observations fall above the value. To find the upper quartile 1. Find the median for the entire data set. (This number divides the set into two halves.) 2. Find the median for the portion of the data set that falls above the actual median (which was found in step 1). This is your Upper Quartile. (By dividing the upper half of the data set in half, you have found the quarters of the entire data set.) 3. Note: If there are an odd number of observations in the original data set, the actual median is not included in the upper half when finding the upper quartile. Five-number summary Provides a quick, convenient description of where the four quarters of the data fall Includes the minimum value, the lower quartile, the median, the upper quartile, and the maximum value. Boxplot A visual display which is based on the 5-number summary. Draw a box between the quartiles. This box demonstrates where the middle 50% of the data fall. Draw horizontal lines (or whiskers) that extend from the left and right sides of the box to the minimum and maximum, respectively. Mark the median with a vertical line inside the box. One weakness of box plots – the effect of an outlier Modified Boxplots Outliers are marked with symbols. “Whiskers” extend to the most extreme, nonoutlying value. Rule for identifying outliers: outliers are observations lying more than 1.5 times the IQR away from the nearer quartile. ________________________________________________________________________ Section 2.3 Standard Deviation A widely used measure of variability. To compute: 1. Calculate the difference between each observation and the mean. 2. Square each of these differences. 3. Add these squares. 4. Divide this sum by n-1. 5. Take the square root. Denoted by s Empirical Rule With mound-shaped data o About 68% of the observations fall within 1 standard deviation of the mean. o About 95% of the observations fall within 2 standard deviations of the mean. o Virtually all observations fall within 3 standard deviations of the mean This is not necessarily true for distributions of other shapes. z-score or standardized score Useful for comparing individual scores from different distributions. To calculate a z-score 1. Subtract the mean from the value of interest. 2. Divide by the standard deviation. The z-score indicates how many standard deviations above or below the mean a particular value falls. It should only be used when working with mound-shaped distributions. Note: A common misconception about variability is to believe that a “bumpier” histogram indicates a more variable distribution, but this is not the case. Similarly, the number of distinct values represented in a histogram does not necessarily indicate greater variability. ________________________________________________________________________ Section 1.5 Part 2 Scatterplot A scatterplot is similar to a dot plot except that it displays two quantitative variables simultaneously. The vertical axis represents one variable and the horizontal axis represents the other. A dot represents an observational pair. Generally, the response variable is on the vertical axis and the explanatory variable is on the horizontal axis. For example, I believe that if I know your foot length, then I can tell you your height. The variable foot length is predicting the variable height. Foot length is the explanatory variable and should be on the horizontal axis. Height is the response variable and should be on the vertical axis. Positive Association Two variables are positively associated if larger values of one variable tend to occur with larger values of the other variable. For example, consider the variables “number of hours worked” and “money earned.” One would assume that if a large number of hours are worked, then a large amount of money is earned. Therefore, these two variables are positively associated. Negative Association Two variables are negatively associated if larger values of one variable tend to occur with smaller values of the other. For example, consider the variables “the number of days absent from class” and “grade in class.” Generally, someone with a high number of absences will have a lower grade in the class. These two variables are negatively associated. Correlation Coefficient The letter r is used to denote the correlation coefficient. The correlation coefficient is a measure of the degree to which two variables are associated. The value of the correlation coefficient ranges from -1 to +1. If the correlation coefficient equals +1 or -1, then the observations form a perfectly straight line. The sign of the correlation coefficient reflects the direction of the association. That is, if r is positive then the two variables are positively associated. If r is negative, then the two variables are negatively associated. Values of r that are closer to +1 or -1 indicate stronger associations. Therefore, the correlation coefficient indicates the magnitude or strength of the correlation. The correlation coefficient only measures linear relationships between two variables. Therefore, it is always important to look at the scatterplot when interpreting r. Association vs. Causation Two variables may be strongly associated without a cause-and-effect relationship. Often, it two variables are associated but a cause-and-effect relationship is not apparent, then it is likely that the two variables are related to a third variable that is not being measured. This third variable is called a lurking variable or a confounding variable.