University of Saskatchewan College of Arts & Science Department of Mathematics and Statistics STATS 245.3 (02): Introduction to Statistical Methods Introduction What is Statistics? Any beginning statistics student should know at least one version of the definition of Statistics. The authors of our textbook provide the following definition “Statistics as a subject provides a body of principles and methodology for designing the process of data collection, summarizing and interpreting the data, and drawing conclusions or generalities [from the data].” This course provides an introduction to each of these three areas which together form the field of Statistics. The branch of statistics referred to as experimental design deals specifically with the collection of data. The branch of statistics referred to as descriptive statistics deals with summarizing the data after its been collected. The branch of statistics referred to as inferential statistics deals with assessing and evaluating the data and drawing conclusions based on this assessment. Common to all three parts of the definition of Statistics is data. Data, simply put, is the information that is recorded during some investigation. Data can be either numerical (sometimes referred to as measurement) or qualitative (sometimes referred to as categorical).. Information that can be represented or measured numerically is referred to as numerical (or measurement) data. Examples of this are such properties as heights, weights, etc. Information that cannot be measured numerically but classified by categories is referred to as qualitative (or categorical) data. Examples of this type of data are properties such as eye colour, gender, etc. Whether the data to be collected is numerical or categorical, the data is based on observations of some characteristic of a collection of objects. The individual entities, whether a person or object, whose characteristics are of interest are referred to as units. A population is the complete collection of the units from which information is sought. A statistical population is the set of measurements (or record of some qualitative trait) on the population. Examples: Population Unit Characteristic STATS 245.3 (02): Introduction to Statistical Methods Populations are generally quite large and perhaps even infinite in size. As a result, populations are generally too large with which to work. Hence it is extremely difficult to analyze the entire population. In these instances when the population size is too large or too difficult (or both) to analyze, we have to be content with a smaller subset of the original population. This smaller subset is referred to as a sample. In order to get an accurate representation of the general population, the subjects to be included in the sample must be selected randomly from the general population. Example: Properties of our random sample usually constitute our data. Once we collected our data, then what? Describing A Data Set (Descriptive Statistics) Numerical Data: Suppose we have a data set of n readings (heights, weights, etc.). We denote the individual readings x1, x2, x3, …,xn . We denote the i’th reading by xi. Sometimes we wish to add the readings, that is x1+x2+x3+…+xn A short-hand notation for the above is Sigma notation, that is n x1 x 2 x3 ... x n xi . i 1 We can use Sigma notation for other expressions. For example, m x k 1 2 k x12 x 22 ... x m2 n 2 y1 2 y 2 2 y 3 ... 2 y n 2 y i . i 1 Example: Suppose a research scientist records four observations: x1=3, x2=1, x3=0, x4=4. Compute the following: [2] STATS 245.3 (02): Introduction to Statistical Methods 4 1. x i 1 i 4 2. x i 1 2 i 4 3. (x i 1 i 2) 2 4 4. xi i 1 2 Measures of Central Tendency A numerical value describing a sample is called a statistic. A numerical value describing a population is called a parameter. We often use statistics to estimate parameters. One of the most common examples of a statistic is the sample mean (or average). Given individual readings x1, x2, x3, …,xn, the sample mean is denoted x and is calculated using the formula n x x i 1 n i . Example: Find the average of the five readings 2.1, 3.2, 4.1, 5.6, and 3.7. Notes: When doing any calculations, rounding off introduces error. Whenever possible, do NOT round off until you are ready to report the result of your calculation!!!! The sample mean x is an example of a statistic and it is quite often used to estimate the parameter , the population mean. Why? Quite often it is unreasonable (if not impossible) to determine the population mean . Using x to estimate is a topic from “inferential statistics” which we will be discussing later in the course. [3] STATS 245.3 (02): Introduction to Statistical Methods Both the population mean and the sample mean x are measures of central tendency. is the average or “central” value for the population and x is the average or “central” value for the sample. The mean (or average) is not the only measure of central tendency. Two other types of central tendency are the median and mode. The sample median is denoted ~ x . The median represents the middle value of a set when the set is expressed in ascending or descending order. If there is an odd number of elements in this ordered set, then the median is the middle value. If there is an even number of elements in this ordered set, then the median is the average of the two middle values. Example: Find the median time of the following times: 37, 49.7, 43, 45.1, 44.6. Example: Find the median height of the following set: 123 cm, 178 cm, 167 cm, 159 cm. The sample mode is also another measure of central tendency. The mode is the value which occurs the most frequently in a sample. Example: Find the mode of the following shoe sizes: 9, 5, 8, 6, 9, 7. What happens if no value occurs more than once in a data set? Which measure of central tendency is the “best”? The answer is very problem specific. instance, calculate the mean and median of the following survival times: 3, 15, 46, 64, 126, 623. [4] For STATS 245.3 (02): Introduction to Statistical Methods Conclusions: If the number of observations is quite large (greater than or equal to 30, for instance), the idea of median can be extended to dividing the data into quarters, tenths, or even hundredths!!! Just as the point which divides a data set in half is called the median, the points which divide the data set into quarters are called quartiles, the points which divide a data set into tenths are called deciles, and the points which divide a data set into hundredths are called percentiles. Definition: The sample 100p-th percentile is a value such that after the data are ordered from smallest to largest, at least 100p% of the observations are at or below this value and at least 100(1p)% are at or above this value. The deciles D1, D2, D3, D4, D5, D6, D7, D8, and D9 are the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 90th percentiles, respectively. The quartiles Q1, Q2, and Q3 are the 25th, 50th, and 75th percentiles, respectively. Calculating the Sample 100p-th Percentile: 1. Order the data from smallest to largest 2. Calculate np. 3. If np is not an integer, round it up to the next integer and find the corresponding ordered value. If np is an integer, say k, calculate the average of the k’th and the (k+1)’st ordered values. Example: The times (in seconds) of 50 runners in a marathon were: 238 243 245 239 243 245 240 243 245 240 243 245 240 244 246 241 244 246 242 244 246 242 244 246 242 245 246 242 245 246 Locate the quartiles and also compute the first decile. [5] 246 246 247 247 247 247 248 248 248 248 248 248 248 249 249 249 250 250 250 251 STATS 245.3 (02): Introduction to Statistical Methods [6] STATS 245.3 (02): Introduction to Statistical Methods Measures of Variation The simplest measure of the variability among a set of data is the range. The range of a sample is simply the largest value in the set minus the smallest value in the set. Example: Find the range of the following shoe sizes: 9, 5, 8, 6, 9, 7: Find the range of the following survival times: 3, 15, 46, 64, 126, 623. Find the range of the marathon runners’ times. Consider the ranges of the sample 2, 3, 4, 5, 6, 7 and the sample 2, 6, 6, 6, 6, 7. Though both samples have a range of 5, the first sample is certainly more spread out than the second sample. We would like a statistic which measures the variability in the sample associated with how the data is spread out. One such method uses x , the center of the sample, as a “frame of reference”. Because x is a measure of the center of a data set, one method of determining the variability of the individual data points xi about the center is their deviation from the mean, that is xi x . Example: Observation xi Deviation xi x . 3 5 7 7 8 x = We refer to the quantity [x i [x i x] = x ] as the total deviation and is not a good measure of the variation within a data set because the total variation is always 0 (to be proved on Assignment #1). To obtain a measure of how “spread out” the data is from the mean, we need to eliminate the negative signs of the individual deviations. One way is to square the individual deviations. If we add these individual squared deviations and divide this sum by one less than the number of data points, we get the sample variance, denoted s2, with n-1 degrees of freedom. [7] STATS 245.3 (02): Introduction to Statistical Methods Example: Observation xi 3 5 7 7 8 x i Deviation xi x . [x i Squared Deviation xi x 2 . [x x] = i x = s2 x ]2 [x Now we define the sample standard deviation, denoted s, as s i x ]2 n 1 s 2 [x i x ]2 n 1 sample standard deviation, not the sample variance, as a basic measure of variability. Example: 1. a. Compute the sample standard deviations for i. 1, 4, 5, 9, 11 ii. 3, 5, 7, 7, 8 b. Compare the two standard deviations you calculated above. [8] . We use the STATS 245.3 (02): Introduction to Statistical Methods The formula used to calculate the sample variance, s 2 [x i x ]2 n 1 , is somewhat tedious to calculate. An equivalent, but somewhat more calculation friendly, formula for the sample variance is 2 xi 1 2 x i . Proving the equivalency will be a question on Assignment 1. s n 1 n 2 Example: Calculate the sample standard deviation of 4, 2, 3, 3, 6, 3. Another measure of variation is the sample interquartile range, which is the length of the interval covering the center half of the data points. Example: Calculate the sample interquartile range for the Marathon runner example. [9] STATS 245.3 (02): Introduction to Statistical Methods Describing A Data Set with a Table In a data set, the frequency of a value is the number of times the value appears in the data set. The relative frequency of a value is the proportion of times the value appears in a data set. Example: On a Stats 245 quiz (marked out of 5), the grades of twenty students were 5 4 2 5 4 2 1 4 4 2 3 3 3 3 5 4 3 5 0 3 Determine the frequency and relative frequency of each grade. The table we completed above are referred to as the Frequency Distribution Table. Grouped Frequency Distribution Table Quite often when the range of the data set is large, the individual frequencies do not reveal much information about the data. We commonly group the data into intervals. If the intervals are not given apriori, a rule of thumb is to form TEN intervals. This by no means always holds (for instance, when the range is small or when there is a lot of data). With experience, one is able to determine an appropriate number of intervals to represent his/her data. The intervals should not overlap and should continuously cover the range of the data. A template for the Grouped Frequency Distribution Table is as follows: Interval Midpoint Tally Freq. [10] Cumulative Frequency Percentage Frequency Cumulative Per. Freq. STATS 245.3 (02): Introduction to Statistical Methods Example: The times (in seconds) of 50 runners in a marathon were: 246 243 244 244 242 246 250 242 244 247 238 245 246 248 245 248 242 246 240 250 246 243 249 240 239 248 243 246 245 247 251 241 246 243 244 249 245 245 247 248 240 248 245 249 246 248 242 247 248 250 Suppose the classes for the runners were defined (inclusively) by: 237-239 240-242 243-245 246-248 249-251 Complete a Grouped Frequency Distribution Table. Interval Midpoint Tally Freq. Cumulative Frequency Relative Frequency Cumulative Rel. Freq. Notes on a Grouped Frequency Distribution Table: 1. Suitable class sizes are subjective and depend on the data. Usually the class sizes should be of the same width. If the class sizes are too wide, then too much of the information is lost. The graphic representation of the Grouped Frequency Distribution (a histogram, to be discussed in the next section) would appear to be ”box-like”. If the class sizes are too narrow, then the graphic representation (a histogram) has many little boxes and little discernable information can be obtained from the graph. [11] STATS 245.3 (02): Introduction to Statistical Methods 2. It is a good strategy to set the class boundaries (which we use to form the class intervals) at values where no data points lie. Usually the trick we used above works. In the above example, our data consisted of whole numbers, so we were able to use ###.5’s as the boundaries. If our data consisted of data of the form $###.##, our class boundaries would have to be of the form $###.##5. Describing A Data Set Graphically We will look at some of the more popular graphical representations of data. 1. Pie Chart: The Manitoba Government decided that in the 1994 fiscal year, each dollar will be spent as follows: Health $0.25, Education $0.25, Highways $0.15, Administration $0.20, Parks $0.10, and Other $0.05. Draw the corresponding Pie Chart: 2. Pareto Diagram: A Pareto Diagram displays events according to their frequency. The events are plotted in descending frequency with the miscellaneous (or other) category always being plotted last. Example: Draw a Pareto Diagram for the grades of the Stats 245 quiz (marked out of 5). [12] STATS 245.3 (02): Introduction to Statistical Methods 3. Line (or Stick) Diagram: Line Diagrams are generally used to graphically illustrate the frequency (or relative frequency) of discrete events. We always label the horizontal axis with the events and the vertical axis with the frequencies (or relative frequencies). We should take particular note that if we are graphing the events vs the relative frequencies, the labeling on the vertical axis should always start with 0 (zero) and finish with 1 (one). Example: Draw a Line Diagram for the frequency of the grades of the Stats 245 quiz (marked out of 5). Example: Draw a Line Diagram for the relative frequency of the grades of the Stats 245 quiz (marked out of 5). 4. Dot Diagram: Suppose we have the following set of measurements (in cm): {37.0, 43.0, 44.6, 45.1, 37.0, 49.7} The corresponding Dot Diagram plots the data points in the sample as well as the sample mean on a number line with integer labellings: [13] STATS 245.3 (02): Introduction to Statistical Methods 5. Stem-and-Leaf Display: A stem-and-leaf display can be thought is a variation of the histogram (to be discussed in a moment), especially when the observations are two-digit numbers. To draw a stemand-leaf display, a. List the digits 0 to 9 in a column and draw a vertical line. These correspond to the leading digits of the observations. b. For each observation, record its second digit to the right of this vertical line in the row where the first digit appears. c. Arrange the second digits in each row so that they are in increasing order. The column of the first digits is referred to as the stem and the second digits are the leaves. Example: Construct a Stem-and-Leaf Display of the following Stats 245 grades: 75 78 15 90 98 99 57 62 42 90 68 58 75 80 60 49 84 89 77 52 Stem-and-Leaf Display for Stats 245 Grades 0 1 2 3 4 5 6 7 8 9 To see variations of the Stem-and-Leaf Display for data, other than two-digit whole numbers, please refer to the textbook. 6. Boxplots: A Boxplot is a graphically illustration of the data which is based on the information in the quartiles. To construct a Boxplot, you need the minimum and maximum observations and the values Q1, Q2, and Q3. Example: Draw a Boxplot for the “Marathon Times” example. [14] STATS 245.3 (02): Introduction to Statistical Methods 7. Scatter Diagram: A Scatter Diagram can be used to display data sets consisting of two variables, where each variable is measured numerically. We would call this data set bivariate A Scatter Diagram is a first step when trying to determine a relationship between the two variables. Graphically, one variable is represented by the horizontal axis and the other variable is represented by the vertical axis. Example: Let y denote the height (centimeters relative to a fixed reference level) of the water in a reservoir, and let x be the monthly rainfall (centimeters). The data for a period of 14 months is as follows. y x 30 5 -10 2 0 1 50 9 10 6 -40 3 -5 4 25 8 -15 6 -60 0 15 5 35 7 20 10 -5 9 Draw a Scatter Diagram for the above information. 8. Histogram: The histogram is the most common graph used to represent data pictorially. histogram can be used to display frequencies (or relative frequencies). A Example: Construct a histogram for the frequency and also the relative frequency of the grades of the Stats 245 quiz (marked out of 5). [15] STATS 245.3 (02): Introduction to Statistical Methods Example: Construct a histogram for the grouped frequencies of the marathon runners. We can estimate the mean and standard deviation of a data set from the information stored in a histogram: The Mean y 1 # int ervals yi f i where yi is the midpoint of the i’th interval and n i 1 fi is the frequency of the i’th interval The Standard Deviation 2 # int ervals2 # int ervals n y i f i y i f i i 1 i 1 where y is the midpoint of the i’th interval and 2 s i n(n 1) fi is the frequency of the i’th interval Example: For the previous example, use the above two formulae to estimate the mean and standard deviation. Compare these values to the mean and standard deviation calculated from the raw data: mean=245.28 and st. dev. =3.11. [16] STATS 245.3 (02): Introduction to Statistical Methods Chapter 3 When discussing the topic of Scatter Diagrams, we were introduced to the idea of collecting numerical data for two distinct variables. We called this bivariate data. Generally when one collects data for more than one variable, s/he wishes to determine: 1. if any relationship exists between the different variables, 2. how strong the relationship is, and 3. whether one variable can be predicted from the other variable(s). When two traits which can be recorded in some qualitative, categorical manner are observed for each sampling unit, the information can be represented in a two-way frequency table. The categories for one trait are labeled along the left margin of the table and the categories for the second trait are labeled along the top margin. The frequencies are stored in the corresponding cells. This two-way frequency table is called a contingency table. From the information contained in a contingency table we can create two different tables: the relative frequency table and the group relative frequency table. Example: For the following contingency table, complete the corresponding relative frequency table and the group relative frequency table. Died Research Hospital 90 Community Hospital 23 Total 113 Lived Total 2110 2200 677 700 2787 2900 Notes: (1) One must be careful when combining the information from two contingency tables from different sources into one table. The information found in this new table can be misleading. There is a risk that some unreported variable(s) may cause a reversal of the findings recorded [17] STATS 245.3 (02): Introduction to Statistical Methods in the original tables. This reversal of the findings is referred to as Simpson’s Paradox. For an explicit example, please refer to the textbook pages 93 to 94. (2) Quite often one of the traits in the contingency table is the placebo effect, that is the tendency for a person to respond to a treatment just because of the attention that is being paid to him/her. ~~~ The following is the midterm and final exam data from a Law 30 class. Student 1 2 3 4 5 6 7 8 9 10 Midterm Mark (X) 55 60 65 85 75 70 75 65 65 85 Final Mark (Y) 60 70 60 75 80 85 80 70 60 80 By creating a scatter diagram with the above data, we get the following graph : Scatter Diagram of Midterm and Final Marks 90 80 Final Marks (Y) 70 60 50 40 30 20 10 0 0 20 40 60 80 100 Midterm Marks (X) We would like to determine if there is a linear relationship between the Midterm Marks (X) and the Final Marks (Y). This relationship is referred to as the correlation (rho). (This is NOT a p). We measure the correlation by the sample correlation coefficient r where r n xi yi xi yi n x x n y y . 2 i 2 i [18] 2 i 2 i STATS 245.3 (02): Introduction to Statistical Methods The standard deviation of r is given by s 1 r 2 n2 It can be shown using the above formula that we always have –1 < r < +1. In other words, the sample correlation coefficient can NEVER be smaller than –1 or greater than +1. A line (green) of the form has a positive slope. A line (green) of the form Notes: (1) (2) has a negative slope. r does NOT measure the slope of the linear line (referred to as the regression line or the line of best fit) that we are trying to fit our data, apart from the sign. If +0.7 < r <+1, then we have a strong positive correlation between the two variables. If +0.4 < r <+0.7, then we have a moderate positive correlation between the two variables. [19] STATS 245.3 (02): Introduction to Statistical Methods If +0.0 < r <+0.4, then we have a weak positive correlation between the two variables. If –1.0 < r <-0.7, then we have a strong negative correlation between the two variables. If –0.7 < r <-0.4, then we have a moderate negative correlation between the two variables. If -0.4 < r <0.0, then we have a weak negative correlation between the two variables. [20] STATS 245.3 (02): Introduction to Statistical Methods If r is close to ZERO, then there is little to no correlation (or linear relationship) between the two variables. Example: Calculate the linear correlation between the Midterm Marks (X) and the Final Marks (Y). [21] STATS 245.3 (02): Introduction to Statistical Methods Does |r|~1 always imply a strong linear relationship between the two variables? A common error that people make is that they interpret a strong correlation as a cause and effect. Sometimes such a relationship does exist (Smokers and Physical Endurance), but in many cases no such causal relationship exists even if the correlation is strong (our midterm and final exam scores). In such situations, there usually are “hidden” variables linking the two quantities of interest. Also note that if say r=0.6, this does not mean that the independent variable(s) explain(s) 60% of the variability in the dependent variable. Predicting One Variable From Another Once we have determined that a correlation exists between the two variables of interest, we would like to draw a regression equation (or a line of best fit) through the data points. One reason for determining the regression equation is we can use a regression equation to predict the dependent variable given a specific value for the independent variable. For two variables, the equation of the sample regression line is yˆ ˆ 0 ˆ1 x , where the ^ indicates a sample estimator, ˆ1 n xi yi xi yi n x xi 2 i 2 and ˆ 0 y ˆ1 x . Example: Calculate the regression line for the midterm and final exam data. We will revisit correlation and linear regression later in the course. [22]