Statistics is a branch of mathematics that deals with the collection

CHAPTER 11. STATISTICAL ANALYSIS IN SCIENTIFIC RESEARCHES 11.1 Basics of Statistics Statistics is a branch of mathematics that deals with the collection, organization, and analysis of numerical data and with such problems as experiment design and decision making. The origin of the term statistics comes from the Italian word statista (meaning “statesman”), but the real term derived from statista was statistik which was firstly used by Gottfried Achenwall (1719-1772). He was a professor at Marlborough and Gottingen. But the introduction of the word statistics was made by E.A.W. Zimmerman in England. However, before eighteenth century people were able to record and use some data. The popularity of statistics had started with Sir John Clair in his work Statistical Account of Scotland which includes the period of 1791-99. There are various techniques in statistics which can be applied in every branch of public and private enterprises. But statisticians generally divide it into two main parts: Descriptive Statistics and Inferential Statistics. Shortly, in descriptive statistics there is no generalization from sample to population1. We can describe any data with tables, charts or graphs so that they do not refer any generalization for other data or population. On the other hand, in inferential statistics there is a generalization from sample to population. The generalization or conclusions on any data goes far beyond that data. So the generalization may not be true and valid, and statistician should specify how likely it is to be true, because it based on estimation somehow. Inferential statistics could be also re-called as Statistical Inference. Statistical inference can be applied also in decision theory, which is a branch of statistics. Because there is a very close relationship between the two; decisions are made under the conditions of uncertainty. So statistical inference is very effective in decision making. 11.2 Arranging Data: Data Array, Frequency Distributions, and Cross-Tabulations Data are collections of any number of related observations. We can collect an information about the number of students in Eastern Mediterranean University (EMU) in Turkish Republic of Northern Cyprus (TRNC). We can divide them into the different categories such as nationality, gender, age groups, and etc.. A collection of data is called data set, and a single observation in the data is called a data point. People can gather data from past records, or by observation. Again people can use data on the past to make decisions about the future. So data plays very important role in decision making. Most of the times it is not possible to gather data for the population. So what statisticians do is to gather the data from a sample. They use this information to make inferences about the population that the sample represents. A population is a whole, whereas a sample is only a fraction of the population. Assume that there are currently 1 The concepts on sample and population will be discussed later. 1 10,300 students in EMU, and we want to evaluate the expectations and findings of EMU students toward the University. It will be very hard to consider all the students in the university, so we select a fraction of the total number. If we decide to take 15% of the total number, the selected number of students would be 1,545 in this case; and this number is called Sample Size. On the other hand, total number of students (10,300) is called Population Size. The collection of sample or population can be implemented randomly or not randomly. When data is selected randomly all the observations have an equal chance of being included in the data regardless of their characteristics. But when data is not selected randomly, there is a biased selection regarding any characteristic of the observations. In order to use data for any purpose efficiently, we need to arrange the data. This arrangement might be in various numbers of forms. Data before it is arranged and analyzed is called raw data. It is still unprocessed by statistical methods. Data Array The first form of arranging the data is to use Data Array. It is one of the simplest ways to present a data, and it arranges the data in ascending or descending order. Table 11.1 Grades of Students Raw Data 88 78 15 65 55 76 30 64 45 100 96 17 47 32 33 68 Data Ascending Array Descending 15 17 30 32 33 45 47 55 64 65 68 76 78 88 96 100 100 96 88 78 76 68 65 64 55 47 45 33 32 30 17 15 When we use data array, we can immediately see the lowest and highest values in the data, we can divide it into the sections, and we see if a value appear more than once in the data. But when we have large quantities of data, data array is not so helpful for us. We need to arrange the data by using another method. Frequency Distributions The second form of arranging the data is to use frequency distributions. It is the one of best known types of data in statistics. It divides the data into the classes with lower and upper limits, and it shows the number of observations that fall into each of the classes. We can also express the frequency of each value in terms of fractions or percentages of the total observations, which is called relative 2 frequency distribution. In table 11.2 you can see the frequency distribution and relative frequency distribution table. Table 11.2 Frequency Distribution of Student Grades Class 0 – 25 26 – 50 51 – 75 76 - 100 Total Frequency 2 5 4 5 16 Relative Frequency 0.13 0.31 0.25 0.31 1.00 As you will notice from the table, the summation of relative frequency in each of the classes is equal to 1.00, or 100%. It can never exceed 1.00. Because they are the results of the division of the frequency of each class by the total. The classes in a frequency distribution are all-inclusive. All the data fit into one category or another. And the classes are mutually exclusive2. The frequency distributions can be qualitative-quantitative, open ended-closed ended and discrete or continious. We can classify the data according to quantitative characteristics as age groups, salary, income level, and etc.. Or we can classify the data according to qualitative characteristics as sex, occupation, nationality, and etc.. On the other hand, we can arrange the data with open ended or closed ended classes. The classification scheme in open-ended classes is limitless. The last class in the frequency distribution is open-ended. Lastly, the classes in the frequency distribution can be discrete or continuous. Discrete data include those entities which are separate and do not progress from one class to another without a break (eg. 1, 2, 5, 10, 100, etc..). On the other hand, continuous data include those continuous numbers which do progress from one class to another without a break (eg. 1.1, 1.2, 22.5, 110.56, etc..). You can see various types of frequency distributions below: Table 11.3 Types of Frequency Distribution Tables 2 Quantitative and Discrete Data with Open- Ended Class Income level ($) 0 - 500 500 – 1000 1000 TOTAL Frequency Qualitative Data Gender Frequency Male Female TOTAL 20 30 50 15 25 10 50 No data point can fall into more than one category. 3 Relative Frequency 0.30 0.50 0.20 1.00 Relative Frequency 0.40 0.60 1.00 (a) (b) Continuous Data with ClosedEnded Classes Student GPAs 1.00-1.99 2.00-2.99 3.00-4.00 TOTAL Frequency 100 250 150 500 Relative Frequency 0.20 0.50 0.30 1.00 (c) Cross-Tabulations And the third form of arranging data is to use “Cross-Tabulations” which is a two-way table representing two data with two separate characteristics with row and column dimensions. Consider table 11.4 (a) for the distribution of income level with respect to gender. On row dimension gender, and on column dimension income level is included. Table 11.4 (b) shows the same two-way distribution table of Income level with respect to gender both in absolute numbers and relative frequencies or in percentages. Table 11.4 (a) Cross-Tabulation Gender of Income level with respect Male to Gender Female Row Income Level ($) 0-500 500-1000 1000- Total Column Total Table 11.4 (b) Cross-Tabulation Gender of Income level with respect Male to Gender 7 9 4 20 8 16 6 30 15 25 10 50 Row Income Level ($) 0-500 500-1000 1000- Total Female Column Total 7 35.0 46.7 14.0 9 45.0 36.0 18.0 4 20.0 40.0 8.0 20 40.0 8 26.7 53.3 16.0 16 53.3 64.0 32.0 6 20.0 60.0 12.0 30 60.0 15 30.0 25 50.0 10 20.0 50 100.0 Interpretation of these two-way tables is essential in statistics, especially in scientific researches and even in decision making. On the base of Table 11.4 (b), for example, sample size is 50; 35% of males in this sample have a income level between 0 and $500 and this corresponds to 7 persons in total number of males which is 20. 4 And 46.7 percent of those persons who have an income level between 0 and $500 are consisted of male which corresponds to 7 persons in total number of those having income level between 0 and $500 which is 15 persons. Lastly, 14% is the fraction of males having an income level between 0 and $500 out of the total sample size of 50. Total number of males (20) constitutes 40% out of the sample size (n=50) and total number of those having an income level between 0 and $500 constitutes 30% out of the sample size. For large number of data it is very hard and time consuming to organize and arrange data with frequency distributions or cross tabulations, nowadays, by using computer packages, especially SPSS (Statistical Package for Social Sciences), it has been very easy to create these types of tables. Later on, we will study on these subjects in the following chapters. 11.3 Using Graphs to Describe Distributions We can represent the distribution of a data (especially frequency distribution) in various forms of graphs. We have usually two dimensions in graphs for distributions: X and Y. On X-axis values or characteristics of variables are included, and on Y-axis the frequency of these variables are included in absolute or relative terms. However, graphs with relative frequencies are more useful because they attract more attention from the reader, they are easier to understand, to make decision, and etc.. Nowadays, there are advanced computer packages that are effective for drawing the graphs. We will discuss these subjects in later chapters. Figure 11.1 includes a few examples to the types of graphs available in Microsoft Excel ’97 for Windows. 5 Figure 11.1 Types of Graphs a) Column bar Graph b) Line Graph (c) Pie Charts (d) XY Scatter Graphs c) Pie Charts d) XY (scatter) Graphs 6 11.4 Measures of Central Tendency and Dispersion After data have been collected and tabulated, analysis begins with the calculation of single numbers, which will summarize or represent all the data called summary statistics. We use summary statistics to describe the characteristics of a data set. Nowadays, almost every statistical package program provides summary statistics for the data in computer output. Two of the summary statistics are important for decision-makers: Central Tendency and Dispersion. Before we get into the details of these two concepts, let’s shortly define them: Central Tendency Because data often exhibit a cluster or central point, this number is called a measure of central tendency. It refers to the central or middle point of a distribution. We can also name Measures of Central Tendency as Measures of Location. We can show the concept of central tendency in a graph: Figure 11.2 Curve A Curve C Central Tendency for Three Types of Distribution Curve B It seems clearly from the figure that the central locations of A and C curves are equal to each other, and central location of curve B lies to the right of those curve A and curve C. Dispersion Dispersion refers to the spread of the data in a distribution. Notice in Figure 11.2 that Curve B has a widest spread, or dispersion than A and C. And Curve C has a wider spread than Curve A. Besides Central Tendency and Dispersion, an investigator may benefit from two other measures in a data set-skewness and kurtosis. Skewness A Curve of any distribution may be either symmetrical or skewed. In symmetrical curves, the area is divided into two equal parts by the vertical line drawn from the peak of the curve in the horizontal axis. For example, we know that total of a relative frequency distribution is equal to 1.00. And in a symmetrical curve we will have 50% of the data on the left-hand side of the symmetric curve, and another 50% of the data on the right-hand side. 7 Figure 11.3 Symmetrical Curve 50% 50% On the other hand, curves A and B in Figure 11.4 are skewed curves. Their frequency distributions are concentrated at either the low end or the high end of the measuring scale on the horizontal axis. Curve A is called to be Positively Skewed, and curve B is called to be Negatively Skewed curves. Figure 11.4 Positively and Negatively Skewed Curves Curve A: Skewed to the right Curve B: Skewed to the left Kurtosis Kurtosis is the peakedness of a distribution. Notice in figure 11.5 that two curve possesses the same central location, dispersion, and both are symmetrical. But Curve A is said to be more peaked than curve B. Figure 11.5 Curve B Curve A Measure of Degree of Kurtosis 8 Measures of Central Tendency In statistics, arithmetic mean, weighted mean, geometric mean, median and mode are referred as the measures of central tendency. We will firstly consider. The Arithmetic mean The arithmetic mean is simple average of a data set. We can calculate the average age in a class, average monthly expenditure of students in EMU, average tourist number coming to TRNC each year, and etc.. The arithmetic mean for population is represented by the symbol of µ and for sample is x . The formulas for µ and x are provided below: Population: µ = Sample: x = ∑X ∑x n N where N represents population size where n represents sample size Table 11.4 provides the ages of students in a class. In this case, we will assume that data represents a sample derived from the whole university. Table 11.4 Ages of Students in a Class ID 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Name Ali Veli Ayla George Mohammed Asher Samah Ayse Mahmut John Age 25 24 23 24 22 26 25 27 26 28 Now, let’s calculate the arithmetic mean for this ungrouped data: x= ∑ x = 25 + 24 + 23 + 24 + 22 + 26 + 25 + 27 + 26 + 28 = 250 n 10 10 So the arithmetic mean of the ages in the class will be; x = 25 But what about if the data is grouped! In a grouped data, we do not know the separate values of each observation. So we are only able to estimate the mean. But in ungrouped data, since we know all the observations in the data, whatever mean we find from the data will be the actual number. 9 To calculate the arithmetic mean for a grouped data we use the following formula: ∑ ( f × x) x= n where • x = sample mean • ∑ = summation • f = number of observations in each class • x = midpoint of each class • n = sample size Let’s look at the following frequency distribution of student GPAs which is a grouped data at the same time. Table 11.5 Frequency Distribution of Student GPAs Student GPAs 1.00-1.99 2.00-2.99 3.00-4.00 TOTAL Frequency 100 250 150 500 Relative Frequency 0.20 0.50 0.30 1.00 The first step in calculating the arithmetic mean is to find the midpoint (x) corresponding to each class. To find the midpoints, we add the lower limit of the first class with the lower limit of the following class and divide it by two. For example, to find the midpoint for the first class, the formula would be (1.00+2.00)/2 = 1.5. This process will continue until we reach last class interval. Then, we multiply each midpoint for the corresponding absolute frequencies and add them up. And lastly, we divide this summation by the total number of observations in the data. This exercise is included in table 11.6 Table 11.6 Student Arithmetic Mean for Student GPAs x= Frequency GPAs 1.00-1.99 2.00-2.99 3.00-4.00 TOTAL 100 250 150 500 ∑ ( f × x ) = 1,300 = 2.6 Midpoint (x) × 1.50 × 2.50 × 3.50 f×x 150 625 525 1,300 n 500 So our approximated or estimated mean for the student GPAs from the grouped data is 2.6. A useful practice about the midpoints is that to come up with whole cents and for easy calculation, rounding the numbers is advisable. 10 Today, we get these frequency distributions as ready done by using statistical packages and computers calculate the arithmetic mean from the original data. So the arithmetic mean for grouped data would be unnecessary in this situation. The arithmetic mean is the best known and most frequently used measure of central tendency. One of the most important uses of the arithmetic mean is that we can easily make a comparison with different data. The arithmetic mean has got two important disadvantages. Firstly, it is affected by extreme values. Secondly, it is not possible to calculate the mean for a grouped data. The Median In its simplest meaning, median divides the distribution into two equal parts. It is a single number, which represent the most central item, or middlemost item in the distribution or in the data. Half of the data lie below this number, and another half of the data lie above this number. In order to calculate the median for ungrouped data, firstly, we array the data in ascending or descending order. If we have an odd number of data, then the median would be the most central item in the data. Let’s consider the following simple data in table 11.7: Year Table 11.7 No of Graduated Students in Students Each Year 1991 1992 1993 1994 1995 10 15 13 14 17 Firstly, let’s array the data in ascending order: 10, 13, 14, 15, 17 In this case, the most central item for this odd-numbered data would be 14, which is the median of this data set at the same time. Another way of finding the median is to use the following formula:  n + 1  th Median is the   item in the data array and n represents number of items  2  in the data. If we apply this formula for the above data;  5 +1  = 3  th item in the data which corresponds to 14. Median is the   2  However, this formula is frequently used for even-numbered data, which takes the average of the two middle items in the data. 11 In order to calculate the median of even-numbered data, we need to take the average of the two middlemost items since we do not know the most central item in the data set. So we should use the above formula to calculate the median. Now let’s extend table 11.7 to 1996 and try to calculate the median for the data. In this case, number of observations will be 6 (1991-1996). Year Table 11.7 No of Graduated Students in Students Each Year 1991 1992 1993 1994 1995 1996 10 15 13 14 17 21 Again we have to sort the data in ascending order; 10, 13, 14, 15, 17, 21  6 +1 From the formula, median is   = 3.5 th item in the data which is  2   14 + 15  included between 14 and 15. And the average of 14 and 15 is   = 14.5 . That  2  is the median of this data set. So the median number of graduated students for the period of 1991-96 is 14.5. For a grouped data, we have to find an estimated value for median that can fall into a class interval. Because we do not know all the observations in the data, we are only given the frequency distribution with class intervals. The formula to calculate the median from the grouped data is given below: ~ = L +  (n + 1) / 2 − (F + 1)  w m   f   ~ m = the median assumed for the sample distribution where • • • • • L = the lower limit of the class interval containing median F = the cumulative sum of the frequencies up to, but not including, median class f = the frequency of the class interval containing median w = the width of the class interval containing median n = total number of observations in the data ~ would be replaced by Md and n In case where we work with the population, m by N. 12 Let’s consider table 11.5 in the previous examples, and try to find the median for this data: Table 11.8 Finding Median for Student GPAs Student GPAs 1.00-1.99 2.00-2.99 3.00-4.00 TOTAL Frequency 100 250 Median class 150 500 The first step is to find the class interval that includes median. The median would be  500 + 1  th   = 250.5 item in the data. Secondly, we have to find in which class interval  2  the 250.5th item is included. To do that, we add all the frequencies together from the very beginning until we reach the summation of 250.5. And then we stop. In this data, the median would fall into the class of (2.00-2.99), because 100+250=350 and we have already reached 250.5. So the median class is (2.00-2.99). Now if we put the values into the formula; ~ = 2.00 +  (500 + 1) / 2 − (100 + 1)  × 1.00 = 2.598 m   250   So the median value for the GPAs of the students is ≈ 2.60. And it is an estimated sample median, since the data is a grouped data. Unlike the mean, the median is not affected by the extreme values in the data. It can be calculated even for open-ended grouped data- unless the median falls into this open-ended class. The Mode The mode is the value or observation that occurs most frequently in the data. If two or more distinct observations occur with equal frequencies, but none with greater frequency, the set of observations may be said not to have a mode or to be bimodal, with modes at the two most frequent observations, or trimodal, with modes at the three most frequent observations. But when there is a single value which is repeating mostly , the distribution is unimodal. In order to find the mode of any ungrouped data, we need to array the data again in ascending or descending order. Let’s consider the following ungrouped data, which represents the final exam marks of 35 students in a class. Table 11.9 Student Marks In Final Exam Marks Arrayed in Ascending Order 10 21 35 60 79 89 12 23 42 65 81 90 16 23 48 67 83 91 19 23 50 68 85 93 20 30 56 76 87 94 13 96 97 98 99 99 It clearly appears that the most frequently repeated observation or student mark is 23, it is repeated 3 times, so the mode for this ungrouped data is 23. So this distribution is unimodal. And as we can observe from the data, 99 is repeated 2 times. Now, let’s consider the following table again for student marks: Table 11.10 Student Marks In Final Exam Marks Arrayed in Ascending Order 10 21 35 60 79 89 12 23 42 65 81 90 16 23 48 67 83 91 19 23 50 68 83 93 20 30 56 76 83 94 96 97 98 99 99 This time we changed the observations. And in this case we have two observations which are repeating mostly, 23 and 83. They are repeated 3 times. And the mode for this data is 23 and 83, which is called bimodal. And lastly if we have 3 most repeated observation in a data, the distribution is trimodal. Let’s make one more change in the previous table: Table 11.9 Student Marks In Final Exam Marks Arrayed in Ascending Order 10 21 35 60 79 90 96 12 23 42 65 81 90 97 16 23 48 67 83 90 98 19 23 50 68 83 93 99 20 30 56 76 83 94 100 This time we have got three observations, which are, repeated most; 23, 83 and 90. They are again repeated 3 times each. However, generally accepted rule is that when we have two or more observations in a distribution, repeating mostly, this distribution is shortly bimodal. When we have a grouped data, we assume that the mode is located in the class interval having the highest frequency. This class interval is called modal class. In order to find the mode from the grouped data, we use the following formula:   f m − fb M 0 = LM 0 +  .w  ( f m − f b ) + ( f m − f a ) where M0 = the mode of the frequency distribution or grouped data • • • • • Lm0 = lower limit of the modal class fm = frequency of the modal class fb = frequency of the class interval below the modal class fa = frequency of the class interval above the modal class w = the width of the modal class 14 Let’s apply this formula to find the mode for the following frequency distribution of student GPAs: Table 11.8 Finding Median for Student GPAs Student GPAs Frequency 1.00-1.99 100 2.00-2.99 250 3.00-4.00 150 TOTAL 500 Modal Class As we can see from the table the modal class for this frequency distribution will be 2.00-2.99 since it has the highest frequency. Now we can put the values into the formula:   250 − 100 M 0 = 2.00 +   × 1.00 = 2.60  (250 − 100) + (250 − 150)  So the mode for this frequency distribution will be 2.60. And since this data is a grouped data, and we do not know every observation in the data, 2.60 is the estimated number for the mode. Like the median, and unlike the mean, the mode is not affected by the extreme values in the data. And we can use it even with the open-ended class intervals. Comparison of the Mean, the Median, and the Mode Among these three measures of central tendency, the mean is the most popular and useable one. The mean and the median is more preferable according to the mode. Most of the times, the data may not contain a mode, because no values may occur more than once in the data. But the frequency of the use of these three measures depends on the conditions and the area of the research that they will be applied in. On the other hand, we can compare these measures of central tendency with respect to statistical methods. When any distribution is symmetrical, the mean, the median and the mode are equal to each other. Figure 11.6 shows this relationship: Figure 11.6 Mean, Median, and Mode in symmetrical distribution Mean Median Mode 15 In this case, there will not be any preference since they are equal to each other. But what about when we have a skewed distribution! Figure 11.7 shows the position of these three measures of central tendency when the distribution is skewed to the right and to the left: Figure 11.7 Mean, Median and Mode in skewed distributions Curve A: Skewed to the right Mode Mean Median Curve B: Skewed to the left Mean Mode Median When the distribution is skewed, the median would be preferable measure of central tendency, because it is included between the mean and the mode in positively and negatively skewed distributions. Measures of Dispersion When we compare two or more distributions by using the measures of central tendency, we may be satisfied. We need o know more information about these distributions; for example, knowing the mean of the data sets may not be enough to compare them. The variability or dispersion is a useful measure to get more information about these distributions. If we try to compare two data sets by finding only the mean of these data, this will not be enough. We may need to know about which distribution is more consistent compared to other, so the measure of dispersion will help us in this case since it measures the spread of the observations in the data around their mean. If the dispersion of the data decreases, the consistency and the reliability of the data will increase. And the central location (mean, median, or mode) of the data will be more representative of the data as a whole. The concept of dispersion plays an important role in our business life also. For example, a financial manager may concern with the earnings of the firms. Widely dispersed earnings will indicate a higher risk for a financial manager. Because the earnings are widely variable, let’s say around the mean, and this indicates inconsistency in the earnings. Figure 11.8 shows the spread of three curves having the same mean. Although they have the same central location, curve A has the least spread compared to B and C. And curve C has the widest spread in the graph. So distribution of Curve A is said to be more consistent and reliable compared to B and C. 16 Figure 11.8 Curve A Curve B Measure of Dispersion for three curves having the same mean Curve C Mean A, B, C Range, Interfractile Range and Interquartile Range These are the first and distance measures of dispersion. Range is the difference between the highest and the lowest values in a data set. We can show it by Range = Highest value – Lowest value Interfractile range is the difference between two fractiles. Generally, fractiles are comprised of 4 characteristics as provided below: Third Fractiles Quartiles Deciles Percentiles = divide the data into 3 equal parts = divide the data into 4 equal parts = divide the data into 10 equal parts = divide the data into 100 equal parts Let’s consider the following data on student grades: 52 72 55 69 • 35 69 38 66 48 38 51 35 46 37 49 34 43 55 46 52 40 52 43 49 61 50 64 47 49 31 52 28 57 41 60 38 58 60 61 57 65 45 68 42 46 41 49 38 As a first example, let’s divide the data into thirds and find the interfractile range between 1/3 and 2/3 fractiles. Firstly, how do we organize the data in three equal parts? We will first order the data starting from the lowest to the highest. And then specify the extensions. We specify the extentions as follows: Since the sample size, n, is 48, we divide 48 by 3, and we get 16. 48/3 = 16 Which means that we will have 3 rows and 16 columns which is in 3 × 16 format. Let’s create it now: 17 1 28 45 55 1 2 3 2 31 45 55 3 34 46 57 4 35 46 57 5 35 47 58 6 37 48 60 7 38 49 60 8 38 49 61 9 38 49 61 10 38 49 64 So, 1/3 fractile will be = 43, 2/3 fractile = 52, 11 40 50 65 12 41 51 66 13 41 52 68 14 42 52 69 15 43 52 69 16 43 52 72 3/3 fractile = 72 Interfractile range between 1/3 and 2/3 fractiles will be then; 52 - 43 = 9. As a second example, what is the interfractile range between 30th and 70th percentiles? • 30th fractile is 30% of 48 = 14.4 = 14th element in the data 70th fractile is 70% of 48 = 33.6 = 34th element in the data 14th element in the data = 42 34th element in the data = 55 So; 55 - 42 = 13 is the interfractile range. • As a third example, let’s find the interquartile range which is the difference between the first and third quartiles, and quartiles divide the data into 4 equal parts. Let’s divide the data into 4 equal parts. 48/4 = 12, 1 2 3 4 1 28 41 49 58 So; 1th quartile 2nd quartile so format will be 4 × 12. 2 31 42 49 60 3 34 43 50 60 = 1/4 = 41 = 2/4 = 49 4 35 43 51 61 5 35 45 52 61 6 37 45 52 64 7 38 46 52 65 3rd quartile 4th quartile 8 38 46 52 66 9 38 47 55 68 10 38 48 55 69 11 40 49 57 69 12 41 49 57 72 = 3/4 = 57 = 4/4 = 72 And; Interquartile Range = Q3 - Q1 = 57 - 41 = 16 • • But if we want to find the range between 1/4 and 2/4 fractiles, then it will be 4941= 9. And again 30th and 70th percentiles are the same values as in the previous example as 42 and 55. And range is 55-42 = 13. No matter we arrange the data into 4 or 3 or any other equal parts, the percentiles are the same values. 18 Variance and Standard Deviation Variance and, especially, standard deviation are the most commonly used statistical measures for dispersion. They specify the average distance of an observation in a data from the mean of the data. We can specify the average distance (or deviation) of the observations from the mean by the following formula: Average deviation = ∑(X i − µ) N Where Xi µ N = Observations in the population = Population mean = Population size But when we use this formula, we will see that the sum of the deviations is equal to zero. And as a result, the average deviation will be also equal to zero. To prevent this problem, we square each deviation to find the standard deviation. The standard deviation is the square root of the variance. It is more applicable than the variance in statistical analyses. The reason behind this is that the variance does not express the average dispersion in the original units; it expresses in squared units. So, in order to bypass this problem, we take the square root of the variance to transform into standard deviation. So, the standard deviation measures the average dispersion in the data in original units of measurement. We can express the variance and the standard deviation for population by the following formulas: σ2 = ∑ (X − µ) 2 i σ = σ2 = N ∑ (X − µ) 2 i N Standard Deviation Variance However, most of the times, it is not possible to know all the observations in the population. So we induce our population formula into sampling units. To calculate the standard deviation of a given sample, we use; ∑ (x − x ) 2 s= i n −1 Where xI x n-1 = each sample unit in the distribution = sample mean = sample size minus 1 What is the reason for using n-1? We can prove that if we select many different samples from a population, find the standard deviation for each sample, take 19 the average of them, then this average will not tend to be equal to population standard deviation. So in order prevent this difference, we use n-1 as a denominator. Now let’s calculate the standard deviation for student CGPAs for a randomly selected sample of 15 students. TABLE 11.9 Calculating Variance And Standard Deviation For Ungrouped Sample Data Of Student CGPAs x = 3.21 s= (x- x )2 0.060 0.118 0.038 0.125 0.402 0.047 0.181 0.015 0.099 0.002 0.035 0.009 0.002 0.050 1.245 x- x 0.24 0.34 -0.20 0.35 0.63 -0.22 -0.43 0.12 0.31 0.04 -0.19 -0.10 -0.05 0.22 -1.12 CGPA 3.45 3.55 3.01 3.56 3.84 2.99 2.78 3.33 3.52 3.25 3.02 3.11 3.16 3.43 2.09 Sum = 0.00 2.43 = 0.17357 = 0.41655 15 − 1 Sum = 2.43 sample standard deviation The standard deviation (s) of this sample of student CGPAs is approximately 0.42 showing that each observation in the sample, on average, deviates from the mean ( x = 3.21) by 0.42 both downwards and upwards. The variance of this sample (s2) is 0.17. As you will observe from table 11.9, the sum of deviations of each observation in the data is equal to zero. So we square each deviation and add them up. Calculating Variance and Standard Deviation by Using Grouped Data Up to this point, we have discussed about the variance and the standard deviation for ungrouped data, which were unprocessed and raw data. But how about if the data is grouped? Then we need to use different formula to find variance and standard deviation. Since the standard deviation (σ) is the square root of the variance (σ2), we will just work on standard deviation. The formula of standard deviation for a grouped data is; ∑ f (x − µ ) 2 σ = σ = 2 i for population N ∑ f (x − x ) 2 s = s2 = i for sample n −1 20 In this case, xI in each formula represents the midpoints of each class interval, and f, the frequency of each class. Table 11.6 Student Arithmetic Mean for Student GPAs Frequency GPAs 1.00-1.99 2.00-2.99 3.00-4.00 0 3 12 Total = Midpoint (x) 1.50 2.50 3.50 xI - µ -1.80 -0.80 0.20 15 (xI - µ)2 f. (xI - µ)2 3.24 0.64 0.04 0 1.92 0.48 Total = 2.40 x = 3.30 x= ∑ f .x = (0 × 1.5) + (3 × 2.5) × (12 × 3.5) = 49.5 = 3.30 s= n 15 15 2.40 = 0.17 = 0.414039 15 − 1 Now it’s time to mention an important thing! Since we do not know every single observation in a grouped data, we find midpoints for each class to make approximation of real observations. We multiply each squared deviation of midpoints from the mean by their corresponding frequency, add them up, and divide by N (if population) or n-1 (if sample). So the standard deviation, or the variance computed from a grouped data is an approximated or estimated value. However, in an ungrouped data, we know every single observation, and whatever we calculate from an ungrouped data is a real value. A Relative Measure of Dispersion: The Coefficient of Variation The standard deviation and the variance are absolute measures of dispersion. On the other hand, CV is a relative measure of dispersion that expresses the standard deviation as a percentage of the mean. By using CV, we can easily compare the dispersions of two sets of data in percentages. The formula for calculating CV is; CV = σ (100) µ for population CV = s (100) x for sample Let’s consider the following example to understand the use of CV better: Suppose that the common stocks of Sabanci Inc. sold at an average of $50,000 per stock and a standard deviation with $5,000 for the period of 1990-1996. On the other hand, Koc Inc. had sold its common stocks at an average of $ 60,000 per stock and a standard deviation with $ 5,800 between 1990-1996. The CV for both firms will be; 21 CVSabanci = s 5000 (100) = 0.10 = 10% = x 50000 CV Koc = s 5800 = = 0.0966 = 9.66% x 60000 On the base of above results, since Sabanci Inc. has less absolute variation (or standard deviation, s= $5,000) in its common stocks than Koc Inc., it has more relative variation than Koc Inc.. This is because of the significant difference in their means. 11.5 Statistical Inference: Estimation and Hypothesis Tests Statistical inference, estimation and hypothesis testing are three important concepts in statistics, and closely related to each other. The definition of statistical inference had been given in the opening chapter, 1. It deals with uncertainty by using probability concepts in decision making. It is based on estimation and is the subject of both estimation and hypothesis testing. In this part, firstly, we will start with estimation. Estimation When we deal with uncertainty, we have to estimate about something. In statistics, in order to estimate the population parameters, we use sample statistics. Generally, there are two types of estimates in statistics: A point estimate and an interval estimate. A point estimate is a single value or a sample statistic, which is used to estimate an unknown population parameter; it does not provide us enough information since it is a single number. On the other hand, an interval estimate is a range of values that is used to estimate the population parameter that it can fall into this range. The sample statistics used to estimate the population parameters are called estimators. For example, x is the sample mean and the estimator of population mean, µ.; s is the sample standard deviation and the estimator of the population standard deviation, σ. The observed values for the estimators are called estimates. For example, x =23; in this case x is an estimator and 23 is the estimate for the true population mean. An Alternative Way For Hypothesis Tests: Using Prob Values (p-values) Recall that α is a predetermined value for the significance level, which is the probability of rejecting a true null hypothesis, called type I error. Selecting the level of α depends on the researcher’s desire. Generally accepted rule for specifying the level of α is that of the trade off between α and β (probability of type II error). 22 If the cost of making type I error is relatively high or expensive for the researcher, then he or she will not desire to make type I error and he or she is going to select a low level of α. On the other hand, if the cost of making type II error is relatively high or expensive for the researcher, he or she will not desire to make type II error and is going to select a high level of α. On the other hand, the standardized value of the probability for rejecting a true null hypothesis is called a prob value (p-value). It is directly found from z-table by using z formula. Let’s consider the following example: Ho : µ = 15 Ha : µ ≠ 15 And; σ = 2.1 n = 20 x = 13.6 In this simple example, we are given the two-tailed hypothesis whether the mean value of any population is equal to 15 or not. Besides we are provided population standard deviation, sample size selected for the test, and sample mean. In this case the probability of µ > 15 or µ < 15 would be called its prob value, that is accepting the alternative hypothesis. So the prob value will be the summation of the the probabilities in both rejection tails. Let’s find the prob value now: Firstly, we have to find the standard error for the mean: σx = σ n = 2.1 = 0.47 20 The next step is to find the z score for x : z= x−µ σx = 13.6 − 15 = −2.98 0.47 Figure 11.9 Prob Values in normal curve 0.0028 0.0028 0.4986 0.4986 z -2.98 0 +2.98 In this example, the p-value for the test would be 2(0.0028)= 0.0056. So the standardized probability of accepting the alternative hypothesis is 0.56%. 23 Now let’s continue to test our hypothesis. Let’s select a significance level of α=0.05. Figure 11.10 shows how α and p-values are used together to test the hypothesis. Figure 11.10 Use of Prob Values in Testing Hypothesis 0.0028 0.0028 z -2.98 0 Reject Ho +2.98 Accept Ho -1.96 Reject Ho +1.96 Z critical values As you will see from figure 11.10, p-values falls outside region of the z critical values so we would reject Ho and accept Ha that the true mean value for the population will be beyond (not equal to) 15. It is possible to derive one more conclusion from above discussions. The pvalue for the above example is 0.0056 and is lower than α=0.05. So when; p-value > α p-value < α then we accept Ho, then we reject Ho and accept Ha This conclusion is commonly true not only for two-tailed tests but also onetailed tests. 11.6 Chi-Square And Analysis Of Variance (ANOVA) Tests Chi-square and ANOVA tests are two statistical techniques used in hypothesis testing. Usually, we use Chi-square as a test of independency between two or more variables and goodness of fit of a particular probability distribution, and ANOVA as a test of difference between two or more population proportions. Let’s consider these tests in more details. 24 Chi-Square Test for Independency Two way tables (cross-tabulations) plays an important role in considering and evaluating chi-square test. When we get the computer output, especially in SPSS, if we will conduct a hypothesis test by using chi-square, we give the command to SPSS and chi-square statistic, df, and the significance level is provided with table. To carry out a Chi-square test we need to find the computed value for Chi-square statistic (χ2) first. The formula for χ2; χ2 = ∑ ( f0 − fe ) fe where χ2 f0 fe : chi-square statistic : observed frequency in the distribution : expected frequency in the distribution But how do we find fe ? The following formula is used to calculate fe : fe = (rt × ct ) n where rt ct n : row total of the corresponding frequency cell : column total of the corresponding frequency cell : total number of observations (sample size) Secondly we need to determine a significance level3 for the hypothesis test. This might be 0.05, 0.10, whose level is up to the researcher’s desire. And lastly we need to find the table value for the Chi-square statistic. To do that we find the degrees of freedom (df) by using the following formula: df = (r-1).(c-1) Where df r c : degrees of freedom : number of rows in the table : number of columns in the table So we can find the table value of the chi-square statistic from chi-square distribution table by looking at df and significance level. If the null hypothesis is true, the sampling distribution of a χ2 can be approximated by a continuous curve, which is 3 Remember that the significance level shows the level of error accepted in hypothesis test. 25 known as chi-square distribution. There is different chi-square distribution for each level of df. The degree of freedom increases as column and/or row dimensions, and/or number of variables in the test increases. As df increases, the chi-square distribution will be more symmetrical and with small df, it will be skewed to the right as you can observe from Figure 11.11. 1 df Figure 11.11 5 df Representing Chi-square distribution with different levels of degrees of freedom 10 df 0 2 4 6 8 10 12 14 χ2 Carrying A Hypothesis Test by Using Chi-Square Figure 11.12 is a representative graph for chi-square distribution used in hypothesis testing. The shaded area on the right tail shows the significance level, which was the level of error accepted for the true null hypothesis and it shows the probability of rejecting the true null hypothesis at the same time. The left-hand side contains the confidence level for the null hypothesis, and shows the probability of accepting the true null hypothesis. Figure 11.12 Table value for χ2 Representative graph of χ2 distribution for hypothesis test α= 0.10 C.L. = 0.90 Acceptance Region Rejection Region χ2 The intersection point of acceptance and rejection areas corresponds to the table value for chi-square statistic. If the computed value of chi-square statistic falls to 26 acceptance region (or if computed value is less than the table value), the null hypothesis will be accepted, otherwise it will be rejected and the alternative hypothesis will be accepted. In order to understand better let’s try to solve a problem on chi-square test of independency. Because of the aim of this book, we will mostly work on computer based outputs in these types of problems. However, the reader can refer to any statistics book to see the theoretical computation of the formulas. Table 11.4 shows the evaluation of teaching ability of lecturers by faculty. The frequency in bold characters in each cell in Table 11.4 represents the expected frequency of each observed frequency. Recall from the fe formula that rt in table 11.4 is equal to 10 for 1st, 2nd and 3rd rows, and 20 for the 4th row; and ct is equal to 4 for the 1st column, 8 for the 2nd column, 23 for the 3rd column 5 for the 4th column and 0 for the 5th column. Each row or column total gives the proportion of each row or column variables in the total number of observations. For example, the rt for the 1st row is 10, it shows the total number of B&E students out of n (=50). Its proportion out of n is 0.20 (20%), which is 10 / 50. The ct for the 2nd column is 18, it shows the total number of student who have found teaching ability of lecturers as High. Its proportion out of n is 0.36 (36%), which is 18 / 50. The combined proportion of rt and ct out of n will give the expected (rt × ct ) . frequency, fe, for each cell, which is n Now let’s continue with our exercise. 27 High Medium Poor Very Ability Very High Poor by Evaluation Faculty 3 6 1 of teaching B&E ability of 0.8 3.6 4.6 10.0 30.0 60.0 lecturers 25.0 16.7 26.1 by faculty 2.0 6.0 12.0 5 A&S 5 3.6 4.6 50.0 50.0 27.8 21.7 10.0 10.0 4 ENG 6 4.6 1 60.0 40.0 26.1 80.0 12.0 8.0 OTHER 10 6 1 3 1.6 7.2 9.2 2 15.0 50.0 30.0 5.0 75.0 55.6 26.1 20.0 6.0 20.0 12.0 2.0 Column 4 18 23 5 Total 8.0 36.0 46.0 10.0 Table 11.4 Computed value for Pearson’s Chi-square Statistic (χ2) 21.70833 Row Total 10 20.0 10 20.0 10 20.0 20 40.0 50 100.0 Significance Level Df 12 0.00985 The null and the alternative hypotheses for chi-square test of this exercise will be as below: Ho = Teaching ability of lecturers are independent of faculty Ha = Teaching ability of lecturers depends on faculty In chi-square test, the null hypothesis specifies the independence, and the alternative hypothesis specifies dependence. The computed value for chi-square statistic is 21.70833 and it is called Pearson’s Chi-square Statistic. The degrees of freedom is (4-1)(5-1) = 12. And the significance level is 0.00985. Let’s test the hypothesis at 0.01 level of significance (α = 0.01). Once we get this information on this exercise, the next step is to find the table value for χ2. 28 Appendix table ….. provides us the chi-square distribution table for different levels of α and df. The table value for our exercise will be; χ 2 0.01,12 = 26.217 Now let’s represent these data in a graph: Figure 11.13 Hypothesis test for the evaluation of teaching ability of lecturers by faculty 21.70833 α= 0.01 Acceptance Region 26.217 Rejection Region χ2 As you see in figure 11.13, the computed value of χ2 is less than the table value and it falls within the acceptance region. So we accept our null hypothesis that the teaching ability of lecturers are independent of faculty according to this data of n=50 observations. On the other hand, the p-value for the Pearson's Chi-Square Statistic is 0.00985 and since p-value = 0.00985 < α = 0.01, then again we would reject our null hypothesis. Analysis of Variance (ANOVA) Test for Difference Analysis of variance (ANOVA) is used to test for any differences among more than two sample means. That is, ANOVA is used to compare two different estimates of the variance, σ2 , of the same population: the first estimate is among the samples, and the second estimate is within the samples. If the null hypothesis is true, then both estimates should be equal. ANOVA is represented by F-ratio which is used to compare two estimates of variances. The formula to find the computed F ratio would be; F= estimate of the variance among the sample means estimate of the variance within the sample means 29 It is possible to formularize this in this way; F= σˆ 2 ∑ n (x = j − x) 2 j k −1 −1  2 n   s j σˆ 2 = ∑  j  nT − k  where; nj = size of the jth sample = sample mean of the jth sample xj x k s 2j = mean (average) of the sample means (grand mean) = number of samples = variance of the jth sample nT = total of the sample sizes ( ∑ n j ) Let's consider an example for the test for difference. Suppose we want to test if there is a significant difference in salaries of males and females in a questionnaire study for a corporation. The sample size selected is 475. The salaries and gender are categorized in the questionnaire form as; SAL: 1. 0 - $50,000 2. $50,000 - $100,000 3. $100,000 - …….. GENDER; 1. Male 2. Female We can formularize our hypothesis as H0 = µmale = µfemale (Salaries of employees do not differ in gender) Ha = µmale ≠ µfemale (Salaries of employees differ in gender) α = 0.01 Below is the SPSS output for ANOVA test for employee data. ANOVA Sum of Squares SAL Between Groups Within Groups Total Mean Square df 6.753 1 6.753 64.380 472 .136 71.133 473 30 F 49.510 Sig. .000 In order to test our null hypothesis we have to compare F-computed value with F-table value. We will find F-table value with regarding degrees of freedom (df). In ANOVA test, there are two degrees of freedom. Df in the numerator of F-ratio Df in the denominator of F-ratio = (k -1) =2-1=1 = ∑ (n j − 1) = nT − k = 475 - 2 = 473 Where; k = number of samples nj = mean of jth sample nT = total sample size Then, F-table value would be; 1 F472 (α=0.01) = 6.63 (approximately) Now let's test our hypothesis at α=0.01: Figure 11.14 Hypothesis test for the difference of salaries among males and females 49.510 α= 0.01 6.63 Acceptance Region Rejection Area F-computed : 49.510 > F-table : 6.63. So, since F-computed falls within the rejection area, we would reject our null hypothesis and accept the alternative hypothesis that in the corporation the salaries of employees differ among males and females. Alternatively, p-value= 0.000 < α=0.01 and again we would reject the hypothesis. The shape of F-distribution in different levels of degrees of freedom can be shown in Figure 11.14. The first number in each parenthesis shows df in the numerator in F-Ratio formula, and the second number shows df in the denominator in F-Ratio formula. 31 (25,25) df Figure 11.15 (5,5) df Representing F distribution with different levels of degrees of freedom (2,1) df F distribution The figure above shows that as df in both numerator and denominator parts of F Ratio formula increases, the shape of F distribution is more likely to approach the shape of normal distribution. 11.7 Correlation, Simple and Multiple Regression Correlation and regression analyses are used to determine the nature and the strength of a relationship between two variables, let's say X and Y. In regression analysis, one of the variables is the independent variable and another is the dependent variable. However, in correlation analysis this is not the case. The number of independent variables can be increased, whereas dependent variable can be only the one. There is a causal relationship between dependent and independent variable(s). The independent variables always cause dependent variable to change. We usually expect a direct (or positive) or an inverse (or negative) relationship between two variables in correlation and regression analyses. Figure 11.16 shows these relationships better: y y Figure 11.16 Direct and inverse relationship between X and Y (a) Direct Relationship x 32 (b) Inverse Relationship x The curve in Figure 11.16 (a) has a positive slope when there is a direct relationship between X and Y, whereas the curve in figure 11.16 (b) has a negative slope when there is an inverse relationship between two variables. Usually, the independent variable (X) is included on x -axis and the dependent variable (Y) is included on y-axis. The relationship between two variables including all the data points from observed data could be well represented in scatter diagrams. When the relationship between two variables are described by a straight line, then we say that there is a linear relationship between two variables. However, there might be some deviations from the straight line. When the relationship between two variables take a form of a curve, this relationship is called curvilinear. These types of relationships explained above can be represented by graphs shown in figure 11.17: y y y x x (a) Direct Linear (b) Inverse Linear y (c) Direct Curvilinear y y x x (d) Inverse Curvilinear Figure 11.17 x (e) Inverse Linear more scattered x (f) No relationship Types of relationships between x and y in scatter diagrams The last graph in section (f) is extremely scattered which indicates no relationship between x and y variables. The more scattered the data points around the straight line are, the less relationship between two variables. Correlation Analysis Correlation analysis is used to determine the power of relationship between any two variables. In statistical theory, there are two measures to describe correlation analysis: the correlation coefficient (R) and the coefficient of determination (R2), which clarifies the degree linear of relationship between two variables. 33 The correlation coefficient shows the degree of linear relationship between any two variables. On the other hand, the coefficient of determination is the square of correlation coefficient especially used in regression analysis, which indicates how the changes in dependent variable can be explained in terms of the changes in independent variable(s). The correlation coefficient (R) takes the values between 0 and ± 1. It can never exceed 1. If As R approaches 1, the degree of relationship between two variables increases. If the value of R is positive, there is a direct relationship; if R is negative, then there is an inverse relationship between two variables. If it is exactly 0, there is no relationship; if it is 1, then there is perfect correlation (relationship) between two variables. The following equation represents the correlation coefficient formula between x and y variables: Rxy = ∑ (x − x )( y − y ) ∑ (x − x ) ∑ ( y − y ) 2 2 The coefficient of determination (R2) is the square of correlation coefficient (R), so the formula for coefficient of determination would be; R2 = (R)2 The coefficient of determination can take the values between 0 and 1. If R2 is 0, then there is no relationship between dependent variable and independent variable(s) in regression analysis; if it is 1, then there is perfect determination among dependent and independent variable(s). Simple Regression Analysis Remember that scattered diagrams had showed us how data points were scattered around the linear straight line. In this section, we will try to calculate this regression line. In simple regression analysis, there is one dependent and one independent variable. Our simple formula to calculate this regression line would be; Y = a + bX Where; Y : dependent variable a : y-intercept of the regression line b : slope of the regression line X : independent variable 34 The dependent variable (Y) is determined by the independent variable (X). This formula is used to see how X determines Y variable. But we can also use this formula to make estimation for Y values. Then we would use the following formula instead: Yˆ = a + bX where Ŷ is the estimated value of Y. 35

Statistics is a branch of mathematics that deals with the collection

Related documents

Products

Support

Statistics is a branch of mathematics that deals with the collection

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib