Business Statistics Doing Statistics in Excel 1. Transforming and Classifying Data Transforming Data We have the following data related to the height and weight of students A B C D E F 1 Sex Height Weight Department BMI Rel 2 m 172 75 1 3 f 167 63 2 4 f 158 52 3 5 m 185 78 3 6 m 166 68 2 Here height is in centimetres and weight in kilograms The body mass index (BMI) of an indivdual is given by BMI = x/y2, where x is weight in kgs and y is height in metres. To obtain height in metres, we divide the height in centimetres by 100 and thus we obtain BMI = 1002 x/z2, where z is height in centimetres. We wish to enter the BMIs of the students in column E. To do this, we first calculate the BMI of the first student in cell E2. In this cell, we write = 100^2*c2/b2^2, since c2 is the weight of this individual and b2 is the height of this individual. We then copy this formula and paste it into the remaining cells in the column. It should be noted that when we paste a formula into new cells, Excel appropriately changes the row numbers and column names (unless we fix the column name or row number - see below). This means that when we move down one row, excel increases the row number by one. Hence, the value which appears in E3 is 100^2*c3/b3^2. Similarly, if we copied this formula into F2 (i.e. move one column to the right), the value which appears is 100^2*d2/c2^2 (i.e. the columns are moved one position to the right). Note that this value has no physical interpretation. In order to fix a row or column in a formula, we use $ before the column name or row number as appropriate. For example, suppose we wish to calculate in column F the height of the i-th individual as a proportion of the height of the first individual. This proportion is B(i-1)/B2. Hence, we would like the row number to change in the numerator of this formula, but be fixed in the denominator. In the cell F2, we write = B2/B$2, which obviously gives the answer 1 and then copy this to the remaining cells in column F. Classifying Data Suppose that we wish to classify a numerical variable into 2 groups. We can use the IF (JEŻELI) function. Let us divide individuals according to height into 2 groups: 1 - Height is less than 170cm, 2 - Height is at least 170cm, and write the appropriate group numbers in column G. We first classify individual number 1. In cell G2, we write =IF(c2<170;1;2). Note that semi-colons are used to separate the arguments of a function. The first argument of the function is the condition which defines group 1 (here c2 is the height of the first individual). If this condition is satisfied, then the output of the function is the second argument (here 1). Otherwise, the output of the function is the third argument (here 2). We then copy this formula and copy it to the rest of column G Now suppose that we wish to classify a variable into k groups, where k>2. In this case, we can use k-1 nested IF functions. For example, suppose we wish to divide individuals into three groups based on weight as follows: 1 - less than 60kg, 2 - between 60 and 69kg inclusive, 3 - 70kg or above, and write the appropriate categorical variable in column H. In the cell H2, we write = IF(c2<60;1;IF(c2<70;2;3)). Here, the first argument is again the condition which defines the first group. If this condition is satisfied, then the output of the function is the second argument (here 1). If the first condition is not satisfied, then this function passes on to the nested if function. Note that if the second if function is used, we know that the weight of an individual must be at least 60kg, thus the second condition is then sufficient to define which individuals are in group 2. If this second condition is satisfied, then the output of the function is 2 (the second argument of the nested if function). Otherwise, the output of the function is 3 (the third argument of the nested if function). If we wish to categorize into four groups, we would have to add another nested if function, and so on. 2. Frequency Tables, Grouped Data and Graphical Presentation of Data Frequency Tables for Discrete and Categorical Data Note that if the number of values a discrete value takes is large, then we treat such a variable as a continuous variable (see histograms). By large we mean both greater than 10 and clearly greater than the number of intervals that would be used to draw a histogram. For n≤50, we can use about √n intervals, for n>50, we can use about 1+1.44 ln(n) intervals, where ln is the natural logarithm function. For example, the data file dice.xls contains the number of children in 200 families in column B. When we are dealing with discrete data, we may need to first determine the range of the data (i.e. the largest and smallest observations). It is easy to notice that the minimum number of children is 0 [we could find this by using the formula =min(B2:B201)]. Note that a colon is used to define a range of data in the spreadsheet. Here, B2 is the top left cell in the set of data to be used and B201 is the bottom right cell. Using the formula =max(B2:B201), we find that the maximum number of children is 7. Hence, from 0 to 7 children are observed in these families. Note that the total number of values is max-min+1, here 8. We input these values into column D, thus obtaining the table below A B C D 1 Die Children 0 2 5 1 1 3 2 2 2 4 6 3 3 5 6 1 4 6 6 0 5 7 1 2 6 8 4 4 7 9 6 1 E F We now calculate the frequencies with which each family size is observed. To count the number of families with no children, we use the COUNTIF (LICZ.JEŻELI) function. In cell E1, we write =COUNTIF(B$2:B$201;D1). The first argument of the countif function is the range of data to be used. The function counts how many times the value D1 (here 0) appears in column B. We then copy this formula and paste it into cells E2 to E8. We need to fix the row numbers in the formula, since the range of data used must be the same each time. This gives us the frequency table for the number of children in a family. We can calculate the relative frequencies of these family sizes by dividing by 200 (the number of observations). These are calculated in column F. In cell F1, we write = E1/200 and copy this into cells F2 to F8. As a result of this, we obtain the following table A B 1 Die 2 C D E F Children 0 19 0.095 5 1 1 46 0.23 3 2 2 2 53 0.265 4 6 3 3 37 0.185 5 6 1 4 25 0.125 6 6 0 5 13 0.065 7 1 2 6 4 0.02 8 4 4 7 3 0.015 9 6 1 Bar Charts for Discrete and Categorical Data We can obtain a bar chart (histogram) for the number of children by first highlighting the column containing the relative frequencies, choosing the INSERT (WSTAWIANIE) menu, clicking on the figure showing a bar chart and choosing the first option there. We can insert a title by first double clicking in the title field. It should be noted that the labels on the x-axis refer to the row numbers rather than the number of children. In order to change this, we right click on the x-axis labels and choose the option LABEL DATA (ZAZNACZ DANE). We click on the right hand side EDIT (EDYTUJ) button - in the X-axis labels (Etykiety osi poziomej) field. Then we highlight the values in column D (the possible family sizes). As a result, we obtain the following bar chart Pie Charts for Discrete and Categorical Data We can obtain a pie chart for the number of children by first highlighting the column containing the ABSOLUTE frequencies (column E), choosing the INSERT (WSTAWIANIE) menu, clicking on the figure showing a pie chart and choosing the first option there. A title can be added and the labels of the groups changed as for bar charts. We obtain the following pie chart. Calculating the Mean and Variance when Data is Grouped Suppose we have the following frequency table describing the observations of a discrete variable A B 1 No. of Children Frequency 2 0 19 3 1 46 4 2 53 5 3 37 6 4 25 7 5 13 8 6 4 9 7 3 C D E F 10 Note that the total number of observations is the sum of the frequencies. Hence, we define B10=SUM(B2:B9). To calculate the mean, we then calculate the products of the no. of children with their frequencies. To do this, we define C2=A2*B2 and copy this into C3 to C9. Summing this column gives us the sum of all the observations. Hence, let C10=SUM(C2:C9). Dividing this sum (C10) by the number of observations (B10), we obtain the mean. Hence, let C11=C10/B10. We thus obtain the table A B C D 1 No. of Children Frequency xn_x sq. dev. 2 0 19 0 3 1 46 46 4 2 53 106 5 3 37 111 6 4 25 100 7 5 13 65 8 6 4 24 9 7 3 21 10 Sum 200 473 Average 2.365 11 E F We now enter the squared deviations from the mean in column D. In D2, we write =(A2-C$11)^2 and copy this formula into D3 to D9. We then calculate the products of the frequencies and square deviations in column E. In E2, we write =B2*D2 and copy this formula into cells C3 to C9. As a result, we obtain the following table: A B C D E F 1 No. of Children Frequency xn_x sq. dev. sq. dev.*n_x 2 0 19 0 5.593 106.27 3 1 46 46 1.863 85.71 4 2 53 106 0.133 7.06 5 3 37 111 0.403 14.92 6 4 25 100 2.673 66.83 7 5 13 65 6.943 90.26 8 6 4 24 13.213 52.85 9 7 3 21 21.483 64.45 10 Sum 200 473 Average 2.365 11 To obtain the variance of the number of children, we first sum the entries in column E, E10=Sum(E2:E9) and then divide this by n-1, thus E11 (the variance) = E10/(B10-1). This gives us s2 = 2.454. Histograms for Continuous Variables We now draw a histogram for the size of flats (see flats.xls). There are 1170 items of data, thus we split into around 1+1.44ln(1170)=11.17 groups. The sizes of these flats are in cells B2 to B1171. Using the minimum and maximum functions, the range of values is between 20 and 135. It thus seems reasonable to split the data into 13 intervals of length 10 (defining the upper end of an interval, but not the bottom end, to belong to that interval). These intervals are: (10,20], (20,30], (30,40],...,(120,130],(130,140]. We enter the end points of these intervals in a new column (say column G). Hence, we enter 20 into G1. Define G2=G1+10 and copy this formula into the cells G3 to G13. We now count the number of observations in each interval and write these in column H. To obtain the number of observations in the first group, we count all the observations less than or equal to 20, using the COUNTIF (LICZ.JEŻELI) command i.e. H1=COUNTIF(B2:B1171;”<=”&G1). The first argument defines where the data are. The second argument is the condition that the observation must be less than or equal to G1 (20). Note the use of inverted commas and the symbol & before the cell to be used, which translates the number into a string. Also, this command is not going to be copied, so we do not need to fix the range of data. To count the number of observations in the other groups, note that we can count the number of observations that do not exceed the upper limit for a group and then subtract the number of observations that do not exceed the lower limit. Hence, H2=COUNTIF(B$2:B$1171;”<=”&G2)-COUNTIF(B$2:B$1171;”<=”&G1) The dollar sign is used here to fix the data set, since this function is then copied into H3 to H13. Dividing these frequencies by the number of observations (1170), we obtain the relative frequencies, i.e. Letting I1=H1/1170 and copying this into I2 to I13 we obtain the following frequency table: A B C 1 20 4 0.0034 2 30 60 0.0513 3 40 178 0.1521 4 50 260 0.2222 5 60 259 0.2214 6 70 209 0.1786 7 80 104 0.0889 8 90 46 0.0393 9 100 27 0.0231 10 110 11 0.0094 11 120 10 00.0085 12 130 1 0.0009 13 140 1 0.0009 We can obtain the histogram in the same way as we obtain a bar chart. We obtain the following histogram It should be noted that when drawing a histogram by hand, the blocks should touch each other and the lower and upper end point of an interval should be at the left and right, respectively, of the bottom of the bar representing that interval. Here the upper end point is printed in the middle. However, it is clear that the distribution of the size of flat is somewhat right skewed with the peak coming around 50m.