Business Statistics Doing Statistics in Excel 1. Transforming and

advertisement
Business Statistics
Doing Statistics in Excel
1. Transforming and Classifying Data
Transforming Data
We have the following data related to the height and weight of students
A
B
C
D
E
F
1
Sex
Height
Weight
Department
BMI
Rel
2
m
172
75
1
3
f
167
63
2
4
f
158
52
3
5
m
185
78
3
6
m
166
68
2
Here height is in centimetres and weight in kilograms
The body mass index (BMI) of an indivdual is given by
BMI = x/y2,
where x is weight in kgs and y is height in metres. To obtain height in metres, we divide the height
in centimetres by 100 and thus we obtain
BMI = 1002 x/z2,
where z is height in centimetres.
We wish to enter the BMIs of the students in column E. To do this, we first calculate the
BMI of the first student in cell E2. In this cell, we write
= 100^2*c2/b2^2,
since c2 is the weight of this individual and b2 is the height of this individual. We then copy this
formula and paste it into the remaining cells in the column. It should be noted that when we paste a
formula into new cells, Excel appropriately changes the row numbers and column names (unless we
fix the column name or row number - see below). This means that when we move down one row,
excel increases the row number by one. Hence, the value which appears in E3 is
100^2*c3/b3^2.
Similarly, if we copied this formula into F2 (i.e. move one column to the right), the value which
appears is
100^2*d2/c2^2
(i.e. the columns are moved one position to the right). Note that this value has no physical
interpretation.
In order to fix a row or column in a formula, we use $ before the column name or row number as
appropriate. For example, suppose we wish to calculate in column F the height of the i-th
individual as a proportion of the height of the first individual. This proportion is B(i-1)/B2. Hence,
we would like the row number to change in the numerator of this formula, but be fixed in the
denominator. In the cell F2, we write
= B2/B$2,
which obviously gives the answer 1 and then copy this to the remaining cells in column F.
Classifying Data
Suppose that we wish to classify a numerical variable into 2 groups. We can use the IF (JEŻELI)
function. Let us divide individuals according to height into 2 groups: 1 - Height is less than 170cm,
2 - Height is at least 170cm, and write the appropriate group numbers in column G. We first classify
individual number 1. In cell G2, we write
=IF(c2<170;1;2).
Note that semi-colons are used to separate the arguments of a function. The first argument of the
function is the condition which defines group 1 (here c2 is the height of the first individual). If this
condition is satisfied, then the output of the function is the second argument (here 1). Otherwise, the
output of the function is the third argument (here 2). We then copy this formula and copy it to the
rest of column G
Now suppose that we wish to classify a variable into k groups, where k>2. In this case, we can use
k-1 nested IF functions. For example, suppose we wish to divide individuals into three groups based
on weight as follows: 1 - less than 60kg, 2 - between 60 and 69kg inclusive, 3 - 70kg or above, and
write the appropriate categorical variable in column H. In the cell H2, we write
= IF(c2<60;1;IF(c2<70;2;3)).
Here, the first argument is again the condition which defines the first group. If this condition is
satisfied, then the output of the function is the second argument (here 1). If the first condition is not
satisfied, then this function passes on to the nested if function. Note that if the second if function is
used, we know that the weight of an individual must be at least 60kg, thus the second condition is
then sufficient to define which individuals are in group 2. If this second condition is satisfied, then
the output of the function is 2 (the second argument of the nested if function). Otherwise, the output
of the function is 3 (the third argument of the nested if function). If we wish to categorize into four
groups, we would have to add another nested if function, and so on.
2. Frequency Tables, Grouped Data and Graphical Presentation of Data
Frequency Tables for Discrete and Categorical Data
Note that if the number of values a discrete value takes is large, then we treat such a variable as a
continuous variable (see histograms). By large we mean both greater than 10 and clearly greater
than the number of intervals that would be used to draw a histogram. For n≤50, we can use about √n
intervals, for n>50, we can use about 1+1.44 ln(n) intervals, where ln is the natural logarithm
function.
For example, the data file dice.xls contains the number of children in 200 families in column B.
When we are dealing with discrete data, we may need to first determine the range of the data (i.e.
the largest and smallest observations). It is easy to notice that the minimum number of children is 0
[we could find this by using the formula =min(B2:B201)]. Note that a colon is used to define a
range of data in the spreadsheet. Here, B2 is the top left cell in the set of data to be used and B201 is
the bottom right cell.
Using the formula =max(B2:B201), we find that the maximum number of children is 7. Hence,
from 0 to 7 children are observed in these families. Note that the total number of values is
max-min+1, here 8.
We input these values into column D, thus obtaining the table below
A
B
C
D
1
Die
Children
0
2
5
1
1
3
2
2
2
4
6
3
3
5
6
1
4
6
6
0
5
7
1
2
6
8
4
4
7
9
6
1
E
F
We now calculate the frequencies with which each family size is observed. To count the number of
families with no children, we use the COUNTIF (LICZ.JEŻELI) function. In cell E1, we write
=COUNTIF(B$2:B$201;D1).
The first argument of the countif function is the range of data to be used. The function counts how
many times the value D1 (here 0) appears in column B. We then copy this formula and paste it into
cells E2 to E8. We need to fix the row numbers in the formula, since the range of data used must be
the same each time. This gives us the frequency table for the number of children in a family.
We can calculate the relative frequencies of these family sizes by dividing by 200 (the number of
observations). These are calculated in column F. In cell F1, we write = E1/200 and copy this into
cells F2 to F8. As a result of this, we obtain the following table
A
B
1
Die
2
C
D
E
F
Children
0
19
0.095
5
1
1
46
0.23
3
2
2
2
53
0.265
4
6
3
3
37
0.185
5
6
1
4
25
0.125
6
6
0
5
13
0.065
7
1
2
6
4
0.02
8
4
4
7
3
0.015
9
6
1
Bar Charts for Discrete and Categorical Data
We can obtain a bar chart (histogram) for the number of children by first highlighting the column
containing the relative frequencies, choosing the INSERT (WSTAWIANIE) menu, clicking on the
figure showing a bar chart and choosing the first option there. We can insert a title by first double
clicking in the title field. It should be noted that the labels on the x-axis refer to the row numbers
rather than the number of children. In order to change this, we right click on the x-axis labels and
choose the option LABEL DATA (ZAZNACZ DANE). We click on the right hand side EDIT
(EDYTUJ) button - in the X-axis labels (Etykiety osi poziomej) field. Then we highlight the values
in column D (the possible family sizes). As a result, we obtain the following bar chart
Pie Charts for Discrete and Categorical Data
We can obtain a pie chart for the number of children by first highlighting the column containing the
ABSOLUTE frequencies (column E), choosing the INSERT (WSTAWIANIE) menu, clicking on
the figure showing a pie chart and choosing the first option there. A title can be added and the labels
of the groups changed as for bar charts. We obtain the following pie chart.
Calculating the Mean and Variance when Data is Grouped
Suppose we have the following frequency table describing the observations of a discrete variable
A
B
1
No. of Children
Frequency
2
0
19
3
1
46
4
2
53
5
3
37
6
4
25
7
5
13
8
6
4
9
7
3
C
D
E
F
10
Note that the total number of observations is the sum of the frequencies. Hence, we define
B10=SUM(B2:B9). To calculate the mean, we then calculate the products of the no. of children
with their frequencies. To do this, we define C2=A2*B2 and copy this into C3 to C9. Summing this
column gives us the sum of all the observations. Hence, let C10=SUM(C2:C9). Dividing this sum
(C10) by the number of observations (B10), we obtain the mean. Hence, let C11=C10/B10. We thus
obtain the table
A
B
C
D
1
No. of Children
Frequency
xn_x
sq. dev.
2
0
19
0
3
1
46
46
4
2
53
106
5
3
37
111
6
4
25
100
7
5
13
65
8
6
4
24
9
7
3
21
10
Sum
200
473
Average
2.365
11
E
F
We now enter the squared deviations from the mean in column D. In D2, we write =(A2-C$11)^2
and copy this formula into D3 to D9. We then calculate the products of the frequencies and square
deviations in column E. In E2, we write =B2*D2 and copy this formula into cells C3 to C9. As a
result, we obtain the following table:
A
B
C
D
E
F
1
No. of Children
Frequency
xn_x
sq. dev.
sq. dev.*n_x
2
0
19
0
5.593
106.27
3
1
46
46
1.863
85.71
4
2
53
106
0.133
7.06
5
3
37
111
0.403
14.92
6
4
25
100
2.673
66.83
7
5
13
65
6.943
90.26
8
6
4
24
13.213
52.85
9
7
3
21
21.483
64.45
10
Sum
200
473
Average
2.365
11
To obtain the variance of the number of children, we first sum the entries in column E,
E10=Sum(E2:E9) and then divide this by n-1, thus E11 (the variance) = E10/(B10-1). This gives us
s2 = 2.454.
Histograms for Continuous Variables
We now draw a histogram for the size of flats (see flats.xls). There are 1170 items of data, thus we
split into around 1+1.44ln(1170)=11.17 groups. The sizes of these flats are in cells B2 to B1171.
Using the minimum and maximum functions, the range of values is between 20 and 135. It thus
seems reasonable to split the data into 13 intervals of length 10 (defining the upper end of an
interval, but not the bottom end, to belong to that interval). These intervals are:
(10,20], (20,30], (30,40],...,(120,130],(130,140]. We enter the end points of these intervals in a new
column (say column G). Hence, we enter 20 into G1. Define G2=G1+10 and copy this formula into
the cells G3 to G13.
We now count the number of observations in each interval and write these in column H. To obtain
the number of observations in the first group, we count all the observations less than or equal to 20,
using the COUNTIF (LICZ.JEŻELI) command i.e.
H1=COUNTIF(B2:B1171;”<=”&G1).
The first argument defines where the data are. The second argument is the condition that the
observation must be less than or equal to G1 (20). Note the use of inverted commas and the symbol
& before the cell to be used, which translates the number into a string. Also, this command is not
going to be copied, so we do not need to fix the range of data.
To count the number of observations in the other groups, note that we can count the number of
observations that do not exceed the upper limit for a group and then subtract the number of
observations that do not exceed the lower limit. Hence,
H2=COUNTIF(B$2:B$1171;”<=”&G2)-COUNTIF(B$2:B$1171;”<=”&G1)
The dollar sign is used here to fix the data set, since this function is then copied into H3 to H13.
Dividing these frequencies by the number of observations (1170), we obtain the relative
frequencies, i.e. Letting I1=H1/1170 and copying this into I2 to I13 we obtain the following
frequency table:
A
B
C
1
20
4
0.0034
2
30
60
0.0513
3
40
178
0.1521
4
50
260
0.2222
5
60
259
0.2214
6
70
209
0.1786
7
80
104
0.0889
8
90
46
0.0393
9
100
27
0.0231
10
110
11
0.0094
11
120
10
00.0085
12
130
1
0.0009
13
140
1
0.0009
We can obtain the histogram in the same way as we obtain a bar chart. We obtain the following
histogram
It should be noted that when drawing a histogram by hand, the blocks should touch each other and
the lower and upper end point of an interval should be at the left and right, respectively, of the
bottom of the bar representing that interval. Here the upper end point is printed in the middle.
However, it is clear that the distribution of the size of flat is somewhat right skewed with the peak
coming around 50m.
Download