by Pat McGillion, Examiner, Formation 1 Business Mathematics and

Quantitative Methods, December 2007.

This article describes the most common methods to present, describe and summarise data sets using methods and techniques that are specifically covered in the syllabus for this subject. This area is pertinent to most management areas since managers frequently present data to a wide range of stakeholders. The particular techniques outlined are frequently covered in examinations.

Methods for Describing Sets of Data

Introduction.

In all accountancy, management and consultancy areas, a large volume of data is subject to analysis, interpretation and description. The methods and techniques used can distort the description of the data and ultimately the decision made. Characteristics of a data set may contain the most frequent score, the variability in the score, the

‘shape’ of the data, the highest and lowest scores, and whether or not the data set contains any unusual data. Interpreting or extracting data visually is, at a minimum, difficult since it may not be possible for us to comprehend large volumes of information.

Some formal methods for summarising and characterising the information in a data set are essential. Most populations are large data sets. Therefore, methods for describing such data sets are also essential for statistical inference. There are two key methods used for describing data – one graphical and the other numerical. Both play an important role in statistics and both methods can be used for describing both qualitative and quantitative data.

Describing qualitative data.

When data is grouped into non-numerical categories, the resulting table is a categorical or qualitative distribution . Therefore the value of a qualitative variable can be classified into categories called classes. Such data can be summarized numerically in two ways – by computing the class frequency, that is, the number of observations in the data set that fall into each class or by computing the class relative frequency, that is, the proportion of the total number of observations falling into each class. (The class relative frequency may also be used, that is, the class frequency divided by the total number of observations in the data set). Although a summary table of the data may be drawn up, a graphical presentation of the data may be required to clearly demonstrate the features of the data. Two of the most widely used graphical methods for describing qualitative data are bar graphs and pie charts. The bar graph plots the class frequency against the class where the height of the ‘bar’ is equal to the class frequency. A pie chart shows the relative frequencies of the classes where the size of the ‘slice’ apportioned to each class is proportional to the class relative frequency.

Page 1 of 8

130 < 140

140 < 150

150 < 160

160 or more

Describing quantitative data.

When data are grouped according to numerical size, the resulting table is a categorical or quantitative distribution . For describing, summarising and detecting patterns in such data the most common graphical methods for describing frequency distributions are histograms, frequency curves (such as polygons or ogives) and scattergrams.

Histograms.

Histograms can be used to display either the frequency or relative frequencies

(frequency of each class divided by the total frequency) of the measurements falling into specified intervals known as measurement classes. By looking at a histogram two important facts are apparent. The proportion of the total area above the interval is equal to the relative frequency of measurements falling in the interval. As the number of elements in the data set increase, a better description of the data set can be obtained by decreasing the width of the class intervals. When the class intervals become small enough a relative frequency histogram will appear as a smooth curve. While histograms provide good visual descriptions of data sets they do not allow the identification of individual measurements.

Another form of presentation is the frequency polygon . The frequencies are plotted at the class marks and the successive points are connected. Applying a similar technique to a cumulative distribution (usually a ‘less than’ distribution) an ogive can be obtained.

This is where the cumulative frequencies are plotted at the class boundaries.

See Fig.1 for examples of above presentations. x less than Frequency Cumulative frequency

Mid

Point

100 < 110

110 < 120

120 < 130

110

120

130

1

4

7

1

5

12

105

115

125

140

150

160

180

13

7

3

1

25

32

35

36

135

145

155

170

Page 2 of 8

8

6

Frequency

16

14

12

10

Frequency Polygon

4

2

100 120 140 160 180 x

Page 3 of 8

Histogram

14 Mode

12

10

Freq

8

6

4

2

100 110 120 130 140 150 160 170 180

Cum Frequency

50

40

less than Ogive

30

20

10

100 110 120 130 140 150 160 170

x

Fig. 1

Page 4 of 8

Scattergram.

A common method to describe the relationship between two quantitative variables (a bivariate relationship) is to plot the data on a scattergram. This is a simple and powerful tool but no measure of reliability can be attached to inferences made about bivariate populations based on scattergrams of sample data. When an increase or decrease in one variable is associated with an increase or decrease in the second variable the two variables are said to be positively or negatively correlated. Both a relationship, and the strength of that relationship, can be determined between the variables by linear regression form of analysis. This can be visually and quantitatively presented and is an effective display.

The above graphical techniques are used frequently for summarising and describing quantitative data but these are often, also, associated with numerical methods for accomplishing this objective. A large number of numerical methods are available to describe quantitative data sets. Most of these methods measure one of two data characteristics 1) the central tendency of the set, that is, the tendency of the data to cluster and 2) the variability of the set, that is, the spread of the data.

Arithmetic Mean.

The most popular measure of central tendency is the arithmetic mean . This is the sum of the measurements divided by the number of measurements contained in the data set.

In many business cases the sample mean, x, is used to estimate or make an inference about a population since we may not have access to measurements for the entire population. However, in this case the accuracy of the estimate depends on the size of the sample (the larger the sample, the more accurate the estimate will tend to be) and the variability of the data (the more variable the data the less accurate the estimate).

Median.

Another important measure of central tendency is the median . This is the middle number when the quantitative data set is arranged in ascending or descending order.

This is of most value in describing large data sets and is the point where 50% of the data lies above the mid point and 50% below it. In certain situations the median may be a better measure of central tendency than the mean. It is less sensitive to extremely large or small values. In general, extreme values (large or small) affect the mean more than the median since these values are used explicitly to calculate the mean. The median is not affected directly by extreme values since only the middle value is explicitly used to calculate the median. Consequently if measurements are pulled towards one end of the distribution the mean will shift toward that tail more than the median.

Mode.

The mode is particularly useful for describing qualitative data. The modal category is the class that occurs most frequently. Because it emphasises data concentrations, the mode is also used with quantitative data sets to locate the region in which much of the data is concentrated. However, for some quantitative data sets, the mode may not be very meaningful since there may be more than one mode in the sample. A more meaningful measure is the modal class . This can be obtained from a relative frequency histogram. However, for most data, the mean and median provide more descriptive data than the mode. See Fig. 2.

Page 5 of 8

Positively skewed distribution

Median

Mode

Mean

Fig. 2

Standard Deviation .

These measures of central tendency provide only a partial description of a quantitative data set. The description is incomplete without a measure of the variability of the data set. Knowledge of the data’s variability along with its centre can help us visualise the shape of the data as well as its extreme values. The simplest measure of the variability of a quantitative set is its range . This is equal to the largest less the smallest measurement. The range is easy to compute and to understand but it is an insensitive measure of data variation when the data sets are large – two data sets can have the same range but be vastly different with respect to data variation. The variation of such data can be obtained by measuring the distance between each measurement and the mean. To cater for the + and – signs of the deviations, the deviations are squared to provide the sample variance, s calculating the

2 = ∑ (x i

– x) 2 standard deviation

/(n – 1) . This is the preliminary step in

of the data set, √ s 2 .

Sample statistics like s primarily used to estimate population parameters like σ 2

2 are

; (n -1) is preferred to n when defining the sample variance. To understand how the standard variation provides a measure of variability of a data set, it is necessary to determine how many measurements fall with I, 2 or 3 standard deviations of the mean. The empirical rule for interpreting the standard deviation of data that is bell shaped and symmetrical (where the mean, median and mode are approximately the same) is that approximately 68% of deviations fall within 1 standard deviation of the mean ( μ ± σ ), 95% fall within 2 standard deviations ( μ ± 2 σ ) and 99.7% fall within 3 standard deviations ( μ ± 2 σ ), for populations.

In addition to the above it may be of interest to describe the relative quantitative location of a particular measurement within a data set. One such method is the percentile ranking. These are of practical value only for large data sets. The measurements are ranked in order and a rule is selected to define the location of each percentile. For example, if your company reports that its annual sales are in the 75 th percentile of all companies in the industry, the implication is that 75% of all companies have annual sales less than your company and 25% have annual sales exceeding your company.

Page 6 of 8

Time Series Plot

Most of the above methods have been concerned with describing the information contained in a sample or population of data. Often these data are viewed as having been produced at a similar point in time. Therefore, time has not been a factor.

However, data of interest to managers are often produced over a time period. When data is produced over time it is important to record both the measurements and the time period associated with each measurement. With this information a time series plot can be constructed to describe the time series data and to learn about the process that generated the data and to monitor the movement (trend) and changes (variations) in the variable being examined. This type of information would not be revealed by most other graphical displays. See Fig. 3.

Time Series Plot

: Original data v Trend

Variable 300

250

200

150 Trend

100 Data

50

0

0 2 4 6 8 10 12 14 16 18

Quarters

Fig. 3

Page 7 of 8

Lies, damned lies and statistics. Distorting the Truth.

Many of the presentations outlined can misrepresent or distort, or allow the target audience to misinterpret, the data presented. One common way to change the impression created by a pictorial or graphical presentation is to change the scale on either one or both axes. By stretching the vertical axis or by increasing the distance between vertical units can give a misleading visual impression of the data. In one case a histogram may appear to be vertically elongated and horizontally compressed or vice versa and may lead to incorrect conclusions. A visual distortion can be achieved with bar graphs by making the width of the bars proportional to the height. A similar effect can be achieved by using a scale break for the vertical axis. Further distortions can also occur with numerical descriptive measures. If a measure of central tendency only is reported in a sample, this can lead to a distortion of the information. Both a measure of central tendency and a measure of variability are needed to obtain an accurate mental image of a data set. The conclusion! Look at graphical descriptions with a critical eye, ignore the visual changes and concentrate on the actual numerical changes associated with the graph or chart.

Page 8 of 8