Uploaded by Yussuf Abdullahi

Summarizing and Analyzing Data: Big Data & Statistics

Summarizing and Analyzing Data
Big Data
Grouped and ungrouped data
Probabilities and expected values
Normal distribution
The standard normal distribution
Using the normal distribution to calculate probabilities
1. Big Data
Big Data refers to the mass of data that society creates each year,
extending far beyond the traditional financial and enterprise data
created by companies. Sources of Big Data include social
networking sites, internet search engines, and mobile devices.
Big Data is a term for extremely large collections of data that may
be analysed to reveal patterns, trends and associations.
The ability to harness these vast amounts of information could
transform an organisation’s performance management.
However, many conventional methods of storing and
processing data will not work.
The Sources of Big Data
Social data:
Comes from the Likes, Tweets & Retweets, Comments, Video Uploads, and
general media that are uploaded and shared via the world’s favorite social media
platforms. This kind of data provides invaluable insights into consumer behavior
and sentiment and can be enormously influential in marketing analytics. The
public web is another good source of social data, and tools like Google Trends can
be used to good effect to increase the volume of big data.
Machine data
Defined as information which is generated by industrial equipment, sensors
that are installed in machinery, and even web logs which track user behavior.
This type of data is expected to grow exponentially as the internet of things
grows ever more pervasive and expands around the world. Sensors such as
medical devices, smart meters, road cameras, satellites, games and the
rapidly growing Internet Of Things will deliver high velocity, value, volume
and variety of data in the very near future.
Transactional data
Generated from all the daily transactions that take place both online and
offline. Invoices, payment orders, storage records, delivery receipts – all are
characterized as transactional data yet data alone is almost meaningless, and
most organizations struggle to make sense of the data that they are
generating and how it can be put to good use.
Characteristics of Big Data
Volume. The scale of information which can now be created and
stored is staggering. Advancing technology has allowed embedded
sensors to be placed in everyday items such as cars, video games and
refrigerators. Mobile devices have led to an increasingly networked
world where people's consumer preferences, spending habits, and
even their movements can be recorded. Advances in data storage
technology as well as a fall in price of this storage has allowed for the
captured data to be stored for further analysis.
Velocity. Timeliness is a key factor in the usefulness of financial
information to decision makers, and it is no different for the users of
Big Data. One source of high-velocity data is Twitter.
Variety. Big Data consists of both structured and unstructured data.
Processing Big Data
The processing of Big Data is known as Big Data analytics.
For example:
 Hadoop
software allows the processing of large data
sets by utilising large data sets simultaneously.
 Google
Analytics tracks many features of website
The uses of Big Data
Big Data is an emerging technology that has implications
across all business departments. It involves the collection
and analysis of large amounts of data to find trends,
understand customer needs and help organisations to focus
resources more effectively and to make better decisions.
Uses continued
Big Data and business value
Business value is measured in many ways, such as profit,
shareholder value, brand value and intellectual value. Big
Data can be used to analyse opportunities to increase
revenue and reduce costs, thereby increasing profit. For
example, a holiday company can use Big Data to analyse
trends in where tourists are visiting in order to improve the
range of holiday locations that it offers. It can reduce its
offering in unpopular areas and increase its offering of
popular areas and increase its revenue.
Big Data and the customer
Understanding the customer is a key benefit of Big Data
analytics. By understanding the customer, the business can
respond to their needs and tailor the customer experience
to be more personal and therefore improve customer loyalty.
Big Data and corporate strategy
To be successful, Big Data must fit into the organisation's overall aims
and objectives. After identifying how business value can be improved,
and the requirements of the customer, business priorities can be
determined – for example, which markets or customers are the most
important in terms of increasing business value.
Big Data is a key source of innovation, helping to create new
products and services. Volume and velocity of data helps speed up
decision making. This means that Big Data can help create new
sources of income for a business and contribute to an improvement in
the organisation's competitive advantage.
Big Data and performance
It can help the organisation to understand its customers’ needs and
It can improve forecasting so that more appropriate decisions can be made
It can help the organisation to automate business processes
It can help to provide more detailed, relevant and up to date performance
Effect of Big Data on decisions
Decisions can be made quickly.
Businesses can respond earlier to environmental changes and be more
flexible in their response.
Decisions can be based on current situation but also have an element
of taking potential future situations into account.
Decisions are made on hard data evidence that can be quantified.
Decisions can be made on a collaborative basis because data is easily
shared and converted from one form into another.
'Outside the box' decisions are more likely because all factors are
taken into account, not just the ones managers think of.
Benefits of Big Data analytics
Criticisms of Big Data
Big Data is simply a buzzword, a vague term that has turned into an
obsession in large organisations and the media.
There is a focus on finding correlations between data sets and less of
an emphasis on causation.
Security and data protection
Personnel issues-Skills to use Big Data systems not always available
There may be technical difficulties involved when integrating new Big
Data systems with existing technology
Time spent measuring relationships that have no organisational value.
Poor veracity leading to incorrect conclusions
Cost of establishing hardware and analytics software
Technical difficulties integrating Big Data systems with current
2. Grouped and ungrouped data
Grouped data is data where the frequency is shown in terms
of a range. Ungrouped data is data where the frequency is
shown in terms of a specific measure or value.
Discrete data can only take on a countable number of
values. Continuous data can take on any value.
If there is a large set of data or if every (or nearly every)
data item is different, it is often convenient to group
frequencies (how often data occurs) together into bands or
Frequency distributions
Frequency diagrams are used if values of particular variables occur
more than once.
Frequently the data collected from a statistical survey or investigation
is simply a mass of numbers.
Many sets of data, however, contain a limited number of data values,
even though there may be many occurrences of each value. It can
therefore be useful to organise the data into what is known as a
frequency distribution (or frequency table) which records the
number of times each value occurs (the frequency).
Individual data items are arranged in a table showing the frequency
each individual data item occurs.
Output of employees in one week in units:
Construct a frequency distribution for the above data (the
output in units of 20 employees during one week).
Grouped frequency distributions
If there is a large set of data or if every (or nearly every) data
item is different, it is often convenient to group frequencies
together into bands or classes.
Once items have been 'grouped' in this way their individual values
are lost. As well as being used for discrete variables (as above),
grouped frequency distributions (or grouped frequency tables) can
be used to present data for continuous variables.
There is an open-ended class at the end of the range. Class
intervals must be carefully considered so that they capture all of
the data once (and only once!).
For example, suppose that the output produced by a group of 20 employees during one
week is as follows, in units
The range of output from the lowest to the highest producer is 792 to 1,265, a range of 473
units. This range could be divided into classes of say, 100 units (the class width or class
interval), and the number of employees producing output within each class could then be
grouped into a single frequency, as follows:
Cumulative frequency distributions
A cumulative frequency distribution (or cumulative
frequency table) can be used to show the total number of
times that a value above or below a certain amount occurs.
These distributions are used to show the number of times
that a value above or below a certain amount occurs.
Cumulative frequencies are obtained by adding the
individual frequencies together.
A cumulative frequency distribution can be graphed as an
The ogive is drawn by plotting the cumulative frequencies
on the graph, and joining them with straight lines.
Although many ogives are more accurately curved lines, you
can use straight lines to make them easier to draw. An ogive
drawn with straight lines may be referred to as a cumulative
frequency polygon (or cumulative frequency diagram)
whereas one drawn as a curve may be referred to as a
cumulative frequency curve.
Consider the following frequency distribution.
Draw an ogive
A frequency distribution can be represented pictorially by
means of a histogram. The number of observations in a class
is represented by the area covered by the bar, rather than
by its height.
A histogram with equal class intervals
If all the class intervals are the same, the bars of the histogram all
have the same width and the heights will be proportional to the
frequencies. The histogram looks almost identical to a bar chart except
that the bars are joined together. Because the bars are joined together,
when presenting discrete data the data must be treated as continuous so
that there are no gaps between class intervals.
3. Averages
(a) The arithmetic mean
The arithmetic mean is the best known type of
average and is widely understood.
Illustration-The arithmetic mean of data
in a frequency distribution
The arithmetic mean of grouped data
To calculate the arithmetic mean of grouped data we therefore need to decide on a
value which best represents all of the values in a particular class interval. This value
is known as the mid-point.
To calculate the arithmetic mean of grouped data we therefore need to
decide on a value which best represents all of the values in a
particular class interval. This value is known as the mid-point.
The mid-point of each class interval is conventionally taken, on the
assumption that the frequencies occur evenly over the class interval
range. In the example above, the variable is discrete, so the first class
includes 1, 2, 3, 4 and 5, giving a mid-point of 3. With a continuous
variable, the mid-points would have been 2.5, 7.5 and so on. Once the
value of x has been decided, the mean is calculated using the formula for
the arithmetic mean of grouped data.
(b)The mode
The mode or modal value is an average which means 'the most
frequently occurring value'.
The mode of a grouped frequency distribution can be calculated
from a histogram. The modal class is always the class with the
tallest bar. This may not be the class with the highest frequency if
the classes do not all have the same width. Hence the mode in a
grouped frequency distribution is only an estimate.
(c)The median
The median is the value of the middle member of an array. The middle item
of an odd number of items is calculated as the (n+1)th item.
With an even number of items, we normally take the arithmetic mean of the
two middle ones.
The median of an ungrouped frequency
The median of an ungrouped frequency distribution is found in a similar way
using cumulative frequencies. Consider the following distribution.
The median of a grouped frequency
The median of a grouped frequency distribution can be
established from an ogive.
Construct an ogive of the following frequency distribution
and hence establish the median.
Since the values are assumed to be spread evenly within
each class, the median calculated is only approximate.
4. Dispersion
Averages are a method of determining the 'location' or
central point of a set of data but they give no information
about the dispersion of values. Measures of dispersion give
some idea of the spread of a variable about its average
(a) Standard deviation
Standard deviation (σ) is one of the most important measures of dispersion.
The standard deviation measures the spread of data around the mean.
The variance is the square of the standard deviation (variance = σ2).
In general, the larger the standard deviation value in relation to the mean,
the more dispersed the data.
Calculate the mean, variance and standard deviation.
(b) Coefficient of variation
The spreads of two distributions can be compared using the coefficient of
Coefficient of variation (coefficient of relative spread)
5. Probabilities and expected values
An expected value is a weighted average value of the
different possible outcomes from a decision, where
weightings are based on the probability of each possible
outcome. Expected values indicate what an outcome is
likely to be in the long term, if the decision can be
repeated many times over. Fortunately, many business
transactions do occur over and over again.
Expected values
Where probabilities are assigned to different outcomes we can measure the
weighted average value of the different possible outcomes. Each possible
outcome is given a weighting equal to the probability that it will occur.
If the probability of an outcome of an event is p, then the expected number
of times that this outcome will occur in x events (the expected value) is equal
The expected value (EV) decision rule is that the decision option with the
highest EV of benefit or the lowest EV of cost should be selected.
6. Normal distribution
A probability distribution is an analysis of the proportion of times each
particular value occurs in a set of items. There are a number of different
probability distributions but the focus is normal distribution.
In calculating the probability of x, (written as P(x)), x can be any value,
and does not have to be a whole number.
The normal distribution can also apply to discrete variables which can
take many possible values. For example, the volume of sales, in units,
of a product might be any whole number in the range 100 – 5,000 units.
There are so many possibilities within this range that the variable is for
all practical purposes continuous.
Graphing the normal distribution
The normal distribution can be drawn as a graph, and it would
be a bell-shaped curve.
Properties of the normal distribution
It is symmetrical and bell-shaped
It has a mean,
The area under the curve totals exactly 1
The area to the left of
(pronounced mew)
= area to the right of
= 0.5
Importance of the normal distribution
The normal distribution is important because in the practical
application of statistics, it has been found that many
probability distributions are close enough to a normal
distribution to be treated as one without any significant loss
of accuracy. This means that the normal distribution can be
used as a tool in business decision making involving
7. The standard normal distribution
For any normal distribution, the dispersion around the mean
of the
frequency of occurrences can be measured exactly in terms of the standard
The standard normal distribution has a mean
of 1.
of 0 and a standard
In general 68% of values are within one standard deviation
(between –1 and 1), 95% of values are within two standard
deviations (between –2 and 2) and 99.7% of values are
within three standard deviations (between –3 and 3).
Normal distribution tables
Although there is an infinite number of normal distributions, depending
on values of the mean ,and the standard deviation , the relative
dispersion of frequencies around the mean, measured as proportions
of the total population, is exactly the same for all normal
In other words, whatever the normal distribution, 47.5% of outcomes will
always be in the range between the mean and 1.96 standard deviations
below the mean, 49.5% of outcomes will always be in the range between
the mean and 2.58 standard deviations below the mean and so on.
Normal distribution tables
8. Using the normal distribution to
calculate probabilities
In order to calculate probabilities, we need to convert a
normal distribution (X) with a mean and standard
deviation, to the standard normal distribution (z) before
using the table to find the probability figure. The normal
distribution is, in fact, a way of calculating probabilities.