Summarizing and Analyzing Data: Big Data & Statistics

Summarizing and Analyzing Data Overview 1) Big Data 2) Grouped and ungrouped data 3) Averages 4) Dispersion 5) Probabilities and expected values 6) Normal distribution 7) The standard normal distribution 8) Using the normal distribution to calculate probabilities 1. Big Data Big Data refers to the mass of data that society creates each year, extending far beyond the traditional financial and enterprise data created by companies. Sources of Big Data include social networking sites, internet search engines, and mobile devices. Big Data is a term for extremely large collections of data that may be analysed to reveal patterns, trends and associations.  The ability to harness these vast amounts of information could transform an organisation’s performance management.  However, many conventional methods of storing and processing data will not work. The Sources of Big Data Social data: Comes from the Likes, Tweets & Retweets, Comments, Video Uploads, and general media that are uploaded and shared via the world’s favorite social media platforms. This kind of data provides invaluable insights into consumer behavior and sentiment and can be enormously influential in marketing analytics. The public web is another good source of social data, and tools like Google Trends can be used to good effect to increase the volume of big data. Machine data  Defined as information which is generated by industrial equipment, sensors that are installed in machinery, and even web logs which track user behavior. This type of data is expected to grow exponentially as the internet of things grows ever more pervasive and expands around the world. Sensors such as medical devices, smart meters, road cameras, satellites, games and the rapidly growing Internet Of Things will deliver high velocity, value, volume and variety of data in the very near future. Transactional data  Generated from all the daily transactions that take place both online and offline. Invoices, payment orders, storage records, delivery receipts – all are characterized as transactional data yet data alone is almost meaningless, and most organizations struggle to make sense of the data that they are generating and how it can be put to good use. Characteristics of Big Data  Volume. The scale of information which can now be created and stored is staggering. Advancing technology has allowed embedded sensors to be placed in everyday items such as cars, video games and refrigerators. Mobile devices have led to an increasingly networked world where people's consumer preferences, spending habits, and even their movements can be recorded. Advances in data storage technology as well as a fall in price of this storage has allowed for the captured data to be stored for further analysis.  Velocity. Timeliness is a key factor in the usefulness of financial information to decision makers, and it is no different for the users of Big Data. One source of high-velocity data is Twitter.  Variety. Big Data consists of both structured and unstructured data. Processing Big Data The processing of Big Data is known as Big Data analytics. For example:  Hadoop software allows the processing of large data sets by utilising large data sets simultaneously.  Google traffic. Analytics tracks many features of website The uses of Big Data Big Data is an emerging technology that has implications across all business departments. It involves the collection and analysis of large amounts of data to find trends, understand customer needs and help organisations to focus resources more effectively and to make better decisions. Uses continued Big Data and business value Business value is measured in many ways, such as profit, shareholder value, brand value and intellectual value. Big Data can be used to analyse opportunities to increase revenue and reduce costs, thereby increasing profit. For example, a holiday company can use Big Data to analyse trends in where tourists are visiting in order to improve the range of holiday locations that it offers. It can reduce its offering in unpopular areas and increase its offering of popular areas and increase its revenue. Big Data and the customer Understanding the customer is a key benefit of Big Data analytics. By understanding the customer, the business can respond to their needs and tailor the customer experience to be more personal and therefore improve customer loyalty. Big Data and corporate strategy  To be successful, Big Data must fit into the organisation's overall aims and objectives. After identifying how business value can be improved, and the requirements of the customer, business priorities can be determined – for example, which markets or customers are the most important in terms of increasing business value.  Big Data is a key source of innovation, helping to create new products and services. Volume and velocity of data helps speed up decision making. This means that Big Data can help create new sources of income for a business and contribute to an improvement in the organisation's competitive advantage. Big Data and performance management  It can help the organisation to understand its customers’ needs and preferences  It can improve forecasting so that more appropriate decisions can be made  It can help the organisation to automate business processes  It can help to provide more detailed, relevant and up to date performance measurement. Effect of Big Data on decisions  Decisions can be made quickly.  Businesses can respond earlier to environmental changes and be more flexible in their response.  Decisions can be based on current situation but also have an element of taking potential future situations into account.  Decisions are made on hard data evidence that can be quantified.  Decisions can be made on a collaborative basis because data is easily shared and converted from one form into another.  'Outside the box' decisions are more likely because all factors are taken into account, not just the ones managers think of. Benefits of Big Data analytics Criticisms of Big Data  Big Data is simply a buzzword, a vague term that has turned into an obsession in large organisations and the media.  There is a focus on finding correlations between data sets and less of an emphasis on causation.  Security and data protection  Privacy  Personnel issues-Skills to use Big Data systems not always available  There may be technical difficulties involved when integrating new Big Data systems with existing technology  Time spent measuring relationships that have no organisational value.  Poor veracity leading to incorrect conclusions  Cost of establishing hardware and analytics software  Technical difficulties integrating Big Data systems with current systems. 2. Grouped and ungrouped data Grouped data is data where the frequency is shown in terms of a range. Ungrouped data is data where the frequency is shown in terms of a specific measure or value. Discrete data can only take on a countable number of values. Continuous data can take on any value. If there is a large set of data or if every (or nearly every) data item is different, it is often convenient to group frequencies (how often data occurs) together into bands or classes. Frequency distributions  Frequency diagrams are used if values of particular variables occur more than once.  Frequently the data collected from a statistical survey or investigation is simply a mass of numbers.  Many sets of data, however, contain a limited number of data values, even though there may be many occurrences of each value. It can therefore be useful to organise the data into what is known as a frequency distribution (or frequency table) which records the number of times each value occurs (the frequency).  Individual data items are arranged in a table showing the frequency each individual data item occurs. Illustration Output of employees in one week in units: Construct a frequency distribution for the above data (the output in units of 20 employees during one week). Grouped frequency distributions If there is a large set of data or if every (or nearly every) data item is different, it is often convenient to group frequencies together into bands or classes. Once items have been 'grouped' in this way their individual values are lost. As well as being used for discrete variables (as above), grouped frequency distributions (or grouped frequency tables) can be used to present data for continuous variables. There is an open-ended class at the end of the range. Class intervals must be carefully considered so that they capture all of the data once (and only once!). Illustration For example, suppose that the output produced by a group of 20 employees during one week is as follows, in units The range of output from the lowest to the highest producer is 792 to 1,265, a range of 473 units. This range could be divided into classes of say, 100 units (the class width or class interval), and the number of employees producing output within each class could then be grouped into a single frequency, as follows: Cumulative frequency distributions A cumulative frequency distribution (or cumulative frequency table) can be used to show the total number of times that a value above or below a certain amount occurs. These distributions are used to show the number of times that a value above or below a certain amount occurs. Cumulative frequencies are obtained by adding the individual frequencies together. Ogives A cumulative frequency distribution can be graphed as an ogive. The ogive is drawn by plotting the cumulative frequencies on the graph, and joining them with straight lines. Although many ogives are more accurately curved lines, you can use straight lines to make them easier to draw. An ogive drawn with straight lines may be referred to as a cumulative frequency polygon (or cumulative frequency diagram) whereas one drawn as a curve may be referred to as a cumulative frequency curve. Illustration Consider the following frequency distribution. Draw an ogive Histograms A frequency distribution can be represented pictorially by means of a histogram. The number of observations in a class is represented by the area covered by the bar, rather than by its height. A histogram with equal class intervals If all the class intervals are the same, the bars of the histogram all have the same width and the heights will be proportional to the frequencies. The histogram looks almost identical to a bar chart except that the bars are joined together. Because the bars are joined together, when presenting discrete data the data must be treated as continuous so that there are no gaps between class intervals. Illustration 3. Averages (a) The arithmetic mean The arithmetic mean is the best known type of average and is widely understood. Illustration Illustration-The arithmetic mean of data in a frequency distribution The arithmetic mean of grouped data To calculate the arithmetic mean of grouped data we therefore need to decide on a value which best represents all of the values in a particular class interval. This value is known as the mid-point. To calculate the arithmetic mean of grouped data we therefore need to decide on a value which best represents all of the values in a particular class interval. This value is known as the mid-point. The mid-point of each class interval is conventionally taken, on the assumption that the frequencies occur evenly over the class interval range. In the example above, the variable is discrete, so the first class includes 1, 2, 3, 4 and 5, giving a mid-point of 3. With a continuous variable, the mid-points would have been 2.5, 7.5 and so on. Once the value of x has been decided, the mean is calculated using the formula for the arithmetic mean of grouped data. (b)The mode  The mode or modal value is an average which means 'the most frequently occurring value'.  The mode of a grouped frequency distribution can be calculated from a histogram. The modal class is always the class with the tallest bar. This may not be the class with the highest frequency if the classes do not all have the same width. Hence the mode in a grouped frequency distribution is only an estimate. (c)The median  The median is the value of the middle member of an array. The middle item of an odd number of items is calculated as the (n+1)th item. 2  With an even number of items, we normally take the arithmetic mean of the two middle ones. The median of an ungrouped frequency distribution  The median of an ungrouped frequency distribution is found in a similar way using cumulative frequencies. Consider the following distribution. The median of a grouped frequency distribution  The median of a grouped frequency distribution can be established from an ogive.  Construct an ogive of the following frequency distribution and hence establish the median.  Since the values are assumed to be spread evenly within each class, the median calculated is only approximate. 4. Dispersion Averages are a method of determining the 'location' or central point of a set of data but they give no information about the dispersion of values. Measures of dispersion give some idea of the spread of a variable about its average (mean). (a) Standard deviation  Standard deviation (σ) is one of the most important measures of dispersion. The standard deviation measures the spread of data around the mean.  The variance is the square of the standard deviation (variance = σ2).  In general, the larger the standard deviation value in relation to the mean, the more dispersed the data. Formula Illustration Calculate the mean, variance and standard deviation. (b) Coefficient of variation  The spreads of two distributions can be compared using the coefficient of variation  Coefficient of variation (coefficient of relative spread) 5. Probabilities and expected values  An expected value is a weighted average value of the different possible outcomes from a decision, where weightings are based on the probability of each possible outcome. Expected values indicate what an outcome is likely to be in the long term, if the decision can be repeated many times over. Fortunately, many business transactions do occur over and over again. Expected values  Where probabilities are assigned to different outcomes we can measure the weighted average value of the different possible outcomes. Each possible outcome is given a weighting equal to the probability that it will occur.  If the probability of an outcome of an event is p, then the expected number of times that this outcome will occur in x events (the expected value) is equal to  The expected value (EV) decision rule is that the decision option with the highest EV of benefit or the lowest EV of cost should be selected. 6. Normal distribution A probability distribution is an analysis of the proportion of times each particular value occurs in a set of items. There are a number of different probability distributions but the focus is normal distribution. In calculating the probability of x, (written as P(x)), x can be any value, and does not have to be a whole number. The normal distribution can also apply to discrete variables which can take many possible values. For example, the volume of sales, in units, of a product might be any whole number in the range 100 – 5,000 units. There are so many possibilities within this range that the variable is for all practical purposes continuous. Graphing the normal distribution The normal distribution can be drawn as a graph, and it would be a bell-shaped curve. Properties of the normal distribution  It is symmetrical and bell-shaped  It has a mean,  The area under the curve totals exactly 1  The area to the left of (pronounced mew) = area to the right of = 0.5 Importance of the normal distribution The normal distribution is important because in the practical application of statistics, it has been found that many probability distributions are close enough to a normal distribution to be treated as one without any significant loss of accuracy. This means that the normal distribution can be used as a tool in business decision making involving probabilities. 7. The standard normal distribution  For any normal distribution, the dispersion around the mean of the frequency of occurrences can be measured exactly in terms of the standard deviation .  The standard normal distribution has a mean deviation of 1. of 0 and a standard In general 68% of values are within one standard deviation (between –1 and 1), 95% of values are within two standard deviations (between –2 and 2) and 99.7% of values are within three standard deviations (between –3 and 3). Normal distribution tables Although there is an infinite number of normal distributions, depending on values of the mean ,and the standard deviation , the relative dispersion of frequencies around the mean, measured as proportions of the total population, is exactly the same for all normal distributions. In other words, whatever the normal distribution, 47.5% of outcomes will always be in the range between the mean and 1.96 standard deviations below the mean, 49.5% of outcomes will always be in the range between the mean and 2.58 standard deviations below the mean and so on. Normal distribution tables 8. Using the normal distribution to calculate probabilities In order to calculate probabilities, we need to convert a normal distribution (X) with a mean and standard deviation, to the standard normal distribution (z) before using the table to find the probability figure. The normal distribution is, in fact, a way of calculating probabilities.

Summarizing and Analyzing Data: Big Data & Statistics

Related documents

Products

Support

Summarizing and Analyzing Data: Big Data & Statistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib