Business Analytics Unit-4 CO: CO1, CO4 PO:PO1-PO8 KL: K3 DESCRIPTIVE ANALYTICS Data Analytics • • • • Summary of Data Understanding pattern of data. Depicts what happened in past. It involves: – Descriptive – Inferential Data Analytics • Data Analytics is summary of data that reveals about what had happened in past, it is all about understanding the pattern of data and inferring out of it. Therefore Data Analytics comes with two major facets: Descriptive and Inferential • Descriptive is all about summarizing features of a collection of data and inferential is describing and drawing inferences about the data. • Descriptive statistics and inferential statistics are the two major areas of statistics. So, the Descriptive statistics are for describing the properties of sample and population data (what has happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make predictions (what can you expect). Population and Sample Population and Sample • In statistics the population comprises all observations or you can say data points about the subject under study and a sample is a subset of the population. It is a small portion of the total observed population. For an instance a cat food company would like to know all the pet stores where it can sell its canned fish. The company has population data on the total number of pet stores on a particular place let’s suppose Delhi. Now, this pet food manufacturer can create a research sample by only selecting the pet stores that sell cat food only. Measures of Central Tendency • Mean • Median • Mode Measures of Central Tendency • Used to describe the distribution of data using a single value. Mean, Median and Mode are the three measures of central tendency. • Mean is average of all the data points and is very sensitive to outliers, median is • the middle value that divides the data into two equal parts once it sorts the data in • ascending order and Mode is the value that occurs most often. Median and mode • both have resistance to outliers. • So, for an instance these three measures of central tendency can be used to rank • data in the context of consumer preferences and rating, then using Median and • Mode is very useful measure when you want to keep in inventory, the most • popular shirt in terms of color or collar size during festive season. Activity: Choosing the "best" measure of central tendency Part 1: The mean A golf team's 6 members had the scores below in their most recent tournament: 70,72,74,76,80,114 Mean Score : What is a correct interpretation of the mean score? Part 2: The median 70,72,74,76,80,114 Median Score: What is a correct interpretation of the median score? Part 3: The "best" measure of central tendency Which measure best describes the scores of the team? Why? The _____________ best describes the scores of the team. Business Analytics Unit-4 CO: CO1, CO4 PO:PO1-PO8 KL: K3 DESCRIPTIVE ANALYTICS: MEASURES OF DISPERSION Measures of Dispersion • Range • Quartile • Standard Deviation and Variance Measures of Dispersion • Measures of dispersion that majorly includes Range, IQR, Variance and Standard deviation. • Range is simply a difference between Maximum value and minimum value. Fundamentally, it is used in real life to make mathematical calculations. Range can be used to calculate the amount of time that has passed, like when calculating your age. The current year is 2020, and you were born in 2005. • Quartiles are the values that divide your data into quarters. However, quartiles aren’t shaped like pizza slices; instead they divide your data into four segments according to where the numbers fall on the number line. The four quarters that divide a data set into quartiles are: • The lowest 25% of numbers. • The next lowest 25% of numbers (up to the median). • The second highest 25% of numbers (above the median). • The highest 25% of numbers. Measures of Dispersion • Standard deviation is a measure of the amount of variation or dispersion of a set of values. In mathematical terms it is the square root of the variance. • Variance and standard deviation represent the measures of fit, meaning how well the mean represents the data. For instance In a company, there is a constant tussle between employees that they get paid less than others and claim that it is unfair on the part of the employer. The employer will then check for disparities by calculating the average salary and its standard deviation for employees in that department. If the standard deviation is higher than expected, the matter will be looked into. For example, when going through the accounts, the employer will realize that the data is skewed because three of the employees are almost 10 years senior to the others and get paid more. Another example is related to your everyday life, you can set a mean amount of money for you to spend and check if you’re spending too much using standard deviation. You will obviously not go around doing calculations, it’s simply an instinctual calculation your mind does for you. Classroom Activity • Write down a list of 10 numbers. • Calculate the mean of the numbers. • Calculate the range of the numbers (i.e., the difference between the largest and smallest numbers in the list). • Calculate the variance of the numbers. • Calculate the standard deviation of the numbers. • Compare the range, variance, and standard deviation. Which one provides a better measure of the spread of the data? Why? • Try changing one or two of the numbers in the list and recalculate the range, variance, and standard deviation. What effect does this have on each measure? • Discuss with your classmates the implications of the measures of dispersion in data analysis and interpretation. Calculating Range • Range: The range is calculated as the difference between the maximum and minimum values in a dataset. To calculate the range of the given dataset, we first need to arrange the data in ascending or descending order: 9, 10, 12, 15, 20, 24, 25, 27, 28, 30 The minimum value is 9 and the maximum value is 30, so the range is: Range = Maximum value - Minimum value Range = 30 - 9 Range = 21 Therefore, the range of the given dataset is 21. Calculating Variance • Variance: The variance is a measure of how spread out a dataset is. It is calculated as the average of the squared differences from the mean. To calculate the variance of the given dataset, we first need to calculate the mean: Mean = (9 + 10 + 12 + 15 + 20 + 24 + 25 + 27 + 28 + 30) / 10 Mean = 20 Next, we calculate the squared differences from the mean for each value in the dataset: (9 - 20)^2 = 121 (10 - 20)^2 = 100 (12 - 20)^2 = 64 (15 - 20)^2 = 25 (20 - 20)^2 = 0 (24 - 20)^2 = 16 (25 - 20)^2 = 25 (27 - 20)^2 = 49 (28 - 20)^2 = 64 (30 - 20)^2 = 100 Then we take the average of these squared differences: Variance = (121 + 100 + 64 + 25 + 0 + 16 + 25 + 49 + 64 + 100) / 10 Variance = 56.4 Therefore, the variance of the given dataset is 56.4. Calculating Standard Deviation • Standard deviation: The standard deviation is the square root of the variance. It measures the spread of a dataset in the same units as the original data. To calculate the standard deviation of the given dataset, we simply take the square root of the variance: Standard deviation = sqrt(56.4) Standard deviation = 7.508 Therefore, the standard deviation of the given dataset is 7.508. Comparison • Range is the simplest measure of spread as it only looks at the difference between the maximum and minimum values. It does not take into account the distribution of the data or how the data is spread throughout the range. • Variance is a more sophisticated measure of spread that considers the distribution of the data. It calculates the average of the squared differences from the mean, which gives us an idea of how much the data is spread out from the mean. • Standard deviation is the most commonly used measure of spread as it gives us an idea of how much the data is spread out in the same units as the original data. It is the square root of the variance, which makes it easier to interpret than the variance. Conclusion • In terms of which measure provides a better measure of spread, it ultimately depends on the situation and what information is needed. Range is the simplest measure and can give a quick idea of the spread, but it does not provide a detailed understanding of the distribution. Variance and standard deviation provide more information about the distribution of the data, but they can be influenced by outliers. Therefore, it is important to consider the context of the data and the purpose of the analysis when choosing which measure to use. Business Analytics Unit-4 CO: CO1, CO4 PO:PO1-PO8 KL: K3 Data Analytics • • • • Summary of Data Understanding pattern of data. Depicts what happened in past. It involves: – Descriptive – Inferential Data Analytics • Data Analytics is summary of data that reveals about what had happened in past, it is all about understanding the pattern of data and inferring out of it. Therefore Data Analytics comes with two major facets: Descriptive and Inferential • Descriptive is all about summarizing features of a collection of data and inferential is describing and drawing inferences about the data. • Descriptive statistics and inferential statistics are the two major areas of statistics. So, the Descriptive statistics are for describing the properties of sample and population data (what has happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make predictions (what can you expect). Population and Sample Population and Sample • In statistics the population comprises all observations or you can say data points about the subject under study and a sample is a subset of the population. It is a small portion of the total observed population. For an instance a cat food company would like to know all the pet stores where it can sell its canned fish. The company has population data on the total number of pet stores on a particular place let’s suppose Delhi. Now, this pet food manufacturer can create a research sample by only selecting the pet stores that sell cat food only. Measures of Central Tendency • Mean • Median • Mode Measures of Central Tendency • Used to describe the distribution of data using a single value. Mean, Median and Mode are the three measures of central tendency. • Mean is average of all the data points and is very sensitive to outliers, median is • the middle value that divides the data into two equal parts once it sorts the data in • ascending order and Mode is the value that occurs most often. Median and mode • both have resistance to outliers. • So, for an instance these three measures of central tendency can be used to rank • data in the context of consumer preferences and rating, then using Median and • Mode is very useful measure when you want to keep in inventory, the most • popular shirt in terms of color or collar size during festive season. Activity: Choosing the "best" measure of central tendency Part 1: The mean A golf team's 6 members had the scores below in their most recent tournament: 70,72,74,76,80,114 Mean Score : What is a correct interpretation of the mean score? Part 2: The median 70,72,74,76,80,114 Median Score: What is a correct interpretation of the median score? Part 3: The "best" measure of central tendency Which measure best describes the scores of the team? Why? The _____________ best describes the scores of the team. Business Analytics Unit-4 CO: CO1, CO4 PO:PO1-PO8 KL: K3 Data Analytics • • • • Summary of Data Understanding pattern of data. Depicts what happened in past. It involves: – Descriptive – Inferential Data Analytics • Data Analytics is summary of data that reveals about what had happened in past, it is all about understanding the pattern of data and inferring out of it. Therefore Data Analytics comes with two major facets: Descriptive and Inferential • Descriptive is all about summarizing features of a collection of data and inferential is describing and drawing inferences about the data. • Descriptive statistics and inferential statistics are the two major areas of statistics. So, the Descriptive statistics are for describing the properties of sample and population data (what has happened). Inferential statistics use those properties to test hypotheses, reach conclusions, and make predictions (what can you expect). Population and Sample Population and Sample • In statistics the population comprises all observations or you can say data points about the subject under study and a sample is a subset of the population. It is a small portion of the total observed population. For an instance a cat food company would like to know all the pet stores where it can sell its canned fish. The company has population data on the total number of pet stores on a particular place let’s suppose Delhi. Now, this pet food manufacturer can create a research sample by only selecting the pet stores that sell cat food only. Measures of Central Tendency • Mean • Median • Mode Measures of Central Tendency • Used to describe the distribution of data using a single value. Mean, Median and Mode are the three measures of central tendency. • Mean is average of all the data points and is very sensitive to outliers, median is • the middle value that divides the data into two equal parts once it sorts the data in • ascending order and Mode is the value that occurs most often. Median and mode • both have resistance to outliers. • So, for an instance these three measures of central tendency can be used to rank • data in the context of consumer preferences and rating, then using Median and • Mode is very useful measure when you want to keep in inventory, the most • popular shirt in terms of color or collar size during festive season. Activity: Choosing the "best" measure of central tendency Part 1: The mean A golf team's 6 members had the scores below in their most recent tournament: 70,72,74,76,80,114 Mean Score : What is a correct interpretation of the mean score? Part 2: The median 70,72,74,76,80,114 Median Score: What is a correct interpretation of the median score? Part 3: The "best" measure of central tendency Which measure best describes the scores of the team? Why? The _____________ best describes the scores of the team.