Uploaded by Vansh Jain

BA

advertisement
Business Analytics
Unit-4
CO: CO1, CO4
PO:PO1-PO8
KL: K3
DESCRIPTIVE ANALYTICS
Data Analytics
•
•
•
•
Summary of Data
Understanding pattern of data.
Depicts what happened in past.
It involves:
– Descriptive
– Inferential
Data Analytics
• Data Analytics is summary of data that reveals about what had
happened in past, it is all about understanding the pattern of
data and inferring out of it. Therefore Data Analytics comes
with two major facets: Descriptive and Inferential
• Descriptive is all about summarizing features of a collection of
data and inferential is describing and drawing inferences
about the data.
• Descriptive statistics and inferential statistics are the two
major areas of statistics. So, the Descriptive statistics are for
describing the properties of sample and population data (what
has happened). Inferential statistics use those properties to
test hypotheses, reach conclusions, and make predictions
(what can you expect).
Population and Sample
Population and Sample
• In statistics the population comprises all observations or you
can say data points about the subject under study and a
sample is a subset of the population. It is a small portion of
the total observed population.
For an instance a cat food company would like to know all the
pet stores where it can sell its canned fish. The company has
population data on the total number of pet stores on a
particular place let’s suppose Delhi. Now, this pet food
manufacturer can create a research sample by only selecting the
pet stores that sell cat food only.
Measures of Central
Tendency
• Mean
• Median
• Mode
Measures of Central
Tendency
• Used to describe the distribution of data using a single value. Mean,
Median and Mode are the three measures of central tendency.
• Mean is average of all the data points and is very sensitive to
outliers, median is
• the middle value that divides the data into two equal parts once it
sorts the data in
• ascending order and Mode is the value that occurs most often.
Median and mode
• both have resistance to outliers.
• So, for an instance these three measures of central tendency can be
used to rank
• data in the context of consumer preferences and rating, then using
Median and
• Mode is very useful measure when you want to keep in inventory,
the most
• popular shirt in terms of color or collar size during festive season.
Activity: Choosing the "best"
measure of central tendency
Part 1: The mean
A golf team's 6 members had the scores below in their most
recent tournament:
70,72,74,76,80,114
Mean Score :
What is a correct interpretation of the mean score?
Part 2: The median
70,72,74,76,80,114
Median Score:
What is a correct interpretation of the median score?
Part 3: The "best"
measure of central
tendency
Which measure best describes the scores of the team? Why?
The _____________ best describes the scores of the team.
Business Analytics
Unit-4
CO: CO1, CO4
PO:PO1-PO8
KL: K3
DESCRIPTIVE ANALYTICS: MEASURES
OF DISPERSION
Measures of
Dispersion
• Range
• Quartile
• Standard Deviation and Variance
Measures of Dispersion
• Measures of dispersion that majorly includes Range, IQR, Variance
and Standard deviation.
• Range is simply a difference between Maximum value and minimum
value. Fundamentally, it is used in real life to make mathematical
calculations. Range can be used to calculate the amount of time that
has passed, like when calculating your age. The current year is 2020,
and you were born in 2005.
• Quartiles are the values that divide your data into quarters.
However, quartiles aren’t shaped like pizza slices; instead they divide
your data into four segments according to where the numbers fall on
the number line. The four quarters that divide a data set into
quartiles are:
• The lowest 25% of numbers.
• The next lowest 25% of numbers (up to the median).
• The second highest 25% of numbers (above the median).
• The highest 25% of numbers.
Measures of Dispersion
• Standard deviation is a measure of the amount of variation or dispersion
of a set of values. In mathematical terms it is the square root of the
variance.
• Variance and standard deviation represent the measures of fit, meaning
how well the mean represents the data.
For instance In a company, there is a constant tussle between employees
that they get paid less than others and claim that it is unfair on the part of
the employer. The employer will then check for disparities by calculating
the average salary and its standard deviation for employees in that
department. If the standard deviation is higher than expected, the matter
will be looked into. For example, when going through the accounts, the
employer will realize that the data is skewed because three of the
employees are almost 10 years senior to the others and get paid more.
Another example is related to your everyday life, you can set a mean
amount of money for you to spend and check if you’re spending too much
using standard deviation. You will obviously not go around doing
calculations, it’s simply an instinctual calculation your mind does for you.
Classroom Activity
• Write down a list of 10 numbers.
• Calculate the mean of the numbers.
• Calculate the range of the numbers (i.e., the difference
between the largest and smallest numbers in the list).
• Calculate the variance of the numbers.
• Calculate the standard deviation of the numbers.
• Compare the range, variance, and standard deviation. Which
one provides a better measure of the spread of the data?
Why?
• Try changing one or two of the numbers in the list and
recalculate the range, variance, and standard deviation. What
effect does this have on each measure?
• Discuss with your classmates the implications of the measures
of dispersion in data analysis and interpretation.
Calculating Range
• Range:
The range is calculated as the difference between the maximum
and minimum values in a dataset. To calculate the range of the
given dataset, we first need to arrange the data in ascending or
descending order: 9, 10, 12, 15, 20, 24, 25, 27, 28, 30 The
minimum value is 9 and the maximum value is 30, so the range
is: Range = Maximum value - Minimum value Range = 30 - 9
Range = 21 Therefore, the range of the given dataset is 21.
Calculating Variance
• Variance:
The variance is a measure of how spread out a dataset is. It is calculated as the average of the
squared differences from the mean. To calculate the variance of the given dataset, we first
need to calculate the mean:
Mean = (9 + 10 + 12 + 15 + 20 + 24 + 25 + 27 + 28 + 30) / 10 Mean = 20
Next, we calculate the squared differences from the mean for each value in the dataset:
(9 - 20)^2 = 121
(10 - 20)^2 = 100
(12 - 20)^2 = 64
(15 - 20)^2 = 25
(20 - 20)^2 = 0
(24 - 20)^2 = 16
(25 - 20)^2 = 25
(27 - 20)^2 = 49
(28 - 20)^2 = 64
(30 - 20)^2 = 100
Then we take the average of these squared differences:
Variance = (121 + 100 + 64 + 25 + 0 + 16 + 25 + 49 + 64 + 100) / 10 Variance = 56.4
Therefore, the variance of the given dataset is 56.4.
Calculating Standard Deviation
• Standard deviation:
The standard deviation is the square root of the variance. It
measures the spread of a dataset in the same units as the
original data.
To calculate the standard deviation of the given dataset, we
simply take the square root of the variance:
Standard deviation = sqrt(56.4)
Standard deviation = 7.508
Therefore, the standard deviation of the given dataset is 7.508.
Comparison
• Range is the simplest measure of spread as it only looks at the
difference between the maximum and minimum values. It
does not take into account the distribution of the data or how
the data is spread throughout the range.
• Variance is a more sophisticated measure of spread that
considers the distribution of the data. It calculates the average
of the squared differences from the mean, which gives us an
idea of how much the data is spread out from the mean.
• Standard deviation is the most commonly used measure of
spread as it gives us an idea of how much the data is spread
out in the same units as the original data. It is the square root
of the variance, which makes it easier to interpret than the
variance.
Conclusion
• In terms of which measure provides a better measure of
spread, it ultimately depends on the situation and what
information is needed. Range is the simplest measure and can
give a quick idea of the spread, but it does not provide a
detailed understanding of the distribution. Variance and
standard deviation provide more information about the
distribution of the data, but they can be influenced by outliers.
Therefore, it is important to consider the context of the data
and the purpose of the analysis when choosing which
measure to use.
Business Analytics
Unit-4
CO: CO1, CO4
PO:PO1-PO8
KL: K3
Data Analytics
•
•
•
•
Summary of Data
Understanding pattern of data.
Depicts what happened in past.
It involves:
– Descriptive
– Inferential
Data Analytics
• Data Analytics is summary of data that reveals about what had
happened in past, it is all about understanding the pattern of
data and inferring out of it. Therefore Data Analytics comes
with two major facets: Descriptive and Inferential
• Descriptive is all about summarizing features of a collection of
data and inferential is describing and drawing inferences
about the data.
• Descriptive statistics and inferential statistics are the two
major areas of statistics. So, the Descriptive statistics are for
describing the properties of sample and population data (what
has happened). Inferential statistics use those properties to
test hypotheses, reach conclusions, and make predictions
(what can you expect).
Population and Sample
Population and Sample
• In statistics the population comprises all observations or you
can say data points about the subject under study and a
sample is a subset of the population. It is a small portion of
the total observed population.
For an instance a cat food company would like to know all the
pet stores where it can sell its canned fish. The company has
population data on the total number of pet stores on a
particular place let’s suppose Delhi. Now, this pet food
manufacturer can create a research sample by only selecting the
pet stores that sell cat food only.
Measures of Central
Tendency
• Mean
• Median
• Mode
Measures of Central
Tendency
• Used to describe the distribution of data using a single value. Mean,
Median and Mode are the three measures of central tendency.
• Mean is average of all the data points and is very sensitive to
outliers, median is
• the middle value that divides the data into two equal parts once it
sorts the data in
• ascending order and Mode is the value that occurs most often.
Median and mode
• both have resistance to outliers.
• So, for an instance these three measures of central tendency can be
used to rank
• data in the context of consumer preferences and rating, then using
Median and
• Mode is very useful measure when you want to keep in inventory,
the most
• popular shirt in terms of color or collar size during festive season.
Activity: Choosing the "best"
measure of central tendency
Part 1: The mean
A golf team's 6 members had the scores below in their most
recent tournament:
70,72,74,76,80,114
Mean Score :
What is a correct interpretation of the mean score?
Part 2: The median
70,72,74,76,80,114
Median Score:
What is a correct interpretation of the median score?
Part 3: The "best"
measure of central
tendency
Which measure best describes the scores of the team? Why?
The _____________ best describes the scores of the team.
Business Analytics
Unit-4
CO: CO1, CO4
PO:PO1-PO8
KL: K3
Data Analytics
•
•
•
•
Summary of Data
Understanding pattern of data.
Depicts what happened in past.
It involves:
– Descriptive
– Inferential
Data Analytics
• Data Analytics is summary of data that reveals about what had
happened in past, it is all about understanding the pattern of
data and inferring out of it. Therefore Data Analytics comes
with two major facets: Descriptive and Inferential
• Descriptive is all about summarizing features of a collection of
data and inferential is describing and drawing inferences
about the data.
• Descriptive statistics and inferential statistics are the two
major areas of statistics. So, the Descriptive statistics are for
describing the properties of sample and population data (what
has happened). Inferential statistics use those properties to
test hypotheses, reach conclusions, and make predictions
(what can you expect).
Population and Sample
Population and Sample
• In statistics the population comprises all observations or you
can say data points about the subject under study and a
sample is a subset of the population. It is a small portion of
the total observed population.
For an instance a cat food company would like to know all the
pet stores where it can sell its canned fish. The company has
population data on the total number of pet stores on a
particular place let’s suppose Delhi. Now, this pet food
manufacturer can create a research sample by only selecting the
pet stores that sell cat food only.
Measures of Central
Tendency
• Mean
• Median
• Mode
Measures of Central
Tendency
• Used to describe the distribution of data using a single value. Mean,
Median and Mode are the three measures of central tendency.
• Mean is average of all the data points and is very sensitive to
outliers, median is
• the middle value that divides the data into two equal parts once it
sorts the data in
• ascending order and Mode is the value that occurs most often.
Median and mode
• both have resistance to outliers.
• So, for an instance these three measures of central tendency can be
used to rank
• data in the context of consumer preferences and rating, then using
Median and
• Mode is very useful measure when you want to keep in inventory,
the most
• popular shirt in terms of color or collar size during festive season.
Activity: Choosing the "best"
measure of central tendency
Part 1: The mean
A golf team's 6 members had the scores below in their most
recent tournament:
70,72,74,76,80,114
Mean Score :
What is a correct interpretation of the mean score?
Part 2: The median
70,72,74,76,80,114
Median Score:
What is a correct interpretation of the median score?
Part 3: The "best"
measure of central
tendency
Which measure best describes the scores of the team? Why?
The _____________ best describes the scores of the team.
Download