Uploaded by Lorenzo Miguel Pescador

Chapter 1 & 2 Business Analytics - Descriptive Statistics

advertisement
MGT1102:
Fundamentals of Business
Analytics with Spreadsheet
Business Analytics DEFINED
System of computing hardware, high speed data processing, and
analytical algorithms combined to make data-based
recommendations, which learns over time.
WATSON
Objective: How can vast amounts of data on the internet create more
data driven, smarter decisions?
Speeds up approval of
medical procedures, assists
with the diagnosis and
treatment of patients
Better customer service and
product offerings; instant
decisioning and approval
for cards and loans
Three Key Factors to
ANALYTICS EXPLOSION
TECHNOLOGICAL
ADVANCES
METHODOLOGICAL
DEVELOPMENTS
IMPROVEMENTS ON COMPUTING
POWER & STORAGE
NEURAL NETWORKS
DECISION MAKING
DECISION TYPE
STRATEGIC
TACTICAL
OPERATIONAL
Overall goals, aspirations,
and direction of the
organization
how the organization should
achieve the goals and
objectives set by its strategy
how the firm is run from day
to day
High Level Management/
Executives
Mid Level Management
Operation Managers
LONG TERM:
3 to 5 years
SHORT TERM:
1 year
Daily
The Thoroughbred Running Company (TRC)
TRC had been a catalog-based retail seller of running
shoes and apparel. TRC sales revenues grew quickly as
it changed its emphasis from catalog-based sales to
Internet-based sales. Recently, TRC decided that it
should also establish retail stores in the malls and
downtown areas of major cities.
STRATEGIC
TACTICAL
OPERATIONAL
Establish Retail Stores in
Malls to complement
Ecommerce Platforms
How many stores to open,
and where to open them,
including distribution stores
Day to Day Activity
Inventory, Crew Schedules,
etc.
Decision making is a
SCIENCE
Identify and define the problem.
Evaluate the alternatives.
Determine the criteria that will be used to
evaluate alternative solutions.
Choose an alternative.
Determine the set of alternative
solutions.
BUSINESS
ANALYTICS
Scientific process of transforming data into insight for making better decisions.
It is used for data-driven or fact-based decision making, which is often seen as
more objective than other alternatives for decision making.
Categories of Analytic Techniques
DESCRIPTIVE
PREDICTIVE
PRESCRIPTIVE
What happened and
Why it happened?
What will happen?
What should you do about
it?
DESCRIPTIVE ANALYTICS
Data Queries, Reports,
and Statistics
Data Dashboards
Data Mining
Request for information with
certain characteristics from
a database
Collections of tables, charts,
maps, and summary statistics
that are updated as new
data become available
Use of analytical techniques
for better understanding
patterns and relationships
that exist in large data sets
PREDICTIVE ANALYTICS
Linear Regression
Simulation
Uncovers relationships across
variables using a linear
equation
Use of probability and
statistics to construct a
computer model to study
the impact of uncertainty on
a decision
PRESCRIPTIVE ANALYTICS
Rule-Based Models
Types of prescriptive models
that rely on a rule or set of
rules
Optimization Models
Simulation
Optimization
Models that give the best
decision subject to the
constraints of the situation
Combines the use of
probability and statistics to
model uncertainty with
optimization techniques to
find good decisions in highly
complex and highly
uncertain settings
BIG DATA
Any set of data that is too large or too complex to be handled by standard dataprocessing techniques and typical desktop software
Because data are collected electronically, we are able to collect more of
it. To be useful, these data must be stored, and this storage has led to vast
quantities of data.
Real-time capture and analysis of data present unique challenges both in
how data are stored and the speed with which those data can be
analyzed for decision making.
More complicated types of data are now available and are proving to be
of great value to businesses (text data, audio data, video data,
Veracity has to do with how much uncertainty is in the data. Inconsistencies
in units of measure and the lack of reliability of responses in terms of bias
also increase the complexity of the data.
MGT1102:
Fundamentals of Business
Analytics with Spreadsheet
Descriptive Statistics
Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation.
A characteristic or a quantity of interest that can
take on different values is known as a variable
An observation is a set of values corresponding
to a set of variables
Variation is the difference in a variable measured
over observations (time, customers, items, etc.).
A quantity whose values are not known with
certainty is called a random variable
TYPES OF DATA
Population vs Sample
Quantitative vs
Qualitative
Cross Sectional vs
Longitudinal
SOURCES OF DATA
Experimental
In an experimental study, a
variable of interest is first
identified. Then one or more
other variables are identified
and controlled or
manipulated to obtain data
about how these variables
influence the variable of
interest.
Observational
Nonexperimental, or
observational, studies make
no attempt to control the
variables of interest.
Existing
Data from pre conducted
studies, either from
experimental or
observational approaches
Firms: Non Government/
Private, Government
Agencies
Modifying Data in Excel
1. Sorting and Filtering Data
2. Conditional Formatting
DISTRIBUTION
Distributions summarize many characteristics of a data set by describing how
often certain values for a variable appear in that data set. Distributions can be
created for both categorical and quantitative data, and they assist the
analyst in gauging variation.
Frequency
Distributions
A summary of data that
shows the number
(frequency) of
observations in each of
several nonoverlapping
classes, typically referred
to as bins.
The frequency distribution shows that
Coca-Cola is the leader, Pepsi is second,
Diet Coke is third, and Sprite and Dr.
Pepper are tied for fourth.
FDT for Categorical Data
FREQUENCY
The frequency of a bin
summarizes the number of
times the value has
occurred.
RELATIVE FREQUENCY
PERCENT FREQUENCY
The relative frequency of a
bin equals the fraction or
proportion of items
belonging to a class.
The percent frequency of a
bin is the relative frequency
multiplied by 100.
A relative frequency
distribution is a tabular
summary of data showing
the relative frequency for
each bin.
A percent frequency
distribution summarizes the
percent frequency of the
data for each bin
FDT for Quantitative Data
Bins are formed by specifying the ranges used to group the data.
1.
Determine the number of
nonoverlapping bins.
Number of Bins
The goal is to use enough bins
to show the variation in the
data, but not so many that
some contain only a few data
items.
2.
Determine the width of
each bin.
Determine the bin limits.
Bin Width
Bin Limits
As a general guideline, the width
should be the same for each bin. A
larger number of bins means a
smaller bin width and vice versa.
Bin limits must be chosen so that
each data item belongs to one and
only one class. The lower bin limit
identifies the smallest possible data
value assigned to the bin. The upper
bin limit identifies the largest possible
data value assigned to the class.
Bin Width = Range/ Bin COunt
Sturges Rule
3.
FDT for Quantitative Data
Cumulative Frequency
Distribution
The cumulative frequency distribution shows the number of data items with values
less than or equal to the upper class limit of each class.
FDT for Quantitative Data
A histogram is a plot that shows the underlying frequency distribution or shape
of a set of continuous data. This allows the inspection of the data for its
underlying distribution (e.g., normal distribution), outliers, skewness, etc.
Frequency
HISTOGRAM
Variable of Interest
DESCRIPTIVE STATISTICS
CENTRAL TENDENCY
VARIABILITY
LOCATION
SHAPE
MEAN
MEDIAN
MODE
RANGE
VARIANCE
STANDARD DEVIATIONS
COEFFICIENT OF VARIATION
PERCENTILE
QUARTILE
DECILE
z-SCORES
SKEEWNESS
KURTOSIS
RELATIONSHIP
CORRELATION COEEFICIENT
COVARIANCE
CENTRAL TENDENCY
MEAN (ARITHMETIC)
The mean provides a
measure of central location
for the data. If the data are
for a sample, the mean is
denoted by Xbar or mu. The
sample mean is a point
estimate of the population
mean for the variable of
interest.
MEDIAN
The median, or Xtilde,
another measure of central
location, is the value in the
middle when the data are
arranged in order.
Odd Case: (n+1)/2
Even Case: (n/2) + ((n/2)+1)
2
MODE
A third measure of location,
the mode, is the value that
occurs most frequently in a
data set.
CENTRAL TENDENCY
VARIATION OF THE MEAN
Geometric Mean
The geometric mean is a measure of location that is
calculated by finding the nth root of the product of n
values.
Ex: Growth Rates
Harmonic Mean
The reciprocal of the arithmetic mean of the reciprocals
VARIABILITY
RANGE
VARIANCE
The simplest measure of
variability is the range. The
range can be found by
subtracting the smallest
value from the largest value
in a data set.
The variance is a measure of
variability that utilizes all the
data. The variance is based
on the deviation about the
mean, which is the
difference between the
value of each observation
and the mean.
Seldom used. Why?
STANDARD
DEVIATION
The standard deviation is
defined to be the positive
square root of the variance.
This is more used vs
Variance. Why?
COEFFICIENT OF
VARIATION
CV, or the Coefficient of
Variation indicates how
large the standard deviation
is relative to the mean
LOCATION
PERCENTILE
QUARTILE
A percentile is the value of a
variable at which a specified
percentage of observations
are below that value. The
pth percentile tells us the
point in the data where
approximately p% of the
observations have values
less than the pth percentile;
hence, approximately (100 −
p)% of the observations
have values greater than the
pth percentile.
It is often desirable to divide
data into four parts.
Quartiles contain
approximately one-fourth,
or 25 percent, of the
observations.
Q1 = 25th Percentile
Q2 = 50th Percentile
(Median)
Q3 = 75th Percentile
IQR = Q3 – Q1
DECILE
Z - SCORE
Deciles contain
approximately one-tenth or
10% of the observations
within the dataset
Z- Score allows us to
measure the relative
location of a value in the
data set. More specifically, a
z-score helps us determine
how far a particular value is
from the mean relative to
the data set’s standard
deviation.
LOCATION
EMPIRICAL
RULE
the empirical rule can be used to determine the percentage of data values
that are within a specified number of standard deviations of the mean, given
the distribution follows a symmetric bell-shaped curve.
For data having a bell-shaped distribution:
Approximately 68% of the data values will
be within 1 standard deviation of the
mean.
Approximately 95% of the data values will
be within 2 standard deviations of the
mean.
Almost all of the data values will be within
3 standard deviations of the mean.
LOCATION
OUTLIERS
Unusually large or unusually small extreme observations within the dataset
Determined via z-Scores (with around +3 or -3 scores)
LOCATION
BOX PLOTS
A box plot is a graphical summary of the distribution of data developed from
the quartiles for a data set.
1.
2.
3.
4.
5.
A box is drawn with the ends of the
box located at the first and third
quartiles.
A vertical line is drawn in the box at
the location of the median
Determine the limits (The limits for the
box plot are 1.5(IQR) below Q1 and
1.5(IQR) above Q3.)
The whiskers are drawn from the
ends of the box to the smallest and
largest values inside the limits
Locate each outlier with an asterisk
ASSOCIATION
COVARIANCE
Covariance is a descriptive
measure of the linear
association between two
variables.
CORRELATION
The correlation coefficient
measures the relationship
between two variables, and,
unlike covariance, the
relationship between two
variables is not affected by
the units of measurement
for x and y.
ASSOCIATION
COVARIANCE
Covariance is a descriptive measure of the linear association between two
variables.
For the bottled water, the
covariance is positive, indicating
that higher temperatures (x) are
associated with higher sales (y). If
the covariance is near 0, then the x
and y variables are not linearly
related. If the covariance is less than
0, then the x and y variables are
negatively related, which means
that as x increases, y generally
decreases.
Note: Covariance is
directly affected by
units of measurement.
(ie cm vs in)
ASSOCIATION
Scatter Plots
A scatter chart is a useful graph for analyzing the relationship between two
variables.
The scatter chart in the figure suggests that higher daily
high temperatures are associated with higher bottled
water sales. This is an example of a positive relationship,
because when one variable (high temperature)
increases, the other variable (sales of bottled water)
generally also increases. The scatter chart also suggests
that a straight line could be used as an approximation
for the relationship between high temperature and
sales of bottled water.
ASSOCIATION
Positive Relationship
No Relationship
Negative Relationship
ASSOCIATION
CORRELATION
The correlation coefficient measures the LINEAR relationship between two
variables, and, unlike covariance, the relationship between two variables is
not affected by the units of measurement for x and y.
The correlation coefficient can take only values between −1
and +1. Correlation coefficient values near 0 indicate no
linear relationship between the x and y variables.
Correlation coefficients greater than 0 indicate a positive
linear relationship between the x and y variables. The closer
the correlation coefficient is to +1, the closer the x and y
values are to forming a straight line that trends upward to
the right (positive slope).
ASSOCIATION
CORRELATION
Because the correlation coefficient defined here measures only the strength of
the linear relationship between two quantitative variables, it is possible for the
correlation coefficient to be near zero, suggesting no linear relationship, when
the relationship between the two variables is nonlinear.
Data Cleansing
1. Missing Data
2. Identification of Erroneous Outliers and Other Erroneous Values
3. Variable Representation
Download