MGT1102: Fundamentals of Business Analytics with Spreadsheet Business Analytics DEFINED System of computing hardware, high speed data processing, and analytical algorithms combined to make data-based recommendations, which learns over time. WATSON Objective: How can vast amounts of data on the internet create more data driven, smarter decisions? Speeds up approval of medical procedures, assists with the diagnosis and treatment of patients Better customer service and product offerings; instant decisioning and approval for cards and loans Three Key Factors to ANALYTICS EXPLOSION TECHNOLOGICAL ADVANCES METHODOLOGICAL DEVELOPMENTS IMPROVEMENTS ON COMPUTING POWER & STORAGE NEURAL NETWORKS DECISION MAKING DECISION TYPE STRATEGIC TACTICAL OPERATIONAL Overall goals, aspirations, and direction of the organization how the organization should achieve the goals and objectives set by its strategy how the firm is run from day to day High Level Management/ Executives Mid Level Management Operation Managers LONG TERM: 3 to 5 years SHORT TERM: 1 year Daily The Thoroughbred Running Company (TRC) TRC had been a catalog-based retail seller of running shoes and apparel. TRC sales revenues grew quickly as it changed its emphasis from catalog-based sales to Internet-based sales. Recently, TRC decided that it should also establish retail stores in the malls and downtown areas of major cities. STRATEGIC TACTICAL OPERATIONAL Establish Retail Stores in Malls to complement Ecommerce Platforms How many stores to open, and where to open them, including distribution stores Day to Day Activity Inventory, Crew Schedules, etc. Decision making is a SCIENCE Identify and define the problem. Evaluate the alternatives. Determine the criteria that will be used to evaluate alternative solutions. Choose an alternative. Determine the set of alternative solutions. BUSINESS ANALYTICS Scientific process of transforming data into insight for making better decisions. It is used for data-driven or fact-based decision making, which is often seen as more objective than other alternatives for decision making. Categories of Analytic Techniques DESCRIPTIVE PREDICTIVE PRESCRIPTIVE What happened and Why it happened? What will happen? What should you do about it? DESCRIPTIVE ANALYTICS Data Queries, Reports, and Statistics Data Dashboards Data Mining Request for information with certain characteristics from a database Collections of tables, charts, maps, and summary statistics that are updated as new data become available Use of analytical techniques for better understanding patterns and relationships that exist in large data sets PREDICTIVE ANALYTICS Linear Regression Simulation Uncovers relationships across variables using a linear equation Use of probability and statistics to construct a computer model to study the impact of uncertainty on a decision PRESCRIPTIVE ANALYTICS Rule-Based Models Types of prescriptive models that rely on a rule or set of rules Optimization Models Simulation Optimization Models that give the best decision subject to the constraints of the situation Combines the use of probability and statistics to model uncertainty with optimization techniques to find good decisions in highly complex and highly uncertain settings BIG DATA Any set of data that is too large or too complex to be handled by standard dataprocessing techniques and typical desktop software Because data are collected electronically, we are able to collect more of it. To be useful, these data must be stored, and this storage has led to vast quantities of data. Real-time capture and analysis of data present unique challenges both in how data are stored and the speed with which those data can be analyzed for decision making. More complicated types of data are now available and are proving to be of great value to businesses (text data, audio data, video data, Veracity has to do with how much uncertainty is in the data. Inconsistencies in units of measure and the lack of reliability of responses in terms of bias also increase the complexity of the data. MGT1102: Fundamentals of Business Analytics with Spreadsheet Descriptive Statistics Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation. A characteristic or a quantity of interest that can take on different values is known as a variable An observation is a set of values corresponding to a set of variables Variation is the difference in a variable measured over observations (time, customers, items, etc.). A quantity whose values are not known with certainty is called a random variable TYPES OF DATA Population vs Sample Quantitative vs Qualitative Cross Sectional vs Longitudinal SOURCES OF DATA Experimental In an experimental study, a variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest. Observational Nonexperimental, or observational, studies make no attempt to control the variables of interest. Existing Data from pre conducted studies, either from experimental or observational approaches Firms: Non Government/ Private, Government Agencies Modifying Data in Excel 1. Sorting and Filtering Data 2. Conditional Formatting DISTRIBUTION Distributions summarize many characteristics of a data set by describing how often certain values for a variable appear in that data set. Distributions can be created for both categorical and quantitative data, and they assist the analyst in gauging variation. Frequency Distributions A summary of data that shows the number (frequency) of observations in each of several nonoverlapping classes, typically referred to as bins. The frequency distribution shows that Coca-Cola is the leader, Pepsi is second, Diet Coke is third, and Sprite and Dr. Pepper are tied for fourth. FDT for Categorical Data FREQUENCY The frequency of a bin summarizes the number of times the value has occurred. RELATIVE FREQUENCY PERCENT FREQUENCY The relative frequency of a bin equals the fraction or proportion of items belonging to a class. The percent frequency of a bin is the relative frequency multiplied by 100. A relative frequency distribution is a tabular summary of data showing the relative frequency for each bin. A percent frequency distribution summarizes the percent frequency of the data for each bin FDT for Quantitative Data Bins are formed by specifying the ranges used to group the data. 1. Determine the number of nonoverlapping bins. Number of Bins The goal is to use enough bins to show the variation in the data, but not so many that some contain only a few data items. 2. Determine the width of each bin. Determine the bin limits. Bin Width Bin Limits As a general guideline, the width should be the same for each bin. A larger number of bins means a smaller bin width and vice versa. Bin limits must be chosen so that each data item belongs to one and only one class. The lower bin limit identifies the smallest possible data value assigned to the bin. The upper bin limit identifies the largest possible data value assigned to the class. Bin Width = Range/ Bin COunt Sturges Rule 3. FDT for Quantitative Data Cumulative Frequency Distribution The cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class. FDT for Quantitative Data A histogram is a plot that shows the underlying frequency distribution or shape of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc. Frequency HISTOGRAM Variable of Interest DESCRIPTIVE STATISTICS CENTRAL TENDENCY VARIABILITY LOCATION SHAPE MEAN MEDIAN MODE RANGE VARIANCE STANDARD DEVIATIONS COEFFICIENT OF VARIATION PERCENTILE QUARTILE DECILE z-SCORES SKEEWNESS KURTOSIS RELATIONSHIP CORRELATION COEEFICIENT COVARIANCE CENTRAL TENDENCY MEAN (ARITHMETIC) The mean provides a measure of central location for the data. If the data are for a sample, the mean is denoted by Xbar or mu. The sample mean is a point estimate of the population mean for the variable of interest. MEDIAN The median, or Xtilde, another measure of central location, is the value in the middle when the data are arranged in order. Odd Case: (n+1)/2 Even Case: (n/2) + ((n/2)+1) 2 MODE A third measure of location, the mode, is the value that occurs most frequently in a data set. CENTRAL TENDENCY VARIATION OF THE MEAN Geometric Mean The geometric mean is a measure of location that is calculated by finding the nth root of the product of n values. Ex: Growth Rates Harmonic Mean The reciprocal of the arithmetic mean of the reciprocals VARIABILITY RANGE VARIANCE The simplest measure of variability is the range. The range can be found by subtracting the smallest value from the largest value in a data set. The variance is a measure of variability that utilizes all the data. The variance is based on the deviation about the mean, which is the difference between the value of each observation and the mean. Seldom used. Why? STANDARD DEVIATION The standard deviation is defined to be the positive square root of the variance. This is more used vs Variance. Why? COEFFICIENT OF VARIATION CV, or the Coefficient of Variation indicates how large the standard deviation is relative to the mean LOCATION PERCENTILE QUARTILE A percentile is the value of a variable at which a specified percentage of observations are below that value. The pth percentile tells us the point in the data where approximately p% of the observations have values less than the pth percentile; hence, approximately (100 − p)% of the observations have values greater than the pth percentile. It is often desirable to divide data into four parts. Quartiles contain approximately one-fourth, or 25 percent, of the observations. Q1 = 25th Percentile Q2 = 50th Percentile (Median) Q3 = 75th Percentile IQR = Q3 – Q1 DECILE Z - SCORE Deciles contain approximately one-tenth or 10% of the observations within the dataset Z- Score allows us to measure the relative location of a value in the data set. More specifically, a z-score helps us determine how far a particular value is from the mean relative to the data set’s standard deviation. LOCATION EMPIRICAL RULE the empirical rule can be used to determine the percentage of data values that are within a specified number of standard deviations of the mean, given the distribution follows a symmetric bell-shaped curve. For data having a bell-shaped distribution: Approximately 68% of the data values will be within 1 standard deviation of the mean. Approximately 95% of the data values will be within 2 standard deviations of the mean. Almost all of the data values will be within 3 standard deviations of the mean. LOCATION OUTLIERS Unusually large or unusually small extreme observations within the dataset Determined via z-Scores (with around +3 or -3 scores) LOCATION BOX PLOTS A box plot is a graphical summary of the distribution of data developed from the quartiles for a data set. 1. 2. 3. 4. 5. A box is drawn with the ends of the box located at the first and third quartiles. A vertical line is drawn in the box at the location of the median Determine the limits (The limits for the box plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3.) The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits Locate each outlier with an asterisk ASSOCIATION COVARIANCE Covariance is a descriptive measure of the linear association between two variables. CORRELATION The correlation coefficient measures the relationship between two variables, and, unlike covariance, the relationship between two variables is not affected by the units of measurement for x and y. ASSOCIATION COVARIANCE Covariance is a descriptive measure of the linear association between two variables. For the bottled water, the covariance is positive, indicating that higher temperatures (x) are associated with higher sales (y). If the covariance is near 0, then the x and y variables are not linearly related. If the covariance is less than 0, then the x and y variables are negatively related, which means that as x increases, y generally decreases. Note: Covariance is directly affected by units of measurement. (ie cm vs in) ASSOCIATION Scatter Plots A scatter chart is a useful graph for analyzing the relationship between two variables. The scatter chart in the figure suggests that higher daily high temperatures are associated with higher bottled water sales. This is an example of a positive relationship, because when one variable (high temperature) increases, the other variable (sales of bottled water) generally also increases. The scatter chart also suggests that a straight line could be used as an approximation for the relationship between high temperature and sales of bottled water. ASSOCIATION Positive Relationship No Relationship Negative Relationship ASSOCIATION CORRELATION The correlation coefficient measures the LINEAR relationship between two variables, and, unlike covariance, the relationship between two variables is not affected by the units of measurement for x and y. The correlation coefficient can take only values between −1 and +1. Correlation coefficient values near 0 indicate no linear relationship between the x and y variables. Correlation coefficients greater than 0 indicate a positive linear relationship between the x and y variables. The closer the correlation coefficient is to +1, the closer the x and y values are to forming a straight line that trends upward to the right (positive slope). ASSOCIATION CORRELATION Because the correlation coefficient defined here measures only the strength of the linear relationship between two quantitative variables, it is possible for the correlation coefficient to be near zero, suggesting no linear relationship, when the relationship between the two variables is nonlinear. Data Cleansing 1. Missing Data 2. Identification of Erroneous Outliers and Other Erroneous Values 3. Variable Representation