Chapter 1 Data and Statistics Xue Yan (薛艳) School of Economics Email:xueyan@jsut.edu.cn Office:2# 214 Overview ◼ What is the application of statistics? ◼ What is statistics? ◼ Why study statistics? Chap 1-2 Application of Statistics ▪ Weather Forecast Chap 1-3 ◼ Medical treatment e.g. Relationship between smoking and lung cancer 4 Applications in Accounting ◼ Public accounting firms use statistical sampling procedures when conducting audits for their clients. ◼ For instance, the audit staff selects a subset of the accounts called a sample. After reviewing the accuracy of the sampled accounts, the auditors draw a conclusion as to whether the accounts receivable amount shown on the client’s balance sheet is acceptable. ◼ Collecting statistics data ◼ Census-taking (普查) ◼ Sampling (抽樣) Applications in Finance Financial analysts use a variety of statistical information to guide their investment recommendations. ◼ For instance, the analysts review a variety of financial data including price/earnings ratios and dividend yields. By comparing the information for an individual stock with information about the stock market averages, a financial analyst can begin to draw a conclusion as to whether an individual stock is over- or under priced. ◼ Applications in Marketing ◼Electronic point-of-sale (POS) scanners at retail checkout counters are used to collect data for a variety of marketing research applications. ◼For example, brand managers can review the scanner statistics and the promotional activity statistics to gain a better understanding of the relationship between promotional activities and sales. Such analyses often prove helpful in establishing future marketing strategies for the various products. ◼Example: 7-11, Wal-Mart, …, etc. 7 Applications in Production ◼ ◼ ◼ A variety of statistical quality control charts are used to monitor the output of a production process. For example, that a machine fills containers with 12 ounces of a soft drink. Periodically, a production worker selects a sample of containers and computes the average number of ounces in the sample. Properly interpret the average can help determine when adjustments are necessary to correct a production process. Example: Yield Rate (良率) in semiconductor chip manufacturing. 8 Applications in Economics ◼ ◼ Economists frequently provide forecasts about the future of the economy by using a variety of statistical information in making such forecasts. For instance, in forecasting inflation rates, economists use statistical information on such indicators as the Product Price Index, the unemployment rate, and manufacturing capacity utilization. Inflating Rate = I + a*Product Price Index + b*unemployment rate + c*manufacturing capacity utilization + e ◼ Often these statistical indicators are entered into computerized forecasting models that predict inflation rate. 9 In Today’s Business World You Cannot Escape From Data ◼ In today’s digital world ever increasing amounts of data are gathered, stored, reported on, and available for further study. ◼ You hear the word data everywhere. ◼ Data are facts about the world and are constantly reported as numbers by an ever increasing number of sources. Statistics is all around us ◼ ◼ ◼ ◼ ◼ ◼ ◼ ◼ Is the housing becoming more expensive over time? Has the unemployment rate fallen over the past year? Who is the highest scoring basketball player in NBA? Are millennials more likely to rent than the rest? Who is the highest paid actress in Hollywood? What is the average salary of a starting business analyst? Is the average salary of a fresh engineer higher than that of a fresh economist? Has crime rate declined in China in recent years? Chap 1-11 The language of Statistics ◼ We use Statistics everyday without really being mindfully of it ◼ Average income, age, height … ◼ Highest paid (Maximum) athlete ◼ Fastest (Maximum) sprinter ◼ Lowest (Minimum) unemployment rate of all OECD countries ◼ Percent of females studying engineering ◼ How consistent (variance) is a stock performance over the past three months? ◼ On average, do men spend more (t-test) on clothes than women? Chap 1-12 To Properly Apply Statistics You Should Follow A Framework To Minimize Possible Errors In this book we will use DCOVA ◼ ◼ ◼ ◼ ◼ Define the data you want to study in order to solve a problem or meet an objective Collect the data from appropriate sources Organize the data collected by developing tables Visualize the data by developing charts Analyze the data collected to reach conclusions and present results Definition Of Some Terms DCOVA VARIABLE A characteristic of an item or individual. DATA The set of individual values associated with a variable. STATISTICS Company Stock Exchange Annual Sales Earn Share Dataram AMEX 73.10 0.86 Energy South OTC 74.00 1.67 Keystone NYSE 365.70 0.86 Land Care NYSE 111.40 0.33 Psychemedics AMEX 17.60 0.13 Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 1-15 What is statistics? ◼ A branch of mathematics taking and transforming numbers into useful information for decision makers ◼ Methods for processing & analyzing numbers ◼ Methods for helping reduce the uncertainty inherent in decision making Chap 1-16 Statistics is the art and science of collecting, analyzing, presenting, and interpreting data. 17 Types of Statistics ◼ ◼ Statistics The branch of mathematics that transforms data into useful information for decision makers. Descriptive Statistics Collecting, summarizing, and describing data Inferential Statistics Drawing conclusions and/or making decisions concerning a population based only on sample data Chap 1-18 Descriptive Statistics ◼ Collect data ◼ ◼ Present data ◼ ◼ e.g., Survey e.g., Tables and graphs Characterize data ◼ X e.g., Sample mean = i n Chap 1-19 Example: Hudson Auto Repair1 The manager of Hudson Auto would like to have a better understanding of the cost of parts used in the engine tune-ups performed in the shop. She examines 50 customer invoices for tune-ups. The costs of parts, rounded to the nearest dollar, are listed on the next slide. 20 Example: Hudson Auto Repair2 ◼ 91 71 104 85 62 Sample of Parts Cost for 50 Tune-ups 78 69 74 97 82 93 57 72 89 62 68 88 68 98 101 75 52 99 66 75 79 97 105 77 83 68 71 79 105 79 80 75 65 69 69 97 62 72 76 80 109 67 74 62 73 21 Tabular Summary: Frequency and Percent Frequency Parts Cost ($) 50-59 60-69 70-79 80-89 90-99 100-109 Parts Frequency 2 13 16 7 7 5 50 Percent Frequency 4 26 (2/50)100 32 14 14 10 100 22 Graphical Summary: Histogram Frequency Tune-up Parts Cost 18 16 14 12 10 8 6 4 2 50-59 60-69 70-79 80-89 90-99 Parts Cost ($) 100-110 23 Inferential Statistics ◼ Estimation ◼ ◼ e.g., Estimate the population mean weight using the sample mean weight Hypothesis testing ◼ e.g., Test the claim that the population mean weight is 120 pounds Drawing conclusions about a large group of individuals based on a subset of the large group. Chap 1-24 Why Study Statistics? ◼ To visualize & summarize business data ◼ ◼ To draw conclusions from business data ◼ ◼ Inferential methods used to reach conclusions about a large group based on data from a smaller group To make reliable forecasts about business activities ◼ ◼ Descriptive methods used to create charts & tables Inferential methods utilizing statistical models based on business data To improve business processes ◼ Involves managerial approaches like Six Sigma Chap 1-25 References ◼ ◼ ◼ Larry Wasserman. All of Statistics : A Concise Course in Statistical Inference [M]. Springer,2010. Bradley Efron;Trevor Hastie.Computer Age Statistical Inference: Algorithms, Evidence, and Data Science[M]. Cambridge University Press,2016. David S. Moore;George P. McCabe;Bruce A. Craig. Introduction to the Practice of Statistics[M]. Macmillan Learning,2021. Chap 1-26 Objectives In this chapter you learn: ◼ ◼ ◼ ◼ How to define variables How to collect data To identify different ways to collect a sample Understand the types of survey errors Classifying Variables By Type DCOVA ▪ Categorical (qualitative) variables take categories as their values such as “yes”, “no”, or “blue”, “brown”, “green”. ▪ Numerical (quantitative) variables have values that represent a counted or measured quantity. ▪ Discrete variables arise from a counting process ▪ Continuous variables arise from a measuring process Examples of Types of Variables DCOVA Question Responses Variable Type Do you have a Facebook profile? Yes or No Categorical (Qualitative) How many text messages have you sent in the past --------------three days? Numerical (discrete) How long did the mobile app update take to download? Numerical (continuous) --------------- Types of Variables DCOVA Variables Categorical Numerical Examples: ◼ ◼ ◼ Marital Status Political Party Eye Color (Defined categories) Discrete Examples: ◼ ◼ Number of Children Defects per hour (Counted items) Continuous Examples: ◼ ◼ Weight Voltage (Measured characteristics) Collecting Data Correctly Is A Critical Task ▪ Need to avoid data flawed by biases, ambiguities, or other types of errors. DCOVA ▪ Results from flawed data will be suspect or in error. ▪ Even the most sophisticated statistical methods are not very useful when the data is flawed. ◼ Developing Operational Definitions Is Crucial To Avoid Confusion / Errors DCOVA An operational definition is a clear and precise statement that provides a common understanding of meaning ◼ In the absence of an operational definition miscommunications and errors are likely to occur. ◼ Arriving at operational definition(s) is a key part of the Define step of DCOVA Sources of Data DCOVA ▪ Primary Sources: The data collector is the one using the data for analysis ▪ Data from a political survey ▪ Data collected from an experiment ▪ Observed data ▪ Secondary Sources: The person performing data analysis is not the data collector ▪ Analyzing census data ▪ Examining data from print journals or data published on the internet. Sources of data fall into five categories DCOVA ◼ Data distributed by an organization or an individual ◼ The outcomes of a designed experiment ◼ The responses from a survey ◼ The results of conducting an observational study ◼ Data collected by ongoing business activities Examples Of Data Distributed By Organizations or Individuals DCOVA ◼ Financial data on a company provided by investment services. ◼ Industry or market data from market research firms and trade associations. ◼ Stock prices, weather conditions, and sports statistics in daily newspapers. Examples of Data From A Designed Experiment ◼ ◼ ◼ DCOVA Consumer testing of different versions of a product to help determine which product should be pursued further. Material testing to determine which supplier’s material should be used in a product. Market testing on alternative product promotions to determine which promotion to use more broadly. Examples of Survey Data DCOVA ◼ A survey asking people which laundry detergent has the best stain-removing abilities ◼ Political polls of registered voters during political campaigns. ◼ People being surveyed to determine their satisfaction with a recent product or service experience. Examples of Data Collected From Observational Studies ◼ DCOVA Market researchers utilizing focus groups to elicit unstructured responses to open-ended questions. ◼ Measuring the time it takes for customers to be served in a fast food establishment. ◼ Measuring the volume of traffic through an intersection to determine if some form of advertising at the intersection is justified. Examples of Data Collected From Ongoing Business Activities DCOVA ◼ A bank studies years of financial transactions to help them identify patterns of fraud. ◼ Economists utilize data on searches done via Google to help forecast future economic conditions. ◼ Marketing companies use tracking data to evaluate the effectiveness of a web site. Data Is Collected From Either A Population or A Sample DCOVA POPULATION A population consists of all the items or individuals about which you want to draw a conclusion. The population is the “large group” SAMPLE A sample is the portion of a population selected for analysis. The sample is the “small group” Population vs. Sample DCOVA Population All the items or individuals about which you want to draw conclusion(s) Sample A portion of the population of items or individuals Collecting Data Via Sampling Is Used When Selecting A Sample Is DCOVA ◼ Less time consuming than selecting every item in the population. ◼ Less costly than selecting every item in the population. ◼ Less cumbersome and more practical than analyzing the entire population. Things To Consider / Deal With In Potential Sources Of Data DCOVA ◼ Is the source of data structured or unstructured? ◼ How is electronic data formatted? ◼ How is data encoded? Structured Data Follows An Organizing Principle & Unstructured Data Does Not DCOVA ◼ A Stock Ticker Provides Structured Data: ◼ ◼ ◼ Due to their inherent structure, data from tables and forms are structured data. E-mails from five people concerning stock trades is an example of unstructured data. ◼ ◼ The stock ticker repeatedly reports a company name, the number of shares last traded, the bid price, and the percent change in the stock price. In these e-mails you cannot count on the information being shared in a specific order or format. This book deals exclusively with structured data All Of The Methods In This Book Deal With Structured Data DCOVA ◼ To use the techniques in this book on unstructured data you need to convert the unstructured into structured data. ◼ For many of the questions you might want to answer, the starting point can / will be tabular data. Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 1-46 Data Can Be Formatted and / or Encoded In More Than One Way DCOVA ◼ Some electronic formats are more readily usable than others. ◼ Different encodings can impact the precision of numerical variables and can also impact data compatibility. ◼ As you identify and choose sources of data you need to consider / deal with these issues Data Cleaning Is Often A Necessary Activity When Collecting Data DCOVA ◼ Often find “irregularities” in the data ◼ ◼ ◼ ◼ ◼ ◼ Typographical or data entry errors Values that are impossible or undefined Missing values Outliers When found these irregularities should be reviewed / addressed Both Excel & Minitab can be used to address irregularities After Collection It Is Often Helpful To Recode Some Variables DCOVA ◼ ◼ ◼ ◼ Recoding a variable can either supplement or replace the original variable. Recoding a categorical variable involves redefining categories. Recoding a quantitative variable involves changing this variable into a categorical variable. When recoding be sure that the new categories are mutually exclusive (categories do not overlap) and collectively exhaustive (categories cover all possible values). A Sampling Process Begins With A Sampling Frame DCOVA ◼ ◼ ◼ ◼ The sampling frame is a listing of items that make up the population Frames are data sources such as population lists, directories, or maps Inaccurate or biased results can result if a frame excludes certain portions of the population Using different frames to generate data can lead to dissimilar conclusions Types of Samples DCOVA Samples Non-Probability Samples Judgment Convenience Probability Samples Simple Random Stratified Systematic Cluster Types of Samples: Nonprobability Sample ◼ DCOVA In a nonprobability sample, items included are chosen without regard to their probability of occurrence. ◼ ◼ In convenience sampling, items are selected based only on the fact that they are easy, inexpensive, or convenient to sample. In a judgment sample, you get the opinions of preselected experts in the subject matter. Types of Samples: Probability Sample ◼ DCOVA In a probability sample, items in the sample are chosen on the basis of known probabilities. Probability Samples Simple Random Systematic Stratified Cluster Probability Sample: Simple Random Sample DCOVA ◼ Every individual or item from the frame has an equal chance of being selected ◼ Selection may be with replacement (selected individual is returned to frame for possible reselection) or without replacement (selected individual isn’t returned to the frame). ◼ Samples obtained from table of random numbers or computer random number generators. Selecting a Simple Random Sample Using A Random Number Table DCOVA Sampling Frame For Population With 850 Items Item Name Item # Bev R. Ulan X. . . . . Joann P. Paul F. 001 002 . . . . 849 850 Portion Of A Random Number Table 49280 88924 35779 00283 81163 07275 11100 02340 12860 74697 96644 89439 09893 23997 20048 49420 88872 08401 The First 5 Items in a simple random sample Item # 492 Item # 808 Item # 892 -- does not exist so ignore Item # 435 Item # 779 Item # 002 Probability Sample: Systematic Sample DCOVA ◼ Decide on sample size: n ◼ Divide frame of N individuals into groups of k individuals: k=N/n ◼ Randomly select one individual from the 1st group ◼ Select every kth individual thereafter N = 40 n=4 k = 10 First Group Probability Sample: Stratified Sample ◼ DCOVA Divide population into two or more subgroups (called strata) according to some common characteristic ◼ A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes ◼ Samples from subgroups are combined into one ◼ This is a common technique when sampling population of voters, stratifying across racial or socio-economic lines. Population Divided into 4 strata Probability Sample Cluster Sample ◼ DCOVA Population is divided into several “clusters,” each representative of the population ◼ A simple random sample of clusters is selected ◼ All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique ◼ A common application of cluster sampling involves election exit polls, where certain election districts are selected and sampled. Population divided into 16 clusters. Randomly selected clusters for sample Probability Sample: Comparing Sampling Methods ◼ ◼ ◼ DCOVA Simple random sample and Systematic sample ◼ Simple to use ◼ May not be a good representation of the population’s underlying characteristics Stratified sample ◼ Ensures representation of individuals across the entire population Cluster sample ◼ More cost effective ◼ Less efficient (need larger sample to acquire the same level of precision) Evaluating Survey Worthiness DCOVA ◼ ◼ ◼ ◼ ◼ ◼ What is the purpose of the survey? Is the survey based on a probability sample? Coverage error – appropriate frame? Nonresponse error – follow up Measurement error – good questions elicit good responses Sampling error – always exists Exercise Suppose that 10,000 customers in a retailer’s customer database are categorized by three customer types: 3,500 prospective buyers, 4,500 first time buyers, and 2,000 repeat (loyal) buyers. A sample of 1,000 customers is needed. a. What type of sampling should you do? Why? b. Explain how you would carry out the sampling according to the method stated in (a). c. Why is the sampling in (a) not simple random sampling? ◼ Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 1-61 Types of Survey Errors ◼ Coverage error or selection bias ◼ ◼ People who do not respond may be different from those who do respond Sampling error ◼ ◼ Exists if some groups are excluded from the frame and have no chance of being selected Nonresponse error or bias ◼ ◼ DCOVA Variation from sample to sample will always exist Measurement error ◼ Due to weaknesses in question design and / or respondent error Types of Survey Errors DCOVA (continued) ◼ Coverage error Excluded from frame ◼ Nonresponse error Follow up on nonresponses ◼ Sampling error Random differences from sample to sample ◼ Measurement error Bad or leading question Chapter Summary In this chapter we have discussed: ◼ ◼ ◼ ◼ The types of variables used in statistics How to collect data The different ways to collect a sample The types of survey errors