Chapter 1 Data and Statistics I need help! Why Study Statistics? Applications in Business and Economics Data and Data Sources Descriptive Statistics Statistical Inference Computers and Description of Raw Data Learning Objectives On completion of this chapter, students will be able to: 1. Understand why we study statistics and its applications 2. Explain what is meant by descriptive statistics and inferential statistics 3. Distinguish between a qualitative variable and a quantitative variable 4. Distinguish between a discrete variable and a continuous variable 5. Distinguish among the nominal, ordinal, interval, and ratio levels of measurement 6. Define the terms mutually exclusive and exhaustive 2 of 20 Why Study Statistics? • Three main reasons why we study statistics are: 1. Data are everywhere 2. Statistical techniques are used to make many decisions that affect our lives 3. Whatever your future career, you will make decisions that involve data • An understanding of statistical methods helps in making decisions more effectively 3 of 20 What Is Statistics? 1. Collecting Data e.g., Survey (source-Mcclave, Benson, Sincich) Data Analysis Why? 2. Presenting Data © 1984-1994 T/Maker Co. e.g., Charts & Tables DecisionMaking 3. Characterizing Data e.g., Average © 1984-1994 T/Maker Co. What Is Statistics? • Statistics is the science of data and involves collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions. • Descriptive statistics are the tabular, graphical, and numerical methods used to summarize and present data. • Inferential Statistics relate to making inferences or predictions about a population from observations and analyses of a sample. It means that results of an analysis based on sample data can be generalized to reflect the larger population that the sample represents 5 of 20 Types of Statistics Descriptive Statistics Inferential Statistics • The methods used to • Methods of organizing, determine something about a summarizing, and population, based on a sample presenting data in an – A population is the entire informative way set of individuals or objects of interest or the – Frequency distributions measurements obtained – Chart forms from all individuals or – Central tendency objects of interest measures – A sample is a portion, or – Data clustering part of the population of interest. 6 of 20 Applications in Business and Economics • Accounting Public accounting firms use statistical sampling procedures when conducting audits for their clients. Economics Economists use statistical information in making forecasts about the future of the economy or some aspect of it. Applications in Business and Economics Marketing Electronic point-of-sale scanners at retail checkout counters are used to collect data for a variety of marketing research applications. Production Emphasis on quality makes quality control an important application of statistics in production. A variety of statistical quality control charts are used to monitor the output of a production process. Applications in Business and Economics (Cont’d) Finance Financial analysts use a variety of statistical information to guide their investment recommendations. Example, Financial advisors use price-earnings ratios and dividend yields to guide their investment recommendations. Summary - Applications in Business and Economics (Cont’d) • Economics – Forecasting – Demographics • Sports – Individual & Team Performance • Engineering – Construction – Materials • Business – Consumer Preferences – Financial Trends Data and Data Sets • What is Data? • Data are the facts and figures collected, summarized, analyzed, and interpreted. The data collected in a particular study are referred to as the data set. Elements, Variables, and Observations The elements are the entities on which data are collected. A variable is a characteristic of interest for the elements. The set of measurements collected for a particular element is called an observation. Examples of Data, Data Sets, Elements, Variables, and Observations Variables Element Names Company Dataram EnergySouth Keystone LandCare Psychemedics Stock Exchange NQ N N NQ N Annual Earn/ Sales($M) Share($) 73.10 74.00 365.70 111.40 17.60 Data Set 0.86 1.67 0.86 0.33 0.13 Data Definitions (Table 2.2) Number of Variables and Typical Tasks Data Set Variables Typical Tasks Univariate One Histograms, descriptive statistics, frequency tallies Bivariate Two Scatter plots, correlations, simple regression Multivariate More than two 2-14 Multiple regression, data mining, econometric modeling Data Definitions A Small Multivariate Data Set 8 Subjects 2-15 5 Variables Data Definitions Binary Data A binary variable has only two values, 1 = presence, 0 = absence of a characteristic of interest (codes themselves are arbitrary). For example, 1 = employed, 0 = not employed 1 = married, 0 = not married 1 = male, 0 = female 1 = female, 0 = male 2-16 The coding itself has no numerical value. So binary variables are attribute data. Scales of Measurement Scales of measurement include: Nominal Ordinal Interval Ratio The scale determines the amount of information contained in the data. The scale indicates the data summarization and statistical analyses that are most appropriate. Scales of Measurement • Nominal • • • • Data are labels or names used to identify an attribute of the element. A nonnumeric label or numeric code may be used. Scales of Measurement Nominal Example: Students of a university are classified by the school in which they are enrolled using a nonnumeric label such as Business, Humanities, Education, and so on. Alternatively, a numeric code could be used for the school variable (e.g. 1 denotes Business, 2 denotes Humanities, 3 denotes Education, and so on). Scales of Measurement • Interval The data have the properties of ordinal data, and the interval between observations is expressed in terms of a fixed unit of measure. Interval data are always numeric. Example: Melissa has an SAT score of 1205, while Kevin has an SAT score of 1090. Melissa scored 115 points more than Kevin. Scales of Measurement • Ordinal • The data have the properties of nominal data and • the order or rank of the data is meaningful. • A nonnumeric label or numeric code may be used. Scales of Measurement • Ordinal Example: Students of a university are classified by their class standing using a nonnumeric label such as Freshman, Sophomore, Junior, or Senior. Alternatively, a numeric code could be used for the class standing variable (e.g. 1 denotes Freshman, 2 denotes Sophomore, and so on). Scales of Measurement • Ratio • The data have all the properties of interval data and the ratio of two values is meaningful. • Variables such as distance, height, weight, and time use the ratio scale. • This scale must contain a zero value that indicates that nothing exists for the variable at the zero point. Scales of Measurement • Ratio • Example: • Melissa’s college record shows 36 credit hours earned, while Kevin’s record shows 72 credit hours earned. Kevin has twice as many credit hours earned as Melissa. Qualitative and Quantitative Data Also, data can be classified as being qualitative or quantitative. The statistical analysis that is appropriate depends on whether the data for the variable are qualitative or quantitative. In general, there are more alternatives for statistical analysis when the data are quantitative. Qualitative Data • Labels or names used to identify an attribute of each element is often referred to as categorical data • It uses either the nominal or ordinal scale of measurement • It can be either numeric or nonnumeric • Its appropriate statistical analyses are rather limited Qualitative Data (Doane/Seward ) Data Types Categorical or Qualitative data. Values are described by words rather than numbers. For example, - Automobile style (e.g., X = full, midsize, compact, subcompact). - Mutual fund (e.g., X = load, no-load). 2-27 Quantitative Data Quantitative data indicate how many or how much: Discrete, if measuring how many Continuous, if measuring how much Quantitative data are always numeric. Ordinary arithmetic operations are meaningful for quantitative data. Data Definitions (Doane/Seward ) Discrete Data A numerical variable with a countable number of values that can be represented by an integer (no fractional values). For example, - Number of Medicaid patients (e.g., X = 2). - Number of takeoffs at O’Hare (e.g., X = 37). 2-29 Data Definitions (Doane/Seward ) Continuous Data A numerical variable that can have any value within an interval (e.g., length, weight, time, sales, price/earnings ratios). Any continuous interval contains infinitely many possible values (e.g., 426 < X < 428). 2-30 Scales of Measurement Data Qualitative Numerical Nominal Ordinal Quantitative Non-numerical Nominal Ordinal Numerical Interval Ratio Time Series vs. Cross-Sectional Data Time Series Data •Values that correspond to specific measurements taken over a range of time periods Cross-Sectional Data •Values collected from a number of subjects during a single time period Time Series versus Cross-Sectional Data • Time Series Data Each observation in the sample represents a different equally spaced point in time (e.g., years, months, days). Periodicity may be annual, quarterly, monthly, weekly, daily, hourly, etc. We are interested in trends and patterns over time (e.g., annual growth in consumer debit card use from 2015 to 2020). Time Series Plot Used to graphically display data produced over time Shows trends and changes in the data over time Time recorded on the horizontal axis Measurements recorded on the vertical axis Points connected by straight lines Time Series Plot Example • The following data shows the average retail price of regular gasoline for 8 weeks in 2016. • Draw a time series plot for this data. Date Oct 16, 2006 Oct 23, 2006 Oct 30, 2006 Nov 6, 2006 Nov 13, 2006 Nov 20, 2006 Nov 27, 2006 Dec 4, 2006 Average Price $2.219 $2.173 $2.177 $2.158 $2.185 $2.208 $2.236 $2.298 Time Series Plot Example Price 2.35 2.3 2.25 2.2 2.15 2.1 2.05 10/16 10/23 10/30 11/6 Date 11/13 11/20 11/27 12/4 Time Series Data • Time series data are collected over several time periods. • Example: data detailing the number of building permits issued in Mississauga municipality, Ontario in each of the last 36 months Cross-Sectional Data • Cross-sectional data are collected at the same or approximately the same point in time. • Example: data detailing the number of building permits issued in June 2010 in each of the municipalities of Ontario Time Series versus Cross-Sectional Data Cross-sectional Data Each observation represents a different individual unit (e.g., person) at the same point in time (e.g., monthly VISA balances). We are interested in - variation among observations or in - relationships. We can combine the two data types to get pooled cross-sectional and time series data. 2-39 Fundamental Elements 1. Experimental unit • Object upon which we collect data 2. Population • All items of interest 3. Variable • • P in Population & Parameter • S in Sample & Statistic Characteristic of an individual experimental unit 4. Sample • Subset of the units of a population Why Sample ? 1. Prohibitive cost of surveying the whole population 2. Destructive nature of some tests 3. Physical impossibility of capturing the population 41 of 20 Parameters and Statistics? • Statistics are computed from a sample of n items, chosen from a population of N items. • Statistics can be used as estimates of parameters found in the population. Any measurement computed from a sample. Usually, the statistic is regarded as an estimate of a population parameter. Sample statistics are often (but not always) represented by Roman letters. 2-42 Parameter or Statistic? Parameter Any measurement that describes an entire population. Usually, the parameter value is unknown since we rarely can observe the entire population. Parameters are often (but not always) represented by Greek letters. 2-43 Parameters and Statistics? Situations Where A Sample May Be Preferred: Infinite Population No census is possible if the population is infinite or of indefinite size (an assembly line can keep producing bolts, a doctor can keep seeing more patients). Destructive Testing The act of sampling may destroy or devalue the item (measuring battery life, testing auto crashworthiness, or testing aircraft turbofan engine life). 2-44 Parameters and Statistics? Situations Where A Sample May Be Preferred: Timely Results Sampling may yield more timely results than a census (checking wheat samples for moisture and protein content, checking peanut butter for aflatoxin contamination). Accuracy Sample estimates can be more accurate than a census. Instead of spreading limited resources thinly to attempt a census, our budget of time and money might be better spent to hire experienced staff, improve training of field interviewers, and improve data safeguards. 2-45 Parameters and Statistics? Situations Where A Sample May Be Preferred: Cost Even if it is feasible to take a census, the cost, either in time or money, may exceed our budget. Sensitive Information Some kinds of information are better captured by a welldesigned sample, rather than attempting a census. Confidentiality may also be improved in a carefully-done sample. 2-46 Parameters and Statistics? Situations Where A Census May Be Preferred Small Population If the population is small, there is little reason to sample, for the effort of data collection may be only a small part of the total cost. Large Sample Size If the required sample size approaches the population size, we might as well go ahead and take a census. 2-47 Parameters and Statistics? Situations Where A Census May Be Preferred Database Exists If the data are on disk we can examine 100% of the cases. But auditing or validating data against physical records may raise the cost. Legal Requirements Banks must count all the cash in bank teller drawers at the end of each business day. The U.S. Congress forbade sampling in the 2000 decennial population census. 2-48 Parameters or Statistics? Finite or Infinite? A population is finite if it has a definite size, even if its size is unknown. A population is infinite if it is of arbitrarily large size. Rule of Thumb: A population may be treated as infinite when N is at least 20 times n (i.e., when N/n ≥ 20) N n Here, N/n ≥ 20 2-49 Descriptive Statistics • Descriptive statistics are the tabular, graphical, and numerical methods used to summarize and present data. Example: Hudson Auto Repair The manager of Hudson Auto would like to have a better understanding of the cost of parts used in the engine tune-ups performed in the shop. She examines 50 customer invoices for tune-ups. The costs of parts, rounded to the nearest dollar, are listed on the next slide. Example: Hudson Auto Repair Sample of Parts Cost ($) for 50 Tune-ups 91 71 104 85 62 78 69 74 97 82 93 57 72 89 62 68 88 68 98 101 75 52 99 66 75 79 97 105 77 83 68 71 79 105 79 80 75 65 69 69 97 62 72 76 80 109 67 74 62 73 Tabular Summary: Frequency and Percent Frequency Parts Cost ($) 50-59 60-69 70-79 80-89 90-99 100-109 Parts Frequency 2 13 16 7 7 5 50 Percent Frequency 4 26 (2/50)100 32 14 14 10 100 Graphical Summary: Histogram Tune-up Parts Cost 18 16 Frequency 14 12 10 8 6 4 2 Parts 50-59 60-69 70-79 80-89 90-99 100-110 Cost ($) Numerical Descriptive Statistics The most common numerical descriptive statistic is the average (or mean). Hudson’s average cost of parts, based on the 50 tune-ups studied, is $79 (found by summing the 50 cost values and then dividing by 50). Pareto Diagram Like a bar graph, but with the categories arranged by height in descending order from left to right. Percent Used Also Frequency 150 Equal Bar Widths Bar Height Shows Frequency or % 100 50 0 Acct. Mgmt. Major Zero Point Econ. Vertical Bars for Qualitative Variables Statistical Inference Population Sample - the set of all elements of interest in a particular study - a subset of the population Statistical inference - the process of using data obtained from a sample to make estimates and test hypotheses about the characteristics of a population Census - collecting data for a population Sample survey - collecting data for a sample Process of Statistical Inference 1. Population consists of all tuneups. Average cost of parts is unknown. 4. The sample average is used to estimate the population average. 2. A sample of 50 engine tune-ups is examined. 3. The sample data provide a sample average parts cost of $79 per tune-up. Chapter Summary • Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions • There are two types of statistics – descriptive and inferential • There are two types of variables – qualitative and quantitative • There are two types of quantitative variables – discrete and continuous • There are four levels of measurement – nominal, ordinal, interval, and ratio 59 of 20 Chapter END! • Chapter END!