IST 203 Statistics for Social Sciences (Section 5231, 5232) Review for Final Examination [ Lectures 1, 2, 3 (A, B), 4A, 5A, 6, 7 ] Bangkok University International College May 7, 2011 1-2 IST 203: Statistics for Social Sciences Lecture 1 1-3 What is Statistics? • Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. • A statistic is a single measure (number) used to summarize a sample data set. For example, the average height of students in this class. • A statistician is an expert with at least a master’s degree in mathematics or statistics or a trained professional in a related field. McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, Inc. All rights reserved. 1-4 Uses of Statistics Two primary uses for statistics: • Descriptive statistics – the collection, organization, presentation and summary of data. • Inferential statistics – generalizing from a sample to a population, estimating unknown parameters, drawing conclusions, making decisions. 1-5 Statistical Challenges Working with Imperfect Data • State any assumptions and limitations and use generally accepted statistical tests to detect unusual data points or to deal with missing data. Dealing with Practical Constraints • You will face constraints on the type and quantity of data you can collect. 1-6 Statistical Challenges Upholding Ethical Standards • Know and follow accepted procedures, maintain data integrity, carry out accurate calculations, report procedures, protect confidentiality, cite sources and financial support. Using Consultants • Hire consultants at the beginning of the project, when your team lacks certain skills or when an unbiased or informed view is needed. 1-7 Statistical Pitfalls Pitfall 1: Making Conclusions about a Large Population from a Small Sample • Be careful about making generalizations from small samples (e.g., a group of 10 patients). Pitfall 2: Making Conclusions from Nonrandom Samples • Be careful about making generalizations from retrospective studies of special groups (e.g., heart attack patients). 1-8 Statistical Pitfalls Pitfall 3: Attaching Importance to Rare Observations from Large Samples • Be careful about drawing strong inferences from events that are not surprising when looking at the entire population (e.g., winning the lottery). Pitfall 4: Using Poor Survey Methods • Be careful about using poor sampling methods or vaguely worded questions (e.g., anonymous survey or quiz). 1-9 Statistical Pitfalls Pitfall 5: Assuming a Causal Link Based on Observations • Be careful about drawing conclusions when no cause-and-effect link exists (e.g., most shark attacks occur between 12p.m. and 2p.m.). Pitfall 6: Making Generalizations about Individuals from Observations about Groups • Avoid reading too much into statistical generalizations (e.g., men are taller than women). 1-10 Statistical Pitfalls Pitfall 7: Unconscious Bias • Be careful about unconsciously or subtly allowing bias to color handling of data (e.g., heart disease in men vs. women). Pitfall 8: Attaching Practical Importance to Every Statistically Significant Study Result • Statistically significant effects may lack practical importance (e.g., Austrian military recruits born in the spring average 0.6 cm taller than those born in the fall). 1-11 IST 203: Statistics for Social Sciences Lecture 2 1-12 Data Vocabulary • Data is the plural form of the Latin datum (a “given” fact). • In scientific research, data arise from experiments whose results are recorded systematically. • In business, data usually arise from accounting transactions or management processes. • Important decisions may depend on data. 1-13 Data Vocabulary Subjects, Variables, Data Sets • We will refer to Data as plural and data set as a particular collection of data as a whole. • Observation – each data value. • Subject (or individual) – an item for study (e.g., an employee in your company). • Variable – a characteristic about the subject or individual (e.g., employee’s income). 1-14 Data Vocabulary Subjects, Variables, Data Sets • Three types of data sets: Data Set Variables Typical Tasks Univariate One Histograms, descriptive statistics, frequency tallies Bivariate Two Scatter plots, correlations, simple regression Multivariate More than two Multiple regression, data mining, econometric modeling 1-15 Data Vocabulary Subjects, Variables, Data Sets Consider the multivariate data set with 5 variables 8 subjects 5 x 8 = 40 observations 1-16 Data Vocabulary Attribute Data • Also called categorical, nominal or qualitative data. • Values are described by words rather than numbers. • For example, - Automobile style (e.g., X = full, midsize, compact, subcompact). - Mutual fund (e.g., X = load, no-load). 1-17 Data Vocabulary Data Coding • Coding refers to using numbers to represent categories to facilitate statistical analysis. • Coding an attribute as a number does not make the data numerical. • For example, 1 = Bachelor’s, 2 = Master’s, 3 = Doctorate • Rankings may exist, for example, 1 = Liberal, 2 = Moderate, 3 = Conservative 1-18 Data Vocabulary Binary Data • A binary variable has only two values, 1 = presence, 0 = absence of a characteristic of interest (codes themselves are arbitrary). • For example, 1 = employed, 0 = not employed 1 = married, 0 = not married 1 = male, 0 = female 1 = female, 0 = male • The coding itself has no numerical value so binary variables are attribute data. 1-19 Data Vocabulary Numerical Data • Numerical or quantitative data arise from counting or some kind of mathematical operation. • For example, - Number of auto insurance claims filed in March (e.g., X = 114 claims). - Ratio of profit to sales for last quarter (e.g., X = 0.0447). • Can be broken down into two types – discrete or continuous data. 1-20 Data Vocabulary Discrete Data • A numerical variable with a countable number of values that can be represented by an integer (no fractional values). • For example, - Number of Medicaid patients (e.g., X = 2). - Number of takeoffs at O’Hare (e.g., X = 37). 1-21 Data Vocabulary Continuous Data • A numerical variable that can have any value within an interval (e.g., length, weight, time, sales, price/earnings ratios). • Any continuous interval contains infinitely many possible values (e.g., 426 < X < 428). 1-22 Level of Measurement Four levels of measurement for data: Level of Measurement Characteristics Example Nominal Categories only Eye color (blue, brown, green, hazel) Ordinal Rank has meaning Bond ratings (Aaa, Aab, C, D, F, etc.) Interval Distance has meaning Temperature (57o Celsius) Ratio Meaningful zero exists Accounts payable ($21.7 million) 1-23 Level of Measurement Nominal Measurement • Nominal data merely identify a category. • Nominal data are qualitative, attribute, categorical or classification data (e.g., Apple, Compaq, Dell, HP). • Nominal data are usually coded numerically, codes are arbitrary (e.g., 1 = Apple, 2 = Compaq, 3 = Dell, 4 = HP). • Only mathematical operations are counting (e.g., frequencies) and simple statistics. 1-24 Level of Measurement Ordinal Measurement • Ordinal data codes can be ranked (e.g., 1 = Frequently, 2 = Sometimes, 3 = Rarely, 4 = Never). • Distance between codes is not meaningful (e.g., distance between 1 and 2, or between 2 and 3, or between 3 and 4 lacks meaning). • Many useful statistical tests exist for ordinal data. Especially useful in social science, marketing and human resource research. 1-25 Level of Measurement Interval Measurement • Data can not only be ranked, but also have meaningful intervals between scale points (e.g., difference between 60F and 70F is same as difference between 20F and 30F). • Since intervals between numbers represent distances, mathematical operations can be performed (e.g., average). • Zero point of interval scales is arbitrary, so ratios are not meaningful (e.g., 60F is not twice as warm as 30F). 1-26 Level of Measurement Likert Scales • A special case of interval data frequently used in survey research. • The coarseness of a Likert scale refers to the number of scale points (typically 5 or 7). “College-bound high school students should be required to study a foreign language.” (check one) Strongly Agree Somewhat Agree Neither Agree Nor Disagree Somewhat Disagree Strongly Disagree 1-27 Level of Measurement Likert Scales • A neutral midpoint (“Neither Agree Nor Disagree”) is allowed if an odd number of scale points is used or omitted to force the respondent to “lean” one way or the other. • Likert data are coded numerically (e.g., 1 to 5) but any equally spaced values will work. Likert coding: 1 to 5 scale Likert coding: -2 to +2 scale 5 = Help a lot 4 = Help a little 3 = No effect 2 = Hurt a little 1 = Hurt a lot +2 = Help a lot +1 = Help a little 0 = No effect 1 = Hurt a little 2 = Hurt a lot 1-28 Sampling Concepts Sample or Census? • A sample involves looking only at some items selected from the population. • A census is an examination of all items in a defined population. • Why can’t the United States Census survey every person in the population? - Mobility - Illegal immigrants - Budget constraints - Incomplete responses or nonresponses 1-29 Sampling Concepts Situations Where A Sample May Be Preferred: Infinite Population No census is possible if the population is infinite or of indefinite size (an assembly line can keep producing bolts, a doctor can keep seeing more patients). Destructive Testing The act of sampling may destroy or devalue the item (measuring battery life, testing auto crashworthiness, or testing aircraft turbofan engine life). Timely Results Sampling may yield more timely results than a census (checking wheat samples for moisture and protein content, checking peanut butter for aflatoxin contamination). 1-30 Sampling Concepts Situations Where A Sample May Be Preferred: Accuracy Sample estimates can be more accurate than a census. Instead of spreading limited resources thinly to attempt a census, our budget of time and money might be better spent to hire experienced staff, improve training of field interviewers, and improve data safeguards. Cost Even if it is feasible to take a census, the cost, either in time or money, may exceed our budget. Sensitive Information Some kinds of information are better captured by a well-designed sample, rather than attempting a census. Confidentiality may also be improved in a carefully-done sample. 1-31 Sampling Concepts Situations Where A Census May Be Preferred Small Population If the population is small, there is little reason to sample, for the effort of data collection may be only a small part of the total cost. Large Sample Size If the required sample size approaches the population size, we might as well go ahead and take a census. Database Exists If the data are on disk we can examine 100% of the cases. But auditing or validating data against physical records may raise the cost. Legal Requirements Banks must count all the cash in bank teller drawers at the end of each business day. The U.S. Congress forbade sampling in the 2000 decennial population census. 1-32 Sampling Concepts Parameters and Statistics • Statistics are computed from a sample of n items, chosen from a population of N items. • Statistics can be used as estimates of parameters found in the population. • Symbols are used to represent population parameters and sample statistics. 1-33 Sampling Concepts Parameters and Statistics Parameter or Statistic? Parameter Any measurement that describes an entire population. Usually, the parameter value is unknown since we rarely can observe the entire population. Parameters are often (but not always) represented by Greek letters. Statistic Any measurement computed from a sample. Usually, the statistic is regarded as an estimate of a population parameter. Sample statistics are often (but not always) represented by Roman letters. 1-34 Sampling Concepts Parameters and Statistics • The population must be carefully specified and the sample must be drawn scientifically so that the sample is representative. Target Population • The target population is the population we are interested in (e.g., U.S. gasoline prices). • The sampling frame is the group from which we take the sample (e.g., 115,000 stations). • The frame should not differ from the target population. 1-35 Sampling Concepts Finite or Infinite? • A population is finite if it has a definite size, even if its size is unknown. • A population is infinite if it is of arbitrarily large size. • Rule of Thumb: A population may be treated as infinite when N is at least 20 times n (i.e., when N/n > 20) N n Here, N/n > 20 1-36 Sampling Methods Probability Samples Simple Random Sample Use random numbers to select items from a list (e.g., VISA cardholders). Systematic Sample Select every kth item from a list or sequence (e.g., restaurant customers). Stratified Sample Select randomly within defined strata (e.g., by age, occupation, gender). Cluster Sample Like stratified sampling except strata are geographical areas (e.g., zip codes). 1-37 Sampling Methods Nonprobability Samples Judgment Sample Use expert knowledge to choose “typical” items (e.g., which employees to interview). Convenience Sample Use a sample that happens to be available (e.g., ask co-worker opinions at lunch). 1-38 Sampling Methods With or Without Replacement • If we allow duplicates when sampling, then we are sampling with replacement. • Duplicates are unlikely when n is much smaller than N. • If we do not allow duplicates when sampling, then we are sampling without replacement. 1-39 Sampling Methods Systematic Sampling • Sample by choosing every kth item from a list, starting from a randomly chosen entry on the list. • For example, starting at item 2, we sample every k = 4 items to obtain a sample of n = 20 items from a list of N = 78 items. • Note that N/n = 78/20 4. 1-40 Sampling Methods Systematic Sampling • A systematic sample of n items from a population of N items requires that periodicity k be approximately N/n. • Systematic sampling should yield acceptable results unless patterns in the population happen to recur at periodicity k. • Can be used with unlistable or infinite populations. • Systematic samples are well-suited to linearly organized physical populations. 1-41 Sampling Methods Systematic Sampling • For example, out of 501 companies, we want to obtain a sample of 25. What should the periodicity k be? k = N/n = 501/25 20. • So, we should choose every 20th company from a random starting point. 1-42 Sampling Methods Stratified Sampling • Utilizes prior information about the population. • Applicable when the population can be divided into relatively homogeneous subgroups of known size (strata). • A simple random sample of the desired size is taken within each stratum. • For example, from a population containing 55% males and 45% females, randomly sample 120 males and 80 females (n = 200). 1-43 Sampling Methods Stratified Sampling • Or, take a random sample of the entire population and then combine individual strata estimates using appropriate weights. • For a population with L strata, the population size N is the sum of the stratum sizes: N = N1 + N2 + ... + NL • The weight assigned to stratum j is wj = Nj / n • For example, take a random sample of n = 200 and then weight the responses for males by wM = .55 and for females by wF = .45. 1-44 Sampling Methods Cluster Sample • Strata consist of geographical regions. • One-stage cluster sampling – sample consists of all elements in each of k randomly chosen subregions (clusters). • Two-stage cluster sampling, first choose k subregions (clusters), then choose a random sample of elements within each cluster. 1-45 Sampling Methods Judgment Sample • A nonprobability sampling method that relies on the expertise of the sampler to choose items that are representative of the population. • Can be affected by subconscious bias (i.e., nonrandomness in the choice). • Quota sampling is a special kind of judgment sampling, in which the interviewer chooses a certain number of people in each category. 1-46 Sampling Methods Convenience Sample • Take advantage of whatever sample is available at that moment. A quick way to sample. Sample Size • Sample size depends on the inherent variability of the quantity being measured and on the desired precision of the estimate. 1-47 IST 203: Statistics for Social Sciences Lecture 3 (A, B) 1-48 Visual Description • Methods of organizing, exploring and summarizing data include: - Visual (charts and graphs) provides insight into characteristics of a data set without using mathematics. - Numerical (statistics or tables) provides insight into characteristics of a data set using mathematics. 1-49 Visual Description • Begin with univariate data (a set of n observations on one variable) and consider the following: Characteristic Interpretation Measurement What are the units of measurement? Are the data integer or continuous? Any missing observations? Any concerns with accuracy or sampling methods? Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? 1-50 Visual Description Characteristic Interpretation Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? 1-51 Visual Description Measurement • Look at the data and visualize how it was collected and measured. Sorting • Sort the data and then summarize in a graphical display. Here are the sorted P/E ratios: 8 10 10 10 13 13 14 14 15 15 16 16 17 18 19 19 20 20 21 22 23 26 26 27 29 29 34 48 55 68 • A histogram graphically displays sorted data. 1-52 Visual Description Sorting • Sorting allows you to observe central tendency, dispersion and shape as well as minimum, maximum and range. • What else do you observe? 1-53 Dot Plots • A dot plot is the simplest graphical display of n individual values of numerical data. - Easy to understand - Not good for large samples (e.g., > 5,000). Steps in Making a Dot Plot 1. Make a scale that covers the data range 2. Mark the axes and label them 3. Plot each data value as a dot above the scale at its approximate location If more than one data value lies at about the same axis location, the dots are piled up vertically. 1-54 Dot Plots • Range of data shows dispersion. • Clustering shows central tendency. • Dot plots do not tell much of shape of distribution. • Can add annotations (text boxes) to call attention to specific features. 1-55 Dot Plots Small Sample: Home Prices • Consider the following median home prices for nine U.S. Cities. Metropolitan Area Median Home Price (000) Akron OH 119.6 Bergen-Passaic NJ 363.0 Bradenton FL 170.4 Colorado Springs CO 181.7 Hartford CT 198.5 Milwaukee WI 186.2 Raleigh-Durham NC 173.8 San Francisco CA 560.2 Topeka KS 100.7 1-56 Dot Plots Small Sample: Home Prices • A dot plot is useful to realtors as they discuss patterns in home selling prices within their community. 1-57 Dot Plots Comparing Groups • A stacked dot plot compares two or more groups using a common X-axis scale. Frequency Distributions and Histograms 3A-58 1-58 Bins and Bin Limits • A frequency distribution is a table formed by classifying n data values into k classes (bins). • Bin limits define the values to be included in each bin. Widths must all be the same. • Frequencies are the number of observations within each bin. • Express as relative frequencies (frequency divided by the total) or percentages (relative frequency times 100). Frequency Distributions and Histograms Constructing a Frequency Distribution 1. Sort data in ascending order (e.g., P/E ratios) 8 10 10 10 13 13 14 14 15 15 16 16 17 18 19 19 20 20 21 22 23 26 26 27 29 29 34 48 55 68 2. Choose the number of bins (k) - k should be much smaller than n. - Too many bins results in sparsely populated bins, too few and dissimilar data values are lumped together. 3A-59 1-59 3A-60 1-60 Frequency Distributions and Histograms Constructing a Frequency Distribution - Herbert Sturges proposes the following rule: Sample Size Number of Bins (n) (k) Sample Size Number of Bins (n) (k) 16 5 256 9 32 6 512 10 64 7 1024 11 128 8 Frequency Distributions and Histograms Constructing a Frequency Distribution 3. Set the bin limits: Bin width X max X min k For example, for k = 7 bins, the approximate bin width is: Bin width 68 8 60 8.57 7 7 To obtain “nice” limits, we round the width to 10 and start the first bin at 0 to get bin limits: 0, 10, 20, 30, 40, 50, 60, 70 3A-61 1-61 Frequency Distributions and Histograms 3A-62 1-62 Constructing a Frequency Distribution 4. Put the data values in the appropriate bin In general, the lower limit is included in the bin while the upper limit is excluded. 5. Create the table, you can include Frequencies – counts for each bin Relative frequencies – absolute frequency divided by total number of data values. Cumulative frequencies – accumulated relative frequency values as bin limits increase. Frequency Distributions and Histograms What are the bin limits for the P/E ratio data? Cumulative Relative Frequency Bin Range Frequency Relative Frequency 0<P/E Ratio<10 1 0.0333 0.0333 10<P/E Ratio<20 15 0.5000 0.5333 20<P/E Ratio<30 10 0.3333 0.8666 30<P/E Ratio<40 1 0.0333 0.8999 40<P/E Ratio<50 1 0.0333 0.9332 50<P/E Ratio<60 1 0.0333 0.9665 60<P/E Ratio<70 1 0.0333 0.9998 3A-63 1-63 Frequency Distributions and Histograms Histograms • A histogram is a graphical representation of a frequency distribution. Y-axis shows frequency within each bin. • A histogram is a bar chart. X-axis ticks shows end points of each bin. 3A-64 1-64 Frequency Distributions and Histograms Histograms • Consider 3 histograms for the P/E ratio data with different bin widths. What do they tell you? 3A-65 1-65 Frequency Distributions and Histograms 3A-66 1-66 Modal Class • A histogram bar that is higher than those on either side. • Monomodal – a single modal class. • Bimodal – two modal classes. • Multimodal – more than two modal classes. • Modal classes may be artifacts of the way bin limits are chosen. Frequency Distributions and Histograms 3A-67 1-67 Shape • A histogram suggests the shape of the population. • It is influenced by number of bins and bin limits. • Skewness – indicated by the direction of the longer tail of the histogram. Left-skewed – (negatively skewed) a longer left tail. Right-skewed – (positively skewed) a longer right tail. Symmetric – both tail areas approximately the same. 1-68 3A-69 1-69 Line Charts Log Scales • Arithmetic scale – distances on the Y-axis are proportional to the magnitude of the variable being displayed. • Logarithmic scale – (ratio scale) equal distances represent equal ratios. • Use a log scale for the vertical axis when data vary over a wide range, say, by more than an order of magnitude. • This will reveal more detail for small data values. 1-70 Scatter Plots • A scatter plot shows n pairs of observations as dots (or some other symbol) on an XY graph. • A starting point for bivariate data analysis. • Allows observations about the relationship between two variables. • Answers the question: Is there an association between the two variables and if so, what kind of association? 1-71 Scatter Plots Example: Birth Rates and Life Expectancy • Consider the following data: Nation Birth Rate Life Expectancy Afghanistan 41.03 46.60 Canada 11.09 79.70 Finland 10.60 77.80 Guatemala 34.17 66.90 Japan 10.03 80.90 Mexico 22.36 72.00 Pakistan 30.40 62.70 Spain 9.29 79.10 United States 14.10 77.40 1-72 Scatter Plots Example: Birth Rates and Life Expectancy • Here is a scatter plot with life expectancy on the X-axis and birth rates on the Y-axis. • Is there an association between the two variables? • Is there a causeand-effect relationship? 1-73 Scatter Plots Example: Aircraft Fuel Consumption • Consider five observations on flight time and fuel consumption for a twin-engine Piper Cheyenne aircraft. • A causal relationship is assumed since a longer flight would consume more fuel. Trip Leg Flight Time (hours) Fuel Used (pounds) 1 2.3 145 2 4.2 258 3 3.6 219 4 4.7 276 5 4.9 283 1-74 Scatter Plots Example: Aircraft Fuel Consumption • Here is the scatter plot with flight time on the X-axis and fuel use on the Y-axis. • Is there an association between variables? 1-75 Scatter Plots Degree of Association Very strong association Strong association Moderate association Little or no association 1-76 Tables • Tables are the simplest form of data display. • A compound table is a table that contains time series data down the columns and variables across the rows. Example: School Expenditures • Arrangement of data is in rows and columns to enhance meaning. • The data can be viewed by focusing on the time pattern (down the columns) or by comparing the variables (across the rows). 1-77 Tables Example: School Expenditures Elementary and Secondary Year All Schools Colleges and Universities Total Public Private Total Public Private 1960 142.2 99.6 93.0 6.6 42.6 23.3 19.3 1970 317.3 200.2 188.6 11.6 117.2 75.2 41.9 1980 373.6 232.7 216.4 16.2 140.9 93.4 47.4 1990 526.1 318.5 293.4 25.1 207.6 132.9 74.7 2000 691.9 418.2 387.8 30.3 273.8 168.8 105.0 Source: U.S. Census Bureau, Statistical Abstract of the United States: 2002, p. 133. Note: All figures are in billions of constant 2000/2001 dollars. • Units of measure are stated in the footnote. • Note merged headings to group columns. 1-78 Pie Charts An Oft-Abused Chart • A pie chart can only convey a general idea of the data. • Pie charts should be used to portray data which sum to a total (e.g., percent market shares). • A pie chart should only have a few (i.e., 2 or 3) slices. • Each slice should be labeled with data values or percents. 1-79 Pie Charts Common Errors in Pie Chart Usage • Pie charts can only convey a general idea of the data values. • Pie charts are ineffective when they have too many slices. • Pie chart data must represent parts of a whole (e.g., percent market share). 1-80 Maps and Pictograms Pictograms • A visual display in which data values are replaced by pictures. 1-81 Deceptive Graphs Error 1: Nonzero Origin • A nonzero origin will exaggerate the trend. Deceptive Correct 1-82 IST 203: Statistics for Social Sciences Lecture 4A 1-83 Numerical Description • Statistics are descriptive measures derived from a sample (n items). • Parameters are descriptive measures derived from a population (N items). 1-84 Numerical Description • Three key characteristics of numerical data: Characteristic Interpretation Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? 1-85 Central Tendency Six Measures of Central Tendency Statistic Formula Excel Formula Mean 1 n xi n i 1 Familiar and uses all the =AVERAGE(Data) sample information. Median Middle value in sorted array =MEDIAN(Data) Pro Robust when extreme data values exist. Con Influenced by extreme values. Ignores extremes and can be affected by gaps in data values. 1-86 Central Tendency Six Measures of Central Tendency Statistic Mode Midrange Formula Most frequently occurring data value xmin xmax 2 Excel Formula =MODE(Data) =0.5*(MIN(Data) +MAX(Data)) Pro Con Useful for attribute data or discrete data with a small range. May not be unique, and is not helpful for continuous data. Easy to understand and calculate. Influenced by extreme values and ignores most data values. 1-87 Central Tendency Six Measures of Central Tendency Statistic Geometric mean (G) Trimmed mean Formula n x1 x2 ... xn Same as the mean except omit highest and lowest k% of data values (e.g., 5%) Excel Formula =GEOMEAN(Data) Pro Con Useful for growth rates and mitigates high extremes. Less familiar and requires positive data. Mitigates effects of =TRMEAN(Data, %) extreme values. Excludes some data values that could be relevant. 1-88 Central Tendency Mean • A familiar measure of central tendency. Population Formula Sample Formula n N xi i 1 N x xi i 1 n • In Excel, use function =AVERAGE(Data) where Data is an array of data values. 1-89 Central Tendency Mean • For the sample of n = 37 car brands: n x xi i 1 n 87 93 98 ... 159 164 173 4639 125.38 37 37 1-90 Central Tendency Characteristics of the Mean • Arithmetic mean is the most familiar average. • Affected by every sample item. • The balancing point or fulcrum for the data. 1-91 Central Tendency Characteristics of the Mean • Regardless of the shape of the distribution, absolute distances from the mean to the data n points always sum to zero. ( xi x ) 0 • Consider the following i 1 asymmetric distribution of quiz scores whose mean = 65. n ( xi x ) = (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65) i 1 = (-23) + (-5) + (5) + (10) + (13) = -28 + 28 = 0 1-92 Central Tendency Median • The median (M) is the 50th percentile or midpoint of the sorted sample data. • M separates the upper and lower half of the sorted observations. • If n is odd, the median is the middle observation in the data array. • If n is even, the median is the average of the middle two observations in the data array. 1-93 Central Tendency Median • For n = 8, the median is between the fourth and fifth observations in the data array. 1-94 Central Tendency Median • For n = 9, the median is the fifth observation in the data array. 1-95 Central Tendency Median • Consider the following n = 6 data values: 11 12 15 17 21 32 • What is the median? For even n, Median = n/2 = 6/2 = 3 and xn / 2 x( n / 21) 2 n/2+1 = 6/2 + 1 = 4 M = (x3+x4)/2 = (15+17)/2 = 16 11 12 15 16 17 21 32 1-96 Central Tendency Median • Consider the following n = 7 data values: 12 23 23 25 27 34 41 • What is the median? For odd n, Median = x( n1) / 2 (n+1)/2 = (7+1)/2 = 8/2 = 4 M = x4 = 25 12 23 23 25 27 34 41 1-97 Central Tendency Median • Use Excel’s function =MEDIAN(Data) where Data is an array of data values. • For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19. • So, the median is x19 = 121. • When there are several duplicate data values, the median does not provide a clean “50-50” split in the data. 1-98 Central Tendency Characteristics of the Median • The median is insensitive to extreme data values. • For example, consider the following quiz scores for 3 students: Tom’s scores: 20, 40, 70, 75, 80 Jake’s scores: 60, 65, 70, 90, 95 Mary’s scores: 50, 65, 70, 75, 90 Mean =57, Median = 70, Total = 285 Mean = 76, Median = 70, Total = 380 Mean = 70, Median = 70, Total = 350 • What does the median for each student tell you? 1-99 Central Tendency Mode • The most frequently occurring data value. • Similar to mean and median if data values occur often near the center of sorted data. • May have multiple modes or no mode. 1-100 Central Tendency Mode • For example, consider the following quiz scores for 3 students: Lee’s scores: 60, 70, 70, 70, 80 Pat’s scores: 45, 45, 70, 90, 100 Sam’s scores: 50, 60, 70, 80, 90 Xiao’s scores: 50, 50, 70, 90, 90 Mean =70, Median = 70, Mode = 70 Mean = 70, Median = 70, Mode = 45 Mean = 70, Median = 70, Mode = none Mean = 70, Median = 70, Modes = 50,90 • What does the mode for each student tell you? 1-101 Central Tendency Mode • Easy to define, not easy to calculate in large samples. • Use Excel’s function =MODE(Array) - will return #N/A if there is no mode. - will return first mode found if multimodal. • May be far from the middle of the distribution and not at all typical. 1-102 Central Tendency Mode • Generally isn’t useful for continuous data since data values rarely repeat. • Best for attribute data or a discrete variable with a small range (e.g., Likert scale). 1-103 Central Tendency Example: Price/Earnings Ratios and Mode • Consider the following P/E ratios for a random sample of 68 Standard & Poor’s 500 stocks. 7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 • What is the mode? 1-104 Central Tendency Example: Price/Earnings Ratios and Mode • Excel’s descriptive statistics results are: • The mode 13 occurs 7 times, but what does the dot plot show? Mean 22.7206 Median 19 Mode 13 Range 84 Minimum 7 Maximum 91 Sum Count 1545 68 1-105 Central Tendency Example: Price/Earnings Ratios and Mode • The dot plot shows local modes (a peak with valleys on either side) at 10, 13, 15, 19, 23, 26, 29. • These multiple modes suggest that the mode is not a stable measure of central tendency. 1-106 Central Tendency Example: Rose Bowl Winners’ Points • Points scored by the winning NCAA football team tends to have modes in multiples of 7 because each touchdown yields 7 points. • Consider the dot plot of the points scored by the winning team in the first 87 Rose Bowl games. • What is the mode? 1-107 Central Tendency Mode • A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data. • Occurs when dissimilar populations are combined in one sample. For example, 1-108 Central Tendency Symptoms of Skewness Distribution’s Shape Histogram Appearance Skewed left (negative skewness) Long tail of histogram points left (a few low values but most data on Mean < Median right) Symmetric Tails of histogram are balanced (low/high values offset) Mean Median Skewed right (positive skewness) Long tail of histogram points right (most data on left but a few high values) Mean > Median Statistics 1-109 Central Tendency Geometric Mean • The geometric mean (G) is a multiplicative average. G n x1 x2 ... xn • For the J. D. Power quality data (n=37): G 37 (87)(93)(98)...(164)(173) 37 2.37667 1077 123.38 • In Excel use =GEOMEAN(Array) • The geometric mean tends to mitigate the effects of high outliers. 1-110 Central Tendency Midrange • The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. xmin xmax Midrange = 2 • For the J. D. Power quality data (n=37): x1 x37 87 173 xmin xmax 130 Midrange = = 2 2 2 • Here, the midrange (130) is higher than the mean (125.38) or median (121). 1-111 Central Tendency Trimmed Mean • To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. • For example, for the n = 68 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). • To determine how many observations to trim, multiply k x n = 0.05 x 68 = 3.4 or 3 observations. • So, we would remove the three smallest and three largest observations before averaging the remaining values. 1-112 Dispersion • Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of dispersion: Measures of Variation Statistic Range Formula xmax – xmin n Variance (s2) xi x i 1 n 1 Excel Pro Con =MAX(Data)MIN(Data) Sensitive to Easy to calculate extreme data values. =VAR(Data) Plays a key role in mathematical statistics. 2 Non-intuitive meaning. 1-113 Dispersion Measures of Variation Statistic Standard deviation (s) Coefficient. of variation (CV) Formula n xi x i 1 2 Excel Pro Con =STDEV(Data) Most common measure. Uses same units as the raw data ($ , £, ¥, etc.). Non-intuitive meaning. None Measures relative variation in percent so can compare data sets. Requires nonnegative data. n 1 100 s x 1-114 Dispersion Range • The difference between the largest and smallest observation. Range = xmax – xmin • For example, for the n = 68 P/E ratios, Range = 91 – 7 = 84 1-115 Dispersion Variance • The population variance (s2) is defined as the sum of squared deviations around the mean divided by the population size. N s2 • For the sample variance (s2), we divide by n – 1 instead of n, otherwise s2 would tend to 2 s underestimate the unknown population variance s2. xi 2 i 1 N n xi x i 1 n 1 2 1-116 Dispersion Standard Deviation • The square root of the variance. • Explains how individual values in a data set vary from the mean. • Units of measure are the same as X. Population standard deviation N s xi i 1 N 2 Sample standard deviation n s xi x i 1 n 1 2 1-117 Dispersion Calculating a Standard Deviation • Consider the following five quiz scores for Stephanie. 1-118 Dispersion Calculating a Standard Deviation • Now, calculate the sample standard deviation: n s 2 x x i i 1 n 1 2380 595 24.39 5 1 • Somewhat easier, the two-sum formula can also be used: 2 x i n 2 (360) 2 i 1 xi n 28300 2 5 28300 25920 595 24.39 s i 1 n 1 5 1 5 1 n 1-119 IST 203: Statistics for Social Sciences Lecture 5A 1-120 Random Experiments Sample Space • A random experiment is an observational process whose results cannot be known in advance. • The set of all outcomes (S) is the sample space for the experiment. • A sample space with a countable number of outcomes is discrete. 1-121 Random Experiments Sample Space • For a single roll of a die, the sample space is: S = {1, 2, 3, 4, 5, 6} • When two dice are rolled, the sample space is the following pairs: S = {(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,1), (2,2), (2,3), (2,4), (2,5), (2,6), (3,1), (3,2), (3,3), (3,4), (3,5), (3,6), (4,1), (4,2), (4,3), (4,4), (4,5), (4,6), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6), (6,1), (6,2), (6,3), (6,4), (6,5), (6,6)} 1-122 Random Experiments Sample Space • Consider the sample space to describe a randomly chosen United Airlines employee by 2 genders, 21 job classifications, 6 home bases (major hubs) and 4 education levels There are: 2 x 21 x 6 x 4 = 1008 possible outcomes • It would be impractical to enumerate this sample space. 1-123 Random Experiments Sample Space • If the outcome is a continuous measurement, the sample space can be described by a rule. • For example, the sample space for the length of a randomly chosen cell phone call would be S = {all X such that X > 0} or written as S = {X | X > 0} • The sample space to describe a randomly chosen student’s GPA would be S = {X | 0.00 < X < 4.00} 1-124 Random Experiments Events • An event is any subset of outcomes in the sample space. • A simple event or elementary event, is a single outcome. • A discrete sample space S consists of all the simple events (Ei): S = {E1, E2, …, En} 1-125 Random Experiments Events • Consider the random experiment of tossing a balanced coin. What is the sample space? S = {H, T} • What are the chances of observing a H or T? • These two elementary events are equally likely. • When you buy a lottery ticket, the sample space S = {win, lose} has only two events. • Are these two events equally likely to occur? 1-126 Random Experiments Events • A compound event consists of two or more simple events. • For example, in a sample space of 6 simple events, we could define the compound events A = {E1, E2} B = {E3, E5, E6} • These are displayed in a Venn diagram: 1-127 Random Experiments Events • Many different compound events could be defined. • Compound events can be described by a rule. • For example, the compound event A = “rolling a seven” on a roll of two dice consists of 6 simple events: S = {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)} 1-128 Probability Definitions • The probability of an event is a number that measures the relative likelihood that the event will occur. • The probability of event A [denoted P(A)], must lie within the interval from 0 to 1: 0 < P(A) < 1 If P(A) = 0, then the event cannot occur. If P(A) = 1, then the event is certain to occur. 1-129 Probability Definitions • In a discrete sample space, the probabilities of all simple events must sum to unity: P(S) = P(E1) + P(E2) + … + P(En) = 1 • For example, if the following number of purchases were made by credit card: 32% debit card: 20% cash: 35% P(cash) = .35 check: 18% P(check) = .18 Sum = 100% Sum = 1.0 P(credit card) = .32 Probability P(debit card) = .20 1-130 Probability Law of Large Numbers • The law of large numbers is an important probability theorem that states that a large sample is preferred to a small one. • Flip a coin 50 times. We would expect the proportion of heads to be near .50. • However, in a small finite sample, any ratio can be obtained (e.g., 1/3, 7/13, 10/22, 28/50, etc.). • A large n may be needed to get close to .50. • Consider the results of 10, 20, 50, and 500 coin flips. 1-131 Probability 1-132 Probability Practical Issues for Actuaries • Actuarial science is a high-paying career that involves estimating empirical probabilities. • For example, actuaries - calculate payout rates on life insurance, pension plans, and health care plans - create tables that guide IRA withdrawal rates for individuals from age 70 to 99 1-133 Rules of Probability Complement of an Event • The complement of an event A is denoted by A′ and consists of everything in the sample space S except event A. 1-134 Rules of Probability Complement of an Event • Since A and A′ together comprise the entire sample space, P(A) + P(A′ ) = 1 • The probability of A′ is found by P(A′ ) = 1 – P(A) • For example, The Wall Street Journal reports that about 33% of all new small businesses fail within the first 2 years. The probability that a new small business will survive is: P(survival) = 1 – P(failure) = 1 – .33 = .67 or 67% 1-135 Rules of Probability Odds of an Event • The odds in favor of event A occurring is Odds = P ( A) P ( A) P ( A ') 1 P ( A) • Odds are used in sports and games of chance. • For a pair of fair dice, P(7) = 6/36 (or 1/6). What are the odds in favor of rolling a 7? Odds = P(rolling seven) 1/ 6 1/ 6 1 1 P(rolling seven) 1 1/ 6 5 / 6 5 1-136 Rules of Probability Odds of an Event • On the average, for every time a 7 is rolled, there will be 5 times that it is not rolled. • In other words, the odds are 1 to 5 in favor of rolling a 7. • The odds are 5 to 1 against rolling a 7. • In horse racing and other sports, odds are usually quoted against winning. 1-137 Rules of Probability Odds of an Event • If the odds against event A are quoted as b to a, then the implied probability of event A is: a P(A) = ab • For example, if a race horse has a 4 to 1 odds against winning, the P(win) is a 1 1 0.20 or 20% P(win) = a b 4 1 5 1-138 Rules of Probability Union of Two Events • The union of two events consists of all outcomes in the sample space S that are contained either in event A or in event B or both (denoted A B or “A or B”). may be read as “or” since one or the other or both events may occur. 1-139 Rules of Probability Union of Two Events • For example, randomly choose a card from a deck of 52 playing cards. • If Q is the event that we draw a queen and R is the event that we draw a red card, what is Q R? • It is the possibility of drawing either a queen (4 ways) or a red card (26 ways) or both (2 ways). 1-140 Rules of Probability Intersection of Two Events • The intersection of two events A and B (denoted A B or “A and B”) is the event consisting of all outcomes in the sample space S that are contained in both event A and event B. may be read as “and” since both events occur. This is a joint probability. 1-141 Rules of Probability Intersection of Two Events • For example, randomly choose a card from a deck of 52 playing cards. • If Q is the event that we draw a queen and R is the event that we draw a red card, what is Q R? • It is the possibility of getting both a queen and a red card (2 ways). 1-142 Rules of Probability General Law of Addition • The general law of addition states that the probability of the union of two events A and B is: P(A B) = P(A) + P(B) – P(A B) When you add So, you have A and B the P(A) and to subtract P(B) together, P(A B) to you count the avoid overA B P(A and B) stating the twice. probability. 1-143 Rules of Probability General Law of Addition • For the card example: P(Q) = 4/52 (4 queens in a deck) P(R) = 26/52 (26 red cards in a deck) P(Q R) = 2/52 (2 red queens in a deck) P(Q R) = P(Q) + P(R) – P(Q Q) Q and R = 2/52 = 4/52 + 26/52 – 2/52 = 28/52 = .5385 or 53.85% Q 4/52 R 26/52 1-144 Rules of Probability Mutually Exclusive Events • Events A and B are mutually exclusive (or disjoint) if their intersection is the null set () that contains no elements. If A B = , then P(A B) = 0 Special Law of Addition • In the case of mutually exclusive events, the addition law reduces to: P(A B) = P(A) + P(B) 1-145 Rules of Probability Forced Dichotomy • Polytomous events can be made dichotomous (binary) by defining the second category as everything not in the first category. Polytomous Events Binary (Dichotomous) Variable Vehicle type (SUV, sedan, truck, motorcycle) X = 1 if SUV, 0 otherwise A randomly-chosen NBA player’s height X = 1 if height exceeds 7 feet, 0 otherwise Tax return type (single, married filing jointly, married filing separately, head of household, qualifying widower) X = 1 if single, 0 otherwise 1-146 Rules of Probability Conditional Probability • The probability of event A given that event B has occurred. • Denoted P(A | B). The vertical line “ | ” is read as “given.” P( A | B) P( A B) P( B) for P(B) > 0 and undefined otherwise 1-147 Rules of Probability Conditional Probability • Consider the logic of this formula by looking at the Venn diagram. The sample space is P( A B) restricted to B, an event P( A | B) P( B) that has occurred. A B is the part of B that is also in A. The ratio of the relative size of A B to B is P(A | B). 1-148 Rules of Probability Example: High School Dropouts • First define U = the event that the person is unemployed D = the event that the person is a high school dropout P(D) = .2905 P(UD) = .0532 P(U) = .1350 P(U D) .0532 P(U | D) .1831 or 18.31% P( D) .2905 • P(U | D) = .1831 > P(U) = .1350 • Therefore, being a high school dropout is related to being unemployed. 1-149 IST 203: Statistics for Social Sciences Lecture 6 1-150 Probability Models Probability Models • A random (or stochastic) process is a repeatable random experiment. • For example, each call arriving at the L.L. Bean order center is a random experiment in which the variable of interest is the amount of the order. • Probability can be used to analyze random (or stochastic) processes and to understand business processes. 1-151 Discrete Distributions Random Variables • A random variable is a function or rule that assigns a numerical value to each outcome in the sample space of a random experiment. • Nomenclature: - Capital letters are used to represent random variables (e.g., X, Y). - Lower case letters are used to represent values of the random variable (e.g., x, y). • A discrete random variable has a countable number of distinct values. 1-152 Discrete Distributions Probability Distributions • A discrete probability distribution assigns a probability to each value of a discrete random variable X. • To be a valid probability, each probability must be between 0 P(x ) 1 i • and the sum of all the probabilities for the values of X must be equal to unity. n P( x ) 1 i 1 i 1-153 Discrete Distributions Example: Coin Flips When you flip a coin three times, the sample space has eight equally likely simple events. They are: 1st Toss H H H H T T T T 2nd Toss H H T T H H T T 3rd Toss H T H T H T H T 1-154 Discrete Distributions Example: Coin Flips If X is the number of heads, then X is a random variable whose probability distribution is as follows: Possible Events TTT HTT, THT, TTH HHT, HTH, THH HHH Total x 0 1 2 3 P(x) 1/8 3/8 3/8 1/8 1 1-155 Discrete Distributions Expected Value • The expected value E(X) of a discrete random variable is the sum of all X-values weighted by their respective probabilities. • If there are n distinct values of X, n E ( X ) xi P( xi ) i 1 • The E(X) is a measure of central tendency. 1-156 Discrete Distributions Example: Service Calls The probability distribution of emergency service calls on Sunday by Ace Appliance Repair is: x P(x) 0 0.05 1 0.10 2 0.30 3 0.25 4 0.20 5 0.10 Total 1.00 What is the average or expected number of service calls? 1-157 Discrete Distributions Example: Service Calls First calculate xiP(xi): x P(x) xP(x) 0 0.05 0.00 1 0.10 0.10 2 0.30 0.60 3 0.25 0.75 4 0.20 0.80 5 0.10 0.50 Total 1.00 2.75 The sum of the xP(x) column is the expected value or mean of the discrete distribution. 5 E ( X ) xi P( xi ) i 1 1-158 Discrete Distributions Example: Service Calls This particular probability distribution is not symmetric around the mean = 2.75. 0.30 Probability 0.25 0.20 0.15 0.10 0.05 0.00 0 1 2 3 = 2.75 Num ber of Service Calls 4 5 However, the mean is still the balancing point, or fulcrum. Because E(X) is an average, it does not have to be an observable point. 1-159 Discrete Distributions Application: Life Insurance • Expected value is the basis of life insurance. • For example, what is the probability that a 30-yearold white female will die within the next year? • Based on mortality statistics, the probability is .00059 and the probability of living another year is 1 - .00059 = .99941. • What premium should a life insurance company charge to break even on a $500,000 1-year term policy? 1-160 Discrete Distributions Application: Raffle Tickets • Now, calculate the E(X): E(X) = (value if you win)P(win) + (value if you lose)P(lose) = (55,000) 1 + (0) 29,345 29,346 29,346 = (55,000)(.000034076) + (0)(.999965924) = $1.87 • The raffle ticket is actually worth $1.87. Is it worth spending $2.00 for it? 1-161 Discrete Distributions Variance and Standard Deviation • If there are n distinct values of X, then the variance of a discrete random variable is: n V ( X ) s2 [ xi ]2 P( xi ) i 1 • The variance is a weighted average of the dispersion about the mean and is denoted either as s2 or V(X). • The standard deviation is the square root of the variance and is denoted s. 2 s s V (X ) 1-162 Discrete Distributions Example: Bed and Breakfast The Bay Street Inn is a 7-room bed-and-breakfast in Santa Theresa, Ca. The probability distribution of room rentals during February is: x P(x) 0 0.05 1 0.05 2 0.06 3 0.10 4 0.13 5 0.20 6 0.15 7 0.26 Total 1.00 1-163 Discrete Distributions Example: Bed and Breakfast First find the expected value 7 E ( X ) xi P( xi ) i 1 = 4.71 rooms x P(x) x P(x) 0 0.05 0.00 1 0.05 0.05 2 0.06 0.12 3 0.10 0.30 4 0.13 0.52 5 0.20 1.00 6 0.15 0.90 7 0.26 1.82 1.00 = 4.71 Total 1-164 Discrete Distributions Example: Bed and Breakfast2 7 V ( X ) s [ xi ]2 P( xi ) The E(X) is then used to find x the variance: 0 P(x) x P(x) [x]2 [x]2 P(x) 0.05 0.00 22.1841 1.109205 1 0.05 0.05 13.7641 0.688205 2 0.06 0.12 7.3441 0.440646 3 0.10 0.30 2.9241 0.292410 4 0.13 0.52 0.5041 0.065533 5 0.20 1.00 0.0841 0.016820 6 0.15 0.90 1.6641 0.249615 7 0.26 1.82 5.2441 1.363466 1.00 = 4.71 = 4.2259 rooms2 The standard deviation is: s = 4.2259 = 2.0577 rooms Total i 1 s2 = 4.225900 1-165 Discrete Distributions What is a PDF or CDF? • A probability distribution function (PDF) is a mathematical function that shows the probability of each X-value. • A cumulative distribution function (CDF) is a mathematical function that shows the cumulative sum of probabilities, adding from the smallest to the largest X-value, gradually approaching unity. 1-166 Discrete Distributions What is a PDF or CDF? Consider the following illustrative histograms: 1.00 0.25 0.90 0.80 0.20 Probability Probability 0.70 0.15 0.10 0.60 0.50 0.40 0.30 0.05 0.20 0.10 0.00 0.00 0 1 2 3 4 5 6 7 8 Value of X 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Value of X Illustrative PDF Cumulative CDF (Probability Density Function) (Cumulative Density Function) The equations for these functions depend on the parameter(s) of the distribution. 1-167 Uniform Distribution Characteristics of the Uniform Distribution • The uniform distribution describes a random variable with a finite number of integer values from a to b (the only two parameters). • Each value of the random variable is equally likely to occur. • Consider the following summary of the uniform distribution: 1-168 Uniform Distribution Parameters PDF Range Mean Std. Dev. a = lower limit b = upper limit 1 b a 1 axb P( x) ab 2 (b a) 12 1 12 Random data generation in Excel Comments =a+INT((b-a+1)*RAND()) Used as a benchmark, to generate random integers, or to create other distributions. 1-169 Uniform Distribution Example: Rolling a Die 0.18 1.00 0.16 0.90 0.14 0.80 0.70 0.12 Probability Probability • The number of dots on the roll of a die form a uniform random variable with six equally likely integer values: 1, 2, 3, 4, 5, 6 • What is the probability of rolling any of these? 0.10 0.08 0.06 0.60 0.50 0.40 0.30 0.04 0.20 0.02 0.10 0.00 0.00 1 2 3 4 5 Num ber of Dots Show ing on the Die PDF for one die 6 1 2 3 4 5 Num ber of Dots Show ing on the Die CDF for one die 6 1-170 Uniform Distribution Example: Rolling a Die 1 1 1 • The PDF for all x is: P( x) b a 1 6 1 1 6 • Calculate the mean as: a b 1 6 3.5 2 2 • Calculate the standard deviation as: (b a) 1 1 2 12 (6 1) 1 1 2 12 1.708 1-171 Uniform Distribution Application: Pumping Gas On a gas pump, the last two digits (pennies) displayed will be a uniform random integer (assuming the pump stops automatically). 0.012 1.000 0.900 0.010 0.800 0.700 0.008 0.600 0.006 0.500 0.400 0.004 0.300 0.200 0.002 0.100 0.000 0.000 0 10 20 30 40 50 60 Pennies Digits on Pum p 70 80 90 0 10 20 30 40 50 60 Pennies Digits on Pum p PDF CDF The parameters are: a = 00 and b = 99 70 80 90 1-172 Uniform Distribution Application: Pumping Gas • The PDF for all x is: 1 1 1 P( x) .010 b a 1 99 0 1 100 • Calculate the mean as: a b 0 99 49.5 2 2 • Calculate the standard deviation as: (b a) 12 1 (99 0) 12 1 28.87 12 12 1-173 Bernoulli Distribution Bernoulli Experiments • A random experiment with only 2 outcomes is a Bernoulli experiment. • One outcome is arbitrarily labeled a “success” (denoted X = 1) and the other a “failure” (denoted X = 0). p is the P(success), 1 – p is the P(failure). • “Success” is usually defined as the less likely outcome so that p < .5 for convenience. • Note that P(0) + P(1) = (1 – p) + p = 1 and 0 < p < 1. 1-174 Bernoulli Distribution Bernoulli Experiments Consider the following Bernoulli experiments: Bernoulli Experiment Possible Outcomes Probability of “Success” Flip a coin 1 = heads 0 = tails p = .50 Inspect a jet turbine blade 1 = crack found 0 = no crack found p = .001 Purchase a tank of gas 1 = pay by credit card 0 = do not pay by credit card p = .78 Do a mammogram test 1 = positive test 0 = negative test p = .0004 1-175 Bernoulli Distribution Bernoulli Experiments • The expected value (mean) of a Bernoulli experiment is2 calculated as: E ( X ) x i P( xi ) (0)(1 p) (1)(p) p i 1 • The variance of a Bernoulli experiment is calculated as: 2 V ( X ) xi E ( X ) P( xi ) (0 p)2 (1 p) (1 p)2 (p) p(1 p) 2 i 1 • The mean and variance are useful in developing the next model. 1-176 Binomial Distribution Characteristics of the Binomial Distribution • The binomial distribution arises when a Bernoulli experiment is repeated n times. • Each Bernoulli trial is independent so the probability of success p remains constant on each trial. • In a binomial experiment, we are interested in X = number of successes in n trials. So, X = x1 + x2 + ... + xn • The probability of a particular number of successes P(X) is determined by parameters n and p. 1-177 Binomial Distribution Characteristics of the Binomial Distribution • The mean of a binomial distribution is found by adding the means for each of the n Bernoulli independent events: p + p + … + p = np • The variance of a binomial distribution is found by adding the variances for each of the n Bernoulli independent events: p(1-p)+ p(1-p) + … + p(1-p) = np(1-p) • The standard deviation is np(1-p) 1-178 Binomial Distribution Parameters PDF n = number of trials p = probability of success P ( x) n! p x (1 p) n x x !(n x)! Excel function =BINOMDIST(k,n,p,0) Range X = 0, 1, 2, . . ., n Mean np Std. Dev. np(1 p) Random data generation in Excel Sum n values of =1+INT(2*RAND()) or use Excel’s Tools | Data Analysis Comments Skewed right if p < .50, skewed left if p > .50, and symmetric if p = .50. 1-179 Binomial Distribution Example: Quick Oil Change Shop • It is important to quick oil change shops to ensure that a car’s service time is not considered “late” by the customer. • Service times are defined as either late or not late. • X is the number of cars that are late out of the total number of cars serviced. • Assumptions: - cars are independent of each other - probability of a late car is consistent 1-180 Binomial Distribution Example: Quick Oil Change Shop • What is the probability that exactly 2 of the next n = 10 cars serviced are late (P(X = 2))? • P(car is late) = p = .10 • P(car not late) = 1 - p = .90 n! P ( x) p x (1 p) n x x !(n x)! 10! P(X = 2) = 2!(10-2)! (.1)2(1-.10)10-2 = .1937 1-181 Binomial Distribution Application: Uninsured Patients • On average, 20% of the emergency room patients at Greenwood General Hospital lack health insurance. • In a random sample of 4 patients, what is the probability that at least 2 will be uninsured? • X = number of uninsured patients (“success”) • P(uninsured) = p = 20% or .20 • P(insured) = 1 – p = 1 – .20 = .80 • n = 4 patients • The range is X = 0, 1, 2, 3, 4 patients. 1-182 Binomial Distribution Application: Uninsured Patients • What is the mean and standard deviation of this binomial distribution? Mean = = np = (4)(.20) = 0.8 patients Standard deviation = s = np(1 p) = 4(.20(1-.20) = 0.8 patients 1-183 IST 203: Statistics for Social Sciences Lecture 7 1-184 Continuous Variables Events as Intervals • • Discrete Variable – each value of X has its own probability P(X). Continuous Variable – events are intervals and probabilities are areas underneath smooth curves. A single point has no probability. 1-185 Describing a Continuous Distribution PDFs and CDFs Continuous PDF’s: • Denoted f(x) • Must be nonnegative • Total area under curve = 1 • Mean, variance and shape depend on the PDF parameters • Reveals the shape of the distribution Normal PDF 1-186 Uniform Continuous Distribution Characteristics of the Uniform Distribution 1-187 Uniform Continuous Distribution Example: Anesthesia Effectiveness • • • An oral surgeon injects a painkiller prior to extracting a tooth. Given the varying characteristics of patients, the dentist views the time for anesthesia effectiveness as a uniform random variable that takes between 15 minutes and 30 minutes. X is U(15, 30) a = 15, b = 30, find the mean and standard deviation. 1-188 Uniform Continuous Distribution Example: Anesthesia Effectiveness a + b 15 + 30 = = = 22.5 minutes 2 2 s= (b – a)2 = (30 – 15)2 = 4.33 minutes 12 12 Find the probability that the anesthetic takes between 20 and 25 minutes. P(c < X < d) = (d – c)/(b – a) P(20 < X < 25) = (25 – 20)/(30 – 15) = 5/15 = 0.3333 or 33.33% 1-189 Normal Distribution What is Normal? A normal random variable should: • Be measured on a continuous scale. • Possess clear central tendency. • Have only one peak (unimodal). • Exhibit tapering tails. • Be symmetric about the mean (equal tails). 1-190 Standard Normal Distribution Characteristics of the Standard Normal • Since for every value of and s, there is a different normal distribution, we transform a normal random variable to a standard normal distribution with = 0 and s = 1 using the formula: z= x– s • Denoted N(0,1) 1-191 Standard Normal Distribution Finding Areas by using Standardized Variables • Suppose John took an economics exam and scored 86 points. The class mean was 75 with a standard deviation of 7. What percentile is John in (i.e., find P(X < 86)? zJohn = x – = 86 – 75 = 11/7 = 1.57 7 s • So John’s score is 1.57 standard deviations about the mean.