Week 5 Dr. Jenne Meyer Discuss syllabus Groundrules Introductions Descriptive Statistics and Probability Distributions Research and Sampling Designs Research Methods and Business Decisions Data Collection Data Analysis Correlation, Linear Regression, and Multiple Regression Analysis Research is… systematic, controlled, empirical, and critical investigation of hypothetical propositions about the presumed relations among phenomenon. (University of Phoenix (Ed.). (2001). Statistics and research methods for managerial decisions [University of Phoenix Custom Edition e-text]. Cincinnati, OH). systematic, controlled, empirical, and critical investigation of phenomenon of interest to decision makers. (book definition). systematic process of collecting and analyzing data or information in order to increase our understanding of the phenomena about which we are concerned or interested. (Leedy and Ormrod, 2001) Research is… a systematic, controlled, empirical, and critical investigation of hypothetical propositions about presumed relations among phenomenon. What happened How it happened Why it happened (give meaning) Business research is the primary means of gathering data for decision making Understanding of the research process can lead to better decisions Research helps make better decisions Understanding the process can help you ask the right questions Identify and effectively solve minor problems in the work setting. Know how to discriminate good from bad research. Appreciate and be constantly aware of the multiple influences and multiple effects of factors impinging on a situation. Take calculated risks in decision making, knowing full well the probabilities associated with the different possible outcomes. Prevent possible vested interests from exercising their influence in a situation. Relate to hired researchers and consultants more effectively. Combine experience with scientific knowledge while making decisions. Symbols (Uppercase Sigma) = Summation (Mu) = Population mean (Lowercase Sigma) = Standard deviation (Pi) = Probability of success in a binomial trial (Epsilon) = Maximum allowable error 2 (Chi Square) = Nonparametric hypothesis test ! = Factorial H0 = Null hypothesis H1 = Alternate hypothesis A single value that summarizes a set of data. It locates the center of the values Arithmetic mean Weighted mean Median Mode Geometric mean If x1, x2 ,..., xn denote a sample of n observations, then the mean of the sample is called "x-bar" and is denoted by: xi x x x x n 1 2 n n The mean of a population is denoted by the Greek letter . X / N Every set of interval data has a mean All values are included Mean is unique - only one Useful to compare two or more populations Sum of the deviations of each value from the mean will always be zero Disadvantage of arithmetic mean Because the Mean is sensitive to extreme values, it may not always be a good representation of the data.. Can’t use for open-ended (range) data Example of “skewed” Mean: Consider the annual incomes of five families in a neighborhood: $12K $12K $12K $13K $100K The Mean income in this case: $29.8K In this case, the Mean is “positively skewed” toward the higher value outlier, and the Mean does not appear to best represent the income of this neighborhood What we need in this case is a measurement that is less sensitive to large values…..we can consider using the Median... >> The midpoint of the values (exactly half are below, half are above) If the number of observations is odd, the median is the “middle observation” If the number of observations is even, the median is the mean or average of the two middle observations Used when the mean is not representative due to high value outliers Unique number Not affected by extremely large or small values Can be used with open-ended range values Can be used for several measurement types Using Previous Examples: Five Incomes: $12K $12K $12K $13K $100K Median is: $12K (better representation of neighborhood) (# of observations is odd, take the middle value = $12K) The value that appears most frequently ▪ Five Incomes Example: $12K $12K $12K $13K $100K ▪ Mode is: $12K Can be used fir any measurement type Not affected by extremely large or small values Sometimes it doesn’t exist Sometimes it represents more than one value Consider Previous Example: Neighborhood Income Mean income: $29.8K Median income: $12K Modal income: $12K If you were trying to promote that this is an affluent neighborhood, you might prefer to report the mean income. If you were trying to argue against a tax increase, you might argue that income is too low to afford a tax increase and report the median and/or the mode. Note: 3 different measures, each valid and informative in their own way, like all statistics, have potential to inform or dis-inform! Range Mean deviation Variance Standard deviation Range = highest value – lowest value Mean deviation – the arithmetic mean of the absolute values of the deviations from the mean The # deviates of average x amount from the mean Variance – the arithmetic mean of the squared deviations from the mean Compare the dispersion of two or more sets of data Standard deviation – the square root of the variance represents the spread or variability of the data, the average range from the center point Normal Curves with Equal Means and Different Standard Deviations 15 17 19 21 23 25 27 29 31 33 Values of X 35 37 39 41 43 45 simplest measure of variability or spread Range = Max value – Min value Can give a misleading picture of the actual pattern of variation. Two distributions could have the same range but different patterns of variation. Is sensitive to extreme data values X X X X X X X X X X X X X X X X X X X X X 20 21 22 23 24 25 26 27 28 29 30 X X X X X X X X X X X X X X X X X X X X 20 21 22 23 24 25 26 27 28 29 30 Population variation =varp(…) Sample variation =var(…) 2 ( x ) i 2 N 2 ( x x ) i s2 n 1 Population variation =stdevp(…) ( xi ) 2 N Sample variation =stdev(…) s ( xi x) 2 n 1 Sample standard deviation is most common use of statistics s ( x x) N 1 2 Example: Numbers Mean 100,100,100,100,100,100 100 90, 90, 100, 110, 110 100 Standard Deviation 0 10 Computing the standard deviation: find the mean (100) find the deviation/variance of each value form the mean (-10, -10, 0, 10, 10) square the deviations/variances (100, 100, 0, 100, 100) sum the squared deviations (100+100+0+100+100 = 400) divide the sum by the # of values minus 1 (# of values = 5 – 1 = 4, 400/4 = 100) take the square root of the variance (10) (Will be important in research when you are trying to determine the range of information.) To compare dispersion in data sets with dissimilar units of measurement (e.g., kilograms and ounces) or dissimilar means (e.g., home prices in two different cities) we define the coefficient of variation (CV), which is a unit-free measure of dispersion: s CV 100 x Two Investments A & B Which should I pick? Choices: Project A B Mean % Std Dev Return % Return 7.6 3.2 6.8 2.5 Normal distribution x 68% 95% 99% If all samples of a particular size are selected from any population, the sampling distribution of the sample mean is approximately a normal distribution. This approximation improves with larger samples. (the larger the sample, the more it appears to be a normal standard distribution) The “chance” or “likelihood” of something happening a value between zero and one ▪ zero= “cannot happen”; one= “sure to happen” ▪ expressed as a decimal or fraction Increasing Likelihood of Occurrence Probability: 0 .5 The occurrence of the event is just as likely as it is unlikely. 1 Discrete Probability (discrete random variables): ▪ fixed number of clearly separated outcomes ▪ examples: rolling a die (6 outcomes); coin flip (2 outcomes) ▪ Binomial Probability Continuous Probability (continuous random variables): ▪ infinite number of outcomes within a certain range ▪ example: life expectancies ▪ Workshop 4: find probabilities under bell shaped curve Probability P(x) Probabilities are individual, .50 singular, and unique values; .40 number of outcomes are limited; graphed as bars or rectangles .30 .20 .10 0 1 2 3 4 Number of Cars Sold on a Saturday, x Not a smooth curve…unless sample size gets large... Probabilities: are the area under the standard normal curve can be an infinite number of values within a certain range “Z” is a calculated value, indicating the number of standard deviations from the mean. Experiment: is a process involving chance or probability that leads to results called outcomes. Outcome: is the result of a single trial of an experiment. Event: is one or more outcomes of an experiment. Sample space: the set of all possible outcomes from an experiment. Independent Event: if the probability of one event is not affected or changed by another Example: Sampling With Replacement - taking random samples from a population, then replacing the random sample before taking another. As a result, each random sample is not affected by another. The population remains with all data intact. Dependent Event: if the probability of one event IS affected or changed by another Example: Sampling Without Replacement - take a random sample from a population, then do not replace the sample before taking another. As a result, each sample taken this way will affect each other. Each removed sample changes the characteristics of the population. Trial: the act of testing something 5-4 Classical probability the outcomes of an experiment are equally likely. Using this classical viewpoint, P(A) Probabilit y of an event = Number of ways that A can occur Total number of possible outcomes 5-5 Experiment: A spinner has 4 equal sectors colored yellow, blue, green, and red. After spinning the spinner, what is the probability of landing on each color? Outcomes: The possible outcomes of this experiment are yellow, blue, green, and red Probabilities: P(yellow) = number of ways to land on yellow = 1 total number of colors P(blue) = number of ways to land on blue = total number of colors 4 1 4 P(green) = number of ways to land on green = 1 total number of colors P(red) = number of ways to land on red = total number of colors 4 1 4 Chapter 7 Discrete Variable – each value of X has its own probability P(X). • Continuous Variable – events are intervals and probabilities are areas underneath smooth curves. A single point has no probability. • Probability Density Function (PDF) – For a continuous random variable, the PDF is an equation that shows the height of the curve f(x) at each possible value of X over the range of X. Normal PDF Continuous PDF’s: • Denoted f(x) • Total area under curve = 1 • Mean, variance and shape depend on the PDF parameters • Reveals the shape of the distribution Normal PDF Probabilities as Areas Continuous probability functions are smooth curves. • Unlike discrete distributions, the area at any single point = 0. • The entire area under any PDF must be 1. • Mean is the balance point of the distribution. Normal PDF f(x) reaches a maximum at and has points of inflection at + Bell-shaped curve Since for every value of and , there is a different normal distribution, we transform a normal random variable to a standard normal distribution with = 0 and = 1 using the formula: x – z= Denoted N(0,1) A common scale from -3 to +3 is used. Entire area under the curve is unity. The probability of an event P(z1 < Z < z2) is a definite integral of f(z). However, standard normal tables or Excel functions can be used to find the desired probabilities. Now find P(Z < 1.96): .5000 .5000 - .4750 = .0250 Now find P(-1.96 < Z < 1.96). Due to symmetry, P(-1.96 < Z) is the same as P(Z < 1.96). .9500 So, P(-1.96 < Z < 1.96) = .4750 + .4750 = .9500 or 95% of the area under the curve. Suppose John took an economics exam and scored 86 points. The class mean was 75 with a standard deviation of 7. What percentile is John in (i.e., find P(X < 86)? 86 – 75 = 11/7 = 1.57 x – zJohn = = 7 So John’s score is 1.57 standard deviations about the mean. Suppose John took an economics exam and scored 86 points. The class mean was 75 with a standard deviation of 7. What percentile is John in (i.e., find P(X < 86)? Suppose John took an economics exam and scored 86 points. The class mean was 75 with a standard deviation of 7. What percentile is John in (i.e., find P(X < 86)? normal distribution p(lower) p(upper) z x mean std.dev .9420 .0580 1.57 86 75 7 Suppose John took an economics exam and scored 86 points. The class mean was 75 with a standard deviation of 7. What percentile is John in (i.e., find P(X < 86)? John is approximately in the 94th percentile For example, let = 2.040 cm and = .001 cm, what is the probability that a given steel bearing will have a diameter between 2.039 and 2.042cm? In other words, P(2.039 < X < 2.042) Excel only gives left tail areas, so break the formula into two, find P(X < 2.039) and P(X < 2.042), then subtract them to find the desired probability: P(X < 2.042) = .9773 P(X < 2.039) = .1587 P(2.039 < X < 2.042) = .9773 - .1587 = .8186 or 81.9% suppose we wanted the probability of selecting a foreman who earned less than $1,100. In probability notation we write this statement as P(weekly income < $1,100). suppose we wanted the probability of selecting a foreman who earned less than $1,100. In probability notation we write this statement as P(weekly income < $1,100). =.8413 suppose we wanted the probability of selecting a foreman who earned less than $1,100. In probability notation we write this statement as P(weekly income < $1,100). =.8413 The mean of a normal probability distribution is 500; the standard deviation is 10. a. About 68 percent of the observations lie between what two values? b. About 95 percent of the observations lie between what two values? c. Practically all of the observations lie between what two values? The mean of a normal probability distribution is 500; the standard deviation is 10. a. About 68 percent of the observations lie between what two values? b. About 95 percent of the observations lie between what two values? c. Practically all of the observations lie between what two values? a. 490 and 510, found by 500 +/- 1(10). b. 480 and 520, found by 500 +/- 2(10). c. 470 and 530, found by 500 +/- 3(10). A normal distribution has a mean of 50 and a standard deviation of 4. a. Compute the probability of a value between 44.0 and 55.0. b. Compute the probability of a value greater than 55.0. c. Compute the probability of a value between 52.0 and 55.0. a. 0.8276: First find z -1.5, found by (44 - 50)/4 and z = 1.25 = (55 - 50)/4. The area between -1.5 and 0 is 0.4332 and the area between 0 and 1.25 is 0.3944, both from Appendix D. Then adding the two areas we find that 0.4332 + 0.3944 = 0.8276. b. 0.1056, found by 0.5000 - 0.3994, where z = 1.25. c. 0.2029: Recall that the area for z = 1.25 is 0.3944, and the area for z = 0.5, found by (52 - 50)/4, is 0.1915. Then subtract 0.3944 - 0.1915 and find 0.2029. Problem 7.25 (p272) The Layton Tire and Rubber Company wishes to set a minimum mileage guarantee on its new MX100 tire. Tests reveal the mean mileage is 67,900 with a standard deviation of 2,050 miles and that the distribution of miles follows the normal distribution. They want to set the minimum guaranteed mileage so that no more than 4 percent of the tires will have to be replaced. What minimum guaranteed mileage should Layton announce? Draw it out Notice that there are two unknowns, z and X. To find X, we first find z, and then solve for X. Notice the area under the normal curve to the left of is .5000. The area between and X is .4600, found by .5000 .0400. Now refer to Appendix C. Search the body of the table for the area closest to .4600. The closest area is .4599. Notice that there are two unknowns, z and X. To find X, we first find z, and then solve for X. Notice the area under the normal curve to the left of is .5000. The area between and X is .4600, found by .5000 - .0400. Now refer to Appendix C. Search the body of the table for the area closest to .4600. The closest area is .4599. X 67,900 1.75 2050 X 67,900 1.75 2050 -1.75(2050) = X – 67,900 X = 67,900 – 1.75(2050) X = 64,312 So Layton can advertise that it will replace for free any tire that wears out before it reaches 64,312 miles, and the company will know that only 4 percent of the tires will be replaced under this plan.