Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business Frequency Distribution and Probability Distribution • Frequency Distribution: Plot of frequency along y-axis and variable along the x-axis • Histogram is an example • Probability Distribution: Plot of probability along y-axis and variable along x-axis • Both have same shape • Properties of probability distributions • Probability is always between 0 and 1 • Sum of probabilities must be 1 Theoretical Probability Distributions • For a discrete variable we have discrete probability distribution • • • • Binomial Distribution Poisson Distribution Geometric Distribution Hypergeometric Distribution • For a continuous variable we have continuous probability distribution • Uniform (rectangular) Distribution • Exponential Distribution • Normal Distribution The Normal Distribution • If a random variable, X is affected by many independent causes, none of which is overwhelmingly large, the probability distribution of X closely follows normal distribution. Then X is called normal variate and we write X ~ N(, 2), where is the mean and 2 is the variance • A Normal pdf is completely defined by its mean, and variance, 2. The square root of variance is called standard deviation . • If several independent random variables are normally distributed, their sum will also be normally distributed with mean equal to the sum of individual means and variance equal to the sum of individual variances. The Normal pdf The area under any pdf between two given values of X is the probability that X falls between these two values Standard Normal Variate, Z • SNV, Z is the normal random variable with mean 0 and standard deviation 1 • Tables are available for Standard Normal Probabilities • X and Z are connected by: Z = (X - ) / and X = + Z • The area under the X curve between X1 and X2 is equal to the area under Z curve between Z1 and Z2. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 0.00 0.0000 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821 0.4861 0.4893 0.4918 0.4938 0.4953 0.4965 0.4974 0.4981 0.4987 0.4990 0.4993 0.4995 0.4997 0.01 0.0040 0.0438 0.0832 0.1217 0.1591 0.1950 0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826 0.4864 0.4896 0.4920 0.4940 0.4955 0.4966 0.4975 0.4982 0.4987 0.4991 0.4993 0.4995 0.4997 0.02 0.0080 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830 0.4868 0.4898 0.4922 0.4941 0.4956 0.4967 0.4976 0.4982 0.4987 0.4991 0.4994 0.4995 0.4997 0.03 0.0120 0.0517 0.0910 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834 0.4871 0.4901 0.4925 0.4943 0.4957 0.4968 0.4977 0.4983 0.4988 0.4991 0.4994 0.4996 0.4997 0.04 0.0160 0.0557 0.0948 0.1331 0.1700 0.2054 0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838 0.4875 0.4904 0.4927 0.4945 0.4959 0.4969 0.4977 0.4984 0.4988 0.4992 0.4994 0.4996 0.4997 0.05 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289 0.3531 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842 0.4878 0.4906 0.4929 0.4946 0.4960 0.4970 0.4978 0.4984 0.4989 0.4992 0.4994 0.4996 0.4997 0.06 0.0239 0.0636 0.1026 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846 0.4881 0.4909 0.4931 0.4948 0.4961 0.4971 0.4979 0.4985 0.4989 0.4992 0.4994 0.4996 0.4997 0.07 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850 0.4884 0.4911 0.4932 0.4949 0.4962 0.4972 0.4979 0.4985 0.4989 0.4992 0.4995 0.4996 0.4997 0.08 0.0319 0.0714 0.1103 0.1480 0.1844 0.2190 0.2517 0.2823 0.3106 0.3365 0.3599 0.3810 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854 0.4887 0.4913 0.4934 0.4951 0.4963 0.4973 0.4980 0.4986 0.4990 0.4993 0.4995 0.4996 0.4997 0.09 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.4890 0.4916 0.4936 0.4952 0.4964 0.4974 0.4981 0.4986 0.4990 0.4993 0.4995 0.4997 0.4998 Standard Normal Probabilities (Table of z distribution) The z-value is on the left and top margins and the probability (shaded area in the diagram) is in the body of the table Illustration Q.A tube light has mean life of 4500 hours with a standard deviation of 1500 hours. In a lot of 1000 tubes estimate the number of tubes lasting between 4000 and 6000 hours A. P(4000<X<6000) = P(-1/3<Z<1) = 0.1306 + 0.3413 = 0.4719 Hence the probable number of tubes in a lot of 1000 lasting 4000 to 6000 hours is 472 Illustration Q. Cost of a certain procedure is estimated to average Rs.25,000 per patient. Assuming normal distribution and standard deviation of Rs.5000, find a value such that 95% of the patients pay less than that. A. Using tables, P(Z<Z1) = 0.95 gives Z1 = 1.645. Hence X1 = 25000 + 1.645 x 5000 = Rs.33,225 95% of the patients pay less than Rs.33,225 Sampling Basics • Population or Universe is the collection of all units of interest. E.g.: Households of a specific type in a given city at a certain time. Population may be finite or infinite • Sampling Frame is the list of all the units in the population with identifications like Sl.Nos, house numbers, telephone nos etc • Sample is a set of units drawn from the population according to some specified procedure • Unit is an element or group of elements on which observations are made. E.g. a person, a family, a school, a book, a piece of furniture etc. Census Vs Sampling • Census – Thought to be accurate and reliable, but often not so if the population is large – More resources (money, time, manpower) – Unsuitable for destructive tests • Sampling – Less resources – Highly qualified and skilled persons can be used – Sampling error, which can be reduced using large and representative sample Sampling Methods • Probability Sampling (Random Sampling) – – – – Simple Random Sampling Systematic Random Sampling Stratified Random Sampling Cluster Sampling (Single stage , Multi-stage) • Non-probability Sampling – Convenience Sampling – Judgment Sampling – Quota Sampling Limitations of Non-Random Sampling • Selection does not ensure a known chance that a unit will be selected (i.e. non-representative) • Inaccurate in view of the selection bias • Results cannot be used for generalisation because inferential statistics requires probability sampling for valid conclusions • Useful for pilot studies and exploratory research Sampling Distribution and Standard Error of the Mean • The sampling distribution of x is the probability distribution of all possible values of x for a given sample size n taken from the population. • According to the Central Limit Theorem, for large enough sample size, n, the sampling distribution is approximately normal with mean and standard deviation /n. This standard deviation is called standard error of the mean. • CLT holds for non-normal populations also and states: For large enough n, x ~ N(, 2/n) Illustration Q. When sampling from a population with SD 55, using a sample size of 150, what is the probability that the sample mean will be at least 8 units away from the population mean? A. Standard Error of the mean, SE = 55/sqrt(150) = 4.4907 Hence 8 units = 1.7815 SE Area within 1.7815 SE on both sides of the mean = 2 * 0.4625 = 0.925 Hence required probability = 1-0.925 = 0.075 Illustration Q. An Economist wishes to estimate the average family income in a certain population. The population SD is known to be $4,500 and the economist uses a random sample of size 225. What is the probability that the sample mean will fall within $800 of the population mean? Point and Interval Estimation • The value of an estimator (see next slide), obtained from a sample can be used to estimate the value of the population parameter. Such an estimate is called a point estimate. • This is a 50:50 estimate, in the sense, the actual parameter value is equally likely to be on either side of the point estimate. • A more useful estimate is the interval estimate, where an interval is specified along with a measure of confidence (90%, 95%, 99% etc) • The interval estimate with its associated measure of confidence is called a confidence interval. • A confidence interval is a range of numbers believed to include the unknown population parameter, with a certain level of confidence Estimators • Population parameters (, 2, p) and Sample Statistics (x,s2, ps) • An estimator of a population parameter is a sample statistic used to estimate the parameter • Statistic,x is an estimator of parameter • Statistic, s2 is an estimator of parameter 2 • Statistic, ps is an estimator of parameter p Illustration Q. A wine importer needs to report the average percentage of alcohol in bottles of French wine. From experience with previous kinds of wine, the importer believes the population SD is 1.2%. The importer randomly samples 60 bottles of the new wine and obtains a sample mean of 9.3%. Find the 90% confidence interval for the average percentage of alcohol in the population. Answer Standard Error = 1.2%/sqrt(60) = 0.1549% For 90% confidence interval, Z = 1.645 Hence the margin of error = 1.645*0.1549% = 0.2548% Hence 90% confidence interval is 9.3% +/- 0.3% More Sampling Distributions • Sampling Distribution is the probability distribution of a given test statistic (e.g. Z), which is a numerical quantity calculated from sample statistic • Sampling distribution depends on the distribution of the population, the statistic being considered and the sample size • Distribution of Sample Mean: Z or t distribution • Distribution of Sample Proportion: Z (large sample) • Distribution of Sample Variance: Chi-square distribution The t-distribution • The t-distribution is also bell-shaped and very similar to the Z(0,1) distribution • Its mean is 0 and variance is df/(df-2) • df = degrees of freedom = n-1 & n = sample size • For large sample size, t &Z are identical • For small n, the variance of t is larger than that of Z and hence wider tails, indicating the uncertainty introduced by unknown population SD or smaller sample size n Illustration Q. A large drugstore wants to estimate the average weekly sales for a brand of soap. A random sample of 13 weeks gives the following numbers: 123, 110, 95, 120, 87, 89, 100, 105, 98, 88, 75, 125, 101. Determine the 90% confidence interval for average weekly sales. A. Sample mean = 101.23 and Sample SD = 15.13. From t-table, for 90% confidence at df = 12 is t = 1.782. Hence Margin of Error = 1.782 * 15.13/sqrt(13) = 7.48. The 90% confidence interval is (93.75,108.71) Chi-Square Distribution • Chi-square distribution is the probability distribution of the sum of several independent squared Z variables • It has a df parameter associated with it (like t distribution). The mean is df and variance is 2df • Being a sum of squares, the chi-squares cannot be negative and hence the distribution curve is entirely on the positive side, skewed to the right. Confidence Interval for population variance using chi-square distribution Q. A random sample of 30 gives a sample variance of 18,540 for a certain variable. Give a 95% confidence interval for the population variance A. Point estimate for population variance = 18,540 Given df = 29, excel gives chi-square values: For 2.5%, 45.7 and for 97.5, 16.0 Hence for the population variance, the lower limit of the confidence interval = 18540 *29/45.7 = 11,765 and the upper limit of the confidence interval = 18540*29/16.0 = 33,604 Chi-Square Distribution • Chi-square distribution is the probability distribution of the sum of several independent squared Z variables • It has a df parameter associated with it (like t distribution). • Being a sum of squares, the chi-squares cannot be negative and hence the distribution curve is entirely on the positive side, skewed to the right. The mean is df and variance is 2df Chi-Square Test for Goodness of Fit • A goodness-of-fit is a statistical test of how sample data support an assumption about the distribution of a population • Chi-square statistic used is Χ2 = ∑(O-E)2/E, where O is the observed value and E the expected value The above value is then compared with the critical value (obtained from table or using excel) for the given df and the required level of significance, α (1% or 5%) Illustration Q. A company comes out with a new watch and wants to find out whether people have special preferences for colour or whether all four colours under consideration are equally preferred. A random sample of 80 prospective buyers indicated preferences as follows: 12, 40, 8, 20. Is there a colour preference at 1% significance? A. Assuming no preference, the expected values would all be 20. Hence the chi-square value is 64/20 + 400/20 + 144/20 + 0 = 30.4 For df = 3 and 1% significance, the right tail area is 11.3. The computed value of 30.4 is far greater than 11.3 and hence deeply in the rejection region. So we reject the assumption of no colour preference. Q. Following data is about the births of new born babies on various days of the week during the past one year in a hospital. Can we assume that birth is independent of the day of the week? Sun:116, Mon:184, Tue: 148, Wed: 145, Thu: 153, Fri: 150, Sat: 154 (Total: 1050) Ans: Assuming independence, the expected values would all be 1050/7 = 150. Hence the chi-square value is 342/150+342/150+22/150+52/150+32/150+42/150=2366/1 50 = 15.77 For df = 6 and 5% significance, the right tail area is 12.6. The computed value of 15.77 is greater than the critical value of 12.6 and hence falls in the rejection region. So we reject the assumption of independence. Correlation • Correlation refers to the concomitant variation between two variables in such a way that change in one is associated with a change in the other • The statistical technique used to analyse the strength and direction of the above association between two variables is called correlation analysis Correlation and Causation • Even if an association is established between two variables no cause-effect relationship is implied • Association between x and y may be looked upon as: – – – – – x causes y y causes x x and y influence each other (mutual influence) x and y are both influenced by z, v (influence of third variable) due to chance (spurious association) • Hence caution needed while interpreting correlation Types of Correlations • Positive (direct) and negative (inverse) – Positive: direction of change is the same – Negative: direction of change is opposite • Linear and non-linear – Linear: changes are in a constant ratio – Non-linear: ratio of change is varying • Simple, Partial and Multiple – Simple: Only two variables are involved – Partial: There may be third and other variables, but they are kept constant – Multiple: Association of multiple variables considered simultaneously Scatter Diagrams Correlation coefficient r = 1 r= - 0.94 r = - 0.54 r=0.42 r = 0.85 r=0.17 Correlation Coefficient • Correlation coefficient (r) indicates the strength and direction of association • The value of r is between -1 and +1 • • • • • • -1: perfect negative correlation +1: perfect positive correlation Above 0.75: Very high correlation 0.50 to 0.75: High correlation 0.25 to 0.50: Low correlation Below 0.25: Very low correlation Methods of Correlation Analysis • Scatter Diagram • A quick approximate visual idea of association • Karl Pearson’s Coefficient of Correlation • For numeric data measured on interval or ratio scale • r = Cov(x,y) /(SDx SDy) • Spearman’s Rank Correlation • For ordinal (rank) data • R = 1 – 6 * Sum of Squared Difference of Ranks / [n(n2-1)] • Method of Least Squares • r2 = bxy * byx, i.e. product of regression coefficients Karl Pearson Correlation Coefficient (Product-Moment Correlation) • r = Covariance (x,y) / (SD of x * SD of y) • Recall: n Var(X) = SSxx, nVar(Y) = SSYY and n Cov(X,Y) = SSXY • Thus r2 = Cov2(X,Y)/[Var(X) Var(Y)] = SS2XY / (SSxx * SSYY) Note: r2 is called coefficient of determination Sample Problem The following data refers to two variables, promotional expense (Rs. Lakhs) and sales (‘000 units) collected in the context of a promotional study. Calculate the correlation coefficient Promo 7 10 9 4 11 5 3 Sales 12 14 13 5 15 7 4 Sales (Y) XAve(X ) 7 12 0 2 0 0 4 10 14 3 4 12 9 16 9 13 2 3 6 4 9 4 5 -3 -5 15 9 25 11 15 4 5 20 16 25 5 7 -2 -3 6 4 9 3 4 -4 -6 24 16 36 7 10 83 58 124 Ave(X) Ave(Y) SSxy SSxx SSyy Promo (X) Y - Ave (Y) Sxy Sxx Coefficient of Determination, r-squared = 83*83 / (58*124) = Coefficient of Correlation, r = square root of 0.95787 = Syy 0.95787 0.978708 Spearman’s Rank Correlation Coefficient • The ranks of 15 students in two subjects A and B are given below. Find Spearman’s Rank Correlation Coefficient (1,10); (2,7); (3,2); (4,6); (5,4); (6,8); (7,3); (8,1); (9,11); (10,15); (11,9); (12,5); (13,14); (14,12) and (15,13) Solution: SSD of Ranks = 81+25+1+4+1+4+ 16+49+4+25+4+49+1+4+4 = 272 R = 1 – 6*272/(14*15*16) = 0.5143 Hence moderate degree of positive correlation between the ranks of students in the two subjects Regression Analysis • Statistical technique for expressing the relationship between two (or more) variables in the form of an equation (regression equation) • Dependent or response or predicted variable • Independent or regressor or predictor variable • Used for prediction or forecasting Types of Regression Models • Simple and Multiple Regression Models – Simple: Only one independent variable – Multiple: More than one independent variable • Linear and Nonlinear Regression Models – Linear: Value of response variable changes in proportion to the change in predictor so that Y = a+bX Simple Linear Regression Model Y = a + bX, a and b are constants to be determined using the given data Note: More strictly, we may say: Y = ayx + byxX To determine a and b solve the following two equations (called “normal equations”): ∑Y = a n + b ∑x ------- (1) ∑YX = a ∑x + b ∑x2 ------- (2) Calculating Regression Coeff • Instead of solving the simultaneous equations one may directly use formulae • For Y = a + bX, i.e. regression of Y on X • byx = SSxy / SSxx • ayx = Y – byxX where mean values of Y, X are used • For X = a + bY form (regression of X on Y) • bxy = SSxy / SSyy • axy = Y – bxyX where mean values of Y, X are used Example For the earlier problem of Sales (dependent variable) Vs Promotional expenses (independent variable) set up the simple linear regression model and predict the sales when promotional spending is Rs.13 lakhs Solution: We need to find a and b in Y = a + bX b = SSxy / SSxx = 83/58 = 1.4310 a = Y - bX, at mean = 10 – 1.4310*7 = -0.017 Hence regression equation is Y = -0.017+1.4310X For X = 13 Lakhs, we get Y = 18.59, i.e. 18,590 units of predicted sales Linear Regression using Excel 18 y = 1.431x - 0.0172 R2 = 0.9579 16 14 12 10 8 6 4 2 0 0 2 4 6 8 10 12 Properties of Regression Coeff • Coefficient of determination r2 = byx * bxy • If one regression coefficient is greater than one the other is less than one because r2 lies between 0 and 1 • Both regression coeff must have the same sign, which is also the sign of the correlation coeff r • The regression lines intersect at the means of X and Y • Each regression coefficient gives the slope of the respective regression line Coefficient of Determination • Recall: • SSyy = Sum of squared deviations of Y from the mean • Let us define • SSR as sum of squared deviations of estimated (using regression equation) values of Y from the mean • SSE as the sum of squared deviations of errors (error means actual Y ~ estimated Y) • It can be shown that: • SSyy = SSR + SSE, i.e. Total Variation = Explained Variation + Unexplained (error) Variation • r2 = SSR/SSyy = Explained Variation / Total Variation • Thus r2 represents the proportion of the total variability of the dependent variable y that is accounted for or explained by the independent variable x Coefficient of Determination for Statistical Validity of Promo-Sales Regression Model Promo (X) Sales (Y) Squared deviation of Ye from Mean Ye = -0.017+1.4310X Squared deviation of Ye from Y Squared Deviation of Y from Mean 7 12 10.00 0.00 4.00 4 10 14 14.29 18.43 0.09 16 9 13 12.86 8.19 0.02 9 4 5 5.71 18.43 0.50 25 11 15 15.72 32.76 0.52 25 5 7 7.14 8.19 0.02 9 3 4 4.28 32.76 0.08 36 7 10 118.77 5.22 124 SSR Coefficient of determination, r-squared = 118.77/124 = Thus 96% of the variation in sales is explained by promo expenses SSE 0.957824 Ssyy