Changing Unit of Measurement • • • A linear transformation changes the original value x into a new variable xnew . xnew is given by an equation of the form, xnew a bx Example 1.21 on page 45 in IPS. (i) A distance x measured in km. can be expressed in miles as follow, x 0.62x . new (ii) A temperature x measured in degrees Fahrenheit can be converted to degrees Celsius by xnew 5(x 32) 160 5 x 9 9 9 week3 1 Effect of a Linear Transformation • Multiplying each observation in a data set by a number b multiplies both the measures of center (mean, median, and trimmed means) by b and the measures of spread (range, standard deviation and IQR) by |b| that is the absolute value of b. • Adding the same number a to each observation in a data set adds a to measures of center, quartiles and percentiles but does not change the measures of spread. • Linear transformations do NOT change the overall shape of a distribution. week3 2 Measure x xnew a bx Median M a+bM Mode Mode Range R bR IQR IQR b IQR Stdev s bs Mean x week3 a+bMode 3 Example 1 • A sample of 20 employees of a company was taken and their salaries were recorded. Suppose each employee receives a $300 raise in the salary for the next year. State whether the following statements are true or false. a) The IQR of the salaries will i. be unchanged ii. increase by $300 iii. be multiplied by $300 b) The mean of the salaries will i. be unchanged ii. increase by $300 iii. be multiplied by $300 week3 4 Density curves • Using software, clever algorithms can describe a distribution in a way that is not feasible by hand, by fitting a smooth curve to the data in addition to or instead of a histogram. The curves used are called density curves. • It is easier to work with a smooth curve, because histogram depends on the choice of classes. • Density Curve Density curve is a curve that is always on or above the horizontal axis. has area exactly 1 underneath it. • A density curve describes the overall pattern of a distribution. week3 5 • The area under the curve and above any range of values is the relative frequency (proportion) of all observations that fall in that range of values. • Example: The curve below shows the density curve for scores in an exam and the area of the shaded region is the proportion of students who scores between 60 and 80. week3 6 Median and mean of Density Curve • The median of a distribution described by a density curve is the point that divides the area under the curve in half. • A mode of a distribution described by a density curve is a peak point of the curve, the location where the curve is highest. • Quartiles of a distribution can be roughly located by dividing the area under the curve into quarters as accurately as possible by eye. week3 7 Normal distributions • An important class of density curves are the symmetric unimodal bell-shaped curves known as normal curves. They describe normal distributions. • All normal distributions have the same overall shape. • The exact density curve for a particular normal distribution is specified by giving its mean (mu) and its standard deviation (sigma). • The mean is located at the center of the symmetric curve and is the same as the median and the mode. • Changing without changing moves the normal curve along the horizontal axis without changing its spread. week3 8 • The standard deviation controls the spread of a normal curve. week3 9 • There are other symmetric bell-shaped density curves that are not normal e.g. t distribution. • Normal density function is mathematical model of process producing data. • If histogram with bars matching normal density curve, data is said to have a normal distribution. • Notation: A normal distribution with mean and standard deviation is denoted by N(, ). week3 10 The 68-95-99.7 rule In the normal distribution with mean and standard deviation , Approx. 68% of the observations fall within of the mean . Approx. 95% of the observations fall within 2 of the mean . Approx. 99.7% of the observations fall within 3 of the mean . week3 11 Example 1.23 on p72 in IPS • The distribution of heights of women aged 18-24 is approximately N(64.5, 2.5), that is ,normal with mean = 64.5 inches and standard deviation = 2.5 inches. • The 68-95-99.7 rule says that the middle 95% (approx.) of women are between 64.5-5 to 64.5+5 inches tall. The other 5% have heights outside the range from 59.5 to 69.5 inches, and 2.5% of the women are taller than 69.5 . • Exercise: 1) The middle 68% (approx.) of women are between ____to ___ inches tall. 2) ___% of the women are taller than 66.75. 3) ___% of the women are taller than 72. week3 12 Standardizing and z-scores • If x is an observation from a distribution that has mean and standard deviation , the standardized value of x is given by x z • A standardized value is often called a z-score. • A z-score tells us how many standard deviations the original observation falls away from the mean of the distribution. • Standardizing is a linear transformation that transform the data into the standard scale of z-scores. Therefore, standardizing does not change the shape of a distribution, but changes the value of the mean and stdev. week3 13 Example 1.26 on p61 in IPS • The heights of women is approximately normal with mean = 64.5 inches and standard deviation = 2.5 inches. • The standardized height is z height 64.5 2.5 • The standardized value (z-score) of height 68 inches is z 68 64.5 1.4 2.5 or 1.4 std. dev. above the mean. • A woman 60 inches tall has standardized height z 60 64.5 1.8 2.5 or 1.8 std. dev. below the mean. week3 14 The Standard Normal distribution • The standard normal distribution is the normal distribution N(0, 1) that is, the mean = 0 and the sdev = 1 . • If a random variable X has normal distribution N(, ), then the standardized variable Z X has the standard normal distribution. • Areas under a normal curve represent proportion of observations from that normal distribution. • There is no formula to calculate areas under a normal curve. Calculations use either software or a table of areas. The table and most software calculate one kind of area: cumulative proportions . A cumulative proportion is the proportion of observations in a distribution that fall at or below a given value and is also the area under the curve to the left of a given value. week3 15 The standard normal tables • Table A gives cumulative proportions for the standard normal distribution. The table entry for each value z is the area under the curve to the left of z, the notation used is P( Z ≤ z). e.g. P( Z ≤ 1.4 ) = 0.9192 week3 16 Standard Normal Distribution z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 .5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159 .8413 .8643 .8849 .9032 .9192 .9332 .9452 .9554 .9641 .9713 .9772 .9821 .9861 .9893 .9918 .9938 .9953 .9965 .9974 .9981 .9987 .5040 .5438 .5832 .6217 .6591 .6950 .7291 .7611 .7910 .8186 .8438 .8665 .8869 .9049 .9207 .9345 .9463 .9564 .9649 .9719 .9778 .9826 .9864 .9896 .9920 .9940 .9955 .9966 .9975 .9982 .9987 .5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212 .8461 .8686 .8888 .9066 .9222 .9357 .9474 .9573 .9656 .9726 .9783 .9830 .9868 .9898 .9922 .9941 .9956 .9967 .9976 .9982 .9987 .5120 .5517 .5910 .6293 .6664 .7019 .7357 .7673 .7967 .8238 .8485 .8708 .8907 .9082 .9236 .9370 .9484 .9582 .9664 .9732 .9788 .9834 .9871 .9901 .9925 .9943 .9957 .9968 .9977 .9983 .9988 .5160 .5557 .5948 .6331 .6700 .7054 .7389 .7703 .7995 .8264 .8508 .8729 .8925 .9099 .9251 .9382 .9495 .9591 .9671 .9738 .9793 .9838 .9875 .9904 .9927 .9945 .9959 .9969 .9977 .9984 .9988 .5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289 .8531 .8749 .8944 .9115 .9265 .9394 .9505 .9599 .9678 .9744 .9798 .9842 .9878 .9906 .9929 .9946 .9960 .9970 .9978 .9984 .9989 .5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315 .8554 .8770 .8962 .9131 .9279 .9406 .9515 .9608 .9686 .9750 .9803 .9846 .9881 .9909 .9931 .9948 .9961 .9971 .9979 .9985 .9989 .5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340 .8577 .8790 .8980 .9147 .9292 .9418 .9525 .9616 .9693 .9756 .9808 .9850 .9884 .9911 .9932 .9949 .9962 .9972 .9979 .9985 .9989 .5319 .5714 .6103 .6480 .6844 .7190 .7517 .7823 .8106 .8365 .8599 .8810 .8997 .9162 .9306 .9429 .9535 .9625 .9699 .9761 .9812 .9854 .9887 .9913 .9934 .9951 .9963 .9973 .9980 .9986 .9990 .5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389 .8621 .8830 .9015 .9177 .9319 .9441 .9545 .9633 .9706 .9767 .9817 .9857 .9890 .9916 .9936 .9952 .9964 .9974 .9981 .9986 .9990 The table shows area to left of ‘z’ under standard normal curve 17 The standard normal tables - Example • What proportion of the observations of a N(0,1) distribution takes values a) less than z = 1.4 ? b) greater than z = 1.4 ? c) greater than z = -1.96 ? d) between z = 0.43 and z = 2.15 ? week3 18 Properties of Normal distribution • If a random variable Z has a N(0,1) distribution then P(Z = z)=0. The area under the curve below any point is 0. • The area between any two points a and b (a < b) under the standard normal curve is given by P(a ≤ Z ≤ b) = P(Z ≤ b) – P(Z ≤ a) • As mentioned earlier, if a random variable X has a N(, ) distribution, then the standardized variable X Z has a standard normal distribution and any calculations about X can be done using the following rules: week3 19 • P(X = k) = 0 for all k. a P X a P Z b P X b 1 P Z b a Pa X b P Z • The solution to the equation P(X ≤ k) = p is k = μ + σzp Where zp is the value z from the standard normal table that has area (and cumulative proportion) p below it, i.e. zp is the pth percentile of the standard normal distribution. week3 20 Questions 1. The marks of STA221 students has N(65, 15) distribution. Find the proportion of students having marks (a) less then 50. (b) greater than 80. (c) between 50 and 80. 2. Example 1.30 on page 65 in IPS: Scores on SAT verbal test follow approximately the N(505, 110) distribution. How high must a student score in order to place in the top 10% of all students taking the SAT? 3. The time it takes to complete a stat220 term test is normally distributed with mean 100 minutes and standard deviation 14 minutes. How much time should be allowed if we wish to ensure that at least 9 out of 10 students (on average) can complete it? (final exam Dec. 2001) week3 21 4. General Motors of Canada has a deal: ‘an oil filter and lube job in 25 minutes or the next one free’. Suppose that you worked for GM and knew that the time needed to provide these services was approximately normal with mean 15 minutes and std. dev. 2.5 minutes. How many minutes would you have recommended to put in the ad above if it was decided that about 5 free services for 100 customers was reasonable? 5. In a survey of patients of a rehabilitation hospital the mean length of stay in the hospital was 12 weeks with a std. dev. of 1 week. The distribution was approximately normal. Out of 100 patients how many would you expect to stay longer than 13 weeks? What is the percentile rank of a stay of 11.3 weeks? What percentage of patients would you expect to be in longer than 12 weeks? What is the length of stay at the 90th percentile? What is the median length of stay? a) b) c) d) e) week3 22 Normal quantile plots and their use • • • • A histogram or stem plot can reveal distinctly nonnormal features of a distribution. If the stem-plot or histogram appears roughly symmetric and unimodal, we use another graph, the normal quantile plot as a better way of judging the adequacy of a normal model. Any normal distribution produces a straight line on the plot. Use of normal quantile plots: If the points on a normal quantile plot lie close to a straight line, the plot indicates that the data are normal. Systematic deviations from a straight line indicate a nonnormal distribution. Outliers appear as points that are far away from the overall pattern of the plot. week3 23 • Histogram, the nscores plot and the normal quantile plot for data generated from a normal distribution (N(500, 20)). 10 5 0 460 470 480 490 500 510 520 530 540 value Normal Probability Plot for value 99 ML Estimates 95 Mean: 500.343 StDev: 17.4618 90 80 Percent Frequency 15 70 60 50 40 30 20 10 5 1 450 week3 500 Data 550 24 • Histogram, the nscores plots and the normal quantile plot for data generated from a right skewed distribution Frequency 10 5 0 0 5 10 value week3 25 Normal Probability Plot for value 99 ML Estimates 95 Mean: 2.64938 StDev: 2.17848 90 Percent 80 70 60 50 40 30 20 10 5 1 0 5 10 Data week3 26 • Histogram, the nscores plots and the normal quantile plot for data generated from a left skewed distribution Frequency 10 5 0 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 value week3 27 Normal Probability Plot for value 99 ML Estimates 95 Mean: 0.8102 StDev: 0.161648 90 Percent 80 70 60 50 40 30 20 10 5 1 0.50 0.75 1.00 1.25 Data week3 28 • Histogram, the nscores plots and the normal quantile plot for data generated from a uniform distribution (0,5) 9 8 Frequency 7 6 5 4 3 2 1 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 value week3 29 Normal Probability Plot for value 99 ML Estimates 95 Mean: 2.21603 StDev: 1.46678 90 Percent 80 70 60 50 40 30 20 10 5 1 -2 -1 0 1 2 3 4 5 6 Data week3 30 Looking at data - relationships • Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the second variable than with other values of that variable. • When examining the relationship between two or more variables, we should first think about the following questions: – What individuals do the data describe? – What variables are present? How are they measured? – Which variables are quantitative and which are categorical? – Is the purpose of the study is simply to explore the nature of the relationship, or do we hope to show that one variable can explain variation in the other? week3 31 Response and explanatory variables • A response variable measure an outcome of a study. An explanatory variable explains or causes changes in the response variables. • Explanatory variables are often called independent variables and response variables are called dependent variables. The ides behind this is that response variables depend on explanatory variables. • We usually call the explanatory variable x and the response variable y. week3 32 Scatterplot • A scatterplot shows the relationship between two quantitative variables measured on the same individuals. • Each individual in the data appears as a point in the plot fixed by the values of both variables for that individual. • Always plot the explanatory variable, if there is one, on the horizontal axis (the x axis) of a scatterplot. • Examining and interpreting Scatterplots – Look for overall pattern and striking deviations from that pattern. – The overall pattern of a scatterplot can be described by the form, direction and strength of the relationship. – An important kind of deviation is an outlier, an individual value that falls outside the overall pattern. week3 33 Example • There is some evidence that drinking moderate amounts of wine helps prevent heart attack. A data set contain information on yearly wine consumption (litters per person) and yearly deaths from heart disease (deaths per 100,000 people) in 19 developed nations. Answer the following questions. • What is the explanatory variable? • What is the response variable? • Examine the scatterplot below. week3 34 Heart disease deaths 300 200 100 0 1 2 3 4 5 6 7 8 9 Wine week3 35 • Interpretation of the scatterplot – The pattern is fairly linear with a negative slope. No outliers. – The direction of the association is negative . This means that higher levels of wine consumption are associated with lower death rates. – This does not mean there is a causal effect. There could be lurking variables. For example, higher wine consumption could be linked to higher income, which would allow better medical care. • MINITAB command for scatterplot Graph > Plot week3 36 Categorical variables in scatterplots • To add a categorical variable to a scatterplot, use a different colour or symbol for each category. • The scatterplot below shows the relationship between the world record times for 10,000m run and the year for both men and women. 2300 F M Time (seconds) 2200 2100 2000 1900 1800 1700 1600 1900 1950 2000 Year week3 37 Correlation • A sctterplot displays the form, direction and strength of the relationship between two quantitative variables. • Correlation (denoted by r) measures the direction and strength of the liner relationship between two quantitative variables. • Suppose that we have data on variables x and y for n individuals. The correlation r between x and y is given by n xi yi nx y n 1 1 i 1 xi x yi y r n 1 i 1 sx s y s x s y n 1 week3 38 Example • Family income and annual savings in thousand of $ for a sample of eight families are given below. savings 1 2 2 5 5 6 7 8 income 36 39 42 45 48 51 54 56 C3 -1.42887 -1.02062 -0.61237 -0.20412 0.20412 0.61237 1.02062 1.42887 C4 -1.45101 -1.03144 -0.61187 -0.19230 0.22727 0.64684 1.06641 1.34612 C5 2.07331 1.05271 0.37469 0.03925 0.04433 0.39611 1.08840 1.92343 Sum of C5 = 6.99429 • r = 6.99429/7 = 0.999185 • MINITAB command: Stat > Basic Statistics > Correlation week3 39 Properties of correlation • Correlation requires both variables to be quantitative and make no use of the distinction between explanatory and response variables. • Correlation r has no unit if measurement. • Positive r indicates positive association between the variables and negative r indicates negative association. • Correlation measures the strength of only the linear relationship between two variables, it does not describe curved relationship! • r is always a number between –1 and 1. Values of r near 0 indicates a weak linear relationship. The strength of the linear relationship increases as r moves away from 0. Values of r close to –1 or 1 indicates that the points lie close to a straight line. r is not resistant. r is strongly affected by a few outliers. week3 40 week3 41 Question from Term test, summer 99 • MINITAB analyses of math and verbal SAT scores is given below. Variable Verbal Math GPA Variable Verbal Math GPA N 200 200 200 Mean Median 595.65 586.00 649.53 649.00 2.6300 2.6000 Minimum Maximum 361.00 780.00 441.00 800.00 0.3000 3.9000 TrMean 595.57 650.37 2.6439 StDev 73.21 66.35 0.5803 SE Mean 5.18 4.69 0.0410 Stem-and-leaf of Verbal N = 200 Leaf Unit = 10 1 3 6 4 4 034 19 4 566888888889999 52 5 000000122222222333333333444444444 (56) 5 55555555555556666666777777777777778888888888888889999999 92 6 00000000011111111222222333333333444444444444444 45 6 555555666666666778888888889999 15 7 0011112244 5 7 55568 week3 42 Stem-and-leaf of Math Leaf Unit = 10 1 3 12 38 (63) 99 51 12 2 4 4 5 5 6 6 7 7 8 N = 200 4 79 001222234 55555666677777778888889999 000000000000001111111111112222222222222222333333333344444444444 555555555666666666666667777777777788888889999999 000000000011111111111112222222333334444 5566777789 00 30 20 Frequency Frequency 20 10 0 10 0 400 500 600 Math 700 800 400 week3 500 600 Verbal 700 800 43 a) Find the 25th percentile, 75th percentile and the IQR of the math SAT scores. b) You were one of the students of this study and your math SAT score was 532. What is your z-score and percentile standing? c) If the math SAT scores were in fact left (negatively) skewed, but the mean was still 650, what could you say about the percentile standing of someone who obtains a score of 650? d) e) What is the class width ? i) of the histogram for verbal SAT scores? ii) of the stemplot of the verbal SAT scores? Describe both the verbal and math score distributions and compare one with the other. week3 44 g) Give a rough sketch of how a normal probability plot would look if the verbal scores were i. Right (positively) skewed ii. Uniform in shape h) For verbal scores, aside from running through the data and tallying, can you determine the approx. percentage of scores which fall between 523 and 668? If so give the percentage. week3 45