Chapter 1 Exploring Data Section 1.1 – Displaying Distributions with Graphs Introduction 0 Any set of data contains information about some group of 0 0 0 0 0 0 0 individuals Individuals- the objects described by a set of data Variable- any characteristic of an individual Variables can be divided into two sections: categorical and quantitative Categorical- places into categories Quantitative- Numerical value Lets try! A political scientist selects a large sample of registered voters. For each voter, she records gender, age, and household income. Which variables are quantitative and which are categorical? •Gender: Categorical •Age: Quantitative •Household income: Quantitative Distribution 0 Distribution of a variable tells us what values the variable takes and how often it takes these values 0 Exploratory data analysis examines and describes the data’s main features 0 Two basic strategies: 1. Examine each variable by itself, then connect it to the other one 2. Make a graph. Add specific aspects of numerical summaries Bar Graphs 0 Bar graphs help the audience grasp the distribution quickly 0 To construct a bar graph: 1. Label your axes and title your graph 2. Scale your axes. Use the counts in each category to help you scale your vertical axis 3.Draw a vertical bar above each category name to a height that corresponds to the count in that category Color Preference Determine which color students prefer to wear to class: Red- 5 Green- 2 Blue- 5 Black- 3 Red Green Blue Black Pie Charts 0 Pie charts help us see what part of the whole each group forms 0 How to construct a pie chart: * Tip: Recommended to use statistical software package 1. Change any numerical values into percents 2. Estimate how much space the category will cover depending on the data given 3. All percents must add up to a total of 1 Red- 5 Green- 2 Blue- 5 Black- 3 = = = = .33 .14 .33 .20 = 33% = 14% = 33% = 20% Dotplot 0 Helps display quantitative data 0 How to construct: 1. Draw one horizontal line going across 2. Label the axis 3. Scale the axis 4. Put a dot in the correct place for every value that appears in the data You roll a die 50 times and record the numbers that you got. Using the data provided, construct a dotplot for this observation. Data: 8 1’s 6 2’s 6 3’s 9 4’s 7 5’s 14 6’s Overall Pattern of a Distribution 0 To describe overall pattern of a distribution 1. Give the center and the spread 2. See if the distribution has a simple shape that you can describe 0 Center is the value that divides the observations in half 0 Spread is giving by the smallest and largest value 0 An outlier in any graph of data is an observation that falls outside the overall pattern of the graph Stemplot 0 Stemplots are used when the values of a variable are too spread out for us to make a reasonable dotplot 0 How to construct a stemplot: 1. Separate each observation into a stem consisting of all but the rightmost digit and a leaf, the final digit 2. Write the stems vertically in increasing order from top to bottom, and draw a vertical line to the right of the stems. 3. Write the stems again, and rearrange the leaves in increasing order out from the stem 4. Title your graph and add a key describing what the stems and leaves represent Given these values, construct a stemplot. 40 26 39 14 42 18 25 43 46 27 19 47 19 26 35 34 15 44 40 38 31 46 52 59 1 2 3 4 5 45899 5667 24589 0023467 29 Key: 5 2 = 52 Histograms 0 The most common graph of the distribution of one quantitative variable is a histogram 0 How to construct a histogram: 1. Divide the range of the data into classes of equal width. Count the number of observations in each class 2. Label and scale your axes and title your graph. Vertical axis contains the scale of counts 3.Draw a bar that represents the count in each class. The base of a bar should cover its class, and the bar height is the class count 1– 5 =9 6 – 10 = 9 11 – 15 = 8 16 – 20 = 5 21 – 25 = 2 The data below is the number of unprovoked attacks by alligators on people in Florida each year for a 33- year period F Construct a histogram for this r distribution: e q 6 12 2 4 17 4 6 10 3 9 13 9 15 u 14 6 18 1 9 6 6 11 24 14 14 5 e 17 17 5 13 22 20 3 5 n c 1-5 6-10 11- 15 16- 20 21-25 y Number of Unprovoked Attacks Symmetric and Skewed Distributions 0 A distribution is symmetric if the right and left side of the histogram are almost mirror images of each other 0 A distribution is skewed to the right if the right side of the histogram extends farther out than the left side 0 A distribution is skewed to the left if the histogram extends much farther out than the right side For Example… Left Skewed Distribution Right Skewed Distribution Symmetric and Skewed Distributions (Cont’d) Symmetric Distribution Percentile 0 The pth percentile of a distribution is the value such that p percent of the observation fall at or below it 0 For example: You may have received a standardized test score report that said you were in the 80th percentile. This means that 80% of the people who took the test earned scores that were less than or equal to your score. The remaining 20% are students that earned a higher score than you Tip: Think of it like your SAT scores, if you are in the 60th percentile, you did better than 60% of the students that also took the SAT. Ogive 0 Also known as a culimative relative frequency 0 Helps us understand the relative standing of an individual observation 0 How to construct an Ogive: 1. Decide on class intervals and make a frequency table, just as in making a histogram. Add three columns to your frequency table: relative, cumulative, and relative cumulative frequency 2. Label and scale your axes and title your graph 3.Plot a point corresponding to the relative cumulative frequency in each class interval at the left endpoint of the next class interval Ogive (Cont’d) 0 To get the values for the relative frequency, count the number of times the value appears 0 To fill in the cumulative frequency column, find the % of the data 0 For relative cumulative frequency column, add the %’s together Example: Construct an ogive with the data provided Twenty- nine female raccoons were observed and the number of male partners during the time the female was accepting partners (generally 1 to 4 days each year) was recorded for each female 1 3 2 1 1 4 2 4 1 1 1 3 1 1 1 1 2 2 1 1 4 1 1 2 1 1 1 1 3 Time Plot 0 A time plot of a variable plots each observation against the time at which it was measured. Always mark the time scale on the horizontal axis and the variable of interest on the vertical axis 0 When examining a time plot, look once again for an overall pattern and for strong deviations form the pattern 0 Trend- a long-term upward or downward movement over time 0 Seasonal variation- a pattern that repeats itself at regular time intervals Time Plot (Cont’d) Example of a Time Plot: Section 1.2 – Describing Distributions with Numbers 0 Mean 0 Average of observations 0 Median 0 Midpoint of values (Center) 0 Inter Quartile Range (IQR) 0 IQR= Q3 – Q1 0 Outlier 0 Less than Q1 – 1.5 x IQR 0 More than Q3 + 1.5 x IQR The Five Number Summary 0 Overall description of a distribution: 0 Min 0 Q1 0M 0 Q3 0 Max Example: 22 25 34 |35| 41 41 46 |46| 46 47 49 |54| 54 59 60 Min Q1= 35 M= 46 Q3= 54 Max IQR and Outlier 0 IQR= Q3 – Q1= 54 – 35= 19 0 Finding Outlier Q1 – 1.5 x IQR 35 – 1.5 x 19= -28.5 (Lower cutoff) Q3 + 1.5 x IQR 54 + 1.5 x 19= 82.5 (Upper cutoff) 0 There are no outliers. Chapter 7 Random Variables Section 7.1 – Discrete and Continuous Random Variables 0 Random variable - a variable whose value is a numerical outcome of a random phenomenon 0 Discrete random variables 0 The outcome probabilities must be between 0 and 1 and have a sum of 1. 0 When the outcomes are numerical, they are values of a random variable. 0 0 0 A discrete random variable X has a countable number of possible values. The probability distribution of X lists the values and their probabilities. Value of X: x₁ x₂ x₃ … xk Probability: p₁ p₂ p₃ … pk pi has two requirements. 1) The probability of pi has to be a number between 0 and 1. 2) p₁ + p ₂ + … + pk = 1. Find the probability of any even by adding the probabilities pi of the particular values x that make up the event. 7.1 - Example #1 0 The instructor of a large class gives 15% each of A’s and D’s, each of B’s and C’s and 10% F’s. Choose a student at random from this class. To “chose a random” means to give every student the same chance to be chosen. The student’s grade on a four-point scale (A=4) is a random variable X. 0 The value of X changes when we repeatedly choose students at random, but it is always one of 0, 1, 2, 3, or 4. This is the distribution of X: Grade: 0 1 Probability: 0.10 0.15 2 0.30 3 0.30 4 0.15 0 The probability that the student got a B or better is the sum of the probabilities of an A and a B: P(grade is a 3 or 4) = P(X = 3) + P (X = 4) = 0.30 + 0.15 = 0.45 Probability Histogram 0 We can use histograms to display probability distributions as well as distributions of data. 0 Probability histograms are used to compare the probability model for random digits with the model given by Benford’s law (Chapter 6). 0 The height of each bar represent the probabilities. 0 They all add to 1. 0 Using histograms help us quickly compare the two distributions. Continuous Random Variables 0 When we use the table of random digits to select a digit between 0 and 9, the result is a discrete random variable. 0 This is one way of assigning probabilities, by using the random digits table. 0 However for certain events, it may be impossible because there are infinitely many possible values. 0 A new way of assigning probabilities to events is to use areas under a density curve. 0 The total area of a density curve is exactly 1 underneath it, corresponding to a total of a probability of 1. 0 This is important way of assigning probabilities to events. Continuous Random Variables (Cont’d) 0 A continuous random variable X takes all values in an interval of numbers. 0 The probability distribution of X is described by a density curve. 0 The probability of any event in the area under the density curve and above the values of X that make up the area. 0 The probability model for a continuous random variable assigns probabilities to intervals of outcomes rather than to individual outcome. 0 All continuous probability distributions assign probability 0 to every individual outcome. Normal Distributions as Probability Distributions 0 Normal distributions are probability distributions. 0 This is because density curves describe an assignment of probabilities. 0 As we know, N(μ, σ), is the shorthand notation for normal distribution. In the language of random variables, if X has the N(μ, σ) distribution, then the standardized variable: Z= X – μ σ is a standard normal random variable having the distribution, N(0, 1). Section 7.2 – Means and Variances of Random Variables 0 Rules for Variances 0 Two random variables X and Y are independent if knowing that any event involving X alone did or did not occur tells us northing about the occurrence of an event involving Y alone. 0 When random variables are not independent, the variance of their sum depends on the correlation between them as well as on their individual variances. 0 We use ρ, the Greek letter rho, for the correlation between two random variables. 0 The correlation between two independent random variables is zero. 0 Rule 1. If X is a random variable and a and b are fixed numbers, then σ²ₐ+bX = b²σ²ₓ 0 Rule 2. If X and Y are independent ransom variables, then σ²x+y = σ²x+σ²y σ²x-y = σ²x+σ²y 0 This is the addition rule for variances of independent random variables. 0 Rule 3. If X and Y have correlation p, then σ²x+y = σ²x+σ²y + 2ρσxσy σ²x-y = σ²x+σ²y - 2ρσxσy 0 This is the general addition rule for variances of random variables. Combining Normal Random Variables 0 Any linear combination of independent normal random variables is also normally distributed. That is, if X and Y are independent normal random variables and a and b are any fixed numbers, aX + bY is also normally distributed. In particular, the sum or difference of independent normal random variables has a normal distribution. 7.2 - Example #1 0 A college uses SAT scores as one criterion for admission. Experience has shown that the distribution of SAT scores among its entire population of applicants is such that SAT Math score X µx = 625 σx = 90 SAT Verbal score Y μy = 590 σy = 100 What are the mean and standard deviation of the total score X + Y among students applying to this college? The mean overall SAT score is μx+y = μx + μy = 625 + 590 = 1215 The variance and standard deviation of the total cannot be computed from the information given. SAT verbal and math scores are not independent, because students who score high on one exam tend to score high on the other also. Therefore Rule 2 does not apply and we need to know ρ, the correlation between X and Y, to apply Rule 3. 7.2 - Example #1 (Cont’d) 0 Nationally, the correlation between SAT Math and Verbal scores is about ρ = 0.7. If this is true for these students, σ²x+y = σ²x+σ²y + 2ρσxσy = (90)² + (100)² + (2)(0.7)(90)(100) = 30,700 The variance of the sum X + Y is greater than the sum of the variances σ²x+σ²y because of the positive correlation between SAT Math scores and SAT Verbal scores. That is, X and Y tend to move up together and down together, which increases the variability of their sum. We find the standard deviation from the variance, σ²√30,700 = 175 7.2 - Example #2 0 Zadie has invested 20% of her funds in Treasury bills and 80% in an “index fund” that represents all U.S. common stocks. The rate of return in an investment over a time period is the percent change in the price during the time period, plus any income received. If X is the annual return on T-bills and Y the annual return on stocks, the portfolio rate of return is R = 0.2X +0.8Y The returns X and Y are random variables because they vary from year to year. Based on annual returns between 1950 and 2000, we have X = annual return on T-bills μx = 5.2% σx = 2.9% Y = annual return on stocks μy = 13.3% σy = 17.0% Correlation between X and Y ρ = -0.1 Stocks had higher returns than T-bills on the average, but the standard deviations show that returns on stocks varied much more from year to year. That is, the risk of investing in stocks is greater than the risk for Tbills because their returns are less predictable. 7.2 - Example #2 (Cont’d) 0 For the return R on Zadie’s portfolio of 20% T-bills and 80% stocks, R = 0.2X + 0.8Y μR = 0.2μx + 0.8μy = (0.2 x 5.2) + (0.8 x 13.3) = 11.68% To find the variance of the portfolio return, combine Rule 1 and Rule 3: σ²R = σ²0.2X + σ²0.8Y + 2ρσ0.2Xσ0.8Y = (0.2)²σ²x + 0.8²σ²y + 2ρ(0.2σx)(0.8σy) = (0.2)²(2.9)² + (0.8)²(17.0)² + (2)(-0.1)(0.2 x 2.9)(0.8 x 17.0) = 183.719 σR = √183.719 = 13.55% The portfolio has a smaller mean return than an all-stock portfolio, but it is also less risky. As a proportion of the all-stock values, the reduction in standard deviation is greater than the reduction in mean return. That’s why Zadie put some funds into Treasury bills. 7.2 Mean and Variances of Random Variables (Continued) 0 Mean x- bar: ordinary average 0 Mean of random variable X: an average of possible values of x. Example: taking X to be the amount your ticket pays you the probability distribution of X is.. Pay off x: $0 $500 Probability: 0.999 0.001 Long run average: $500 1 1000 + $0.999 = $0.50 1000 0 You will often find the mean of a random variable X called the expected value. Mean of a Discrete Random Variable The mean of a discrete random variable X is a weighted average of the possible values that the random variable can take. Unlike the sample mean of a group of observations, which gives each observation equal weight, the mean of a random variable weights each outcome xi according to its probability, pi. The common symbol for the mean (also known as the expected value of X) is , formally defined by The mean of a random variable provides the long-run average of the variable, or the expected average outcome over many observations. Example: Suppose an individual plays a gambling game where it is possible to lose $1.00, break even, win $3.00, or win $10.00 each time she plays. The probability distribution for each outcome is provided by the following table: Outcome -$1.00 $0.00 $3.00 $5.00 Probability 0.30 0.40 0.20 0.10 The mean outcome for this game is calculated as follows: = (-1*.3) + (0*.4) + (3*.2) + (10*0.1) = -0.3 + 0.6 + 0.5 = 0.8. In the long run, then, the player can expect to win about 80 cents playing this game -- the odds are in her favor. 0 Continuous random variable X: described by a density curve; variance of a random variable. 0 Mean: A measure of the center of a distribution. 0 The Variance of a random variable X is also denoted by σ;2 but when sometimes can be written as Var(X). 0 Variance of a random variable can be defined as the expected value of the square of the difference between the random variable and the mean. 0 Given that the random variable X has a mean of μ, then the variance is expressed as: Variance of a Discrete Random Variable 0 Discrete random variables are introduced here. The related concepts of mean, expected value, variance, and standard deviation are also discussed. 0 Let X be a numerically valued random variable with expectedvalue µ = E(X). Then the variance of X, denoted by V (X), is V (X) = E((X − µ)^2) • Law of a Large Number: Remarkable fact because it holds for any population, not just for some special class such as normal distribution. • The mean μ of a random variable is the average value of the variable in two senses. • μ is the average of the possible values, weighted by their probability of occurring. Rules for Means: 0 RULE 1: If X is a random variable and A and B are fixed numbers, then μ a+b μx 0 RULE 2: if X and Y are random variables then μ x+y= μx+y