MAT 320 Probability is the branch of mathematics that deals with randomness and uncertainty. It may be the most applicable of the mathematical disciplines. “It is unlike other branches of math because they may not have Probability on Mars.” “Statistics … the most important science in the whole world: for upon it depends the practical application of every other science and of every art; the one science essential to all political and social administration, all education, all organization based upon experience, for it only gives the result of our experience. Florence Nightingale (1820 – 1910) Statistics: the science that deals with the collection, description, analysis and interpretation of data. (Mugno, 1997) When most people hear statistics they think of descriptive statistics. Numbers or graphics that summarize a data set. Examples include batting average, median income, disease prevalence, etc… Descriptive statistics are important but usually pretty simple. Inferential statistics are when we use sample data to make estimates, decisions and predictions about a larger data set or population. This is why statistics is so important. Four major themes to keep in mind throughout the course. Design: How the data is collected is extremely important and will affect greatly your analysis and interpretation. Designed experiments, surveys, polls, etc… Description: Can be very important in how your results are perceived by the reader. Bad practices can cause very misleading results. Descriptive statistics, graphs, tables, etc … Analysis: It is very important to use a proper methodology and which descriptive statistics are appropriate. Weighted, biased, etc… Interpretation: inferences or making decisions based on design, descriptive statistics and analysis. This is what makes statistics so important and powerful, especially in today’s data driven society. Ex. THE FOLLOWING RESULTS ARE BASED ON A POLL OF 664 LIKELY REPUBLICAN PRIMARY VOTERS 1REP. If the Republican primary for Governor were being held today, would you vote for Tom Foley, Mike Fedele, or Oz Griebel? (If undecided q1REP) As of today, would you say that you lean a little more toward Foley, Fedele, or Griebel? (This table includes "Leaners".) LIKELY REP PRIMARY VOTERS Tot Men Wom Foley Fedele Griebel SMONE ELSE(VOL) WLDN'T VOTE(VOL) DK/NA 38% 30 17 14 43% 27 18 12 33% 33 16 17 From August 3 - 8, Quinnipiac University surveyed 664 Connecticut Republican likely primary voters with a margin of error of +/- 3.8 percentage points and 464 Democratic likely primary voters with a margin of error of +/- 4.6 percentage points. These likely voters were selected from lists of people who have voted in past elections. The Quinnipiac University Poll conducts public opinion surveys in New York, New Jersey, Connecticut, Pennsylvania, Florida, Ohio and the nation as a public service and for research. What type of design was used to collect these data? How are these data described? How was the data analyzed? What conclusions (inferences) are drawn from the data? The design here is a poll. Specifics are given as to how the poll participants were selected. This can be very important and greatly affect the results and interpretations. We do know about 3 variables here: Gender, Party affiliation (R) and who are you likely to vote for? The data are described by percentages (relative frequencies) and the total is given. They are placed in a table. (two-dimensional frequency table or contingency table) The data analysis was very simple, they just found the percents of responses. The conclusions are generalizing these results to the general population. Note: Polls are not always reliable. But assume that this one is. Margin of error is given as ±3.8%? Is this good news for Foley or Fedele? Why? DK/NA is 14%. Is this good news for Foley or Fedele? Why? Notice anything else of interest? What is the probability that Foley wins? Some key terms: Subjects: entities that we wish to measure. Population: the total set of subjects that we wish to study. Sample: a subset of the population. Variable: a characteristic of the subject Design: the plan to obtain the data. Inference: a decision or generalization based on the sample about the population. Probability: branch of mathematics that deals with randomness and chance Descriptive Statistics: methods for summarizing data Inferential Statistics: methods for making decisions or generalizations about a population. Parameter: a numerical summary of the population. Statistic: a numerical summary of the sample. General methodology: a researcher wants to know about a parameter. Because of limited resources (time, money, etc.) the researcher takes a (representative) sample from the population, calculates the statistics that will enable the estimation of the parameter, then makes an inference about the population based on the statistics and probability. The researcher may also include graphs charts or tables to help describe the findings. Randomness ensures a representative sample Random number generation Each subject of the population has an equal chance of being included in the sample. Ex. Polio Vaccine Trial The trial was conducted by the National Foundation for Infantile Paralysis (NFIP). First a sample of 3 grade children was selected, all of whose parents consented to vaccination. The sample would be randomly divided into two groups. One group would be given the polio vaccination; the other group would be given a placebo (three injections of inert saltwater that would appear identical to the three injections of the real vaccine). Additionally, none of the participants would know the group identity--not the child, not the parents, and not the examining doctors. The results are listed in the table below. Does this provide evidence at the 1% significance level that the polio vaccine lowers the risk of polio? http://wps.aw.com/wps/media/objects/14/15269/projects/ch12_salk/index.html Treatment Vaccine Placebo Sample size 200745 201229 Polio 57 142 Subjects: 3 grade children Population: all children Sample: the 400,000+ children given a treatment. Variables: Vaccine or placebo and Polio or not polio Design: randomized trial (see below). Inference: Comparing the true / population proportion of polio for vaccinated and vaccinated groups Probability will be stated with the inference as a significance level, like 1% or 5% Parameter: the proportion of all vaccinated children who get polio and the proportion of all non-vaccinated children who get polio Statistic: 57/200475 and 142/201229 are sample proportions This type of design is known as a randomized control experiment. The randomization tends to nullify all effects (confounding variables) except the treatment effect. An experiment of this type, in which both the subjects and the evaluators are ignorant of the treatment/control status, is known as a double-blind experiment. The randomized control, double-blind design is considered the gold standard of statistical designs. Univariate: data set with observations on a single variable Bivariate: data set with observations on each of two variables Multivariate: data set with observations on more than one variable Concrete vs. conceptual populations concrete population: when the population really exists. examples: all college students, all voters, all black widow spiders, all widgets produced in a factory. conceptual population: when the population does not actually exist. examples: all speeds a car can crash at, all altitudes a place can fly at, all temperatures possible for making widgets. Enumerative vs. Analytical studies unchanging and finite vs. futuristic Enumerative: all college students in 2008, all Toyota Camarys in 2006 Analytical: All college students over the next 3 years, all hybrids until 2010 Chapter 1.2 Two types of data: qualitative and quantitative Qualitative: non numeric sometimes called categorical, because the data can be divided up into classes or categories. Can be further divided into nominal and ordinal data. Ordinal is not numeric in the true sense but the order of classes is inherent and important. Ex. Grade in school: Freshman, Sophomore, Junior, Senior, could be coded as 1, 2, 3, 4 Nominal: the order is arbitrary. Favorite color: Blue, Red Green, Other Quantitative: numeric data. Can be divided into discrete and continuous. Discrete: finite or countably infinite number of possibilities. 0, 1, 2, … Continuous: range of possibilities form an interval. (0,1) Examples: Population: Students at SCSU Sample: Take a random sample of 100 students Possible variables: GPA: quantitative, continuous. Height in inches: quantitative, discrete Hair color: qualitative, nominal Home area code: qualitative, nominal ( it is numeric, but the numbers do not count or measure) Letter grade you received in Calculus I: qualitative and ordinal. (order matters) Graphical Displays (review these if you need to) For summarizing categorical data the primary displays are Pie Charts and Bar Graphs Pie Charts: circles where each “slice” represents a category and the size of each slice corresponds to the proportion or percentage of observations in that category. Bar Graphs display a vertical bar for each category. The height of the bar is the percentage or proportion of observations in that category. The proportion of observations in a class or category is the frequency of observations that fall in the class divided by the total number of observations. The percentage is the proportion times 100. Proportions and percentages are both known as relative frequencies. A frequency table is a listing of all the classes and their corresponding frequencies. It is necessary to create a frequency table before making a pie chart or bar graph or histogram (later). Examples: Nominal: Cell phone carrier. Ordinal: Year at college Graphs for quantitative variables. Dot plots: a dot for each observation is place above the appropriate number in a number line. Stem and Leaf Plots: each observation is represented as a stem and leaf. The stem usually consists of all the digits of the number except for the last one, which is the leaf. Dot plots and stem and leaf plots are only reasonable for small data sets. Ex. Heights. For larger Data sets a Histogram is used. A Histogram is a graph that uses bars to portray the frequencies or relative frequencies of the possible outcomes for a quantitative variable. Steps for constructing a Histogram 1. Divided the range of the data into classes, non-overlapping intervals of equal length. For a discrete set of data with a small number of values, use the actual values as the classes. 2. Count the number of observations in each class, forming a frequency table. 3. On the horizontal axis, label the values or the endpoints of the intervals. Draw a bar over each class or value with height equal to its frequency (or percentage). The vertical axis should be scaled and labeled with either the raw or relative frequencies. Both the horizontal and vertical axes should be scaled so that all the classes and frequencies fit and are disguisable. Histograms are one of the most misunderstood concepts in statistics. A histogram is simply a bar graph of the frequency distribution of the data. Histogram example. Heights of students (inches) from a previous class. Heights 57 59 66 60 70 47 61 55 57 71 Heights 74 48 70 67 62 62 58 55 62 68 We have 20 data points and want to break them up into 4 or 5 classes. Guideline: if n is the total number of observations and k is the number of classes then k = √n The range of the data = max – min = 74 – 47 = 27. Note that there are 28 integers between 47 and 74 if you include the endpoints. So, 4 classes that are 7 units long will work fine. The classes would then be [47, 53], [54, 60], [61, 67], [68, 74]. Note that all the data points are included and no point is in more than one class. The frequency distribution would then be: Class Freq [47, 53] 2 [54, 60] 7 [61, 67] 6 [68, 74] 5 Note that the total is 20. Now just make a bar graph of the frequency distribution. Histogram of Heights 8 7 Number 6 5 4 Freq 3 2 1 0 [47, 53] [54, 60] [61, 67] Height in inches [68, 74] Note that the axes are labeled and the histogram is titled and there are no gaps in the bars. For Quantitative data there are 3 kinds of plots: Dot plots Stem and leaf plots Histograms Dot plots and Stem and leaf plots are used for small data sets (under 50 observations). Histograms are more flexible, because of classes Histograms and dot plots and stem-and-leaf-plots allow us to see the shape of the distribution. 1. Outlier detection: rare or unusual observations 2. The mode or most common observation class. unimodal vs. bimodal. 3. Symmetry of the dataset. a. Symmetric: when you divide the histogram down the middle, the left side of is a mirror image of the right side. b. Skewed left: if the left tail of the histogram is longer than the right tail. The small observations are more extreme than the large observations. c. Skewed right: if the right tail of the histogram is longer than the left tail. The large observations are more extreme than the small observations. Another Histogram example: page 22 #20 20. 0|123334555599 1|00122234688 2|1112344477 3|0113338 4|37 5|23778 Bin (0,1000] (1000, 2000] (2000, 3000] (3000, 4000] (4000, 5000] (5000, 6000] Frequency 13 10 10 7 2 5 23 points < 2000 23/47 = .489 17 points between 2000 and 4000 17/47 = .362 Positive skewed (right) (Note it really should have specified left or right endpoint inclusion) Measures of Central Tendencies Mean: average Sample mean: the average of the sample. This is a capital sigma: ∑ It means to take the sum. Symbolically we write the formula for the sample mean as: x x n = x-bar x-bar is used to estimate μ = population mean Median : the middle Sample median: the middle of the sample Symbolically we will write the sample median as x~. To find the sample median: 1. Sort the n observations (in ascending order) 2. If n is odd, let k = (n + 1) / 2. Then x~ = kth observation 3. If n is even, let k = n / 2 and j = (n + 2)/2. Then x~ is the average of the kth and the jth observations. x~ is used to estimate μ~ = the population median. Outlier: an observation that falls outside pattern of data. Ex. The following sample are 10 scores from a test given last semester. 75 84 86 68 93 97 32 90 80 70 Find the mean. Find the median. Make a dot plot. Are there any outliers? If so identify them. ∑x = 775 and n = 10, so the mean = 775 / 10 = 77.5 Mean = 77.5 Sort the data. 32 68 70 75 80 84 86 90 93 97 n = 10, 10 / 2 = 5 and 12/2 = 6, so the median is the average of the 5th and 6th observations = (80 + 84)/ 2 = 82. Median = 82 Dot plot: ● ● ● ● ● ●● ● ● ● 30 40 50 60 70 80 90 32 seems to be an outlier. How can outliers affect the mean and the median? Assume that the person who got the 32 drops the class, because the student got really sick. The remaining (sorted) data looks like: 68 70 75 80 84 86 90 93 97 So now are n = 9 and ∑x = 743, so Mean = 743 / 9 = 82.5 Median = 5th observation = 84. The mean increased 5 points, but the median only increased 2 points. The mean is a weighted measure, whereas the median is a resistant measure. Resistant measures if extreme observations have little if any effect. To calculate the mean and the median as well as some other important statistics on the TI83/ TI84. 1. Enter the data into a list. Hit [STAT] Choose 1. Edit In L1, enter the years. Ex. 75 [ENTER] 84 [ENTER] … 70 [ENTER] 2. Hit [STAT]. Hit the right arrow to highlight CALC. Choose 1:1-Var Stats hit [ENTER] The screen should read: 1-Var Stats (then hit L1 [2cd] [1]), so that the screen reads: 1-Var Stats L1. Hit [ENTER] The output should look like: 1-Var Stats x 77.5 ∑ x = 775 ∑ x2 = 63183 Sx = 18.62047857 σx = 17.66493702 n = 10 (to see more hit the down arrow) minX = 32 Q1 = 70 Med = 82 Q3 = 90 maxX = 97 Q1 is the first quartile = 25% percentile. Sx = sample standard deviation More on these later. Q3 is the third quartile = 75% percentile. σx = population standard deviation. Trimmed Mean of p percent: removes the top and bottom p% observations and then finds the mean. This is a compromise between x-bar and x~. It is a weighted measure that is more resistant to outliers then x-bar. Ex. The 10% Trimmed mean of the data below is: 32 68 68 70 70 75 75 80 80 84 84 86 86 90 90 93 93 97 10% trimmed mean = 80.75 Ex. An airline company is wondering about the number of cancellations it receives for a specific commuter flight. The airline takes a random sample of 15 days. The data is listed below. Find the mean and the median for the sample. Make a dot plot of the data. Are there any outliers? Describe the symmetry of the data. 4, 24, 17, 17, 9, 12, 9, 12, 13, 14, 14, 15, 15, 16, 16. x-bar = 13.8 x~ = 14 Another way to determine symmetry: Data are symmetric if x-bar = x~ ( does not have to be exact, within 10%) Data are skewed right if x-bar > x~ Data are skewed left if x-bar < x~ These data appear to be symmetric. There are no clear outliers, but one could argue that both 4 and 24 are outliers. Measures of Variability First we measured the center of the data, the mean and the median. We also looked at the shape of the data, unimodal or bimodal, symmetric or skewed. No we look at how spread out the data is. The first measure is simple but does not tell us much about the spread. The range is the difference between the largest and smallest observations. Range = max - min A better measure would summarize the deviations from the center of the data. A deviation of an observation x from the mean xbar is (x - xbar), the difference. A deviation is positive if x is bigger than xbar. A deviation is negative if x is smaller than xbar. Unfortunately if we sum all the deviations of any data set, we get 0, because of how xbar is defined. So before we sum up the deviations, we square them, which makes them all positive. The average of these squared deviations is called the variance and is denoted by s 2. The formulae for s and s2 are given in your text page 32 and 34. The square root of s2 is s which is called the standard deviation. The bigger the standard deviation, the more spread out the data is. We use the standard deviation more often then the variance because the standard deviation is in the units of the problem and the mean. s2 is used to estimate σ2 = population variance. s is used to estimate σ = population standard deviation. We will not use the formulae much because your calculator will do it for you. Remember under 1-VAR_STATS there was Sx, which is the standard deviation. Technically this is the sample standard deviation which is what we want. Ex. A random sample of 10 grades is given below. Calculate the mean, and standard deviation of the sample. Grades (x) 95 87 45 76 76 82 68 63 92 88 x-bar = s^2 = s= 77.2 233.067 15.267 (x - xbar) 17.8 9.8 -32.2 -1.2 -1.2 4.8 -9.2 -14.2 14.8 10.8 (x - xbar)^2 316.84 96.04 1036.84 1.44 1.44 23.04 84.64 201.64 219.04 116.64 0.000 2097.600 233.067 15.267 Interpreting the standard deviation. In general, the greater the spread the greater s is. Also, s = 0 means that there is no deviation, which only happens when all the observations are the same. For example, if your data set was: 20, 20, 20, 20, 20, 20. S = 0. Proposition: Let x1, x2, x3, …, xn be a sample and c be any non-zero constant then, a. if y1 = x1 + c, y2 = x2 + c, …, yn = xn + c then Sy2 = Sx2 and b. if y1 = cx1, y2 = cx2, …, yn = cxn then Sy2 = c2Sx2 and Sy = |c|Sx Measures of relative Standing and Boxplots The pth percentile is a value such that p percent of the observations fall below or at that value. You have probably seen percentiles on standardized tests. The median is a percentile, the 50th. Three useful percentiles that we will use are the quartiles. The median is the second called Q2. The first quartile is called Q1 and is the 25th percentile. It is also the median of the lower half of the data. The median is the second called Q2. The Third quartile is called Q3 and is the 75th percentile. It is also the median of the upper half of the data. The TI calculators can calculate all 3 of them for you. Some people look at Q0 as the minimum observation and Q4 as the maximum observation. These 5 numbers together are called the 5-number-summary of the data. These numbers can be used to detect outliers and create visual display of the data called a box plot. First we need to calculate the Inter Quartile Range (IQR = Q3 - Q1 = fourth spread = fs). Constructing a box-plot 1. Calculate the 5-Number Summary. 2. A box is drawn from Q1 to Q3. (vertical lines at the quartiles) 3. A (vertical) line is drawn at the median. 4. A whisker (horizontal line) is drawn from Q1 to the smallest observation that is bigger than Q1 - 1.5*IQR. A whisker (horizontal line) is drawn from Q3 to the largest observation that is smaller than Q3 + 1.5*IQR. Any observation that is outside the whiskers, either less than Q1 1.5*IQR or Q3 + 1.5*IQR, is a potential outlier. Ex. Grades (x) 95 87 35 76 76 82 68 63 92 88 Sorted x 95 92 88 87 82 76 76 68 63 35 Median = (76 + 82) / 2 = 79 Q1 = 68 Q3 = 88 Min = 35 Max = 95 IQR = 88 – 68 = 20 1.5 * IQR = 30 68 – 30 = 38 Since 35 < 38 it is a potential outlier. (35 is more extreme than 38) 88 + 30 = 118 Since 95 < 118 it is NOT a potential outlier. (95 is less extreme than 118) Note that your book distinguishes between outliers and extreme outliers: Outlier: any observation x that is more than 1.5 IQR from the closest quartile (Q1, Q3). Q1 – x > 1.5IQR or x – Q3 > 1.5IQR Extreme Outlier: any observation x that is more than 3 IQR from the closest quartile (Q1, Q3). Q1 – x > 3IQR or x – Q3 > 3IQR They mark outliers with a solid circle and extreme outliers with an open circle. Boxplots can be used to compare two sets of observations. They are usually graphed next to each other. Also note that Boxplots can be vertical or horizontal. Misleading your audience with statistics. Guidelines for Constructing Effective Graphs 1. Label both axes and provide title. 2. Compare relative sizes accurately, scale correctly! Y axis should start at 0 3. Use standard shapes and symbols. 4. Displaying more than one group on a single graph can be difficult. 5. Mean vs. Median (Baseball examples) 6. Percents vs. Frequencies 7. Simpson’s Paradox. Do not’s 1. Do not use scale breaks in any of your axes! 2. When making a histogram, uses classes and bars of the same width. 3. Do not make inferences about the population from one simple statistic like the mean, especially when you have a small sample size. Simpson’s Paradox. A baseball example: Batting average = number of hits / number of qualifying at bats. Who has the better batting average? A (Hits/ AB) Avg B (Hits/ AB) Avg Vs Lefties Vs Righties 200/500 .400 5/10 .500 30/100 .300 210/590 .360 Batter B is better vs. left-handed pitching and better vs. right-handed pitching. Totals A (Hits/ AB) Avg B (Hits/ AB) Avg Vs Lefties 200/500 .400 5/10 .500 Vs Righties 30/100 .300 210/590 .356 But overall batter A has a better average. Simpson’s Reversal of Inequalities: • E.H. Simpson in 1951 noted that a A c C and b B d D But ac AC bd BD Recall that: Totals 230/600 .380 215/600 .358 a c ac b d bd Hiring Practices at University of California at Berkeley: Was there a hiring preference given to males? Dept History Geography Total Men 1/5 (20%) 6/8 (75%) 7/13 (54%) Women 2/8 (25%) 4/5 (80%) 6/13 (46%) Note first that the overall percentage of males hired (54%) is greater that that for females (46%). However, both the history department hired a greater percentage of females (25% to 20%) as did the Geography department (80% to 75%). Two important aspects to note: First, he History department and the Geography department do not talk to each other before making a hire. Second, more females applied for the jobs that were harder to get, History only had 3 positions available, whereas Geography had 10 positions available. Knowing these to facts there is no evidence of gender discrimination (in favor of males anyway) This a piece of data that was a real case actually brought before the California legislature, where even though they were presented similar evidence concluded that, something like this could never happen and Berkely was guilty of gender bias. a 2a b 2b and c 3c d 3d does that mean that a c 2a 3c b d 2b 3d Example 1 1 2 2 4 1 3 3 9 but 11 2 3 23 49 2 5 5 13 .4 > .385 Example 2 1 3 2 6 1 2 3 6 but 11 3 2 23 66 2 5 5 12 .4 < .417 Conclusions: Be very careful how and when you add ratios or fractions. Be very careful how you interpret you findings when adding ratios or fractions. Bibliography Simpson. E.H. (1951), “The interpretation of interaction in contingency tables.” http://en.wikipedia.org/wiki/Simpson%27s_paradox http://en.wikipedia.org/wiki/Low_birth_weight_paradox http://plato.stanford.edu/entries/paradox-simpson/