Department of Statistics TEXAS A&M UNIVERSITY STAT 211 Instructor: Keith Hatfield 1 Topic 1: Data collection and summarization • • • • • • Populations and samples Frequency distributions Histograms Mean, median, variance and standard deviation Quartiles, interquartile range Boxplots What is Statistics? • What do you think of when you hear the word “statistics”? (sports, boring, not applicable to my field of study) • Statistics: The science of collecting, classifying, and interpreting data. • Anticipated learning outcomes: – appreciate and apply basic statistical methods in an everyday life setting (Election polls, clinical trials, lies, big lies & statistics) – appreciate and apply basic statistical methods in their scientific field 3 Collecting data • Observational study – Observe a group and measure quantities of interest. – This is passive data collection in that one does not attempt to influence the group. – The purpose of the study is to describe the group. • Experimental study – Deliberately impose treatments on groups in order to observe responses. – The purpose is to study whether the treatments cause a change in the responses 4 Observational Study Terms • Population: The entire group of interest • Sample: A part of the population selected to draw conclusions about the entire population • Census: A sample that attempts to include the entire population • Parameter: A concept that describes the population • Statistic: A number produced from a sample that 5 estimates a population parameter Horry County SC, Murder Case • Do juries properly represent the racial makeup of Horry County which is 13% African American? • What is the population parameter of interest? • What sample statistic could be used to estimate the parameter and does the sample support the claim? • 295 jurors summoned, 22 were African American 6 Experiment Terms • Experimental Group: A collection of experimental units subjected to a difference in treatment, imposed by the experimenter. • Control Group: A collection of experimental units subjected to the same conditions as those in an experimental group except that no treatment is imposed. • This design helps control for potential confounding effects. 7 What are “confounding” effects? • When you have multiple factors in a study and you can’t tell which factor causes a change in the variable of interest. • Example: Does going to church make you live longer?.....Not necessarily. There are too many other factors or “lurking variables”, discussed later. • Best to set up study with everything else constant and have only one factor changed. That way, you’re more apt to identify that the change in the variable is due to the change you instituted in the study. 8 NCTR study (National Center for Toxicological Research) • A large scale study was conducted to see if a new drug might have potential toxic effects. They used rats for the experiment. • Dose groups of 0, 100, 200, and 400 ppg were evaluated for liver tumors at the end of a two week exposure to the drug. (which is the control and which are the experimental groups?) • What comparisons would you want to make? • Should you evaluate each group on consecutive days at the end of the study? 9 Analyzing data with StatCrunch • StatCrunch is a statistical software package that runs through a Web browser. • You can access StatCrunch once you have registered and created an account ($$). See the information tab in eCampus for details. • No tutorials for StatCrunch, but demonstrations of how to perform basis tasks and tests will be done in class. • Note that the homework uses StatCrunch. Several datasets will be given in the homework and in class examples. I don’t advise using your calculator for this purpose as it can be tedious and lead to input errors. 10 All about variables • Variable: Any characteristic or quantity to be measured on units in a study • Categorical variable: Places a unit into one of several categories – Examples: Gender, race, political party • Quantitative variable: Takes on numerical values for which arithmetic makes sense – Examples: SAT score, number of siblings, cost of textbooks • Univariate data has one variable. • Bivariate data has two variables. • Multivariate data has three or more variables. 11 Cereal data mfr A = American Home; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina type cold or hot calories calories per serving protein grams of protein fat grams of fat sodium milligrams of sodium fiber grams of dietary fiber carbo grams of complex carbohydrates sugars grams of sugars potass milligrams of potassium vitamins vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended shelf display shelf (1, 2, or 3, counting from the floor) weight weight in ounces of one serving cups number of cups in one serving rating a rating of the cereal 12 Summarizing a single categorical variable • Frequency - number of times the value occurs in the data • Relative frequency - proportion of the data with the value mfr Frequency Relative Frequency A 1 0.012987013 G 22 0.2857143 K 23 0.2987013 N 6 0.077922076 P 9 0.116883114 Q 8 0.103896104 R 8 0.103896104 Cereal data 13 Analyzing a single quantitative variable • Consider the concentration data which contains the concentration of suspended solids in parts per million at 50 locations along a river. • What is a typical concentration? (Generally characterized by the center of the data) • How much spread is there in the concentrations along the river? (Generally, the relative “width” of the data…how dispersed they are around the center)? – Wide versus narrow and the inherent good and bad things about spread. – Discuss the difference in typical and spread if taken at a single point on the river, versus several points along the river. 14 Histograms • Histogram - bar graph of binned or grouped data where the height of the bar above each bin denotes the frequency (relative frequency) of values in the bin • Typical concentration? • Spread? • Roughly how many concentrations below 50? 15 Choosing the number of histogram bins • General rule: # of bins # of observations – Most stat packages will do this for you, but sometimes you may want to change the number of bins or categories, depending on what you want the data to convey…. • Following is a sample of historical geyser eruptions from Old Faithful in Yellowstone National Park. Demonstration done in class, typical outputs shown on next two slides show same data from different perspectives. Old Faithful data 16 Data presented from an alarmist point of view 17 Data presented from a “calming” point of view 18 Describing the shape of quantitative data • Symmetric data has roughly the same mirror image on each side of a center value. • Skewed data has one side (either right or left) which is much longer than the other relative to the mode (peak value). – The above definitions are most useful when describing data with a single mode. • Multimodal data has more than one mode. • Beware of outliers when describing shape. • Shape of the concentration data? 19 States data from 1996 • Define the shape of each variable. POVERTY percentage of the state population living in poverty CRIME violent crime rate per 100,000 population COLLEGE percentage of states population who are enrolled in college METRO percentage of the state population living in a metropolitan area INCOME median household income in 1996 dollars 20 Shapes of states data – Percentage living in poverty 21 Shapes of states data – Violent crime rates per 100K 22 Shapes of states data - % living in metro area 23 Shapes of states data – Income 24 Summary statistics for quantitative data • Measures of central tendency (typical) – The sample median is the middle observation if the values are arranged in increasing order. – The sample mean of n observations is the average, the sum of the values divided by n. X 1 ,..., X n represents n data values n X X i 1 i n 25 Summary statistics for quantitative data • pth percentile -the value such that p×100% of values are below it and (1p) ×100% are above it (How to actually find the value? Multiply the percentile by # of observations and round up if necessary). – first quartile (Q1) is the 25th percentile – second quartile (Q2) 50th percentile (median) – third quartile (Q3) is the 75th percentile • 5-number summary: Min, Q1, Q2, Q3, Max – Boxplots: Stacking boxplots can be very useful for comparing multiple groups (you’ll see in 2 slides). 26 • From the boxplot above – Are more than 75% of the values below 80? – Are more than 75% of the values above 40? – What percentage of values fall roughly between 45 and 70? – Is the data symmetrical? – What are the approximate maximum and minimum values? 27 Summary statistics for quantitative data • Measures of spread: – Interquartile range, IQR = Q3-Q1, the range of the middle 50% of the data – sample variance, s2, is the sum of squared deviations from the sample mean divided by n-1 n s 2 (X i 1 i X) 2 n 1 – sample standard deviation, s, is the square root of sample variance. Preferred because it has the same units as the data. 28 Calculation of sample variance (partial from data) Obs 1 2 3 4 5 6 7 8 9 10 Totals x 5 4 3 2 2 5 7 3 4 9 44 x bar 4.4 4.4 4.4 4.4 4.4 4.4 4.4 4.4 4.4 4.4 ( (x‐xbar) (x‐xbar)^2 0.6 0.4 ‐0.4 0.2 ‐1.4 2 ‐2.4 5.8 ‐2.4 5.8 0.6 0.4 2.6 6.8 ‐1.4 2 ‐0.4 0.2 4.6 21.2 0 44.4 x x ) x^2 25 16 9 4 4 25 49 9 16 81 238 29 Cereal data • Compare rating across shelf… – Numerically using StatCrunch “Summary Stats” 30 Cereal info – Comparative boxplots • Boxplot/outliers – An example of comparative bloxplots. – Graphically using StatCrunch “Graphics>Boxplots” 31 Comparing measures of central tendency and spread • The sample mean and the sample standard deviation are good measures of center and spread, respectively, for symmetric data • If the data set is skewed or has outliers, the sample median and the interquartile range are more commonly used. • Note about trimmed mean. 32 Case Study: Salary data • A fictitious large university decides to study the salaries of their graduates. A survey was conducted of 2232 recent graduates from engineering and education majors. • The salary data consists of three variables: – Gender: Male or Female – Major: Education or Engineering – Salary: Reported in $ • What types of variables do we have? 33 Salary data by major • Are both majors equally represented in the survey? • Do salaries differ across major? 34 Salary data by gender • Are both genders equally represented in the survey? Summary statistics for Salary: Group by: Gender Gender n Mean Female 1,088 41,108 Male 1,144 50,589 Variance 97,633,984 86,189,224 Std. Dev. 9,881 9,284 Median 36,369 54,471 Min 33,070 29,027 Max 64,279 61,533 • Do salaries differ across gender? Discrimination? 35 Salary data by gender within each major • How do male and female salaries compare in engineering? Summary statistics for Salary: Where: Major=Engineering Group by: Gender Gender n Mean Female 232 59,921 Male 924 55,022 Variance 3,900,454 4,146,587 Std. Dev. 1,975 2,036 Median 59,994 55,019 Min 53,598 48,019 Max 64,279 61,533 • How do male and female salaries compare in education? Summary statistics for Salary: Where: Major=Education Group by: Gender Gender n Mean Female 856 36,009 Male 220 31,971 Variance 1,004,212 1,238,722 Std. Dev. 1,002 1,113 Median 36,009 32,002 Min 33,070 29,027 Please read the additional file for Topic 1 for more info Max 39,411 35,608 36