Statistics 571 Statistical Methods for Bioscience I Lecture 1: Cecile Ane Lecture 2: Nicholas Keuler Department of Statistics University of Wisconsin–Madison Fall 2009 Outline 1 Course Information 2 Introduction to Statistics 3 Descriptive Statistics 4 Shape of distributions Outline 1 Course Information 2 Introduction to Statistics 3 Descriptive Statistics 4 Shape of distributions Course Information www.stat.wisc.edu/courses/st571-ane/ Read the entire syllabus carefully. Complete the survey sheet. Switch section? Late homework Block the dates and time for the exams NOW: Tuesday, October 13 Tuesday, November 24 Wednesday, December 20, 7:45am - 9:45am No discussion this week Course Information Get help beyond lectures: Reading materials, course website, forum, discussion sections, office hours, etc. Your feedback is highly appreciated. Examples of comments from previous years: make shorter exams make slides or write big “powerpoint is good, but instead of having the examples printed off, leave blank space, go over on the board [...] have us copy them down” get more practical advice Your evaluations are most valuable to me! Ask questions, get involved! Forum on Learn UW. Course Information: Why R? Why not Microsoft Excel? Limitations of Microsoft Excel: 65K “raw” data size limit little data protection, little/no tracking XL2000 has many errors, without warning. Can get negative correlation coefficients, wrong pie charts, wrong paired t-test with missing values, does not accept categorical predictors in multiple regression, etc. Some bugs are fixed, new bugs are created in XL2003. Still doesn’t have distributions right. Lots of errors known over 10 years without fixes. McCullough & Wilson (2005) On the accuracy of statistical procedures in Microsoft Excel 2003. Computational Statistics & Data Analysis 49(4):1244-1252 Foresight: The International Journal of Applied Forecasting, issue 3 (2006) R. Hesse. Incorrect Nonlinear Trend Curves in Excel B. McCullough. The Unreliability of Excel’s Statistical Procedures P. Fields. On the Use and Abuse of Microsoft Excel Expectations with Computing and R. Resources on course webpage. Tutorial at first discussion. Good practice: keep assignments/projects in separate folders. Keep a plain text file (.r extension) with the list of commands to replicate what you have done. Example... Being able to use a computing software is essential for you to analyze your own data when the time comes. My goal = you own the methods and gain independence. I expect that you will experiment with R, try things on your own, so as to get a good understanding of how R works. Getting error and warning messages is normal while experimenting. Don’t get stuck: get help! Forum, friends, TAs, instructor. Expectations with Assignments. Must be written clearly. When including R commands and output, don’t put them alone. Add comments to explain in English what the commands are doing, and interpret the results. When using graphs, include axis labels, legend if necessary, etc. Handwritten legends are okay. Outline 1 Course Information 2 Introduction to Statistics 3 Descriptive Statistics 4 Shape of distributions Introduction to Statistics What is statistics? Branch of scientific inquiry: helps to determine cause and association, and to make predictions. Organize and summarize data from a sample (i.e. a subset of a population). Use information in the data to draw conclusions about a population (i.e. all individuals of a particular type). Population vs. sample A book vs. a few pages of the book. All corn plants vs. 100 plants in a field. Introduction to Statistics Probability vs. statistics Probability: mathematics of chance and randomness. Properties of samples when the population is known, Deductive approach. Statistics: a sample is available, Conclusions about a population when one sample is known. Inductive approach. Three main topics Descriptive statistics: display & summarize data in a sample. Probability: Given a population, study the uncertainty associated with a sample taken from the population. Statistics: Given a sample, learn methods to draw conclusions about a population, while taking into account of uncertainties in the sample. Russell et al. (2007) Science 317:941-943 Russell et al. (2007) Science 317:941-943 Outline 1 Course Information 2 Introduction to Statistics 3 Descriptive Statistics 4 Shape of distributions Descriptive Statistics Example: height of seedlings Thirteen (13) red pine seedlings were sampled from a nursery in Wisconsin. The heights of these seedlings were (in cm): 42 23 43 34 49 56 31 47 61 54 46 34 26 Graphical methods describe data by visual/graphical techniques. Stem-and-leaf plot*, dot plot Histogram Numerical methods extract summarizing numbers that characterize the data set and reveal main features. Measures of location/center: Sample mean Sample median* Sample quantiles, box plot* Measures of spread: Sample range Interquartile range (IQR)* Sample variance, standard deviation Descriptive Statistics: stem-and-leaf, dotplots A stem-and-leaf plot: 2 3 4 5 6 | | | | | 36 144 23679 46 1 An alternative is a dot plot. Stem-and-leaf plots and dot plots have information about the shape, center, spread of the data distribution, as well as outliers and # of observations. Descriptive Statistics: Histogram Divide data into non-overlapping classes. Decide the number of obs (i.e. frequencies) in each class (i.e. tally). Draw rectangles with height = frequencies and base = class intervals. For the height of seedlings, class 19.5-29.5 29.5-39.5 39.5-49.5 49.5-59.5 59.5-69.5 frequencies Descriptive Statistics: Histogram Ex: milk production of organic cows Dot−plot of milk ● 10 20 ● ● ● 30 ● ● ● ● ● ● 40 ● ● ● ● 50 60 50 60 0 2 4 6 Histogram of milk 10 20 30 40 0 2 4 6 Histogram of milk milk Descriptive Statistics: Remarks Histogram is a pictorial representation of the data frequency distribution. Note the boundary values for the class intervals. Histograms have information about shape, center, spread of the data distribution. Descriptive Statistics: Sample mean The sample mean of a data set of y1 , y2 , . . . , yn provides a measure of location/center of the data set. To compute the sample mean: P add all the values ni=1 yi = y1 + y2 + · · · + yn divide by the number of observations n Pn ȳ = i=1 yi n Seedlings: 42 23 43 34 49 56 31 47 61 54 46 34 26 y1 = 42, y2 = 23, y3 = 43, . . . , y13 = 26 and thus Pn yi 546 = 42 cm. ȳ = i=1 = n 13 ȳ is the balance point of the dot plot. P P Sometimes ni=1 yi is abbreviated as yi . Descriptive Statistics: Sample variance 2 s = Pn − ȳ )2 n−1 i=1 (yi Height of seedlings: y1 = 42, y2 = 23, . . ., and we had ȳ = 42. 42 23 43 34 49 56 31 47 61 54 46 34 26 2 s = = 138.17 Sample variance measures the average squared deviation. Why dividing by n − 1 but not n? For hand calculation, use working formulas " n # Pn 2 X ( y ) 1 i i=1 yi2 − s2 = n−1 n i=1 or s2 = " n # X 1 yi2 − n(ȳ )2 n−1 i=1 Descriptive Statistics: Sample standard deviation Sample standard deviation (SD) is the square root of sample variance p s = s2 √ Height of seedlings: s = 138.17 = 11.75 cm. Sample standard deviation is a typical deviation, as ±1s captures about 2/3 of bell-shaped data. > mean(milk) [1] 36.21429 > sd(milk) [1] 9.760033 sd=9.8 ● 16.6 20 ● ● ● ● ● 26.4 30 sd=9.8 ● ● ● ● ● ● ● 36.2 40 46.0 50 ● 55.8 The mean is sensitive to large values Suppose data values are 2 4 6 7 8 10 12 Then ȳ = , s = .42 Suppose data values are 2 4 6 7 8 10 102 Then ȳ ≈ , s = 36.32 Key R commands > hts = c(42, 23,43,34,49,56,31,47,61, 54,46,34, 26) # enter data > hts [1] 42 23 43 34 49 56 31 47 61 54 46 34 26 > length(hts) # sample size [1] 13 > stem(hts) # stem-and-leaf plot The decimal point is 1 digit(s) to the right of the | 2 | 36 3 | 144 4 | 23679 5 | 46 6 | 1 > hist(hts) # histogram plot > mean(hts) # sample mean [1] 42 > var(hts) # sample variance [1] 138.1667 > sd(hts) # sample standard deviation [1] 11.75443 Outline 1 Course Information 2 Introduction to Statistics 3 Descriptive Statistics 4 Shape of distributions Shape of the distribution of the data Weight of soil: example 1 Actual weight of 15 2-lb. bags of soil used for a lab experiment. 2.36 2.27 2.42 2.13 2.19 2.33 2.54 2.21 2.06 2.36 2.51 2.45 2.12 2.32 2.29 The decimal point is 1 digit(s) to the left of the | 20 21 22 23 24 25 | | | | | | 6 239 179 2366 25 14 mean ȳ = 2.30, standard deviation s = 0.14 Shape of the distribution of the data Weight of soil: example 2 19 20 21 22 23 24 | | | | | | 3 5 8 47 344789 1124 mean ȳ = 2.30, standard deviation s = 0.15 Mean and spread are similar to ex.1, but the distribution is ... Shape of the distribution of the data Weight of soil: example 3 20 21 22 23 24 25 | | | | | | 69 35779 9 01358 15 mean ȳ = 2.30, standard deviation s = 0.17 Mean and spread are similar to ex.1, but the distribution is ... Need to look at the data! not just at numerical summaries. 5 6 2.0 2.1 2.2 2.3 2.4 2.5 2.6 sample1 Frequency 2 3 0 0 0 1 1 1 2 Frequency 2 Frequency 3 4 3 4 5 4 Soil weight examples 1.9 2.0 2.1 2.2 2.3 2.4 2.5 sample2 2.0 2.1 2.2 2.3 2.4 2.5 2.6 sample3 Types of data There are two broad classes of data: quantitative (i.e. numerical) and qualitative (i.e. categorical) data. For quantitative data, each observation has a number associated with it. ex: weight, milk yield, or # of cows on a farm. either continuous or discrete. ex: weight and milk yield are data and # of cows on a farm are data. For qualitative data, each observation can be put into a category, which is either nominal or ordered. ex: 15 cows are assigned to 3 types of beds or 3 different diet types (VC=Vitamin and Choline): bed types Hay Cement Others # of cows 5 6 4 diet types high in VC low in VC control # of cows 5 5 5