Statistics 571 Statistical Methods for Bioscience I

advertisement
Statistics 571
Statistical Methods for Bioscience I
Lecture 1: Cecile Ane
Lecture 2: Nicholas Keuler
Department of Statistics
University of Wisconsin–Madison
Fall 2009
Outline
1
Course Information
2
Introduction to Statistics
3
Descriptive Statistics
4
Shape of distributions
Outline
1
Course Information
2
Introduction to Statistics
3
Descriptive Statistics
4
Shape of distributions
Course Information
www.stat.wisc.edu/courses/st571-ane/
Read the entire syllabus carefully.
Complete the survey sheet.
Switch section?
Late homework
Block the dates and time for the exams NOW:
Tuesday, October 13
Tuesday, November 24
Wednesday, December 20, 7:45am - 9:45am
No discussion this week
Course Information
Get help beyond lectures: Reading materials, course
website, forum, discussion sections, office hours, etc.
Your feedback is highly appreciated. Examples of
comments from previous years:
make shorter exams
make slides or write big
“powerpoint is good, but instead of having the examples
printed off, leave blank space, go over on the board [...]
have us copy them down”
get more practical advice
Your evaluations are most valuable to me! Ask questions,
get involved! Forum on Learn UW.
Course Information: Why R? Why not Microsoft Excel?
Limitations of Microsoft Excel:
65K “raw” data size limit
little data protection, little/no tracking
XL2000 has many errors, without warning. Can get
negative correlation coefficients, wrong pie charts, wrong
paired t-test with missing values, does not accept
categorical predictors in multiple regression, etc.
Some bugs are fixed, new bugs are created in XL2003.
Still doesn’t have distributions right. Lots of errors known
over 10 years without fixes.
McCullough & Wilson (2005) On the accuracy of statistical procedures in Microsoft
Excel 2003. Computational Statistics & Data Analysis 49(4):1244-1252
Foresight: The International Journal of Applied Forecasting, issue 3 (2006)
R. Hesse. Incorrect Nonlinear Trend Curves in Excel
B. McCullough. The Unreliability of Excel’s Statistical Procedures
P. Fields. On the Use and Abuse of Microsoft Excel
Expectations with Computing and R.
Resources on course webpage. Tutorial at first discussion.
Good practice: keep assignments/projects in separate folders.
Keep a plain text file (.r extension) with the list of commands
to replicate what you have done. Example...
Being able to use a computing software is essential for you to
analyze your own data when the time comes. My goal = you
own the methods and gain independence.
I expect that you will experiment with R, try things on your own,
so as to get a good understanding of how R works. Getting
error and warning messages is normal while experimenting.
Don’t get stuck: get help! Forum, friends, TAs, instructor.
Expectations with Assignments.
Must be written clearly.
When including R commands and output, don’t put them alone.
Add comments to explain in English what the commands are
doing, and interpret the results.
When using graphs, include axis labels, legend if necessary,
etc. Handwritten legends are okay.
Outline
1
Course Information
2
Introduction to Statistics
3
Descriptive Statistics
4
Shape of distributions
Introduction to Statistics
What is statistics?
Branch of scientific inquiry: helps to determine cause and
association, and to make predictions.
Organize and summarize data from a sample (i.e. a subset
of a population).
Use information in the data to draw conclusions about a
population (i.e. all individuals of a particular type).
Population vs. sample
A book vs. a few pages of the book.
All corn plants vs. 100 plants in a field.
Introduction to Statistics
Probability vs. statistics
Probability: mathematics of chance and randomness.
Properties of samples when the population is known,
Deductive approach.
Statistics: a sample is available, Conclusions about a
population when one sample is known. Inductive approach.
Three main topics
Descriptive statistics: display & summarize data in a
sample.
Probability: Given a population, study the uncertainty
associated with a sample taken from the population.
Statistics: Given a sample, learn methods to draw
conclusions about a population, while taking into account
of uncertainties in the sample.
Russell et al. (2007) Science 317:941-943
Russell et al. (2007) Science 317:941-943
Outline
1
Course Information
2
Introduction to Statistics
3
Descriptive Statistics
4
Shape of distributions
Descriptive Statistics
Example: height of seedlings
Thirteen (13) red pine seedlings were sampled from a nursery
in Wisconsin. The heights of these seedlings were (in cm):
42 23 43 34 49 56 31 47 61 54 46 34 26
Graphical methods describe data by visual/graphical
techniques.
Stem-and-leaf plot*, dot plot
Histogram
Numerical methods extract summarizing numbers that
characterize the data set and reveal main features.
Measures of location/center:
Sample mean
Sample median*
Sample quantiles, box plot*
Measures of spread:
Sample range
Interquartile range (IQR)*
Sample variance, standard deviation
Descriptive Statistics: stem-and-leaf, dotplots
A stem-and-leaf plot:
2
3
4
5
6
|
|
|
|
|
36
144
23679
46
1
An alternative is a dot plot.
Stem-and-leaf plots and dot plots have information about
the shape, center, spread of the data distribution, as well
as outliers and # of observations.
Descriptive Statistics: Histogram
Divide data into non-overlapping classes.
Decide the number of obs (i.e. frequencies) in each class
(i.e. tally).
Draw rectangles with height = frequencies and
base = class intervals.
For the height of seedlings,
class
19.5-29.5
29.5-39.5
39.5-49.5
49.5-59.5
59.5-69.5
frequencies
Descriptive Statistics: Histogram
Ex: milk production of organic cows
Dot−plot of milk
●
10
20
●
●
●
30
●
●
●
●
●
●
40
●
●
●
●
50
60
50
60
0
2
4
6
Histogram of milk
10
20
30
40
0
2
4
6
Histogram
of milk
milk
Descriptive Statistics: Remarks
Histogram is a pictorial representation of the data
frequency distribution.
Note the boundary values for the class intervals.
Histograms have information about shape, center, spread
of the data distribution.
Descriptive Statistics: Sample mean
The sample mean of a data set of y1 , y2 , . . . , yn provides a
measure of location/center of the data set. To compute the
sample mean:
P
add all the values ni=1 yi = y1 + y2 + · · · + yn
divide by the number of observations n
Pn
ȳ =
i=1 yi
n
Seedlings: 42 23 43 34 49 56 31 47 61 54 46 34 26
y1 = 42, y2 = 23, y3 = 43, . . . , y13 = 26 and thus
Pn
yi
546
= 42 cm.
ȳ = i=1 =
n
13
ȳ is the balance point of the dot plot.
P
P
Sometimes ni=1 yi is abbreviated as
yi .
Descriptive Statistics: Sample variance
2
s =
Pn
− ȳ )2
n−1
i=1 (yi
Height of seedlings: y1 = 42, y2 = 23, . . ., and we had ȳ = 42.
42 23 43 34 49 56 31 47 61 54 46 34 26
2
s =
= 138.17
Sample variance measures the average squared deviation.
Why dividing by n − 1 but not n?
For hand calculation, use working formulas
" n
#
Pn
2
X
(
y
)
1
i
i=1
yi2 −
s2 =
n−1
n
i=1
or
s2 =
" n
#
X
1
yi2 − n(ȳ )2
n−1
i=1
Descriptive Statistics: Sample standard deviation
Sample standard deviation (SD) is the square root of
sample variance
p
s = s2
√
Height of seedlings: s = 138.17 = 11.75 cm.
Sample standard deviation is a typical deviation, as ±1s
captures about 2/3 of bell-shaped data.
> mean(milk)
[1] 36.21429
> sd(milk)
[1] 9.760033
sd=9.8
●
16.6 20
●
●
● ● ●
26.4 30
sd=9.8
●
●
● ● ●
● ●
36.2 40
46.0 50
●
55.8
The mean is sensitive to large values
Suppose data values are
2 4 6 7 8 10 12
Then
ȳ =
, s = .42
Suppose data values are
2 4 6 7 8 10 102
Then
ȳ ≈
, s = 36.32
Key R commands
> hts = c(42, 23,43,34,49,56,31,47,61, 54,46,34, 26) # enter data
> hts
[1] 42 23 43 34 49 56 31 47 61 54 46 34 26
> length(hts)
# sample size
[1] 13
> stem(hts)
# stem-and-leaf plot
The decimal point is 1 digit(s) to the right of the |
2 | 36
3 | 144
4 | 23679
5 | 46
6 | 1
> hist(hts)
# histogram plot
> mean(hts)
# sample mean
[1] 42
> var(hts)
# sample variance
[1] 138.1667
> sd(hts)
# sample standard deviation
[1] 11.75443
Outline
1
Course Information
2
Introduction to Statistics
3
Descriptive Statistics
4
Shape of distributions
Shape of the distribution of the data
Weight of soil: example 1
Actual weight of 15 2-lb. bags of soil used for a lab experiment.
2.36 2.27 2.42 2.13 2.19 2.33 2.54 2.21 2.06 2.36
2.51 2.45 2.12 2.32 2.29
The decimal point is 1 digit(s) to the left of
the |
20
21
22
23
24
25
|
|
|
|
|
|
6
239
179
2366
25
14
mean ȳ = 2.30, standard deviation s = 0.14
Shape of the distribution of the data
Weight of soil: example 2
19
20
21
22
23
24
|
|
|
|
|
|
3
5
8
47
344789
1124
mean ȳ = 2.30, standard deviation s = 0.15
Mean and spread are similar to ex.1, but the distribution is ...
Shape of the distribution of the data
Weight of soil: example 3
20
21
22
23
24
25
|
|
|
|
|
|
69
35779
9
01358
15
mean ȳ = 2.30, standard deviation s = 0.17
Mean and spread are similar to ex.1, but the distribution is ...
Need to look at the data! not just at numerical summaries.
5
6
2.0 2.1 2.2 2.3 2.4 2.5 2.6
sample1
Frequency
2
3
0
0
0
1
1
1
2
Frequency
2
Frequency
3
4
3
4
5
4
Soil weight examples
1.9 2.0 2.1 2.2 2.3 2.4 2.5
sample2
2.0 2.1 2.2 2.3 2.4 2.5 2.6
sample3
Types of data
There are two broad classes of data: quantitative (i.e.
numerical) and qualitative (i.e. categorical) data.
For quantitative data, each observation has a number
associated with it.
ex: weight, milk yield, or # of cows on a farm.
either continuous or discrete.
ex: weight and milk yield are
data and
# of cows on a farm are
data.
For qualitative data, each observation can be put into a
category, which is either nominal or ordered.
ex: 15 cows are assigned to 3 types of beds or 3 different
diet types (VC=Vitamin and Choline):
bed types
Hay
Cement
Others
# of cows
5
6
4
diet types
high in VC
low in VC
control
# of cows
5
5
5
Download