I_Descriptive_Statistics

advertisement
R INSTALLATION
R is an open source software package for
statistical data analysis
R Installation
• R download: http://www.r-project.org/
• Germany, Stefan Drees Bonn: http://cran.r-mirror.de/
• This is the main program!
• RStudio (Auxiliary program for editing R files):
http://www.rstudio.com/ide/download/
• Install the programs (R should be there already).
Working with the R command prompt
(without RStudio)
• Practical session
Working with RStudio
• Start RStudio
• „File -> New -> R Script“
Lets you edit a R command or R script (= small programme =
several consecutive commands)
Working with RStudio
New R files
The command prompt
Select
• Files
• Plots
• Packages (for advanced analyses)
• Help
Working with RStudio
• Commands and programs can be stored in R files
• Execute one command line:
• Ctrl+Enter or
• Button „Run“
• Execute several lines:
• Mark lines and use „Ctrl+Enter“ or „Run“ button
WHAT IS STATISTICS?
What is statistics?
• Statistics is a means to connect empirical knowledge and
theory and is constituted as follows:
• Data representation (Empirics)
• Methods for description, analysis, and interpretation of data, in
order to allow predictions, conclusions and decisions (Statistical
Theory)
What is statistics?
I.
Descriptive Statistics
II.
Probability theory
III.
Test theory
DESCRIPTIVE STATISTICS
Descriptive Statistics
• Basic concepts:
• Population: Collection of objects for which a conclusion shall be made
(can be human beings but also a collection of atoms when applied in
physics)
• Sample: a representative part/sub-set of the population
• Random sample: elements of the population drawn randomly and
independently of each other
• Example: „Mietspiegel“ (= statistics of rents) for the city of Bonn
• Population: all rooms, flats etc. for rent in Bonn (← too many to investigate
all)
• Sample: selected part; all flats from Poppelsdorf
• Random sample: Investigation of n = 100, 200,… random objects from
Bonn
Descriptive statistics
Observational objects
Attributes / traits
patients, blood samples, DNA samples, houses,
atoms
blood pressure, weight, age, blood group,
number of siblings, marital status, rent
Values
quantitative
qualitative
Blood group, marital status
discrete
Number of siblings
continuous
blood pressure, weight, age, rent
Descriptive statistics
• Scaling:
• Nominal scale: attribute values that are not directly comparable
(sex, subject of studies, country of origin) (qualitative)
• Ordinal scale: attribute values that have a „natural“ order (grades,
font sizes: tiny-small-medium-large-huge)
• Interval scale: difference between attribute values is interpretable
(temperature in °C) (quantitative)
• To be distinguished:
• Discrete attributes: Attribute values can be counted
• Continuous attributes: All real numbers, or at least all numbers from an
interval, are possible
Descriptive statistics
• Frequencies:
• Absolute frequency ni:
• Number of obersvations with attribute value i (counts)
• Relative frequency hi:
• Portion of elements with attribute value i
• To be computed as absolute frequency devided by total number of
objects N: ni / N
• Relative frequencies lie between 0 and 1
• Relative frequencies have to add up to 1 (<- can be used to check
computation)
Descriptive statistics
AB0 blood group
value
tally sheet
absolute frequency ni
Bonn
1
0
Köln
17
relative frequency hi
Bonn
Köln
78
0.34
0.39
0.38
2
A1
19
76
0.38
3
A2
6
20
0.12
0.10
4
B
5
18
0.10
0.09
5
A1B
2
6
0.04
0.03
6
A2B
1
2
0.02
0.01
7
other
0
0
0.00
0.00
N = 50
200
1.00
1.00
k
 ni  N
i 1
hi 
ni
N
0  hi  1
k
 hi  1
i 1
Descriptive statistics
• Frequencies:
• Cumulative frequency:
• Sum of all frequencies up to a given value i.
• Denoted as 𝑁𝑖 for absolute frequencies and denoted as 𝐻 i for relative
frequencies
• Often used when values are subdivided into classes
• Classification:
• Arrangement of attribute values into disjoint groups, so called „classes“
• Classes are disjoint, i.e. non-overlapping, and neighbouring intervals
of attribute values, which are defined by a lower and an upper bound.
Neighbouring values implies that each value belongs to a class and
does not lie outiside (completeness of the classification).
Descriptive statistics
height
classification:
• complete
• disjoint (each value belongs to only
one class)
150
200
(
](
](
](
](
](
150
160
170
180
190
200
class limits:
• (160; 170] contains all values, that are
> 160 but  170.
height [cm]
height [cm]
Descriptive statistics
height [cm]
Class
number i
Class limits
(ai-1; ai]
1
Ni 
frequency
Cumulative frequency
Tally sheet
absolute ni
relative hi
absolute Ni
relative Hi
 150
0
0.00
0
0.00
2
(150; 160]
5
0.05
5
0.05
3
(160; 170]
30
0.30
35
0.35
4
(170; 180]
35
0.35
70
0.70
5
(180; 190]
25
0.25
95
0.95
6
(190; 200]
5
0.05
100
1.00
7
> 200
0
0.00
100
1.00
N
Hi  i
N
N=100
1,00
i
 nk
k 1
DESCRIPTIVE
STATISTICS
Graphical representation
Descriptive statistics
• Pie chart (R function: pie() )
• Shows absolute frequencies
• Example: blood groups
Descriptive statistics
• Bar chart (R function: barplot() )
• Shows relative frequencies
• Example: blood groups
Descriptive statistics
• Representation of cumulative frequencies with empirical
distribution function F
• Discrete trait: Number of Children
Number of
children
Tally sheet
Frequencies
Cumulative frequencies
absolute
ni
relative
hi
absolute
Ni
relative
Hi
1
0
5
0.10
5
0.10
2
1
20
0.40
25
0.50
3
2
15
0.30
40
0.80
4
3
5
0.10
45
0.90
5
4
3
0.06
48
0.96
6
>4
2
0.04
50
1.00
N = 50
1.00
Descriptive statistics
Number of children
hi
1.0
0.8
0.6
Bar chart
0.4
0.2
0.0
hi
0
1
2
3
4
>4
Hi F
1.0
0.8
hi
0.6
F: Empirical distribution function
0.4
Since the attribute is quantitative discrete,
we obtain a step function
0.2
0.0
0
1
2
3
4
>4
Descriptive statistics
• Histogramms (R function: hist() )
• Construction:
• Data is subdevided into classes
• Surface area of columns is proportional to the respective frequencies
• Columns are neighbouring since classes are neighbouring
Descriptive statistics
Example: Height [cm]
hi
Histogram
1
0,8
0,6
0,4
0,2
0
height [cm]
150
160
170
180
190
200
Descriptive statistics
f
0,8
empirical density
function f
0,6
0,4
0,2
0
height [cm]
150
160
170
180
190
F
•
0,8
200
•
empirical distribution
function F
•
0,6
0,4
(for continuous trait)
•
0,2
0
•
150
•
height [cm]
160
170
180
190
200
Descriptive statistics
f
0,8
0,6
empirical density
function f
0,4
0,2
0
f  F
hi
height [cm]
150
160
170
180
200
190
F
•
•
hi
0,8
•
empirical distribution
function F
0,6
0,4
•
0,2
0
•
150
•
height [cm]
160
170
180
190
200
F  f
Descriptive Statistics
• Note: Slides 23 and 26 both show empirical distribution
functions. In the first case, we obtain a step function since
the trait under investigation is discrete.
DESCRIPTIVE
STATISTICS
Measures of central tendency, dispersion and
spread
Descriptive statistics
• Measures of central tendency:
• A number to characterize the „center“ of the data
• Most important:
• Mean
• Median
Descriptive statistics
• Median (R function: median() )
• Sample: 𝑥1 , 𝑥2 , … , 𝑥𝑛 → Order according to: 𝑥1 ≤ 𝑥2 ≤ ⋯ ≤ 𝑥𝑛 →Ordered sample:
𝑥(1) , 𝑥(2) , … , 𝑥(𝑛)
𝑥
• Median 𝑥 =
sample
1
2
𝑛+1
2
𝑥
𝑛
2
,
𝑖𝑛 𝑐𝑎𝑠𝑒 𝑛 𝑖𝑠 𝑜𝑑𝑑 (𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑡ℎ𝑒 "𝑚𝑖𝑑𝑑𝑙𝑒")
+𝑥
𝑛
+1
2
, 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
sample
ranks
ranks
n = 7 odd:
𝑥 = 𝑥 7+1 = 𝑥(4) = 6
x1=5
x(1)=3
x2 =9
x(2)=4
x3=3
x(3)=5
x4=8
x(4)=6
x5=19
x(5)=7
x1=5
x(1)=3
x2 =9
x(2)=4
x3=3
x(3)=5
x4=8
x(4)=6
x5=19
x(5)=8
x6=4
x(6)=8
x6=4
x(6)=9
x7=6
x(7)=9
x7=6
x(7)=19
x8=7
x(8)=19
2
n = 8 even:
1
𝑥 = 𝑥 8 +𝑥 8
2 2
2+1
1
= 𝑥 4 +𝑥 5
2
1
= 6 + 7 = 6.5
2
Descriptive statistics
• Mean (R function: mean() )
• Sample: 𝑥1 , 𝑥2 , … , 𝑥𝑛
• Sample size: n
• Mean 𝑥 =
1
𝑛
𝑛
𝑖=1 𝑥𝑖
Descriptive statistics
• Comparison of median and
mean:
i
𝒙𝒊
𝒙′𝒊
1 2000 2000
ordered 𝒙𝒊 and 𝒙′𝒊
1500
• Both samples have median 2500
2 5000 15000 2000
• 𝑥 = 3000 and 𝑥 ′ = 5000 are the
3 4000 4000
2500
4 1500 1500
4000
5 2500 2500
5000 / 15000
mean values
• Mean can strongly be influenced
by a single value
 Median is more robust against
extreme values („outliers“)
Nevertheless, the mean is more
often used in practice since it has
other desirable properties (see
later).
Descriptive statistics
x
x
outlier
How to treat outliers?
1) Discard
No!
2) Check value and correct
Yes!
Descriptive statistics
sample A
x1
x2

xnA
nA
sample B
x
xA
x
x1
x2

x nB
nB
xB
 The mean (or median) is not sufficent to describe a sample
• Measure the amount of variation of the data!
Descriptive statistics
• Measures of dispersion and spread:
• Numbers to characterize the amount variation around the center (=
mean)
• Most important:
• Minimum, maximum, range (dispersion)
• Empirical variance (spread)
• Empirical standard deviation (spread)
Descriptive statistics
• range:
sample
x1=5
ranks
x(1)=3
x2 =9
x(2)=4
x3=3
x(3)=5
x4=8
x(4)=6
x5=19
x(5)=8
x6=4
x(6)=9
x7=6
x(7)=19
n=7
n=7
minimum:
min = x(1)
maximum:
max = x(n)
range: R = x(n) – x(1)
Descriptive statistics
• Variance (R function: var() ):
• A measure to express the spread around the center (mean) by a
single value
• The squared deviation 𝑥𝑖 − 𝑥 ² of each attribute value 𝑥𝑖 from the
mean is considered.
• Formula for the empirical variance from a sample of 𝑛 elements:
𝑠2 =
1
𝑛−1
𝑛
𝑖=1
𝑥𝑖 − 𝑥 ²
• The empircal standard deviation 𝑠 is just the square root of the
variance, s =
+
𝑠 2 . (R function: sd() ).
Why devide by 𝑛 − 1 instead of 𝑛?
x1
x2

xn
n
x
Example: x1
=
75
x2 =
2
x3 =
270
x4 =
4  100  75  2  270  53
n
=
4
x
=
100
x4 =53 is not free,
but given by other values when the mean is
known.
s2 has (n-1) degrees of freedom (f)
1
s  
f
2
n
 x  x 
i
i 1
2
Data
• If you have data you want to analyse, please bring it
along!
Download