R INSTALLATION R is an open source software package for statistical data analysis R Installation • R download: http://www.r-project.org/ • Germany, Stefan Drees Bonn: http://cran.r-mirror.de/ • This is the main program! • RStudio (Auxiliary program for editing R files): http://www.rstudio.com/ide/download/ • Install the programs (R should be there already). Working with the R command prompt (without RStudio) • Practical session Working with RStudio • Start RStudio • „File -> New -> R Script“ Lets you edit a R command or R script (= small programme = several consecutive commands) Working with RStudio New R files The command prompt Select • Files • Plots • Packages (for advanced analyses) • Help Working with RStudio • Commands and programs can be stored in R files • Execute one command line: • Ctrl+Enter or • Button „Run“ • Execute several lines: • Mark lines and use „Ctrl+Enter“ or „Run“ button WHAT IS STATISTICS? What is statistics? • Statistics is a means to connect empirical knowledge and theory and is constituted as follows: • Data representation (Empirics) • Methods for description, analysis, and interpretation of data, in order to allow predictions, conclusions and decisions (Statistical Theory) What is statistics? I. Descriptive Statistics II. Probability theory III. Test theory DESCRIPTIVE STATISTICS Descriptive Statistics • Basic concepts: • Population: Collection of objects for which a conclusion shall be made (can be human beings but also a collection of atoms when applied in physics) • Sample: a representative part/sub-set of the population • Random sample: elements of the population drawn randomly and independently of each other • Example: „Mietspiegel“ (= statistics of rents) for the city of Bonn • Population: all rooms, flats etc. for rent in Bonn (← too many to investigate all) • Sample: selected part; all flats from Poppelsdorf • Random sample: Investigation of n = 100, 200,… random objects from Bonn Descriptive statistics Observational objects Attributes / traits patients, blood samples, DNA samples, houses, atoms blood pressure, weight, age, blood group, number of siblings, marital status, rent Values quantitative qualitative Blood group, marital status discrete Number of siblings continuous blood pressure, weight, age, rent Descriptive statistics • Scaling: • Nominal scale: attribute values that are not directly comparable (sex, subject of studies, country of origin) (qualitative) • Ordinal scale: attribute values that have a „natural“ order (grades, font sizes: tiny-small-medium-large-huge) • Interval scale: difference between attribute values is interpretable (temperature in °C) (quantitative) • To be distinguished: • Discrete attributes: Attribute values can be counted • Continuous attributes: All real numbers, or at least all numbers from an interval, are possible Descriptive statistics • Frequencies: • Absolute frequency ni: • Number of obersvations with attribute value i (counts) • Relative frequency hi: • Portion of elements with attribute value i • To be computed as absolute frequency devided by total number of objects N: ni / N • Relative frequencies lie between 0 and 1 • Relative frequencies have to add up to 1 (<- can be used to check computation) Descriptive statistics AB0 blood group value tally sheet absolute frequency ni Bonn 1 0 Köln 17 relative frequency hi Bonn Köln 78 0.34 0.39 0.38 2 A1 19 76 0.38 3 A2 6 20 0.12 0.10 4 B 5 18 0.10 0.09 5 A1B 2 6 0.04 0.03 6 A2B 1 2 0.02 0.01 7 other 0 0 0.00 0.00 N = 50 200 1.00 1.00 k ni N i 1 hi ni N 0 hi 1 k hi 1 i 1 Descriptive statistics • Frequencies: • Cumulative frequency: • Sum of all frequencies up to a given value i. • Denoted as 𝑁𝑖 for absolute frequencies and denoted as 𝐻 i for relative frequencies • Often used when values are subdivided into classes • Classification: • Arrangement of attribute values into disjoint groups, so called „classes“ • Classes are disjoint, i.e. non-overlapping, and neighbouring intervals of attribute values, which are defined by a lower and an upper bound. Neighbouring values implies that each value belongs to a class and does not lie outiside (completeness of the classification). Descriptive statistics height classification: • complete • disjoint (each value belongs to only one class) 150 200 ( ]( ]( ]( ]( ]( 150 160 170 180 190 200 class limits: • (160; 170] contains all values, that are > 160 but 170. height [cm] height [cm] Descriptive statistics height [cm] Class number i Class limits (ai-1; ai] 1 Ni frequency Cumulative frequency Tally sheet absolute ni relative hi absolute Ni relative Hi 150 0 0.00 0 0.00 2 (150; 160] 5 0.05 5 0.05 3 (160; 170] 30 0.30 35 0.35 4 (170; 180] 35 0.35 70 0.70 5 (180; 190] 25 0.25 95 0.95 6 (190; 200] 5 0.05 100 1.00 7 > 200 0 0.00 100 1.00 N Hi i N N=100 1,00 i nk k 1 DESCRIPTIVE STATISTICS Graphical representation Descriptive statistics • Pie chart (R function: pie() ) • Shows absolute frequencies • Example: blood groups Descriptive statistics • Bar chart (R function: barplot() ) • Shows relative frequencies • Example: blood groups Descriptive statistics • Representation of cumulative frequencies with empirical distribution function F • Discrete trait: Number of Children Number of children Tally sheet Frequencies Cumulative frequencies absolute ni relative hi absolute Ni relative Hi 1 0 5 0.10 5 0.10 2 1 20 0.40 25 0.50 3 2 15 0.30 40 0.80 4 3 5 0.10 45 0.90 5 4 3 0.06 48 0.96 6 >4 2 0.04 50 1.00 N = 50 1.00 Descriptive statistics Number of children hi 1.0 0.8 0.6 Bar chart 0.4 0.2 0.0 hi 0 1 2 3 4 >4 Hi F 1.0 0.8 hi 0.6 F: Empirical distribution function 0.4 Since the attribute is quantitative discrete, we obtain a step function 0.2 0.0 0 1 2 3 4 >4 Descriptive statistics • Histogramms (R function: hist() ) • Construction: • Data is subdevided into classes • Surface area of columns is proportional to the respective frequencies • Columns are neighbouring since classes are neighbouring Descriptive statistics Example: Height [cm] hi Histogram 1 0,8 0,6 0,4 0,2 0 height [cm] 150 160 170 180 190 200 Descriptive statistics f 0,8 empirical density function f 0,6 0,4 0,2 0 height [cm] 150 160 170 180 190 F • 0,8 200 • empirical distribution function F • 0,6 0,4 (for continuous trait) • 0,2 0 • 150 • height [cm] 160 170 180 190 200 Descriptive statistics f 0,8 0,6 empirical density function f 0,4 0,2 0 f F hi height [cm] 150 160 170 180 200 190 F • • hi 0,8 • empirical distribution function F 0,6 0,4 • 0,2 0 • 150 • height [cm] 160 170 180 190 200 F f Descriptive Statistics • Note: Slides 23 and 26 both show empirical distribution functions. In the first case, we obtain a step function since the trait under investigation is discrete. DESCRIPTIVE STATISTICS Measures of central tendency, dispersion and spread Descriptive statistics • Measures of central tendency: • A number to characterize the „center“ of the data • Most important: • Mean • Median Descriptive statistics • Median (R function: median() ) • Sample: 𝑥1 , 𝑥2 , … , 𝑥𝑛 → Order according to: 𝑥1 ≤ 𝑥2 ≤ ⋯ ≤ 𝑥𝑛 →Ordered sample: 𝑥(1) , 𝑥(2) , … , 𝑥(𝑛) 𝑥 • Median 𝑥 = sample 1 2 𝑛+1 2 𝑥 𝑛 2 , 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑛 𝑖𝑠 𝑜𝑑𝑑 (𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑡ℎ𝑒 "𝑚𝑖𝑑𝑑𝑙𝑒") +𝑥 𝑛 +1 2 , 𝑖𝑛 𝑐𝑎𝑠𝑒 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛 sample ranks ranks n = 7 odd: 𝑥 = 𝑥 7+1 = 𝑥(4) = 6 x1=5 x(1)=3 x2 =9 x(2)=4 x3=3 x(3)=5 x4=8 x(4)=6 x5=19 x(5)=7 x1=5 x(1)=3 x2 =9 x(2)=4 x3=3 x(3)=5 x4=8 x(4)=6 x5=19 x(5)=8 x6=4 x(6)=8 x6=4 x(6)=9 x7=6 x(7)=9 x7=6 x(7)=19 x8=7 x(8)=19 2 n = 8 even: 1 𝑥 = 𝑥 8 +𝑥 8 2 2 2+1 1 = 𝑥 4 +𝑥 5 2 1 = 6 + 7 = 6.5 2 Descriptive statistics • Mean (R function: mean() ) • Sample: 𝑥1 , 𝑥2 , … , 𝑥𝑛 • Sample size: n • Mean 𝑥 = 1 𝑛 𝑛 𝑖=1 𝑥𝑖 Descriptive statistics • Comparison of median and mean: i 𝒙𝒊 𝒙′𝒊 1 2000 2000 ordered 𝒙𝒊 and 𝒙′𝒊 1500 • Both samples have median 2500 2 5000 15000 2000 • 𝑥 = 3000 and 𝑥 ′ = 5000 are the 3 4000 4000 2500 4 1500 1500 4000 5 2500 2500 5000 / 15000 mean values • Mean can strongly be influenced by a single value Median is more robust against extreme values („outliers“) Nevertheless, the mean is more often used in practice since it has other desirable properties (see later). Descriptive statistics x x outlier How to treat outliers? 1) Discard No! 2) Check value and correct Yes! Descriptive statistics sample A x1 x2 xnA nA sample B x xA x x1 x2 x nB nB xB The mean (or median) is not sufficent to describe a sample • Measure the amount of variation of the data! Descriptive statistics • Measures of dispersion and spread: • Numbers to characterize the amount variation around the center (= mean) • Most important: • Minimum, maximum, range (dispersion) • Empirical variance (spread) • Empirical standard deviation (spread) Descriptive statistics • range: sample x1=5 ranks x(1)=3 x2 =9 x(2)=4 x3=3 x(3)=5 x4=8 x(4)=6 x5=19 x(5)=8 x6=4 x(6)=9 x7=6 x(7)=19 n=7 n=7 minimum: min = x(1) maximum: max = x(n) range: R = x(n) – x(1) Descriptive statistics • Variance (R function: var() ): • A measure to express the spread around the center (mean) by a single value • The squared deviation 𝑥𝑖 − 𝑥 ² of each attribute value 𝑥𝑖 from the mean is considered. • Formula for the empirical variance from a sample of 𝑛 elements: 𝑠2 = 1 𝑛−1 𝑛 𝑖=1 𝑥𝑖 − 𝑥 ² • The empircal standard deviation 𝑠 is just the square root of the variance, s = + 𝑠 2 . (R function: sd() ). Why devide by 𝑛 − 1 instead of 𝑛? x1 x2 xn n x Example: x1 = 75 x2 = 2 x3 = 270 x4 = 4 100 75 2 270 53 n = 4 x = 100 x4 =53 is not free, but given by other values when the mean is known. s2 has (n-1) degrees of freedom (f) 1 s f 2 n x x i i 1 2 Data • If you have data you want to analyse, please bring it along!