Lecture 1: Descriptive statistics: Population: group of individuals (= respondents in a poll) Variable: property of an individual (= answer to a question in a poll) Qualitative variable: the possible values are categories or levels. We distinguish 2 kinds of scales: Nominal: the values of a variable are not arranged, e.g. sex, political preference. Ordinal: the values of a variable are arranged, e.g. 4 levels of education (none, lower, middle and higher. Quantitative variable: the possible values are numerical/measurable. These variables are interval variables or ratio variables (when there is a zero). Information about a variable: often information about the population as a whole is not available information about a sample is: we have n observed values x1, x2, …, xn of the variable at our disposal as an indication of the population characteristics. 1 Describing the observations: Explorative Data Analysis (EDA) EDA- techniques: Summarizing the data: measures for the location (centre), dispersion (spread) etc, especially for quantitative variables Presentation with graphs and diagrams Measures of location (centre): 1. The sample mean: ...... x n 1 n x 1 x n xi n i 1 2. The median M: the middle observation (measurement) in size. If there is an even number of observations, compute the mean of the middle two. First arrange them from small to large (order statistics): x1, x2, …, xn → x(1), x(2), …, x(n) 3. The modus: the most observed value. Percentiles and quartiles: The median is also called the 50th percentile: (about) 50% of the observations is smaller and 50% is greater than the median M: 2 In a diagram of the observations (•= observation): 25% 25% 25% 25% • …..……….• •….• •……….• • ……………….• Q1 M Q3 The quartiles Q1, Q2 and Q3 are the 25th, 50th and 75th percentiles: they split the observations in roughly 4 equal quarters. Rules (software sometimes uses different rules): Q1 is the median of all observations smaller than the overall median M. Q2 is the median M! Q3 is the median of the observations greater then M Percentiles (what follows is a description, not a formal definition): Example: the 90th percentile is a limiting value so that (about) 10% of the observations is greater and 90% smaller. Measures of dispersion (spread) 1. Inter Quartile Distance IQD = Q3 – Q1 2. Sample variance s2: s 2 n11 xi - x 2 2 1 2- 1 xi n 1 x i n s2 is said to have n-1 degrees of freedom 3 3. Sample standard deviation s: s s 2 Proporties of s en s2 : s ≥ 0 en s2 ≥ 0 If s = 0, then all observations are equal. Outliers: unusually large or small observations The 1.5×IQD -rule: observations greater than Q3 + 1.5× IQD or smaller than Q1 - 1.5× IQD are outliers Resistant measures: not sensitive for outliers. Resistant: Median and IQD Non-resistant: sample mean, variance and standard deviation. If the sample consist of n observations then: Frequency: the number of observations with the same specific numerical or categorical value. Relative frequency: the quotient of frequency and n (the fraction or proportion frequency/n) Often written as a percentage, e.g. 13/50 = 26% The distribution of a variable: all possible values and their relative frequency. The 5-numbers-summary of observations: the smallest, Q1, M ,Q3 and the largest. 4 Graphs and diagrams: 1. Bar graph: -On the x-axis: the categories/values. -the height of the bars represent frequencies or relative frequencies (given on the y-axis) 2. Sector, circle or pie diagram -especially for qualitative variables -every sector gives the proportion of the value. 3. Stem-and-leaf diagram Example: stem leaf This diagram represents 31 1 observations: 15 is the smallest, 1 5556668 42 the largest. 2 01334 2 55678999 42 occurred twice as an observation. 3 00123 We place the smallest leaf closest 3 579 to the stem. 4 022 4 Split the stem whenever there are too many observations per value of the stem. In the example we have split the tens (leafs 0-4 and 5-9) Back-to-back stem-leaf diagram: for comparing two samples we use the same stems for both samples. 5 4. Box plot: a diagram of the 5-numbers-summary If there are outliers, first graph them separately and then the 5-numbers-summary of the original observations excluding the outliers. 5. Histogram Histogram of (relative) frequencies (grouping data): First make a frequency table: choose intervals (classes) of equal length and determine their frequencies. (Rule of thumb: roughly 10ln(n) intervals) The histogram consists of rectangles placed upon the intervals on the x-axis and with the height equal to the (relative) frequency on the y-axis. Histogram of intervals having different length: the area of the rectangle equals the relative frequency. opschrijven Write down in the frequency table for every interval: the relative frequency, the length and the height of the rectangle. Use the formula: relative frequency height length 6 De height is called the frequency density. (the higher, the more observations per unit) 6. Graph for time series The x-axis is the time axis and the y-axis consists of the values of the variable. Beware of the (overall) trend and the effect of seasons. 7. a scatterdiagram is a graph of the relation of two quantative variables (on x-axis en y-axis resp.). Ex: 7 Reductie in % van NOx-uitstoot afhankelijk van de hoeveelheid katalysator in benzine 5,0 4,5 4,0 3,5 3,0 2,5 2,0 0 1 2 3 4 5 6 7 8 HOEVEELHEID TOEGEV OEGDE STOF When commenting the graphs or diagrams pay attention to their shape (especially in the case of histograms): the overall shape of the distribution: is it symmetric or a “tail” to the right or left, how many peaks? The location of the centre and the spread. Gaps and (possible) outliers. 8