L1: Lecture notes Descriptive Statistics

advertisement
Lecture 1: Descriptive statistics:
Population: group of individuals
(= respondents in a poll)
Variable: property of an individual
(= answer to a question in a poll)
 Qualitative variable: the possible values are
categories or levels.
We distinguish 2 kinds of scales:
Nominal: the values of a variable are not arranged,
e.g. sex, political preference.
Ordinal: the values of a variable are arranged,
e.g. 4 levels of education (none, lower, middle and
higher.
 Quantitative variable: the possible values are
numerical/measurable.
These variables are interval variables or ratio
variables (when there is a zero).
Information about a variable: often information about
the population as a whole is not available information
about a sample is: we have n observed values
x1, x2, …, xn of the variable at our disposal as an
indication of the population characteristics.
1
Describing the observations:
Explorative Data Analysis (EDA)
EDA- techniques:
 Summarizing the data:
measures for the location (centre), dispersion
(spread) etc, especially for quantitative variables
 Presentation with graphs and diagrams
Measures of location (centre):
1. The sample mean:
 ......  x n 1 n
x
1
x
 n  xi
n
i 1
2. The median M: the middle observation
(measurement) in size. If there is an even number of
observations, compute the mean of the middle two.
First arrange them from small to large (order
statistics): x1, x2, …, xn → x(1), x(2), …, x(n)
3. The modus: the most observed value.
Percentiles and quartiles:
The median is also called the 50th percentile: (about)
50% of the observations is smaller and 50% is greater
than the median M:
2
In a diagram of the observations (•= observation):
25%
25%
25%
25%
• …..……….• •….• •……….• • ……………….•
Q1
M
Q3
The quartiles Q1, Q2 and Q3 are the 25th, 50th and 75th
percentiles: they split the observations in roughly 4
equal quarters.
Rules (software sometimes uses different rules):
 Q1 is the median of all observations smaller than
the overall median M.
 Q2 is the median M!
 Q3 is the median of the observations greater then M
Percentiles (what follows is a description, not a
formal definition):
Example: the 90th percentile is a limiting value so
that (about) 10% of the observations is greater and
90% smaller.
Measures of dispersion (spread)
1. Inter Quartile Distance IQD = Q3 – Q1
2. Sample variance s2:
s 2  n11 
 xi - x 
2

2
1 
2- 1 
xi 
n 1   x i n




s2 is said to have n-1 degrees of freedom
3
3. Sample standard deviation s: s  s 2
Proporties of s en s2 :
 s ≥ 0 en s2 ≥ 0
 If s = 0, then all observations are equal.
Outliers: unusually large or small observations
The 1.5×IQD -rule: observations greater than Q3 +
1.5× IQD or smaller than Q1 - 1.5× IQD are outliers
Resistant measures: not sensitive for outliers.
 Resistant: Median and IQD
 Non-resistant: sample mean, variance and standard
deviation.
If the sample consist of n observations then:
 Frequency: the number of observations with the
same specific numerical or categorical value.
 Relative frequency: the quotient of frequency and
n (the fraction or proportion frequency/n)
Often written as a percentage, e.g. 13/50 = 26%
The distribution of a variable:
all possible values and their relative frequency.
The 5-numbers-summary of observations:
the smallest, Q1, M ,Q3 and the largest.
4
Graphs and diagrams:
1. Bar graph:
-On the x-axis: the categories/values.
-the height of the bars represent frequencies or
relative frequencies (given on the y-axis)
2. Sector, circle or pie diagram
-especially for qualitative variables
-every sector gives the proportion of the value.
3. Stem-and-leaf diagram
Example:
stem leaf
This diagram represents 31
1
observations: 15 is the smallest,
1 5556668
42 the largest.
2 01334
2 55678999 42 occurred twice as an
observation.
3 00123
We place the smallest leaf closest
3 579
to the stem.
4 022
4
 Split the stem whenever there are too many
observations per value of the stem. In the example
we have split the tens (leafs 0-4 and 5-9)
 Back-to-back stem-leaf diagram: for comparing
two samples we use the same stems for both
samples.
5
4. Box plot: a diagram of the 5-numbers-summary
If there are outliers, first graph them separately and
then the 5-numbers-summary of the original
observations excluding the outliers.
5. Histogram
 Histogram of (relative) frequencies (grouping data):
First make a frequency table: choose intervals
(classes) of equal length and determine their
frequencies. (Rule of thumb: roughly 10ln(n)
intervals)
The histogram consists of rectangles placed upon
the intervals on the x-axis and with the height equal
to the (relative) frequency on the y-axis.
 Histogram of intervals having different length:
the area of the rectangle equals the relative
frequency. opschrijven
Write down in the frequency table for every
interval: the relative frequency, the length and the
height of the rectangle. Use the formula:
relative frequency
height 
length
6
De height is called the frequency density.
(the higher, the more observations per unit)
6. Graph for time series
The x-axis is the time axis and the y-axis consists of
the values of the variable.
Beware of the (overall) trend and the effect of
seasons.
7. a scatterdiagram is a graph of the relation of two
quantative variables (on x-axis en y-axis resp.). Ex:
7
Reductie in % van NOx-uitstoot afhankelijk
van de hoeveelheid katalysator in benzine
5,0
4,5
4,0
3,5
3,0
2,5
2,0
0
1
2
3
4
5
6
7
8
HOEVEELHEID TOEGEV OEGDE STOF
When commenting the graphs or diagrams pay
attention to their shape (especially in the case of
histograms):
 the overall shape of the distribution: is it symmetric
or a “tail” to the right or left, how many peaks?
 The location of the centre and the spread.
 Gaps and (possible) outliers.
8
Download