ost the m solves

advertisement
Stat 330 (Spring 2015): slide set 25
3
♦ Measure 1.5(IQR) down from the first quartile and up from the third
quartile. All the data points outside this range are assumed to be suspiciously
extreme. They are the candidates of outliers.
♥ From the ordered sample, we see only the 3rd element, 35, has no more
than 2.5 observations to the left and no more than 7.5 observations to the
right of it. Hence, the first quartile Q̂1 = 35.
2
Rule of thumb: used to detect and identify the outliers, it is the rule of
1.5(IQR).
♥ The first quartile: for p = 0.25 and n = 10, so 25% of the sample equals
to np = 2.5 and 75% of the sample is n(1 − p) = 7.5.
♦ The third quartile is Q̂3 = 62.
♣ In practice, outliers maybe a real serious problem that it is hard to avoid.
To detect and identify outliers, we use IQR to measure the variability of the
data
Other descriptive statistics (Cont’d)
Interquartile range (IQR): is the difference between the first and the third
quartiles
IQR = Q3 − Q1
It measures the variability of the data and not affected by outliers
significantly.
Stat 330 (Spring 2015): slide set 25
1
♣ A median is at the same time a 0.5-quantile, 50th percentile, and 2nd
quartile.
Quartile: The 1st, 2nd and 3rd quartiles are the 25th, 50th, and 75th
percentiles. They split a population or a sample into four parts.
Percentile: A p-quantile is also called a 100pth percentile.
Sample quantile: A sample p-quantile is any number that exceeds at most
100 · p% of the sample and is exceeded by at most 100(1 − p)% of the
sample.
Population quantile: A p-quantile of a population is a number x that solves
equations P (X < x) ≤ p, P (X > x) ≤ 1 − p
Review: Descriptive statistics, inferential statistics, sample/population
mean, sample/population variance, sample/population median, range
15, 34, 35, 36, 43, 48, 49, 62, 70, 82
with ordered sample
70, 36, 43, 49, 82, 48, 34, 62, 35, 15
Example: The CPU time for randomly chosen tasks are
Example:
Last update: March 22, 2015
Stat 330 (Spring 2015)
Slide set 25
Stat 330 (Spring 2015): slide set 25
Other descriptive statistics
Stat 330 (Spring 2015): slide set 25
Stat 330 (Spring 2015): slide set 25
Stat 330 (Spring 2015): slide set 25
Stat 330 (Spring 2015): slide set 25
6
5
4
5
6
7
8
34
34569
01224455566789
000001123567
7
Example: cherry tree (again) Stem-and-leaf plot for height (leaf unit=1,
6|34 = 63, 64)
239 ⇔ 23 | 9, 23 ⇔ 2|3
♥ The first one or several digits for a stem, and the next digit forms a leaf.
Other digits are dropped. For example
♥ To construct a stem-and-leaf plot, we need to draw a stem and a leaf.
Stem-and-leaf: A stem-and-leaf plot is similar to histogram. They however
show how the data are distributed within columns.
♥ To construct a histogram, we split the range of data into equal intervals,
’bins’, and count how many (or how much proportion of) observations fall
into each bin.
Histogram: A histogram shows the shape of a pmf or a pdf of data, checks
for homogeneity, and suggests possible outliers.
♠ We can collect the data and draw schematic to illustrate how the data
distributed.
♠ Note that girth is the diameter of the tree (in inches) measured at 4 ft 6
in above the ground.
♠ To illustrate those graphical tools, consider the data set consist of
measurements of the girth, height, and volume of timber in 31 felled black
cherry trees.
Graphical Statistics
♣ None of the data in the sample is outside the interval [−5.5, 102.5]. No
outliers are suspected.
Q3 + 1.5(IQR) = 62 + 1.5 · 27 = 102.5
Q1 − 1.5(IQR) = 35 − 1.5 · 27 = −5.5
and measure 1.5 interquartile ranges from each quartile
IQR = Q3 − Q1 = 62 − 35 = 27
Previous Example: Ordered sample 15, 34, 35, 36, 43, 48, 49, 62, 70, 82 with
Q1 = 35 and Q3 = 62. Then
Example: cherry tree (again) The scatter plot of girth v.s. height (xcoordinate is girth, y-coordinate is height)
five points = (min xi, Q̂1, M̂ , Q̂3, max xi)
♥ X̄ is an estimator for θ then, and x̄ is a value of this estimator.
10
♣ Their values are x1, · · · , xn and we can compute the sample mean based
on those value: x̄, we wish this is a good representation of θ, i.e. estimate
θ
♠ What we do is to select a good and appropriate sample, a subset of the
whole population, say X1, · · · , Xn with sample size n < N .
not able to record the annual income for each individual!
i=1
♠ Ideally, if we know all the data in the population, say x1, · · · , xN (they
N
xi/N . However, we are
are the sample values of X1 · · · , XN ) then θ =
Some motivations: Suppose we are interested in the average annual income
of people in the U.S., we use a parameter θ to denote it.
E(θ̂1 − θ)2 < E(θ̂2 − θ)2
11
♣ Efficiency: For two estimators of θ, say θ̂1 and θ̂2, θ̂1 is considered to be
more efficient than θ̂2 if
♣ Unbiasedness: An estimator for θ is unbiased if the expected value of the
estimator is the true parameter, i.e. E(θ̂) = θ
♠ We need some terminology to compare our estimators.
A very natural question: is that estimate good or bad?
Estimate: For each realization x1, · · · , xn, θ̂(x1, · · · , xn), which is a
number, is called an estimate of θ.
Estimator: Let X1, · · · , Xn be i.i.d. random variables with distribution
Fθ with (unknown) parameter θ. A statistics θ̂ = θ̂(X1, · · · , Xn) used to
estimate the value of θ is called an estimator of θ.
Stat 330 (Spring 2015): slide set 25
Stat 330 (Spring 2015): slide set 25
♠ Why do we need estimators and what is an estimator?
Parameter Estimation
9
8
Estimators
♣ Scatter plot consists of n pints on an (x, y)−plane, with x− and
y−coordinates representing the two recorded variables.
♠ This representation is also called five-points summary (xi is the sample
value obtained for random variable Xi)
Example: cherry tree (again) The boxplot of girth is below
Stat 330 (Spring 2015): slide set 25
Scatter plot and time series plot: Scatter plots are used to see and
understand a relationship between two variables. Particularly if one of the
variable is time, it is referred as time plot.
Stat 330 (Spring 2015): slide set 25
Boxplot: To construct a boxplot, we draw a box between the first and the
third quartile, a line inside a box for a median, and extend whiskers to the
smallest and the largest observations.
Stat 330 (Spring 2015): slide set 25
E(X̄) = E(n−1
i=1
n
♥ Reason: We have
Xi) = n−1
i=1
n
E(Xi) = n−1
i=1
n
μ = n−1nμ = μ
12
♥ Example: The sample mean X̄ is unbiased for population mean μ, then
sample variance is unbiased for the population variance σ 2.
for any > 0.
n→∞
lim P (|θ̂ − θ| > ) = 0
♣ Consistency: If we have a large sample size n, we want the estimator θ̂
to be closed to the true parameter in the send that
♠ E(θ̂ − θ)2 is called MSE (Mean Squared Error)
i=1
E(S 2) =
Xi2 − nX̄ 2) so
13
Stat 330 (Spring 2015): slide set 25
1 (Xi − μ)2 − n(X̄ − μ)2
n−1
(Xi − X̄)2 = (n − 1)−1(
1
1
n
(nσ 2 − nVar(X̄)) =
(nσ 2 − σ 2) = σ 2.
n−1
n−1
n
where μ = E(Xi), thus
S2 =
and S 2 = (n − 1)−1
n
Download