Stat 330 (Spring 2015): slide set 25 3 ♦ Measure 1.5(IQR) down from the first quartile and up from the third quartile. All the data points outside this range are assumed to be suspiciously extreme. They are the candidates of outliers. ♥ From the ordered sample, we see only the 3rd element, 35, has no more than 2.5 observations to the left and no more than 7.5 observations to the right of it. Hence, the first quartile Q̂1 = 35. 2 Rule of thumb: used to detect and identify the outliers, it is the rule of 1.5(IQR). ♥ The first quartile: for p = 0.25 and n = 10, so 25% of the sample equals to np = 2.5 and 75% of the sample is n(1 − p) = 7.5. ♦ The third quartile is Q̂3 = 62. ♣ In practice, outliers maybe a real serious problem that it is hard to avoid. To detect and identify outliers, we use IQR to measure the variability of the data Other descriptive statistics (Cont’d) Interquartile range (IQR): is the difference between the first and the third quartiles IQR = Q3 − Q1 It measures the variability of the data and not affected by outliers significantly. Stat 330 (Spring 2015): slide set 25 1 ♣ A median is at the same time a 0.5-quantile, 50th percentile, and 2nd quartile. Quartile: The 1st, 2nd and 3rd quartiles are the 25th, 50th, and 75th percentiles. They split a population or a sample into four parts. Percentile: A p-quantile is also called a 100pth percentile. Sample quantile: A sample p-quantile is any number that exceeds at most 100 · p% of the sample and is exceeded by at most 100(1 − p)% of the sample. Population quantile: A p-quantile of a population is a number x that solves equations P (X < x) ≤ p, P (X > x) ≤ 1 − p Review: Descriptive statistics, inferential statistics, sample/population mean, sample/population variance, sample/population median, range 15, 34, 35, 36, 43, 48, 49, 62, 70, 82 with ordered sample 70, 36, 43, 49, 82, 48, 34, 62, 35, 15 Example: The CPU time for randomly chosen tasks are Example: Last update: March 22, 2015 Stat 330 (Spring 2015) Slide set 25 Stat 330 (Spring 2015): slide set 25 Other descriptive statistics Stat 330 (Spring 2015): slide set 25 Stat 330 (Spring 2015): slide set 25 Stat 330 (Spring 2015): slide set 25 Stat 330 (Spring 2015): slide set 25 6 5 4 5 6 7 8 34 34569 01224455566789 000001123567 7 Example: cherry tree (again) Stem-and-leaf plot for height (leaf unit=1, 6|34 = 63, 64) 239 ⇔ 23 | 9, 23 ⇔ 2|3 ♥ The first one or several digits for a stem, and the next digit forms a leaf. Other digits are dropped. For example ♥ To construct a stem-and-leaf plot, we need to draw a stem and a leaf. Stem-and-leaf: A stem-and-leaf plot is similar to histogram. They however show how the data are distributed within columns. ♥ To construct a histogram, we split the range of data into equal intervals, ’bins’, and count how many (or how much proportion of) observations fall into each bin. Histogram: A histogram shows the shape of a pmf or a pdf of data, checks for homogeneity, and suggests possible outliers. ♠ We can collect the data and draw schematic to illustrate how the data distributed. ♠ Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground. ♠ To illustrate those graphical tools, consider the data set consist of measurements of the girth, height, and volume of timber in 31 felled black cherry trees. Graphical Statistics ♣ None of the data in the sample is outside the interval [−5.5, 102.5]. No outliers are suspected. Q3 + 1.5(IQR) = 62 + 1.5 · 27 = 102.5 Q1 − 1.5(IQR) = 35 − 1.5 · 27 = −5.5 and measure 1.5 interquartile ranges from each quartile IQR = Q3 − Q1 = 62 − 35 = 27 Previous Example: Ordered sample 15, 34, 35, 36, 43, 48, 49, 62, 70, 82 with Q1 = 35 and Q3 = 62. Then Example: cherry tree (again) The scatter plot of girth v.s. height (xcoordinate is girth, y-coordinate is height) five points = (min xi, Q̂1, M̂ , Q̂3, max xi) ♥ X̄ is an estimator for θ then, and x̄ is a value of this estimator. 10 ♣ Their values are x1, · · · , xn and we can compute the sample mean based on those value: x̄, we wish this is a good representation of θ, i.e. estimate θ ♠ What we do is to select a good and appropriate sample, a subset of the whole population, say X1, · · · , Xn with sample size n < N . not able to record the annual income for each individual! i=1 ♠ Ideally, if we know all the data in the population, say x1, · · · , xN (they N xi/N . However, we are are the sample values of X1 · · · , XN ) then θ = Some motivations: Suppose we are interested in the average annual income of people in the U.S., we use a parameter θ to denote it. E(θ̂1 − θ)2 < E(θ̂2 − θ)2 11 ♣ Efficiency: For two estimators of θ, say θ̂1 and θ̂2, θ̂1 is considered to be more efficient than θ̂2 if ♣ Unbiasedness: An estimator for θ is unbiased if the expected value of the estimator is the true parameter, i.e. E(θ̂) = θ ♠ We need some terminology to compare our estimators. A very natural question: is that estimate good or bad? Estimate: For each realization x1, · · · , xn, θ̂(x1, · · · , xn), which is a number, is called an estimate of θ. Estimator: Let X1, · · · , Xn be i.i.d. random variables with distribution Fθ with (unknown) parameter θ. A statistics θ̂ = θ̂(X1, · · · , Xn) used to estimate the value of θ is called an estimator of θ. Stat 330 (Spring 2015): slide set 25 Stat 330 (Spring 2015): slide set 25 ♠ Why do we need estimators and what is an estimator? Parameter Estimation 9 8 Estimators ♣ Scatter plot consists of n pints on an (x, y)−plane, with x− and y−coordinates representing the two recorded variables. ♠ This representation is also called five-points summary (xi is the sample value obtained for random variable Xi) Example: cherry tree (again) The boxplot of girth is below Stat 330 (Spring 2015): slide set 25 Scatter plot and time series plot: Scatter plots are used to see and understand a relationship between two variables. Particularly if one of the variable is time, it is referred as time plot. Stat 330 (Spring 2015): slide set 25 Boxplot: To construct a boxplot, we draw a box between the first and the third quartile, a line inside a box for a median, and extend whiskers to the smallest and the largest observations. Stat 330 (Spring 2015): slide set 25 E(X̄) = E(n−1 i=1 n ♥ Reason: We have Xi) = n−1 i=1 n E(Xi) = n−1 i=1 n μ = n−1nμ = μ 12 ♥ Example: The sample mean X̄ is unbiased for population mean μ, then sample variance is unbiased for the population variance σ 2. for any > 0. n→∞ lim P (|θ̂ − θ| > ) = 0 ♣ Consistency: If we have a large sample size n, we want the estimator θ̂ to be closed to the true parameter in the send that ♠ E(θ̂ − θ)2 is called MSE (Mean Squared Error) i=1 E(S 2) = Xi2 − nX̄ 2) so 13 Stat 330 (Spring 2015): slide set 25 1 (Xi − μ)2 − n(X̄ − μ)2 n−1 (Xi − X̄)2 = (n − 1)−1( 1 1 n (nσ 2 − nVar(X̄)) = (nσ 2 − σ 2) = σ 2. n−1 n−1 n where μ = E(Xi), thus S2 = and S 2 = (n − 1)−1 n