Descriptive Statistics: Data Visualization & Summaries

Why are descriptive statistics important? In January 1986, the space shuttle Challenger broke apart shortly after liftoff. The accident was caused by a part that was not designed to fly at the unusually cold temperature of 29◦ F at launch. 6 4 2 0 Frequency 8 10 Here are the launch-temperatures of the first 25 shuttle missions (in degrees F): 66,70,69,80,68,67,72,70,70,57,63,70,78,67,53,67,75,70,81,76,79,75,76,58,29 20 30 40 50 60 Temperature 70 80 90 The two most important functions of descriptive statistics are: I Communicate information I Support reasoning about data When exploring data of large size, it becomes essential to use summaries. Graphical summaries of data It is best to use a graphical summary to communicate information, because people prefer to look at pictures rather than at numbers. There are many ways to visualize data. The nature of the data and the goal of the visualization determine which method to choose. Pie chart and dot plot California International Other US Washington Oregon International Oregon Washington Other US California 0 10 20 30 40 Percent The dot plot makes it easier to compare frequencies of various categories, while the pie chart allows more easily to eyeball what fraction of the total a category corresponds to. Bar graph When the data are quantitative (i.e. numbers), then they should be put on a number line. This is because the ordering and the distance between the numbers convey important information. 0 2 4 6 8 The bar graph is essentially a dot plot put on its side. 3 4 5 Number of assignments completed 6 The histogram The histogram allows to use blocks with different widths. Key point: The areas of the blocks are proportional to frequency. So the percentage falling into a block can be figured without a vertical scale since the total area equals 100%. But it’s helpful to have a vertical scale (density scale). Its unit is ‘% per unit’, so in the above example the vertical unit is ‘% per year’. The histogram gives two kinds of information about the data: 1. Density (crowding): The height of the bar tells how many subjects there are for one unit on the horizontal scale. For example, the highest density is around age 19 as .04 = 4% of all subjects are age 19. In contrast, only about 0.7% of subjects fall into each one year range for ages 60–80. 2. Percentages (relative frequences): Those are given by area = height x width. For example, about 14% of all subjects fall into the age range 60–80, because the corresponding area is (20 years) x (0.7 % per year)=14 %. Alternatively, you can find this answer by eyeballing that this area makes up roughly 1/7 of the total area of the histogram, so roughly 1/7=14% of all subjects fall in that range. The boxplot (box-and-whisker plot) 25 20 15 10 Miles per gallon for 32 cars 30 The boxplot depicts five key numbers of the data: 25 20 15 10 Miles per gallon 30 The boxplot conveys less information than a histogram, but it takes up less space and so is well suited to compare several datasets: 4 6 Number of Cylinders 8 The scatterplot The scatterplot is used to depict data that come as pairs. 25000 20000 15000 Income The scatterplot visualizes the relationship between the two variables. 10000 5000 0 6 8 10 12 Education 14 16 Providing context is important Statistical analyses typically compare the observed data to a reference. Therefore context is essential for graphical integrity. I ‘The Visual Display of Quantitative Information’ by Edward Tufte (p.74) One way to provide context is by using small multiples. The compact design of the boxplot makes it well suited for this task: Providing context with small multiples ale up the projects so it can elf," Webb said. gether experts from manyse cooperation results in a over the 2000 figure. "The total is surprisinglylarge in light of the overall sharp decline in stock market values over this periogical breakthroughs. od," RAND's Council for Aid to Edis heading towards interucation said. The non-profit CounWebb said. "Having the Sophisticated software makes it tempting cil forAid has trackedto to Education biologists all in the same roughs. I think it is a good erdisciplinary program at of cooperation between dising funding requirements. fficult because it is generalexperiments. Often bioloogy development is critical, ding. You are going after al sources that are used to al way," Webb said. fense Advanced Research n more willing to take the ayoff projects, according to rtment of Defense funding ects that are long-range, ard, like those taken on by I "1 expected growth in giving by foundations to be there, but 1 didn't expect it to be nearly as high as it was," Kaplan said. "There was so much bad news in the stock market, especially durproduce showy but poor period, with the ing that fiscal Pitfalls when visualizing data dent Hennessy in 2000 with th goal to raise $ 1 billion over five years. visualizations: This is "the largest campaign specifically for undergraduat education ever undertaken b any university," Henness wrote in a 2000 letter introducing the program. As of March 31, Stanford campaign had already raise $733 million, although a portio of that total reflected as-yet unme commitments to match dona tions. Columbia University, wi $359 million in 2001, was just be hind Stanford. Indiana Universi ty was the most-funded pub university, in seventh place wi $301 million. AARON STAPLE/The Stanford Daily The ‘Ghettysburg Powerpoint Presentation’ by Peter Norvig Steakburger" ority now is its Campaign fo Undergraduate Education launched by University Presi- Numerical summary measures For summerizing data with one number, use the mean (=average) or the median. The median is the number that is larger than half the data and smaller than the other half. Mean vs. median Mean and median are the same when the histogram is symmetric. 0 5 10 15 20 25 30 100 measurements of the speed of light 600 700 800 900 km/sec 1000 1100 Mean vs. median When the histogram is skewed to the right, then the mean can be much larger than the median. So if the histogram is very skewed, then use the median. Mean vs. median If the median sales price of 10 homes is $ 1 million, then we know that 5 homes sold for $ 1 million or more. If we are told that the average sale price is $ 1 million, then we can’t draw such a conclusion: Percentiles The 90th percentile of incomes is $ 135,000: 90% of households report an income of $ 135,000 or less, 10% report more. The 75th percentile is called 3rd quartile: $ 85,000 The 50th percentile is the median: $ 50,000 The 25th percentile is called 1st quartile. Five-number summary 25 20 10 15 Miles per gallon for 32 cars 30 Recall that the boxplot gives a five-number summary of the data: the smallest number, 1st quartile, median, 3rd quartile, largest number. The interquartile range = 3rd quartile − 1st quartile. It measures how spread out the data are. The standard deviation A more commonly used measure of spread is the standard deviation. x̄ stands for the average of the numbers x1 , . . . , xn . The standard deviation of these numbers is v u n u1 X s = t (xi − x̄)2 or n i=1 v u u t n 1 X (xi − x̄)2 n−1 i=1 The two numbers x̄ and s are often used to summarize data. Both are sensitive to a few large or small data. If that is a concern, use the median and the interquartile range.

Descriptive Statistics: Data Visualization & Summaries

Related documents

Products

Support

Descriptive Statistics: Data Visualization & Summaries

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib