Review Distribution The distribution of a variable describes WHAT values the variable takes and HOW often it takes these values. Stat 226 – Introduction to Business Statistics I Spring 2009 Professor: Dr. Petrutza Caragea Section A Tuesdays and Thursdays 9:30-10:50 a.m. Depending on the type of the data (categorical or quantitative) we need to use different graphical and numerical tools to analyze and summarize the data at hand. We will start by describing data graphically: Chapter 1, Section 1.1 bar graphs, pie charts and pareto charts can be used to graphically summarize categorical data. a common graphical display for quantitative data is a histogram. Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 1 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 2 / 37 Chapter 1.1 – Displaying Distributions with Graphs Strategies for data analysis: Chapter 1.1 – Displaying Distributions with Graphs examine each variable in data set individually and look for possible relationships among variables start with graphs and add numerical summaries as needed Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 3 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 4 / 37 Chapter 1.1 – Displaying Distributions with Graphs Chapter 1.1 – Displaying Distributions with Graphs One of the first graphical displays is the so-called “coxcomb” by Florence Nightingale: The text in the lower left corner reads: “The Areas of the blue, red, & black wedges are each measured from the center as the common vertex. The blue wedges measured from the center of the circle represent area for the deaths from Preventable or Mitigable Zymotic diseases, the red wedges measured from the center the deaths from wounds, & the black wedges measured from the center the deaths from all other causes. The black line across the red triangle in Nov. 1854 marks the boundary of the deaths from all other causes during the month. In October 1854, & April 1855, the black area coincides with the red, in January & February 1855, the blue coincides with the black. The entire areas may be compared by following the blue, the red, & the black lines enclosing them.” Source: Nightingale, Florence. Notes on Matters Affecting the Health, Efficiency and Hospital Administration of the British Army, 1858. Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 5 / 37 Chapter 1.1 – Displaying Distributions with Graphs Stat 226 (Spring 2009, Section A) ∗ 6 / 37 Chart 24.6 Example: data on loan amounts owed by Asia to several foreign loaners: percentage 35.8 11.7 9.1 8.8 5 29.8 ≈ 100∗∗ 31.7 80.9 loan amount loan amount∗ 97.2 31.7 24.6 23.8 16.3 80.9 274.5 Chapter 1, Section 1.1 Chapter 1.1 – Displaying Distributions with Graphs Categorical variables: Bar graphs and Pie charts country Japan Germany France US Great Britain others Total Introduction to Business Statistics I 13.6 23.8 97.2 country country loan amount in Billions of Dollars, due to rounding percentages will add up to 100.2 France Japan Germany US Great Britain others ∗∗ Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 7 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 8 / 37 Chapter 1.1 – Displaying Distributions with Graphs Chapter 1.1 – Displaying Distributions with Graphs Pie Chart A pie chart shows the amount of data that belong to each category as a proportional part of the circle Chart 9.1% useful when only one variable is of interest 11.7% loan amount 29.8% easy to compare: relative size of the parts to each other as well as the size of each part compared to the total 5.0% percentage have to add up to 100, i.e. we need to account for all possible categories∗ 8.8% 35.8% ∗ Assume, we did not have complete information on loan amounts for all other countries: a pie chart can no longer be constructed because loan amounts for Japan, Germany, France, US and Great Britain do not account for 100% of loan amount given to Asia. (see also AYK problem 1.3, page 9) country France Japan country Stat 226 (Spring 2009, Section A) Germany US Great Britain others Introduction to Business Statistics I Chapter 1, Section 1.1 9 / 37 Chapter 1.1 – Displaying Distributions with Graphs Chart 125 125 35.8% 50 9.1% 11.7% 8.8% 75 Japan 16.6% 9.1% 0 country US Japan France Germany US others Japan France Germany Great Britain Introduction to Business Statistics I Chapter 1, Section 1.1 country Great Britain country others 11.7% France 12.5% 25 50 75 100 125 loan amount country Germany US 5.0% Germany 12.9% 0 France Japan 35.8% Great Britain 7.1% Stat 226 (Spring 2009, Section A) 8.8% 50 25 0 29.8% US 5.0% country others 50.9% Great Britain 25 10 / 37 Bar graphs are valuable presentation tools as they are effective at reinforcing differences in magnitude (note that bars have to be of equal widths and are equally spaced) Bar graphs are (obviously) useful when the observed outcomes of the variable, our data, can be placed into different categories Bar graphs can be either horizontal or vertical country 75 Chapter 1, Section 1.1 Chart 100 29.8% loan amount loan amount 100 Introduction to Business Statistics I Chapter 1.1 – Displaying Distributions with Graphs Bar Graph A bar graph shows the amount of data that belong to each category as proportionally sized rectangular areas. They are more flexible than pie charts because we don’t need to account for all possible categories of the variable. Chart Stat 226 (Spring 2009, Section A) France Japan Germany US Great Britain 11 / 37 Stat 226 (Spring 2009, Section A) France Japan Germany US Great Britain others Introduction to Business Statistics I Chapter 1, Section 1.1 12 / 37 Chapter 1.1 – Displaying Distributions with Graphs Chapter 1.1 – Displaying Distributions with Graphs Chart Comments: 125 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 13 / 37 75 50 25 Chapter 1.1 – Displaying Distributions with Graphs Great Britain US France Germany Introduction to country Business Statistics I Stat 226 (Spring 2009, Section A) country others 0 Japan Sometimes, however, it is more useful to arrange the bars with respect to their magnitude, i.e. order them from tallest to shortest in order to highlight the categories with the highest frequencies (most important classes). This type of bar graph is called a Pareto chart. 100 loan amount Often classes have a “natural order”, so it makes sense to put the bars in that order. For example consider how many Fr, So, J, S are enrolled in this 226 section. Japan France others US Chapter 1, Section 1.1 14 / 37 Germany Great Britain Chapter 1.1 – Histograms Example: Ages and annual salaries for the CEOs of the 60 top ranked small companies in America in 1993. Source: Forbes, Nov. 8, 1993, ”America’s Best Small Companies,” Quantitative Variables: Frequency Tables and Histograms Distribution of Variable Age Numerical data (observations are numbers rather than categories) are very common in business. One of the most common ways to summarize observations from a quantitative variable is a histogram. Idea: group values into “classes” (intervals) of equal width count how many observations fall into each class draw a bar graph for the counts (keeping the intervals in numerical order, adjacent but non-overlapping) 30 35 40 45 50 55 60 65 70 75 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 15 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 16 / 37 Chapter 1.1 – Histograms Chapter 1.1 – Histograms General Guidelines for Drawing a Histogram Distribution of Salary in Thousands all classes have same width classes must not overlap, i.e. each observation can only belong to one class √ reasonable number of classes ( n for samples of size n ≤ 150 observations) there is no “best choice” of class width and number of classes ⇒ use good judgement to decide (see also next example for effect of class width) 0 200 400 600 800 frequency f for each class corresponds to number of observations that belong to that class 1000 1200 sum of all class frequencies must add to n, the size of the sample Note: Classifying data into classes leads to a loss of information, always be aware of that!!! Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 17 / 37 Chapter 1.1 – Histograms Distributions Age Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 18 / 37 Chapter 1.1 – Histograms Distributions Distributions Effect of class width Age Age Different shapes of distributions/histograms We distinguish the following main shapes of distributions symmetric 30 34 38 42 46 50 54 58 62 66 70 74 78 30 36 42 48 54 60 66 72 78 30 35 40 45 50 55 60 65 70 75 skewed to the right skewed to the left Distributions Age Distributions Distributions Age Age uniform/rectangular J-shaped 35 40 45 50 55 60 65 70 75 Stat 226 (Spring 2009, Section A) 40 50 60 70 80 Introduction to Business Statistics I 30 45 60 in addition we distinguish between bimodal and multi-modal distributions as opposed to uni-modal distributions. 75 Chapter 1, Section 1.1 19 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 20 / 37 Chapter 1.1 – Histograms Chapter 1.1 – Histograms Example: Boston Housing Data - housing values for Boston suburbs, 506 observations on 14 different variables Example: Boston Housing Data cont’d — different shapes of distributions 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. variable Crime rate Residential zone nonretail river NO concentration #of rooms age distance Access to higway taxrate pt ration Black people Lower status M value of home explanation per capita crime rate proportion of residential land zoned for large lots proportion of nonretail business acres Charles river nitric oxides concentration average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centers index of accessibility to radial highways full-value property tax-rate per $ 10,000 pupil--teacher ratio 1000(B - 0.63)^2I(B less than 0.63)where B is the proportion of blacks % lower status of the population median value of owner-occupied homes in $1000 type quantitative quantitative quantitative categorical quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative quantitative NO concentration # of rooms value of home comments 1 if tract bounds river, 0 otherwise .4 .5 .6 .7 .8 5 Tax rate Bar chart for the only categorical variable (variable 4) 6 7 8 10 lower status 20 30 40 50 black people 400 N 300 200 100 200 0 0 300 400 500 600 700 0 5 10 15 20 25 30 35 50 100 150 200 250 300 350 400 1 chas Stat 226 (Spring 2009, Section A) chas 0 1 Introduction to Business Statistics I Chapter 1, Section 1.1 21 / 37 Chapter 1.1 – Histograms Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I 140 150 Histogram of Amount Spent on Most Recent Haircut 100 100 80 60 possible outliers Number of Students 4 40 spread 0 0 20 3 50 center Number of Students 2 120 Describing distributions/histograms overall shape of histogram 22 / 37 Chapter 1.1 – Histograms Histogram of Time Spent Sleeping the Night before Classes Began 1 Chapter 1, Section 1.1 0 Stat 226 (Spring 2009, Section A) 2 4 6 8 10 12 0 20 Sleep Time in Hours Introduction to Business Statistics I Chapter 1, Section 1.1 23 / 37 Stat 226 (Spring 2009, Section A) 40 60 80 100 Amount Spent in Dollars Introduction to Business Statistics I Chapter 1, Section 1.1 24 / 37 Chapter 1.1 – Stemplots Chapter 1.1 – Stem-and-Leaf Plot (Stemplot) Example: random sample of package design ratings ranging from 0 - 45: shows the actual digits Each numerical value is divided into two components 22 21 31 20 25 21 32 26 43 30 27 30 27 36 28 33 38 35 19 30 34 41 leading digits – stem trailing digits – leaf How to make a stem-and-leaf plot: 1 Separate each observation into a stem consisting of all but the last (most right) digit and a leaf (the last digit). Stems may have as many digits as needed, but each leaf contains only a single digit. 2 Write stems in a vertical column with smallest at the top, then draw a vertical line at the right if this column 3 Write each leaf in the row to the right of its stem, in increasing order out from the stem 4 add legend on how to read stem plot Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 25 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1.1 – Stemplots Chapter 1.1 – Stemplots Split Stem-and Leaf Plot: Back-to-Back Stem-and Leaf Plot: Sometimes it might be more informative to have a so-called Split Stem-and Leaf Plot where we split the stem further, namely into two groups for the leading digits: 0 to 4 and 5 to 9. Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 411 555 4210 99885 3322110 98776655 100 876 4 27 / 37 Stat 226 (Spring 2009, Section A) 2 2 3 3 4 4 5 5 6 6 7 7 Chapter 1, Section 1.1 26 / 37 Chapter 1, Section 1.1 28 / 37 234 5589 023 566788 0023345 55679 0134 68 24 5 Introduction to Business Statistics I Chapter 1.1 – Time Plots Chapter 1.1 – Time Plots A trader at the stock exchange in Frankfurt. Photo from NY Times, January 21, 2008 Stocks Plunge in Europe and Asia on U.S. Recession Fear Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 29 / 37 Chapter 1.1 – Time Plots Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 30 / 37 Chapter 1.1 – Time Plots Time Series average number of occupied rooms By month average number of occupied rooms Time Plots A time-plot displays data that are observed over a given period of time. Helps us understand the behavior of the data over time. 1100 1000 900 800 700 600 500 0 Always put time on the horizontal axis of your plot and the variable of interest in the vertical axis 12 24 36 48 60 72 84 96 108 120 132 144 156 168 month important features: trend seasonal variation Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 31 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 32 / 37 Chapter 1.1 – Time Plots Chapter 1.1 – Time Plots A trend in a time series refers to a long-term, persistent rise or fall, i.e. a positive (upward) or negative (downward trend) A pattern in a time series that repeats itself at known regular intervals is called seasonal variation. Time Series average number of occupied rooms By month 1100 average number of occupied rooms average number of occupied rooms Time Series average number of occupied rooms By month 1000 900 800 700 600 1100 1000 900 800 700 600 500 500 0 12 24 36 48 60 72 84 0 96 108 120 132 144 156 168 12 24 36 48 60 72 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 84 96 108 120 132 144 156 168 month month 33 / 37 Chapter 1.1 – Time Plots Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 34 / 37 Chapter 1.1 – Time Plots The southern oscillation is defined as the barametric pressure difference between Tahiti and the Darwin Islands at sea level. The southern oscillation is a predictor of El Nino which in turn is thought to be a driver of world-wide weather. Specifically, repeated southern oscillation values less than -1 typically defines an El Nino. Time Series Sunspots By Year Mean Std N 150 46.93 37.17775 100 Time Series Oscillation By Year Mean -0.103509 Std 1.0898398 N 456 2 Oscillation 1 100 Sunspots 3 50 0 -1 0 -2 -3 1770 1780 1790 1800 1810 1820 1830 1840 1850 1860 1870 Year -4 1954 1956 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 Year Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 35 / 37 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 36 / 37 Chapter 1.1 – Time Plots Time Series number of employed persons in thousands By date Mean 138862.22 Std 3761.5691 N 103 number of employed persons in thousands 145000 140000 135000 01/2007 01/2006 01/2005 01/2004 01/2003 01/2002 01/2001 01/2000 01/1999 130000 date Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.1 37 / 37